Address for reprint requests and other correspondence: M. Olivier, Biotechnology and Bioengineering Center, Medical College of Wisconsin, 8701 Watertown Plank Rd., Milwaukee, WI 53226 (e-mail: ude.wcm@reivilom).
Received 2012 Jun 21; Accepted 2012 Nov 5. Copyright © 2013 the American Physiological SocietyCopy number variation (CNV), generated through duplication or deletion events that affect one or more loci, is widespread in the human genomes and is often associated with functional consequences that may include changes in gene expression levels or fusion of genes. Genome-wide association studies indicate that some disease phenotypes and physiological pathways might be impacted by CNV in a small number of characterized genomic regions. However, the pervasiveness and full impact of such variation remains unclear. Suitable analytic methods are needed to thoroughly mine human genomes for genomic structural variation, and to explore the interplay between observed CNV and disease phenotypes, but many medical researchers are unfamiliar with the features and nuances of recently developed technologies for detecting CNV. In this article, we evaluate a suite of commonly used and recently developed approaches to uncovering genome-wide CNVs and discuss the relative merits of each.
Keywords: copy number variation, SNP array, next generation sequencing, detection method, analysis algorithm
a major focus in genomic studies is the identification of polymorphisms associated with phenotypic variation and disease. Forms of mammalian genomic variation include single nucleotide polymorphisms (SNPs), microsatellites and other repeats, and structural variation (SV) such as copy number variants (CNVs). CNVs are imbalances that alter the diploid status of a locus so that copy numbers increase (duplications) or decrease (deletions). They range in length between 50 bp and 1 Mbp (28). Genomic regions that are affected by CNVs are far greater than those with SNPs (52). CNVs are widespread among humans and contribute substantially to population diversity. About 15% of the human genome is affected by CNVs (79), and an average of 12 CNVs exist per individual compared with the reference genome. A substantial portion of CNVs is likely to have functional consequences, such as changing expression levels of genes in or near the affected regions due to positional effects, gene dosage alteration, and disruption or fusion of genes. Many CNVs have been shown to contribute to diseases, such as Alzheimer's disease, autism, schizophrenia, breast cancer, obesity, and developmental delay (15, 28, 54, 81, 86, 95).
The primary mechanisms that create CNVs are: nonhomologous end joining, nonallelic homologous recombination, transposition of transposable elements or pseudogenes, variable numbers of tandem repeats, and replication errors bought about by fork stalling or template switching. Recent studies of CNVs have shown that there are CNV distribution hotspots in the human genome. Most commonly, the following three categories of genes are enriched for structural variants: 1) genes that are involved in immunity and cell-cell signaling; 2) genes encoding proteins involved in interaction with the environment (e.g., immune response, perception of smell); and 3) retrovirus- and transposition-related protein coding genes. In contrast, CNV occurrence is significantly reduced in genes that encode proteins that mediate their effect through protein-protein interactions (21). Similarly, dosage-sensitive genes are underrepresented in regions affected by CNVs (76). More evolutionarily conserved genes appear to have fewer CNVs (50). For different functional parts of genes, CNV rates are not evenly distributed (50). A recent analysis by Li and colleagues (50) based on a de novo genome assembly, found a significantly higher number of structural variants in untranslated regions than had previously been identified. Additionally, protein-coding exonic regions had fewer variants than introns.
While the full impact of this “new” form of genome variation on genome function and dysfunction is still unclear, it is likely that a number of phenotypes and physiological functions in cells and tissues are being affected by these structural changes, as suggested by the disease associations reported so far. In this article, we will describe and summarize current platforms and methods to uncover CNVs to provide a summary to geneticists and physiologists alike. Since the vast majority of work (and tool development) has focused on the human genome, we will describe tools and analytical approaches used in these human studies. However, it is likely that detection methods used in humans, and the physiological consequences of genome structural variation, extend equally to other mammalian species.
The discovery and genotyping of CNVs have become central in human genetics and disease-marker development. Yet comprehensive and systematic cataloging of CNVs is still in its infancy. A few databases were developed to catalog CNVs (e.g., DGV, the Database of Genomic Variants, http://projects.tcag.ca/variation). Many challenges exist, including detecting regions where CNVs reside (e.g., in regions with abundant repetitive content, it is difficult to resolve particular CNVs), robust detection algorithms for accurate detection of CNVs using array-based platforms, distinguishing classes of CNVs (e.g., insertions, deletions, duplications, and inversions), and the various mechanisms that create CNVs. Several well-established methods have been applied for validation or replication of targeted CNV at either the single or multiple-locus scale, such as quantitative PCR (qPCR), paralog-ratio testing (PRT) and molecular copy number counting (MCC). qPCR compares threshold cycles (Ct) between the target gene and a reference sequence with normal copy numbers, to generate ΔCt values which are used for CNV calculation. This method has been used in large-scale CNV analysis in detecting disease associations, for example, psoriasis (34) and Crohn's disease (6). With the development of genome-wide CNV screening, qPCR is often used as a confirmation method for computationally identified loci. PRT, on the other hand, uses a single pair of primers to explore the degree of similarity between sequence elements (often dispersed repeats) both in the target locus (with CNV) and a remote reference locus (24). Derived from HAPPY mapping (19), MCC is a simple assay that exploits the ability of PCR to detect single molecules of a target sequence (marker) in DNA aliquots that have undergone limiting dilution. In MCC, the number of aliquots that are positive for a target sequence can be counted and compared with those of other target sequences, thus allowing the relative copy number of multiple sequences in the original DNA sample to be inferred. MCC can be rapidly adapted to regions of particular interest or to a set of scattered candidate loci for refining CNV at high resolution (18, 57). Other multiplex PCR-based approaches, such as multiplex amplifiable probe hybridization, multiplex ligation-dependent probe amplification, multiplex PCR-based real-time invader assay, quantitative multiplex PCR of short fluorescent fragments, and multiplex amplicon quantification, have also been used for targeted screening and validation of CNVs (8).
While these methodologies are useful and commonly used for targeted locus studies, we will focus, for this review, on methods for the high-throughput, genome-wide identification of CNVs and review the leading platforms and most recent analytical methods used for detection of CNVs in human genomes.
High-resolution array platforms have been used extensively to study gene expression, analyze SNP frequency, and, more recently, to detect CNVs. The most commonly used array platforms are SNP arrays and comparative genome hybridization (CGH) arrays. Agilent (http://www.genomics.agilent.com/) and Nimblegen (http://www.nimblegen.com/) first developed CGH as a method for copy number comparison between differentially labeled target and reference DNAs by measuring the fluorescence ratio along the length of each chromosomal region, indicating relative losses or gains in a target sample using fluorescence in situ hybridization. CGH arrays use arrays of bacterial artificial chromosomes, cDNA, or long synthetic oligonucleotides to probe specific regions of interest for copy number assessment. These probes can either be selected for an even genome-wide distribution or for unique, repetitive regions. Thus, CGH offers the highest sensitivity and specificity (30, 32), though this method has low spatial resolution, typically 5–10 Mb (37), and low throughput. SNP arrays, primarily supplied by Affymetrix (Santa Clara, CA) (http://www.affymetrix.com/estore/) and Illumina (San Diego, CA) (http://www.illumina.com/), use short base-pair sequences to capture fragments of DNA and to infer copy number based on hybridization intensities without the reference sample being cohybridized to a target sample. The advantage of this approach is that it allows for the determination of SNP genotypes and requires less sample DNA than CGH (63, 88). However, the distribution of probes in SNP arrays is restricted by the availability of informative SNPs throughout the genome and is known to affect platform performance (32).
Many high-resolution analysis platforms have been developed to detect CNVs that utilize CGH or whole genome SNP genotyping arrays or both. Some, e.g., NimbleGen (Madison, WI) and Agilent (Santa Clara, CA) CNV focused arrays, have included probes that specifically cover CNVs that are reported in the DGV. Currently, there are at least 12 array platforms available for CNV detection (32). By assessing the performance of these platforms in the detection of CNVs in HapMap CEU sample NA12878, Haraksingh and colleagues (32) concluded that CNV-focused arrays offer the highest overall sensitivity for detection of CNV abundance, size range, and breakpoint resolution. To increase the reliability of detecting CNVs, Illumina and Affymetrix have improved their probe selection by including additional probes for non-SNP sites to facilitate higher-resolution CNV detection. These arrays are more appealing for cost-effective, routine analyses. Currently, the Affymetrix Genome-wide Human SNP 6.0 chip includes approximately ∼907 K probes for SNP detection, and an additional ∼946 K non-SNP probes, which include 140K CNV-specific probes (63). Illumina 1M-Duo has over one million markers. Custom designed Agilent array sets contain ∼470 K CNV-specific probes (67).
Most commonly, CNV analysis is performed using DNA extracted from blood or tissue, as recommended by the manufacturers. However, array-based CNV profiling has also been applied to single cells (46, 82). The major problem with array-based, single cell CNV detection is that it requires whole genome amplification (WGA), either through PCR or multiple displacement amplification (MDA). However, it is difficult to distinguish between WGA artifacts and true genetic variants. Furthermore, a majority of these methods lacks the quality metrics required to exclude single cell WGA samples without which the reliable detection of copy number or SNP genotyping is impossible. For example, reliable, imbalanced copy number detection is almost impossible for S-phase cells because diploid components of the genome can be present in two or more copies. A recently developed protocol by Konings and colleagues (40) aims to resolve these difficulties, but the methodology has not been used widely enough to allow reliable assessment.
Complete, whole genome-analysis CNV involves three steps: 1) normalization, 2) probe-level modeling and segmentation, and 3) association analysis. The purpose of normalization is to clean the nonrelevant effects, including the GC content of the fragment amplified by PCR, probe-specific intensity bias due to difference in binding affinity and spatial artifacts. Probe-level modeling is usually performed at two levels: single locus and multilocus. Single-locus modeling combines the probes that measure the same fragment to produce a raw fragment copy number. Multilocus modeling combines the raw copy numbers of neighboring fragments or DNA probe loci into a “meta-probe set,” which measure the whole region's copy number (12). Segmentation involves the identification of break points, which separate neighboring regions based on the Log ratio of probe intensity. Association analysis uses statistical algorithms to assess the potential impact of particular CNVs compared with appropriate controls.
Array data from CGH include the log-ratios of normalized intensities of disease and control samples, arranged by the physical locations of the probes in the genome. Regions with concentrated high or low log-ratios are indicative of copy number changes of targets (44). To give some examples, algorithms developed for analyzing CGH data include circular binary segmentation (CBS) (64) and Gain and loss analysis of DNA (GLAD) (36). Using a maximum of likelihood ratio statistics to detect narrower segments of aberration, CBS (64) was able to divide the genome into regions of equal DNA copy number. This algorithm has a very consistent performance and low false discovery rate (FDR), although it has a heavy computational cost. GLAD (36) used an adaptive weights smoothing procedure, in which penalty terms are controlled by weights that can be adapted to the data, for automatic detection of breakpoints. This algorithm detects large regions with high accuracy but often fails to detect fine structure variation.
Because of their high probe density and coverage, SNP arrays have become the most frequently used tools to detect CNVs. High-density SNP array data require genotypes with high quality intensity and allelic ratio information for each surveyed locus. Briefly, if two alleles of one SNP are designated as A and B, and a and b are the normalized intensities of alleles A and B, respectively, deletion and duplication of a certain DNA region will result in a decrease or increase in signal intensity. For CNV estimation, a and b are transformed into R = a + b and θ = arctan(a/b)/(π/2), so that R measures the combined signal intensity of two alleles and θ measures the relative allelic intensity ratio. Log R ratio (LRR) is defined as log2(Robserved/Rexpected), in which Rexpected is measured from reference samples (66, 84). B allele frequency (BAF) is the normalized measure of relative signal intensity ratio of two alleles. Depending on the θ values for three canonical genotype clusters generated from reference samples, different formulas were followed for the normalization of BAF (84). Most methods that use SNP arrays to detect CNVs use both LRR and BAF (66). Both Affymetrix and Illumina offer free software packages for CNV analysis: Genotyping Console (Affymetrix) and Beadstudio (Illumina) (66). With Genotyping Console, two algorithms, the BirdSuite Package (43) and Birdseed, are used to enhance detection of CNVs. With Beadstudio, CNV detection is performed by calculating the mode of BAF for SNPs in a sliding window along the chromosome. It can be run rapidly for an overview of larger events in a sample, but this sliding window approach has limited and rough boundary resolution for CNVs. With the recent surge in SNP-genotyping array technology, many other algorithms have been developed concomitantly to improve the statistical modeling to achieve efficient and more accurate callings. Below, we discuss the features of recently developed algorithms based on their emphasis, (e.g., normalization, probe-level modeling and segmentation, see also Table 1 and Supplemental File). 1
Algorithms developed for CNV detection using array-based data
Platform | Data Type | Algorithm | Advantages | Disadvantages | Size Range of CNVs | Ref. List No. |
---|---|---|---|---|---|---|
PennCNV | Illunima (mainly) and Affymetrix | Hidden Markov Model (HMM) | Detects smaller CNVs and using family trio information to detect inherited and novel CNVs | PennCNV assumes that the model parameters are known which is not true when normal tissue contamination across samples require sample-specific model parameters | Kilo-bp resolution, with a median size of 12 kb | 85 |
Pedigree information significantly improved the sensitivity and false discovery rate | Sensitive to intensity noise | |||||
QuantiSNP | Illumina | Objective Bayes-HMM | Ability to combine data from several platforms of differing resolution | Size of detected CNVs are very large (>190 Kb) | >190 kb | 13 |
Provides probabilistic quantification of state classifications | ||||||
Significantly increased the accuracy of segmental aneuploidy identification and mapping | ||||||
GenoCNV | Illumina | HMM | Data-driven parameter estimation | Designed for Illumina SNP arrays | NA | 80 |
Effects of tissue contamination are explicitly modeled in tumor samples | Appropriate normalization and transformation method is needed for Affymetrix SNP arrays | |||||
Detects the cancer CNV in a tumor samples contaminated with normal tissue | ||||||
Reports genotype information within CNV regions | ||||||
MixHMM | Illumina | HMM | Directly deal with stromal contamination caused heterogeneity in clinical samples, allowing detection of tumor CNV events in heterogeneous tumor samples | Is not intended to distinguish among multiple clones | NA | 51 |
CNV detection for copy numbers is ≤7 | Works only for detection in autosomes | |||||
Allows more complete and accurate description of other forms of allelic imbalance | ||||||
ACNE | Affymetrix | Multisample summarization method based on nonnegative matrix factorization | Ddeals directly with the cross hybridization between probes within SNP probeset; estimates the allele-specific CNVs | Relies on the data given by CRMA v2 after background and offset removal | Mb resolution | 65 |
CRLMM | Affymetrix SNP 6.0 | Linear model to estimate batch- and locus-specific parameters | Specifically designed to address batch effects and uncertainty | For the study of germline traits, the model is most useful when 25 or more samples have been processed together in a batch | NA | 75 |
Does not require reference samples to estimate model parameters | The assumption of fixed batch effect and linearity between average intensity and allelic dosage is not always true | |||||
Uses of biallelic genotype calls from experimental data to estimate batch-specific and locus-specific parameters of background and signal without the requirement of training data | ||||||
ITALICS | Affymetrix | Uses GLAD algorithm to estimate biological signal; estimates nonrelevant effects by multiple linear regression | Estimates both biological and nonrelevant effects in an alternate, iterative manner, accurately eliminates irrelevant effects | Is based solely on perfect match (PM) probes; needs to obtain a reference dataset for calculating the quartet effect | NA | 71 |
Corrects for experimental artifacts caused by spatial effects | ||||||
CNstream | Illumina Beadchip | Heuristics and parametrical statistics to assign a confidence score to each sample at each probe | Joint clustering analysis of all the intensity samples at each probe with the computational speed and analytical accuracy of estimating copy numer polymorphisms (CNPs) from segments of consecutive probes | Designed for Illumina arrays | 10–100 kb | 3 |
Allows a whole-genome analysis to be carried out without the need for previously defined CNP maps | ||||||
High level of accuracy was achieved by analyzing the measures of each intensity channel separately and combining information from multiple samples | ||||||
cn.FARMS | Affymetrix, Illumina and Agilent array | Probabilistic latent variable model, which is optimized by a Bayesian maximum a posteriori approach. | Controls for false positive rate | Does not do segmentation of integer copy number estimation | NA | 12 |
Faster computation with low computational load | ||||||
GADA | CGH, Affymetrix and Illumina | Compact liner algebra representation for the genome copy number based on normalized probe intensities. Then it use sparse Bayesian learning (SBL) and backward elimination (BE) for breakpoint identification | Fast computational speed | Does not include allele-specific copy number data | NA | 69 |
Low false discovery rate | ||||||
CNVworkshop | A variety of SNP array platforms | Genotype-specific extension of the Circular Binary Segmentation algorithm | Integration with the UCSC Genome Browser | Predominantly uses circular binary segmentation (CBS) algorithm, which carries a heavy computational cost | NA | 26 |
Tabular displays of genomic and pathogenic attributes for each CNV | ||||||
Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data | ||||||
R-Gada | Illumina, Affymetrix and aCGH arrays | Uses Bayesian learning and BE for segmentation, parallel computation | Fast and accurate segmentation algorithm to call CNVs | Rely on other packages for normalization and segmentation procedure; only use Log R ratio in the analysis; Sensitivity, given by the maximum breakpoint sparseness, is controlled by the hyperparameter a | 2 kb–8.5 Mb | 68 |
Multivariate analysis of population structure | ||||||
Has tools for performing genome wide association studies | ||||||
Has a complete set of tools for visualizing and reporting copy number alterations | ||||||
NEXUS | CGH and SNP arrays, data generated by array-image analysis software | Rank Segmentation, SNPRank Segmentation, FASST Segmentation and SNP-FASST Segmentation | Convenient tool to explore the data on the population level by performing various statistical analyses, such as class comparisons clustering, enrichment analysis, survival analysis | NA | NA | 17 |
SCIMMkit | Illumina Infinium II and GoldenGate SNP assays | Two rounds of mixture likelihood-based clustering | For targeted interrogation of CNVs using Illumina Infinium II and GoldenGate SNP assaysis | Does not explicitly include batch effects in its statistical model | 1.6–560 kb | 93 |
Can be applied to standardized genome-wide SNP arrays and customized multiplexed SNP panels | ||||||
Provides economy, efficiency and flexibility in experimental design | ||||||
GLAD | Array-based, comparative genomic hybridization (CGH) data | Local constant Gaussian regression model using a weighted maximum likelihood estimator | Automatic detection of breakpoints | Fails to detect very fine structure | NA | 36 |
Detects large regions with high accuracy | ||||||
CBS | Array-based, comparative genomic hybridization (CGH)data | Maximum of a likelihood ratio statistic is used recursively to detect narrower segments of aberration | Consistent | Heavy computation cost with majority of its computational time spent on significance evaluation | NA | 35, 64 |
Low FDR |
* NA, not available.
A suite of algorithms has been developed to improve normalization, which may have a substantial effect on downstream analysis ( Table 1 and Supplemental File). Specifically designed for Affymetrix SNP arrays, ITALICS (71) takes into account that nonrelevant effects can be similar or higher than the biological effects. It estimates the biological signal using the GLAD algorithm (36), and nonrelevant effects with multiple linear regression in an alternate, iterative manner. CRLMM (75) uses biallelic genotype calls from experimental data to explicitly model batch and locus-specific parameters of background without the usage of training data. In addition, CRLMM uses shrinkage to improve locus-specific estimates of copy number uncertainty.
Bayesian and hidden Markov model (HMM) based algorithms are frequently used for copy number segmentation from high-density SNP array data. In HMM-based algorithms, hidden states represent the underlying copy number of probes. Beadstudio is the first software that uses an objective Bayes HMM algorithm to automatically infer regions of segmental aneuploidy (SNV) from BeadArray genotyping data (Illumina). Four additional recently developed algorithms are HMM based: QuantiSNP, PennCNV, GenoCNV, and MixHMM. QuantiSNP incorporates the LRR and BAF simultaneously in a HMM framework by using a fixed rate of heterozygosity for each SNP. Compared with Beadstudio, QuantiSNP significantly improved the resolution of CNV detection (13).
PennCNV (85) uses state-specific and distance-dependent transition probabilities in the HMM state-transition matrix, which is a more realistic model for state transition between different copy number states. To achieve finer mapping and improve a posteriori CNV validation, PennCNV incorporates several sources of available data, including population-based BAF for each SNP, distance between adjacent SNPs and pedigree information (when applied to Trio-analysis). PennCNV is especially efficient at identifying smaller sized CNVs (with a median size of 12 Kb). However, PennCNV was designed primarily for use with the Illumina Infinium high-density SNP genotyping platform. To use PennCNV with Affymetrix data, Affymetrix Power Tools are required for data normalization.
Parameter estimation in GenoCNV (80) is data driven. In addition, it explicitly models the effect of tissue contamination. The two components of this package, GenoCNV and GenoCNA, were designed for detection of CNVs in normal tissue and copy number aberration (CNA) in tumor tissue, respectively. In an effort to detect CNVs in tumor samples that have significant stromal cell contamination (up to 80%), Liu and colleagues (51) developed MixHMM. This Python package assumes that each CNV region in tumor cells is derived from the corresponding region in the mixed stromal cells and allows for more complete and accurate description of other forms of allelic imbalance, such as imbalanced amplifications.
In a Bayesian framework, the most plausible hypothesis concerning data segmentation is constructed by combining prior distributions with some posterior distribution functions. Genome alteration detection algorithm, GADA (69), first builds a compact linear algebra representation for the genome copy number based on normalized probe intensities. Then it uses sparse Bayesian learning and backward elimination for breakpoint identification, though GADA only uses LRR in the analysis. This algorithm was later incorporated into another comprehensive package, R-Gada (68), which is an integrated pipeline with the ability to do complete CNV analysis from data normalization to population structure and association studies. cn.FARMS implements a probabilistic latent variable model by a Bayesian maximum posteriori approach. In this R package, the raw copy numbers of neighboring fragments or DNA probe loci are combined to a “meta-probe set,” which targets a DNA region. Such a multilocus modeling has a lower FDR, but cn.FARMS does not implement integer copy number estimation (12).
Other statistical algorithms have also been developed for CNV analysis. CNstream (3) uses maximization algorithm with a two-dimensional (Rn, θn) Gaussian mixture model to assign a confidence score to each sample at each probe. A high level of accuracy and sensitivity was achieved by jointly calling the copy number state over a set of nearby and consecutive probes. ACNE (65) is a multisample summarization method based on nonnegative matrix factorization, which assumes that observed probe intensities are linear combinations of the true allele-specific CNVs.
As noted in Table 1 and the Supplemental File, existing algorithms for CNV detection using high-density SNP array data each have distinctive features. Researchers who desire high-throughput, whole genome profiling of CNVs and interpretation of their biological significance are challenged by both the sizeable variety and specific scope among existing tools. An emerging need is to combine a range of CNV detection algorithms with the downstream association analysis into one cohesive platform. Such platforms not only can process data from a wide variety of SNP arrays but also have the flexibility to implement different normalization and segmentation algorithms. Furthermore, they will facilitate the downstream analysis, including annotation, data sharing, and assessment of the importance of particular CNVs. While using CBS as the default algorithm, both CNVworkshop and NEXUS can implement other algorithms with little or no source code modification. CNVworkshop provides an infrastructure to quickly manage, annotate, and visualize CNVs in a sizable number of samples. It is an attractive platform for coordinating multiple associated projects (26). In addition to identifying regions of gain or loss on a per sample basis, several components in NEXUS make it a convenient tool to explore the data at a population level by performing various statistical analyses, such as class comparisons clustering, enrichment analysis, and survival analysis (17). R-Gada pipeline can process data from multiple sources (e.g., Affymetrix, Illumina, and Aroma.affymetrix). Its built-in logistic regression model and likelihood-ratio test features can perform population stratification analysis and association tests. When new methods or improvements are made, R-Gada can allow different modules to be easily substituted. With the rapid expansion of larger catalogs of common (>1% frequency in the population) and rare (≤1% frequency) CNVs (43, 80), it becomes feasible and attractive to some researchers to efficiently interrogate targeted CNVs at a large population scale. SCIMMkit (93) is one such software package that utilizes prior knowledge of the approximate location of each targeted CNV to perform specific CNV polymorphism studies in large populations. Three tools in this kit, SCIMM (SNP-conditional mixture modeling), SCIMM-Search, and SCOUT (SNP-conditional outlier detection), each assumes prior knowledge of the approximate location of targeted variants. Combined, they allow accurate and efficient detection of CNVs in an initial phenotypic association testing stage followed by a validation stage for testing the strongest associations in a much larger population.
High-density SNP array-based experiments can detect the total number of paralogous copies in a genomic region, but the size range and the breakpoints of CNVs are limited by the resolution and coverage of the SNPs and probes on the arrays. Often, such information has significant phenotypic consequences (39). Investigating the sequence content in these regions can provide valuable information for functional studies. Limited by a dependence on a reference genome, array-based technology cannot detect novel insertions relative to a reference genome. In recent years, there has been a rapid expansion of next-generation sequencing platforms, including Illumina (5), the SOLiD system from Applied Biosystems (59), and Roche 454 Life Sciences (55). These sequencing advancements opened the door for new approaches for detecting CNVs and led to an exponential increase in CNVs, especially for smaller SVs. An obvious advantage of detecting CNVs using genomic sequencing data is that it allows the identification of CNV breakpoints to specific base pair resolution. Precise characterization of breakpoints, which may capture the signature of potential mutational mechanisms, is crucial for designing robust genotyping assays and assessing the functional content of detected CNVs. Commonly used methods for CNV discovery include: clone-based sequencing (39), split-read mapping (23, 61), paired-end read mapping (41, 42), read-depth analysis (2, 10, 90, 92), and mated short-read analysis (60) (see also Table 2 ). Several factors, including read length, sequencing coverage, and average span between read pairs, affect the accuracy and quality of read mapping, which, in turn, directly affect the performance of each method. Different mapping methods are complementary to each other. For example, read-pair or split-read methods have poor performance in genomic regions enriched with duplications since these methods rely on confident and independent mapping of each end (92). In situations like this, read-depth analysis, which can use single or pair-end reads, offers superior sensitivity because it only requires unambiguous mapping of one end. On the other hand, read-pair and split-read based approaches can better ascertain copy number-neutral events including inversions and translocations (89) (see also Table 2 ).
Analysis methods for genome sequencing-based CNV detection
Analysis Method | Category of CNVs | Advantages | Disadvantages | Ref. List No. |
---|---|---|---|---|
Clone-based sequencing | Duplication, deletion, transposon insertion, novel sequence insertion, inversion | Generate a physical clone map of the whole genome with high resolution | Laborious and tedious cloning | 39 |
Pair-end or mate-pair mapping | Duplication, deletion, transposon insertion, novel sequence insertion, inversion | Allows global detection of SVs at 3 Kb resolution | Cannot detect SVs where the read pairs do not flank the SV breakpoints | 42 |
Average resolution of breakpoints at approximately 600 bp | Detectable size of insertion is limited by library insert size | |||
Split-read mapping | Duplication, deletion, transposon insertion, novel sequence insertion, inversion | Takes advantage of reads directly crossing a breakpoint, which enables accurate detection of breakpoints | Ability to define a breakpoint and genotype of an insertion is limited by the read length | 23, 45, 61 |
Read-depth | Deletions and duplications | Detects very large deletions and duplications | Cannot detect inversions | 10, 87, 92 |
Cannot indicate the exact location of duplicated sequences | ||||
De novo genome assembly | Duplication, deletion, transposon insertion, novel sequence insertion, inversion | Low false positive and false negative rate | Difficulty of de novo assembling transposable elements from massively parallel short-read sequencing data | 29, 49, 50 |
Can identify SVs with a wider range of lenghs, which facilitates more comprehensive mapping of SVs |
Generally, two sources of information are used for CNV calling using short-read alignments: depth of coverage (DOC) and break points. DOC positively correlates with copy number (2, 91, 92), which provides more reliable information for prediction of large CNVs and CNVs flanked by repeats. Breakpoints inferred from sequence alignment gaps are more powerful for accurate detection of SV boundaries and short indel detection (48, 58). Most algorithms developed so far use only one source of information. Shen and colleagues (77) developed an HMM method to detect medium-sized deletions at low coverage (
Because of the relatively short reads in high-throughput sequencing, there are unavoidable erroneous mappings. Sizeable discrepancies exist among different algorithms when they are used to analyze the same data set. There is no single CNV caller that can cover the full range of CNVs throughout the genome. Computational pipelines that can incorporate the results from multiple callers followed by computational validation using local de novo assembly provide superior advantage in detecting a wide spectrum of CNVs with enhanced breakpoint refinement and accuracy (e.g., SVMerge) (89). inGAP-sv provides a convenient tool to visualize detected structural variations with an advantage of detecting many more large insertions and complex variants with lower FDR. An obvious advantage of graphical visualization is that it can help users distinguish different types of SVs and filter false positives. Detailed information of current software packages and algorithms are listed in Table 3 and the Supplemental File.
Algorithms developed for CNV detection using direct genomic sequencing data
Platform | Data Type | Algorithm | Advantages | Disadvantages | Size Range of CNVs | Ref. List No. |
---|---|---|---|---|---|---|
Control-FREEC | Single-end, mate-pair or paired-end genomic sequencing | Uses a sliding window approach to calculate read count (RC) in nonoverlapping windows (raw copy number polymorphism). Uses LASSO-based algorithm for the segmentation | Designed for cancer deep sequencing when no control is available and/or the genome is polyploidy | Designed mainly for Illumina single-end, mate-pair or paired-end sequencing | NA | 7, 33 |
Able to identify the genotype status despite contamination of the tumor sample by normal cells (estimated percent of tumor cells was 60%) | ||||||
ExomeCNV | Exome sequencing data | Uses normalized depth-of-coverage and B allele frequency to infer CNV and LOH | Does not assume random, unbiased distribution of sequence reads, and continuity of search space | Same level of consistency was not observed when single-end data were compared with paired-end data | 120 bp–240 Mb in size | 74 |
Ability to detect copy-number variation (CNV) and loss of heterozygosity (LOH) from exome sequencing data | Resolution of CNV detection with ExomeCNV is limited largely by the probe design | |||||
For detecting deletions, 95% power is achieved for segments of size 500 bp or more | ||||||
SeqGene | Exome and transcriptome sequencing data | Uses a CBS for CNV detection | Detects CNV | Reference (such as a normal DNA sample or the average of a group of pooled samples) is needed for absolute copy number calls | NA | 20, 83 |
Supports gene expression quantification, mutation and coverage visualization and pathway analysis workflows | ||||||
mrsFast | Short-read sequencing data | Simple cache-oblivious, all-to-all list comparison algorithm | Rapidly finds all mapping locations of a collection of short reads | Mainly a mapping algorithm | NA | 31 |
Substantially faster and more accurate | Requires another algorithm, variation Hunter, to find CNVs | |||||
CNV-seq | Shotgun sequencing data | Derived aCGH statistical analysis | Method can be applied to relatively low sequence coverage with good specificity and sensitivity | Assumes sampling of DNA fragments is random, which may not be true when data were generated by different sequencing methods | 1 kb–2.9 Mb | 91 |
window-size (1947 bp) gives a specificity of 95.4% | ||||||
overall specificity between 91.7–99.9% and sensitivity between 72.2–96.5% | ||||||
MoDIL | Clone-end sequencing data | Using expectation-maximization algorithm and appropriate Bayesian priors | Detects small indels in the range of 20–50 bp | Designed for detecting small indels | 20–50 bp | 47 |
BreakDancer | Mate-pair genomic sequencing | Infers CNVs based on discordant mate pairs that have larger outer-distance deviations than a fixed threshhold | Predicts a wide variety of structural variants, including insertions, deletions, inversions and translocations | Detections for inversions are limited by library insert size | 10 bp–1 Mb | 9 |
BreakDancerMini, focuses on detecting small indels (typically 10–100 bp) | ||||||
Overall high and consistent specificity and sensitivity | ||||||
CNVer | Mated short reads sequencing data | Uses a computational framework, donor graph, for CNV calling based on both pair-end-mapping and depth of coverage | Uses both mate pair mapping and depth-of-coverage information jointly to achieve better accuracy and sensitivity of detecting CNVs flanked by segmental duplications | Cannot identify the locations where the duplicated sequence is inserted | Near perfect resolution in breakpoints identification | 60 |
The insert size does not limit the size of the variants that can be detected | Inability to identify novel insertions | |||||
SplazerS | Pair-end and single-end sequencing data | More sensitive alignment method for pre_x and suf_x matches, allowing for mismatches and small gaps | Detects medium-sized insertions and long deletions with precise break points | Shows highest sensitivity, especially in variant-rich regions | >10 bp, with exact breakpoints identification | 23 |
Robust in the presence of sequencing errors as well as alignment errors | Applicable to anchored paired-end as well as unanchored and single-end data | |||||
Can be used on reads of variable lengths | Is not constrained to short read lengths | |||||
Ability to analyze complex structural variations | ||||||
SVMerge | Single, pair-end sequencing data | Uses a collection of SV callers, which use paired-end mapping, split-mapping, clusters of one-end-mapped reads, read-depth and targeted insertion calling | Integrates calls from several existing structural variant callers | Users need to provide standard Binary Alignment Map files; CNV calls in highly repetitive regions in the genome are removed for quality control, difficulty in detecting true insertions | >100 bp | 89 |
Enhanced structural variant detection | ||||||
Breakpoint refinement using local de novo assembly | ||||||
A lower FDR | ||||||
New SV calling methods may be incorporated into the analysis pipeline | ||||||
AGE | Nucleotide or protein sequences | Alignment with gap excision simultaneously aligning the 5′ and 3′ ends of two given sequences and introducing a “large-gap jump” between the local end alignments to maximize the total alignment score | Defines the exact location of break points of SVs by performing precise local alignment at the flanking ends | Due to computational scalability, this algorithm is most practical at aligning sequence with only one deletion, insertion or inversion | 50 bp–1 Mb | 1 |
Can be applied to tandem duplications, inversions and complex events | ||||||
Can be universally applied in various biological studies relying on alignment | ||||||
Can be applied to the alignment of protein sequences | ||||||
SRiC | Split-read, sequencing data | Built upon BLAT, which maps sequence reads to reference genome with gapped alignment | Ability to pinpoint the exact breakpoints | Requires reference genome, with nonsplit reads, SRiC becomes unnecessarily time-consuming | 1 bp–700 kb | 96 |
Reveal the actual sequence content of insertions and cover the whole size spectrum of deletions | ||||||
Can achieve 70-80% call accuracy | ||||||
inGAP-sv | Paired end mapping data (PEM) | Uses both pair-end mapping and coverage of depth strategies to identify different types of SVs | Can detect many more large insertions and complex variants with lower FDR | Failed to detect some small indels (<50bp) | >50 bp | 70 |
Users can identify, visualize, annotate and manually edit SVs | ||||||
Can detect 58-80% of deletions in gold-standard sets in individual NA12878 | ||||||
CNVnator | Single-end and paried-end sequencing data | Statistical analysis of mapping density based on mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) assumes uniformity of coverage across the genome | High sensitivity (86–96%) | Misses CNVs created by retrotransposable elements | from a few hundred to mega bases | 2 |
Low FDR (3–20%) | ||||||
High genotyping accuracy (93–95%) | ||||||
High resolution in breakpoint discovery ( | ||||||
HMM | Short-read sequencing data | Uses an HMM by integrating both depth of coverage and mate-pair relationship to calculate emission probability | Detects small deletions (200–2,000 bp) with low coverage (<10x) | Hemizygous deletions are not modeled | 200–2,000 bp | 77 |
Exceeds 80% discovery rate at ≥3× depth of coverage for medium-size (400–800 bp) CNVs | ||||||
BIC-seq | Short-read sequencing data | Read-depth based Bayesian information criteria algorithm to detect CNVs based on minimizing the Bayesian information criterion | Detects small CNVs (10 bp resolution) with high sensibility and true positive rate | Harder to detect copy gains than loss | 40 bp–5.7 Mb | 90 |
Performs well for detecting large deletions (e.g., one-copy loss of size >50 kb) | ||||||
80% of the calling validared by qPCR |
Another example of high-throughput genotyping technology is exome sequencing, which is made possible by the rapid growth in DNA enrichment technologies (including genomic capture). Large numbers of disease-causing mutations so far have been found in coding regions or canonical splice sites (11), so high-throughput exome sequencing is often used as a powerful tool in human genetic and disease studies to identify new disease-causing mutations. Compared with whole genome sequencing, exome sequencing has more straightforward data analysis and interpretation for a reasonable cost. Recent studies have successfully identified disease-causing mutations using parallel exome sequencing technology (16, 25, 72).
With the availability of extensive exome sequencing data, CNV detection has shifted to utilize this resource. Traditional software and algorithms developed for CNV detection generally assume random, unbiased distribution of sequence reads and the continuity of search space. These assumptions no longer hold true regarding exome sequencing data since the probes used for exome capture have variable specificity and efficiency for the targeted coding regions. Several algorithms have been developed for detection of CNVs to compensate for this bias, e.g., SeqGene and ExomeCNV (see Table 3 and the Supplemental File). SeqGene uses a CBS algorithm for CNV calling though it requires a reference for the absolute calling of CNVs. In addition to CNV calling, SeqGene supports gene expression quantification, mutation and coverage visualization. ExomeCNV uses normalized DOC and BAF to infer CNV and loss of heterozygosity, though the resolution of CNV calling is largely dependent on probe design in exome sequencing.
Array-based platforms can provide a cost-effective means for CNV discovery. However, these platforms have relatively limited power to detect CNVs of smaller size and they have very rough break-point prediction. A wide range of analysis algorithms has been developed for array-based CNV detection, but a sizable difference exists among these methods regarding the number and size of CNVs detected. It is generally recommended to employ two or more analysis algorithms to the same data set and follow the confident consensus calls in multiple analyses in downstream studies. Despite these limitations, array-based platforms have led to the discovery of a number of disease-associated CNVs, and provide an easy-to-use platform for an initial CNV analysis of biological samples.
In contrast, next-generation sequencing-based methods are able to detect smaller SVs in the range of 10 to 2,000 bp with the capability of breakpoint prediction at single base resolution. In addition, such analysis has an obvious advantage of detecting inversions, intrachromosomal translocation, and de novo CNVs not present in the reference genome. Many algorithms have been developed to facilitate CNV detection, but these algorithms are hindered by several factors. First, it remains a problem to unambiguously map the typically short reads of the next-generation sequences (∼400 bp for Roche 454 platform and 2 × 100 bp for Illumina platform) to repetitive regions of the genome. An improvement in read-length in high-throughput sequencing technology will significantly improve both mapping quality and de novo genome assembling. Second, compared with array-based platforms, it is considerably more expensive to sequence a genome to 20× or greater coverage, which is the required depth of coverage for reliable calling of CNVs. Third, large-scale throughput is not feasible for sequencing. Currently, only a few projects are able to generate data at such a scale. Fourth, a computational pipeline for CNV detection on a large population scale would require a considerable hardware and software infrastructure. For example, since the launch of the 1000 Genomes Project (http://1000genomes.org), a massive data set has been generated. Only a few researchers have the computational power to analyze it.
To date, many CNVs have been associated with diseases [e.g., Charcot-Marie tooth disease (53), Angelman syndrome (56), Alzheimer's disease (73), and obesity (78)]. The repertoire of pathological CNVs is expanding, but a comprehensive map of all disease-causing CNVs is far from complete. Once obtained, a resource of this type would facilitate the development of improved CNV-specific, array-based platforms for analysis of copy number polymorphisms, which will likely mature into low-cost and efficient screening methods for clinical and research usage. However, due to the repetitive structure of the regions that are enriched for CNVs, a number of these structural variants will continue to pose a substantial challenge for current discovery and analysis methods, primarily because of the already high noise-to-signal ratio in array-based platforms and the relatively short-length reads produced by next generation sequencing. Improvements in CNV detection accuracy and sequencing read-lengths would enable the development of practical, accurate clinical methods.
New algorithms are developed at a fast pace to detect CNVs from array-based or direct sequencing-based platforms. Each one has its own merits and domain of greatest accuracy and efficiency. However, at present, there are large detection discrepancies among different platforms and analytic methods, indicating the need for ongoing comparative evaluations. Several groups have published comparative analyses of the performance and advantages of different methods they used to analyze identical sets of data, but such work is time-consuming, and, to date, such studies have limited themselves to only a few analytical platforms (4, 22, 88, 94). Accordingly, a comprehensive ranking of available methods is not yet feasible. Large-scale studies that make use of a set of reliable “gold standard” CNVs will help evaluate the most reliable and efficient CNV detection algorithms. Unfortunately, poor consistency exists among different gold standards (94). More rigorous effort should be focused on the construction of a superior set of CNV gold standards to facilitate ongoing algorithm and platform design. Efforts such as the 1000 Genomes Project already incorporate extensive SV validation at nucleotide resolution with precise locations in the genome (62). Thus, SVs identified through this project may provide an especially reliable data set that can be used in the development of more mature detecting platforms and algorithms for CNV studies.
Future efforts should focus on incorporating a wider range of analytical tools that can analyze both array-based and next-generation sequencing data. A thoroughly interconnected platform that implements both next-generation sequencing and low-cost array technology would provide a superior framework for CNV detection. For such platforms to work, one would start with a low-cost screen using an array platform. During this step, many well-tested algorithms or software packages will be employed to delimit a pool of suitably accurate tools that can be used. An improved gold-standard set of CNVs confirmed by multiple methods should be used to determine the most appropriate and efficient CNV-mining algorithms and software. Subsequently, an automated system will be connected to a high-throughput sequencing platform. This step will use the information obtained from the previous step to identify regions of the genome for further deep sequencing.
The authors acknowledge the financial support of the Innovation Center of Medical College of Wisconsin. W. Li is supported by a Kern Innovation Postdoctoral Fellowship from the Kern Innovation Foundation at Medical College of Wisconsin. M. Olivier is supported in part by National Human Genome Research Institute Grant P50HG-004952.
No conflicts of interest, financial or otherwise, are declared by the author(s).
Author contributions: W.L. drafted manuscript; W.L. and M.O. edited and revised manuscript; W.L. and M.O. approved final version of manuscript.