Current analysis platforms and methods for detecting copy number variation

Address for reprint requests and other correspondence: M. Olivier, Biotechnology and Bioengineering Center, Medical College of Wisconsin, 8701 Watertown Plank Rd., Milwaukee, WI 53226 (e-mail: ude.wcm@reivilom).

Received 2012 Jun 21; Accepted 2012 Nov 5. Copyright © 2013 the American Physiological Society

Associated Data

Supplemental Material GUID: E55D2DC2-B028-4F70-A65E-1A347A17FE1C

Abstract

Copy number variation (CNV), generated through duplication or deletion events that affect one or more loci, is widespread in the human genomes and is often associated with functional consequences that may include changes in gene expression levels or fusion of genes. Genome-wide association studies indicate that some disease phenotypes and physiological pathways might be impacted by CNV in a small number of characterized genomic regions. However, the pervasiveness and full impact of such variation remains unclear. Suitable analytic methods are needed to thoroughly mine human genomes for genomic structural variation, and to explore the interplay between observed CNV and disease phenotypes, but many medical researchers are unfamiliar with the features and nuances of recently developed technologies for detecting CNV. In this article, we evaluate a suite of commonly used and recently developed approaches to uncovering genome-wide CNVs and discuss the relative merits of each.

Keywords: copy number variation, SNP array, next generation sequencing, detection method, analysis algorithm

a major focus in genomic studies is the identification of polymorphisms associated with phenotypic variation and disease. Forms of mammalian genomic variation include single nucleotide polymorphisms (SNPs), microsatellites and other repeats, and structural variation (SV) such as copy number variants (CNVs). CNVs are imbalances that alter the diploid status of a locus so that copy numbers increase (duplications) or decrease (deletions). They range in length between 50 bp and 1 Mbp (28). Genomic regions that are affected by CNVs are far greater than those with SNPs (52). CNVs are widespread among humans and contribute substantially to population diversity. About 15% of the human genome is affected by CNVs (79), and an average of 12 CNVs exist per individual compared with the reference genome. A substantial portion of CNVs is likely to have functional consequences, such as changing expression levels of genes in or near the affected regions due to positional effects, gene dosage alteration, and disruption or fusion of genes. Many CNVs have been shown to contribute to diseases, such as Alzheimer's disease, autism, schizophrenia, breast cancer, obesity, and developmental delay (15, 28, 54, 81, 86, 95).

The primary mechanisms that create CNVs are: nonhomologous end joining, nonallelic homologous recombination, transposition of transposable elements or pseudogenes, variable numbers of tandem repeats, and replication errors bought about by fork stalling or template switching. Recent studies of CNVs have shown that there are CNV distribution hotspots in the human genome. Most commonly, the following three categories of genes are enriched for structural variants: 1) genes that are involved in immunity and cell-cell signaling; 2) genes encoding proteins involved in interaction with the environment (e.g., immune response, perception of smell); and 3) retrovirus- and transposition-related protein coding genes. In contrast, CNV occurrence is significantly reduced in genes that encode proteins that mediate their effect through protein-protein interactions (21). Similarly, dosage-sensitive genes are underrepresented in regions affected by CNVs (76). More evolutionarily conserved genes appear to have fewer CNVs (50). For different functional parts of genes, CNV rates are not evenly distributed (50). A recent analysis by Li and colleagues (50) based on a de novo genome assembly, found a significantly higher number of structural variants in untranslated regions than had previously been identified. Additionally, protein-coding exonic regions had fewer variants than introns.

While the full impact of this “new” form of genome variation on genome function and dysfunction is still unclear, it is likely that a number of phenotypes and physiological functions in cells and tissues are being affected by these structural changes, as suggested by the disease associations reported so far. In this article, we will describe and summarize current platforms and methods to uncover CNVs to provide a summary to geneticists and physiologists alike. Since the vast majority of work (and tool development) has focused on the human genome, we will describe tools and analytical approaches used in these human studies. However, it is likely that detection methods used in humans, and the physiological consequences of genome structural variation, extend equally to other mammalian species.

DISCOVERY METHODS FOR CNVS

The discovery and genotyping of CNVs have become central in human genetics and disease-marker development. Yet comprehensive and systematic cataloging of CNVs is still in its infancy. A few databases were developed to catalog CNVs (e.g., DGV, the Database of Genomic Variants, http://projects.tcag.ca/variation). Many challenges exist, including detecting regions where CNVs reside (e.g., in regions with abundant repetitive content, it is difficult to resolve particular CNVs), robust detection algorithms for accurate detection of CNVs using array-based platforms, distinguishing classes of CNVs (e.g., insertions, deletions, duplications, and inversions), and the various mechanisms that create CNVs. Several well-established methods have been applied for validation or replication of targeted CNV at either the single or multiple-locus scale, such as quantitative PCR (qPCR), paralog-ratio testing (PRT) and molecular copy number counting (MCC). qPCR compares threshold cycles (Ct) between the target gene and a reference sequence with normal copy numbers, to generate ΔCt values which are used for CNV calculation. This method has been used in large-scale CNV analysis in detecting disease associations, for example, psoriasis (34) and Crohn's disease (6). With the development of genome-wide CNV screening, qPCR is often used as a confirmation method for computationally identified loci. PRT, on the other hand, uses a single pair of primers to explore the degree of similarity between sequence elements (often dispersed repeats) both in the target locus (with CNV) and a remote reference locus (24). Derived from HAPPY mapping (19), MCC is a simple assay that exploits the ability of PCR to detect single molecules of a target sequence (marker) in DNA aliquots that have undergone limiting dilution. In MCC, the number of aliquots that are positive for a target sequence can be counted and compared with those of other target sequences, thus allowing the relative copy number of multiple sequences in the original DNA sample to be inferred. MCC can be rapidly adapted to regions of particular interest or to a set of scattered candidate loci for refining CNV at high resolution (18, 57). Other multiplex PCR-based approaches, such as multiplex amplifiable probe hybridization, multiplex ligation-dependent probe amplification, multiplex PCR-based real-time invader assay, quantitative multiplex PCR of short fluorescent fragments, and multiplex amplicon quantification, have also been used for targeted screening and validation of CNVs (8).

While these methodologies are useful and commonly used for targeted locus studies, we will focus, for this review, on methods for the high-throughput, genome-wide identification of CNVs and review the leading platforms and most recent analytical methods used for detection of CNVs in human genomes.

HIGH-RESOLUTION ARRAY PLATFORMS FOR WHOLE GENOME CNV PROFILING

High-resolution array platforms have been used extensively to study gene expression, analyze SNP frequency, and, more recently, to detect CNVs. The most commonly used array platforms are SNP arrays and comparative genome hybridization (CGH) arrays. Agilent (http://www.genomics.agilent.com/) and Nimblegen (http://www.nimblegen.com/) first developed CGH as a method for copy number comparison between differentially labeled target and reference DNAs by measuring the fluorescence ratio along the length of each chromosomal region, indicating relative losses or gains in a target sample using fluorescence in situ hybridization. CGH arrays use arrays of bacterial artificial chromosomes, cDNA, or long synthetic oligonucleotides to probe specific regions of interest for copy number assessment. These probes can either be selected for an even genome-wide distribution or for unique, repetitive regions. Thus, CGH offers the highest sensitivity and specificity (30, 32), though this method has low spatial resolution, typically 5–10 Mb (37), and low throughput. SNP arrays, primarily supplied by Affymetrix (Santa Clara, CA) (http://www.affymetrix.com/estore/) and Illumina (San Diego, CA) (http://www.illumina.com/), use short base-pair sequences to capture fragments of DNA and to infer copy number based on hybridization intensities without the reference sample being cohybridized to a target sample. The advantage of this approach is that it allows for the determination of SNP genotypes and requires less sample DNA than CGH (63, 88). However, the distribution of probes in SNP arrays is restricted by the availability of informative SNPs throughout the genome and is known to affect platform performance (32).

Many high-resolution analysis platforms have been developed to detect CNVs that utilize CGH or whole genome SNP genotyping arrays or both. Some, e.g., NimbleGen (Madison, WI) and Agilent (Santa Clara, CA) CNV focused arrays, have included probes that specifically cover CNVs that are reported in the DGV. Currently, there are at least 12 array platforms available for CNV detection (32). By assessing the performance of these platforms in the detection of CNVs in HapMap CEU sample NA12878, Haraksingh and colleagues (32) concluded that CNV-focused arrays offer the highest overall sensitivity for detection of CNV abundance, size range, and breakpoint resolution. To increase the reliability of detecting CNVs, Illumina and Affymetrix have improved their probe selection by including additional probes for non-SNP sites to facilitate higher-resolution CNV detection. These arrays are more appealing for cost-effective, routine analyses. Currently, the Affymetrix Genome-wide Human SNP 6.0 chip includes approximately ∼907 K probes for SNP detection, and an additional ∼946 K non-SNP probes, which include 140K CNV-specific probes (63). Illumina 1M-Duo has over one million markers. Custom designed Agilent array sets contain ∼470 K CNV-specific probes (67).

Most commonly, CNV analysis is performed using DNA extracted from blood or tissue, as recommended by the manufacturers. However, array-based CNV profiling has also been applied to single cells (46, 82). The major problem with array-based, single cell CNV detection is that it requires whole genome amplification (WGA), either through PCR or multiple displacement amplification (MDA). However, it is difficult to distinguish between WGA artifacts and true genetic variants. Furthermore, a majority of these methods lacks the quality metrics required to exclude single cell WGA samples without which the reliable detection of copy number or SNP genotyping is impossible. For example, reliable, imbalanced copy number detection is almost impossible for S-phase cells because diploid components of the genome can be present in two or more copies. A recently developed protocol by Konings and colleagues (40) aims to resolve these difficulties, but the methodology has not been used widely enough to allow reliable assessment.

ALGORITHMS FOR GENOMIC PROFILING OF CNVS USING HIGH-RESOLUTION ARRAY PLATFORMS

Complete, whole genome-analysis CNV involves three steps: 1) normalization, 2) probe-level modeling and segmentation, and 3) association analysis. The purpose of normalization is to clean the nonrelevant effects, including the GC content of the fragment amplified by PCR, probe-specific intensity bias due to difference in binding affinity and spatial artifacts. Probe-level modeling is usually performed at two levels: single locus and multilocus. Single-locus modeling combines the probes that measure the same fragment to produce a raw fragment copy number. Multilocus modeling combines the raw copy numbers of neighboring fragments or DNA probe loci into a “meta-probe set,” which measure the whole region's copy number (12). Segmentation involves the identification of break points, which separate neighboring regions based on the Log ratio of probe intensity. Association analysis uses statistical algorithms to assess the potential impact of particular CNVs compared with appropriate controls.

Array data from CGH include the log-ratios of normalized intensities of disease and control samples, arranged by the physical locations of the probes in the genome. Regions with concentrated high or low log-ratios are indicative of copy number changes of targets (44). To give some examples, algorithms developed for analyzing CGH data include circular binary segmentation (CBS) (64) and Gain and loss analysis of DNA (GLAD) (36). Using a maximum of likelihood ratio statistics to detect narrower segments of aberration, CBS (64) was able to divide the genome into regions of equal DNA copy number. This algorithm has a very consistent performance and low false discovery rate (FDR), although it has a heavy computational cost. GLAD (36) used an adaptive weights smoothing procedure, in which penalty terms are controlled by weights that can be adapted to the data, for automatic detection of breakpoints. This algorithm detects large regions with high accuracy but often fails to detect fine structure variation.

Because of their high probe density and coverage, SNP arrays have become the most frequently used tools to detect CNVs. High-density SNP array data require genotypes with high quality intensity and allelic ratio information for each surveyed locus. Briefly, if two alleles of one SNP are designated as A and B, and a and b are the normalized intensities of alleles A and B, respectively, deletion and duplication of a certain DNA region will result in a decrease or increase in signal intensity. For CNV estimation, a and b are transformed into R = a + b and θ = arctan(a/b)/(π/2), so that R measures the combined signal intensity of two alleles and θ measures the relative allelic intensity ratio. Log R ratio (LRR) is defined as log2(Robserved/Rexpected), in which Rexpected is measured from reference samples (66, 84). B allele frequency (BAF) is the normalized measure of relative signal intensity ratio of two alleles. Depending on the θ values for three canonical genotype clusters generated from reference samples, different formulas were followed for the normalization of BAF (84). Most methods that use SNP arrays to detect CNVs use both LRR and BAF (66). Both Affymetrix and Illumina offer free software packages for CNV analysis: Genotyping Console (Affymetrix) and Beadstudio (Illumina) (66). With Genotyping Console, two algorithms, the BirdSuite Package (43) and Birdseed, are used to enhance detection of CNVs. With Beadstudio, CNV detection is performed by calculating the mode of BAF for SNPs in a sliding window along the chromosome. It can be run rapidly for an overview of larger events in a sample, but this sliding window approach has limited and rough boundary resolution for CNVs. With the recent surge in SNP-genotyping array technology, many other algorithms have been developed concomitantly to improve the statistical modeling to achieve efficient and more accurate callings. Below, we discuss the features of recently developed algorithms based on their emphasis, (e.g., normalization, probe-level modeling and segmentation, see also Table 1 and Supplemental File). 1

Table 1.

Algorithms developed for CNV detection using array-based data

PlatformData TypeAlgorithmAdvantagesDisadvantagesSize Range of CNVsRef. List No.
PennCNVIllunima (mainly) and AffymetrixHidden Markov Model (HMM)Detects smaller CNVs and using family trio information to detect inherited and novel CNVsPennCNV assumes that the model parameters are known which is not true when normal tissue contamination across samples require sample-specific model parametersKilo-bp resolution, with a median size of 12 kb85
Pedigree information significantly improved the sensitivity and false discovery rateSensitive to intensity noise
QuantiSNPIlluminaObjective Bayes-HMMAbility to combine data from several platforms of differing resolutionSize of detected CNVs are very large (>190 Kb)>190 kb13
Provides probabilistic quantification of state classifications
Significantly increased the accuracy of segmental aneuploidy identification and mapping
GenoCNVIlluminaHMMData-driven parameter estimationDesigned for Illumina SNP arraysNA80
Effects of tissue contamination are explicitly modeled in tumor samplesAppropriate normalization and transformation method is needed for Affymetrix SNP arrays
Detects the cancer CNV in a tumor samples contaminated with normal tissue
Reports genotype information within CNV regions
MixHMMIlluminaHMMDirectly deal with stromal contamination caused heterogeneity in clinical samples, allowing detection of tumor CNV events in heterogeneous tumor samplesIs not intended to distinguish among multiple clonesNA51
CNV detection for copy numbers is ≤7Works only for detection in autosomes
Allows more complete and accurate description of other forms of allelic imbalance
ACNEAffymetrixMultisample summarization method based on nonnegative matrix factorizationDdeals directly with the cross hybridization between probes within SNP probeset; estimates the allele-specific CNVsRelies on the data given by CRMA v2 after background and offset removalMb resolution65
CRLMMAffymetrix SNP 6.0Linear model to estimate batch- and locus-specific parametersSpecifically designed to address batch effects and uncertaintyFor the study of germline traits, the model is most useful when 25 or more samples have been processed together in a batchNA75
Does not require reference samples to estimate model parametersThe assumption of fixed batch effect and linearity between average intensity and allelic dosage is not always true
Uses of biallelic genotype calls from experimental data to estimate batch-specific and locus-specific parameters of background and signal without the requirement of training data
ITALICSAffymetrixUses GLAD algorithm to estimate biological signal; estimates nonrelevant effects by multiple linear regressionEstimates both biological and nonrelevant effects in an alternate, iterative manner, accurately eliminates irrelevant effectsIs based solely on perfect match (PM) probes; needs to obtain a reference dataset for calculating the quartet effectNA71
Corrects for experimental artifacts caused by spatial effects
CNstreamIllumina BeadchipHeuristics and parametrical statistics to assign a confidence score to each sample at each probeJoint clustering analysis of all the intensity samples at each probe with the computational speed and analytical accuracy of estimating copy numer polymorphisms (CNPs) from segments of consecutive probesDesigned for Illumina arrays10–100 kb3
Allows a whole-genome analysis to be carried out without the need for previously defined CNP maps
High level of accuracy was achieved by analyzing the measures of each intensity channel separately and combining information from multiple samples
cn.FARMSAffymetrix, Illumina and Agilent arrayProbabilistic latent variable model, which is optimized by a Bayesian maximum a posteriori approach.Controls for false positive rateDoes not do segmentation of integer copy number estimationNA12
Faster computation with low computational load
GADACGH, Affymetrix and IlluminaCompact liner algebra representation for the genome copy number based on normalized probe intensities. Then it use sparse Bayesian learning (SBL) and backward elimination (BE) for breakpoint identificationFast computational speedDoes not include allele-specific copy number dataNA69
Low false discovery rate
CNVworkshopA variety of SNP array platformsGenotype-specific extension of the Circular Binary Segmentation algorithmIntegration with the UCSC Genome BrowserPredominantly uses circular binary segmentation (CBS) algorithm, which carries a heavy computational costNA26
Tabular displays of genomic and pathogenic attributes for each CNV
Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data
R-GadaIllumina, Affymetrix and aCGH arraysUses Bayesian learning and BE for segmentation, parallel computationFast and accurate segmentation algorithm to call CNVsRely on other packages for normalization and segmentation procedure; only use Log R ratio in the analysis; Sensitivity, given by the maximum breakpoint sparseness, is controlled by the hyperparameter a2 kb–8.5 Mb68
Multivariate analysis of population structure
Has tools for performing genome wide association studies
Has a complete set of tools for visualizing and reporting copy number alterations
NEXUSCGH and SNP arrays, data generated by array-image analysis softwareRank Segmentation, SNPRank Segmentation, FASST Segmentation and SNP-FASST SegmentationConvenient tool to explore the data on the population level by performing various statistical analyses, such as class comparisons clustering, enrichment analysis, survival analysisNANA17
SCIMMkitIllumina Infinium II and GoldenGate SNP assaysTwo rounds of mixture likelihood-based clusteringFor targeted interrogation of CNVs using Illumina Infinium II and GoldenGate SNP assaysisDoes not explicitly include batch effects in its statistical model1.6–560 kb93
Can be applied to standardized genome-wide SNP arrays and customized multiplexed SNP panels
Provides economy, efficiency and flexibility in experimental design
GLADArray-based, comparative genomic hybridization (CGH) dataLocal constant Gaussian regression model using a weighted maximum likelihood estimatorAutomatic detection of breakpointsFails to detect very fine structureNA36
Detects large regions with high accuracy
CBSArray-based, comparative genomic hybridization (CGH)dataMaximum of a likelihood ratio statistic is used recursively to detect narrower segments of aberrationConsistentHeavy computation cost with majority of its computational time spent on significance evaluationNA35, 64
Low FDR

* NA, not available.

A suite of algorithms has been developed to improve normalization, which may have a substantial effect on downstream analysis ( Table 1 and Supplemental File). Specifically designed for Affymetrix SNP arrays, ITALICS (71) takes into account that nonrelevant effects can be similar or higher than the biological effects. It estimates the biological signal using the GLAD algorithm (36), and nonrelevant effects with multiple linear regression in an alternate, iterative manner. CRLMM (75) uses biallelic genotype calls from experimental data to explicitly model batch and locus-specific parameters of background without the usage of training data. In addition, CRLMM uses shrinkage to improve locus-specific estimates of copy number uncertainty.

Bayesian and hidden Markov model (HMM) based algorithms are frequently used for copy number segmentation from high-density SNP array data. In HMM-based algorithms, hidden states represent the underlying copy number of probes. Beadstudio is the first software that uses an objective Bayes HMM algorithm to automatically infer regions of segmental aneuploidy (SNV) from BeadArray genotyping data (Illumina). Four additional recently developed algorithms are HMM based: QuantiSNP, PennCNV, GenoCNV, and MixHMM. QuantiSNP incorporates the LRR and BAF simultaneously in a HMM framework by using a fixed rate of heterozygosity for each SNP. Compared with Beadstudio, QuantiSNP significantly improved the resolution of CNV detection (13).

PennCNV (85) uses state-specific and distance-dependent transition probabilities in the HMM state-transition matrix, which is a more realistic model for state transition between different copy number states. To achieve finer mapping and improve a posteriori CNV validation, PennCNV incorporates several sources of available data, including population-based BAF for each SNP, distance between adjacent SNPs and pedigree information (when applied to Trio-analysis). PennCNV is especially efficient at identifying smaller sized CNVs (with a median size of 12 Kb). However, PennCNV was designed primarily for use with the Illumina Infinium high-density SNP genotyping platform. To use PennCNV with Affymetrix data, Affymetrix Power Tools are required for data normalization.

Parameter estimation in GenoCNV (80) is data driven. In addition, it explicitly models the effect of tissue contamination. The two components of this package, GenoCNV and GenoCNA, were designed for detection of CNVs in normal tissue and copy number aberration (CNA) in tumor tissue, respectively. In an effort to detect CNVs in tumor samples that have significant stromal cell contamination (up to 80%), Liu and colleagues (51) developed MixHMM. This Python package assumes that each CNV region in tumor cells is derived from the corresponding region in the mixed stromal cells and allows for more complete and accurate description of other forms of allelic imbalance, such as imbalanced amplifications.

In a Bayesian framework, the most plausible hypothesis concerning data segmentation is constructed by combining prior distributions with some posterior distribution functions. Genome alteration detection algorithm, GADA (69), first builds a compact linear algebra representation for the genome copy number based on normalized probe intensities. Then it uses sparse Bayesian learning and backward elimination for breakpoint identification, though GADA only uses LRR in the analysis. This algorithm was later incorporated into another comprehensive package, R-Gada (68), which is an integrated pipeline with the ability to do complete CNV analysis from data normalization to population structure and association studies. cn.FARMS implements a probabilistic latent variable model by a Bayesian maximum posteriori approach. In this R package, the raw copy numbers of neighboring fragments or DNA probe loci are combined to a “meta-probe set,” which targets a DNA region. Such a multilocus modeling has a lower FDR, but cn.FARMS does not implement integer copy number estimation (12).

Other statistical algorithms have also been developed for CNV analysis. CNstream (3) uses maximization algorithm with a two-dimensional (Rn, θn) Gaussian mixture model to assign a confidence score to each sample at each probe. A high level of accuracy and sensitivity was achieved by jointly calling the copy number state over a set of nearby and consecutive probes. ACNE (65) is a multisample summarization method based on nonnegative matrix factorization, which assumes that observed probe intensities are linear combinations of the true allele-specific CNVs.

As noted in Table 1 and the Supplemental File, existing algorithms for CNV detection using high-density SNP array data each have distinctive features. Researchers who desire high-throughput, whole genome profiling of CNVs and interpretation of their biological significance are challenged by both the sizeable variety and specific scope among existing tools. An emerging need is to combine a range of CNV detection algorithms with the downstream association analysis into one cohesive platform. Such platforms not only can process data from a wide variety of SNP arrays but also have the flexibility to implement different normalization and segmentation algorithms. Furthermore, they will facilitate the downstream analysis, including annotation, data sharing, and assessment of the importance of particular CNVs. While using CBS as the default algorithm, both CNVworkshop and NEXUS can implement other algorithms with little or no source code modification. CNVworkshop provides an infrastructure to quickly manage, annotate, and visualize CNVs in a sizable number of samples. It is an attractive platform for coordinating multiple associated projects (26). In addition to identifying regions of gain or loss on a per sample basis, several components in NEXUS make it a convenient tool to explore the data at a population level by performing various statistical analyses, such as class comparisons clustering, enrichment analysis, and survival analysis (17). R-Gada pipeline can process data from multiple sources (e.g., Affymetrix, Illumina, and Aroma.affymetrix). Its built-in logistic regression model and likelihood-ratio test features can perform population stratification analysis and association tests. When new methods or improvements are made, R-Gada can allow different modules to be easily substituted. With the rapid expansion of larger catalogs of common (>1% frequency in the population) and rare (≤1% frequency) CNVs (43, 80), it becomes feasible and attractive to some researchers to efficiently interrogate targeted CNVs at a large population scale. SCIMMkit (93) is one such software package that utilizes prior knowledge of the approximate location of each targeted CNV to perform specific CNV polymorphism studies in large populations. Three tools in this kit, SCIMM (SNP-conditional mixture modeling), SCIMM-Search, and SCOUT (SNP-conditional outlier detection), each assumes prior knowledge of the approximate location of targeted variants. Combined, they allow accurate and efficient detection of CNVs in an initial phenotypic association testing stage followed by a validation stage for testing the strongest associations in a much larger population.

DETECTING CNVS BY LARGE-SCALE GENOME SEQUENCING

High-density SNP array-based experiments can detect the total number of paralogous copies in a genomic region, but the size range and the breakpoints of CNVs are limited by the resolution and coverage of the SNPs and probes on the arrays. Often, such information has significant phenotypic consequences (39). Investigating the sequence content in these regions can provide valuable information for functional studies. Limited by a dependence on a reference genome, array-based technology cannot detect novel insertions relative to a reference genome. In recent years, there has been a rapid expansion of next-generation sequencing platforms, including Illumina (5), the SOLiD system from Applied Biosystems (59), and Roche 454 Life Sciences (55). These sequencing advancements opened the door for new approaches for detecting CNVs and led to an exponential increase in CNVs, especially for smaller SVs. An obvious advantage of detecting CNVs using genomic sequencing data is that it allows the identification of CNV breakpoints to specific base pair resolution. Precise characterization of breakpoints, which may capture the signature of potential mutational mechanisms, is crucial for designing robust genotyping assays and assessing the functional content of detected CNVs. Commonly used methods for CNV discovery include: clone-based sequencing (39), split-read mapping (23, 61), paired-end read mapping (41, 42), read-depth analysis (2, 10, 90, 92), and mated short-read analysis (60) (see also Table 2 ). Several factors, including read length, sequencing coverage, and average span between read pairs, affect the accuracy and quality of read mapping, which, in turn, directly affect the performance of each method. Different mapping methods are complementary to each other. For example, read-pair or split-read methods have poor performance in genomic regions enriched with duplications since these methods rely on confident and independent mapping of each end (92). In situations like this, read-depth analysis, which can use single or pair-end reads, offers superior sensitivity because it only requires unambiguous mapping of one end. On the other hand, read-pair and split-read based approaches can better ascertain copy number-neutral events including inversions and translocations (89) (see also Table 2 ).

Table 2.

Analysis methods for genome sequencing-based CNV detection

Analysis MethodCategory of CNVsAdvantagesDisadvantagesRef. List No.
Clone-based sequencingDuplication, deletion, transposon insertion, novel sequence insertion, inversionGenerate a physical clone map of the whole genome with high resolutionLaborious and tedious cloning39
Pair-end or mate-pair mappingDuplication, deletion, transposon insertion, novel sequence insertion, inversionAllows global detection of SVs at 3 Kb resolutionCannot detect SVs where the read pairs do not flank the SV breakpoints42
Average resolution of breakpoints at approximately 600 bpDetectable size of insertion is limited by library insert size
Split-read mappingDuplication, deletion, transposon insertion, novel sequence insertion, inversionTakes advantage of reads directly crossing a breakpoint, which enables accurate detection of breakpointsAbility to define a breakpoint and genotype of an insertion is limited by the read length23, 45, 61
Read-depthDeletions and duplicationsDetects very large deletions and duplicationsCannot detect inversions10, 87, 92
Cannot indicate the exact location of duplicated sequences
De novo genome assemblyDuplication, deletion, transposon insertion, novel sequence insertion, inversionLow false positive and false negative rateDifficulty of de novo assembling transposable elements from massively parallel short-read sequencing data29, 49, 50
Can identify SVs with a wider range of lenghs, which facilitates more comprehensive mapping of SVs

Generally, two sources of information are used for CNV calling using short-read alignments: depth of coverage (DOC) and break points. DOC positively correlates with copy number (2, 91, 92), which provides more reliable information for prediction of large CNVs and CNVs flanked by repeats. Breakpoints inferred from sequence alignment gaps are more powerful for accurate detection of SV boundaries and short indel detection (48, 58). Most algorithms developed so far use only one source of information. Shen and colleagues (77) developed an HMM method to detect medium-sized deletions at low coverage (

Because of the relatively short reads in high-throughput sequencing, there are unavoidable erroneous mappings. Sizeable discrepancies exist among different algorithms when they are used to analyze the same data set. There is no single CNV caller that can cover the full range of CNVs throughout the genome. Computational pipelines that can incorporate the results from multiple callers followed by computational validation using local de novo assembly provide superior advantage in detecting a wide spectrum of CNVs with enhanced breakpoint refinement and accuracy (e.g., SVMerge) (89). inGAP-sv provides a convenient tool to visualize detected structural variations with an advantage of detecting many more large insertions and complex variants with lower FDR. An obvious advantage of graphical visualization is that it can help users distinguish different types of SVs and filter false positives. Detailed information of current software packages and algorithms are listed in Table 3 and the Supplemental File.

Table 3.

Algorithms developed for CNV detection using direct genomic sequencing data

PlatformData TypeAlgorithmAdvantagesDisadvantagesSize Range of CNVsRef. List No.
Control-FREECSingle-end, mate-pair or paired-end genomic sequencingUses a sliding window approach to calculate read count (RC) in nonoverlapping windows (raw copy number polymorphism). Uses LASSO-based algorithm for the segmentationDesigned for cancer deep sequencing when no control is available and/or the genome is polyploidyDesigned mainly for Illumina single-end, mate-pair or paired-end sequencingNA7, 33
Able to identify the genotype status despite contamination of the tumor sample by normal cells (estimated percent of tumor cells was 60%)
ExomeCNVExome sequencing dataUses normalized depth-of-coverage and B allele frequency to infer CNV and LOHDoes not assume random, unbiased distribution of sequence reads, and continuity of search spaceSame level of consistency was not observed when single-end data were compared with paired-end data120 bp–240 Mb in size74
Ability to detect copy-number variation (CNV) and loss of heterozygosity (LOH) from exome sequencing dataResolution of CNV detection with ExomeCNV is limited largely by the probe design
For detecting deletions, 95% power is achieved for segments of size 500 bp or more
SeqGeneExome and transcriptome sequencing dataUses a CBS for CNV detectionDetects CNVReference (such as a normal DNA sample or the average of a group of pooled samples) is needed for absolute copy number callsNA20, 83
Supports gene expression quantification, mutation and coverage visualization and pathway analysis workflows
mrsFastShort-read sequencing dataSimple cache-oblivious, all-to-all list comparison algorithmRapidly finds all mapping locations of a collection of short readsMainly a mapping algorithmNA31
Substantially faster and more accurateRequires another algorithm, variation Hunter, to find CNVs
CNV-seqShotgun sequencing dataDerived aCGH statistical analysisMethod can be applied to relatively low sequence coverage with good specificity and sensitivityAssumes sampling of DNA fragments is random, which may not be true when data were generated by different sequencing methods1 kb–2.9 Mb91
window-size (1947 bp) gives a specificity of 95.4%
overall specificity between 91.7–99.9% and sensitivity between 72.2–96.5%
MoDILClone-end sequencing dataUsing expectation-maximization algorithm and appropriate Bayesian priorsDetects small indels in the range of 20–50 bpDesigned for detecting small indels20–50 bp47
BreakDancerMate-pair genomic sequencingInfers CNVs based on discordant mate pairs that have larger outer-distance deviations than a fixed threshholdPredicts a wide variety of structural variants, including insertions, deletions, inversions and translocationsDetections for inversions are limited by library insert size10 bp–1 Mb9
BreakDancerMini, focuses on detecting small indels (typically 10–100 bp)
Overall high and consistent specificity and sensitivity
CNVerMated short reads sequencing dataUses a computational framework, donor graph, for CNV calling based on both pair-end-mapping and depth of coverageUses both mate pair mapping and depth-of-coverage information jointly to achieve better accuracy and sensitivity of detecting CNVs flanked by segmental duplicationsCannot identify the locations where the duplicated sequence is insertedNear perfect resolution in breakpoints identification60
The insert size does not limit the size of the variants that can be detectedInability to identify novel insertions
SplazerSPair-end and single-end sequencing dataMore sensitive alignment method for pre_x and suf_x matches, allowing for mismatches and small gapsDetects medium-sized insertions and long deletions with precise break pointsShows highest sensitivity, especially in variant-rich regions>10 bp, with exact breakpoints identification23
Robust in the presence of sequencing errors as well as alignment errorsApplicable to anchored paired-end as well as unanchored and single-end data
Can be used on reads of variable lengthsIs not constrained to short read lengths
Ability to analyze complex structural variations
SVMergeSingle, pair-end sequencing dataUses a collection of SV callers, which use paired-end mapping, split-mapping, clusters of one-end-mapped reads, read-depth and targeted insertion callingIntegrates calls from several existing structural variant callersUsers need to provide standard Binary Alignment Map files; CNV calls in highly repetitive regions in the genome are removed for quality control, difficulty in detecting true insertions>100 bp89
Enhanced structural variant detection
Breakpoint refinement using local de novo assembly
A lower FDR
New SV calling methods may be incorporated into the analysis pipeline
AGENucleotide or protein sequencesAlignment with gap excision simultaneously aligning the 5′ and 3′ ends of two given sequences and introducing a “large-gap jump” between the local end alignments to maximize the total alignment scoreDefines the exact location of break points of SVs by performing precise local alignment at the flanking endsDue to computational scalability, this algorithm is most practical at aligning sequence with only one deletion, insertion or inversion50 bp–1 Mb1
Can be applied to tandem duplications, inversions and complex events
Can be universally applied in various biological studies relying on alignment
Can be applied to the alignment of protein sequences
SRiCSplit-read, sequencing dataBuilt upon BLAT, which maps sequence reads to reference genome with gapped alignmentAbility to pinpoint the exact breakpointsRequires reference genome, with nonsplit reads, SRiC becomes unnecessarily time-consuming1 bp–700 kb96
Reveal the actual sequence content of insertions and cover the whole size spectrum of deletions
Can achieve 70-80% call accuracy
inGAP-svPaired end mapping data (PEM)Uses both pair-end mapping and coverage of depth strategies to identify different types of SVsCan detect many more large insertions and complex variants with lower FDRFailed to detect some small indels (<50bp)>50 bp70
Users can identify, visualize, annotate and manually edit SVs
Can detect 58-80% of deletions in gold-standard sets in individual NA12878
CNVnatorSingle-end and paried-end sequencing dataStatistical analysis of mapping density based on mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) assumes uniformity of coverage across the genomeHigh sensitivity (86–96%)Misses CNVs created by retrotransposable elementsfrom a few hundred to mega bases2
Low FDR (3–20%)
High genotyping accuracy (93–95%)
High resolution in breakpoint discovery (
HMMShort-read sequencing dataUses an HMM by integrating both depth of coverage and mate-pair relationship to calculate emission probabilityDetects small deletions (200–2,000 bp) with low coverage (<10x)Hemizygous deletions are not modeled200–2,000 bp77
Exceeds 80% discovery rate at ≥3× depth of coverage for medium-size (400–800 bp) CNVs
BIC-seqShort-read sequencing dataRead-depth based Bayesian information criteria algorithm to detect CNVs based on minimizing the Bayesian information criterionDetects small CNVs (10 bp resolution) with high sensibility and true positive rateHarder to detect copy gains than loss40 bp–5.7 Mb90
Performs well for detecting large deletions (e.g., one-copy loss of size >50 kb)
80% of the calling validared by qPCR

Another example of high-throughput genotyping technology is exome sequencing, which is made possible by the rapid growth in DNA enrichment technologies (including genomic capture). Large numbers of disease-causing mutations so far have been found in coding regions or canonical splice sites (11), so high-throughput exome sequencing is often used as a powerful tool in human genetic and disease studies to identify new disease-causing mutations. Compared with whole genome sequencing, exome sequencing has more straightforward data analysis and interpretation for a reasonable cost. Recent studies have successfully identified disease-causing mutations using parallel exome sequencing technology (16, 25, 72).

With the availability of extensive exome sequencing data, CNV detection has shifted to utilize this resource. Traditional software and algorithms developed for CNV detection generally assume random, unbiased distribution of sequence reads and the continuity of search space. These assumptions no longer hold true regarding exome sequencing data since the probes used for exome capture have variable specificity and efficiency for the targeted coding regions. Several algorithms have been developed for detection of CNVs to compensate for this bias, e.g., SeqGene and ExomeCNV (see Table 3 and the Supplemental File). SeqGene uses a CBS algorithm for CNV calling though it requires a reference for the absolute calling of CNVs. In addition to CNV calling, SeqGene supports gene expression quantification, mutation and coverage visualization. ExomeCNV uses normalized DOC and BAF to infer CNV and loss of heterozygosity, though the resolution of CNV calling is largely dependent on probe design in exome sequencing.

SUMMARY OF ARRAY-BASED AND NEXT-GENERATION SEQUENCING BASED METHODS

Array-based platforms can provide a cost-effective means for CNV discovery. However, these platforms have relatively limited power to detect CNVs of smaller size and they have very rough break-point prediction. A wide range of analysis algorithms has been developed for array-based CNV detection, but a sizable difference exists among these methods regarding the number and size of CNVs detected. It is generally recommended to employ two or more analysis algorithms to the same data set and follow the confident consensus calls in multiple analyses in downstream studies. Despite these limitations, array-based platforms have led to the discovery of a number of disease-associated CNVs, and provide an easy-to-use platform for an initial CNV analysis of biological samples.

In contrast, next-generation sequencing-based methods are able to detect smaller SVs in the range of 10 to 2,000 bp with the capability of breakpoint prediction at single base resolution. In addition, such analysis has an obvious advantage of detecting inversions, intrachromosomal translocation, and de novo CNVs not present in the reference genome. Many algorithms have been developed to facilitate CNV detection, but these algorithms are hindered by several factors. First, it remains a problem to unambiguously map the typically short reads of the next-generation sequences (∼400 bp for Roche 454 platform and 2 × 100 bp for Illumina platform) to repetitive regions of the genome. An improvement in read-length in high-throughput sequencing technology will significantly improve both mapping quality and de novo genome assembling. Second, compared with array-based platforms, it is considerably more expensive to sequence a genome to 20× or greater coverage, which is the required depth of coverage for reliable calling of CNVs. Third, large-scale throughput is not feasible for sequencing. Currently, only a few projects are able to generate data at such a scale. Fourth, a computational pipeline for CNV detection on a large population scale would require a considerable hardware and software infrastructure. For example, since the launch of the 1000 Genomes Project (http://1000genomes.org), a massive data set has been generated. Only a few researchers have the computational power to analyze it.

FUTURE DIRECTIONS

To date, many CNVs have been associated with diseases [e.g., Charcot-Marie tooth disease (53), Angelman syndrome (56), Alzheimer's disease (73), and obesity (78)]. The repertoire of pathological CNVs is expanding, but a comprehensive map of all disease-causing CNVs is far from complete. Once obtained, a resource of this type would facilitate the development of improved CNV-specific, array-based platforms for analysis of copy number polymorphisms, which will likely mature into low-cost and efficient screening methods for clinical and research usage. However, due to the repetitive structure of the regions that are enriched for CNVs, a number of these structural variants will continue to pose a substantial challenge for current discovery and analysis methods, primarily because of the already high noise-to-signal ratio in array-based platforms and the relatively short-length reads produced by next generation sequencing. Improvements in CNV detection accuracy and sequencing read-lengths would enable the development of practical, accurate clinical methods.

New algorithms are developed at a fast pace to detect CNVs from array-based or direct sequencing-based platforms. Each one has its own merits and domain of greatest accuracy and efficiency. However, at present, there are large detection discrepancies among different platforms and analytic methods, indicating the need for ongoing comparative evaluations. Several groups have published comparative analyses of the performance and advantages of different methods they used to analyze identical sets of data, but such work is time-consuming, and, to date, such studies have limited themselves to only a few analytical platforms (4, 22, 88, 94). Accordingly, a comprehensive ranking of available methods is not yet feasible. Large-scale studies that make use of a set of reliable “gold standard” CNVs will help evaluate the most reliable and efficient CNV detection algorithms. Unfortunately, poor consistency exists among different gold standards (94). More rigorous effort should be focused on the construction of a superior set of CNV gold standards to facilitate ongoing algorithm and platform design. Efforts such as the 1000 Genomes Project already incorporate extensive SV validation at nucleotide resolution with precise locations in the genome (62). Thus, SVs identified through this project may provide an especially reliable data set that can be used in the development of more mature detecting platforms and algorithms for CNV studies.

Future efforts should focus on incorporating a wider range of analytical tools that can analyze both array-based and next-generation sequencing data. A thoroughly interconnected platform that implements both next-generation sequencing and low-cost array technology would provide a superior framework for CNV detection. For such platforms to work, one would start with a low-cost screen using an array platform. During this step, many well-tested algorithms or software packages will be employed to delimit a pool of suitably accurate tools that can be used. An improved gold-standard set of CNVs confirmed by multiple methods should be used to determine the most appropriate and efficient CNV-mining algorithms and software. Subsequently, an automated system will be connected to a high-throughput sequencing platform. This step will use the information obtained from the previous step to identify regions of the genome for further deep sequencing.

GRANTS

The authors acknowledge the financial support of the Innovation Center of Medical College of Wisconsin. W. Li is supported by a Kern Innovation Postdoctoral Fellowship from the Kern Innovation Foundation at Medical College of Wisconsin. M. Olivier is supported in part by National Human Genome Research Institute Grant P50HG-004952.

DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

AUTHOR CONTRIBUTIONS

Author contributions: W.L. drafted manuscript; W.L. and M.O. edited and revised manuscript; W.L. and M.O. approved final version of manuscript.