A subset of human neocortical neurons harbors complex karyotypes wherein megabase-scale copy-number variants (CNVs) alter allelic diversity. Divergent levels of neurons with complex karyotypes (CNV neurons) are reported in different individuals, yet genome-wide and familial studies implicitly assume a single brain genome when assessing the genetic risk architecture of neurological disease. We assembled a brain CNV atlas using a robust computational approach applied to a new dataset (>800 neurons from 5 neurotypical individuals) and to published data from 10 additional neurotypical individuals. The atlas reveals that the frequency of neocortical neurons with complex karyotypes varies widely among individuals, but this variability is not readily accounted for by tissue quality or CNV detection approach. Rather, the age of the individual is anti-correlated with CNV neuron frequency. Fewer CNV neurons are observed in aged individuals than in young individuals.

We assembled a brain CNV atlas to evaluate how neuronal CNVs alter the genetic architecture of the neurotypical human cerebral cortex. A new dataset of 827 human cerebral cortical nuclei from 5 neurotypical individuals was combined with smaller published datasets () from 10 other neurotypical individuals. We developed an unbiased CNV detection approach based on population-level statistics and established a human neuronal CNV atlas with 507 CNVs. Initial analysis of the atlas identified substantial inter-individual variability in the frequency of neurons with complex karyotypes (CNV neurons), but also found support for the hypothesis () that some long genomic loci shape the genetic architecture of neurotypical human brains.

During the past decade, large CNVs have been recognized as major contributors to human genetic diversity (). At the population level, SNVs are collectively more numerous than CNVs, but CNVs affect an order of magnitude more genome sequence (∼10%), and some CNVs show evidence of positive selection during human evolution (). In individuals, de novo CNVs represent rare variants with a strong contribution to the genetic risk of schizophrenia, autism, and other neurological disorders (). Whereas the consequences of germline CNVs have been inferred from population-level studies, neuronal CNV studies to date have been underpowered to determine whether the genes affected by neuronal CNVs contribute to brain development, function, and disease.

Psychosis Endophenotypes International Consortium CNV and Schizophrenia Working Groups of the Psychiatric Genomics Consortium Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects.

Every human neocortical neuron may contain private somatic variants. Single nucleotide variants (SNVs) are especially common, with hundreds per neuron reported () and with frequencies of >3,000 SNVs per neuron observed in aged individuals (). Endogenous mobile elements such as long interspersed nuclear element 1 (LINE1) retrotransposons are also active during brain development (). Reported frequencies of de novo mobile element events range from <1 to >7 per neuron (). Mobile element activity has also been linked to the generation of copy-number variants (CNVs) (). Whole and subchromosomal CNVs bring about complex karyotypes through the duplication or deletion of several megabases (Mb) of genomic sequence in a subpopulation of neocortical neurons (). Gene density in the human genome averages >10 genes per Mb; thus, by contrast to other somatic variants, Mb-scale CNVs almost always affect multiple genes. A reanalysis of published data () herein found an average of 63 genes affected per neuronal CNV.

Neocortical neurons are among the most diverse and longest-lived mammalian cells. The mammalian cerebral cortex is often put forward as a pinnacle of evolutionary complexity, and human-specific brain phenotypes are attributed to neocortical expansion during evolution (). Aberrant development and maturation of neocortical circuits are likewise associated with neuropsychiatric and neurodegenerative diseases (). Various approaches count ∼15–20 billion neurons and as many as 35 billion glia in the human cerebral cortex (). Single-cell transcriptomic approaches are beginning to comprehensively catalog human neuronal diversity () and identify new subtypes of human neurons (). After decades of debate, it is now clear that human neocortical neurons are not normally regenerated during the human lifespan (). With some exceptions (), neuronal cell types are also generally thought to be stable throughout life, but neuronal genomes are surprisingly labile.

The search for true numbers of neurons and glial cells in the human brain: a review of 150 years of cell counting.

CNV-affected loci may not be restricted to long genes. We assembled a comprehensive list of all of the genes affected in the brain CNV atlas and used PANTHER () to calculate enrichment statistics and plotting scripts from REViGO () to visualize these results. Gene Ontology categories associated with “sensory perception of smell” and “calcium-mediated signaling” were notably enriched in the atlas-wide gene set ( Figure 4 F). When assessed by age group, as in Figure 4 E, these enrichments were only detected in the aged groups ( Figures 4 G and 4H).

CNV neurons were rare in aged individuals, so we assessed whether the genetic architecture of CNV neurons may also change during the lifespan. We further tested each individual in the atlas for candidate gene enrichment using relevant thresholds for recurrent hotspots (see Method Details ) at stringent and lenient criteria. Significant corrected p values were not observed in the youngest individuals, but they were observed in aged individuals ( Figure 4 D). When individuals were pooled based on age groups, significant p values were also observed only in the most aged group ( Figure 4 E).

The assembled atlas identified 522 neural CNVs ( Table 1 ). We performed overlap and random permutation analyses to determine whether subsets of 93 candidate long genes ( Figure 4 A) were associated with CNVs more frequently than expected by chance. Genomic regions (i.e., bins) that accumulated CNVs (i.e., duplications and deletions) at an increased population-wide frequency compared to the rest of the genome (i.e., hotspots; Figure 4 B) were determined using different hotspot thresholds (see Method Details ). Enrichment was assessed by calculating 184 raw p values (8 candidate gene lists, 23 hotspot thresholds) from 10,000 permutations per gene list per hotspot threshold. After correcting for multiple hypothesis testing (Benjamini-Hochberg false discovery rate [FDR], 5% FDR cutoff), dataset-wide significance was observed for the entire candidate gene list in some cases, and, notably, with putative hotspots that include 3 of the 4 common candidate genes: GPC6, NRXN3, and RBFOX1 ( Figure 4 C).

(F–H) REViGO plots of enriched Gene Ontology (GO) terms for all of the data analyzed. The relative size of each category reflects significance; the largest groups have the lowest p values. GO enrichment determined using PANTHER analysis of CNV-affected genes in all neural data (F), and neurons from age groups 68–74 (G) and 81–95 years old (H).

(C–E) Enrichment results for hotspots from all cells (C), individuals (D), and age groups (E). p values < 0.05 are red.

Brain CNVs, like all non-V(D)J CNVs, occur because DNA repair is not perfect. Transcription and replication lead to DNA double-strand breaks (DSBs), which in turn sometimes lead to CNVs. Gene length increases susceptibility to transcriptional and replicative genomic stress. Genes encoded by >100 kb of genomic sequence (i.e., long genes) tend to be neuronally expressed genes with roles in neuronal connectivity and synaptic plasticity (). Long genes also overlap with DNA fragile sites, and replicative stress can lead to large CNVs that encompass these loci (). Likewise, transcriptional stress leads to DNA DSBs in neurons () and has a predominant effect on long gene transcript abundance (). Recent studies link these observations to DNA DSBs during mouse neurodevelopment () and motivate the hypothesis that somatic mutations affecting long genes mediate the functional consequences of brain somatic mosaicism ().

Our observation of fewer CNV neurons in aged individuals contrasts with the concept of genosenium (), which states that the accumulation of somatic mutations over one’s lifetime is associated with aging-related cellular and molecular phenotypes. Thus, we also analyzed the number of CNVs per CNV neuron. Most CNV neurons (62.8%) contained only 1 or 2 CNVs, but the average number of CNVs per CNV neuron in the atlas was 3.9 ( Figure 3 D). Aged CNV neurons, although rare, had more irregular karyotypes. For example, the 20-year-old individual had the highest percentage of CNV neurons (40% CNV neurons, 3.8 CNVs/CNV neuron), but the 49-year-old (11.1% CNV neurons) had the most CNVs/CNV neurons (10.3). Other individuals, such as the 24-year-old and the 86-year-old, also support this trend; 38.9% of neurons in the 24-year-old had CNVs with an average of 2.8 CNVs/CNV neuron, whereas only 4% of neurons in the 86-year-old had CNVs, but these CNV neurons averaged 5.0 CNVs. The individual with the highest mean CNV size (37.2 Mb) was 86 years old ( Figure S4 J); this was due in large part to 1 neuron containing 2 trisomies and 1 monosomy ( Figure S3 N). Genosenium, if true, may operate on different somatic mutations in distinct ways.

Anti-correlation between CNV neuron prevalence and individual age is roughly an order of magnitude more significant in the complete atlas (lenient: R= 0.5521, p = 0.00097, Figure 3 F; stringent: R= 0.3941, p = 0.0092, Figure S4 R). Despite narrower age ranges, CNV neuron prevalence was also anti-correlated with age in the published datasets (lenient: R= 0.5315, p = 0.011, Figure S4 S; stringent: R= 0.3524, p = 0.054, Figure S4 T). Potential confounding variables such as BIC scores or post-mortem interval showed no significant correlation with age ( Figure S5 ).

The general characteristics of CNVs were similar regardless of WGA approach or sex ( Figure 3 C–3E). The mean length of single-cell CNVs was 14.8 Mb across all of the datasets, which is notably larger than most CNVs observed in bulk, germline human genomes (1–10 kb;). As with PicoPLEX data, CNV size was variable among individuals ( Figures S4 J, S4K, S4M, S4N, S4P, and S4Q), but average CNV size was similar across WGA approaches ( Figures 3 C–3E). The average CNV size in GenomePlex libraries was 11.7 Mb, but it ranged from 4.1 to 15.9 Mb among individuals. The average CNV size in Strand-seq libraries was 13.2 Mb, but it ranged from 4.3 to 17.7 Mb among individuals. Similar but on average larger CNV sizes were observed at the stringent threshold ( Figures S4 I, S4L, and S4O). In contrast to a previous report (), no enrichment for LINE1 sequence was apparent in CNVs or CNV borders compared to chance, and equal rates of telomeric CNVs (∼14%) were observed in young and aged neurons (see Method Details ).

Four previous studies used different WGA and CNV detection approaches on 458 additional single neurons from 6 male and 5 female neurotypical individuals of different ages (). WGA was performed using GenomePlex (), a degenerate oligonucleotide-primed (DOP)-PCR-based approach, or Strand-seq (), a “pre-amplification free” approach. We harmonized these with our PicoPLEX dataset (e.g., alignment to hg19, exclusion of outlier bins in Figures S4 A and S4B, BIC thresholds in Figures 3 A, 3B, S4 G, and S4H). GenomePlex and Strand-seq datasets displayed distinct outlier bin profiles. Analysis of RepeatMasker (University of California, Santa Cruz [UCSC]) genomic features in outlier bins found a depletion (log2 fold change 0.47) for short interspersed nuclear element (SINE) and simple repeat features (Wilcoxon, p < 2.2E–16) compared to remaining bins. BIC thresholds removed 11/225 GenomePlex cells (4.9%) and 41/233 Strand-seq cells (18.0%) from further analysis. The CN state thresholds reported above ( Figure 1 F) were inclusive of all of the WGA libraries passing BIC cutoffs. To further test the effectiveness of outlier bin removal, BIC cutoffs, and CNV thresholds, we re-analyzed split-amplification GenomePlex libraries from. One split-sample exceeded the BIC cutoff; the other with bad bins removed showed perfect concordance using our analysis pipeline ( Figures S4 C–S4F).

(C–E) CNV attributes including size (C), number per cell (D), and percent genome coverage (E) were similar regardless of WGA approach.

The 474 neuronal nuclei we analyzed represent the largest CNV dataset of neurotypical human brains generated by a single laboratory using a single WGA approach to date. We identified 62 CNV neurons (13.1%) with a mean CNV size of 16.8 Mb. We find that both neuronal and non-neuronal genomes harbor Mb-scale CNVs, which is consistent with previous reports (). The frequency (4%–23.1%) of complex karyotypes was more variable among neurons than among non-neuronal cells (4.7%–8.7%) from the same individuals ( Figure 2 D). CNV frequency in non-neuronal cells is similar to that reported in other somatic cells (e.g., keratinocytes). Relative to non-neuronal CNVs ( Figures S3 B, S3C, S3E, S3F, S3H, and S3I), neuronal CNVs are larger ( Figure 2 E), occur in greater numbers ( Figure 2 F), and affect more of the genome (1.3%–6% in neurons, 0.2%–0.8% in non-neurons; Figure 2 G). Similar ratios of neuronal to non-neuronal CNVs were observed at stringent thresholds, but mean differences in non-euploid CN states were amplified ( Figures S3 K– S3M). A statistically significant linear decline in CNV neuron abundance was observed with age at both CN state thresholds (lenient: R= 0.9224, p = 0.0094, Figure 2 H; stringent: R= 0.9434, p = 0.0058, Figure S3 O).

Large CNVs inevitably alter the CN state of a brain-expressed gene, as more of the genome is expressed in neurons than in other cell types (). Our analysis ( Table 1 ) identified 310 CNVs in 70 of 589 neural genomes (11.9%). CNVs ranged in size from 2.9 to 159.1 Mb (mean = 16.5 ± 20.1 Mb; Figure 2 A). With stringent criteria that are prone to false-negatives, we still identify 168 CNVs (mean = 18.0 ± 22.3 Mb) in 42 neural nuclei (7.1%) ( Figure S3 A). CNVs were detected in each individual examined and, given their size and frequency ( Figures 2 B and S3 D), represent a clear contribution to the genetic architecture of the brain ( Figures 2 C and S3 G). We note that much smaller percentages of phenotypically distinct cells can bring about focal epilepsies () and are essential for normal brain function (e.g., adult-born dentate granule neurons;).

(E–G) CNVs have an increased impact on genetic architecture in neuronal genomes compared to non-neuronal genomes, as measured by size (E), number per cell (F), and percent genome coverage (G).

(A–C) Megabase-scale neuronal CNVs are observed across the human lifespan (ages 0.36–95 years) with varying size (A), number per cell (B), and percent genome coverage (C).

Additional NULL and ALT model simulations found that BIC cutoffs and CN state thresholds protect against false-positive CNVs brought about by simulated WGA-induced noise. We tested 6 neuronal WGA libraries in the BIC Figure S2 B), all of the simulated cells from neuron 10 were excluded by our BIC threshold as a consequence of strong autocorrelated noise (ϕ = 0.663; atlas mean ϕ = 0.1394). In the remaining 1,000 simulated cells, only 41 and 1 small (3.9 ± 0.7 Mb) segments, respectively, passed the lenient and stringent CN state thresholds. Thus, upper-bound estimates of false-positive CNV neuron detection rates are ∼3% (lenient threshold) and <0.1% (stringent threshold). All ALT model simulations ( Figure S2 C) passed the BIC threshold, and only ∼1% (14/1,200) of simulated cells contained DNAcopy segments that passed the lenient CN state thresholds without overlapping the synthetic CNV; these false-positive CNVs were also small (6.0 ± 2.7 Mb). As observed for monosomy X in the PicoPLEX dataset, improved detection of true-positives (synthetic CNVs) in ALT model cells was observed at lenient (95%) relative to stringent (81.8%) CN state thresholds.

Analysis of 827 PicoPLEX datasets identified additional population-based filters. First, we identified 101 genomic bins that routinely deviated from median read depth and confound segmentation ( Figure S1 K); these were excluded before the subsequent analysis. Second, we computed the 95percentile of the low BIC score Gaussian distribution (−2.21) to establish an objective filter for the highest quality single-cell datasets ( Figure 1 E). Third, identified segments displayed 2 modes near integer-like copy number (CN) states of 2 (euploid, mean = 1.97) and 1 (deletion, mean = 1.12) and a heavy tail near the CN state of 3 (duplication, mean = 2.92) ( Figure 1 F). Lenient and stringent CN state thresholds were established, respectively, at a 2-tailed p value ≤ 0.05 (<1.34 for a deletion and >2.60 for a duplication) and ≤0.01 (<1.14 for a deletion and >2.80 for a duplication). The BIC threshold produced a final dataset of 589/827 (71.2%) neural nuclei, including 474/681 (69.6%) neuronal (NeuN) nuclei and 115/146 (78.8%) non-neuronal (NeuN) nuclei ( Figures S1 L and S1M). The stringent CN state threshold identified monosomy X in 52.8% of male neural nuclei, while the lenient threshold identified 99.5% of these true-positives ( Figure S2 A).

We further assessed segmentation parameters with 2 in silico models ( Figure S1 F) built from the read-depth statistics (DNAcopy segments, Gaussian noise, and autocorrelated noise [ϕ]) of each test cell. NULL model simulations contained no CNVs but did contain strong autocorrelated noise, a known source of DNAcopy false-positives (). Alternative (ALT) model simulations harbored synthetic CNVs (i.e., DNAcopy calls) with residual autocorrelated noise. Segmentation outputs matched the ALT model, but not the NULL model, simulations well ( Figures S1 H and S1I). Concordance with ALT model cells (i.e., highest sensitivity and specificity) was also associated with the lowest BIC scores ( Figure S1 J).

Parameter space in DNAcopy () is defined by 3 user-tunable parameters: significance threshold (alpha), minimum number of genomic bins required to call a copy number state change (min.width), and the number of SDs between the levels of copy-number states to maintain the copy-number state change (undo.SD). We assessed DNAcopy parameters using Bayesian information criterion (BIC) (), a log-likelihood estimate of the performance of the segmentation algorithm, for dozens of parameter combinations. The lowest, or near-lowest, BIC scores for each test cell library were observed at alpha = 0.001, min.width = 5, and undo.SD = 0 ( Figures 1 D and S1 G); these parameters identified monosomy X in all 6 male cells, and trisomy 21 in only the fibroblast. We also observed that minimum BIC scores were lower in WGA libraries with less overall bin-to-bin variation in read-depth ( Figures 1 C, S1 A–S1E, and S1G), suggesting that low BIC scores represent an additional quality control filter.

Neuronal CNV detection is inherently challenging because one cannot know the state of the genome before WGA, and neuronal CNVs are rarely clonal. For this reason, CNV calling approaches have been used conservatively, with bias toward avoiding type I errors at the risk of type II errors. We used a test dataset of 6 representative PicoPLEX WGA libraries (5 neurons and 1 trisomy 21 fibroblast) with varied subjective quality ( Figures 1 C and S1 A–S1E) to optimize our approach.

Single neuronal and non-neuronal nuclei were isolated using fluorescence-activated nuclei sorting (FANS) of the prefrontal cortex from 5 non-diseased (neurotypical) male individuals aged 0.36, 26, 49, 86, and 95 years ( Figure 1 A; Table 1 ). Whole genome amplification (WGA) was performed using PicoPLEX (Rubicon Genomics), an approach that is similar to multiple annealing and loop-based amplification (MALBAC) (), which produces Illumina-compatible libraries with 48 unique barcodes. Before library pooling and sequencing on Illumina platforms, we found that ∼60% of WGA reactions produced a measurable product. Single-end or paired-end sequencing (50, 75, or 100 bp) of 48 pooled libraries on Illumina HiSeq Rapid platforms or of smaller pools (<17 libraries) on MiSeq platforms routinely produced >1 million reads per library after duplicate removal. Neither paired-end nor longer read sequencing altered the data quality. Reads were aligned to hg19 and read depth was calculated across 4,505 genomic bins, each containing ∼500 kb of uniquely mappable sequence (mean bin size = 687 ± 1,072 kb). This approach ( Figure 1 B) generated >134.2 Gb of genomic sequence from 827 male single-cell genome libraries (162.3 ± 115.1 Mb/cell).

(E) Histogram of BIC scores for new dataset. Gaussian distributions (black and red) were used to establish BIC cutoff (−2.21).

(C) CNV profile of test Neuron 1. Read depth-derived CN values of genomic bins are colored alternately (green, blue) by chromosome. Red line indicates DNAcopy segmentation.

Discussion

We assembled a brain CNV atlas from 1,285 single brain nucleus libraries, the accumulated work of our laboratory and 3 other laboratories. Development of a single, robust computational pipeline that both protected against false-positive CNV calls and minimized false-negative calls was essential. First, we identified and excluded a small group (101/4,505) of genomic bins that were consistently non-euploid (i.e., outliers) across our 5 analyzed individuals. Second, we applied an objective BIC cutoff to exclude WGA libraries that would be the most prone to aberant segmentation calls. Third, we evaluated published datasets from 4 other sources that used different WGA methods. Population-level statistics were generally similar, but each WGA approach identified unique outlier bins and slightly different BIC cutoffs. Fourth, CN state distributions centered near CN = 1, 2, and 3 were apparent across all of the WGA approaches, so we used 2-tailed p values (lenient = 0.05, stringent = 0.01) from pan-method DNAcopy segmentation calls to define 2 sets of CN state thresholds. Fifth, we showed that these filters (outlier bin removal, BIC cutoffs, and CN state thresholds) effectively eliminate false-positive CNV calls (15/2,400; <1%). Lenient CN state thresholds are mildly (<3%) more prone to short (∼<6 Mb) false-positive calls, but far better at identifying true-positives (>95%). Our central findings are unchanged when only the largest (>6 Mb) CNVs are considered. The brain CNV atlas comprises 879 neuronal and 115 non-neuronal genomic libraries; Mb-scale CNVs were identified in 129 neurons and 8 non-neurons.

The most salient feature of the brain CNV atlas is an anti-correlation between the age of an individual and the percentage of CNV neurons in the frontal cortex of that individual. This finding was significant in our new PicoPLEX dataset and in the published GenomePlex and Strand-seq datasets, and is highly significant using all of the available data (p = 0.00097). By contrast, the initial assessment of CNV location finds evidence for the enrichment of a subset of long genes and neurally associated Gene Ontology categories only in aged brains. Given the enrichment of these CNVs in aged, not young, neurons, CNVs affecting some genomic loci may be more compatible with neural survival than others. We found similar rates of CNV non-neurons at different ages; however, it will be interesting to determine whether other long-lived cells (e.g., cardiomyocytes) show a similar change in mosaic composition during aging.