The macronuclear genome of the ciliate Oxytricha trifallax displays an extreme and unique eukaryotic genome architecture with extensive genomic variation. During sexual genome development, the expressed, somatic macronuclear genome is whittled down to the genic portion of a small fraction (∼5%) of its precursor “silent” germline micronuclear genome by a process of “unscrambling” and fragmentation. The tiny macronuclear “nanochromosomes” typically encode single, protein-coding genes (a small portion, 10%, encode 2–8 genes), have minimal noncoding regions, and are differentially amplified to an average of ∼2,000 copies. We report the high-quality genome assembly of ∼16,000 complete nanochromosomes (∼50 Mb haploid genome size) that vary from 469 bp to 66 kb long (mean ∼3.2 kb) and encode ∼18,500 genes. Alternative DNA fragmentation processes ∼10% of the nanochromosomes into multiple isoforms that usually encode complete genes. Nucleotide diversity in the macronucleus is very high (SNP heterozygosity is ∼4.0%), suggesting that Oxytricha trifallax may have one of the largest known effective population sizes of eukaryotes. Comparison to other ciliates with nonscrambled genomes and long macronuclear chromosomes (on the order of 100 kb) suggests several candidate proteins that could be involved in genome rearrangement, including domesticated MULE and IS1595-like DDE transposases. The assembly of the highly fragmented Oxytricha macronuclear genome is the first completed genome with such an unusual architecture. This genome sequence provides tantalizing glimpses into novel molecular biology and evolution. For example, Oxytricha maintains tens of millions of telomeres per cell and has also evolved an intriguing expansion of telomere end-binding proteins. In conjunction with the micronuclear genome in progress, the O. trifallax macronuclear genome will provide an invaluable resource for investigating programmed genome rearrangements, complementing studies of rearrangements arising during evolution and disease.

The macronuclear genome of the ciliate Oxytricha trifallax, contained in its somatic nucleus, has a unique genome architecture. Unlike its diploid germline genome, which is transcriptionally inactive during normal cellular growth, the macronuclear genome is fragmented into at least 16,000 tiny (∼3.2 kb mean length) chromosomes, most of which encode single actively transcribed genes and are differentially amplified to a few thousand copies each. The smallest chromosome is just 469 bp, while the largest is 66 kb and encodes a single enormous protein. We found considerable variation in the genome, including frequent alternative fragmentation patterns, generating chromosome isoforms with shared sequence. We also found limited variation in chromosome amplification levels, though insufficient to explain mRNA transcript level variation. Another remarkable feature of Oxytricha's macronuclear genome is its inordinate fondness for telomeres. In conjunction with its possession of tens of millions of chromosome-ending telomeres per macronucleus, we show that Oxytricha has evolved multiple putative telomere-binding proteins. In addition, we identified two new domesticated transposase-like protein classes that we propose may participate in the process of genome rearrangement. The macronuclear genome now provides a crucial resource for ongoing studies of genome rearrangement processes that use Oxytricha as an experimental or comparative model.

Funding: The authors acknowledge the National Human Genome Research Institute (NHGRI) grant U54HG003079 to R.K.W. and the National Institute of General Medical Sciences (NIGMS) for funding the Princeton Sequencing Core's Illumina data sequence management interface, htseq.princeton.edu (GM071508) and for grant GM59708 to L.F.L. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

We report on the assembly and analysis of the first Oxytricha macronuclear genome, from the reference JRB310 strain. During and after assembly, we have addressed a number of challenges arising from the unusual structure of this genome, which we discuss. We focus on the most interesting and unique biological characteristics of this genome and place them in the context of the characteristics of the other sequenced ciliate macronuclear genomes.

As a consequence of the extensive fragmentation of the Oxytricha macronuclear genome, each macronucleus possesses tens of millions of telomeres, an abundance that enabled the first isolations of telomere end-binding proteins [27] , [28] . Oxytricha also has micronuclear transposons bearing telomeric repeats (C 4 A 4 ) that resemble those of nanochromosomes. These telomere-bearing elements, or TBE transposons [29] , play an important role in macronuclear genome development [30] . The exact site of telomere addition may vary for some nanochromosome ends [31] and is followed by a roughly 50 bp subtelomeric region of biased base composition with an approximately 10 bp periodicity of the bias ( Figure 3 ) [32] , [33] .

The phylogeny represents the bootstrap consensus of 100 replicates from PhyML (with the HKY85 substitution model) based on a MUSCLE multiple sequence alignment of 18S rRNA genes from seven ciliates (Oxytricha trifallax—FJ545743; Stylonychia lemnae—AJJRB310497; Euplotes crassus—AJJRB310492; Nyctotherus ovalis—AJ222678; Tetrahymena thermophila—M10932; Ichthyophthirius multifiliis—IMU17354; and Paramecium tetraurelia—AB252009) rooted with two other alveolates (Perkinsus marinus—X75762 and Plasmodium falciparum—NC_004325). All bootstrap values are ≥80, except for the node between Nyctotherus and Oxytricha/Stylonychia/Euplotes, which has a boostrap value of 60. Euplotes and Nyctotherus both have nanochromosomes, like Oxytricha. Other than the genome statistics for Oxytricha trifallax, which were determined in this study, table statistics were obtained from the following sources: a - [2] , b - [22] , [116] , c - [117] , d - [99] , e - [94] , f - [56] (the number of chromosomes is an estimate), g - [118] , h - [119] , i - [120] , j - [64] (for a single stage of the Ichthyophthirius life cycle), k - [121] , l - [69] , m - [122] . Table statistics for Perkinsus marinus are for the current assembly deposited in GenBank (GCA_000006405.1).

Two fundamental differences distinguish Oxytricha's macronuclear chromosomes from those of Tetrahymena and Paramecium: Oxytricha's chromosomes are tiny (“nanochromosomes,” with a mean length ∼3.2 kb reported in this study), each typically encoding just a single gene with a minimal amount of surrounding non-protein-coding DNA [7] , and they are differentially amplified ( Figure 2 and 3 ) [8] , [9] . In some cases, alternative fragmentation of macronuclear-destined micronuclear DNA produces different nanochromosome isoforms ( Figure 2 ) 10 – 12 , which may be present at very different levels of amplification (differing by as much as 10-fold [13] ). Gene expression and nanochromosome copy number may be moderately correlated [14] . Macronuclear chromosomes in all the model ciliates segregate by amitosis during cellular replication (without a mitotic spindle) [15] ,—a process that may lead to allelic fixation 17 – 20 . In ciliates with nanochromosomes, major fluctuations of nanochromosome copy number [8] may arise, since copy number is unregulated during normal cellular replication [21] . Theoretical models propose that these fluctuations are a cause of senescence in these ciliates [22] . In contrast to the lack of copy number regulation during cellular replication, both genetic [23] – [25] and epigenetic mechanisms [9] , [26] may influence chromosome copy number during sexual development in ciliates.

During conjugation of Oxytricha cells, segments of the micronuclear genome (MDSs) are excised and stitched together to form the nanochromosomes of the new macronuclear genome, and the remainder of the micronuclear genome is eliminated (including the IESs interspersed between MDSs). The old macronuclear genome is also degraded during development. The segments that are stitched together may be either in order (e.g., forming nanochromosome 1, on the left) or out of order or inverted (e.g., forming the two forms of nanochromosome 2), in which case they need to be “unscrambled.” Two rounds of DNA amplification produce nanochromosomes at an average copy number of ∼1,900 [2] . Alternative fragmentation of DNA during nanochromosome development may also occur, irrespective of unscrambling, giving rise to longer (2a) and shorter (2b) nanochromosome isoforms. The mature nanochromosomes are capped on both ends with telomeres.

In the model ciliates Oxytricha trifallax, Tetrahymena thermophila, and Paramecium tetraurelia, varying amounts of micronuclear DNA are deleted (including the “internally eliminated sequences,” or IESs, interspersed between “macronuclear destined sequences,” or MDSs) during conjugation or autogamy (two forms of sexual development) to give rise to the information-rich macronuclear genome ( Figure 1 ). A much larger fraction of the Oxytricha micronuclear genome—∼96% of the micronuclear complexity [2] —is eliminated during the macronuclear formation than in the oligohymenophoreans, Tetrahymena and Paramecium (which both eliminate ∼30% of their micronuclear genomes [4] , [5] ). The most remarkable difference in macronuclear development between Oxytricha and the two oligohymenophoreans is that the micronuclear-encoded MDSs that give rise to the macronuclear chromosomes may be nonsequential, or even in different orientations in the micronuclear genome [6] . Consequently, unlike the oligohymenophoreans, Oxytricha needs to unscramble its micronuclear genome during macronuclear development.

Oxytricha trifallax is a distinctive ciliate [1] —an ancient lineage of protists named for their coats of cilia. Like all ciliates, Oxytricha has two types of nuclei: a micronucleus, a germline nucleus that is largely transcriptionally inactive during vegetative growth, and a macronucleus, which is the transcriptionally active somatic nucleus [2] . Compared to the micronucleus, Oxytricha's macronucleus is massively enlarged due to ∼2,000-fold [2] amplification resulting from two rounds of DNA amplification [3] during development.

Results and Discussion

Macronuclear Genome Assemblies To assemble the Oxytricha macronuclear genome for the type strain—JRB310 [1]—we chose to build upon three assemblies, from ABySS [34], IDBA [35], and PE-Assembler [36]/SSAKE [37], based on Illumina sequences, and supplemented by a Sanger/454 assembly. To combine these assemblies, we developed a specialized meta-assembly pipeline (see Materials and Methods and Figure S1). Current genome assembly strategies for second-generation sequence data often employ multiple, hybrid strategies to overcome the experimental biases leading to low sequence coverage in particular genomic regions and repetitive DNA [38]. Since Oxytricha's macronuclear genome was expected to have a low repeat content [2], repetitive DNA was expected to be a relatively insignificant issue, and thus even greedy genome assemblers were able to produce useful preliminary assemblies. However, unlike conventional genomes, the Oxytricha macronuclear genome provides assembly challenges by virtue of its fragmented architecture, variable processing (“alternative chromosome fragmentation”), and nonuniform nanochromosome copy number. We resolved these challenges during and after assembly. The initial 454/Sanger genome assemblies contained a mixture of bacterial DNA, mitochondrial DNA, and up to two additional alleles other than those expected from the strain we originally proposed to sequence (JRB310) due to accidental contamination by a commonly used strain in our lab (JRB510—a complementary mating type), whereas the Illumina assemblies were produced from purified macronuclear DNA from the type strain (JRB310) alone. Given these contamination issues, we built our final assembly with the Illumina assemblies as the primary data source, rather than the 454/Sanger assemblies, to maintain the purity of our final assembly. This excluded virtually all bacterial and mitochondrial contamination in our final assembly since very few contigs in the Illumina assemblies derive from these sources (sucrose gradient purification of macronuclei eliminated almost all such contaminants) that could potentially be extended by the 454/Sanger data. We also kept JRB510 allelic data to a minimum in our final assembly, (i) by preferring best extensions, which were most likely to be from the more similar JRB310 derived contigs or reads, either from the Illumina assemblies or from the Sanger/454 assemblies and raw Sanger data (see Materials and Methods), and (ii) by the sequence consensus majority rule (the most abundant base at each position from the assembled contigs) of the CAP3 assembler [39] used to combine the three Illumina assemblies versus one 454/Sanger assembly during contig construction. The conclusion that our final assembly strategy succeeded in keeping JRB510 allelic information to a minimum is supported by matches to the three pure JRB310 Illumina starting assemblies. Our final assembly is 82.0% covered by identical BLAT matches ≥100 bp long to one of these pure JRB310 assemblies and 92.5% covered by 99.5% identity, ≥100 bp BLAT matches to these assemblies (note that consensus building by CAP3 may result in alternating selection of JRB310 polymorphisms from the original assemblies, and hence even meta-contigs assembled from the pure JRB310 assemblies may have BLAT matches that differ from all the original assemblies). We chose to keep alleles apart by applying moderately strict criteria for merging during our meta-assembly (e.g., by merging contigs with overlaps at least 40 bp long and ≥99% identical with CAP3 [39]; see Materials and Methods). However, in order to maximize the number of complete nanochromosomes in our assembly, we collapsed some alleles (i.e., producing “quasi-nanochromosomes” derived from two alleles; see “Extensive Genome Homozygosity and High SNP Heterozygosity”). Merging of contigs is also complicated by alternative fragmentation, which affects ∼10% of the nanochromosomes and may either result in collapsing or splitting of nanochromosome isoforms (see “Extensive Alternative Nanochromosome Fragmentation”). We discriminated between homozygous and heterozygous nanochromosomes after assembly (see the next section). We have not attempted to determine the haplotypes of the heterozygous contigs due to computational complications arising from both alternative nanochromosome fragmentation and variable representation of alleles (which need not be 1∶1). A comparison of the Oxytricha macronuclear genome assemblies and meta-assembly is given in Table 1 (also see Tables S11–S17 for the progressive improvements in the genome assembly through the successive steps of our meta-assembly approach shown in Figure S1). Since the size selection used in the construction of our paired-end (PE) sequence library results in poor sequence coverage for a span of approximately 160 bp, roughly 100 bp from the telomeric ends (see Materials and Methods), the incorporation of single-end (SE) sequence data allowed ABySS to assemble far more contigs with telomeric sequences than either IDBA or the PE-Assembler/SSAKE assembler combination, neither of which could use SE and PE sequences simultaneously (Table 1). The ABySS assembly is larger (78.0 Mb) than the other assemblies (47.8 Mb for PE-Asm/SSAKE and 57.7 Mb for IDBA) and also more complete, as evidenced by a higher fraction of reads that can be mapped to this assembly (Table 1). The ABySS assembly also incorporates a substantially higher proportion of telomeric reads than the other two assemblers (91.8% for ABySS versus 45.4% for PE-Asm/SSAKE and 42.4% for IDBA) and contains a larger proportion of telomere-containing contigs (66.5% versus 18.3% for PE-Asm/SSAKE and 8.6% for IDBA) and longer contigs (mean length of 2,273 bp for ABySS versus 2,090 bp for PE-Asm/SSAKE and 1,204 bp for IDBA). Consequently, the ABySS assembler produced almost an order of magnitude more full-length nanochromosomes than the other two assemblers. Even though the ABySS assembly incorporates more telomeric reads than the other assemblies, it also excludes a higher proportion of telomeric reads than nontelomeric reads (92% telomeric PE reads versus 98% total PE reads map to the assembly; telomeric reads comprise ∼13% of all the reads). While the ABySS assembly appears to be the most complete, the majority of its contigs (81.3%) are still missing one or both telomeres. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Comparison of Oxytricha macronuclear genome assemblies. https://doi.org/10.1371/journal.pbio.1001473.t001 The initial meta-assembly of the ABySS, IDBA, and PE-Asm assemblies yielded a modest improvement in the total number of full-length nanochromosomes relative to ABySS alone, with the ratio of full-length nanochromosomes to contigs increasing from 21% to 24% of the total number of contigs. Since the meta-assembly was still highly fragmented, and our aim was to assemble full-length nanochromosomes with complete genes, we developed a strategy that consisted of two rounds of extension of nontelomeric contig ends and reassembly (see Materials and Methods). This strategy produced an assembly where the majority (77%) of the contigs had both 5′ and 3′ telomeres. For the final meta-assembly, the average contig length (2,982 bp) is longer than that of any of the original assemblies (2,273 for ABySS versus 2,090 for PE-Asm/SSAKE and 1,204 for IDBA) and the read coverage is as high as or higher than the most complete ABySS assembly (98.0% coverage for PE reads for ABySS and the final assembly; 88.2% SE read coverage for ABySS and 88.5% SE read coverage for the final assembly). However, a larger fraction of telomeric reads than nontelomeric reads still do not map to the final assembly (e.g., 9.0% of telomeric PE reads versus 2.0% of all PE reads do not map to the assembly), indicating that some telomeric regions are still missing from the final assembly.

Extensive Genome Homozygosity and High SNP Heterozygosity Since there are currently no published effective population size estimates for Oxytricha trifallax, we wanted to obtain an estimate from allelic diversity of the macronuclear genome. Furthermore, current estimates of effective population size for other free-living model ciliates, Paramecium and Tetrahymena, differ [40]–[42], so additional estimates from other species will be necessary to determine if there are general trends in population size within ciliates. Given our assembly conditions, we expect many allelic sequences to co-assemble, but visual inspection of reads mapped to the final assembly suggested that a substantial fraction of the genome is homozygous. A trivial explanation for this observed homozygosity is that the micronuclear genome regions (MDS) that form the nanochromosomes are homozygous, however it is also possible that a combination of other factors may be responsible for some of the observed homozygosity. These factors include both nanochromosomal allelic drift, arising from stochastic nanochromosome segregation during amitosis (during normal cellular replication) and nanochromosomal allelic selection, both of which could lead, in principle, to haplotype fixation (also known as “allelic assortment”), as well as allelic biases introduced during conjugation. Some well-studied macronuclear nanochromosomes, including the 81 locus [43], and the nanochromosomes encoding the telomere end-binding proteins α and β (Contig22209.0 and Contig22260.0) are homozygous in the Oxytricha JRB310 strain upon which our reference macronuclear genome assembly is based. Knowledge of the fraction of homozygous and heterozygous nanochromosomes is also necessary to obtain a reasonable estimate of macronuclear genome size. To determine nanochromosomal homozygosity, we focused on nonalternatively fragmented nanochromosomes in order to avoid both ambiguous alignments and possible false identification of heterozygosity due to the presence of alternative telomere locations. Of the nonalternatively fragmented nanochromosomes, 66% (7,487 out of 11,297) had no substantial BLAT [44] nonself matches (≥100 bp and ≥90% identical; default BLAT parameters) to any other contig in the final assembly (“matchless” nanochromosomes). Matchless nanochromosomes with variants present at ≥0.5% of the positions were considered heterozygous (see Materials and Methods for our precise definition of heterozygosity); otherwise they were considered to be homozygous. Note that our read mapping cutoff may reduce polymorphism estimates by filtering out reads with more polymorphisms (see Text S1, “Read Mapping Rationale”). These criteria overestimate low frequency variants at lower coverage sites and underestimate higher frequency variants at lower coverage sites (Figure S2) but comprise a small proportion of all the sites (e.g., 11.8% of sites identified as heterozygous are at 20–40× coverage). By these criteria, the well-characterized Oxytricha actin I [45],[46] nanochromosome (Contig19101.0) was correctly classified as heterozygous, with variants at 0.89% of the examined positions (12/1,350). Mismapped reads do not affect the identification of the heterozygous sites in this nanochromosome, since only two of the reads mapped to this nanochromosome map to any of the other contigs. For the actin I nanochromosome, the frequency of allelic variants varies from just over 5% (three positions) to 13.1% (mean 8.7%) corresponding to a roughly 1∶11 ratio of the minor∶major allele. Of the matchless nanochromosomes, 63% have sequence variants identified at 0%–0.5% of positions; hence, 37% of matchless nanochromosomes are classified as heterozygous by this criterion. Read mapping appears to be adequately sensitive to detect most SNPs (single nucleotide polymorphisms), since the mean number of reads per bp mapped to heterozygous and homozygous nanochromosomes is almost identical (see “Nanochromosome Copy Number Is Nonuniform”). Approximately 42% of nanochromosomes are homozygous when the proportion of putative homozygous nanochromosomes is calculated from the total number of matchless and matched nanochromosomes. The high levels of macronuclear genome homozygosity agree with preliminary observations of micronuclear sequence data for this strain (Chen et al., unpublished), suggesting that the majority of nanochromosomal homozygosity may derive from homozygosity in the micronuclear genome, rather than other possible factors (allelic assortment and/or developmental biases). Once the micronuclear genome is more complete, it will be possible to assess how much these factors have contributed to the observed homozygosity. Nevertheless, the abundance of homozygous nanochromosomes in the final assembly (∼42%) suggests that the wild-isolate JRB310 [47] may actually be substantially inbred and that this inbreeding arose at its source. Deleterious inbreeding effects may contribute to the complexity of Oxytricha trifallax mating types [47]. It will be interesting to determine whether the two “promiscuous” Oxytricha strains (JRB27 and JRB51 [47]) that mate with the broadest set of other mating types are less inbred. Sequence polymorphisms are abundant in the Oxytricha macronuclear genome: excluding the homozygous nanochromosomes (which may have arisen from inbreeding), the mean SNP heterozygosity is 4.0% (SD = 1.8%; Figure 4). From mapped reads, for heterozygous matchless nanochromosomes, mean SNP heterozygosity is 3.1% (SD = 1.3%), and for heterozygous matched nanochromosomes, it is 4.6% (SD = 1.9%; Figure 4). For alignments of heterozygous matched nanochromosomes (with BLAT matches ≥100 bp and ≥90% identical to each other) produced by MUSCLE (for nanochromosome pairs where one of the nanochromosomes is no more than 10% longer than the other and for alignments with ≤15% differences), mean SNP heterozygosity is 3.0% (SD = 2.5%). These estimates of SNP heterozygosity indicate that assembly has masked a substantial amount of allelic variation. Similar statistics were obtained for the subset of the final assembly's nanochromosomes that were present in the pure JRB310 strain ABySS assembly (e.g., heterozygous matchless mean SNP heterozygosity is 2.8%, SD = 1.2%, and for MUSCLE alignments of heterozygous matched nanochromosomes, mean heterozygosity is 3.2%, SD = 2.9%). Hence it is unlikely that any residual JRB510 strain allelic information in our final assembly has had a considerable effect upon our estimates of heterozygosity. Nevertheless, these are first estimates from a complex genome assembly, complicated by homozygosity due to potential inbreeding, and hence inferences based upon them (including our subsequent population size estimate) should be treated with caution until better estimates from the micronuclear genome and additional strains become available. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Nanochromosomal SNP heterozygosity. The green histogram (left) corresponds to SNP heterozygosity estimated from mapped reads (see Materials and Methods) for “matchless” nanochromosomes (which have no non-self contig matches to the genome assembly) and includes homozygous nanochromosomes. The red histogram (right) corresponds to SNP heterozygosity estimated from mapped reads for “matched” nanochromosomes. The orange histogram (center) corresponds to SNP diversity assessed from pairwise alignments of matched nanochromosomes. The smallest bin is 0–0.005 (0%–0.5%) heterozygosity. https://doi.org/10.1371/journal.pbio.1001473.g004 At 4-fold synonymous sites from heterozygous nanochromosomes with matches (see Materials and Methods), the mean SNP heterozygosity is 8.3% (SD = 9.4%). This underestimates sequence diversity at 4-fold synonymous sites, since pairwise alignments of contigs underestimate SNP heterozygosity at all sites (see previous paragraph). If we apply a correction for the missing SNP heterozygosity based on our overall estimate of SNP heterozygosity, we obtain an estimate of 11.1% mean SNP heterozygosity at 4-fold synonymous sites [8.3%×(4.0%/3.0%)]. This 4-fold synonymous site SNP heterozygosity is very high—even higher than the current SNP heterozygosity record holder, Ciona savignyi, which has 8.0% 4-fold synonymous site mean SNP heterozygosity [48]. These high levels of SNP heterozygosity suggest that Oxytricha trifallax has a large effective population size. Assuming a mutation rate μ ∼10−9 per base per generation, as in Snoke et al. 2006 [42], for nucleotide diversity at 4-fold synonymous sites and π 4S = 4N e μ, we estimate an effective population size of 2.6×107. This effective population size is on the same order estimated for P. tetraurelia (using silent site diversity, π S , which will yield a smaller population size estimate than one based on π 4S ) [42]. However, this estimate of the P. tetraurelia effective population size may be an overestimate due to incorrect classification of species within the Paramecium aurelia species complex and may be closer to the order of 106 [40]. In contrast, the T. thermophila effective population size is estimated to be considerably smaller than Oxytricha's, at N e = 7.5×105 for π S = 0.003 (with μ = 10−9) [41],[42]. In laboratory culture conditions, Oxytricha trifallax tends to replicate asexually and rarely conjugates (resulting in meiotic recombination). Conjugation in the laboratory is induced by starvation as long as cells of compatible mating types are available. However, we do not know the frequency of conjugation relative to replication of Oxytricha trifallax in its natural environment. The relationships between the frequency of asexual reproduction, and additional population genetic factors arising from asexuality, such as the variance of asexual and sexual reproductive contributions, are complex and can result in increases or decreases in estimates of effective population size [49],[50]. As a result, effective population size estimates for Oxytricha should be treated with caution until these factors are better understood. As indicated for the well-characterized actin I locus, which has a roughly 1∶11 ratio of minor∶major allelic variant, there may be substantial deviations from the expected 1∶1 ratio for the two possible allelic variants at each site. For matchless nanochromosomes, we found that the distribution of median nanochromosomal variant frequency (i.e., the median frequency of putative allelic polymorphisms) is bimodal, with one mode close to the expectation at 40%–45% and the other at 5%–10% (Figure 5; since the lower peak is bounded by the cutoff we chose to assess variants, the true lower peak may be lower than this). The bimodality of this distribution persists even if we only consider nanochromosomes where the mean coverage of the variant sites is high (e.g., at mean coverage of ≥110× at variant sites). Though some deviation from the 1∶1 variant ratio might result from allele-specific read mapping biases [51], given the relatively relaxed read mapping parameters we used (see Materials and Methods, “Read Mapping and Variant Detection”), the two variant frequency modes differ too much to explain the lower mode's existence. Instead, the deviation from the expected ratio may indicate that allelic assortment has occurred, or that there are developmentally specific allelic biases. Since a high proportion of nanochromosomes deviate substantially from the 1∶1 expected allelic ratio, it is also possible that allelic assortment has occurred for some nanochromosomes, which may contribute to the observed abundance of homozygous nanochromosomes. Nanochromosomes that deviate the most from the expected 1∶1 allelic ratio tend to have lower mean SNP heterozygosities, which likely reflects the diminished ability to detect SNPs with more distorted variant frequencies (Figure 5). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Nanochromosomal variant frequencies. (A) Normalized to form a probability density (cumulative frequency of 1) and (B) unnormalized median nanochromosomal variant frequencies for six increasing ranges of mean SNP heterozygosity. Variant frequencies were determined for nanochromosomes with no non-self matches to the genome assembly (the same nanochromosomes underlying the SNP heterozygosity histogram for “matchless” nanochromosomes in Figure 4), with variant positions called at the same minimum variant frequency as that used to determine potentially heterozygous sites (5% for sites with ≥20× read coverage). To exclude potentially paralogous mapped reads, we only analyzed nanochromosomes with ≤4 reads mapped to other contigs (using all nanochromosomes does not substantially change the form of the distributions). Variant frequency bins are labeled by their lower bounds. Variant frequencies ≥40 bp from either nanochromosome end were counted to avoid possible incorrect variant calling resulting from telomeric bases that were not masked (due to sequencing errors). https://doi.org/10.1371/journal.pbio.1001473.g005

Analysis of Assembly Redundancy and Estimation of Macronuclear Genome Size We desired an estimate for the total haploid Oxytricha trifallax macronuclear genome size since it is unknown. To obtain a reasonable estimate, we needed to determine the extent of redundancy in our genome assembly. As judged by visual inspection of our original assemblies, the main sources of redundancy are (i) the two alleles from the partially diallelic genome (see “Extensive Genome Homozygosity and High SNP Heterozygosity”), (ii) alternative nanochromosome fragmentation (see “Extensive Alternative Nanochromosome Fragmentation”), (iii) erroneous base calling that may result from high copy number regions and relatively abundant sequencing errors, and (iv) paralogous genes. The assembly with the most redundancy—from ABySS (Figure S3)—has approximately half of its contigs with nonself matches that are identical or almost identical (matches that are ≥100 bp and ≥99% identical). Visual inspection of the ABySS assembly revealed that much of the redundancy arose from the combined effect of high copy number DNA and sequencing errors. Our assembly strategy eliminated most of the redundancy from erroneous base calling, because it collapsed regions that are nearly identical. A small quantity of additional redundancy may have also been introduced by the inclusion of non-reference (JRB510) allelic sequences from the Sanger/454 genome assemblies, though the strategy we used prefers the inclusion of reference allelic sequences (see Materials and Methods, “Princeton Illumina Assembly and TGI Sanger/454 Assembly Integration”). Though some redundancy remains in our final genome assembly, this is counteracted by ∼1/4 of the nanochromosomes that have had their alleles collapsed during the assembly (see “Extensive Genome Homozygosity and High SNP Heterozygosity”). Given the ∼42% estimate of nanochromosomal homozygosity, we estimate that the haploid number of nanochromosomes is ∼15,600 [from 15,993+1,279 two- and multitelomere contigs and 5,303 single-telomere contigs (Table 1); we also estimate that ∼10% of nanochromosomes are alternatively fragmented (see “Extensive Alternative Nanochromosome Fragmentation”)]. With a mean nanochromosomal length of ∼3.2 kb, we estimate that the haploid macronuclear genome size is 50 Mb, which is similar to earlier experimental estimates [52],[53].

The Macronuclear Genome Encodes All the Genes Necessary for Vegetative Growth Traditional assessments of genome completeness are not very meaningful in Oxytricha because they usually measure genomes of uniform coverage with relatively long chromosomes. In two key ways, the Oxytricha macronuclear genome assembly is more similar to a de novo transcript assembly than to a conventional genome assembly: it contains multiple nanochromosome isoforms produced by alternative nanochromosome fragmentation (see “Extensive Alternative Nanochromosome Fragmentation”), and it is an assembly of nonuniformly amplified DNA (see “Nanochromosome Copy Number Is Nonuniform”). Unlike RNA transcripts, nanochromosome levels remain relatively stable during asexual growth [22], and variation of nanochromosome copy number is considerably lower than that of transcripts, so we are able to completely sample the genome's DNA over time. Simple genome metrics indicate that our assembly is largely complete. Firstly, we have sequenced the genome to a substantial depth: we have >62× haploid coverage of the genome assembly by Illumina 100 bp PE reads and >48× haploid coverage by SE reads. Secondly, nearly all high-quality reads map to our final assembly (98% of high quality PE reads) and the majority of contigs (71.3%) represent complete nanochromosomes, with only 5.1% of the contigs missing both telomeres and 23.6% missing one telomere (Table 1). Finally, our 50 Mb haploid genome assembly size estimate is similar to an earlier estimate of ∼55 Mb for the DNA complexity of the Oxytricha macronucleus [2]. To assess genome completeness we analyzed the completeness of two specific, functionally related gene sets—encoding ribosomal proteins and tRNAs—and one general gene data set in Oxytricha. All of these measures of completeness indicate that the macronuclear genome assembly is essentially complete. Firstly, the final genome assembly contains all 80 of the standard eukaryotic ribosomal proteins (32 small subunit and 48 large subunit proteins). Secondly, the Oxytricha macronuclear genome has a haploid complement of ∼59 unique tRNA nanochromosomes (including a selenocysteine tRNA on Contig21859.0). These tRNAs are sufficient to translate all of its codons if wobble position anticodon rules [54] are accounted for. As judged by searches of tRNAdb [55], codons without cognate tRNAs in Oxytricha are either absent or rare in other eukaryotes. Furthermore, with the exception of a Tetrahymena glycine tRNA that has a CCC anticodon [56], Oxytricha's tRNAs share the same anticodons as Tetrahymena. We also assessed the completeness of the macronuclear genome by searches of predicted proteins against 248 “core eukaryotic genes” (CEGs: defined by KOGS [57] based on the complete protein catalogs of H. sapiens, D. melanogaster, C. elegans, A. thaliana, S. cerevisiae, and S. pombe [58]). Using a strategy similar to that used to assess completeness during analyses of ncRNAs in Oxytricha [59], and based on the CEGMA analysis strategy [58], we searched for all the core proteins using a match coverage of ≥70% of the mean CEG sequence length, accumulated over the span of the query CEG sequence matches from BLASTP (BLAST+ [60]; with at least one of the matches for each query having an E-value≤1e-10). Of our predicted proteins 231 had substantial sequence similarity to the core eukaryotic protein sequences. The numbers of core eukaryotic proteins we found in the latest Tetrahymena (223) and Paramecium (230) gene predictions were similar to those found in Oxytricha (within the limits of the sensitivity of the searches we used and possible gene prediction failures). However, since CEGS are defined for just five genomes of animals, fungi, and one plant and exclude a diversity of other eukaryotes, the true set of CEGs may be somewhat smaller than this. Given the great evolutionary divergences of ciliates from these eukaryotes (possibly in excess of 1.5 billion years ago [61]), it is also possible that the BLAST criteria employed by CEGMA are not sufficiently sensitive to detect more distant ciliate homologs. This suggests that the predicted proteomes of all three ciliates are largely complete. Given that the deep divergences of ciliates might prevent detection of their homologs to the remaining 17 CEGS without matches, we attempted more sensitive searches at the domain level using HMMER3 [62]. We assigned Pfam domains to each KOG if the domains were best Pfam hits to the majority of the members of each KOG. Using domain searches, 13 of the remaining 17 core proteins had matches in Pfam-A (with domain, full-sequence E-values<1e-3; see Text S1, “Pfam Domains Detected for CEGs Missing in Oxytricha”). With the exception of KOG2531, these CEGs are relatively short, single-domain proteins. Of the four undetectable CEGs remaining after HMMER3 searches, one, KOG3285, is an ortholog group corresponding to the MAD2 [63] spindle assembly checkpoint (SAC) protein. We were unable to detect homologs of additional SAC proteins such as MAD1 and MAD3, suggesting that these checkpoint proteins have either been lost or that they are very divergent. The Tetrahymena macronuclear genome paper reported the absence of the checkpoint kinase CHK1 [56], but this is difficult to establish unambiguously given both the considerable divergence of ciliate proteins from the model organisms in which this protein was discovered and that the single defining domain for this protein is the widely distributed and extremely common protein kinase domain (PF00069). There is no evidence of a mitotic spindle in ciliate macronuclei [16], so they may not need SAC proteins, unlike conventional nuclei. It is also possible that the lack of these proteins contributed to the evolution of ciliate amitosis. However, micronuclei do appear to undergo a spindle-guided mitosis [2],[16]. Since ciliates need to coordinate the division of multiple nuclei, their nuclear cycle checkpoints may be more complex than most eukaryotes; hence, they may use genes that are nonorthologous to conventional checkpoint genes in nonciliates. The other three undetectable CEGs—KOG0563, KOG3147, and KOG2653—correspond to three key oxidative pentose phosphate pathway (OPPP) enzymes: glucose-6-phosphate dehydrogenase (G6PD), gluconolactonase (6PGL), and 6-phosphogluconate dehydrogenase (6PGD), which are also missing in Paramecium and Tetrahymena, and hence may have been lost in ciliates (Figure S4). Two of these enzymes—G6PD and 6PGL—were also noted to be missing in Paramecium, Tetrahymena, and Ichthyophthirius [64]. Thus, excluding these three CEGs that appear to be absent from ciliates, Oxytricha's macronuclear genome is only missing one CEG (KOG3285/Mad2). Together, since the macronuclear genome has 244/245 (99.6%) of the ciliate-restricted CEGs, it (together with the mitochondrial genome [65]) is likely to encode a complete set of genes required for vegetative growth. This is consistent with the observation of amicronucleate Oxytricha species in the wild, which are capable of vigorous replication for hundreds of generations in culture [2],[66], and with temporary dispensability of the micronucleus. Currently TBE transposon genes are the only published examples of micronuclear-limited genes (not encoded on any of the nanochromosomes in our assembly) in Oxytricha. These genes are exclusively expressed during sexual development and appear to be essential for accurate genome rearrangements [30], and hence may only need to be expressed from the micronucleus.

Extensive Alternative Nanochromosome Fragmentation A characteristic feature of the Oxytricha macronuclear genome is the existence of multiple, stable “versions” of nanochromosomes that share genic regions [10]–[12]. “Alternative processing” or “alternative fragmentation” of DNA is analogous to alternative splicing of introns from pre-mRNAs, but unlike alternative RNA splicing, macronuclear DNA is simply fragmented (with telomere addition), rather than joined together. Variable deletion of micronuclear DNA in Paramecium also gives rise to alternatively fragmented macronuclear chromosomes, though it produces much longer multigene chromosomes and this alternative fragmentation is much less frequent than that in Oxytricha [67]–[69]. Initial surveys of Oxytricha fallax nanochromosomes revealed a substantial amount of alternative nanochromosome fragmentation, with 40% (6/15) of the surveyed nanochromosomes alternatively fragmented [11], so we wanted to assess this on a genome-wide scale. We also sought evidence of possible functional relationships between alternative fragmentation and gene expression, since, in principle, alternative nanochromosome fragmentation may affect gene expression by (i) permitting variable amplification of nanochromosome isoforms, thereby affecting basal transcription levels of the genes encoded on these isoforms (see “Nanochromosome Copy Number Is Nonuniform”), (ii) gene truncation, and (iii) affecting regulation of gene expression by modulating which regulatory elements are present on nanochromosomes. The creation of contigs during assembly merges shorter alternative nanochromosomes into longer isoforms, obscuring telomeric repeats when they contribute a minority of bases, thus making it difficult to identify the alternative isoforms directly from the contig sequences. Therefore, we exploited two sources of raw sequence data to uncover this kind of variation: 454 telomeric read pairs and Illumina telomeric reads (see Materials and Methods). From alternative fragmentation sites predicted by either data source, almost 1/4 of all the nanochromosomes (3,369/14,390) in our final assembly are predicted to be alternatively fragmented (only counting contigs with terminal telomeres ≤100 bp from either end of the contig). We predict 11% (1,909/17,372) and 14% (2,380/17,372) of nanochromosomes are alternatively fragmented from 454 telomeric reads alone and Illumina telomeric reads alone, respectively. Of the nanochromosomes predicted to be alternatively fragmented by Illumina telomeric reads, 63% are also predicted to be alternatively fragmented by 454 telomeric reads, and 68% of the nanochromosomes predicted to be alternatively fragmented by 454 telomeric reads are also predicted to be alternatively fragmented by Illumina telomeric reads. The actual portion of alternatively fragmented nanochromosomes may be closer to 10% since many of the predicted sites are only supported by a few reads (which results in a poor correspondence between the predictions from the two data sources when there are few telomeric reads at a putative alternative fragmentation site; see Text S1, “Classification of Strongly and Weakly Supported Alternative Fragmentation Sites”). We propose that most of the nanochromosomes arising from weakly supported sites with few supporting telomeric reads (e.g., <9 llumina telomeric reads) may represent “developmental noise” or healing of broken nanochromosomes by capping the broken ends with telomeres, rather than functional nanochromosomes. We may not have recovered some alternatively fragmented nanochromosome isoforms due to limitations of our genome assembly. Since we focused on nanochromosomes with telomeres at both ends, some alternative fragmentation will be missed on nanochromosomes that lack telomeric ends (e.g., the alternatively fragmented 81-Mac locus [10],[11], represented by Contig13637.0.1, is missing both ends). Another possible failure to detect alternative fragmentation is a consequence of the semigreedy nature of our genome assembly strategy, since we stop extending nanochromosomes once we have detected at least one 5′ and at least one 3′ telomeric repeat (see Materials and Methods), which means that we may miss some longer unfragmented nanochromosome isoforms. Consequently, we consider our estimates of the level of alternative fragmentation to be conservative. Alternative fragmentation sites tend to map between predicted genes in intergenic regions rather than within intragenic regions [Table S1; we use inter-CDS regions rather than intergenic regions since CDS (coding sequence) predictions are more reliable than UTRs (untranslated regions)]. For contigs with single internal alternative telomere fragmentation sites, strongly supported alternative fragmentation sites are 58 times more likely to be located in inter-CDS regions than in intra-CDS regions (per bp of these sequence regions). For nanochromosomes with single-gene predictions, strongly supported alternative fragmentation sites are 27 times more likely to reside within non-CDS regions (i.e., introns, UTRs, subtelomeric regions, or regions with no predicted gene), than within CDSs (Table S2). Strongly supported, noncoding alternative fragmentation sites typically have more telomere-containing reads than do coding alternative fragmentation sites for both single (mean 213 versus 95 reads) and two-gene [mean 186 (intergenic region) versus 116 reads] nanochromosomes. For strongly supported alternative fragmentation sites predicted by Illumina telomeric reads, 74% (1,208/1,622) of alternatively fragmented nanochromosomes have one site of alternative fragmentation (giving rise to two nanochromosome isoforms: a long unfragmented form and a shorter fragmented isoform), and 21% of alternatively fragmented nanochromosomes have two sites of alternative fragmentation (similar statistics were obtained from strongly supported sites predicted from 454 telomeric reads). This means that typically only a few possible nanochromosome isoforms are produced for each of our assembled nanochromosomes and also that most alternative fragmentation is “directional,” giving rise to only one of the two possible single shorter isoforms. The most extreme example has seven alternative fragmentation sites predicted from the Illumina telomeric reads (at most nine from 454 telomeric reads) for Contig14329.0 (GenBank Accession: AMCR01001519.1). The observation of directional alternative fragmentation suggests there must either be differential amplification of particular isoforms or degradation of specific forms following excision. The higher amplification levels of alternatively fragmented nanochromosomes relative to nonalternatively fragmented nanochromosomes (see next section) provides support for the first model but does not exclude the second. In the future, it will be interesting to determine how these fragmentation signals relate to chromosome amplification, since the timing of DNA fragmentation correlates with nanochromosome copy number in Euplotes [25]. The longest isoform of the most extreme case of alternative fragmentation we discovered (Contig14329.0) is about 8 kb long with eight distinct protein-coding regions. This contig has 15 predicted telomere addition sites (TASs) (nine 5′ and six 3′ sites relative to the contig orientation in the assembly) from the 454 telomeric reads (with 11 strongly supported sites, including the two terminal sites), giving rise to up to 14 distinct nanochromosome isoforms from the same 8.1 kb region (Figure 6). An alternative fragmentation site at ∼6,000 bp is weakly supported by 454 telomeric reads but strongly supported by Illumina telomeric reads, suggesting that this site largely gives rise to longer nanochromosome isoforms that the 454 telomeric reads are less likely to detect. Every one of the alternative fragmentation sites predicted from Illumina telomeric reads, with the exception of a single weakly supported site at 4,767 bp, was corroborated by 454 telomeric reads within 100 bp of the site. The 454 telomeric reads suggest that each of the seven intergenic regions in this contig is a site of alternative fragmentation (no Illumina telomeric reads map to the site between genes 5 and 6). Consistent with the genome-wide pattern, this contig's alternative fragmentation sites typically reside between, not in the middle of, genes. For the 454 telomeric reads, only a small portion (2/15 sites) of the fragmentation sites are predicted to be in coding regions, and these sites are weakly supported, whereas the Illumina telomeric reads do not predict any sites in coding sequence regions. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 6. Extreme nanochromosomal fragmentation. Contig14329.0 is shown with coordinates in bp. Predicted genes, coding sequences, and introns are indicated by horizontal green, yellow, and white arrows, respectively. 5′ and 3′ fragmentation sites predicted from telomeric read pairs are indicated by red and navy arrows, respectively, with upward pointing solid arrows for sites predicted by 454 telomeric reads and downward pointing dashed arrows for sites predicted by Illumina telomeric reads. Numbers above/below the arrows indicate the number of telomeric reads found at each site for the two telomeric read sources. Alternative nanochromosome isoforms predicted from the 454 telomeric reads (isoforms B–O) are shown below the main locus, with the number of supporting read pairs next to each form. One additional isoform missed by our prediction method but documented in the 454 telomeric read pairs is indicated in pale green. Since the two fragmentation positions at 3,762 and 3,806 bp are in close proximity to each other, they were treated as a single point during alternative isoform prediction. Additional nanochromosome isoforms that were not detected by 454-telomeric reads, including the full eight-gene nanochromosome, but were detected by Southern blotting are indicated by stars (isoforms A, P, Q, and R). Sequence coverage, indicated by the cyan graph, shows the cumulative DNA amplification for all the nanochromosome isoforms. Sequence coverage is calculated from both Illumina telomereless and telomeric reads; telomeric read pairs appear as twin peaks ∼300 bp apart. https://doi.org/10.1371/journal.pbio.1001473.g006 To experimentally validate the predicted extreme fragmentation of Contig14329.0, we performed Southern hybridization (Figure S5) on the same vegetative Oxytricha JRB310 macronuclear DNA sequenced by Illumina. With two exceptions, our Southern analysis confirmed all tested nanochromosomes and identified four novel isoforms: the full length ∼8 kb isoform A, and isoforms P, Q, and R (Figure 6, Figure S5; Text S1, “Examination of Discrepancies Between Predicted and Experimentally Determined Alternative Fragmentation Isoforms of the Highly Fragmented Contig14329.0”). Since the process of generating 454 telomeric reads included a size selection (see Text S1, “Whole Nanochromosome Telomere-Based Library Construction”), it is unsurprising that sequencing missed the longer isoforms we were able to detect by Southern hybridization (A, P, Q, and R, at 8.1 kb, 6 kb, 6.5 kb, and 4.5 kb long, respectively). The eight genes encoded on these alternatively fragmented nanochromosomes are (1) an RNAse HII domain containing protein (Pfam: PF01351), (2) a dsDNA-binding domain (PF01984) protein, (3) a Tim10/DDP family zinc finger domain protein (PF02953), (4) a protein with no significant BLASTP (to GenBank NR; E-value<1e-3) or Pfam matches (E-value<1e-3), (5) a COPI-associated protein domain (PF08507) protein, (6) an uncharacterized conserved protein (DUF2036) protein (PF09724), (7) another protein with no significant BLASTP or Pfam matches, and (8) a translation initiation factor eIF3 subunit (PF08597) protein. From the domain annotations, no obvious functional relationship amongst these genes is evident. From Figure 6, it can be seen that the representation of genes 2, 3, 4, and 8 in the different nanochromosomal isoforms is greater than for the remainder of the genes. Two of the shortest Oxytricha proteins (encoded by genes 2 and 3; see also Text S1, “Analysis of Short Protein and ncRNA-Encoding Nanochromosome”) are encoded on the most abundant nanochromosomal isoforms. Remarkably, in contrast to the surrounding, heterozygous DNA encoding genes 1–3 and gene 8, the ∼4.2 kb DNA region encoding genes 4–7 appears to be completely homozygous, suggesting the possibility that these regions derive from different micronuclear sources.

Nanochromosome Copy Number Is Nonuniform In contrast to the oligohymenophorean ciliates, which typically have uniformly amplified macronuclear genomes [56],[69], there is considerable variation in nanochromosome copy number in Oxytricha. The distribution of copy number for nonalternatively fragmented nanochromosomes is right-skewed and is restricted around a mean relative copy number of 0.94, with ∼90% of the nanochromosomes contained within a relative copy number range of 0.12–1.76 centered on the mean (Figure 7A). It is possible that some lower copy number nanochromosomes may not have completely assembled since the combined depth of sequence coverage is <120× and lower bound copy number estimation is constrained by the >62× coverage of the PE reads. Mindful of these limitations, within the sequenced JRB310 clonal population of cells, nanochromosome copy number does not appear to vary as much as gene transcription. The most highly amplified nanochromosome, encoding the 18S, 5.8S, and 28S rRNA (Contig451.1), has a copy number that is ∼56× the mean of nonalternatively fragmented nanochromosomes, yet its transcripts typically yield more than 90% of the RNA in our non-poly(A)-selected RNA-seq samples. There is a roughly 2-fold difference between the most highly amplified nanochromosome and the next most highly amplified nanochromosome, encoding the 5S rRNA (Contig14476.0/Contig17968.0; quasi-allelic contigs). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 7. Nanochromosome copy number variation. (A) Relative nanochromosome copy number distribution (number of telomere-less reads/bp of nonsubtelomeric nanochromosome; see Materials and Methods) for homozygous matchless, heterozygous matchless, and heterozygous matching nanochromosomes. The mean copy number of the combined homozygous matchless and heterozygous matchless nanochromosomes is indicated by a dashed line at 0.94, with dotted lines corresponding to an interval of ∼1.3σ (∼0.12 to 1.76) either side of the mean, which includes ∼90% of all nanochromosomes. (B) Relative nanochromosome copy number of nonalternatively fragmented (with a single, directional fragmentation site per nanochromosome) versus alternatively fragmented nanochromosomes measured by the number of telomeric reads per nanochromosome. (C) Relative nanochromosome copy number of nonalternatively fragmented chromosomes versus nonalternatively fragmented chromosomes encoding ribosomal proteins and tRNAs. https://doi.org/10.1371/journal.pbio.1001473.g007 Since the method we used to estimate nanochromosome copy number combines reads from both possible alleles for heterozygous nanochromosomes, it is necessary to map the reads sensitively to avoid exclusion of reads and to minimize incorrect mapping in order to obtain accurate estimates (see “Genome Homozygosity and SNP Heterozygosity”). Our mapping procedure seems to be appropriate for matchless nanochromosomes, since there is no substantial difference in copy number distributions for homozygous and heterozygous matchless nanochromosomes (mean copy number of 0.93, SD = 0.61, and 0.97, SD = 0.67, respectively; Figure 7A). However, for heterozygous nanochromosomes with matches, the mean nanochromosome copy number is lower (0.81; SD = 0.59; Figure 7A) than for matchless nanochromosomes. This is likely because some of the nanochromosomes with matches exhibit higher heterozygosity regions than matchless heterozygous nanochromosomes (>6% mean SNP heterozygosity; see “Genome Homozygosity and SNP Heterozygosity”) and the mapping criteria (≥94% read identity to the mapped contig) eliminated some of the more heterozygous reads. To assess nanochromosome copy number of alternatively fragmented versus nonalternatively fragmented nanochromosomes, we examined the relationship between the number of telomeric reads and the number of nontelomeric reads per bp of the nanochromosomes (see Materials and Methods). We found that there was a good correlation between telomeric reads from either end of the nonalternatively fragmented nanochromosomes (Figure S6B) with r = 0.90. However, there are examples where the number of reads from each nanochromosome end differs substantially (e.g., Contig22209.0 and Contig5780.0 from Table S3). This may indicate the failure to extend the ends of some nanochromosomes completely or that the ends derive from the DNA of the nonreference strain (JRB510) and are relatively divergent with few reads from the reference DNA mapped to them (e.g., Contig5780.0). Alternatively, there may be experimental biases that skew the numbers of reads mapped to the two ends (e.g., Contig22209.0, which has JRB310 telomeric reads mapped to the nanochromosome end with fewer reads but no reads extending further, even with relaxed read mapping parameters). The correlation between the number of reads per bp and the number of telomeric reads per nanochromosome is also strong (r = 0.89; Figure S6A), indicating that assessment of telomeric reads alone is appropriate for large-scale analyses of nanochromosome copy number. Furthermore, our estimates of relative nanochromosome copy number, either via reads per bp or the number of telomeric reads per contig, are in good agreement with those obtained by qPCR (Table S3; Figure S7). For relative nanochromosome copy number measured by telomeric reads, the mean number of telomeric reads per alternatively fragmented nanochromosome with a single (directional) alternative fragmentation site (i.e., only two nanochromosome isoforms) is 2.4 times (885 reads, SD = 768 reads) that of nonalternatively fragmented nanochromosomes (363 reads; SD = 290 reads; K-S one-sided test D = 0.59 and p value<1e-9, with the alternative hypothesis that alternatively fragmented nanochromosome copy number>nonalternatively fragmented nanochromosome copy number; Figure 7B). It follows that the DNA of the shorter alternative nanochromosome isoforms is even more highly amplified than that of nonalternatively fragmented nanochromosomes. The greater amplification of alternatively fragmented nanochromosomes relative to nonalternatively fragmented nanochromosomes supports a model of net overamplification of specific alternatively fragmented nanochromosomes isoforms rather than a model of net destruction. The higher amplification of alternatively fragmented nanochromosomes may indicate a commensal DNA relationship between two genes, arising when one of the genes benefits from the amplification signal of a more highly amplified nanochromosome isoform bearing another gene. This relationship requires no functional association between the genes on alternatively fragmented nanochromosomes, consistent with our general observations (e.g., no specific functional associations between nonribosomal genes and ribosomal genes on alternatively fragmented nanochromosomes). For nonalternatively fragmented nanochromosomes, the ribosomal protein-encoding nanochromosomes are ∼3.9× more highly amplified than nonribosomal protein nanochromosomes, and tRNA-encoding nanochromosomes are ∼3.6× more highly amplified than non-tRNA-encoding nanochromosomes (Figure 7C; for ribosomal versus nonribosomal nanochromosomes: K-S one-sided test D = 0.64 and p value<1e-9, with the alternative hypothesis that ribosomal nanochromosome copy number>nonribosomal nanochromosome copy number; for tRNA versus non-tRNA nanochromosomes: K-S one-sided test D = 0.62 and p value<1e-6, with the alternative hypothesis that tRNA nanochromosome copy number>non-tRNA nanochromosome copy number). Similarly, the ribosomal protein- and tRNA-encoding nanochromosome isoforms arising from alternative fragmentation are typically overamplified relative to the isoforms that encode other genes (50/54 alternatively fragmented ribosomal nanochromosomes and 25/28 alternatively fragmented tRNA nanochromosomes; Figure S8). Given the modest variation in nanochromosome copy number, most notably the limited overamplification of nanochromosomes encoding highly expressed genes (rRNAs, tRNAs, and ribosomal proteins), even if a strong correlation exists between nanochromosome copy number and transcription levels, copy number may only be a modest contributor to the final RNA and protein expression levels. Regulation of expression at the transcriptional/posttranscriptional level may be essential to buffer the variation in DNA copy number that arises during extended periods of vegetative growth.

Nanochromosome Length Variation Oxytricha nanochromosomes range in length from ∼500 bp to 66 kb, with a mean size of ∼3.2 kb (Figure 8). Few nanochromosomes were assembled at either extremity of the length distribution, with just 32 shorter than 600 bp long and 61 longer than 15 kb, consistent with observations of macronuclear DNA on electrophoretic gels [2],[70]. While the mean length of two-telomere nanochromosomes in the final Oxytricha macronuclear genome assembly is ∼3.2 kb (Table 1), the true average length of nanochromosomes is shorter than this because the longest isoform of alternatively fragmented nanochromosomes is the one that tends to be assembled. On electrophoretic gels, Oxytricha nanochromosomes are visibly longer than those of Euplotes [2],[71], which we propose is primarily a consequence of the lack of alternative fragmentation in Euplotes (inspection of mapped reads to our preliminary Euplotes crassus assembly indicated no signs of alternative fragmentation; unpublished data). The longest isoforms of alternatively fragmented nanochromosomes average 5.0 kb (SD = 2.4 kb), while nonalternatively fragmented nanochromosomes have a mean length of 3.0 kb (SD = 2.4 kb; Figure 8). The mean length of the shortest nanochromosome isoforms produced by alternative fragmentation is 2.4 kb (SD = 1.6 kb). For single-gene nonalternatively fragmented nanochromosomes, the mean nanochromosome length is 2.2 kb (SD = 1.0 kb). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 8. Length distributions of alternatively and nonalternatively fragmented nanochromosomes. The shortest nanochromosome isoforms produced from single (directional) alternative fragmentation sites are labeled as “Short isoform.” The histograms show normalized frequencies for 1,587 alternatively fragmented nanochromosomes and 15,219 nonalternatively fragmented nanochromosomes. Alternatively fragmented nanochromosomes have at least one strongly supported (≥10 Illumina reads) alternative fragmentation site >250 bp from either end of the nanochromosome (these nanochromosomes are >500 bp long). https://doi.org/10.1371/journal.pbio.1001473.g008 The shortest assembled nanochromosome (Contig20269.0) is a mere 248 bp, excluding the telomeric sequences. Though we were unable to identify any ORFs or any ncRNAs on this nanochromosome by RFAM searches, we found two matching RNA-seq PE reads, suggesting that there is expression from this nanochromosome. The shortest nanochromosome (Contig19982.0) with a known protein is 469 bp (excluding the telomeres) and encodes a 98 aa ThiS/MoaD family protein, while the shortest ncRNA-bearing nanochromosome we found is 540 bp (excluding telomeres) and encodes tRNA-Gln(CUG) (see Text S1, “Analysis of Short Protein and ncRNA-Encoding Nanochromosomes”). Searches for shorter possible nanochromosomes in the Illumina and Sanger reads did not reveal additional plausible nanochromosome candidates (Text S1, “Reads Containing Both Putative Telomeric Repeats Are Not Genuine Nanochromosomes”). The longest nanochromosomes (>15 kb) typically encode a single large structural protein (Table S4), such as dynein heavy chain proteins (e.g., Contig354.1). None of the 20 longest nanochromosomes are alternatively fragmented. Seven of these 20 nanochromosomes contain multiple predicted genes (up to a maximum of four); however, all but one of these gene predictions are oriented head-to-tail, consistent with the possibility that their predictions may have been incorrectly split. Hence most of the longest nanochromosomes are likely still single-gene nanochromosomes. One ∼20 kb nanochromosome (Contig289.1) does indeed contain multiple genes, since it encodes a Pkinase domain (PF00069) protein on the opposite strand to two predicted PAS_9 domain (PF14326) proteins (though these latter two proteins may also be incorrectly split). Six of the longest nanochromosomes encode single proteins with no detectable Pfam domains (Pfam-A 26; independent E-value<0.01) but all have BLASTP NCBI non-redundant database (nrdb) matches (E-value<1e-10), typically to large proteins (>2,000 aa). The longest nanochromosome (Contig7580.0) is 66 kb (65,957 bp; excluding telomeres) and encodes a single giant protein (“Jotin,” after a Norse giant) with BLASTP best hits to Titin-like genes in the NCBI nrdb (see Text S1, “Characterization of the Jotin Protein”). We note that this single-gene nanochromosome is comparable in size to the entire, relatively large and gene-rich ∼70 kb Oxytricha mitochondrial genome [65], which was largely eliminated by the sucrose gradient isolation of macronuclei (see Materials and Methods). The Oxytricha Jotin ORF is 64,614 bp. AUGUSTUS predicts four short introns (117, 151, 77, and 63 bp), two of which are supported by and one of which conflicts with RNA-seq reads. This gene's entire coding sequence is well supported by pooled RNA-seq reads (covered from end to end).

Gene Predictions The gene prediction software, AUGUSTUS, predicted complete genes on 15,387 of the complete nanochromosomes we surveyed (96%) and 91% of the final assembly's contigs. Examination of three developmental time points (0, 10, and 20 h after initiation of conjugation) confirms transcription of 97% of Oxytricha nanochromosomes (94% of all contigs). AUGUSTUS predicts genes on 94% of nanochromosomes with expression evidence. Most Oxytricha nanochromosomes (80%) contain single genes, consistent with earlier studies (Figure S9) [2],[33]. Alternatively fragmented nanochromosomes tend to encode more genes per nanochromosome: only 15% of alternatively fragmented nanochromosomes have single gene predictions, versus 90% of all nonalternatively fragmented nanochromosomes. Roughly half (48%) of multigene nanochromosomes have alternative fragmentation. All nanochromosomes with five or more (maximum eight) predicted genes are alternatively fragmented (Figure S9), and only two nonalternatively fragmented nanochromosomes encode four genes. The nanochromosome with the largest number of separate gene products (Contig8800.0; ∼6.8 kb) is alternatively fragmented, with a shorter ∼3.5 kb nanochromosome isoform that encodes 12 C/D snoRNAs [59] and a putative protein-coding gene encoded by the remainder of the full-length isoform. Key properties of Oxytricha's gene predictions are consistent with a pilot survey [33], including relative AT-richness (34% GC) with noncoding regions that are more AT-rich than coding regions (e.g., introns are 23.6% GC), and 1.6 introns per gene (Table 2). Oxytricha gene lengths (mean length 1,839 bp excluding UTRs) are similar to those predicted for Tetrahymena [56]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Properties of gene predictions. https://doi.org/10.1371/journal.pbio.1001473.t002