Genome sequencing, assembly and assessment

We sequenced and assembled five genomes from four species within the B.plicatilis species complex: B. rotundiformis (Italy2), B. sp. ‘Tiscar’ (TiscarSM28), B. plicatilis s.s. (Tokyo1), and B. asplanchnoidis (OHJ82 and OHJ22). The number of sequenced base pairs (bp) ranged from 2.06 Gbp to 9.77 Gbp; we identified 0.4–8% of reads as coming from contaminants, and retained between 2.01 and 9.73 Gbp. Kmer analyses of the different cleaned read libraries revealed that the genomes of both B. asplanchnoidis strains (OHJ82, 0.412%; OHJ22, 0.412%) were more heterozygous than Italy2 (0.055%), TiscarSM28 (0.178%), and Tokyo1 (0.109%). The B. calyciflorus genome had an estimated heterozygosity of 1.66%. Assembly size for Italy2, TiscarSM28, and Tokyo1 was approximately half of the holoploid genome size, while the mean read depth across the entire assembly and in non-repetitive regions was slightly less than twice the expected coverage (Table 1). The contig N50 of these genomes, an indication of assembly contiguity, ranged from 15,643 bp in Tokyo1 to 42,810 bp in Italy2. In contrast, despite much greater sequencing effort the B. asplanchnoidis assemblies were about 27% of the genome size, with a mean read depth slightly more than twice the expected coverage in non-repetitive regions of the assemblies. Both the OHJ82 and OHJ22 assemblies were ~ 115 Mbp, with contig N50 values around 10,000 bp. Each of the five assemblies had 91–92% of the metazoan BUSCO genes (Table 1). Overall, 5.5% of the metazoan BUSCO genes (54 genes) were not found in any of our assemblies, and 740 genes (75.7%) were found in complete single copies in all five (Additional file 5: File S1).

Ploidy assessment

Because very large changes in genome size between species often suggest changes in ploidy, we examined our assemblies for differences in read coverage and allele frequency. For all species, median observed read coverage of the non-repetitive regions of the assembly was about twice the expected coverage (Table 1). In all cases, genome coverage was unimodal, arguing against ploidy differences between species (Fig. 2). The coverage distributions of the 740 shared BUSCO genes followed the overall genome coverage in each assembly; a small fraction of genes had coverage significantly higher than the median, and there were more of these in the larger genomes (Fig. 2, Additional file 5: File S1). With the exception of Tokyo1, which had a very low number of SNPs in the BUSCO genes, the frequency distributions of minor alleles in the shared BUSCO genes were similar across species, with the frequency of most minor alleles in the 0.4–0.5 range (Additional file 1: Figure S1).

Fig. 2 Distribution of observed coverage (on a per-gene basis) of a subset of BUSCO genes shared across all assemblies, dots indicate mean coverage values for each gene. Coverage distribution across the whole genome assemblies (in 500-bp windows) is shown in a grey overlay Full size image

To assess coverage and allele frequency independently from assembly, we examined coverage of heterozygous kmer pairs in each read library. Comparing the relative coverage of each pair to the normalized frequency of the minor sequence can reveal patterns of ploidy and heterozygosity. For all five read libraries, the spectra indicated that most heterozygous kmers were covered around 4n, with a minor kmer relative frequency around 0.5. There was indication of a minor peak around 2n, most visible in TiscarSM28 and both B. asplanchnoidis libraries. The B. calyciflorus PE500 read library had a major peak at 2n with a minor kmer frequency of 0.5, but also an extended tail of kmer pairs with 3n and 4n coverage and minor kmer frquency of 0.3 and 0.5, respectively (Additional file 2: Figure S2). Finally, we used the program nQuire to evaluate models of diploidy, triploidy, and tetraploidy using all reads, reads that did not map to highly repetitive regions (discussed below), and reads mapping to BUSCO genes. While the “denoise” step of analysis removed at least 40% of the sites from the first two datasets, all three datasets supported a model of diploidy for Italy2, TiscarSM28, OHJ22 and OHJ82, and tetraploidy for Tokyo1 and B. calyciflorus (Additional file 6: File S2).

Repetitive element analyses

RepeatMasker, using either its “Metazoa” library or de novo RepeatModeler libraries, identified a small number of repetitive elements in each assembly (Additional file 7: File S3). Although the total repetitive DNA content increased with assembly size, the proportion of repetitive DNA only increased from 6 to 11% and did not account for significant portions of the differences in genome size across the species complex. However, de novo repetitive element identification using the program dnaPipeTE directly on read libraries revealed more repetitive elements, in terms of both diversity and genome proportion (Fig. 3, Additional file 7: File S3). Estimates of the genome content of these elements consistently and significantly increased with genome size in both absolute (linear regression, p = 0.0014, df = 4) and relative amounts (linear, regression, p = 0.0003, df = 4), from 16.8 Mbp in Italy2 (15%) to 185.92 Mbp in OHJ22 (44%). The difference in repetitive content between Italy2 and OHJ22 was just over half (54%) of the total difference in genome size (Fig. 3). Repetitive elements could account for 71% of the genome size difference between OHJ82 and Tokyo1 (the most closely related species to B. asplanchnoidis). When the repetitive elements generated from this method were used as a library for RepeatMasker, similar, but slightly lower proportions of the genome assemblies were annotated as repetitive (Additional file 7: File S3).

Fig. 3 a Proportional repetitive element content estimates per genome using dnaPipeTE, b shows these estimates in Mbp of each genome, Bcal = B. calyciflorus Full size image

LTR (Long Terminal Repeat) and LINE (Long Interspersed Nuclear Element) retrotransposons, and DNA transposons are the three largest groups of annotated transposons in the B. asplanchnoidis genomes. Together, these account for 3.3% of the genome of Italy2 and 27% of the genome in OHJ22 (Fig. 3). Additionally, as genome size increases across the species complex, the number of less diverged elements in these three groups increases, and this increase is not observed when considering only assembly-based repeat annotation (Fig. 4). The proportion of less diverged elements in these classes also increases with genome size (Additional file 3: Figure S3). Within B. asplanchnoidis (OHJ82 and OHJ22), there are also changes in the number and proportion of less diverged elements.

Fig. 4 Distributions of repetitive element divergence estimates of three repetitive element classes from repetitive element annotation of read libraries (dnaPipeTE, red) and assemblies (dnaPipeTE_RM, blue). For dnaPipeTE the count reflects the number of reads which had a BLAST hit to any one dnaPipeTE assembled repetitive element, and for dnaPipeTE_RM, this represents one instance of a BLAST alignment of a dnaPipeTE assembled repetitive element in the respective genome assembly Full size image

Using the dnaPipeTE method we estimated that the B. calyciflorus genome consists of 38.9% repetitive elements (Fig. 3, Additional file 7: File S3), many of which are simple/satellite (10.9% of the genome) or low complexity repeats (5.6% of the genome). We also found all other classes of repetitive elements as in the B. plicatilis genomes in this genome, including SINE elements (0.26 Mbp, or 0.08% of the genome), which were not previously reported.

Gene annotations

We used the protein sequences of the predicted gene models from the published B. calyciflorus genome [32] to annotate 11,000–12,500 genes in each of our five genome assemblies (Table 2). The assemblies had fewer annotated genes than the B. calyciflorus reference. The difference in gene number could be accounted for due to our assemblies all having far fewer single-intron genes. Our assemblies also have smaller mean lengths of exons, introns, and intergenic regions. A smaller mean intergenic distance could be an artefact of a less-contiguous assembly, so intergenic distance for B. calyciflorus was recalculated as if each contig was broken in 10 pieces, however, this did not reduce the intergenic distance (not shown). In contrast, our assemblies had a higher proportion of pseudogenes than B. calyciflorus, and the number of pseudogenes increased with genome size (R2 = 0.93). In the species with smaller genomes (B. rotundiformis, B. sp. ‘Tiscar’, and B. plicatilis), mean intron size increased with genome size (R2 = 0.95), resulting in an increase in total intronic DNA. However, the total contribution of pseudogenes and intronic DNA is relatively small compared to overall differences in genome size.

Table 2 Gene number after annotation and quality filtering with fathom, the number of single exon genes, number of potential pseudogenes, sum total gene, exon and intron sizes, mean exon and intron size, mean intergenic size, intergenic50 (similar to N50, but calculated with intergenic size instead of contig size), and the GC content of the genes Full size table

Most of the annotated genes, when clustered by OrthoVenn, were shared between all, or most of the assemblies. Only 446 of 12,372 gene clusters were found in any single assembly and not shared by any others (Additional file 4: Figure S4). Most of these gene clusters (366) were in the B. calyciflorus genome assembly. The B. calyciflorus genome assembly also had about 1000 more gene clusters than the B. plicatilis genomes annotated here.