The genome of B. antarctica is the smallest yet reported for an insect. The estimate of total genome size based on flow cytometry is 1C=99.25±0.4 megabase pairs (Mbp) for the female and 1C=98.4±0.1 Mbp for the male (Supplementary Methods; Supplementary Fig. 1; Supplementary Table 1). On the basis of the raw sequence reads, we estimate the size of the B. antarctica genome to be >89.5 Mbp but <105 Mbp (Supplementary Fig. 2). Previous cytological preparations of polytene chromosomes from salivary glands indicate B. antarctica has three linkage groups (2n=6)14. We used a single larva of B. antarctica of unknown sex from Cormorant Island, near Palmer Station, Antarctica for the reference genome, using Illumina sequencing technology and Velvet de novo15 for assembly. Several assemblers were compared (Supplementary Table 2). Paired-end reads from RNA-seq data10 were used to improve the assembly by scaffolding contigs, resulting in 5,064 scaffolds. One Pacific Biosciences RSII SMRTbell library was generated to scaffold the assembly, which added minimal scaffolding owing to the limited amount of DNA in a single individual. The size of the assembled genome was 89.6 Mbp, including ambiguous bases; this represents over 90% of the total genome (Table 1). The assembly consists of 5,003 contigs >300 bp (Supplementary Fig. 3), with an N50 contig length of 98.2 kilobases (kb) and an average coverage of × 177 (Supplementary Fig. 4). A total of 83.89 Mbp (93.7% of the assembled genome) was contained in 1,256 contigs >10 kb. The longest contig assembled was over 622 kb. These multiple lines of evidence as well as the identification of nearly all (97%) core eukaryotic genes suggest a high-quality assembled genome (see also Supplementary Methods).

Table 1 Genome assembly and annotation summary. Full size table

The genome of B. antarctica is smaller than even the tiny genomes reported for the body louse (104.7 Mbp) and Strepsiptera (108 Mbp)16,17. Previously published genome size estimates for three chironomid species, as well as new flow cytometry estimates for three additional members of the family Chironomidae (1C=108–118 Mbp), further suggest that the B. antarctica genome is small even for a chironomid (Supplementary Methods; Supplementary Table 1). The small genome found in this Antarctic midge does not conform to the coupling frequently reported between low temperatures and large genomes18, thus suggesting that alternative evolutionary forces are driving the small size of this genome. The only other Diptera with genomes near 100 Mbp are Colboldia fuscipes (Scatopsidae) and Psycoda cinerea (Psychodidae), cosmopolitan species whose genomes sizes may be constrained by early developmental traits19.

Amplification, deletion and rearrangements of repeated DNA sequences may account for intraspecific variations in genome size20. In B. antarctica the small size of the genome is a function of a paucity of repeats, including a reduction in the number of TEs and the reduced length of introns (Fig. 2). Analysis of the repeat content of the genome assembly revealed that repeat elements comprise only 0.49% of the assembled genome and 10% of the entire genome, assuming that the discrepancy between the assembled genome size and the flow cytometry estimate is due to repeat elements. Most of the repeat elements we identified were found in low-complexity sequences (Table 2; Supplementary Methods; Supplementary Data 1; Supplementary Tables 3–5). Using known TE libraries and examining raw reads, we estimate that only 0.016% of the genome failed to assemble due to TE insertions. Furthermore, no species-specific TEs were detected (Supplementary Methods). The B. antarctica genome has ~0.12% of the genome as TEs, a small proportion compared with Aedes aegypti (47%)21, Anopheles gambiae (16%)22, Culex quinquefasciatus (29%)23 and Drosophila melanogaster (20%)24,25. In contrast to the above, the body louse Pediculus humanus humanus, similar to B. antarctica, has a small genome (1C=105 Mbp) associated with a low TE proportion (1.03% of genome)17.

Figure 2: Distribution of genome annotations among five Diptera. Each panel is ordered with respect to overall assembled genome size. The four panels represent the total amount of sequence in each annotation: genome size, intron, coding sequence (CDS) and transposable elements (TEs). Full size image

Table 2 Repeat content in B. antarctica. Full size table

The TEs found in the B. antarctica genome were of multiple origins. The TEs represent 154 TE families from the three main TE orders (DNA elements, retroelements with long terminal repeat (LTR) and non-LTR retroelements). A total of 513 TE insertion locations were identified in the assembled genome (Table 2). Of those 513 TE insertions, 74 were nested with >1 TE insertion, while the remaining 439 clearly correspond to unique TE insertions. An additional 23 TE insertions were detected as absent from the assembly, as they were located at the flanking regions of the contigs. Most of the unique insertions are from retroelements. The reduced number of TEs in the genome was reflected in ribosomal genes. R1 and R2 non-LTR retroelements are present in nearly all arthropods and have been identified in the ribosomal DNA (rDNA) loci of nearly all arthropod lineages examined to date26. However, based on reconstruction of the rDNA region, B. antarctica lacks both R1 and R2 non-LTR retroelements. All lines of evidence suggest that the TE insertions in B. antarctica are of ancient origin.

Approximately 19.4% (just under 19 Mbp) of the genome is protein coding in B. antarctica and contains 97% of the core eukaryotic genes (Supplementary Methods; Supplementary Tables 6 and 7). A large proportion of the genome is coding in comparison with Ae. aegypti (22 Mbp, 1.6% of the genome), An. gambiae (20.7 Mbp, 7.6%), C. quinquefasciatus (24.9 Mbp, 4.3%) and D. melanogaster (22.8 Mbp, 13.6%) (Fig. 2). A total of 13,517 protein-coding genes were annotated, underscoring that loss of gene function is not driving the small genome of B. antarctica. On the basis of a cox1 sequence data, our sample clusters with other samples collected at the same location (Supplementary Fig. 5). Of the 13,517 gene models, 12,914 gene models were supported by at least one RNA-seq read, and 9,011 models were supported by at least 100 RNA-seq reads. Among the annotated genes, 8,575 genes have unique alignments to entries in the SwissProt database, and 10,557 genes have matches in the non-redundant database. Genes are clustered in regions of relatively high GC content (GC content of coding regions is 47%, compared with a 37% GC content for the non-coding, Supplementary Table 8).

We compared the B. antarctica genome with that of four other dipteran species, three mosquitoes, Ae. aegypti, An. gambiae, C. quinquefasciatus and D. melanogaster, the insect with the most completely annotated genome. Overall, B. antarctica has an intermediate genome GC content but a lower coding GC% than any of the other four Diptera (Supplementary Table 8). Analysis of codon usage suggests that B. antarctica is not unusual compared with the other four dipteran species (Supplementary Fig. 6). Potential clusters of orthologous genes for comparative analyses were determined using annotations from the four dipteran species (Fig. 3). In an orthoMCL27 comparison between the five species, 4,910 genes were unique to B. antarctica, and 3,582 one-to-one orthologous genes between all five species were identified. Given the lack of TEs in the B. antarctica genome, we interrogated the PIWI-interacting RNA (piRNA) pathway genes (Supplementary Methods). The piRNA pathway serves to control transposon activity28. We identify several key players in the piRNA pathway that are absent from the B. antarctica genome (Supplementary Table 9).

Figure 3: Orthologous gene clusters. Venn diagram of orthologous gene clusters among B. antarctica, An. gambiae, Ae. aegypti, C. quinquefasciatus and D. melanogaster. The numbers in each area indicate the number of orthologous gene clusters in each category, and the numbers in parentheses indicate the total number of genes in each area. The Venn diagram was generated at http://bioinformatics.psb.ugent.be/webtools/Venn/. Full size image

Intron size distribution was compared with protein-coding length distribution, calculated for the one-to-one orthologs as well as all genes for each of the five dipteran species (Supplementary Tables 10–12; Supplementary Fig. 7). The comparison showed that reduction in intron length also contributed to the reduced size of this genome.

While the number of genes is consistent with other Diptera, the relative proportion in different ontologies varies between B. antarctica and An. gambiae. Gene ontology (GO) terms were assigned to the gene models for B. antarctica and the published genes of An. gambiae using Blast2GO29; this yielded 8,856 and 8,653 genes, respectively, with at least one GO term. A comparison of the gene sets using Fisher’s exact test revealed 162 GO terms positively enriched in B. antarctica and 20 terms negatively enriched (Supplementary Data 2). Many of the positively enriched terms fall into two broad categories, ‘development’ (38 terms) and ‘regulation of biological processes’ (50 terms).

The effective population size of B. antarctica has been decreasing over the past 10,000 years. Mapping reads onto the assembled genome allowed us to identify 195,860 of the 88,780,579 non-repeat-masked base pairs in the assembled genome as putative heterozygotes; this is an ~0.2% heterozygosity rate, which suggests an average of one heterozygous position for every 450 bp in this single individual (in contrast D. melanogaster has an order of magnitude more heterozygosity at ~2%). The result is similar (~185,000 single-nucleotide polymorphisms (SNPs) in 83.89 Mbp) when the analysis is limited to contigs greater than 10 kb. Using the D. melanogaster single-nucleotide mutation rate of 8.4 × 10−9 per site per generation (ref. 30), we estimate that the time-averaged effective population size of B. antarctica is ~60,000 diploid individuals. We used a pairwise sequentially Markovian coalescent analysis31 to infer population changes from a single individual to make inferences about population change over time (Fig. 4). Assuming that the mutation rate estimated for B. antarctica is correct, the analysis suggests that the population reached a population size that was maximum just prior to the last glacial maximum, suggesting that the midge populations declined markedly at the glacial maximum but survived in refugia during the period of extensive glaciation. The use of alternate mutation rates would, of course, shift the estimates either higher or lower (Supplementary Figs 8 and 9). Our work is consistent with current hypotheses on Antarctic arthropod dispersal, indicating that most endemic species established well before the last glacial maximum and survived in isolated refugia during glacial periods32. Moreover, low levels of genetic diversity suggest a small effective population size, implying that strong selective pressure drove the fixation of adaptive alleles underlying these unique features of the midge genome.