Generation of models to detect NCLDV proteins

Initial hidden Markov models (HMMs) for the MCPs were built from a multiple sequence alignment of published NCLDV MCPs and subsequently updated on the basis of extracted metagenomic NCLDV MCP sequences. We screened around 537 million proteins encoded on about 45.1 million contigs with a length greater than 5 kb available in 8,535 public metagenomes in IMG/M43 (June 2018) for contigs that encode the NCLDV MCP using a version of hmmsearch (v.3.1b2, http://hmmer.org/) that is optimized44 for the supercomputer Cori, with a set of models for the NCLDV MCP (https://bitbucket.org/berkeleylab/mtg-gv-exp/) and an E-value cut-off of 1 × 10−10. The 1,003,222 proteins found on the 77,701 contigs with hits for MCPs were then clustered with CD-hit45 at a sequence similarity of 99% to remove nearly identical and identical proteins. This resulted in 524,161 clusters and singletons. The cluster representatives were used to infer protein families using orthofinder (v.2.27) with default settings and the -diamond flag46,47. Multiple sequence alignments were built with mafft48 (v.7.294b) for protein families that included at least 10 members and corresponding HMM models were obtained with hmmbuild (v.3.1b2, http://hmmer.org/). This led to a total of 7,182 HMMs that can detect NCDLV proteins that were then tested against all public genomes in IMG/M43 (June 2018). Models that gave rise to hits above an E-value cut-off of 1 × 10−10 in more than 10 reference genomes were removed. The resulting 5,064 models were then used for targeted binning of NCLDV metagenome contigs.

Identification of NCLDV-specific genome features and design of an automatic classifier

A set of representative genomes of bacteria, archaea, eukaryotes and non-NCLDV viruses was gathered from the IMG/M database43 (June 2018) and combined with NCLDV genomes assembled from metagenomes and protist genomes downloaded from NCBI GenBank to identify NCLDV-specific genome features. Genes were predicted for these genomes using Prodigal49 (v.2.6.3; February, 2016) in both ‘regular’ mode (default parameters) and with the option ‘-n’ activated, which forces a full motif scan. For genomes of less than 100 kb, the option ‘-p meta’ was used to apply precalculated training files rather than training the gene predictor from the genome, as recommended by the tool documentation. Next, a set of different metrics was calculated for each genome on the basis of the genes predicted with a confidence of ≥90 and score of ≥50. These included gene density (number of genes predicted on average per 10 kb of genome), coding density (number of bp predicted as part of a coding sequence per 10 kb of genome), spacer length (average length of the spacer between the predicted ribosomal binding site (RBS)), predicted start codon for genes in which a putative RBS was detected and RBS motif profile (the proportion of each type of RBS predicted in the genome, see below).

For the RBS motif profile, motifs were predicted using the full motif scan option of prodigal (see above). Notably, some of these motifs may not represent true RBSs, but are instead other conserved motifs (including transcription-related motifs) found upstream of start codons in these different genomes. These motifs were grouped into 11 categories as follows: (1) ‘None’ for cases in which prodigal did not predict a RBS; (2) ‘SD_Canonical’ for different variations of the canonical AGGAGG Shine–Dalgarno sequence (for example, AGGAG, AGxAG, GAGGA, as well as motifs identified by Prodigal as ‘3Base_5BMM’ or ‘4Base_6BMM’); (3) ‘SD_Bacteroidetes’ for variations of the motif predicted typically from Bacteroidetes genomes (TA{2,5}T{0,1}: T followed by 2–5 As, and with sometimes a terminal T); (4) ‘Other_GA’ for motifs that include ‘GA’ patterns but that are different from the canonical Shine–Dalgarno sequence, for example, GAGGGA, typically identified in a few archaeal and bacterial genomes; (5) ‘TATATA_3.6’ for variations of the motif typically detected in NCLDV, that is, a motif of 3–6 bp with alternating Ts and As (TAT, ATAT, TATA, TATAT, and so on); (6) ‘OnlyA’ for motifs exclusively composed of As not already included in a previous group, for example, AAAAA, most often found in Bacteroidetes; (7) ‘OnlyT’ for motifs exclusively composed of Ts not already included in a previous group, for example, TTTTT, found at a low frequency in some archaeal genomes; (8) ‘DoubleA’ for motifs with two consecutive As not already included in a previous group, for example, AAAAC, most often found in Bacteroidetes and bacteria from the candidate phyla radiation (CPR) group; (9) ‘DoubleT’ for motifs with two consecutive Ts not already included in a previous group, for example, TACTT, found at a low frequency in plants, Bacteroidetes and NCLDV; (10) ‘NoA’ for motifs without any As and not included in a previous group, for example, TCTCG, found in some archaeal genomes; and (11) ‘Other’ for motifs that did not fit into any of these categories.

Representative genomes were then grouped on the basis of the frequency of each motif type through hierarchical clustering (R function ‘hclust’). This enabled the delineation of 12 genome groups on the basis of taxonomy (at the kingdom or domain ranks) and motif profile (Extended Data Fig. 2). Two types of random-forest classifiers were then built on the basis of the 14 features (11 motifs, gene density, coding density and average spacer length, see above): one for which the category to be predicted was binary (that is, ‘Virus_NCLDV’ versus ‘Other’) and one for which the category to be predicted was the set of genome groups based on predicted RBS motifs (‘NCLDV (non-pandoraviruses)’, ‘animal and plants’, ‘protists & fungi’, ‘canonical bacteria and archaea’, ‘bacteroidetes-like’, ‘bacteria (CPR)’, ‘atypical bacteria’, ‘atypical archaea’, ‘plasmids’ and ‘other viruses’, which include pandoraviruses). The 14 features were evaluated on the whole genomes, as well as on fragments of 20 kb and 10 kb selected randomly along the genomes. These random fragments were used to train a classifier on input sequences more comparable to metagenome assemblies, which most often represent short genome fragments of a few kb. For these fragments, Prodigal was run with the ‘-p meta’ option and default parameters otherwise50, that is, without a full motif scan, as these sequences are typically too short to identify de novo RBS motifs. Animal and plant genomes were not included in this analysis as these are highly unlikely to be assembled from metagenomes. All classifiers were built using R library randomforest and included 2,000 trees, with default parameters otherwise, and 10-fold cross-validation was performed to evaluate the classifier accuracy. The probability ‘prob’ of NCLDV origin was used as a prediction score to evaluate the classifiers and was then applied to metagenome assemblies. Because the input dataset is easily skewed towards bacterial and archaeal genomes, specificity and sensitivity were evaluated separately for each group of genome (Extended Data Fig. 2c). Statistical tests were performed in R using the package stats (Kolmogorov–Smirnov test)51 and effsize (Cohen’s effect size)52.

MAGs from non-targeted binning of IMG genomes

Complementary to the targeted binning of NCLDV contigs, we performed genome binning of public metagenomes in IMG/M (assessed June 2018)15 with MetaBAT (v.0.32.4)53 in the ‘superspecific’ mode, using read coverage information, if available in IMG, and a minimum contig length of 5 kb. Resulting MAGs were then checked for quality using CheckM (v.1.0.7)54. Genome bins with completeness <50% were labelled as low quality according to the ‘minimum information for a MAG’ (MIMAG) standards55.

Targeted binning of putative NCLDV metagenome contigs

The 5,064 NCLDV-specific models were used for hmmsearch (v.3.1b2, http://hmmer.org/) on the initial set of around 537 million proteins encoded on about 45 million contigs with a length greater than 5 kb with an E-value cut-off of 1 × 10−10 (Extended Data Fig. 1). In addition to the screening of the metagenomic contigs with NCLDV-specific models, we also used an automatic classifier using gene density and RBS motifs (see above). On the basis of the output of the automatic classifier, a score was assigned to each contig: a score of 2 if Ratio_TATATA_36 > 0.3 or Pred_simple_NCLDV_score > 0.3 and the prediction result was ‘Virus_NCLDV’, a score of 1 if Ratio_TATATA_36 > 0.3 or Pred_simple_NCLDV_score > 0.1 or the prediction result was ‘Virus_NCLDV’, otherwise a score of 0. On the basis of the cross-validation of the classifier, these parameters were chosen to maximize sensitivity while retaining enough specificity. The resulting set of around 1.2 million contigs with an RBS score of at least 1 and/or at least 20% of encoded genes (1 out of 5) with hits to the NCLDV models were subject to metagenomic binning as follows: for each metagenome, putative NCLDV contigs were extracted and binning performed with MetaBAT56 (v.2) and contig read coverage information was used as input in case it was available in IMG43. The targeted binning approach gave rise to around 72,000 putative NCLDV MAGs.

Filtering of GVMAGs

Contigs with a length of less than 5 kb were removed from GVMAGs. Filtering was performed on the basis of the copy number of NCVOGs16 (Supplementary Tables 2, 3). GVMAGs were removed when they encoded more than 20 copies of NCVOG0023, 4 copies of NCVOG0038, 12 copies of NCVOG0076, 7 copies of NCVOG0249 or 4 copies of NCVOG0262. On the basis of the copy numbers of 16 conserved NCOVGs (NCVOG0035, NCVOG0036, NCVOG0038, NCVOG0052, NCVOG0059, NCVOG0211, NCVOG0249, NCVOG0256, NCVOG0262, NCVOG1060, NCVOG1088, NCVOG1115, NCVOG1117, NCVOG1122, NCVOG1127 and NCVOG1192), which are usually present at low copy numbers across all published NCLDV genomes, a duplication ratio was calculated as follows. The total number of copies of the 16 NCVOGs in the respective GVMAG was divided by the total number of unique observations of the 16 NCVOGs. GVMAGs with a duplication ratio higher than three were excluded from the dataset. We then used Diamond BLASTp47 against the NCBI non-redundant (nr) database (August 2018) and assigned a taxonomic affiliation on the basis of best BLASTp hits against Archaea, Bacteria, Eukaryota, phages or other viruses (including NCLDVs) to proteins using an E-value cut-off of 1 × 10−5. Best hits of query proteins to proteins derived from MAGs from the Tara Mediterranean metagenome binning survey57 were disregarded owing to the high number of misclassified genomes in this dataset. Proteins without a hit in the NCBI nr database were labelled as ‘Unknown’. We then applied filters to remove contigs from GVMAGs on the basis of the distribution of taxonomic affiliation of best blast hits (Supplementary Table 7). Finally, alignments were built with mafft48 (v.7.294b) for NCVOG0023, NCVOG0038, NCVOG0076, NCVOG0249 and NCVOG0262. Positions with 90% or more gaps were removed from the alignments with trimal58 (v.1.4). Protein alignments were concatenated and a species tree constructed with IQ-tree59 (LG + F + R8, v.1.6.10). The phylogenetic tree was then manually inspected and for each clade outliers were removed on the basis of the presence, absence and copy numbers of 20 conserved NCVOGs16, duplication factor (see above), coding density, GC content and genome size. In addition, GVMAGs that represented singletons on long branches were manually removed. The filtered dataset was then clustered together with all available NCLDV reference genomes (December 2018) using average nucleotide identities of greater than 95% and an alignment fraction of at least 50% with FastANI60 (v.1.1). For each 95% average nucleotide identity cluster the 6 NCVOGs16 with the on-average longest amino acid sequences (NCVOG0022, NCVOG0023, NCVOG0038, NCVOG0059, NCVOG0256 and NCVOG1117) were subjected to a within-cluster all-versus-all BLASTp. GVMAGs that had any full-length 100% identity hits between any of these maker proteins to other cluster members were removed from the dataset as potential duplicates. Duplicate GVMAGs originating from the conventional binning approach were removed first and GVMAGs with the largest assembly size were retained.

GVMAG quality on the basis of estimated completeness and contamination

Estimation of the quality of MAGs is critical for their interpretation and use in downstream applications. Standards exist for bacterial and archaeal MAGs that have proposed a three-tier classification (high, medium or low quality) based on estimated genome completeness and contamination55. These completeness and contamination metrics are typically calculated on the basis of a set of universal single-copy marker genes. A set of conserved genes in the NCLDV are the NCVOGs16, of which a subset has been shown to be probably vertically inherited16 (NCVOG20, Supplementary Table 2). We calculated for each superclade the average number of NCVOG20 present either as a single copy or as multiple copies (Supplementary Table 3). We then compared the number of observed single- and multicopy NCVOG20 in every GVMAG to the mean number of observations in the respective superclade. Considering the high genome plasticity of NCLDVs2,61, we tolerated a deviation from the mean by a factor of 1.2, which was considered low contamination, and a factor of 2 was considered medium contamination (Extended Data Fig. 4 and Supplementary Table 4). Higher deviations from the superclade mean were potentially caused by a non-clonal composition of the GVMAG; these were, as a consequence, considered to be of high contamination. We also estimated completeness on the basis of the presence of the NCVOG20 compared with other members of the respective superclade. The presence of 90% or more of the NCVOG20 compared with the superclade mean resulted in a classification as high quality in terms of completeness. If at least 50% of NCVOG20 were present in a GVMAG then the respective GVMAG was classified as medium quality in terms of estimated completeness, or low if less than 50% of NCVOG20 were present (Extended Data Fig. 4 and Supplementary Table 4). The final GVMAG quality was determined on the basis of a combination of contamination and completeness (Supplementary Table 8). Additional criteria to assign GVMAGs to the high-quality category were the presence of no more than 30 contigs, a minimum assembly size of 100 kb and the presence of at least one contig with a length greater than 30 kb. To assign a GVMAG to the medium-quality category were the presence no more than 50 contigs, a minimum assembly size of 100 kb and the presence of at least one contig with a length greater than 15 kb.

Annotation of GVMAGs

Gene calling was performed with GeneMarkS using the virus model62. For functional annotation proteins were subject to BLASTp against previously established NCVOGs16 and the NCBI nr database (May 2019) using Diamond (v.0.9.21) BLASTp47 with an E-value cut-off of 1.0 × 10−5. In addition, protein domains were identified by pfam_scan.pl (v.1.6) against Pfam-A63 (v.29.0), and rRNAs and introns were identified with cmsearch using the Infernal package64 (v.1.1.1) against the Rfam database65 (v.13.0). No rRNA genes were detected in the final set of GVMAGs. The eggNOG mapper66 (v.1.0.3) was used to assign functional categories to NCLDV proteins. Protein families were inferred with PorthoMCL67 (version of December 2018) with default settings.

Survey of the NCLDV MCP

We used hmmsearch (v.3.1b2, http://hmmer.org/) optimized for the supercomputer Cori44 to identify all copies of MCP encoded in the final set of GVMAGs and NCLDV reference genomes. Proteins were extracted and multiple sequence alignments were created with mafft48 (v.7.294b) for 74 NCLDV lineages with at least 5 copies of MCP. For each lineage-specific MCP alignment, we inferred models with hmmbuild (v.3.1b2, http://hmmer.org/). Using these models, the modified version of hmmsearch (v.3.1b2, http://hmmer.org/)44 was used to identify all MCPs in the entire set of metagenomes (IMG/M43, June 2018), MCPs with identical amino acid sequences were excluded as potential duplicates. A logistic-regression-based classifier (sklearn LogisticRegression, solver = ‘lbfgs’, multi_class = ‘ovr’) was trained for each NCLDV lineage taking into account the score distribution of all lineage MCPs hits against the entire set of lineage-specific MCP models. The accuracy of the classifier was 0.861. Unbinned metagenomic MCPs were assigned to NCLDV lineages if the classifier returned a probability greater than 50% (sklearn predict_proba), or as ‘novel’ if the probability was 50% or below. We then normalized the environmental MCP counts on the basis of the observed average copy number of MCP in GVMAGs and reference genomes in the respective lineage. Distribution of NCLDV lineages on the basis of MCPs was projected on a world map with Python 3/basemap on the basis of coordinates provided in IMG metagenomes43.

NCLDV species tree

To build a species tree of the extended NCLDV, viral genomes with at least three out of five core NCVOGs16 were selected: DNA polymerase elongation subunit family B (NCVOG0038), D5-like helicase-primase (NCVOG0023), packaging ATPase (NCVOG0249), DNA or RNA helicases of superfamily II (NCVOG0076), and poxvirus late transcription factor VLTF3-like (NCVOG0262). The NCVOGs were identified with hmmsearch (version 3.1b2, http://hmmer.org/) using an E-value cut-off of 1 × 10−10, extracted and aligned using mafft48 (v.7.294b). Columns with less than 10% sequence information were removed from the alignment with trimal58. The species tree was then calculated on the basis of the concatenated alignment of all five proteins with IQ-tree59 (v.1.6.10) with ultrafast bootstrap68 and LG + F + R8 as suggested by model test as the best-fit substitution model69. The percentage increase in phylogenetic diversity70 was calculated on the basis of the difference of the sum of branch lengths of the phylogenetic species trees of the NCLDV including the GVMAGs compared with a NCLDV species tree calculated from published NCLDV reference genomes (n = 205, no dereplication based on the average nucleotide identity) with IQ-tree as described above. Phylogenetic trees were visualized with iTol71 (v.5). Genus or subfamily level lineages were defined on the basis of their monophyly in the species tree and presence or absence pattern of conserved NCVOGs (Supplementary Table 4). If no viral isolates were present in the respective monophyletic clade we designated it MGVL. Neighbouring lineages with isolates and MGVLs were further combined under the working term superclade. Branch lengths separating clades differ based on the density of sampled viruses.

Protein trees

Target proteins were extracted from NCLDV genomes and used to query the NCBI nr database (June 2018) with Diamond BLASTp47. The top-50 hits per query were extracted, merged with queries, dereplicated on the basis of protein accession number and aligned with MAFFT (-linsi, v.7.294b)48, trimmed with trimal58 (removal of positions with more than 90% of gaps) and maximum-likelihood phylogenetic trees inferred with IQ-tree59 (multicore v.1.6.10) using ultrafast bootstrap68 and the model suggested by the model test feature implemented in IQ-tree69 based on Bayesian information criterion. Selected models are indicated in the legend of Extended Data Fig. 8. Owing to its size, the phylogenetic tree for ABC transporter was inferred with FastTree72 (v.2.1.10) LG and can be accessed at https://bitbucket.org/berkeleylab/mtg-gv-exp/. Phylogenetic trees were visualized with iTol71 (v.5). Information on functional genes including parent contigs is provided in Supplementary Table 5.

Virus–host linkage through HGT

To generate a cellular nr database, all non-cellular sequences and sequences from the Tara Mediterranean genome study57 were removed from the NCBI nr database. All proteins in the NCLDV genomes were then subjected to Diamond BLASTp47 against the cellular nr database using an E-value cut-off of 1 × 10−50, an alignment fraction of 50% and a minimum sequence identity of 50%. Best blast hits within the same lineage were removed. Proteins that had a hit in cellular nr with a lower E value compared with hits in the NCLDV blast database were considered HGT candidates. The total number of best hits from lineage pan-proteomes against defined groups of Eukaryotes were then used as edge weights to build an HGT network. The network was created in Gephi (v.0.92)73 using a force layout and filtered at an edge weight of 2. Pfam annotations of HGT candidates were based on the most commonly detected domains and functional categories were assigned with the eggNOG Mapper (v.1.03)66. Information on HGT candidates including parent contigs is provided in Supplementary Table 6. The number of HGT linkages was limited by the available of reference genomes and the stringency applied.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.