Phage- and plasmid-genome identification

Datasets generated in the current study, those from previous research conducted by our team, the Tara Oceans microbiomes42 and the Global Oceans Virome43 were searched for sequence assemblies that could have derived from phages with genomes of more than 200 kb in length. Read assembly, gene prediction and initial gene annotation followed standard, previously reported methods44,45,46,47,48.

Phage candidates were initially found by retrieving sequences that were not assigned to a genome and had no clear taxonomic profile at the domain level. Taxonomic profiles were determined through a voting scheme, in which the winning taxonomy had to have more than 50% votes for each taxonomic rank on the basis of protein annotations in the UniProt and ggKbase (https://ggkbase.berkeley.edu/) databases49. Phages were further narrowed down by identifying sequences with a high number of hypothetical protein annotations and/or the presence of phage-specific genes, such as capsid, tail, terminase, spike, holin, portal and baseplate. All candidate phage sequences were checked throughout to distinguish putative prophages from phages. Prophages were identified on the basis of a clear transition into the host genome with a high fraction of confident functional predictions, often associated with core metabolic functions and much higher similarity to bacterial genomes. Plasmids were distinguished from phages on the basis of matches to plasmid partitioning and conjugative transfer genes. Those that did not have phage-specific genes were assigned using phylogenetic tree placement using recA, polA, polB, dnaE and the DNA sliding clamp loader gene. Phages and placement assignments were further verified using a network of protein clustering with proteins from RefSeq prokaryotic viruses and 400 randomly sampled plasmids of more than 200 kb using vContact250 (Extended Data Fig. 6).

Phage- and plasmid-genome manual curation

All classified scaffolds were tested for end overlaps indicative of circularization. Assembled sequences that could be perfectly circularized were considered potentially complete. Erroneous concatenated sequence assemblies were initially flagged by searching for direct repeats of more than 5 kb using Vmatch51. Potentially concatenated sequence assemblies were manually checked for multiple large repeating sequences using the dotplot and RepeatFinder features in Geneious v.9. Sequences were corrected and removed from further analysis if the corrected length was more than 200 kb.

A subset of the phage sequences was selected for manual curation, with the goal of finishing (replacing all Ns at scaffolding gaps or local misassemblies by the correct nucleotide sequences and circularization). Curation generally followed previously described methods9. In brief, reads from the appropriate dataset were mapped using Bowtie2 v.2.3.4.152 to the de novo assembled sequences. Unplaced mate pairs of mapped reads were retained with shrinksam (https://github.com/bcthomas/shrinksam). Mappings were manually checked throughout to identify local misassemblies using Geneious v.9. N-filled gaps or misassembly corrections made use of unplaced paired reads, in some cases using reads relocated from sites to which they were mismapped. In such cases, mismappings were identified on the basis of much larger than expected paired read distances, high polymorphism densities, backwards mapping of one read pair or any combination of these. Similarly, ends were extended using unplaced or incorrectly placed paired reads until circularization could be established. In some cases, extended ends were used to recruit new scaffolds that were then added to the assembly. The accuracy of all extensions and local assembly changes were verified in a subsequent phase of read mapping. In many cases, assemblies were terminated or internally corrupted by the presence of repeated sequences. In these cases, blocks of repeated sequences as well as unique flanking sequences were identified. Reads were then manually relocated, respecting paired-read placement rules and unique flanking sequences. After gap closure, circularization and verification of accuracy throughout, end overlap was eliminated, genes were predicted and the start moved to an intergenic region, which was—in some cases—suspected to be origin on the basis of a combination of coverage trends and GC skew53. Finally, the sequences were checked to identify any repeated sequences that could have led to an incorrect path choice because the repeated regions were larger than the distance spanned by paired reads. This step also ruled out artefactual long phage sequences generated by end-to-end repeats of smaller phages, which occur in previously described datasets9.

Structural and functional annotations

After the identification and curation of phage genomes, coding sequences and Shine–Dalgarno ribosomal binding site motifs were predicted with Prodigal using genetic code 11 (-m -g 11 -p single). The resulting coding sequences were annotated as previously described by searching UniProt, UniRef100 and KEGG54. Functional annotations were further assigned by searching proteins in PFAM r3255, TIGRFAMS r1556, Virus Orthologous Groups (VOG) r90 (http://vogdb.org/) and Prokaryotic Virus Orthologous Groups57 (pVOG). tRNAs were identified with tRNAscan-s.e. v.2.058 using the bacterial model. tmRNAs were assigned using ARAGORN v.1.2.3859 with the genetic code of bacteria and plant chloroplasts.

Clustering of the coding sequences into families was achieved using a two-step procedure. A first protein clustering was done using the fast and sensitive protein-sequence searching software MMseqs60. An all-versus-all sequences search was performed using an E-value cut-off of 1 × 10−3, sensitivity of 7.5 and coverage of 0.5. A sequence similarity network was built on the basis of the pairwise similarities and the greedy set cover algorithm from MMseqs was performed to define protein subclusters. The resulting subclusters were defined as subfamilies. To test for distant homology, we grouped subfamilies into protein families using a comparison of hidden Markov models (HMMs). The proteins of each subfamily with at least two protein members were aligned using the result2msa parameter of MMseqs, and HMM profiles were built using the HHpred61 suite from the multiple sequence alignments. The subfamilies were then compared to each other using HHblits from the HHpred suite (with parameters -v 0 -p 50 -z 4 -Z 32000 -B 0 -b 0). For subfamilies with probability scores of at least 95% and coverage at least 0.50, a similarity score (probability × coverage) was used as weight of the input network in the final clustering using the Markov clustering algorithm62, with 2.0 as the inflation parameter. These clusters were defined as the protein families. Protein sequences were functionally annotated on the basis of their best hmmsearch match (v.3.1) (E-value cut-off 1 × 10−3) against an HMM database constructed on the basis of orthologous groups defined by the KEGG database63 (downloaded on 10 June 2015). Domains were predicted using the same hmmsearch procedure against the PFAM r31 database55. The domain architecture of each protein sequence was predicted using the DAMA software64 (default parameters). SIGNALP65 (v.4.1) (parameters, -f short -t gram+) and PSORT66 v.3 (parameters, --long --positive) were used to predict the putative cellular localization of the proteins. Prediction of transmembrane helices in proteins was performed using TMHMM67 (v.2.0) (default parameters). Hairpins (palindromes, based on identical overlapping repeats in the forward and reverse directions) were identified using the Geneious Repeat Finder and located across the dataset using Vmatch51. Repeats of more than 25 bp with 100% similarity were tabulated.

Reference genomes for size comparisons

RefSeq r92 genomes were recovered using the NCBI Virus portal and selecting only complete dsDNA genomes with bacterial hosts. Genomes from a previously published study14 were downloaded from IMG/VR and only sequence assemblies that were labelled ‘circular’ with predicted bacterial hosts were retained. Given the presence of sequences in IMG/VR that were based on erroneous concatenations, we only considered sequences from this source that were more than 200 kb; however, a subset of these was removed as artefactual sequences.

Alternative genetic codes

In cases in which the gene prediction using the standard bacterial code (code 11) resulted in seemingly anomalously low coding densities, potential alternative genetic codes were investigated. In addition to making a prediction using the fast and accurate genetic code inference and logo68 (FACIL) web server, we identified genes with well-defined functions (for example, polymerase or nuclease) and determined the stop codons terminating genes that were shorter than expected. We then repredicted genes using GLIMMER3 v.1.569 and Prodigal with TAG not interpreted as a stop codon. Other combinations of repurposed stop codons were evaluated and candidate codes (for example, code 6, with only one stop codon) were ruled out owing to unlikely gene-fusion predictions.

Large terminase subunit and MCP phylogenetic analyses

The phylogenetic tree of the large terminase subunit was constructed by recovering large terminases from the aforementioned protein-clustering and annotation pipeline. The coding sequences that matched with >30 bitscore to PFAM, TIGRFAMS, VOG and pVOG were retained. Any coding sequence that had a hit to large terminase, regardless of bitscore, was searched using HHblits70 against the uniclust30_2018_08 database. The resulting alignment was then further searched against the PDB70 database. Remaining coding sequences that clustered in protein families with a large terminase HMM were also included after manual verification. Detected large terminases were manually verified using the HHPred70 and jPred71 webservers. Large terminases from the >200-kb phage genomes14 and all >200-kb complete dsDNA phage genomes from RefSeq r92 were also included by protein family clustering with the phage-coding sequences from this study. The resulting terminases were clustered at 95% amino acid identity to reduce redundancy using CD-HIT72. Smaller phage genomes were included by searching the resulting coding sequences set against the full RefSeq protein database and retaining the top 10 best hits. Those hits that had no large terminase match against PFAM, TIGRFAMS, VOG or pVOG were removed from further consideration and the remaining set was clustered at 90% amino acid identity. The final set of large terminase coding sequences that were more than 100 amino acids in length were aligned using MAFFT73 v.7.407 (--localpair --maxiterate 1000) and poorly aligned sequences were removed and the resulting set was realigned. The phylogenetic tree was inferred using IQTREE v.1.6.6 using automatic model selection74. The phylogenetic tree of MCP genes was constructed by retrieving all MCPs annotated by combining the PFAM annotations of protein families and direct annotations by PFAM, TIGRFAMS, VOG and pVOG. Reference MCP gene sequences were collected using the same strategy and sources as for the large terminase subunit tree. The resulting set was further screened by searching against PFAM, TIGRFAMS, VOG and pVOG and removing matches that had no large terminase match regardless of bitscore. The final set of MCP sequences were aligned with MAFFT(--localpair --maxiterate 1000) and the phylogenetic tree was constructed using IQTREE with automatic model selection and 1,000 bootstrap replicates.

Whole-genome scale clustering

To identify phage genomes that were closely related at the whole-genome level, we compared sequences using whole-genome alignments. The goal of this analysis was to further corroborate the identified phylogenetic clades and test for the presence of very similar phages in different habitats and environments. Genomes grouped together in the primary clusters from dRep v.275 were evaluated for genome alignment using Mauve76 within Geneious v.9.

CRISPR–Cas locus and target detection

Phage- and host-encoded CRISPR loci (repeats and spacers) were identified using a combination of MinCED (https://github.com/ctSkennerton/minced) and CRISPRDetect77. A custom database of Cas genes was built by collecting Cas gene sequences from previous studies23,25,78,79,80,81,82 and built with MAFFT (--localpair --maxiterate 1000) and hmmbuild. The coding sequences from this study were searched against the HMM database using hmmsearch with E < 1 × 10−5. Matches were checked using a combination of hmmscan and BLAST searches against the NCBI nr database and manually verified by identifying colocated CRISPR arrays and Cas genes. Spacers extracted from between repeats of the CRISPR locus were compared to sequences assemblies from the same site using BLASTN-short83. Matches with alignment length >24 bp and ≤1 mismatch were retained and targets were classified as bacteria, phage or other. CRISPR arrays that had ≤1 mismatch, were further searched for more spacer matches in the target sequence by finding more hits with ≤3 mismatches.

Host identification

The phylum affiliations of bacterial hosts for phage and plasmid-like sequences were predicted by considering the UniProt taxonomic profiles of every coding sequence for each phage genome. The phylum level matches for each phage genome were summed and the phylum with the most hits was considered as the potential host phylum. However, only cases in which this phylum that had 3× as many counts as the next most-counted phylum were assigned as the tentative phage host phylum. Phage hosts were further assigned and verified using the CRISPR-targeting strategy describe above with the phage and plasmid-like genomes as targets. CRISPR arrays were predicted on all sequence assemblies from the same site that each phage genome was reconstructed. Sequence assemblies containing spacers with a match of length >24 bp and ≤1 mismatch were used to infer phage–host relationships. In all cases, the predicted host phylum based on taxonomic profiling and CRISPR targeting were in complete agreement. Similarly, the phyla of hosts were predicted on the basis of phylogenetic analysis of phage genes also found in host genomes (for example, involved in translation and nucleotide reactions). Inferences based on computed taxonomic profiles and phylogenetic trees were also in complete agreement.

Phage-encoded tRNA synthetase trees

Phylogenetic trees were constructed for phage-encoded tRNA synthetase, ribosomal and initiation factor protein sequences using a set of the closest reference sequences from NCBI and bacterial genomes from the current study. The tRNA synthetases were identified on the basis of annotation of genes using the standard ggKbase pipeline (see above), and confirmed by HMMs with datasets from TIGRFAMs. For each type of tRNA synthetase, references were selected by comparing all of the corresponding genes of this type against the NCBI nr database using DIAMOND v.0.9.2484, their top 100 hits were clustered by CD-HIT using a 90% similarity threshold72. The phylogenetic tree of each tRNA synthetase was constructed using RAxML v.8.0.2685 with the PROTGAMMALG model.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.