The 93 TARA Oceans metagenomes we analysed correspond to a size fraction targeting free-living microorganisms (0.2–3 μm) from 61 surface samples and 32 samples from the deep chlorophyll maximum layer of the water column (Supplementary Table 1). Presumed absent from this size fraction are the majority of those bacterial and archaeal cells that have a symbiotic relationship with eukaryotes, form large aggregates or attach to large particles. Of 33.7 billion metagenomic reads, 30.9 billion passed quality control criteria and were used as input for 12 metagenomic co-assemblies (1.14–5.33 billion reads per set) using geographically bounded samples (Supplementary Fig. 1). A total of 42,193,607 genes were identified in scaffolds longer than 1,000 nucleotides (see Supplementary Table 2 for a summary of the assembly statistics). A combination of automatic and manual binning was applied to each co-assembly output, which resulted in 957 manually curated, non-redundant metagenome-assembled genomes (MAGs) containing 2,288,202 genes (Supplementary Fig. 1; also see ref. 34 for an automatic binning effort that includes larger size fractions).

Our MAGs belonged to the domains Bacteria (n = 820), Eukarya (n = 72) and Archaea (n = 65) (Supplementary Table 3), and recruited 2.11 billion quality controlled reads (6.84% of the data set) when we mapped the metagenomic data back to this collection. The genomic completion estimates for archaeal and bacterial MAGs based on domain-specific single-copy core genes averaged to 79% and 76.1%, respectively, and resolved to the phyla Proteobacteria (n = 432), Bacteroidetes (n = 113), Euryarchaeota (n = 65), Verrucomicrobia (n = 65), Planctomycetes (n = 43), Actinobacteria (n = 37), Chloroflexi (n = 34), Candidatus Marinimicrobia (n = 27), Acidobacteria (n = 6), Cyanobacteria (n = 6), Spirochaetes (n = 5), Firmicutes (n = 2), Ignavibacteriae (n = 1) and diverse members of the Candidate Phyla Radiation (n = 4). We could assign only 6.33% of the bacterial and archaeal MAGs to described genera. Eukaryotic MAGs were substantially larger than bacterial and archaeal MAGs (7.24 Mbp versus 2.26 Mbp and 1.47 Mbp on average, respectively) and were dominated by a small number of genera: Micromonas (n = 14), Emilliana (n = 14), Bathycoccus (n = 8) and Ostreococcus (n = 4). Recovery of these MAGs complements decades of cultivation efforts by providing genomic context for lineages missing in culture collections (for example, Euryarchaeota and Candidatus Marinimicrobia), and allowed us to search for diazotrophs within a large pool of marine microbial populations.

Genomic stability of a well-studied nitrogen-fixing symbiotic population at large scale

Our genomic collection included six cyanobacterial MAGs, one of which (ASW 00003) contained genes that encode the catalytic (nifHDK) and biosynthetic (nifENB) proteins required for nitrogen fixation8. This MAG, which we recovered from the Atlantic southwest metagenomic co-assembly, showed remarkable similarity to the genome of the symbiotic cyanobacterium ‘Candidatus Atelocyanobacterium thalassa’35,36 (previously known as UCYN-A) sorted by flow cytometry from the North Pacific gyre (GenBank accession no. CP001842.1). Besides their comparable size of 1.43 Mbp (MAG ASW 00003) and 1.46 Mbp (consensus genome from isolated cells), their average nucleotide identity was 99.96% over the 1.43 Mbp alignment. ‘Ca. A. thalassa’ is a diazotrophic taxon that lacks key metabolic pathways and lives in symbiosis with photosynthetic eukaryotic cells19,36. The high genomic similarity between ASW 00003 and the ‘Ca. A. thalassa’ genome sorted by flow cytometry demonstrates the accuracy of our metagenomic workflow.

Genomic evidence for nitrogen fixation by Proteobacteria and Planctomycetes

Besides the cyanobacterial MAG, we also identified seven Proteobacteria and two Planctomycetes MAGs in our collection that contained the complete set of genes for nitrogen fixation. To the best of our knowledge, these MAGs (HBD-01 to HBD-09) represent the first genomic evidence of putative HBDs inhabiting the surface of the open ocean (Table 1). They were obtained from the Pacific Ocean (n = 6), Atlantic Ocean (n = 2) and Indian Ocean (n = 1), and possessed relatively large genomes (up to 6 Mbp and 5,390 genes) and a GC content ranging from 50% to 58.7%. One of the Proteobacterial MAGs resolved to the genus Desulfovibrio (HBD-01). The remaining MAGs from this phylum correspond to lineages within the orders Desulfobacterales (HBD-02), Oceanospirillales (HBD-03, HBD-04, HBD-05) and Pseudomonadales (HBD-06, HBD-07) (Table 1). The phylogenetic assignment of one Planctomycetes MAG (HBD-08) with a low completion estimate (33.5%) could not be resolved beyond the phylum level, possibly due to missing phylogenetic marker genes for taxonomic inferences. However, the length of this MAG (4.03 Mbp) suggests that its completion may have been underestimated, as we have observed in previous studies37,38. The second Planctomycetes MAG (HBD-09) was affiliated with the family Planctomycetaceae (order Planctomycetales) based on its single-copy core genes. This MAG contained a large fragment of the 16S rRNA gene (1,188 nt; Supplementary Table 4) for which the best match to any characterized bacterium in the NCBI’s non-redundant database was Algisphaera agarilytica (strain 06SJR6-2, NR_125472) with 88% identity.

Table 1 Summary of the genomic features of HBDs Full size table

We placed the nine HBDs in a phylogenomic analysis of the 432 Proteobacteria and 43 Planctomycetes MAGs using a set of 37 marker gene families (Fig. 1a; for an interactive version see https://anvi-server.org/merenlab/tara_hbds). The two deltaproteobacterial HBDs were closely related to each other, but not adjacent in the phylogenomic tree. The HBDs within Oceanospirillales (n = 3), Pseudomonadales (n = 2) and Planctomycetes (n = 2) formed three distinct phylogenomic lineages. These results suggest that closely related populations of diazotrophs inhabit the surface ocean, and nitrogen fixation genes occur sporadically among diverse putatively heterotrophic marine microbial lineages, consistent with previous investigations39.

Fig. 1: Nexus between phylogeny and function of HBDs. a, Phylogenomic analysis of 432 Proteobacteria MAGs and 43 Planctomycetes MAGs in the non-redundant genomic database (including the nine HBDs) using a collection of 37 phylogenetic marker gene families. Layers surrounding the phylogenomic tree indicate genome size and taxonomy of each MAG at the phylum and class level. b, Functional network of the nine HBDs based on a total of 5,912 identified gene functions. Size and colour of genomic nodes represent the number of detected functions and MAG taxonomy, respectively. Colours of functional nodes indicate their occurrence in the different HBDs. Full size image

Our initial binning results included 120 redundant MAGs that were observed multiple times in independent co-assemblies (Supplementary Table 5). Although they are not present in our final collection of 957 non-redundant MAGs (for an accurate assessment of the relative abundance of microbial populations), we used this redundancy to investigate the stability of the phylogeny and functional potential of populations recovered from multiple geographical regions. For instance, we characterized the genomic content of HBD-06 from the Atlantic northwest (5.49 Mbp) and from each of the three Pacific Ocean regions (5.56, 5.33 and 5.29 Mbp in regions PON, PSW and PSE, respectively) (Table 1 and Supplementary Table 5). Average nucleotide identities between the Atlantic MAG and three Pacific MAGs ranged from 99.89% to 99.97% over more than 97% of the genome length. We observed similar trends for HBD-07 and HBD-09 (Table 1 and Supplementary Table 5). The complete set of nitrogen fixation genes was present in all of the redundant MAGs, demonstrating the large-scale stability of this functional trait in these HBDs.

On average, the proportion of genes of unknown function was 27.6% (±2.63%) for the proteobacterial HBDs and 49.3% (±0.5%) for the Planctomycetes HBDs, reflecting our greater lack of functional understanding of the latter taxonomic group of diazotrophs. The 37,582 total genes identified in the nine HBDs encoded for 5,912 known functions (Supplementary Table 6), and a network analysis of HBDs based on known functions organized them into four distinct groups corresponding to Deltaproteobacteria, Oceanospirillales, Pseudomonadales and Planctomycetes (Fig. 1b), mirroring the results of our phylogenomic analysis. A large number of the functions identified in these HBDs (4,224 out of 5,912) were unique to one of the four groups (Fig. 1b and Supplementary Table 6). The relatively weak overlap of known functions between these groups indicates that the ability to fix nitrogen in marine populations may not be associated with a tightly defined functional lifestyle. The HBDs we characterized appeared to be involved in different steps of the nitrogen cycle (for example, denitrification for HBD-06) and possessed distinct strategies regulating nitrogen fixation (see section ‘Functional differences between HBDs’ in the Supplementary Information for additional functional insights), but shared traits related to energy conservation, motility, nutrient acquisition and gene regulatory processes. Swimming motility, which has previously been suggested as a potential mechanism to find anaerobic microniches favourable to nitrogen fixation28,40, was a common trait we observed in all the HBDs and may be an indication of particle-attached lifestyle rather than the symbiotic lifestyles observed in some cyanobacterial diazotrophs.

The taxonomy of HBDs is coherent with the phylogeny of nitrogen fixation genes

Our phylogenetic analysis of the catalytic nifH and nifD genes from a wide range of diazotrophs placed our HBDs in four distinct lineages (Fig. 2). Also included in this analysis were the genomic replicates that were removed from the non-redundant genomic collection. These replicates clustered with their representative MAGs in the phylogenetic tree, revealing near-identical nitrogen fixation genes in geographically distant HBDs. HBD-01 (Desulfovibrio) and HBD-02 (Desulfobacterales) were clustered with close taxonomic relatives. In addition, the gammaproteobacterial HBDs were most closely related to reference genomes of the genera Pseudomonas and Azotobacter from the same class. Finally, the nifD and nifH genes we identified in the Planctomycetes HBDs formed distinct clusters, which was particularly apparent for nifD (Fig. 2). All of the catalytic and biosynthetic genes for nitrogen fixation were located in a single operon in the Planctomycetes HBD-09 genome (HBD-08 was too fragmented to determine their organization). The agreement between the taxonomy of HBDs and their placement in the functional gene-based phylogeny, along with the synteny of genes involved in nitrogen fixation (see Supplementary Information), both favour a scenario where transmission of these genes is mainly vertical in the surface ocean, contributing to the ongoing debate regarding the extent of horizontal transmission for this key functionality9,30,41.

Fig. 2: Phylogeny of nitrogen fixation genes. a,b, Phylogenetic analysis of nifD (a) and nifH (b) occurring in the 15 nitrogen-fixing MAGs (including five redundant MAGs and ‘Ca. A. talassum’) we identified from TARA Oceans in relation to 252 and 316 reference proteins, respectively. MAGs are coloured based on their phylogenetic affiliation at the phylum level. Full size image

HBDs are not only diverse but are also abundant in the surface ocean

The cumulative relative abundance of the Planctomycetes and Proteobacteria HBDs in the metagenomic data set averaged 0.01% and 0.05%, respectively. In particular, HBD-06, the diazotrophic population that recruited the largest number of reads with an average and maximum relative abundance of 0.025% and 0.33% across all metagenomes, ranked 47th in our database of 957 MAGs (Supplementary Table 3). The relative abundance of Proteobacteria and Planctomycetes HBDs was very low in the Mediterranean Sea and Red Sea (0.00064% on average). In contrast, they were substantially enriched in metagenomes from the Pacific Ocean (0.14% on average) compared to the other regions (Fig. 3). In fact, the Pacific Ocean metagenomes contained 81.4% of the 17.8 million reads that were recruited by the HBD MAGs from the entire metagenomic data set. In particular, the two most abundant Proteobacteria and Planctomycetes HBDs (HBD-06 and HBD-09) showed a broad distribution (Fig. 3) and were significantly enriched in this ocean (Welch’s test, P < 0.005). HBD-06 was also abundant in the northwest region of the Atlantic Ocean and to a lesser extent in the Southern Ocean, revealing that the ecological niche of a single HBD population can encompass multiple oceans and a wide range of temperatures (Supplementary Table 3). Interestingly, HBD-07 and HBD-08, which are phylogenetically and functionally closely related to HBD-06 and HBD-09, respectively, were not only less abundant, but also exhibited a different geographical distribution (Fig. 3). We could not explain the increased signal for the nine HBDs in a few geographic regions using temperature, salinity or the concentration of essential inorganic chemicals including oxygen, phosphate and nitrate (Supplementary Table 1).

Fig. 3: Abundance of nitrogen-fixing populations of Planctomycetes and Proteobacteria in the surface ocean. Top: boxplots display the square-root-normalized cumulative relative distribution of the Planctomycetes (n = 2) and Proteobacteria (n = 7) HBDs in 93 metagenomes corresponding to 12 marine geographic regions (*assuming that each litre in the surface ocean contains 0.5 billion archaeal and bacterial cells83). Boxes represent the first quartile, median and third quartile of distribution values, and whiskers of 1.5 × interquartile range. Bottom: maps show the niche partitioning of HBD-06, HBD-07, HBD-08 and HBD-09 at the surface of four oceans and two seas (61 metagenomes from surface samples). Full size image

To reconcile the abundance of nitrogen-fixing populations in the surface of the open ocean with the inclusion of HBDs described in this study, we used the previous PCR-based estimations of the abundance of non-cyanobacterial nifH gene phylotypes. Quantitative PCR (qPCR) surveys have estimated that non-cyanobacterial nifH gene phylotypes generally range from 10 to 1,000 copies, and rarely reach 0.1 million copies per litre23,24,25,26,27,42. We translated genome-wide quantitative read recruitment of our HBDs into cells per litre (see Supplementary Information for details). Our estimates suggest that the nine populations of HBDs characterized in this study collectively correspond to 0.72 million cells per litre on average (and up to 3.16 million cells) in the surface of the Pacific Ocean, and 0.077 million cells per litre in the other regions. HBD-06 alone might contribute about 0.31 million cells per litre in the Pacific Ocean. These results indicate that HBD populations are orders of magnitude more abundant than previously thought in metagenomes covering large regions of the surface ocean.

PCR assays confirm the occurrence of Planctomycetes nifH genes in the surface ocean

We tracked HBDs at the long-term field study of Station ALOHA (22° 45′ N, 158° 00′ W) in the oligotrophic North Pacific Subtropical Gyre to compare the sensitivity of metagenomics and PCR surveys. The nine HBDs were below the detection limit in a data set of 624.2 million metagenomic reads originating from Station ALOHA43, indicating that HBDs are not as abundant at this location as they are in other regions of the Pacific Ocean (Supplementary Table 7). We developed digital droplet (dd)PCR assays for the two Planctomycetes nifH genes, and could detect HBD-08 at ~750 copies per litre in samples from Station ALOHA44 (Supplementary Table 7). We could also detect HBD-09 at levels near the limit of detection, confirming the occurrence of Planctomycetes nifH genes in the surface ocean.

Reconstructed nifH genes are more abundant than previously characterized nifH genes in surface ocean metagenomes

The non-redundant collection of 957 curated MAGs in which we searched for HBDs encompassed only 5.42% of the genes in our metagenomic assembly outputs. To identify more nifH genes, we also investigated those occurring in the remaining ‘orphan’ scaffolds (39,510,139 genes). Our search based on amino acid similarity with the HBD database resulted in the recovery of nine additional non-redundant nifH genes (Fig. 4 and Supplementary Table 8). Eight of them originated from the Pacific Ocean metagenomic co-assemblies, substantiating the unequal distribution patterns for nitrogen fixation genes we observed at the MAG level (Fig. 3). Phylogenetic analysis on these nifH genes affiliated them with Elusimicrobia (n = 2), Firmicutes (n = 2), Proteobacteria (n = 1), Spirochaeta (n = 1), Verrucomicrobia (n = 1), a group of uncultured bacteria (n = 1), and Euryarchaeota (n = 1) (Supplementary Fig. 1). This primer-independent survey identified a wide range of nifH gene lineages that spanned all four of the previously described phylogenetic clusters45 (Supplementary Table 8). The average nucleotide identity of short metagenomic reads each nifH gene recruited was between 97.4% and 100%, and above 99% for each of the nine HBDs (Supplementary Table 8), suggesting that these nifH gene sequences correspond to highly homogeneous phylotypes. Despite their high abundance in the surface ocean, most of these nifH genes were not in the NCBI non-redundant database, or reference nifH collections46,47, and none of them occurred in a large-scale amplicon survey of the surface ocean39, even when considering the subtle variations these phylotypes maintain in the environment (Supplementary Table 8). Our in silico analysis of widely used primer sequences (see Methods) revealed mismatches to these nifH genes, which is a likely reason for this discrepancy (Supplementary Table 8).

Fig. 4: Relative abundance of the TARA Oceans nifH genes in the context of reference collections and amplicons. Violin plots summarizing the average mean coverage of nifH genes retrieved in this study, nifH reference databases46,47 and nifH amplicon sequences from a large-scale survey39 across 93 TARA Oceans metagenomes using a competitive read recruitment strategy. The 18 nifH genes retrieved in this study were separated into two groups (‘HBD genomes’ and ‘Orphan genes’ for which we only have a scaffold) and compared to a database of nifH gene sequences. For each gene sequence, the coverage values were corrected by excluding nucleotide positions with coverage in the 1st and 4th quartiles to minimize the effect of non-specific mapping. Full size image

We used previously characterized reference and amplicon nifH gene sequences to recruit reads from metagenomes to estimate their relative abundance (see Methods). The large majority of these sequences were undetected in the TARA Oceans metagenomes (Supplementary Table 9), and the few sequences recruiting reads were less abundant than the nifH genes we reconstructed, confirming the remarkable abundance of the HBDs we characterized (Fig. 4). A notable exception was the ‘Ca. A. thalassa’ sequence, which was also present in our MAG database. Finally, the number of reads orphan nifH genes recruited suggests that HBDs abundant in the surface ocean might not be limited to Planctomycetes and Proteobacteria (Fig. 4).

A binomial naming for the HBDs characterized from multiple geographic regions

Most of the MAGs we characterized in our study correspond to unknown genera, but the lack of cultured representatives prevents a formal taxonomic characterization of these lineages. Here we suggest tentative names for the HBDs we independently characterized from multiple geographic regions (that is, those for which we have genomic replicates) using the candidatus status and binomial naming system: ‘Candidatus Azoaequarella praevalens’ gen. nov., sp. nov. (HBD-06) and ‘Ca. Azopseudomonas oceani’ gen. nov., sp. nov. (HBD-07) within the order Pseudomonadales (unknown family), and ‘Ca. Azoplanctomyces absconditus’ gen. nov., sp. nov. (HBD-09) within the phylum Planctomycetes (unknown order and family).