Genomic diversity in parasitic nematodes and platyhelminths

We have produced draft genomes for 45 nematode and platyhelminth species and predicted 0.8 million protein-coding genes, with 9,132–17,274 genes per species (5–95% percentile range; see Methods, Supplementary Tables 1–3, Supplementary Fig. 1 and Supplementary Notes 1.1 and 1.2). We combined these new data with 36 published worm genomes—comprising 31 parasitic8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30 and five free-living18,31,32,33,34 species—and 10 outgroups35,36,37,38,39,40,41,42,43,44 from other animal phyla, into a comparative genomics resource of 91 species (Fig. 1 and Supplementary Tables 2 and 4). There was relatively little variation in gene set completeness (coefficient of variation, c.v. = 0.15) among the nematodes and platyhelminths, despite variation in assembly contiguity (c.v. = 8.5; Fig. 1b and Supplementary Table 2). Nevertheless, findings made using a subset of high-quality assemblies that were designated ‘tier 1’ (Methods and Supplementary Table 4) were corroborated against all species.

Genome size varied greatly within each phylum, from 42 to 700 Mb in nematodes, and from 104 to 1,259 Mb in platyhelminths. In a small number of cases, size estimates may have been artifactually inflated by high heterozygosity causing alternative haplotypes to be represented within the assemblies (Supplementary Note 1.3 and Supplementary Table 2a). A more important factor appeared to be repeat content that ranged widely, from 3.8 to 54.5% (5–95% percentile; Supplementary Table 5). A multiple regression model, built to rank the major factors driving genome size variation, identified long terminal repeat transposons, simple repeats, assembly quality, DNA transposable elements, total length of introns and low complexity sequence as being the most important (Supplementary Note 1.3, Methods and Supplementary Table 6). Genome size variation is thus largely due to non-coding elements, as expected45, including repetitive and non-repetitive DNA, suggesting it is either non-adaptive or responding to selection only at the level of overall genome size.

Gene family births and expansions

We inferred gene families from the predicted proteomes of the 91 species using Ensembl Compara46. Of the 1.6 million proteins, 1.4 million were placed into 108,351 families (Supplementary Note 2.1 and Supplementary Data), for which phylogenetic trees were built and orthology and paralogy inferred (Methods, Supplementary Fig. 2 and Supplementary Table 7). Species trees inferred from 202 single-copy gene families that were present in at least 25% of species (Fig. 1), or from presence/absence of gene families, largely agreed with the expected species and clade relationships, except for a couple of known contentious issues (Supplementary Fig. 3, Supplementary Note 2.2 and Methods).

The species in our dataset contained significant novelty in gene content. For example, ~28,000 parasitic nematode gene families contained members from two or more parasitic species but were absent from Caenorhabditis elegans and 47% of gene families lacked any functional annotation (Supplementary Note 2.1 and Methods). The latter families tended to be smaller than those with annotations (Supplementary Fig. 4) and, in many cases, correspond to families that are so highly diverged that ancestry cannot be traced, reflecting the huge breadth of unexplored parasite biology.

Gene families specific to particular parasite clades are likely to reflect important aspects of parasite biology and possible targets for new antiparasitic interventions. At key nodes in the phylogeny that are relevant to parasitism, we identified 5,881 families with apparent clade-specificity (synapomorphies; Supplementary Note 2.3, Methods and Supplementary Table 8), although our ability to discriminate truly parasite-specific clades was limited by the low number of free-living species. The apparent synapomorphies were either gene family births, or subfamilies that were so diverged from their homologues that they appeared as separate families. Functional annotation of these families was diverse (Fig. 2), but they were frequently associated with sensory perception (such as G-protein coupled receptors; GPCRs), parasite surfaces (platyhelminth tegument or nematode cuticle maintenance proteins) and protein degradation (proteases and protease inhibitors).

Fig. 2: Functional annotation of synapomorphic and expanded gene families. a, Rectangular matrices indicate counts of synapomorphic families grouped by 18 functional categories, detailed in the top left corner. Representative functional annotation of a family was inferred if more than 90% of the species present contained at least one gene with a particular domain. The node in the tree to which a panel refers is indicated in each matrix. ‘Other’ indicates families with functional annotation that could not be grouped into one of the 18 categories. ‘None’ indicates families that had no representative functional annotation. b, Expansions of apyrase and PUMA gene families. Families were defined using Compara. For color key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed on the species tree. A scale bar beside the plot for a family shows the minimum, median and maximum gene count across the species, for that family. Full size image

Among nematodes, clade IVa (which includes Strongyloides spp.; Fig. 1) showed the highest number of clade-specific families, including a novel ferrochelatase-like family. Most nematodes lack functional ferrochelatases for the last step of haem biosynthesis47, but harbor ferrochelatase-like genes of unknown function, to which the synapomorphic clade IVa family was similar (Supplementary Fig. 5 and Methods). Exceptions are animal parasites in nematode clades III (for example ascarids and filaria) and IV that acquired a functional ferrochelatase via horizontal gene transfer48,49. Within the parasitic platyhelminths, a clade-specific inositol-pentakisphosphate 2-kinase (IP2K) was identified. In some species of Echinococcus tapeworms, IP2K produces inositol hexakisphosphate nanodeposits in the extracellular wall (the laminated layer) that protects larval metacestodes50. The deposits increase the surface area for adsorption of host proteins and may promote interactions with the host51.

Paralogous expansions of gene families, particularly those that are large or repeatedly involve related processes, can be evidence of adaptive evolution. We searched among our 10,986 highest-confidence gene families (those containing ≥10 genes from tier 1 species) for those that had expanded in parasite clades. A combination of scoring metrics (Methods) reduced the list to 995 differentially distributed families with a bias in copy number in at least one parasite clade. Twenty-five expansions have previously been observed, including 21 with possible roles in parasitism (Supplementary Fig. 6). A further 43 were placed into major functional classes that historically have been favored as drug targets (kinases, GPCRs, ion channels and proteases52; Supplementary Table 9a).

By manually inspecting the distribution of the remaining 927 families across the full species tree, we identified 176 families with striking expansions (Supplementary Table 9a and Supplementary Note 2.4). Thirty two had no functional annotation; for example, family 393312 was highly expanded in clade Va nematodes (Supplementary Fig. 7 and Supplementary Table 9a). Even when families could be functionally annotated to some extent (for example, based on a protein domain), discerning their precise biological role was a challenge. For example, a sulfotransferase family that was expanded in flukes compared with tapeworms includes the Schistosoma mansoni locus that is implicated in resistance to the drug oxamniquine53 but the endogenous substrate for this enzyme is unknown (Supplementary Fig. 7j).

Among the newly identified expansions, we focused on those with richer functional information, especially where they were related to similar biological processes. For instance, we identified several expansions of gene families involved in innate immunity of the parasites, as well as their development. These included families implicated in protection against bacterial or fungal infections in nematode clade IVa (bus-4 GT31 galactosyltransferase54, irg-355) and clades Va/Vc (lysozyme56 and the dual oxidase bli-357) (Supplementary Fig. 8a–d). In nematode clade IIIb, a family was expanded that contains orthologs of the Parascaris coiled-coiled protein PUMA, involved in kinetochore biology58 (Fig. 2b). This expansion possibly relates to the evolution of chromatin diminution in this clade, which results in an increased number of chromosomes requiring correct segregation during metaphase59. In nematode clade IVa and in Bursaphelenchus, an expansion of a steroid kinase family (Supplementary Fig. 8e) is suggestive of novelty in steroid-regulated processes in this group, such as the switch between free-living or parasitic stages in Strongyloides60.

Infections with parasitic worms are typified by their chronicity and a plausible involvement in host–parasite interactions is a recurring theme for many of the families. Taenia tapeworms and clade V strongylid nematodes (that is Va, Vb and Vc; Fig. 1) contained two expanded families with apyrase domains that may have a role in hydrolyzing ATP (a host danger signal) from damaged host tissue61 (Fig. 2b and Supplementary Fig. 9a). Moreover, many of the strongylid members also contained amine oxidoreductase domains, possibly to reduce production of pro-inflammatory amines, such as histamine, from host tissues62. In platyhelminths, we observed expansions of tetraspanin families that are likely components of the host/pathogen interface. Described examples show tetraspanins being part of extracellular vesicles released by helminths within hosts63; or binding the Fc domain of host antibodies64; or being highly immunogenic65 (Supplementary Fig. 9b,c). In strongylids, especially clade Vc, an expansion of the fatty acid and retinol-binding (FAR) family, implicated in host–parasite interaction of plant- and animal-parasitic nematodes66,67 (Supplementary Fig. 9d), suggests a role in immune modulation. Repertoires of glycosyl transferases have expanded in nematode clades Vc and IV, and tapeworms (Supplementary Fig. 10a–c), and may be used to evade or divert host immunity by modifying parasite surface molecules directly exposed to the immune system68; alternatively, surface glycoproteins may interact with lectin receptors on innate immune cells in an inhibitory manner69. An expanded chondroitin hydrolase family in nematode clade Vc may possibly be used either for larval migration through host connective tissue or to digest host intestinal walls (Supplementary Fig. 9e). Similarly, an expanded GH5 glycosyl hydrolase family contained schistosome members with egg-enriched expression8,70 that may be used for traversing host tissues such as bladder or intestinal walls (Supplementary Fig. 9f). In nematode clade I, we found an expansion of a family with the PAN/Apple domain, which is implicated in attachment of some protozoan parasites to host cells71, and possibly modulates host lectin-based immune activation (Supplementary Fig. 9g).

The SCP/TAPS (sperm-coating protein/Tpx/antigen 5/pathogenesis-related protein 1) genes have been associated with parasitism through their abundance, secretion and evidence of their role in immunomodulation72 but are poorly understood. This diverse superfamily appeared as eight expanded Compara families. A more comprehensive phylogenetic analysis of the full repertoire of 3,167 SCP/TAPS sequences (Supplementary Note 2.5, Supplementary Table 10 and Methods) revealed intra- and interspecific expansions and diversification over different evolutionary timescales (Fig. 3 and Supplementary Figs. 11a,b and 12). In particular, the SCP/TAPS superfamily has expanded independently in nematode clade V (18–381 copies in each species) and in clade IVa parasites (39–166 copies) (Fig. 3 and Supplementary Fig. 11c). Dracunculus medinensis (Guinea worm) was unusual in being the only member of clade III to display an expansion (66 copies), which may reflect modulation of the host immune response during the tissue migration phase of its large adult females.

Fig. 3: Distribution and phylogeny of SCP/TAPS genes. A maximum-likelihood tree of SCP/TAPS genes. Colors represent different species groups. Homo sapiens GLIPR2 was used to root the tree. Blue dots show high bootstrap values (≥0.8). A clade was collapsed into a triangle if more than half its leaves were genes from the same species group. Nematode clade I had fewer counts, but was collapsed to show its relationship to other clades’ expansions. ‘Strongylid’ refers to clades Va, Vb and Vc. Full size image

Proteins historically targeted for drug development

Proteases, GPCRs, ion channels and kinases dominate the list of targets for existing drugs for human diseases52, and are attractive leads for developing new ones. We therefore explored the diversity of these superfamilies across the nematodes and platyhelminths (Supplementary Fig. 13, Supplementary Note 3 and Methods).

Proteases and protease inhibitors perform diverse functions in parasites, including immunomodulation, host tissue penetration, modification of the host environment (for example, anticoagulation) and digestion of blood73. M12 astacins have particularly expanded in nematode clade IVa (five families), as previously reported18, but there are two additional expansions in clades Vc and Vb (Fig. 4, Supplementary Fig. 14 and Supplementary Table 11). Because many of these species invade through skin (IVa, Vc; Supplementary Table 12) and migrate through the digestive system and lung (IVa, Vc, Vb; Supplementary Table 13), these expansions are consistent with evidence that astacins are involved in skin penetration and migration through connective tissue74. The cathepsin B C1-cysteine proteases are particularly expanded in species that feed on blood (two expansions in nematode clades Vc and Va30, with highest platyhelminth gene counts in schistosomatids and Fasciola12; Supplementary Fig. 14). Indeed, they are involved in blood digestion in adult nematodes75 and platyhelminths76, but some likely have different roles such as larval development77 and host invasion78.

Fig. 4: Abundances of superfamilies historically targeted for drug development. Relative abundance profiles for 84 protease and 31 protease inhibitor families represented in at least 3 of the 81 nematode and platyhelminth species. Thirty-three protease families and 6 protease inhibitor families present in fewer than 3 species were omitted from the visualization. For each species, the gene count in a class was normalized by dividing by the total gene count for that species. Families mentioned in the Results or Supplementary Note text are labeled; complete annotations of all protease families are in Supplementary Table 11. Full size image

Different protease inhibitors may modulate activity of parasite proteases or protect parasitic nematodes and platyhelminths from degradation by host proteases, facilitate feeding or manipulate the host response to the parasite79. The I2 (Kunitz-BPTI) trypsin inhibitors are the most abundant protease inhibitors across parasitic nematodes and platyhelminths (Fig. 4). An expansion of the I17 family, which includes secretory leukocyte peptidase inhibitor, was reported previously in Trichuris muris17 but the striking confinement of this expansion to most of the parasites of clade I is now apparent (Fig. 4). We also observed a notable family of α-2-macroglobulin (I39) protease inhibitors that are present in all platyhelminths but expanded in tapeworms (Supplementary Fig. 14). The tapeworm α-2-macroglobulins may be involved in reducing blood clotting at attachment or feeding sites; alternatively, they may modulate the host immune response, since α-2-macroglobulins bind several cytokines and hormones80. Chymotrypsin/elastase inhibitors (family I8) were particularly expanded in clades Vc and IVa (consistent with upregulation of I8 genes in Strongyloides parasitic stages18) and to a lesser extent in clade IIIb (Fig. 4), consistent with evidence that they may protect Ascaris from host proteases81. We also identified protein domain combinations that were specific to either nematodes or platyhelminths (131 and 50 domain combinations, respectively). Many of these involved protease and protease inhibitor domains. In nematodes, several combinations included Kunitz protease inhibitor domains, and in platyhelminths metalloprotease families M18 and M28 were found in novel combinations (Supplementary Table 14, Supplementary Note 3.2 and Methods).

Of the 230 gene families annotated as GPCRs (Supplementary Figs. 13 and 15 and Supplementary Note 3.3), only 21 were conserved across phyla. Chemosensory GPCRs, while abundant in nematodes, were not identified in platyhelminths, although they are identifiable in other Lophotrochozoa (such as Mollusca82), suggesting that either the platyhelminths have lost this class or they are very divergent (Supplementary Table 15). GPCR families lacking sequence similarity with known receptors included the platyhelminth-specific rhodopsin-like orphan families (PROFs), which are likely to be class A receptors and peptide responsive, and several other fluke-specific non-PROF GPCR families. The massive radiation of chemoreceptors in C. elegans was unmatched in any other nematode (87% versus ≤48% of GPCRs). All parasitic nematodes possessed chemoreceptors, with the most in clade IVa, including several large families synapomorphic to this clade (Supplementary Fig. 15), perhaps related to their unusual life cycles that alternate between free-living and parasitic forms.

Independent expansion and functional divergence has differentiated the nematode and platyhelminth pentameric ligand gated ion channels (Supplementary Fig. 16, Supplementary Table 16 and Supplementary Note 3.4). For example, glutamate signaling arose independently in platyhelminths and nematodes83, and in trematodes the normal role of acetylcholine has been reversed, from activating to inhibitory84. Our analysis suggested the platyhelminth acetylcholine-gated anion channels are most related to the Acr-26/27 group of nematode nicotinic acetylcholine receptors that are the target of the anthelmintics morantel and pyrantel85, rather than to nematode acetylcholine-gated cation channels, targeted by nicotine and levamisole (Supplementary Fig. 17).

ABC transporters (Supplementary Table 17 and Supplementary Note 3.5) and kinases (Supplementary Note 3.6 and Supplementary Fig. 18) showed losses and independent expansion within nematodes and platyhelminths. The P-glycoprotein class of transporters, responsible for the transport of environmental toxins and linked with anthelmintic resistance, is expanded relative to vertebrates86, with increased numbers in nematodes (Supplementary Fig. 19).

Metabolic reconstructions of nematodes and platyhelminths

In the context of drug discovery, understanding the metabolic capabilities of parasitic worms may reveal vulnerabilities that can be exploited in target-based screens for new compounds. For each of the 81 nematode and platyhelminth species, metabolism was reconstructed based on high confidence assignment of enzyme classes (Supplementary Table 18a). The nematodes had a greater range of annotated enzymes per species than the platyhelminths (Supplementary Fig. 20a), in part reflecting the paucity of biochemical studies in platyhelminths. Because variation in assembly quality or divergence from model organisms87 could bias enzyme predictions, we identified losses of pathways and differences in pathway coverage across different clades (Supplementary Note 4, Methods, Fig. 5 and Supplementary Fig. 21). Pathways related to almost all metabolic superpathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG)88 showed significantly lower coverage for platyhelminths (versus nematodes) and filaria (versus other nematodes) (Supplementary Fig. 20b).

Fig. 5: Metabolic modules and biochemical pathways in platyhelminths and nematodes. a, Topology-based detection of KEGG metabolic modules among tier 1 species (dark green, present; light green, largely present (only one enzyme not found)). Only modules detected to be complete in at least one species are shown. The EC annotations used for this figure included those from pathway hole-filling and those based on Compara families (Supplementary Table 18a, b). b, Biochemical pathways that appear to have been completely or partially lost from certain platyhelminth and nematode clades. PRPP, phosphoribosyl pyrophosphate. Full size image

In contrast to most animals, nematodes possess the glyoxylate cycle that enables conversion of lipids to carbohydrates, to be used for biosyntheses (for example, during early development) and to avert starvation89. The glyoxylate cycle appears to have been lost independently in the filaria and Trichinella species (Fig. 5a; M00012), both of which are tissue-dwelling obligate parasites. The filaria and Trichinella have also independently lost alanine-glyoxylate transaminase that converts glyoxylate to glycine (Fig. 5b). Glycine can be converted by the glycine cleavage system (GCS) to 5,10-methylenetetrahydrofolate, a useful one-carbon pool for biosyntheses, and two key GCS proteins appear to have been lost independently from filaria and tapeworms, suggesting their GCS is non-functional (Supplementary Table 19e). In addition, filaria have lost the ability to produce and use ketone bodies, a temporary store of acetyl coenzyme A (CoA) under starvation conditions (Supplementary Table 19b). The filaria lost these features after they diverged from D. medinensis, an outgroup to the filaria in clade IIIc that has a major difference in its life cycle, namely, a free-living larval stage (Supplementary Table 12).

The absence of multiple initial steps of pyrimidine synthesis was observed in some nematodes, including all filaria (as previously reported23) and tapeworms, suggesting they obtain pyrimidines from Wolbachia endosymbionts or from their hosts, respectively (Supplementary Table 19f). Similarly, all platyhelminths and some nematodes (especially clade IVa and filaria IIIc) appear to lack key enzymes for purine synthesis (Supplementary Table 19g) and rely on salvage instead. However, despite the widespread belief that nematodes cannot synthesize purines90,91, complete or near-complete purine synthesis pathways were found in most members of clades I, IIIb and V. Nematodes are known to be unable to synthesize haem47 but the pathway was found in platyhelminths, including S. mansoni (despite conflicting biochemical data47) (Supplementary Table 19h and Supplementary Table 20i).

Genes from the β-oxidation pathway, used to break down lipids as an energy source, were not detected in schistosomes and some cyclophyllidean tapeworms (Hymenolepis, Echinococcus; Fig. 5a, M00087; Supplementary Table 19a). These species live in glucose-rich environments and may have evolved to use glucose and glycogen as principal energy sources. However, biochemical data suggest they do perform β-oxidation92, so they may have highly diverged but functional β-oxidation genes.

The lactate dehydrogenase (LDH) pathway is a major source of ATP in anaerobic but glucose-rich environments. Platyhelminths have high numbers of LDH genes, as do blood-feeding Ancylostoma hookworms (Supplementary Fig. 22g). Nematode clades Vc (including Ancylostoma) and IIIb have expansions of α-glucosidases that may break down starch and disaccharides in host food to glucose (Supplementary Fig. 22a). Many nematodes and flatworms use malate dismutation as an alternative pathway for anaerobic ATP production93. The importance of the pathway for clade IIIb nematodes was reflected in expanded families encoding two key pathway enzymes PEPCK and methylmalonyl CoA epimerase, and the intracellular trafficking chaperone for cobalamin (vitamin B-12), a cofactor for the pathway (Supplementary Fig. 22c–e and Supplementary Table 9a). A second cobalamin-related family (CobQ/CbiP) is clade IIIb-specific and appears to have been gained by horizontal gene transfer from bacteria (Supplementary Fig. 23a, Supplementary Note 2.6 and Methods). A glutamate dehydrogenase family expanded in clade IIIb (Supplementary Fig. 22h) is consistent with a GABA (γ-aminobutyric acid) shunt that helps maintain redox balance during malate dismutation. In clade Va, an expansion in the propionate breakdown pathway94 (Supplementary Fig. 22f), suggested degradation of propionate, originating from malate dismutation or fermentation in the host’s stomach95. Clade I nematodes have an acetate/succinate transporter that appeared to have been gained from bacteria (Supplementary Note 2.6 and Methods), and may participate in acetate/succinate uptake or efflux (Supplementary Fig. 23b).

Identifying new anthelmintic drug targets and drugs

As an alternative to a purely target-based approach that would require extensive compound screening, we explored drug repurposing possibilities. We developed a pipeline to identify the most promising targets from parasitic nematodes and platyhelminths. These sequences were used in searches of the ChEMBL database that contains curated activity data on defined targets in other species and their associated drugs and compounds (Supplementary Note 5 and Methods). Our pipeline identified compounds that are predicted to interact with the top 15% of highest-scoring worm targets (n = 289). These targets included 17 out of 19 known or likely targets for World Health Organization-listed anthelmintics that are represented in ChEMBL (Supplementary Table 21b). When compounds within a single chemical class were collapsed to one representative, this potential screening set contained 5,046 drug-like compounds, including 817 drugs with phase III or IV approval and 4,229 medicinal chemistry compounds (Supplementary Table 21d). We used a self-organizing map to cluster these compounds based on their molecular fingerprints (Fig. 6). This classification showed that the screening set was significantly more structurally diverse than existing anthelmintic compounds (Supplementary Fig. 24).

Fig. 6: Self-organizing map of known anthelmintic compounds and the proposed screening set of 5,046 drug-like compounds. A self-organizing map clustering known anthelmintic compounds (Supplementary Table 21a) and our proposed screening set of 5,046 compounds. The density of red and green shows the number of screening set and known anthelmintic compounds clustered in each cell, respectively. Structures for representative known anthelmintic compounds are shown at the top, and examples from the proposed screening set along the bottom. Full size image

The 289 targets were further reduced to 40 high-priority targets, based on predicted selectivity, avoidance of side-effects (clade-specific chokepoints or lack of human homologues) and putative vulnerabilities, such as those suggested by gene family expansions in parasite lineages, or belonging to pathways containing known or likely anthelmintic targets (Supplementary Fig. 25). These 40 targets were associated with 720 drug-like compounds comprising 181 phase III/IV drugs and 539 medicinal chemistry compounds. There is independent evidence that some of these have anthelmintic activity. For example, we identified several compounds that potentially target glycogen phosphorylase, which is in the same pathway as a likely anthelmintic target (glycogen phosphorylase phosphatase, likely target of niridazole; Supplementary Fig. 25). These compounds included the phase III drug alvocidib (flavopiridol), which has anthelmintic activity against C. elegans96. Another example is the target cathepsin B, expanded in nematode clade Va (Supplementary Table 9a), for which we identified several compounds including the phase III drug odanacatib, which has been shown to have anthelmintic activity against hookworms97. Existing drugs such as these are attractive candidates for repurposing and fast-track therapy development, while the medicinal chemistry compounds provide a starting point for broader anthelmintic screening.