Plant materials and culture conditions

Axenically grown plants of C. follicularis were obtained from CZ Plants Nursery (Trebovice, Czech Republic) and were maintained in polycarbonate containers (60 × 60 × 100 mm) containing half-strength Murashige and Skoog solid medium31 supplemented with 3% sucrose, 1× Gamborg’s vitamins, 0.1% 2-(N-morpholino)ethanesulfonic acid, 0.05% Plant Preservative Mixture (Plant Cell Technology) and 0.3% Phytagel, at 25 °C in continuous light. For transcriptome sequencings, D. adelae was cultivated in a peat pot in an incubator at 25 °C in continuous light. N. alata was grown in soil in a greenhouse. S. purpurea was grown in peat-based soil and was maintained in a field. For digestive fluid sampling, C. follicularis, D. adelae, N. alata and S. purpurea were grown in a greenhouse.

Culture conditions for leaf fate regulation

Shoot apices with one or two expanded leaves were collected with fine forceps from plants grown at 25 °C and planted on medium. The plantlets were grown for 12 weeks under a light intensity of 20–40 μmol m−2 s−1. Numbers of youngest pitcher and flat leaves on main shoots were counted for each plantlet (Fig. 1b). Leaves with intermediate shapes were counted as either of the two categories based on morphological similarity.

DNA isolation

Total genomic DNA was isolated from young flat leaves and pitcher leaves of axenically grown plants. Collected leaves were homogenized in liquid nitrogen using a mortar and pestle. The homogenate was transferred into 2× CTAB extraction buffer (2% cetyltrimethylammonium bromide (CTAB), 1.4 M NaCl, 100 mM Tris-HCl (pH 8.0), 20 mM EDTA (pH 8.0)) preheated to 80 °C and was gently agitated at 60 °C for 1 h. An equal volume of chloroform:isoamyl alcohol (25:1) was added and agitated using a rotator at 20 r.p.m. for 10 min at room temperature. After centrifugation at 9,000 × g for 30 min at room temperature, supernatants were transferred to new tubes and supplemented with 1/10 volume of 10% CTAB and an equal volume of chloroform:isoamyl alcohol (25:1). The tubes were shaken with a rotator for 10 min. After centrifugation, supernatants were again transferred to new tubes and an equal volume of isopropanol was added. The tubes were centrifuged and supernatants were discarded. The crude DNA pellet was rinsed with 5 ml of 70% EtOH and air-dried for 10 min. The pellet was dissolved in 200 μl of TE (pH 8.0) containing 0.1 mg/ml RNase A, and gently agitated for 60 min at 37 °C. A 1/20 volume of 20 mg ml−1 Proteinase K was added, and tubes were incubated at 56 °C for 30 min. Subsequently, the DNA solution was further purified using Qiagen Genomic-tip, following the manufacturer’s instructions. DNA concentration was determined using fluorometry with Qubit 2.0 (Life Technologies).

Genome sequencing

Whole-genome shotgun short-read sequences were generated with an Illumina HiSeq 2000 to a depth of approximately 150-fold of the 2 Gb Cephalotus genome using paired-end and mate-pair protocols, according to the manufacturer’s instructions (Supplementary Table 1). For long read sequencing, genomic DNA samples were sheared to 6 kb or 10 kb using g-Tube (Covaris, Massachusetts). Libraries were prepared with DNA Template Prep Kit 2.0 (Pacific Biosciences, California) (3–10 kb) following the manufacturer’s instructions and sequencing was performed using PacBio RS with C2 chemistry, P2 polymerase and 45-min movies. Using 158 cells, a total of ca. 17 Gb were generated with a quality cut-off value of 0.75 (Supplementary Table 1).

Genome size estimation

The size of the Cephalotus genome was estimated by k-mer frequency analysis using JELLYFISH32 (Supplementary Fig. 1a).

Genome assembly

Illumina paired-end reads with all insert sizes, and mate-pair reads with insert sizes of 2 and 5 kb, were first assembled into 43,308 scaffolds using Allpaths-LG v4238133. Fragment filling was applied to paired-end libraries with insert sizes of 170 bp and 250 bp. Standard deviations of insert sizes were set to 10% of insert sizes. Gap filling and further scaffolding were performed by adding mate-pair reads with longer inserts using SSPACE34 and GapCloser35. PacBio reads were subjected to two rounds of error correction using Sprai v0.2.2.3 (http://zombie.cb.k.u-tokyo.ac.jp/sprai/) and used for four rounds of iterative gap filling with PBJelly v12.9.1436. The final assembly included 16,307 scaffolds with N50 of 287 kb (Supplementary Table 2).

Repeat identification

Repetitive elements of the Cephalotus genome were first identified and masked for gene prediction (Supplementary Tables 3 and 4). De novo prediction of transposable elements was performed using RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) and LTR_FINDER (http://tlife.fudan.edu.cn/ltr_finder/). Known transposable elements were found using RepeatMasker and RepeatProteinMask (http://repeatmasker.org). Tandem repeat sequences were screened using Tandem Repeats Finder37.

RNA extraction

Plant materials were ground in liquid nitrogen using a mortar and pestle. Total RNA was extracted using the PureLink Plant RNA Reagent (Life Technologies) and subsequently purified using the RNeasy Mini Kit (QIAGEN). DNase treatment was performed during the column purification. Total RNA was qualified using a 2100 Bioanalyzer (Agilent).

Transcriptome sequencing

Extracted RNA was subjected to two rounds of mRNA enrichment using Dynabeads mRNA Purification Kit (Life Technologies) according to the manufacturer’s instructions. RNA-seq libraries were prepared using TruSeq RNA Sample Preparation kit v.2 (Illumina). Strand-specific mRNA libraries were constructed using the dUTP second-strand marking method38. These libraries were sequenced on an Illumina HiSeq 2000 with three biological replications (Supplementary Table 8).

Gene prediction

For gene predictions, we used homology-based, ab inito and transcript-based methods. Protein data sets of Arabidopsis thaliana, Linum usitatissimum, Manihot esculenta, Populus trichocarpa and Ricinus communis (Supplementary Table 11) were aligned to the Cephalotus genome using tblastn (cut-off: 1e−5) and then homology-based gene predictions were generated using GeneWise39. We also used Augustus (http://augustus.gobics.de/), GENSCAN (http://genes.mit.edu/GENSCAN.html), GlimmerHMM (https://ccb.jhu.edu/software/glimmerhmm/) and SNAP (http://korflab.ucdavis.edu/software.html) for ab initio predictions, with model parameters trained using 730 Cephalotus gene models that were well supported by homology evidence. RNA-seq data generated from 16 samples (Supplementary Table 8) were used for transcript-based predictions with the Bowtie–Tophat–Cufflinks pipeline40. These models were merged using GLEAN (http://glean-gene.sourceforge.net/). Finally, gene models that were not in the GLEAN non-redundant gene set but supported by both homology and RNA-seq evidences, or homology-based models (frame shift mutation not allowed and aligning rate >50%), or RNA-seq models encoding proteins ≧120 amino acids in length, were further added.

Gene annotation

Gene functions were assigned using BLAST searches (E-value cut-off of 10−5) against the following databases: KEGG (Release 58), nr (NCBI release 20130904), Swissprot and TrEMBL (Uniprot release 201203). Conserved protein domains were assessed by InterPro41 and InterProScan42 with applications including HMMPfam, HMMPanther, ProfileScan, HMMSmart, FPrintScan and BlastProDom.

Evaluation of genome assembly and gene prediction

Gene coverage of predicted gene sets was evaluated using CEGMA 2.443 (Supplementary Fig. 1b). Read mapping rates of 15 RNA-seq libraries from five tissues ranged from 74.4% to 83.6% (Supplementary Table 9), indicating consistency between the assembled genome and the sequenced transcriptome.

Small RNA extraction and sequencing

Plant tissues were ground in liquid nitrogen using a mortar and pestle. Total RNA was extracted using PureLink Plant RNA Reagent (Life Technologies) and subsequently purified using the miRNeasy kit (QIAGEN). DNase treatment was performed during the column purification. Briefly, for each sample, RNA of the desired size range (18–30 nucleotides) was size-fractionated and ligated with the 5' adapter and, subsequently, the 3' adapter. Ligated RNA was then subjected to PCR with reverse transcription (RT-PCR) to produce sequencing libraries. Small RNA-seq was performed on an Illumina HiSeq 2000 (Supplementary Table 10).

miRNA prediction and target prediction

Cephalotus miRNA loci were predicted in the genome by both transcriptome- and homology-based methods (Supplementary Table 6). Small RNA-seq reads were mapped onto genomic inverted repeats predicted by EMBOSS einverted44. miRNA loci were identified from the mapping results using ShortStack v1.2.345. For homology-based prediction, 7,385 mature miRNA sequences of Viridiplantae species were retrieved from miRbase release 2046. These miRNA sequences were mapped onto the Cephalotus genome using patscan47, allowing one mismatch. Putative loci mapped by less than five independent miRNAs were excluded. Secondary structures were identified from flanking regions of mapped loci (±350 bp) using RNAfold of Vienna RNA Package 2.048, and putative miRNA loci were predicted using miRcheck with default parameters49. When putative miRNAs were predicted on both strands of the same loci, the minor locus was collapsed. Putative targets of annotated miRNAs were identified using psRNATarget50 using default settings (Supplementary Table 7).

OrthoMCL gene classification

Orthologues were clustered by comparison of protein data sets among A. thaliana, C. follicularis, Theobroma cacao, Vitis vinifera, Prunus persica, Coffea canephora, Solanum lycopersicum, U. gibba and P. trichocarpa using BLASTP (cut-off: 10−5) and OrthoMCL6 (Supplementary Tables 11 and 12). Protein data sets of the nine genomes were BLAST searched against nr (NCBI release 20140407; BLASTP, E-value cut-off of 10−5). Functional terms (GO and enzyme codes) were then assigned to each query sequence using Blast2GO (https://www.blast2go.com/).

Maximum-likelihood inference of orthogroup gains and losses

We estimated the divergence times of the surveyed species using RAxML version 851, employing tree topologies published previously52–54. The reported placement of P. persica (Rosales) is discrepant between plastid-54 and nuclear-based analyses52,53. To account for that, we analysed phylogenetic relationships using the single-copy orthologue alignment (see below). Although the bootstrap supports were low, the maximum-likelihood tree supported the nuclear-based topology (Supplementary Fig. 1h), and therefore we placed P. persica as sister to the clade containing A. thaliana, T. cacao, P. trichocarpa and C. follicularis. The placement of V. vinifera is also different among previously published phylogenies52–54. To account for that, two alternative tree topologies with different placements of V. vinifera were assumed in this analysis. For that, we leveraged the amino acid sequence data of all single-copy orthologues, as defined by OrthoMCL (1,836 1:1 orthologues), after excluding all putative TE sequences identified in BLAST searches against different TE databases (TIGR Plant Repeat Databases55, TransposonPSI (http://transposonpsi.sourceforge.net) and NCBI's non-redundant (nr) protein database). We then aligned the sequences of each orthogroup with the program M-Coffee56 and used trimAl57 to automatically remove poorly aligned regions. The best-fit amino acid substitution model for each multiple sequence alignment was selected using ProtTest58 and specified in the RAxML analysis under a partitioned scheme. We finally used r8s59 to obtain the ultrametric trees required for the BadiRate60 analysis, by applying the nonparametric rate smoothing algorithm59 to the maximum-likelihood trees and fixing the age of the root to 113 Myr in both cases. This date, a compromise for the two trees we tested, was derived from the average of the 2 BEAST point estimates for the earliest split within the rosid clade (with Vitaceae as one sister lineage), as calculated in ref. 54 (their Fig. 1 and Table 2). The two trees tested are detailed below.

Tree 1:

((V_vinifera:92.251246,(((T_cacao:72.791098,A_thaliana:72.791098): 5.199433,(P_trichocarpa:72.150935,C_follicularis:72.150935):5.839595):5.054431,P_persica:83.044961):9.206285):20.748754,(U_gibba:101.840606,(C_canephora:89.804910,S_lycopersicum:89.804910):12.035696):11.159394).

Tree 2:

(((U_gibba:82.830331,(C_canephora:74.586612,S_lycopersicum:74.586612):8.243718):14.783045,(((T_cacao:74.611384,A_thaliana:74.611384):5.510359,(P_trichocarpa:73.737693,C_follicularis:73.737693):6.384050):5.848981,P_persica:85.970724):11.642652):15.386624,V_vinifera:113.000000).

To identify gene families specifically expanded in the Cephalotus genome, we followed the method implemented in refs 4 and 61, accepting a weighted Akaike information criterion (wAIC) ratio of 2.7 for the best-fit branch model to the second-best-fit model. We ran BadiRate60 twice, once for each of the two alternative topologies shown above. Only those families strongly supported as expanded (wAIC ratio >2.7) under both of the two alternative topologies were considered for further analyses (Supplementary Table 14).

GO enrichment analysis

Supplementary Table 12 shows the per species summary of orthogroups and singletons in nine plant species. Before BadiRate analyses, orthogroups containing sequences with significant similarity to transposable elements (resulting in E-values <10−15 in TBLASTX searches against sequences of the RepBase v19.12 database)62 were filtered out from all nine genomes. The functional categories (generic GO terms) differentially represented among 493, 495 and 492 Cephalotus-specific expanded genes families (grouping 2,560, 2,567 and 2,557 total genes, respectively), as identified in BadiRate analyses performed using tree 1, tree 2 and the intersection of both trees, are displayed in Supplementary Tables 15, 16 and 17, respectively. Similarly, differential representation of GO generic terms among 2,716 Cephalotus-specific singletons, 237 Cephalotus-specific two-gene families (474 total genes) and Cephalotus-specific 201 multigene families (1,714 total genes) are shown in Supplementary Tables 19, 20 and 21, respectively. Finally, differential representation of GO generic terms among five pairs of genes unique to Cephalotus and U. gibba is presented in Supplementary Table 18. We performed significance analyses of differential distribution of GO terms by comparing different subsets of genes with the entire complement of genes in the genome using Fisher’s exact test (seefor example, ref. 4). To control for multiple testing, the resulting P values were corrected according to ref. 63.

Selection of differentially expressed genes

Strand-specific RNA-seq reads were mapped to gene models on the genome assembly using Tophat264 with minimum and maximum intron lengths of 20 and 20,000 bp, respectively (Supplementary Table 9). Transcript abundances calculated by featureCounts65 were normalized using the iterative differentially expressed gene elimination strategy (iDEGES)66, which consists of sequential TMM-(edgeR-TMM) n normalization67,68. Using the normalized reads per million mapped reads (RPM) values, differentially expressed genes were identified by an exact test for a negative binomial distribution69 and subsequent multiple correction by adjusting the false discovery rate to q < 0.01 (ref. 63; Fig. 2c,d and Supplementary Figs 2–5 and 9). Normalized RPM values are used in Fig. 2c,d, whereas unnormalized RPM values are plotted in Supplementary Figs 2–5 and 9. The significantly differentially expressed genes were subjected to a subsequent GO-enrichment analysis (Supplementary Tables 22 and 23).

Protein sequencing of digestive fluids

Digestive fluids of C. follicularis, D. adelae, N. alata and S. purpurea were collected from soil-grown plants in a greenhouse. Fluids were freeze dried and stored at room temperature. Dried samples were dissolved in a protease inhibitor cocktail (cOmplete, Mini, EDTA-free, Roche), precipitated with 8% trichloroacetic acid (TCA) and then washed with 90% acetone. They were dissolved in SDS sample buffer (62.5 mM Tris-HCl, 2% SDS, 0.25% BPB, 10% glycerol, 5% 2-mercaptoethanol, pH 6.8), denatured at 95 °C for 3 min and then separated by 12% SDS-polyacrylamide gel electrophoresis. Negative staining was performed using the Gel-Negative Stain Kit (Nacalai Tesque) according to the manufacturer’s instructions. After destaining, proteins were transferred to polyvinylidene difluoride (PVDF) membranes. N-terminal sequences of each protein band were determined by the Edman degradation method using an ABI Procise 494-HT instrument (Applied Biosystems). To obtain internal protein sequences, protein bands were dissected from the gel, destained, dehydrated with 100% acetonitrile for 5 min, dried using an evaporator and then reduced by incubating in 10 mM DTT and 25 mM ammonium bicarbonate at 56 °C for 60 min. After washing with 25 mM ammonium bicarbonate, the proteins were alkylated in 55 mM iodoacetamide and 25 mM ammonium bicarbonate for 45 min at room temperature. After washing with 50% acetonitrile containing 25 mM ammonium bicarbonate, the samples were dried using an evaporator. The proteins were in-gel-digested with 10 ng μl−1 trypsin in 50 mM ammonium bicarbonate, 10 ng μl−1 lysyl endopeptidase in 25 mM Tris-HCl (pH 9.0) or 20 ng μl−1 V8 protease in 50 mM phosphate buffer (pH 7.8) at 37 °C overnight. The digested peptides were extracted twice by sonication in 50% acetonitrile containing 5% trifluoroacetic acid (TFA) for 10 min. The peptides were separated by high-performance liquid chromatography (HPLC) using the Pharmacia SMART System and a reverse-phase column (μRPC C2/C18 PC 3.2/3, GE Healthcare Life Sciences, or XBridge C8 5 μm 2.1×100 mm, Waters) under the following conditions: constant flow rate of 200 μl min−1; solvent A, 0.5% TFA, solvent B, acetonitrile containing 0.5% TFA; linear gradient from 10 to 40% (B over A in % (v/v)) over 30 min (1% min−1). Separated peptides were then used for protein sequencing by the Edman degradation method.

Transcriptome assembly and identification of transcripts encoding biochemically identified proteins

RNA-seq reads of D. adelae, N. alata, and S. purpurea (Supplementary Table 8) were assembled into transcripts using Trinity (version r2013-02-25)70 with a 200 bp minimum contig length cut-off. Partial amino acid sequences of digestive fluid proteins were subjected to TBLASTN searches71 against the transcriptome assemblies and the Cephalotus gene models to identify the corresponding transcripts (Supplementary Tables 25–28). Sequence variants within a Trinity’s component were considered as originating from the same gene.

Preparation of digestive fluid protein data sets

In addition to proteins identified in this study (Supplementary Tables 25–28), we obtained for phylogenetic analyses a number of previously published sequences of digestive fluid proteins8,9,72–78 (Supplementary Table 24). Although many protein and transcript sequences for possible digestive enzymes are available (for example, refs 17,79–84), we included only genes for which complete coding sequences were available and for which their presence in digestive fluid had been biochemically validated (Supplementary Table 24, last searched 20 January 2016).

Phylogenetic analyses of gene families

Phylogenetic relationships of digestive enzyme genes and other carnivory-related genes were analysed along with their homologues in the annotated genomes of ten angiosperm species (Supplementary Table 11). TBLASTX searches71 were performed against the above coding sequence (CDS) data sets with an E-value cut-off of 10. After sequence retrieval, multiple alignments were prepared using MAFFT 6.95685, and ambiguous codons were removed using trimAl57 implemented in Phylogears2-2.0.2013.03.15 (http://www.fifthdimension.jp/products/phylogears/) with the ‘gappyout’ option. Poorly aligned sequences were removed using MaxAlign86. Phylogenetic trees were reconstructed by the maximum-likelihood method using RAxML v8.0.2651 with the general time-reversible (GTR) model of nucleotide substitution and four discrete gamma categories of rate heterogeneity (‘GTRGAMMA’ option). Support for nodes was estimated by rapid bootstrapping with 100 replicates. Trees were rooted at the midpoint between the two most divergent genes. Gene duplication events shown in Figs 2b and 3a were inferred on the basis of species overlap between partitions87 using a Python package ‘ETE3’88. The trees were visualized using iTOL89.

Detection of orthologous relationships

Orthology of Cephalotus genes and digestive enzyme genes was inferred on the basis of tree topologies reconstructed by the maximum-likelihood method using the ten plant genomes described above (Supplementary Figs 2, 4 and 5). As we cannot exclude the possibility of parallel gene losses, a clade containing genes from at least five plant genomes was designated as a putative orthologous unit.

Expression profiling of Arabidopsis genes

Affymetrix ATH1 (25K) microarray data sets on stress-related experiments were retrieved from ArrayExpress90 if two or more replicates were available on wild-type Arabidopsis plants (Supplementary Table 30). Robust multi-array average normalized expression data91 were subjected to heatmap visualization using the R package ‘gplots’. Dendrograms were constructed using the furthest neighbour method with Euclidian distances. Significance of differential expression was analysed by a randomization test with 10,000 iterations in which resamplings were performed in each gene family and the sum of expression changes was compared with the original value.

Evaluation of detection methods for molecular convergence

To evaluate different tree reconstruction methods, simulated gene sequences were generated using the R package ‘Phylosim’92. We used publicly available simulated data sets for 16 fungi species93,94. These data sets contain 1,000 simulated tree topologies of gene families, each of which was generated under observed gene duplication and loss rates. Sequences of 300 codons were simulated on the tree topologies of the fungi data set. Codon usage was sampled from the actual frequencies in Saccharomyces cerevisiae95. The κ (transition/transversion rate) was set to 1. The ω (nonsynonymous / synonymous nucleotide substitution rate ratio (dN/dS)) of each codon position was randomly sampled from a gamma distribution (shape = 0.5, rate = 1). To mimic molecular convergence, two genes were randomly selected to be converged. In terminal branches of selected genes, codon usage of S. cerevisiae was replaced with a biased matrix in which frequencies of codons coding for two randomly selected amino acids were increased. Increased frequency was calculated by multiplying the original value by 100, and then total frequencies of all codons were scaled to 1.

Gene trees were inferred by the maximum-likelihood method51 using first, second, third and all codon positions as well as 300 nucleotide random sequences. To obtain a robust tree topology, the gene trees were reconciled with the species tree using Treefix 1.1.1094, which incorporates duplication-loss parsimony and a test statistic for likelihood equivalence. Reconciliation was accomplished using default settings for which 1,000 iterations of topology searches were performed and rearrangements were accepted when likelihood was not significantly reduced by the Shimodaira–Hasegawa test96 (P value threshold of 0.05). Branch lengths of reconciled trees were optimized using RAxML51. Finally, the numbers of convergent and divergent substitutions were estimated from the inferred tree topologies and the original simulated alignments using CodeMLancestral21 (Supplementary Fig. 10). Substitution pairs that result in the same descendant amino acid at the same alignment position in both branches were categorized as convergent changes, whereas the remaining substitution pairs were counted as divergent changes21,28.

Detection of molecular convergence in digestive fluid proteins

Genes encoding digestive fluid proteins identified in this study (Supplementary Tables 25–28) and previous research (Supplementary Table 24) were analysed. When corresponding gene sequences for a given species clustered together in the maximum-likelihood trees (Supplementary Fig. 5), they were considered to represent the same gene, whereafter we retained our own sequences to circumvent incorrect inference of gene duplication events in phylogeny reconciliation. A maximum-likelihood tree was reconstructed using third codon position sequences of the trimAl-processed alignments, and it was subsequently reconciled with a species tree prepared from a dated large-scale plastid phylogeny of flowering plants54 using Treefix94 with default parameters, except with the number of iterations increased to 1,000. Although the plastid-based topology54 is partly different from nuclear-based topology52,53 (Supplementary Fig. 1h), we employed it because of the necessity to include carnivorous lineages in which nuclear genome sequences are unavailable (for example, Drosera, Nepenthes and Sarracenia). Branch lengths of the reconciled trees were optimized against trimAl-processed CDS alignments using RAxML51. The trees were subsequently used for Bayesian ancestral state reconstruction using PhyloBayes97 over 12,000 generations (2,000 generations of burn-in) with an infinite mixture of GTR substitution models (CAT-GTR model) of amino acid substitution and five discrete gamma categories of rate heterogeneity to calculate posterior numbers of convergent and divergent substitution pairs. Background levels (null hypothesis) of convergent substitution pairs were estimated by a linear regression in which the posterior numbers of convergent changes were predicted by divergent changes21. Over-accumulation of convergent changes in a tree was examined by one-sided single-sample proportion tests98 with Yate’s continuity correction99 and subsequent Bonferroni adjustment for multiple comparisons100. Digestive enzyme branch pairs among independent carnivorous plant lineages were examined in the statistical test. Corrected P values are shown in Fig. 3b and Supplementary Fig. 6.

Homology modelling of protein structures

Protein structures of digestive enzymes were analysed using the SWISS-MODEL Workspace101. Template models were selected using the ‘Template Identification’ tool. SWISS-MODEL Template Library IDs of selected templates were 2dkv.1.A, 3zk4.1.A and 1dix.1.A for GH19 chitinases, purple acid phosphatases and RNase T2s, respectively. Predicted models were visualized using UCSF Chimera 1.10102. Relative exposure of amino acid surfaces was calculated by dividing solvent-accessible surface in protein structures by the theoretical maximum of corresponding amino acids in Gly-X-Gly tripeptide contexts103. The relative solvent-accessible surface area for a paired amino acid substitution was reported by averaging values in proteins constituting the two clades (Supplementary Fig. 8).

Data availability

The Cephalotus genome assembly and gene models are available from the DNA Data Bank of Japan (DDBJ) with the accession numbers BDDD01000001 to BDDD01016307. The genomic sequences, gene models and other source data are also available at CoGe (Genome ID = 29002) and Dryad (doi:10.5061/dryad.50tq3). The DDBJ accession numbers for DNA-seq (DRR053706–DRR053720), mRNA-seq (DRR053690–DRR051749; DRR029007–DRR29010) and small RNA-seq (DRR058704–DRR058708) are shown in Supplementary Tables 1, 8 and 10, respectively. DDBJ accessions and gene IDs for coding sequences of digestive fluid proteins are provided in Supplementary Tables 25–28.