To perform a comparative analysis of the innate immune gene repertoire, across multiple actiniarian taxa, we interrogated a suite of genomic resources, both publically available and newly generated in this study. Specifically, genomic datasets used in this study consisted of newly sequenced transcriptomes for Actinia tenebrosa (total n = 5; red colourmorph n = 2, brown n = 1, green n = 1, blue n = 1), Anthopleura buddemeieri (n = 1), Aulactinia veratra (n = 2), Calliactis polypus (n = 2), Nemanthus annamensis (n = 1) and Telmatactis sp. (n = 1), and as well as publically available transcriptomes for Anthopleura elegantissima (n = 1; SRX754678), Aiptasia pallida (n = 1; SRX231866, run SRR696721) and Nematostella vectensis (n = 1; SRX315372). These species were selected based on their phylogenetic distribution across most superfamilies within Actiniaria, availability of animals and general lack of sequence information for these species. Multiple transcriptomes were generated for A. tenebrosa and A. veratra which represent colourmorphs (red, brown, green and blue for A. tenebrosa) and aerial exposure treatments (one control and one treatment each for A. tenebrosa (red only) and A. veratra). All raw sequence reads were submitted to NCBI Sequence Read Archive under BioProject accession PRJNA313244 (see Additional file 1: Table S1 for all accession numbers).

Animal acquisition

Individual specimens of A. tenebrosa, A. buddemeieri and A. veratra were collected from the intertidal zone at Point Cartwright, (QLD, Australia) in 2014. C. polypus individuals were collected in 2014 from pumice washed onto rock pools at Snapper Rocks (QLD, Australia). Telmatactis sp. was bought from Cairns Marine (QLD, Australia) in 2014. N. annamensis was bought from Great Barrier Reef Marine Pty Ltd in 2015. All animals were housed in holding tanks, in the marine laboratory at QUT, until experimental use. Tanks were maintained at between 33 and 37 ppt salinity and 20–28 °C.

Sequencing, assembly and annotation

Total RNA was extracted from whole organisms by first homogenizing individuals in liquid nitrogen, followed by a TRIzol/chloroform RNA extraction protocol (TRIzol®, Life Technologies). Extracted RNA was assessed for quality and integrity on a Bioanalyzer 2100 (Agilent) using an RNA nano chip. Sequencing libraries were prepared using the Illumina TruSeq® Stranded mRNA Library Preparation Kit and sequenced on an Illumina NextSeq 500. 150 bp paired-end chemistry was used for all sample libraries, except N. annamensis and A. tenebrosa (4; blue) which were prepared at a different time using 75 bp paired-end chemistry. All transcriptomes were assembled using the Trinity de novo assembler [39]. Trinity assembler was used for A. tenebrosa (1, 2, 3 and 4), A. pallida, A. buddemeieri, A. elegantissima, A. veratra, C. polypus (1, 2), N. annamensis, N. vectensis and Telmatactis sp.. Quality trimming was performed on all datasets using Trimmomatic [40] with default settings to retain only high quality reads and to remove non-biological sequence.

Redundant and chimeric sequences were removed from all assemblies using CD-Hit v.4.6.1 [41, 42], by clustering sequences with greater than 95 % similarity within each assembly. In addition, CEGMA (Core Eukaryotic Genes Mapping Approach) v.2.5 [43] was used to assess the completeness of each assembly, by determining the percentage of full-length sequences in each transcriptome corresponding to 248 highly conserved eukaryotic proteins.

Functional annotation of individual transcriptomes was undertaken using Trinotate [44], for both nucleotide and predicted peptide sequences. Nucleotide sequences for each contig were annotated using BLASTx searches against the Swiss-Prot [45] and TrEMBL (Uniref90) [46] databases using BLAST+ v.2.2.31 [47, 48] software (E value 1 × 10−5). Batch extraction and translation of the longest open reading frames (ORFs) for each contig were performed using TransDecoder v.2.0.1 [49, 50]. From the predicted peptide sequences, signal peptides and protein families were detected in each ORF, using Signal Peptide v.4.1 [51] and Pfam v.3.1b1 [32], respectively. Longest ORFs were annotated using BLASTp searches against the Swiss-Prot [45] and TrEMBL [46] databases using BLAST+ v.2.2.31 [47, 48] software (E value 1 × 10−5).

Gene ontology and RSEM analysis

GO terms were assigned to contigs that received significant BLASTx hits (E value 1 × 10−5). WEGO [52] and CateGOrizer [53] were used to examine the distribution of top GO terms and the distribution of ‘immune class’ GO terms in each assembly, respectively. Estimates of transcript abundance for all newly generated transcriptomes were calculated using the Trinity [50] pipeline. Raw reads were aligned to assembled contigs using Bowtie 2 [54], followed by RNAseq by Expectation-Maximization (RSEM) [55] estimation of expression values as Fragments per Kilobase of transcript per Million mapped reads (FPKM).

Identification of candidate genes

The list of candidate genes interrogated in this project was limited to five innate immune gene families which included Toll-like receptor (TLR), Nucleotide-binding and Leucine-rich Repeat containing gene (NLR), Interleukin-1 receptor-like genes (IL-1R-like), Myeloid Differentiation primary response gene 88 (MyD88) and Nuclear Factor kappa-light-chain-enhancer of activated B cells (NF-κB). Transcripts containing full length ORFs of these candidate genes were identified using Pfam, BLASTp and BLASTx searches. Specifically, the search method for each candidate gene was as follows. TLR was identified by searching for all contigs with a TIR (PF01582) (or TIR_2, PF13676) domain, along with at least one leucine-rich repeat (LRR, CL0022). MyD88 was identified by searching for a TIR (or TIR_2) domain, along with a death domain (DD). IL-1R-like was identified by searching for a TIR (or TIR_2) domain, along with least one immunoglobulin (Ig, CL0011) domain. NF-κB was identified by searching for a Rel homology domain (RHD, PF00554), along with at least one ankyrin repeat (PF00023). The presence of other domains with a RHD may be indicative of different classes of NF-κB, however, in these actiniarian species no other domains were found in the same contig with a RHD. Hence, all identified NF-κB belong to Class 1. A Rel homology domain without ankyrin repeats, as seen in N. vectensis [56] were not considered as full length ORFs. NLR was identified by searching for a NACHT (domain found in NAIP, C2TA, HET-E and TP1) domain, along with at least one LRR. NLR genes may also include a variable N-terminal domain, which can be broadly classified in the ‘death-fold superfamily’ which includes the domains caspase activation and recruitment domain (CARD, PF00619), pyrin domain (PYD, PF02758), death domain (DD, PF00531) and death effector domain (DED, PF01335). Therefore, contigs also containing a domain from this superfamily were considered potential candidate NLRs. In addition, the translated ORF for each candidate gene was analysed by TMHMM v.2 [57], to identify transmembrane domains.

Finally, contigs were only retained as complete candidate genes if they consisted of a full length ORF which contained a start and stop codon, and received a BLASTp hit (from either the TrEMBL or Swiss-Prot databases) against other species using the open reading frame batch extracted by TransDecoder [49].

Presence/absence of other innate immune genes

Non-conservative counts of other innate immune genes in this study were identified from gene counts provided by Trinotate. CniFL was identified by searching for all genes containing the Pfam domains Collagen (PF01391), Ig (CL0011) and Fibrinogen (PF00147). MASP was identified from BLAST hits from TrEMBL with N. vectensis MASP and presence of the correct Pfam domains (CUB, EGF-like calcium binding, CUB2, Sushi 1, Sushi 2 and Peptidase S1). SRCR were identified by searching for the number of genes with Pfam annotations for the SRCR domain (PF00530). C-type lectin domains were identified by searching for the number of genes with Pfam annotations for lectin_C domains (PF00059). Complement control proteins (C3, Factor B, C6 and Factor I) were identified by searching for BLAST hits from either the TrEMBL or Swiss-Prot databases.

Identification of novel immune genes

Potential novel genes were identified by searching for novel domain architectures that included the TIR (PF01582) and/or TIR_2 (PF13676) domains. This domain is well documented in having a role in signal transduction and protein-protein interactions in immune pathways, and therefore, provides a robust target for finding putative novel genes that interact in innate immune pathways [58]. All species were also interrogated for novel actiniarian genes previously identified [11, 27]. Initially, all TIR or TIR_2 domain containing contigs, for each species, were identified in the Trinotate annotation report, and all contigs without a BLASTx or BLASTp annotation were further investigated. In addition, contigs that received BLASTx or BLASTp hits only against N. vectensis predicted proteins or genes were also interrogated in detail. The EMBL-EBI [32] Pfam domain architecture analysis (http://pfam.xfam.org/ Accessed 28th March 2016) was used to investigate known architectures with either the TIR or TIR_2 domains, and thereby identify which species, if any, have genes with novel or infrequently used architectures. Novel architectures were identified if no or few hits were returned from either BLAST or EMBL-EBI, which may indicate potential actiniarian lineage-specific (orphan) genes, horizontally transferred genes or taxonomically-restricted genes.

Candidate and novel gene validation

To validate the transcriptome assemblies, as well as the ORF of candidate transcripts, PCR validation was performed for the five candidate genes, as well as three novel genes and CniFL. All primers were designed using Primer-BLAST [59] using the settings of [60], to amplify the ORF of each respective gene for four species: A. tenebrosa, A. buddemeieri, A. veratra and C. polypus (except CniFL, for which genes were validated in A. tenebrosa and A. buddemeieri only). For large ORFs, multiple primer sets were designed to ‘walk’ the length of the ORF, with at least 100 bp overlap between products. A detailed list of primer sequences is presented in Additional file 3: Table S11. PCRs were performed using RNA extracted using the above described protocol, followed by cDNA synthesis using the SensiFAST™ cDNA synthesis kit (Bioline). PCR products were then amplified using MyFi™ DNA polymerase mix (Bioline) (PCR protocols and thermocycler conditions can be found in Additional file 3: Tables S12–S13). Purification was performed using the ISOLATE II PCR and Gel kit (Bioline), and sequenced using BigDye® Terminator v3.1 (ThermoFisher). Sequences were run on a Genetic Analyser 3500 from Applied Biosystems (ThermoFisher). Sequence chromatograms were visualised in Geneious version R8.1.4 [61] and aligned to the transcript from which the primers were designed, as per [62]. To confirm the ORF for the candidate gene, sequence similarity was compared.

Evolutionary and phylogenetic analyses

PAML

To investigate whether genes were under purifying, neutral or positive selection and thereby provide insights into the selective pressures on the actiniarian immune system, PAML analysis was performed. Non-synonymous vs synonymous substitution rates (d N /d S ) of all TLR, MyD88, NF-κB and the transmembrane domain containing NLRs, as well as Novel Genes 1, 2 and 3, were analysed using the method of Yang and Nielsen [33]. First, the nucleotide sequence for the complete ORF of all genes were extracted from the TransDecoder output and imported into Geneious [61] version R8.1.4. All stop codons were manually removed and then sequences were imported into MEGA version 7.0.14 [63] for alignment with MUSCLE [64] (codon), using default settings and gaps removed. Alignments were converted into PHYLIP format in Geneious [61] and then imported into pamlX [65, 66] version 1.3.1, where d n /d s ratios were calculated using YN00 [33]. All values for d N and d S were averaged for each gene to provide an average d N /d S.

RAxML

In order to understand how actiniarian NLRs have evolved in the context of the metazoan NLR family, the phylogenetic relationship of NLRs in this study and other metazoan NLRs were resolved using maximum likelihood analysis in RAxML. Of particular interest was the question of how the novel TMD containing NLRs have evolved (i.e., from a single origin or multiple independent origins). Initially, the translated ORFs for 35 complete and near-complete NLRs (near-complete only used for species where no complete NLR found) of the candidate NLR transcripts were imported into Geneious [61] version R8.1.4 and aligned along with NACHT domains from species across Metazoa (Additional file 6) which were obtained from [7]. Alignments were performed using MUSCLE [64] with default settings (duplicates removed). Sequences were trimmed to retain only the NACHT domain, as this represents the most evolutionary conserved domain of NLRs, across metazoan species and is useful for inferring phylogenetic relationships [7]. Alignments were imported into MEGA version 7 [63] to determine the best model of protein evolution. Maximum likelihood analysis was then performed in RAxML version 8.1.15 [67], using rapid bootstrapping (1000 replicates) with a random seed and no outgroup.

To understand how CniFL genes are disturbed across Anthozoa species, maximum likelihood analysis of CniFL proteins was performed. All but two species had at least one gene copy for which a complete ORF was identified; Telmatactis sp. had only partial sequences and no CniFL was identified in A. elegantissima. An initial alignment of the amino acid sequences of the complete ORFs was performed using MUSCLE [64]. Alignments were trimmed using Gblocks [68] to remove poorly aligned sections, allowing for gap positions within final blocks and not allowing many contiguous non-conserved positions. Protein model testing in MEGA [63] and maximum likelihood analysis in RAxML [67] was performed the same as above for NLR. The outgroups (Hs FCN1 and Mus FCN1) were set post-analysis in Geneious [61]. SMART analysis [69, 70] was used to identify the number of domains in each CniFL shown in the tree.