The total evidence assembly returned 1,533,193 unique RNA-seq contigs that were clustered into 1,345,464 potential gene level (isoforms collapsed) transcript groups. The assembly largely consisted of a mixture of A. maculatum and O. amblystomatis transcripts. There were also 7,193 transcript groups (0.5%) corresponding to a predatory mite, Metaseiulus occidentalis, and 2,641 transcript groups corresponding to a dermal fungus, Malassezia globosa. Sequences corresponding to the mite and fungus were removed by BLASTN homology search (all BLAST analysis was completed using BLAST+ algorithms, v 2.2.28+ (Camacho et al., 2009) using BLAST databases comprised of all known transcript sequences from the genera Metaseiulus and Malassezia. Transcripts corresponding to the alga O. amblystomatis and salamander A. maculatum were recovered by BLASTN against a database consisting of transcript sequences from lab grown cultures of O. amblystomatis and sequences from the model salamander Ambystoma mexicanum (contributed by Randall Voss—University of Kentucky, and from [Stewart et al., 2013; Wu et al., 2013]). Transcripts were further filtered by a BLASTX homology search against a database containing the entire protein complement of: Arabidopsis thaliana, Chlamydomonas reinhardtii, Mesostigma viride, Micromonas pusilla, Ostreococcus tauri, Oryza sativa, O. amblystomatis, Chrysemys picta bellii, Xenopus tropicalis, A. mexicanum, Pseudozyma, Saccharomyces cerevisiae, and the genera Melanopsichium and Leptosphaeria. The assortment of species was chosen due to phylogentic proximity to O. amblystomatis or A. maculatum, or due to best hits from those genera/species found when a selection of transcripts was queried against the nr database. Best hits to plant or green algal species were noted as algal sequences and combined with the results of BLASTN against the O. amblystomatis database. Best hits to salamander or other animal species were noted as salamander sequences and combined with the results of BLASTN against the A. mexicanum database. Best hits to fungal sequences were discarded. The remainder with no known homology were retained and included as putative algal or salamander transcripts based on their expression pattern across samples.

One expectation of differential expression analysis is that most genes are expressed equally between control and experimental samples. If expression levels are ordered from low to high expression in control samples and binned in a sliding window of 100 genes per bin, the median expression level in each bin will increase as the index increases. Based on the expectation of equal expression, for the same sets of genes, the median expression level of experimental samples should correspondingly increase.

For a subset of genes confirmed to belong to A. maculatum by BLAST homology, genes were sorted by Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values in the samples of A. maculatum without intracellular algae. Median FPKM values of bins of 100 genes, in a sliding window from the lowest expressed gene through the 100 genes with the highest FPKM values were calculated for the samples of A. maculatum with and without intracellular algae. The median FPKM values for the bins of A. maculatum only samples were plotted against the data from the samples of A. maculatum with intracellular algae (Figure 1—figure supplement 2a). At moderate to high expression levels, A. maculatum with intracellular alga FPKM values increased with A. maculatum without intracellular algae FPMK values (Figure 1—figure supplement 2a). But at very low FPKM values, the data were essentially uncorrelated. Median A. maculatum without intracellular alga FPKM values increased (since the data was ordered by those values), but median A. maculatum with intracellular alga FPKM values stayed the same (Figure 1—figure supplement 2a).

Lower limit FPKM values were determined by finding the FPKM value above which A. maculatum without intracellular alga samples and A. maculatum with intracellular alga samples exhibited a positive correlation (Figure 1—figure supplement 2a). Genes with FPKM values below the lower limit of both sets of samples being analyzed for differential expression (i.e. intracapsular algae and intracellular algae) were not included in the analysis. The uncorrelated expression pattern of genes with FPKM values below the threshold suggests that there is either insufficient sequencing depth to compare those genes between the two conditions, or those lowly expressed genes are expressed stochastically in these cells and the fluctuations in expression levels of those genes are not indicative of a biological difference between conditions. The same analysis was completed for A. maculatum cells with intracellular algae by ordering the genes based on their expression levels (Figure 1—figure supplement 2b), and for intracapsular (Figure 1—figure supplement 2c) and intracellular algal samples (Figure 1—figure supplement 2d).

The lower limit FPKM values for A. maculatum genes were 0.55 FPKM for A. maculatum cells (Figure 1—figure supplement 2a, vertical red-dashed line) and 0.61 FPKM for A. maculatum cells with algal endosymbionts (Figure 1—figure supplement 2b, vertical red-dashed line). The lower limit FPKM values for algal genes were 2 FPKM for the intracapsular alga (Figure 1—figure supplement 2c, vertical red-dashed line) and 0.04 FPKM for the intracellular alga (Figure 1—figure supplement 2d, vertical red-dashed line). The values are reflective of the sequencing depth of each sample, and are close to the widely used FPKM >1 lower limit threshold used in many RNAseq studies (Fagerberg et al., 2014; Shin et al., 2014; Graveley et al., 2011), except for the intracellular alga samples which suffer from low sequencing depth, but nonetheless display correlated expression with intracapsular alga samples starting at low FPKM values.

After determining the lower limit thresholds, the algal gene set consisted of genes with at least one read pair mapping to each of the three intracapsular algal samples or each of the four endosymbiotic cell samples that additionally were not found in the salamander, fungal, or mite BLAST data. Additionally, the genes had to have expression values above the lower limits described above in respective algal samples and below the lower limit for A. maculatum without intracellular alga samples in the salamander only cell samples for those genes. This resulted in a set of 8989 potential algal genes. However, due to a low depth of sequencing of the algal component of the endosymbiotic cell samples, additional filtering was necessary.

Genes that were not detected in intracellular algae could have been missing due to the lower depth of sequencing rather than representing an actual biological difference in expression between the algal populations. To determine what level of expression in intracapsular alga would be needed for a complete absence of measured expression in intracellular alga to be meaningful, the 8,989 algal genes were first ordered by intracapsular algal FPKM. Then the proportion of genes with no expression in intracellular samples was plotted against the median expression level in high-sequencing-depth intracapsular algae in bins of 100 genes in the ordered data. At low FPKM values in the intracapsular algae, up to 58% of the genes were absent from intracellular samples (Figure 1—figure supplement 3b). As expression in intracapsular alga samples increased, the proportion of genes with measurable expression levels in intracellular algal samples increased as well (Figure 1—figure supplement 3b). The same relationship was not observed for A. maculatum data sets, where the depth of sequencing between samples was approximately equal (Figure 1—figure supplement 4).

The dependence of gene detection in intracellular algal samples on FPKM level in intracapsular algal samples abated at intracapsular alga FPKM values where 95% or more genes could be detected in intracellular algal samples (Figure 1—figure supplement 3b, vertical red-dashed line). That expression level corresponded to 67.9 FPKM in intracapsular algal genes. After removing genes with no detectable expression in intracellular algae with expression levels below 67.9 FPKM in intracapsular algae, the dependence of gene detection on expression level in intracapsular algae was removed, and the anomalous peak of undetected genes was removed from the expression histogram (see: Figure 1—figure supplement 3a and c). This resulted in a set of 6,781 genes. Due to finding some genes with homology to anomalous organisms such as pine and beech trees without homologs in C. reinhardtii in the set of 6,781 genes (perhaps due to pollen in the low cell number samples), only genes with homologs in the lab strain Oophila transcriptomes were considered. The final set of algal genes used in differential expression analysis consisted of 6,726 genes.