Sex determination in Xiphophorus hellerii

A total of 30 female and 30 male X. hellerii individuals from the same strain (origin Rio Lancetilla), which were also used for the backcrossing experiment, were genotyped to identify sex-related markers, and thus characterize the sex determination system of this swordtail species strain.

Sequencing and data filtering

Genomic DNA was extracted from fin clips using standard phenol–chloroform extraction66, followed by RNase treatment. DNA quality of each sample was assessed by agarose gel electrophoresis and quantified using a Qubit v2.0 fluorometer (Life Technologies, Darmstadt, Germany). Approximately 100 ng of DNA template of each sample was used to construct a single quaddRAD67 library (PstI and MspI were used as rare and frequent restriction enzymes, respectively), size selected from 450 to 550 bp using a Pippin Prep platform (Sage Sciences, Beverly, USA) and paired-end sequenced in an Illumina HiSeq2500 (rapid run mode, 2 × 151 cycles).

The raw fastq files obtained from one Illumina lane (NCBI’s SRA database, accession no. PRJNA493969) were first processed using the clone_filter module implemented in the Stacks v1.4668,69 package to identify and remove duplicate reads. The retained sequence data, stripped of the four random bases at the 5′ end of each paired read, was then separated by the dual index inner barcodes using the Stacks’ process_radtags script (option: –inline_inline) with no quality filters applied. Next, the sequences were processed by the dDocent v2.2.25 pipeline70, with default parameters, in order to get the individuals’ genotypes. For this RAD data set, the dDocent pipeline was preferred to Stacks because of its superior ability to deal with paired-end reads when calling genotypes. Briefly, dDocent calls Trimmomatic v0.3671, bwa v0.7.1572 and Freebayes v1.1.073 to respectively quality control the raw reads, align the filtered sequences to the reference genome and infer individuals’ genotypes at polymorphic loci. Trimmomatic was used to remove bases with a quality score <20 from the beginning and end of reads, and additionally trim bases with an average quality score <10 in a sliding 5 bp window. The ‘mem’ algorithm implemented in bwa was applied setting the match score value (-A) to 1, the penalty for a mismatch (-B) to 4 and the gap penalty (-O) to 6. Freebayes was used to call genotypes setting a minimum mapping and base quality score to 10.

Sequence variation analysis: filtering and GWAS of sex in X. hellerii

The raw VCF variant file was obtained by setting the X. maculatus v4.4.2 as the reference genome46. In this version of the X. maculatus genome, the original scaffolds43 were tiled along a dense genetic map to create chromosome-length genome assemblies. The VCF file was filtered using VCFtools v0.1.1574 to retain loci present in a minimum of 80% of the individuals and having mapping quality (MQ) > 30. Further stringent filtering was applied using the dDocent_filters script (http://ddocent.com/filtering/) that relies on both VCFtools and the vcffilter module in the vcflib package (https://github.com/vcflib/). A total of 65,417 SNPs passed the quality control applied and were used for the GWAS and the F ST analyses (the full set of filters, and the number of sites removed at each step, are reported in Supplementary Table 3; the individuals’ genotypes were deposited into the Dryad Data Repository: https://doi.org/10.5061/dryad.7h54h66).

The final VCF file was used to perform a standard genome-wide association analysis using PLINK v1.90b4.975 setting sex as the binary case/control variable. The −log 10 p-value (obtained with a Fisher’s exact test) of each SNP was shown as a function of genomic position on the 24 X. maculatus LGs46 in a Manhattan plot constructed with the R package qqman v0.1.476. The threshold for genome-wide significance was set at a p-value = 7.64e−7 after Bonferroni correction for multiple comparisons (0.05/65,417 variants). To identify highly divergent, potential candidate sex-chromosome regions using an additional line of investigation, the program VCFtools was used to calculate F ST -values between males and females across different non-overlapping sliding windows (1, 10 and 100 Kb). As a suggestive threshold for significance, we used the upper 1% percentile distribution of the per window F ST values (1 Kb window: F ST = 0.184; 1 Kb window: F ST = 0.181; 1 Kb window: F ST = 0.157).

Coverage analysis to identify sex association in X. hellerii

As an alternative and independent approach to identify and verify potential sex-associated SNPs, we analysed mapped sequence coverage information from X. hellerii males and females to identify loci located in the non-recombining and recombination restricted regions of the sex chromosomes. We first created a set of reference loci using the ustacks module of the Stacks package in default mode, but setting the minimum depth of coverage required to create a stack (-m) to 5. As a reference for aligning the RAD loci, we used the X. hellerii genome v3.0.1 (NCBI accession number: GCA_001443345.1)47 that is derived from a single female, strain Sarabia, from the Xiphophorus Genetic Stock Center (TX, USA). Existing evidence suggested that the female is the heterogametic sex in this species; therefore, because the genome of X. hellerii was generated using a single female individual, using the X. hellerii genome would allow us to recover a higher number of W-linked loci than using the X. maculatus genome. We used this approach to confirm the findings of the statistically more robust GWAS approach, and possibly to identify specific X. hellerii sex-determining regions (e.g. autosomal modifiers). The total 70,763 loci, extracted from the Stacks’ catalogue formatted in fasta format (deposited into the Dryad Data Repository: https://doi.org/10.5061/dryad.7h54h66), were used as a reference for the alignment of the sequence set of each sample carried out with Bowtie2 v2.3.077 in ‘--very-sensitive’ mode. To avoid redundant information, only one of the paired loci was retained to serve as a reference. Next, raw coverage values—i.e. the number of reads mapping to each locus—were extracted from each individual mapping file using SAMtools v1.2.178 idxstats and normalized using the Median Ratio Normalization (MRN) function implemented in DESeq2 v1.8.179.

We then calculated the mean coverage for each locus (including males and females), but first removed loci with coverage below 3 and above 400 because they most likely indicate mapping errors and repetitive elements, respectively (the number of loci was reduced to 29,444 after the coverage filters were applied). We assumed a ZW sex-determination system is operating in this system, given the results from our sequence variation analysis (see the previous paragraph), and from the filtered coverage data set we selected (1) potentially W-linked female-specific loci and (2) Z-linked loci. According to expectations, W-linked loci should be present only in females, the heterogametic sex, so we applied coverage filters to satisfy this condition (mean coverage in females >7; mean coverage in males <2). On the other hand, the Z-linked loci are expected to be present in both sexes, but they should have twice the coverage in males than in females, as they have two copies of the Z chromosome (coverage filter to identify these loci: mean coverage ratio in male/female >1.9). We note that this type of filtering criteria is dependent on the sequencing depth of coverage used in the experiment, and the use of these arbitrary filters means the results should be interpreted with caution. However, the main purpose of the coverage analysis was to confirm the results of the GWAS analysis. Finally, we used a two-tailed Fisher’s exact test to test for over-representation of these candidate sex-chromosome loci in the X. maculatus LGs that were associated with sex (SNP analysis).

Synthetic hybrids: laboratory cross experiment

Laboratory crosses were conducted which mimic the hybridisation with backcrossing evolutionary scenario that has potentially given rise to at least two Xiphophorus fish species, and may have contributed to the ancestry of other taxa. We sampled both first generation and approximately 100th generation backcrossed individuals that had been artificially selected for two pigmentation phenotypes. We note that all sampled backcross generations were raised under the same standard laboratory conditions, therefore excluding any environmental effects.

In these laboratory crosses, one parental species is X. maculatus, where the strain Jp163A (origin Rio Jamapa) was used, which is also the same strain from which the reference genome was produced43. This strain has been kept as a brother–sister mating line for more than 100 generations. The other parental species used in the laboratory crosses is X. hellerii. The stock is derived from fish collected from the Rio Lancetilla and has been kept for over 50 years in closed colony breeding. The X. hellerii stock has been through several bottlenecks (due to the breakdown of the stock) where only a handful of fish or even a single pair was used to regenerate the stock. Both parental species are representatives of the two clades that were earlier shown to have been involved in the initial hybridisation event generating X. clemenciae and X. monticolus33,36,37.

The first generation hybrid cross was between a female X. maculatus and a male X. hellerii. Next, a backcross was made between an F 1 female, carrying the pigmentation phenotypes intended for selection and an X. hellerii male. Females were then selected for each successive backcross according to the two colour markers. One is the erythrophore pigmentation pattern dorsal red, Dr, which causes a dark red colouration of the dorsal fin and in hybrids extends from there over the whole body and the unpaired fins. The second pattern is caused by the closely linked macromelanophore pattern locus spotted dorsal, Sd. It is expressed as jet-black pigmentation spots at the base of the dorsal fin in purebred X. maculatus. In the genetic background of X. hellerii, the macromelanophores cover the whole dorsal fin, and from there invade the dorsal and caudal body compartment of the fish. As an effect of the Sd locus component, xmrk oncogene74, the macromelanophore pattern in the genetic background of X. hellerii is enhanced to melanotic hyperpigmentation that can progress to malignant melanoma (for an overview of the pigmentation genetics of platyfish/swordtail hybrids, see ref. 40). Dr and Sd/xmrk are closely linked on the X-chromosome of the platyfish corresponding to linkage group 2143,46. In the serial backcross experiment, each generation consists of 50% fish with and 50% without the chromosome region where both genes are located, designated as ‘pigm’ or ‘wt’ group. Selecting fish with the two linked colour markers for breeding mimics natural selection for a certain phenotype. Therefore, we expect fish with the pigmentation phenotype to carry the parental platyfish genetic region of the X-chromosome. Backcrossing was continued in this fashion for at least 100 generations (more than 30 years of laboratory crosses have been performed, counting was terminated after backcross 50 and on average three backcross generations were produced per year). For the approximate 100th generation backcross offspring, we expect fish carrying the pigmentation genes (further on referred to as BC 100 _pigm) to be enriched for genetic markers of platyfish LG21. By contrast, segregants not carrying the colour genes (wild type pigmented, further on referred to as BC 100 _wt) should not be enriched for platyfish sequences of this region of the genome.

Sequencing, filtering and genotype calling

Genomic DNA was extracted from pooled organs (brain, eyes, gills, liver, spleen, kidney) of the two parental species (two males and two female individuals each for X. maculatus and X. hellerii), four F 1 females, 16 first generation backcross individuals and 18 individuals generated after about 100 backcross generations. DNA extraction was carried out using the standard phenol–chloroform extraction66. The DNA quality of each sample was determined by agarose gel electrophoresis and quantified using a Qubit v2.0 fluorometer. Approximately 300 ng of DNA template of each sample was used to construct double-digest restriction site-associated DNA (ddRAD)80 libraries following the modifications introduced in ref. 81. DNA digestion was carried out using the restriction enzymes PstI (rare cutter) and MspI (frequent cutter).

A single ddRAD library containing 46 barcoded individuals (see Supplementary Data 3 for details) was constructed, size selected from 360 to 430 bp using a Pippin Prep system and single-end sequenced in an Illumina HiSeq2500 using rapid run mode with 151 cycles (raw reads were deposited in the NCBI’s SRA database, accession no. PRJNA493969).

The process_radtags script implemented in the Stacks package was used for individual demultiplexing and for filtering erroneous and low-quality reads (options: -c –q). After this process, an average of 2.95 million (X. maculatus) and 2.08 million (X. hellerii) sequences were obtained for the parents; an average of 3.05, 1.91 and 2.34 million sequences were obtained for the F 1 , BC 1 and BC 100 individuals, respectively (see Supplementary Data 3). The sequence length of each read was 146 bp after removing its 5-bp barcode.

Filtered reads were individually aligned to the anchored version of the X. maculatus genome46 using Bowtie2 with default end-to-end mode. SAMtools was then used to filter the mapping result by retaining only sites with high quality score (MQ ≥ 30). Loci construction and genotype calling were performed using the Stacks’ ref_map.pl pipeline requiring a minimum of five reads (- m 5) to form a locus and calling genotypes at a 5% significance level using the bounded error model (upper bound of 0.05). The allelic variants (SNPs) were exported in VCF format using the Stacks’ populations module.

Sorting of parental alleles in the laboratory cross dataset

An in-house Perl script was used to identify the subset of sites fixed for alternative alleles in the two parental species (X. maculatus and X. hellerii) of the laboratory cross following these criteria: (1) the individuals’ genotypes are homozygous within each parental species, but they have different allelic variants; (2) the sites selected previously are heterozygous including both parentals’ allelic variants in F 1 individuals; (3) at these sites, all BC 1 and BC 100 individuals have at least one paternal (X. hellerii) allele. After these filtering steps, we obtained a total of 34,632 SNPs (the individuals’ genotypes were deposited into the Dryad Data Repository: https://doi.org/10.5061/dryad.7h54h66).

Tests for excess ancestry

As we focused on SNPs fixed between the parental species (see above), each SNP was fully informative about ancestry. Thus, we were able to directly calculate the frequency of platyfish ancestry in the BC 100 lines. We then used a stochastic simulation to develop null expectations for the frequency of platyfish ancestry expected in the absence of selection after 100 generations of backcrossing. We sampled platyfish ancestry frequencies each generation stochastically as y t = binomial(p t−1 * 0.5, 2N) and then set p t = y t /2N. Here, y t is the number of platyfish alleles in the sample of N diploid hybrid fish and p t and p t−1 are the ancestry frequencies in the current and previous generation, respectively. The expectation for the binomial is reduced by half each generation as the BC line is backcrossed to X. hellerii. Using this iterative process and starting from p 0 = 0.5 (F 1 s), we simulated expectations for 100 generations of backcrossing with N = 10 or 15 hybrid fish per generation (conservative estimates of the average number of fish used to maintain the line over time). We repeated this procedure 10,000 times. In all cases, we found expected platyfish ancestry at generation 100 of p 100 = 0.

Ethical statement

All animals were kept and sampled in accordance with the applicable EU and national German legislation governing animal experimentation. In particular, all experimental protocols were approved through an authorization (568/300-1870/13) of the Veterinary Office of the District Government of Lower Franconia, Germany, in accordance with the German Animal Protection Law (TierSchG).