Experimental design

To identify the closest relatives of the Lake Victoria Region Superflock (LVRS) and identify potential hybridization events, we performed restriction associated DNA (RAD) sequencing with haplochromine cichlid species from all major drainage systems that are either currently connected to the Lake Victoria region, or have been in the past, and representatives of all major radiations of the Lake Victoria Region Superflock. The closest relatives were inferred with maximum likelihood trees based on these data and on mitochondrial sequences of the same fish. D statistics, five population tests, and F4-ratio tests, were used to identify hybridization between lineages, determine the direction of gene flow and quantify ancestry proportions, respectively. Whole-genome sequencing was then performed with fish of the lineages identified as ancestral to the LVRS plus representatives of the superflock in order to corroborate the signatures of hybridization and to estimate ancestry block sizes.

Taxonomic sampling

We sampled haplochromines from all major lakes and several small lakes in the Lake Victoria region and from the Mpanga River that drains the Rwenzori Mountains in the drainage divide between Lakes Edward and Victoria (Supplementary Data 1, Fig. 1). Further, we collected or obtained from other collectors haplochromines belonging to all phylogenetic lineages and from almost all African river drainages hosting haplochromines. We also included members of the haplochromine species flocks of the other African Great Lakes Malawi, Tanganyika and Mweru (Fig. 1, Supplementary Data 1). Samples were collected under research and sample export permissions of the Tanzania Fisheries Research Institute and the Tanzanian Ministry of Agriculture, Livestock and Fisheries; the Ugandan National Fisheries Resources Research Institute and Ministry of Agriculture, Animal Industry and Fisheries Uganda; and the Department of Fisheries Zambia. All samples were collected in compliance with applicable international and national guidelines for the use of animals, and ethical standards.

RAD sequencing

DNA was extracted from fin clips or muscle tissue with a standard phenol-chloroform-isoamyl alcohol extraction method58. Restriction-site Associated DNA sequencing (RADseq) was performed following a standard protocol59. Restriction digestion was done overnight (8–10 h) using the restriction endonuclease HF-SbfI (NewEngland Biolabs) and 400–1,000 ng DNA per sample. P1 adaptors contained 5–8 bp long barcodes differing by at least two nucleotides from all other barcodes. The DNA was sheared with a Covaris S220 Focused-Ultra sonicator and fragments of 300–600 bp length were extracted from an agarose gel. We performed 18 PCR cycles to amplify the RAD fragments (30 s 98 °C, × 18 (10 s 98 °C, 30 s 65 °C, 30 s 72 °C), 5 min 72 °C). All libraries were single-end sequenced on an Illumina HiSeq 2,500 sequencer. The reads were de-multiplexed and trimmed to 84 nucleotides (nt, after barcode removal) with the process_radtags script from the Stacks pipeline60, correcting single errors in the barcode and discarding reads with incomplete restriction sites. The length of 84 nt results from removing the barcode (mostly 6 nt long) and trimming off the last 10 nt because of reduced quality at the read ends. The FastX toolkit (http://hannonlab.cshl.edu/fastx_toolkit) was used to remove all reads containing at least one base with a Phred quality score below 10 and reads with more than 10% of bases with quality less than 30.

The reads of each individual were then mapped to the Pundamilia nyererei reference genome using Bowtie2 (ref. 61) with the end-to-end alignment option. Single nucleotide polymorphisms (SNPs) and genotypes were called using GATK Unified Genotyper v. 3.1 (ref. 62). All sites were then filtered with a custom-made Python script and vcftools v. 4.1 (ref. 63). SNPs within 5 nt from indels (insertions and deletions) were removed to avoid false SNPs because of misalignment problems, and SNPs were required to have a quality value of at least 30.

Mitochondrial sequencing

Two mitochondrial markers (NADH Dehydrogenase Subunit 2 (ND2)64 using the primers ND2Met-F 5′-CAT ACC CCA AAC ATG TTG GT-3′ and ND2Trp-R 5′-GTS GST TTT CAC TCC CGC TTA-3′ and the mtDNA control region (D-loop)65 with the primers FISHL15926-F 5′-GAG CGC CGG TCT TGT AA-3′ and FISH12s-R 5′-TGC GGA GAC TTG CAT GTG TAA G-3′ were amplified with PCR and Sanger sequenced for the same individuals or downloaded from GenBank for the same species as those included in the RADseq dataset (Supplementary Data 1). The sequences were aligned in ClustalW implemented in BioEdit 7.2.5 (ref. 66) and manually curated for correct local alignment.

Phylogenetic analyses

Phylogenies were reconstructed for the concatenated mitochondrial genes and the concatenated RAD sequences separately, including both variant and invariant sites using a maximum likelihood approach (RAxML v. 7.7.7 and ExaML v. 1.0.4)67,68. For the mitochondrial dataset, a maximum likelihood tree was reconstructed using three partitions, one for Dloop, one for the first and the second codon positions of ND2, and one for the third codon position of ND2. For each dataset, we performed a RAxML analysis with 100 rapid bootstraps using the GTRGAMMA model of rate heterogeneity. For the RAD-seq dataset, we used all concatenated sites with no more than 40 individuals missing (25%) to reconstruct a maximum likelihood tree with RAxML (ref. 67) and ExaML (ref. 68). Each of 100 bootstraps was performed by randomly sampling with replacement sites from the concatenated dataset to get a dataset of the original size. The maximum likelihood tree was then inferred for each resampled dataset with ExaML (ref. 68) using a GTRGAMMA model of rate heterogeneity, as recommended in the RAxML-light manual68. We calculated bootstrap support values based on these 100 topologies with RAxML (ref. 67). The nuclear and mitochondrial trees were rooted with the reference genome of Oreochromis niloticus and ladderized and plotted using the R-package Ape v. 3.1 (ref. 69).

We then used RAD-derived SNP data to infer the species tree for the LVRS groups and their closest relatives with SNAPP (ref. 45). SNAPP bypasses gene trees and computes species trees directly from independently inherited markers by integrating over all possible gene trees45. We restricted this analysis to two individuals per species (one for A. sp. ‘Yaekama’) and as SNAPP assumes no linkage among loci, we included only biallelic sites that were at least 500 kb apart from each other. The resulting data set contained 31 individuals and 1,817 sites. We ran SNAPP for 1,000,000 iterations, sampling every 1,000th iteration using default priors. We discarded the first 50% of the trees as burn-in and visualized the posterior distribution of the remaining 500 trees as consensus trees in Densitree70.

Mitochondrial chronograms

Dated phylogenies were reconstructed based on mitochondrial D-loop65 and ND2 (ref. 64) sequences using BEAST v. 2.3.0 (ref. 71) and four different sets of calibration nodes (Supplementary Methods). We caution that the mitochondrial tree only shows the phylogeny of the maternal line and that time estimates more recent than one million years are most likely overestimates because of the increase of molecular rates towards the recent (Supplementary Methods).

To compare the splitting time between cichlid lineages with the major relevant geological events, we reconstructed paleogeographic maps at different time points based on previously published data and reviews (see Supplementary Methods).

Patterson’s D statistics

To test for evidence of ancient admixture among lineages, we computed Patterson’s D statistic46,72 (ABBA-BABA test), a method to detect admixture based on the frequencies of discordant SNP genealogies in a four-taxon tree, with the software package ADMIXTOOLS v. 1.1 (ref. 48). Genotypes were discarded if they had less than 6 reads or a genotype quality Phred score <20 (that is, error probability >1%). Significance of D statistics was assessed with a block jackknife procedure using a z score of three standard errors as a threshold48. We used three individuals of Astatotilapia flavijosephi from Lake Kinneret as the outgroup population and we tested for evidence of gene flow from each Eastern and Upper Nile clade into the LVRS relative to allele sharing with the closest relative of the LVRS, the Congo drainage taxa A. stappersi from Zambia (Fig. 2) or A. sp. ‘Yaekama’ from the central Congo. For these analyses we used all individuals with at least 50% of the sites sequenced at a depth of at least 10 reads. To exclude the possibility that the D statistics are biased by the alignment of the reads to the Lake Victoria species Pundamilia nyererei, we also aligned the reads to the Astatotilapia burtoni reference genome, which is an outgroup to all taxa used in the D statistics. To rule out that the choice of outgroup (A. flavijosephi) biases the D statistics, we repeated the tests with A. burtoni or A. desfontainii as outgroups. All species combinations with number of individuals and SNPs included are given in Supplementary Table 2.

Five population tests

To infer directionality of gene flow between LVRS and the Nilotic taxa, we used an extended version of the partitioned D statistic test developed by Eaton & Ree47 (we call it a ‘five population test’). Similar to the D statistics, this test is based on discordant allele sharing patterns, but by considering five populations it allows one to infer the directionality of gene flow. The five taxa with the topology ((P1,P2), (P3a,P3b)),O include the potential source of gene flow (for example, P3a), the taxon receiving gene flow (for example, P1), a close relative of each of these two taxa (for example, P3b and P2) and an outgroup (O, see also Supplementary Table 3 for visualization). If genes had introgressed from for example, P3a into P1, a close relative (P3b) of the introgression donor (P3a) would also show excess allele sharing with P1, but to a lesser extent than P3a. This is because many derived alleles that introgressed from P3a into P1 will be shared by P3a and P3b because of their recent common ancestry. In contrast, we would not expect that a close relative (P2) of the receiver of gene flow, P1, would show excess allele sharing with P3a and P3b. In the genomic data, this would be seen as an excess number of BABBA patterns, where P1 shares a derived allele (‘B’) with both P3a and P3b, whereas P2 has the outgroup allele (‘A’), as compared with the number of ABBBA patterns (P2 shares a derived allele with both P3 taxa). On the other hand, if the direction of gene flow was from P1 into P3a, P3b would not show excess allele sharing with P1 but instead, P2 would show excess allele sharing with P3a because of ancestrally shared alleles between P1 and P2 that introgressed into P3a. Therefore, gene flow from P1 into P3a would not affect the relative frequencies of BABBA and ABBBA patterns, but the pattern BBBAA (P1, P2 and P3a share a derived allele) would be more frequent than BBABA (P1 and P2 share a derived allele with P3b, see Supplementary Table 3). Counting discordant allele sharing patterns thus allows us to infer the direction of gene flow.

We computed the five population tests with a custom made script using the three individuals with least missing data for each LVRS group and the single individual with the most complete data for all other taxa. In contrast to the D statistics computed with ADMIXTOOLS, which are based on allele frequencies, our five population tests are calculated from a single individual for each focal population, following Eaton and Ree47. We tested each of the three individuals of each LVRS group separately, and report the means for each radiation. At heterozygous sites, one allele was chosen at random. For each combination of individuals tested, we counted the eight patterns needed to compute the four D statistics (Supplementary Table 3) using all sites without missing data. We calculated z scores in units of s.d. from 100 bootstrapped datasets (sites resampled with replacement) as in Eaton and Ree47.

Estimation of ancestry proportions

The absolute value of D statistics does not only depend on the admixture proportion but also on the demographic history, the divergence between the hybrid parental lineages, and the genetic distance between the real source taxon of gene flow and the sampled surrogate population73. Thus, the D statistic cannot be used directly to infer the magnitude of introgression. Instead, we applied the F4-ratio test48,73 to infer the Upper Nile ancestry proportions in the different LVRS taxa using all genotypes with at least ten reads and SNPs with a maximum of 10% missing data (72,443 SNPs) with ADMIXTOOLS v. 1.1 (ref. 48). The F4-ratio test is based on four populations with the genealogy (((A,B)C)O) and a fifth potential hybrid population X that is tested for ancestry proportions from B and C (Supplementary Fig. 6). We repeated the test with different Eastern taxa used as population ‘A’, T. pharyngalis or H. gracilior as ‘B’, A. sp. ‘Yaekama’ as ‘C’ and A. flavijosephi as ‘O’ as shown in Supplementary Table 4. The Upper Nile ancestry proportions estimated with the combination of populations with the lowest s.d. were used for further analyses and to explain the F4-ratio test in Supplementary Fig. 6.

Long-wave sensitive (LWS) opsin gene

We sequenced the LWS opsin gene from exon 2 to 6 using the Carleton & Kocher primer combinations F2 (5′-TTT GAG GGT CCC AAT TAC CA-3′), R2 (5′-TCC ACA CAG CAA GGT AGC AC-3′) and F3 (5′-ACT GGC CTC ATG GAC TGA AG-3′) and R4 (5′-TCC CAA AAT GGA GAA CAT GG-3′) (http://cichlid.umd.edu/cichlidlabs/protocols/Basic/pcrprimer.html)74. The sequences were aligned together with all available Lake Victoria cichlid LWS sequences from GenBank using ClustalW implemented in BioEdit 7.2.5 (ref. 66) and then manually curated. For sequences used, see Supplementary Data 3. A maximum likelihood tree was reconstructed including both variant and invariant sites using the GTRGAMMA model of rate heterogeneity with RAxML v. 7.7.7 (ref. 67). We used one partition with concatenated introns, one for the first plus second codon position of concatenated exons, and one for the third codon positions. The tree was rooted with the Oreochromis niloticus sequence and nodal support values were drawn from 100 rapid bootstraps performed with RAxML. The tree was ladderized and plotted using the R-package Ape v. 3.1 (ref. 69).

To visually assess the effects of recombination on the LWS opsin alleles in the Lake Victoria Region cichlids, we extracted all positions of the alignment that were polymorphic among the LVRS sequences and also differed between the Congolese and Upper Nile taxa. If one of the alleles was only found in the Congolese taxa but not in the Upper Nile taxa, we coloured it red, whereas alleles found only in the Upper Nile taxa were coloured blue. Many LVRS haplotypes could be explained by recombination between the two divergent allele classes occurring in the Congo and Upper Nile clades. The second LWS gene tree was reconstructed with all recombinant sequences falling between the two major allele classes in the phylogenetic tree removed from the alignment.

For information about the occurrence of LWS haplotypes across LVRS species, we compiled data from Tables 1+2 in Terai et al.51, Fig. 1 and Supplementary Table 4 in Terai et al.49, Supplementary Fig. 4b in Seehausen et al.39, Figure S3 in Miyagi et al.54 and additional GenBank sequences by Carleton et al. (GenBank). Information sources and GenBank accession numbers are provided in Supplementary Data 3.

Sorting of ancestral alleles

We created a RAD sequence dataset of our Congolese and Upper Nile LVRS relatives and previously published sequences36 of six sympatrically occurring Lake Victoria species that are ecologically and phenotypically diverse (Pundamilia pundamilia (n=10, blue male nuptial coloration, benthic insectivore-omnivore), P. nyererei (n=11, red planktivore-omnivore), Harpagochromis cf. serranus (n=10, blue piscivore), Lipochromis melanopterus (n=9, yellow paedophage), Paralabidochromis chilotes (n=9, blue benthic insectivore) and Neochromis omnicaeruleus (n=12, blue algae scraper)). The sites were filtered for having at least 5 individuals covered in each species at a sequencing depth of at least six reads. We used Arlequin v. 3.514 (ref. 75) to calculate global F ST values based on a hierarchical island model among the six Lake Victoria cichlid species. We calculated allele frequencies for the Congolese (Astatotilapia stappersi from Zambia and A. sp. ‘Yaekama’ from DRC, n=5) and the Upper Nile taxa (‘Haplochromis’ gracilior and Thoracochromis pharyngalis, n=9) to get estimates of the allele frequencies for the parental lineages of the LVRS and to infer the likely ancestry of the alleles segregating in Lake Victoria. We excluded sites monomorphic in the subset of Lake Victoria species or sequenced in less than three individuals in either the Congolese or the Upper Nile lineage taxa. We extracted sites for which only one of the two alleles segregating in Lake Victoria cichlids were found in the Congolese and Upper Nile lineage taxa together (category 1). Next, we extracted sites with both alleles segregating in the Congolese taxa representing ancestral standing genetic variation that would have been present in the Lake Victoria radiation also without ancient hybridization with the Upper Nile lineage (category 2).

We identified Lake Victoria polymorphic sites for which only one allele was found in the Congolese taxa, and the second allele was found exclusively in the Upper Nile taxa, representing genetic variation that can be attributed to the ancient admixture event (category 3). We also identified the subset of sites fixed for alternative alleles in the Congolese and Upper Nile taxa (category 4). These are sites that could be involved in epistatic BDM-like incompatibilities between Congolese and Upper Nile alleles. Sites fixed for alternative alleles in the Congolese and Upper Nile lineage taxa are expected to have been present at the onset of the Lake Victoria radiation at a mean minor allele frequency of 16% (that is, the inferred Upper Nile ancestry proportion for the Lake Victoria radiation). To take the effect of this moderately high initial allele frequency into account in our subsequent analyses, we created a dataset of SNPs with comparable expected ancestral allele frequency in Lake Victoria (category 5) by weighting the allele frequencies of the Congolese and Upper Nile lineage taxa at each site by 0.84 and 0.16, respectively, corresponding to the inferred ancestry proportions for the Lake Victoria radiation. Of these, we extracted all sites with a weighted minor allele frequency of 14–18% excluding sites that are divergently fixed between the Congolese and Upper Nile taxa. As an example, if at a given bi-allelic site the frequency of one allele was 0.1 in the Congolese taxa and the frequency of the same allele was 0.5 in the Upper Nile taxa, we calculated 0.1 × 0.84+0.5 × 0.16=0.164. As this site would have a weighted minor allele frequency between 14% and 18%, it would be included in the control set of SNPs. Note that such loci are not predicted to be involved in BDM incompatibilities because all their alleles were present within at least one of the two parental populations. The proportion of sites that were outliers of high global F ST (P<0.05) among the Lake Victoria species (‘LV outliers’) was calculated for each dataset and compared between datasets using a two-sided Fisher’s exact test. The robustness of the test to erroneously inferred ancestry proportions was assessed by repeating the calculations with different Upper Nile allele frequency weighting, ranging from 10 to 30% (Supplementary Discussion).

To check if LV outliers fixed for alternative alleles in the parental lineages inherently have a higher fixation probability than non-outliers, we analysed fixation patterns in other closely related cichlid species that are not part of the radiation. We used Astatotilapia bloyeti, A. sparsidens, A. flavijosephi, A. calliptera, A. burtoni and Astatoreochromis alluaudi as we have at least three sequenced individuals of each. For each pairwise comparison between these species, we extracted sites with at least three genotypes sequenced at a minimum depth of six reads in both species. We then checked if sites fixed for alternative alleles in the Congolese and Upper Nile lineage taxa were also fixed for alternative alleles between these other species. Finally, we tested with Fisher’s exact tests if sites that were high global F ST outliers among Lake Victoria species were also more often differentially fixed between these other species than non-outlier sites (see also Supplementary Discussion).

To analyse the genomic distribution of LV outliers that are fixed for alternative alleles between the Congolese and Upper Nile taxa, we used liftOver (http://genome.ucsc.edu/) with a chain file from Brawand et al.50 to get positions on the Oreochromis niloticus genome which includes chromosomal information.

Whole-genome sequencing data

To corroborate our findings from RAD sequence data, and for the analysis of ancestry block sizes, we sequenced whole genomes of nine fish from Lake Victoria, two from Lake Kivu, and of the taxa that we found to be the closest extant representatives of the lineages directly ancestral to the LVRS, the Congolese Astatotilapia stappersi, and the Upper Nile Thoracochromis pharyngalis and ‘Haplochromis’ gracilior. Whole-genome sequencing data was generated using PCR-free library preparation76 and Illumina HiSeq 3000 paired-end sequencing for 11 individuals. Three additional individuals, sequenced the same way on the same machine, were taken from McGee et al.77 (see Supplementary Data 1 for sample information). Local alignment against the Astatotilapia burtoni reference genome50 was performed with Bowtie 2 (ref. 61). For variant calling and genotyping we used Haplotype Caller (GATK v. 3.5)62. Genotypes with fewer than five reads, multiallelic sites, and indels were removed using vcftools v. 4.1 (ref. 63).

Whole-genome D statistics were calculated with ADMIXTOOLS v. 1.1 (ref. 48) using all biallelic sites with a 1% minor allele frequency cutoff and a maximum missing data proportion of 20% across all 14 genomes. The reference genome (A. burtoni) was used as an outgroup.

To study signatures of admixture along the genome, we used f d statistics by Martin et al.78. In comparison to D statistics, f d is more suited for small genomic regions78. ABBA and BABA pattern counts were calculated using allele frequencies by weighting each segregating site according to its fit to the ABBA or BABA pattern46,72,78. As an example, if a site has derived allele frequencies of 0, 0.5, 1 and 0 in P1, P2, P3 and the outgroup, respectively, it would count as half ABBA site. We used A. stappersi as P1, different LVRS individuals as P2, H. gracilior and T. pharyngalis as P3 and A. burtoni as outgroup (Supplementary Fig. 8). Note that P1 and P2 are switched as compared to D statistics, where LVRS are used as P1 and the Congolese lineage as P2. This difference is simply for consistency with the description of f d in Martin et al.78, and thus here ABBA represents sites with a derived allele shared between LVRS and Upper Nile, whereas BABA represents sites with a derived allele shared between Congolese and Upper Nile lineage taxa. f d is calculated as the difference between ABBA and BABA patterns compared to the maximum possible difference where gene flow between P2 and P3 would equal random mating78. Positive f d values indicate gene flow between P2 (LVRS) and P3 (Upper Nile). We calculated f d in non-overlapping sliding windows of 10 kb along the A. burtoni scaffolds using the Python script by Martin et al.78. Windows with less than five total ABBA and BABA patterns were excluded. As only positive f d values are indicative of excess allele sharing between Upper Nile and LVRS, and are correctly standardized, values with negative D scores were set to 0 as in Martin et al.78. Pearson’s product-moment correlations of f d values between different individuals were calculated with the stats R-package.

Putative Congolese or Upper Nile ancestry was assessed by comparing the frequency of ABBA sites (LVRS shares the derived allele exclusively with Upper Nile taxa) with the frequency of BBAA sites (LVRS shares the derived allele exclusively with the Congolese lineage representative A. stappersi) in non-overlapping sliding windows of 3 kb. Scaffolds with <100 kb data were removed. As for the f d statistic, ABBA and BBAA pattern counts were calculated using allele frequencies by weighting each segregating site according to its fit to the ABBA or BBAA pattern. Only windows with a total ABBA and BBAA pattern count exceeding one were used. Windows with a minimum ABBA proportion ((ABBA/(ABBA+BBAA)) of 0.7 were defined as candidate windows of Upper Nile lineage ancestry and are coloured in blue, whereas genomic regions with an ABBA proportion of 0.3 or less were defined as putative Congolese lineage derived windows and are highlighted in red. Ancestry tracts were defined as consecutive sliding windows of the same colour (red or blue) ignoring single sliding windows without data. Expected ancestry block sizes were calculated using a formula from Racimo et al.79. Assuming ∼50,000–100,000 generations since admixture, a recombination rate of 2.5 × 10−8 and an Upper Nile proportion of 20%, the expected mean length of the admixture tracts is (0.8 × 2.5 × 10−8 × (50,000 to 100,000)−1))−1=500 bp to 1 kb (ref. 79). The breakup of ancestry blocks may have slowed as the original hybrid population began to undergo genomic stabilization, speciated and formed geographically isolated radiations in separate lakes each of which underwent further stabilization independent of each other, likely associated with differential sorting of the ancestral variation.

Data availability

Mitochondrial and LWS opsin sequences are available on GenBank under the accession numbers KY366716-KY366843 for ND2, KY366844-KY366970 for D-loop, and KY366971-KY366986 for LWS opsin sequences. RADseq and whole-genome sequencing reads generated in this study can be downloaded from the NCBI Sequence Read Archive under Bioproject PRJNA355227. The JAVA program for 5-population tests is publicly available on GitHub (https://github.com/joanam/scripts).