Genome sequence assembly and annotation

The blood samples used for genome sequencing were acquired from the Everland Zoo of Korea (Amur tiger, white Bengal tiger, African lion and white African lion) following the Everland Zoo (Korea) ethical guidelines and procedures, and a muscle sample was obtained from a Mongolian snow leopard carcass preserved in the Conservation Genome Resource Bank for Korean Wildlife, Seoul National University. No animals were killed or captured as a result of this study. Libraries for the Amur tiger genome were constructed at BGI, Shenzhen, and the insert sizes of the libraries were 170 bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb and 20 kb. The libraries were sequenced using HiSeq2000. Other big cat genomes were sequenced at Theragen BiO Institute (TBI), Korea, using HiSeq2000 with read and insert lengths of ~90 bp and ~400 bp, respectively.

The corrected reads were used to complete the genome assembly using SOAPdenovo13. First, the short insert size library (170 bp, 500 bp and 800 bp) data were used to construct a de Bruijn graph. Second, all reads were realigned with the contig sequences. The amount of shared paired-end relationships between pairs of contigs were calculated and weighted with the rate of consistent and conflicting paired ends, before constructing the scaffolds step by step from the short insert size paired ends to the long distant paired ends. Third, the gaps between the constructed scaffolds were closed using the paired-end information to retrieve read pairs where one end mapped to a unique contig while the other was located in the gap region.

The Amur tiger genes were predicted using three approaches. First, de novo prediction was performed using the repeat-masked genome using AUGUSTUS (version 2.5.5)32 and GENSCAN (version 1.0)33. Second, homologous proteins in other species were mapped to the genome using tBLASTn (Blast 2.2.23)34 with an E-value cutoff of 1E-5. The aligned sequence and its query protein were then filtered and passed to GeneWise (version 2.2.0)35 to search for accurately spliced alignments. Third, cat EST and full-length cDNA sequences (from UCSC) were aligned to the genome using BLAT36 to generate spliced alignments. For EST results, spliced alignments were linked according to overlap using PASA37. Source evidence generated from the three approaches was integrated with GLEAN38 to produce a consensus gene set. Then, the Amur tiger genome sequence was aligned to two well-assembled and annotated genomes (human and domestic cat) using LASTZ (version 1.02). Finally, mapped results yielding information on homologous proteins were filtered by syntenic blocks of genome sequences. We also predicted the domestic cat (Felis_catus-6.2) gene set, because the gene set of the cat genome is preliminary.

Orthologous gene families

A comparative analysis was used to examine the rate of protein evolution and the conservation of gene repertoires among orthologs in the genomes of the Amur tiger, dog, human, mouse, giant panda, domestic cat (Felis_catus-6.2) and opossum. We used the TreeFam methodology39 to define a gene family as a group of genes that descended from a single gene in the last common ancestor of a considered species. We assigned a connection (edge) between two nodes (genes) if more than 1/3 of the region was aligned to both genes. An H-score (minimum edge weight) that ranged from 0 to 100 was used to weigh the similarity (edge). For two genes, G1 and G2, the H-score was defined as score (G1G2)/max (score (G1G1), score (G2G2)), where the score shown is the BLAST raw score. Gene families were extracted by clustering using Hcluster_sg. We used the average distance for the hierarchical clustering algorithm, requiring the H-score to be larger than five, and the minimum edge density (total number of edges/theoretical number of edges) to be larger than 1/3. The clustering for a gene family would also stop if it already had one or more of the out-group genes.

We determined the expansion and contraction of the orthologous protein families among seven mammalian species (tiger, cat (Felis_catus-6.2), dog, human, mouse, giant panda and opossum) using CAFÉ 2.2 (ref. 40) with 0.001080 of lambda option. GO of all tiger genes was annotated by InterPro. A χ2 test followed by a Fisher’s exact test (P≤0.01) were used to test for over-represented functional categories among expanded genes and ‘genome background’ genes; Fisher’s exact test was used when any expected value of count was below 5, which would have make the χ2 test inaccurate41.

Gene evolution

We investigated Panthera lineage-specific amino-acid changes by comparison with the known genes from the human, dog and mouse (from the Ensembl 69 release). We used lion and snow leopard gene sets by mapping reads to the tiger scaffolds and substituting SNVs. Artifacts from the multiple sequence alignment (ClustalW242) limitations were removed by filtering option with ≥1/2 of coverage and ≥of well-matched amino acids (consensus string is ‘*’, ‘:’ or ‘.’).

To detect tiger genes evolving under positive selection, we used conserved genome synteny methodology19 to establish a high-confidence orthologous gene set. Briefly, whole-genome multiple alignments were performed between human (hg19) and other species (cat (Felis_catus-6.2), dog (CanFam2.0), mouse (mm9) and panda (ailMel1) genomes) by the LASTZ alignment pipeline. We collected all the human protein-coding genes from RefSeq43, KnownGene44 and VEGA45, and mapped them to the other species via the syntenic regions. We then filtered the resulting blocks with rigorous conditions to get large-scale synteny of high-alignment quality, and a conservation of exon–intron structure. Finally, we found 7,415 1:1 high-quality ortholog genes to analyse, most of which also correspond to genes in the panda, dog and mouse genomes. Then, we aligned ortholog genes by PRANK46 and used the optimized branch-site model of PAML (version 4.5) and likelihood ratio tests (LRTs) (P≤0.05). A GO annotation download from Ensembl was used to assign GO categories to 7,415 orthologs. A χ2 test followed by a Fisher’s exact test (P≤0.01) were used to test for over-represented functional categories among positively selected genes; a Fisher’s exact test was used when any expected value of count was below 5, which would have made the χ2 test inaccurate41.

We also used an approach based on Ka/Ks47,48 to identify GO categories significantly above or below average in the tiger genome. The Ka and Ks rates are estimated by PAML from all aligned bases with a quality score >20 in orthologs, using the F3 × 4 codon frequency model and the REV substitution matrix. To determine whether the GO categories are evolving under significantly high constraints, we repeated this procedure 10,000 times on the same data set after randomly permuting the GO annotations. Then, we acquired the GO categories if the P-value was less than 0.05.

Chromosomal rearrangement

Among the alignment data generated from SyMAP49, when one scaffold happened to be mapped to several physically distant cat (Felis_catus-6.2) chromosomal locations, they were considered to be inter- or intra-chromosomal rearrangement events of the Amur tiger genome relative to the cat genome. The species (tiger and domestic cat)-specific genomic rearrangements were also analysed. We performed the dog versus tiger and cat versus tiger whole-genome pair-wise alignments using LASTZ software on the repeat-masked genomes. Using these methods, we identified clusters of unique alignments with well-defined order and orientation. There was a total of 18 chromosomal rearrangement (12 inter- and 6 intra-chromosomal rearrangements) overlaps when the results from SyMAP and LASTZ were integrated by comparing syntenic break positions. As the tiger assembly was generally fragmented, we carefully validated the 18 syntenic breaks to examine the assembly integrity by aligning long insert mate-pair libraries (2 kb, 5 kb, 10 kb and 20 kb) to the tiger scaffolds. Finally, we reported six putative chromosomal rearrangements (two inter- and four intra-chromosomal rearrangements) between the tiger and cat. All six rearrangements were validated by long-range PCR experiments followed by the Sanger sequencing method.

Demographic history

The history of population size helps to develop insights into evolution. Based on the pairwise sequentially Markovian coalescent model (PSMC)31, we inferred detailed population size histories of Amur tiger (TG), African lion (LN), snow leopard (SL), white tiger (WTG) and white lion (WLN). Using SNV data sets scanned with all the big cat sequencing reads mapped to Felis_catus-6.2, the consensus sequences of each big cat were constructed and then divided into non-overlapping 100-bp bins marked as homozygous or heterozygous. The resultant bin sequences for their sex chromosomal parts were removed, and then they were taken as the input of the PSMC estimation. To test the estimation accuracy, bootstrapping was performed by randomly resampling 100 sequences from the original sequences. Using the neutral mutation rates, the raw PSMC outputs were scaled to time and population sizes. We obtained atmospheric surface air temperature and global relative sea level data of the past 3 million years50.