WGS catalog

To comprehensively represent the diversity of modern canids, we obtained publicly available WGS data from the genera Canis, Cuon, and Lycalopex (Sequence Read Archive: http://www.ncbi.nlm.nih.gov/sra; n = 314 unique individuals), as well as 128 unpublished genomes contributed by collaborators, 186 previously catalogued WGS25 and data from 94 domestic dogs sequenced by the Ostrander lab of which 52 were previously unpublished and now available on NCBI (accession number: PRJNA448733). All Biosample numbers for the 722 genomes are listed in the Supplementary Data 1 and the entire genome dataset can be found on NCBI. Long-term health status of most dogs is unknown. We applied standard QC methods to remove duplicate samples (see Methods) and validated the breed/species of each genome using a neighbor joining phylogeny comprising variant positions and data from Parker et al.4 (Supplementary Fig. 1). The final reference dataset contained 722 WGS from 144 established breeds, with 54 breeds represented by three or more dogs, 11 mixed breed samples, 26 samples of unknown breed status, 104 village and feral dogs from diverse locales, and 54 wild canids from six species (Supplementary Fig. 2a and Supplementary Data 1). The complete data set (vcf file containing 91 million variants and 722 genomes) is also available on NCBI (accession number: PRJNA448733).

To find genomic patterns enriched within breeds selected and maintained by human intervention, variants were called across all 722 individuals. The vast majority of the 91 million variants (including 17.3 million small (+/−24 bp) indel variants) observed are contained within intergenic regions (Supplementary Fig. 2c). Thirty-five percent of variants, including those in wild canids, are within introns or exons, 39% of exonic changes are non-synonymous and 7% are high impact variants as defined by both snpEFF26 and VEP27 (Supplementary Fig. 2c and Supplementary Table 1). The sequence depth for the 722 WGS ranged from 2.0x to 93.8x with a median of 18x (Supplementary Data 1). To optimize the dataset, we use previously published SNP chip data11, collected from a subset of the same individuals, to determine the minimum sequence depth required for confident genotype calls and opt to use a genome quality score (GQ) of 20 and an average sequence depth >10x (Supplementary Fig. 2b). We then define a primary reference dataset that retains only biallelic SNVs and small indels, for a total of 76.5 million variants (Supplementary Fig. 3). For the studies described here, we further refine the dataset, retaining only two males and two females from each modern breed, selecting those with the deepest sequence coverage. We also remove the genomes of village, mixed breed and dogs of unknown origin, but retain the genomes of wild canines in order to ascertain ancestral versus derived alleles, thus generating a working dataset of 268 modern breeds dogs and 54 wild canids. Finally, we use village dogs as an outlier group in order to identify genomic signatures of artificial selection in modern breeds.

Morphological traits analyses

We investigated 16 phenotypes using a Genome-Wide Mixed Model Association algorithm (GEMMA)28 which fits a univariate linear mixed model for marker association tests with a single phenotype, correcting for sex and using a relatedness matrix to correct for population stratification (Supplementary Fig. 3). The number of breeds used for each analysis depends on the availability of the standard breed information of the American Kennel Club5 (Table 1 and Supplementary Data 2). Keeping only variants with minor allele frequency above 1%, genome-wide data from an average of 14 million variants per phenotype are analyzed. Bonferroni corrections are applied to identify significant associations (threshold = 8.46) (Tables 1 and 2). Our initial findings validate our previously described associations for Mendelian morphological traits including fur growth patterns14 and coat color29; as well as complex traits such as standard breed height (SBH)1,13,16,18 (Fig. 1a). The analysis for SBH highlighted only genes/loci previously described in dogs such as the ligand-dependent nuclear receptor corepressor-like gene (LCORL), Stanniocalcin 2 (STC2), growth hormone receptor (GHR), SMAD family member 2 (SMAD2), high mobility group AT-hook 2 (HMGA2), fibroblast growth factor 4 (FGF4), insulin like growth factor 1 (IGF1), and one locus on Canis lupus familiaris chromosome 26 (CFA26)1,13,16,18. Signals at three previously identified genes, insulin like growth factor 1 receptor (IGF1R), insulin like growth factor 2 mRNA binding protein 2 (IGF2BP2) and immunoglobulin superfamily member 1 (IGSF1) were observed but did not pass the Bonferroni threshold (Fig. 1a).

Table 1 Summary of phenotypes used to perform GWAS using the WGS catalog Full size table

Table 2 Summary of significant associations identified by multiple GWAS using a maximum of 268 modern breed genomes Full size table

Fig. 1 GWAS results for morphological traits in dogs using the canine 722 genome catalog. Manhattan plots showing statistical significance (−log10 scale) for the 30,000 most associated biallelic variants for each canine autosome, and all variants for the X chromosome (X-axis). a Validation of this WGS-GWAS approach using known examples in dogs: presence or absence of moustache and eyebrows, length of fur, and height as a multigenic trait. b Associations identified using body mass including the bulky phenotype and life span. The red line represents the Bonferroni corrected significance threshold (−log 10 (P) ≃8.46) and variants passing this threshold are colored in red. Candidate genes identified in this study are in bold Full size image

We next run quantitative GWAS using breed-average measures for weight (SBW) as taken from the AKC breed standards (Fig. 1b). We identify 12 significant associations with weight (SBW) including the known canine body size genes/loci of LCORL, GHR, SMAD2, HMGA2, IGF116,18, as well as the two recently described genes: acyl-CoA synthetase long chain family member 4 (ASCL4) and IGSF117 (Fig. 1b). Our analysis also reveals three candidate genes on CFA11 (zinc finger protein 608-ZNF608)30, CFA19 (R3H domain-containing protein 1- R3HDM1)31 and CFA20 (ADAM metallopeptidase with thrombospondin type 1 motif 9 - ADAMTS9)32. In addition, we identified two genes, ADAMTS-like protein 3 (ADAMTSL3)33 on CFA3 and the hepatocyte nuclear factor 4-gamma gene (HNF4G)34 on CFA29 associated with the tall heavy muscled (bulky) phenotype we described previously17.

We observe a significant association at LCORL in the analysis of both SBW and SBH (p wald = 4.1 × 10−23 and 2.4 × 10−10, respectively), which are themselves highly correlated traits. No canine mutation has been previously described for this gene which encodes a transcription factor that has an established association with body size in other species35,36,37,38. The human gene has several isoforms, one of which is “long” (5,493 bp-NCBI: XP_022272118.1) and several that are “short” (≃1600 bp), differing significantly in the sequence of the last exons (4850 bp and 1301 bp, respectively) (Fig. 2b, c). Sanger sequencing of cDNA obtained from testis reveals three canine isoforms, two short and one long (Supplementary Data 3). Examination of both the WGS and testis cDNA reveals that large breeds (SBW > 41 kg) harbor a 1-bp insertion in the last exon of only the long isoform (Fig. 2a). With an allele frequency of 0.18 in the modern breed population, this mutation was never observed in small breeds (<10 kg), has a low frequency (af = 0.16) in medium sized breeds (between 10–41 kg), and is present in 80% of large breeds (>41 kg) (af = 0.67) (Supplementary Data 4). This insertion introduces a frameshift, changing the sequence of 11 amino acids and creating a premature stop codon (p.S1221*), resulting in the loss of 611 terminal amino acids (Fig. 2c). Alignment of human (ENSP00000490600.1) and canine LCORL protein sequences revealed strong conservation, with 81% identity. Interestingly, the long form of the protein contains a DUF4553 DNA-binding domain within the deleted portion of the dog protein. The strong conservation of this DNA-binding domain (86%) between human and dog suggests that, in large dogs, the 611 amino acid loss may disrupt transcription factor binding of LCORL with its target.

Fig. 2 Identification of LCORL mutation in large breeds and comparison with human. a Comparison of genomic sequences between human and the two canine alleles. A single nucleotide insertion is observed in large breeds (>41 kg). b Conservation of the two main LCORL proteins and their predicted functional domain using SIM68 and LALNVIEW69. c Schematic representations of LCORL proteins, highlighting the effect of the canine mutation (STOP codon after amino acid 1221 leads to a loss of 610 aa). The common part shared by all forms is colored in yellow. Source data are provided as a Source Data file Full size image

In addition to the above, regulatory element variants associated with canine SBW are identified in R3HDM1, ADAMTS9 and HNF4G, affecting promoter, long non-coding RNA and 3’UTR, respectively (Table 3). As expected, the identified body weight variants were never or rarely observed in wild canids (af < 0.06), defining them as derived alleles (Supplementary Data 4). Presence of the derived alleles in wild canids with low allele frequencies can be explained by post-divergence gene flow between wild canids and dog populations, and has been previously reported8. We also observe lower allele frequencies in village dogs compared to modern breeds, reflecting the absence of selective pressure in village dog populations for the specified body size genes under selection in modern breeds. The single exception was an allele frequency of 0.17 in wild canids and 0.59 in village dogs for the derived allele of IGSF1, which has been previously associated with the muscled phenotype in domestic dogs17, perhaps providing a fitness advantage in the “village dog” environment.

Table 3 Previously unreported candidate variants identified using the WGS canids catalog Full size table

We confirm all body size variants by Sanger sequencing DNA from 468 independent dogs encompassing 96 breeds of varying size and shape (five dogs/breed minimum) (Supplementary Data 5). We observed low allele frequencies (<0.03) for the described mutations in R3HDM1, ADAMTS9 and HNF4G, as estimated with the WGS data set. The derived allele for each of these genes was only observed in bulky breeds, including the Bernese Mountain Dog, Great Dane, English Mastiff, and Saint Bernard (Supplementary Data 4 and 5).

Combining our results with previously published data13,15,16,17, we estimate that variants in just 14 genes, i.e. IGF1R, LCORL, STC2, GHR(1), GHR(2), SMAD2, HMGA2, ZNF608, IGF1, R3HDM1, ADAMTS9-AS, HNF4G, ACSL4, and IGSF1 account for as much as 95% of SBW variation in purebred dogs (Table 4). Thus, while several hundred loci affect human height and body mass index (BMI)33,38,39, a much smaller number of genes of large effect explain the striking 40-fold range of body size observed across dog breeds.

Table 4 Allele frequencies at 14 markers explain 95% of weight variation in dog population Full size table

In order to provide more information about functional impact of these genes on body size, we utilized 51 RNA-seq experiments from SRA database and, in parallel, isolated RNA from 28 testes from 20 breeds for qRT-PCR analysis (Supplementary Data 6 and 7). As expected, we do not observe significant differences in either analysis, as the number of breeds is low and, in many cases, ideal tissue types were not available (Supplementary Fig. 4 and Supplementary Data 6 and 7).

Longevity analysis

We next considered the role of genetic predisposition in life span using American Kennel Club (AKC) breed-average life spans as a phenotype. Four of the 17 body weight/size loci identified in this study are significantly associated with longevity: LCORL, HMGA2, IGF1 and the locus on CFA26 (Fig. 1b). These results support and partially explain the previously reported correlation between body size and life span in domestic dog; large breeds breeds (SBW >30 kg) have a shorter average life span (8–10 years) than miniature and toy breeds, which can live ≥ 18 years24,40. We further investigate this observation using a panel of 746 dogs from 79 breeds genotyped using the Illumina Canine HD SNP array11 (Supplementary Data 8). Using the AKC metrics of breed-average for both weight and life span5, we observe a negative correlation between these traits (r = 0.72) (Fig. 3b). We use GEMMA28,41 to perform an association test with multivariate linear mixed models, which simultaneously estimates the association between a given variant and phenotypes of interest41, in this case body size and breed average lifespan (Fig. 3a and Supplementary Fig. 5), and observed the most significant associations (p wald < 5 × 10−10) for HMGA2, IGF1, IGSF1, IRS4, LCORL and SMAD2.

Fig. 3 Body mass and longevity analyses using 746 dogs genotyped on 170k SNP markers. a Manhattan plot of the multivariate GWAS for standard breed weight (SBW) and life span corrected by sex, using 746 dogs genotyped on Illumina HD SNP array11. The −log10 P values for each SNP are plotted on the y-axis versus each canine autosome and the X-chromosome on the x-axis. The red line represents the Bonferroni corrected significance threshold (−log 10 (P) = 6.48) and SNPs passing this threshold are colored in red. b Negative correlation between SBW and longevity. In blue, large breed outliers: Anatolian Shepherd Dogs (52.2 kg; 13 years) and Tibetan Mastiff (70.3 kg; 13.5 years) c SBW and longevity (y-axis) of each breed (without outliers) are plotted by genotype at each marker (x-axis). The homozygous D/D alleles have generally a stronger effect on the distribution of SBWs (or longevity) for a given genotype/marker combination (the median and first and third quartiles are indicated by the box-plots). Statistics for each genotype/marker combination are summarized in (d). P values estimated by Mann–Whitney–Wilcoxon tests (*P < 0.05; **P < 0.01; ***P < 0.001). SBWs and longevity of genotype classes are reported as mean ± SD. Source data are provided as a Source Data file Full size image

We test which genes contribute the most to both body size and life span, defining the “ancestral” allele for each gene (as opposed to “derived”) as that present in wild canid genomes (Supplementary Data 4). For SMAD2, HMGA2 and IGF1, the derived allele is associated with low SBW (average = 12.7 kg) and increased longevity (avg = 13 years), (p < 0.001, Mann–Whitney– Wilcoxon test). An increase in SBW and reduced lifespan (avg SBW = 39.5 kg; avg life span = 10.5 years; p < 0.001, Mann–Whitney–Wilcoxon test) are also observed in breeds homozygous for the derived allele of the most strongly associated marker at LCORL, IRS4 and IGSF1 (p < 0.01, Mann–Whitney–Wilcoxon test). Finally, a reduced life span is observed only for those breeds homozygous for the derived allele at IGSF1 (avg = 10.6, p < 0.001, Mann–Whitney–Wilcoxon test).

Additional morphologic phenotypes

We investigate several additional morphologic phenotypes including leg length, ear shape, and tail length and curl. We compare 22 dogs from 10 breeds with long hindquarters, as defined by the AKC5, including Sighthounds and tall working breeds (i.e. Great Dane, and Great Pyrenees (Fig. 4a)) versus 48 other breeds (80 small, medium and large dogs) and we find four large homozygous haplotypes that are significantly associated with long legs. The first and second, spanning the IGF1 and IRS4 genes have been previously described as body size genes13,17, and are validated herein (IGF1: p wald < 6.2 × 10−14 and IRS4: p wald < 2.6 × 10−13). Two associations on CFA1 (42–42.5 Mb) and CFA9 (53.4–54 Mb) were also observed. While no genes are annotated for the interval on CFA9, the association observed on CFA1 spans the estrogen receptor 1 (ESR1) gene, with the most significant variant located within the second intron of the gene (Fig. 4b). ESR1 is a major mediator of estrogen action, and is strongly linked to bone mass and osteoporosis in humans42. We confirm the CFA1 locus association using 855 dogs (88 breeds) genotyped on the Illumina Canine HD SNP array (Supplementary Table 2) and observe that >80% of long-legged dogs harbor the derived allele for the most associated SNP. Combining haplotype data from the 102 WGS and 855 genotyped dogs, we reduce the locus to 300 kb, spanning both ESR1 and its neighboring gene Spectrin Repeat Containing Nuclear Envelope Protein 1 (SYNE1). No mutations were identified within exonic sequences of either gene. However, examination of the human orthologous region reveals numerous annotated histone marks on the locus suggesting non-coding variants modulating regulatory elements in long-limbed dogs (Fig. 4b). qRT-PCR analysis using RNA extracted from testes revealed significantly higher levels of ESR1 expression in Sighthounds, with Irish Wolfhounds and Whippets displaying 20–70 times higher levels of ESR1 than other tested breeds (Fig. 4c). These results suggest that either over-expression of ESR1 is involved in a process leading to the elongation of long bones and epiphyseal fusion42, and/or that variation in gene expression is associated with an ossification disorder. The latter is of particular interest as many long-legged breeds are predisposed to develop bone diseases, including osteosarcoma21, for which ESR1 is reportedly a contributing factor43.

Fig. 4 ESR1 and the long leg phenotype in dogs. a Manhattan plots showing statistical significance (−log10 scale) for the 30,000 most associated biallelic variants for each canine autosome, and all variants for the X chromosome (X-axis) for the long-leg phenotype observed in Sighthounds, Great Dane, and Great Pyrenees. We distinguish four peaks: one peak pinpointing ESR1 gene on chromosome 1, one locus on CFA9 without any candidate genes in the interval, and IGF1 (CFA15) and IRS4 (CFAX) previously associated with height variation in dogs. Images to the left are Great Dane (top) and Greyhound (bottom). b UCSC genome browser showing the ESR1 locus in dog (top) and human (bottom). Vertical bars correspond to the most associated variants identified with the 722 genomes (in red), and the 855 dogs genotyped on 170k SNP array (in brown), and horizontal bars represent the homozygous haplotype observed. The bottom panel represents the human orthologous locus with tracks corresponding to the H3K4me1 and H3K27ac chromatin signals annotated by the ENCODE project55. c Expression level of ESR1 in a panel of 20 breeds, showing high expression in the Sighthounds, Irish Wolfhound and Whippet, in comparison to six different breeds with average leg length. Y-axis represents the relative normalized expression. d XP-CLR plot on ESR1 locus comparing Sighthounds (long legs breeds) with normal-sized legs breeds. We detected a significant selection signature located on ESR1 locus (in grey). Horizontal lines represent the empirical top 1% of genomic regions. Source data are provided as a Source Data file Full size image

We next sought genes underlying ear shape and size. The shape of the auricular cartilage determines the appearance of the pinna, which may be upright (prick ears) or pendulous (drop ears)44 (Fig. 5b). We compare variants from 60 breeds (113 dogs) with drop and 46 (101 dogs) with prick ears (Fig. 5a), and identify a significant association on CFA10 (p wald = 7.63 × 10−24) with a single nucleotide variant (chr10.g.8070103C > T) located in the exonic region of a long intergenic non-coding RNA (lincRNAs) (TCONS_00016758, TCONS_00016759) (Fig. 5c). This lincRNA is 29 kb downstream from the gene methionine sulfoxide reductase B3 (MSRB3), which is associated with human deafness45,46 (Fig. 5c and Table 3). The derived allele is detected in 76% of the drop ears dogs present in the WGS catalog, while only 5% of the prick ears dogs and wild canids carry the derived allele (Supplementary Data 4). Sanger sequencing of 855 dogs (88 breeds) reveals similar proportions, as 71 and 8% of drop and prick ear dogs carry the derived allele, respectively (Supplementary Table 3). Since the variant impacts a lncRNA, we hypothesize that a complex regulatory mechanism may be involved in determination of the drop ear phenotype, which includes this lincRNA, directly or indirectly, impacting MSRB3 expression.

Fig. 5 Ear morphology in dogs. a Manhattan plots showing one significant signal on the CFA10 for the drops ears phenotype and another one on chromosome 12 for the large and round ears. b Characteristic breeds representing four different ear shapes observed in dogs: Normal (1,3), large and round (2,4), prick (1,2) or drop (3,4). c UCSC genome browser showing the position on the canine genome (Canfam3.1) of the mutated lincRNA (in red) associated with the drop ears. d Combination of alleles at both loci create four phenotypes. Plus (+) and minus signs (−) indicate the presence or absence of variant (non-ancestral) genotype Full size image

We also perform GWAS to identify genes controlling large, round ears (e.g. Spaniel breeds, Beagle and Corgi) versus triangular, standard size ears (e.g. Eurasier or Miniature Pinscher) (Fig. 5b). Large ears are defined as having a greater area between the lateral and medial border of the ear, with a round and not triangular apex44. Comparing WGS from 31 dogs of 13 breeds with large, round ears to 182 dogs (85 breeds) that lack this phenotype we observe a significant association on CFA12 (p wald = 4.91 × 10−41). Analysis of variants either homozygous or heterozygous for a derived allele defined an interval of 33.8–35.1 Mb (Fig. 5a) which contains two genes: Regulating Synaptic Membrane Exocytosis 1 (RIMS1), a gene involved in cognition processes in humans47, which is an unlikely candidate, and Potassium Voltage-Gated Channel Subfamily Q Member 5 (KCNQ5). The latter has a vestibular role in mouse models48 and is a much stronger candidate. We did not detect coding variants in either gene, leading us to postulate non-exonic SNVs or structural variants as potential candidates involved in this phenotype. Acquisition of cartilage tissue, which has proven difficult to obtain, will allow future expression studies for both phenotypes (Supplementary Data 6 and 7, and Supplementary Fig. 6). Nevertheless, it is clear that combinations of variants at just these two loci control otherwise seemingly complex ear phenotypes in modern breeds (Fig. 5d). Other phenotypes (hairless, tail shape, behaviors) are described in Supplementary Fig. 7.

Signatures of selection on candidate genes

To further substantiate our hypothesis that genes responsible for the marked phenotypic variations among dog breeds have been driven by positive selection, we use the cross-population composite likelihood ratio (XP-CLR)49 and cross-population extended haplotype homozygosity (XP-EHH)50 to investigate extreme allele frequency and LD differentiation over extended linked regions in multiple breeds. Hypothesizing that breeds with different traits have experienced distinct evolutionary processes, we performed five independent case/control analyses based on a subset of traits previously defined: (1) long legs; (2) bulky (tall heavy muscled); (3) standard breed height/weight; (4) drop ears, and (5) large ears (Table 5 and Supplementary Data 9–11) with a goal of localizing signals of population-specific selection. Using the empirical top 1% of genomic regions, most of the candidate genes (13 of 18) identified from GWAS show significant allele frequency, or LD differentiation, between case and control populations (Table 5 and Supplementary Fig. 8), suggesting that human selection caused adaptive mutations to sweep to high prevalence or become rapidly fixed within a population. Nine of 13 significant genes are detected by both tests. The ESR1 gene, for example, reveals significant signals of positive selection (XP-CLR = 109.0, XP-EHH = 1.16) in breeds from the Sighthound clade (described in Supplementary Data 10 and 11) compared to average and short-legged breeds (Fig. 4d). We apply the same strategy to comparisons of case population versus random-bred village dogs and find that selection signatures remain significant (16 of 18 under a more relaxed threshold of 5%), highlighting the robustness of our results (Supplementary Data 10 and 11). Finally, the genetic distance between breeds of large and small size is significantly greater when estimated within body size genes compared to the whole genome (P < 2.2 × 10−16, Mann–Whitney U-test), based on the fixation index (F ST ) (Supplementary Fig. 9).