Study Populations and study design

For the discovery phase we used the samples from two isolated Italian populations participating to the INGI consortium. In particular, our study used 370 individuals from INGI-CARL, a population coming from Carlantino, a small village located in Puglia (Southern Italy), and 843 defined as INGI-FVG, making reference to 6 villages situated in the Friuli Venezia Region in North-Eastern Italy for a total of 1207 samples. For replication we used the samples coming from the Erasmus Rucphen Family (ERF) study, a cross-sectional cohort including 3,000 living descendants of 22 couples who had at least 6 children baptized in the community church around 1850–1900; 1731 samples were used from this study.

Phenotype and sample selection

Coffee consumption was assessed in the INGI cohorts through field interviews while it was self-reported in the ERF cohort. In all cohorts coffee consumption was measured as number of coffee cups per day. Given that the distribution of the trait was extremely skewed we transformed this to log 10 (N coffee cups + 1). In order to remove outliers, we excluded people who drank more than 9 cups of coffee per day in the Italian cohorts and more than 20 in ERF. This difference was due to the very different coffee consumption distribution observed in the Italian vs. the Dutch populations. Finally, all people under hypertensive medication were excluded from the analysis since those are usually advised to reduce or avoid coffee consumption.

Genotyping and Imputation

Genotyping was carried out as previously described21,22. Briefly, INGI-CARL, and INGI-FVG have been genotyped with the Illumina 370k high density SNP array. Genotype imputation on the INGI cohorts was conducted after standard QC using SHAPEIT223 for the phasing step and IMPUTE224 for the imputation using the1000 Genomes phase I v3 reference set25. ERF has been genotyped with different genotyping platforms: Illumina 318 k, 350 k, 610 k and Affymetrix 200 k. Genotypes were pooled together after QC, phased and imputed to the 1000 Genomes dataset phase I v326 using MaCH and minimac27. After imputation SNPs with MAF <0.01 or Info score <0.4 we excluded from the statistical analyses for each of the populations but ERF, for which R2 < 0.3 was used instead.

Association analysis

Association analysis was conducted using mixed model linear regression where the standardized food liking was used as the dependent variable and the SNP dosages as the independent variable. Sex and age were used as covariates. The kinship matrix based on all available genotyped SNPs was used as the random effect. For ERF the kinship matrix was estimated on 14.4 k SNPs common to all different genotyping platforms used. Association analysis was conducted using the FASTA method28 as implemented in the GenABEL 1.7–229 R package in order to eliminate the effect of familial relatedness from the trait. MixABEL30 was used for the actual association of the imputed SNPs. SNPs that did not pass quality control for more than one population were discarded. It is common practice in GWAS to run association presuming the genetic variants have an additive effect on the phenotype, however this is not necessarily true. Therefore, we decided to also use non-additive genetic models, in particular the dominant and recessive models. For the discovery step, association analysis was conducted separately for each INGI cohort and results were pooled using the inverse-variance weighting method. Given that no meta-analysis software supports non-additive genetic models we developed custom R scripts. After meta-analysis in the INGI cohorts, genomic control was used to eliminate any residual stratification and all significant SNPs where taken forward to the replication step in the ERF cohort. The significance threshold was estimated considering the number of equivalent tests performed when running multiple genetic models, which has been estimated as being 2.231, and was thus set at 2.27 × 10−8. In order to avoid false positive results due to rare categories, in each GWAS we eliminated those SNPs in which the coded category was either <0.01 or >0.99. This resulted in a different number of SNPs depending on the model used. Supplementary Table S2 summarizes the number of SNPs used for each cohort under each model.

Comparison of the association pattern of coffee consumption with eQTL results

In order to verify if the observed SNP association pattern could be traced back to an expression quantitative locus (eQTL), we have decided to compare it with that coming from the cis-eQTL of PDSS2 gene available from 14 different tissues from the GTEx database32. We have downloaded the full dataset of results and extracted all SNPs tested against the PDSS2 gene which fell inside the two recombination hotspots delimiting our association signal (chr6:107400000-107900000). If the two association signals can be traced back to each other we would expect that the strengths of the association should resemble each other and the relationships between the direction of their effect should be consistent across all SNPs examined and in particular in those showing the lowest p-values. In order to give more weight to the SNPs with the lowest p-values and to keep the directionality of the effect, we assigned to each SNP a score equal to −log 10 (p-value) × 1 if the direction of the effect was positive and log 10 (p-value) otherwise. Given that the two association patterns do not have a normal distribution we used the Kendal and correlation test to verify if the two patterns resembled each other and if the resulting correlation was significant.

Functional annotation of the discovered SNPs

In order to understand the possible functional role of the 5 replicated SNPs we annotated them using Haploreg v4.133. Briefly, Haploreg allows one to annotate a list of SNPs by reporting the associated chromatin stated on a 15-state model from the using the Roadmap epigenomics project34 and Encode project35. Moreover it checks if the SNPs have been previously associated to any eQTL, QTL or metabolite levels according to various databases. Finally, it allows one to set an LD threshold based on which all SNPs with r2 > threshold will be included in the annotation. For this study we limited the analysis to those SNPs that have r2 > 0.8 with our 5 SNPs. The results from this analysis can be found in Supplementary Table S4.

Ethical statement

All studies adhered to the tenets of the Declaration of Helsinki. The ERF study was approved by the Medical Ethics Committee of the Erasmus Medical Center in Rotterdam. Informed consent was obtained after explanation of the nature and possible consequences of the study.

All subjects in the INGI-CARL and INGI-FVG studies provided written informed consent before participation. Approval for the research protocol was obtained from the ethical committee of IRCCS-Burlo Garofolo Hospital.