Sample description

A total of 27 multigenerational, multiply affected Spanish (n = 23) and German (n = 4) families were investigated. A detailed description of the phenotypic assessment of the Spanish families is provided elsewhere24. In brief, the diagnostic assessment of affected and unaffected individuals in the Spanish families was performed using: the Schedule for Affective Disorders and Schizophrenia (SADS)25; the Operational Criteria Checklist for Psychotic Illness (OPCRIT)26; a review of medical records; and interviews with first and/or second-degree family members using the Family Informant Schedule and Criteria (FISC)27. Consensus best estimate diagnoses were assigned by two or more independent senior psychiatrists and/or psychologists, in accordance with the Diagnostic and Statistical Manual of Mental Disorders IV (DSM IV). In affected and unaffected members of the German families, phenotypic assessment and the assignment of diagnoses were performed by an experienced psychiatrist28.

No relationships were reported between the 27 individual families. In each of the 27 families, three individuals with BD were selected for WES (Supplementary Figure 1). These individuals were selected on the basis of being as distantly genetically-related as possible. The 81 selected individuals (37.0% male) had a DSM IV diagnosis of BD type I (n = 69); BD type II (n = 9); or BD not otherwise specified (NOS, n = 3).

The study was approved by the respective local ethics committees. Written informed consent was obtained from all participants prior to inclusion.

WES

Library enrichment for WES was conducted using SureSelectXTHuman All Exon v5 from Agilent Technologies (Santa Clara, CA, USA). Enriched samples were sequenced using an Illumina HiSeq2500 system (San Diego, CA, USA), and a 2 × 125 base pair (bp) paired end sequencing approach. Mean coverage of the sequences was 68.28×, and 96.90% of the sequencing reads had a coverage of >10×. Sequencing data were annotated according to the GRCh37/hg19 reference genome. The sequencing data are available upon request.

A detailed plan of the analytical steps is presented in the Supplement (Supplementary Figure 2). In a first step, separate analyses were conducted for each of the 27 families. For each of the three selected family members, Variant Calling Files were generated using the VARBANK pipeline (https://varbank.ccg.uni-koeln.de). The VARBANK pipeline integrates a number of publically available sequencing analysis tools. Among others, various GATK tools are used for diverse processing steps. The VARBANK pipeline is therefore based on GATK core components. VARBANK filter criteria were set for the detection of heterozygous variants (allele read frequency between 25% and 75%). Sequencing reads with a coverage of ≥10× were included in the subsequent analyses. The analyses focused on single-nucleotide variants/polymorphisms (SNVs, SNPs) and insertions or deletions (InDels) that: (i) resulted in an alteration in primary protein structure; or (ii) had strong splice site effects29. Only variants shared by all three investigated family members were included in the subsequent variant analysis, as these variants might be responsible for the exceptional aggregation of BD in the respective multiply affected families. The present rationale is based on the assumption that in multiply affected families, individual rare variants with a relatively strong effect (penetrance) on disease development may segregate. By concentrating solely on variants that were present in all three sequenced patients, the analysis focused on variants with a potentially high penetrance, and knowingly overlooked rare, disease-associated variants with lower penetrance. Although the term “segregation” is used in describing the exome-sequencing results, it should be noted that only “allele sharing” is actually observed. However, owing to the rarity of the identified variants, the observation of allele sharing between three exome-sequenced individuals from a given family is likely to reflect true segregation (i.e., identity-by-state = identity-by-descent).

The identified variants were filtered for a minor allele frequency (MAF) < 0.1% using the data of the Exome Aggregation Consortium (ExAC, http://exac.broadinstitute.org, release 0.3, non-psychiatric subsets)30. The majority of ExAC data originate from Europe (around 60%), and were thus considered appropriate in terms of estimating the MAF of variants identified in the present cohort.

To obtain functional predictions for the identified variants, the dbNSFP database was accessed. In accordance with Purcell et al.31,32,33, the five prediction tools SIFT, PolyPhen-2 HumDiv, PolyPhen-2 HumVar, LRT, and MutationTaster were used for the analysis of SNVs and SNPs. Only variants predicted to be potentially/probably damaging by at least three of the five prediction tools were included in the final list (Supplementary Table 1). For the analysis of InDels, the prediction tools MutationTaster and PROVEAN/SIFT were used. Only InDels predicted to be damaging by at least one of the three prediction tools were included in the subsequent analyses. Owing to their potential impact on protein function, nonsense variants that were classified as (probably) disease causing by the MutationTaster tool were included in the final list (Supplementary Table 1). For each of the identified variants, visual inspection of the sequencing reads in the VARBANK database was performed in order to control for technical artifacts.

Kinship analysis

A kinship analysis was conducted for the 81 exome-sequenced individuals from the 27 multiplex families using the Sample Kinship analysis tool within the VARBANK pipeline. This tool determines the proportion of alleles that are shared between pairs of individuals. The thresholds for shared variant analysis are: MAF < 0.1%; target distance < 100 bp; and passing of GATK’s variant quality score recalibration filter. The resulting values are the number and percentage of shared alleles. The results report the pairwise comparison of one individual with all other individuals within the investigated cohort.

Technical validation and extended segregation analyses

Using Sanger sequencing, technical validation and extended segregation analyses were performed for variants that were both: (i) rare, non-synonymous, and predicted to be potentially/probably damaging according to the above-mentioned criteria; and (ii) located in genes that harbored variants in at least two independent families. Extended segregation analysis was conducted in all family members (affected and unaffected) for whom DNA was available (Supplementary Figure 1). Primers for these experiments were designed using Primer334. Cycle Sequencing was conducted with the BigDye Terminator v3.1. Sanger sequencing was conducted using an ABI3130 Genetic Analyzer (Life Technologies, Carlsbad, CA, USA). Primer sequences and PCR conditions are obtainable upon request.

Rare variant association testing using RareIBD

RareIBD analyses were performed for the 16 rare variants that were investigated in the extended segregation analysis. RareIBD is a rare variant association method for large and extended pedigrees. RareIBD was selected as it is applicable to pedigrees with different family structures and those in which individuals in the top generations are missing35. RareIBD analysis (v1.2, http://genetics.bwh.harvard.edu/rareibd/) was conducted using the segregation analysis data of all family members for whom DNA was available. For the RareIBD analysis, individuals were defined as being affected if they were diagnosed with BD type I, BD type II, or BD NOS. All other individuals were defined as unaffected. RareIBD software settings were applied in accordance with the standard recommendations for the analysis. The resulting p values were Bonferroni-corrected for multiple testing according to the number of investigated variants (n = 16). For the quality control of the generated pedigree files, pedigrees were drawn using the CRAN-package kinship2, and the generated figures were then inspected to confirm correct family structure36.

Brain expression of candidate genes

To determine whether candidate genes identified in the WES analyses are expressed in the human brain, the Genotype-Tissue Expression (GTEx) database was accessed (https://gtexportal.org/)37. The GTEx Portal comprises expression data from multiple brain regions. Using the GTEx data, average expression values were generated for 12 different brain regions (excluding the spinal cord). Genes with a mean expression of > 0.5 Reads Per Kilobase Million were considered brain-expressed.

Investigation of identified candidate genes in published datasets

A literature search was conducted to determine whether candidate genes identified in the present study had been reported in previous independent next-generation sequencing studies of BD patients or in BD GWAS8,9,18,19,20,22,23,38,39,40,41,42.

Gene set enrichment analysis

For the 368 genes that harbored rare, non-synonymous, and segregating variants, a systematic investigation of gene set enrichment was performed using the permutation-based method described in Goes et al.22, in order to account for potential confounders, such as coding length, sequencing coverage, and overall mutability. Testing was performed for an enrichment of genes reported in previous de novo studies of autism and schizophrenia, and genes encoding postsynaptic density (PSD) proteins or targets of the fragile X mental retardation protein (FMRP)22. In brief, curated gene lists were retrieved from studies that had summarized genes with de novo nonsense and missense variants in autism (n = 1781), and schizophrenia (n = 670), as well as genes encoding proteins found in the PSD (n = 1398) and the FMRP pathway (n = 795)43,44. Newly curated lists of de novo autism and schizophrenia gene sets were also included in the permutation-based gene set enrichment analysis. The novel autism gene set was downloaded from the database de novo-db (version 1.5) and annotated using the Variant Effect Predictor (VEP) tool45,46. Non-synonymous variants that showed an association with a primary autism phenotype were selected, resulting in a set of 3679 genes. The novel schizophrenia gene set was compiled from seven published whole-exome studies of schizophrenia trios44,47,48,49,50,51,52. All available exonic variants were combined and re-annotated using the VEP tool, which is based on the Gencode v19 genome build46. The novel schizophrenia data set comprised 714 genes with at least one de novo non-synonymous variant.

The permutation-based gene set enrichment analysis was also performed for 55 additional gene sets that had shown association with BD in previous sequencing studies or GWAS9,16,17,20,21,53. Where possible, the original pathway definitions given in the respective studies were used. When this was not possible, current pathway definitions were obtained from the databases Gene Ontology (GO, http://geneontology.org/); Kyoto Encyclopedia of Genes and Genomes (KEGG, https://www.genome.jp/kegg/); and Molecular Signatures Database (MSigDB, software.broadinstitute.org/gsea/msigdb)54,55,56,57,58,59.

Tests were then conducted to determine whether the 368 genes that harbored rare non-synonymous segregating variants in the present cohort were enriched for any of these 61 gene sets. The gene set enrichment analysis was also performed for 139 genes that harbored rare non-synonymous segregating variants, which were predicted to be potentially/probably damaging by all applied prediction tools.

An equal number of genes captured by the present WES study were selected at random and matched with our candidate genes for the following three potentially confounding metrics: cumulative exon length (± 20%); sequence coverage (± 20%); and a gene-specific measure of intolerance to missense variation (ExAC missense constraint z score). A total of 10,000 permutations were performed, and the number of times that randomly selected genes were found in each of the gene sets was counted. To obtain empirical p values, a comparison was made between the observed degree of overlap and gene sets with this null distribution.

The Benjamini & Hochberg method was applied to the resulting 122 p values in order to perform a false discovery rate correction. Adjusted p values of < 0.05 were considered statistically significant.