Ethics statement

We have complied with all relevant ethical regulations. All study participants provided written, informed consent for use of their data in genetic studies of epilepsy. For minors, written informed consent was obtained from their parents or legal guardian. Local institutional review boards approved study protocols at each contributing site.

Cohorts and phenotype definition

A list of the sites included in this study is described in Supplementary Table 6. We classified seizures and epilepsy syndromes according to the classification and terminology outlined by the ILAE15,54. For all cases, epilepsy specialists assessed each phenotype at the contributing site. Individuals with epilepsy were initially assigned to one of three phenotypic categories: genetic generalized epilepsy, focal epilepsy, or unclassified epilepsy. Based on EEG, MRI and clinical histories we further classified our cohort into the epilepsy subtypes listed in Supplementary Table 1. We used a combination of population-based datasets as controls. Some control cohorts were screened by questionnaire for neurological disorders. 53.4% of cases were female compared to 51.6% of controls.

Study design

We conducted a case-control study in subjects of Caucasian, Asian (Han Chinese) and African-American ethnicities. Our primary analyses were structured to test common genetic variants for association with epilepsy according to broad epilepsy phenotypes. We pooled cases from cohorts of the same ethnic group to perform linear mixed model analysis, followed by subsequent meta-analyses of regression coefficients across the three ethnic groups. Our secondary analyses tested for associations with specific syndromes of genetic generalized epilepsy (childhood absence epilepsy, juvenile absence epilepsy, juvenile myoclonic epilepsy, and generalized tonic-clonic seizures alone) and phenotypes of focal epilepsy (lesion negative, focal epilepsy with hippocampal sclerosis, and focal epilepsy with other lesions). The secondary analyses were limited to Caucasian subjects due to sample size. We prioritized the results of the GWAS by incorporating eQTL information, transcriptome-wide analysis, and biological annotation. Finally, we estimated the genetic correlation of epilepsy phenotypes using Linkage-Disequilibrium Adjusted Kinships (LDAK).

Genotyping

The EpiPGX samples were genotyped at deCODE Genetics on Illumina OmniExpress-12 v1.1 and OmniExpress-24 v1.1 single nucleotide polymorphism (SNP) arrays. The EPGP samples were genotyped on Illumina HumanCore beadchips at Duke University, North Carolina. The remainder of the samples were genotyped on various SNP arrays, as previously published15.

Genotyping quality control and imputation

Quality control of genotyping was performed separately for each cohort using PLINK 1.955. Each genotype cohort was temporarily merged with a genetically similar reference population from the 1000 Genomes Project (CEU, CHB, or YRI). A test for Hardy–Weinberg equilibrium (HWE) was performed and SNPs significant at p < 1 × 10−10 were removed. All samples and all SNPs with missing genotype rate >0.05 and all SNPs with minor allele frequency (MAF) <0.01 were removed. Next, we pruned SNPs using the PLINK --indep-pairwise command (settings: window size 100 kb, step size 25, R2 > 0.1). Using this subset of SNPs, we removed samples with outlying heterozygosity values (>5 SD from the median of the whole cohort). Identity by descent (IBD) was calculated in PLINK to remove sample duplicates (>0.9 IBD) and to identify cryptic relatedness. We removed one from each sample pair with IBD>0.1875, with the exception of the EPGP familial epilepsy cohort. Subjects were removed if sex determined from X-chromosome genotype did not match reported gender. Array-specific maps were used to update all SNPs positions and chromosome numbers to the Genome Reference Consortium Human Build 37 (GRCh37), and remove all A/T and C/G SNPs to avoid strand issues. We applied pre-imputation checks according to scripts available on the website of Will Rayner of the Wellcome Trust Centre for Human Genetics (www.well.ox.ac.uk/~wrayner/tools/) to remove SNPs with allele frequencies deviating >20% from the frequency in the Haplotype Reference Consortium. Samples were submitted to the Sanger Imputation Service (https://imputation.sanger.ac.uk/)56. We selected the Human Reference Consortium (release 1.1; n = 32470) dataset as reference panel for Caucasian and Asian datasets and the African Genome Resources (n = 4956) for the African-American datasets. Post-imputation quality control filters were applied to remove SNPs within each imputed cohort with an imputation info score <0.9 or HWE p<1e-6. Imputed genotype dosages with a minimum certainty of 0.9 per subject were converted to hard-coded PLINK format after which all samples were pooled into a single cohort. We performed a principal components analysis using GCTA. From the PCA results we stratified our subjects into three broad ethnic groups (Caucasian, Asian and African) while removing extreme outliers. After stratifying by ethnicity, we removed SNPs with HWE p < 1e-6, call rate <0.95 or MAF<0.01. In total 816 subjects out of 45705 subjects were filtered out by quality control procedures, leaving 44889 subjects for analyses.

Study power

We estimated using PGA57 that the study had 80% power to detect a genetic predictor of relative risk for all epilepsy (approximated to odds ratio) ≥1.45 with MAF = 1% and an alpha level of 5 × 10−8. We estimated that our meta-analyses had 80% power to detect genome-wide significant SNPs of MAF = 1% with relative risks ≥1.5 and ≥1.8, for focal and generalized epilepsy respectively (see Supplementary Figure 15). Our analysis of generalized epilepsy sub-phenotypes had 80% power to detect genome-wide significant SNPs of MAF = 1% with relative risks ≥2.6, ≥3.3, and ≥2.4 for CAE, JAE, and JME respectively. Our analysis of focal epilepsy sub-phenotypes had 80% power to detect genome-wide significant SNPs of MAF = 1% with relative risks ≥1.9, ≥2.8, and ≥1.9 for focal epilepsy lesion negative, focal epilepsy with hippocampal sclerosis and focal epilepsy with lesion other than hippocampal sclerosis, respectively.

Statistical analyses

Association analyses were conducted within the three ethnic subgroups using a linear mixed model in BOLT-LMM58. A subset of SNPs, used to correct for (cryptic) relatedness and population stratification by BOLT-LMM, were derived by applying SNP imputation info score >0.99, MAF >0.01, call rate >0.99 before pruning the remaining variants using LDAK with a window size of 1 Mb and R2 > 0.243. All analyses included gender as a covariate and the threshold for statistical significance was set at 5 × 10−8. We compared χ2 values of the BOLT-LMM output between all pairs of SNPs in high LD (R2 > 0.4) and removed pairs of SNPs with extreme χ2 differences using a formula that scales exponentially with magnitude of χ2 and LD: χ2 difference cutoff = \(\frac{{3^\ast \sqrt {\frac{{{\mathrm{SNP}}1 - {{\chi }}2 + {\mathrm{SNP}}2 - {{\chi }}2}}{2}} }}{{(R^2)^2}}\); where SNP1- χ2 and SNP2− χ2 are the χ2-statistic of the two SNPs in each pair and R2 is their squared correlation (LD). We tested the homogeneity of all SNPs by splitting the pooled cohort into 13 distinct clusters of ethnically matched cases and controls and removed SNPs exhibiting significant heterogeneity of effect (P het < 1 × 10−8). Fixed effects, trans-ethnic meta-analyses were conducted using the software package METAL59. Manhattan plots for all analyses were created using qqman. Considering that our study had unequal case-control ratios, we calculated the effective sample size per ethnicity using the formula recommended by METAL: N eff = 4/(1/N cases + 1/N ctrls ). Since >95% of all cases were Caucasian, we included all SNPs that were present in at least the Caucasian dataset (~5 million).

Conditional association analysis was performed with PLINK on loci containing significant SNPs to establish whether other genetic variants in the region (500 Kb upstream and downstream) were independently associated with the same phenotype. The conditional threshold for significance was set at 2 × 10−5, based on approximately 2500 imputed variants per 1MB region.

Assessment of inflation of the test statistic

Potential inflation of the test statistic was assessed per ethnicity and phenotype by calculating the genomic inflation factor (λ; the ratio of the median of the empirically observed distribution of the test statistic to the expected median) and the mean χ2. Since λ is known to scale with sample size, we also calculated the λ 1000 , i.e λ corrected for an equivalent sample size of 1000 cases and 1000 controls60. We observed some inflation of the test statistic (λ > 1) across the different phenotypes (Supplementary Table 7), suggesting either polygenicity or confounding due to population stratification or cryptic relatedness. Therefore, we applied LD score regression61, estimating LD scores using matched populations from the 1000 GP (EUR for Caucasians (n = 669), AFR for African-Americans and EAS for Asians). These LDSC results suggested that inflation of the test statistic was primarily due to polygenicity for most analyses (Supplementary Table 7). Only the Caucasian focal and all epilepsy analyses had LDSC intercepts >1.1, suggesting confounding or an incomplete match of the LD-score reference panel. Our focal and all epilepsy analyses included cohorts from various Caucasian ethnicities, including Finnish and Italian focal epilepsy cohorts, and it has been shown that LD differs considerably between Finnish and Italian populations61. Therefore, we consider an incomplete match of the LD-score reference panel the most likely cause of the observed inflation, since we used a mixed-model analysis that corrects for population stratification and cryptic relatedness58.

Gene mapping and biological prioritization

Genome-wide significant loci of all phenotypes were mapped to genes in and around these loci using FUMA35. Genome-wide significant loci were defined as the region encompassing all SNPs with P < 10-4 that were in LD (R2 > 0.2) with the lead SNP (i.e. the SNP with the lowest P-value in the locus with P < 5 × 10−8). Positional mapping was performed to map genes that were located within 250 kb of these loci. Additionally, we mapped genes that were farther than 250 kb away from the locus using chromatin interaction data to identify genes that show a significant 3D interaction (P FDR < 10−6) between their promoter and the locus, based on Hi-C data from dorsolateral prefrontal cortex, hippocampus, and neural progenitor cells62. This resulted in a total of 146 mapped genes across all phenotypes, of which some genes (e.g. SCN1A) were associated with multiple epilepsy phenotypes.

We next devised various prioritization criteria to prioritize the most likely biological candidate genes out of the 146 mapped genes, similar to previous studies17,18,63, based on the following 6 criteria:

1. A significant correlation between the epilepsy phenotype and expression of the gene, as assessed with a transcriptome-wide association study (TWAS). Default settings of the FUSION software package64 were used to impute gene-expression based on our GWAS summary statistics and RNA-sequencing data from dorsolateral prefrontal cortex tissue (n = 452, CommonMind Consortium65), after which the association between the epilepsy phenotype with gene-expression was calculated. It was possible to test the TWAS expression association for 53 out of our 146 mapped genes, since only the expression of these 53 genes was significantly heritable (heritability p-value <0.01, as suggested by Gusev et al.64). We set a Bonferroni corrected p-value threshold of 0.05/53 = 0.00094 to define significant TWAS associations. 2. Genes for which a SNP in the genome-wide significant locus (as defined above) is a significant cis-eQTL (Bonferroni corrected P < 8 × 10−10)23 based on data from the ROS and MAP studies, which includes RNA-sequencing data from 494 dorsolateral prefrontal cortex tissues23. 3. The gene is preferentially expressed in the brain. This was assessed by using gene-expression data from all 53 tissues of the Gene-Tissue expression (GTEx) Consortium66. Genes were considered to be preferentially expressed in the brain when the average expression in all brain tissues was higher than the average expression in non-brain tissues. 4. Genes for which a SNP in the genome-wide significant locus (as defined above) is a missense variant, as annotated by ENSEMBL67. 5. Genes prioritized by protein-protein interaction network, as calculated using the default settings of DAPPLE68, which utilizes protein–protein interaction data from the InWeb database69. The 146 genes implicated by our GWAS were input after which DAPPLE assessed direct and indirect physical interactions to create a protein-interaction network. Next, DAPPLE assigned a significance score to each gene according to several connectivity parameters; genes with a corrected P < 0.05 were considered to be prioritized by DAPPLE. 6. Genes for which a nervous system or behavior/neurological phenotype was observed in knockout mice. Phenotype data of knockout mice was downloaded from the Mouse Genome Informatics database (http://www.informatics.jax.org/) on 17 January 2018 and nervous system (phenotype ID: MP:0003631) and behavior/neurological phenotype (MP:0005386) data were extracted.

All 146 genes were scored based on the number of criteria met (range 0–6 with an equal weight of 1 per criterion), see Supplementary Data 1 for a full list. We considered the gene(s) with the highest score in each locus as the most likely biological epilepsy candidate gene. Multiple genes in a locus were selected if they had an equally high score whilst no genes were selected in a locus if all genes within it had a score <2, similar to previous studies17,18.

Gene co-expression analysis for epilepsy with brain-coX

In silico gene prioritization was performed using brain-coX25. brain-coX uses a compendium of seven large-scale normal brain gene expression data resources to identify co-expressed genes with a set of given genes (known, or putative, disease causing genes) likely to encapsulate gene expression networks involved in disease. This approach can identify, and thus leverage networks that are not currently known and not present in available resources such as PPI networks and is a complementary approach to these. We used a set 102 monogenic epilepsy genes (Supplementary Table 8) as the set of known disease genes. An FDR of 0.2 was used to identify genes that significantly co-express with the known set of genes. Prioritization in at least three datasets at an FDR of 0.2 led to a specificity of 0.925.

In the first analysis we used a set of 16 candidate epilepsy genes identified by the GWAS analysis and prioritized using additional methods (Fig. 2). These excluded any genes already included in the set of known epilepsy genes (Supplementary Table 8). Supplementary Fig. 9 shows the gene co-expression pattern using the weighted average gene co-expression across all seven datasets for candidate genes from the GWAS that show significant gene co-expression with any of the 102 known epilepsy genes.

In the second analysis we used the set of all the 146 candidate genes identified in the GWAS analysis (Supplementary Data 1). Only 140 of these were identified as having available expression data in the gene expression resources. Many genes showed some evidence of gene co-expression but few showed co-expression in more than 2 datasets (18 out of 140). BCL11A (6) and GJA (6) remain the most robust candidate genes co-expressed with known epilepsy genes. The complete results are shown in Supplementary Table 9.

Functional annotations

We annotated all genome-wide significant SNPs (p < 5 × 10−8) from all phenotypes using the Variant Effect Predictor of ENSEMBL67 and the RegulomeDB database28. We annotated chromatin states using epigenetic data from the NIH Roadmap Epigenomics Mapping Consortium70 and ENCODE71. We used FUMA35 to annotate the minimum chromatin state (i.e. the most active state) across 127 tissues and cell types for each SNP, similar to a previous study27.

Heritability enrichment of epigenetic markers and gene-expression

We used stratified LD-score regression72 to assess tissue-specific heritability enrichment of epigenetic markers in 88 tissues, using standard procedures29. We used the same settings and pre-calculated weights that accompanied the paper by Finucane et al. to calculate the heritability enrichment of all epilepsy, focal epilepsy and generalized epilepsy, based on epigenetic data of 6 chromatin markers in 88 tissues from the Roadmap Consortium and gene-expression data in 53 tissues from the GTEx Consortium.

Enrichment analyses

Hypergeometric tests were performed with R (version 3.4.0) to assess whether the genes mapped to genome-wide significant loci and the subset of prioritized biological epilepsy genes (see above) were enriched for: (i) known monogenic epilepsy genes (n = 102) and (ii) known anti-epileptic drug target genes (n = 64), relative to the rest of the protein-coding genes in the genome (n = 19180). We supplemented the list of 43 known dominant epilepsy genes36 with an additional 59 monogenic epilepsy genes from the GeneDX comprehensive epilepsy panel (www.genedx.com). We compiled the list of drug target genes from73, supplemented with additional FDA & EMA licensed AEDs. The full list of gene targets considered in each analysis are listed in Supplementary Tables 8 and 10.

Enrichment analyses corrected for gene size

Brain expressed genes are known to be larger in size than non-brain expressed genes. To assess whether gene size could be a cause of bias for our enrichment analyses, we first assessed whether the size of the genes mapped in our analyses was different than non-mapped genes in the genome. We found that the size of the 146 genes mapped to genome-wide significant loci was 65.6 kb, whereas the average gene size of all other protein-coding genes is on average 62.2 kb, suggesting there is no strong bias towards preferentially mapping loci to small or large genes.

We also observed that the 102 established monogenic epilepsy genes are on average 2.44 times longer than non-epilepsy genes (152.0 kb vs 62.2 kb). As a conservative approach to correct for this size difference, we have used the Wallenius’ noncentral hypergeometric distribution, as implemented in the R-package ‘BiasedUrn’. Using this distribution, we repeated our hypergeometric analyses under the conservative assumption of a 2.42 times increased likelihood of mapping epilepsy genes as opposed to non-epilepsy genes. Using this distribution, the 146 genes that were mapped to genome-wide significant loci were significantly enriched for monogenic epilepsy genes (Wallenius’ noncentral hypergeometric test p = 8.3×10−3). When limiting our results to the 21 biological prioritized genes, the enrichment of monogenic epilepsy genes became more significant (Wallenius’ noncentral hypergeometric distribution p = 5.3×10−4).

Similarly, we observed that the targets of AEDs are on average 2.43 times longer than non-AED target genes (151.8 kb vs 62.4 kb). When correcting for this gene-size difference under the assumption of a 2.43 times increased likelihood of mapping our genome-wide significant loci to AED target genes, we find that the 146 mapped genes were significantly enriched for AED target genes (Wallenius’ noncentral hypergeometric test p = 1.7×10−5). When limiting our results to the 21 biological prioritized genes, the enrichment of AED target genes became more significant (Wallenius’ noncentral hypergeometric test p = 1.0×10−8).

Connectivity mapping

Connectivity mapping was performed using our GWAS results in order to identify drugs which can potentially be repurposed for the treatment of epilepsy, enabling significant savings in the time and cost of antiepileptic drug development. Recently, So et al. identified candidate drugs that could be repurposed for the treatment of schizophrenia by using GWAS results to impute the gene-expression changes associated with the disease and, then, identifying drugs that change gene-expression in the opposite direction in cell lines38. Interestingly, the set of candidate drugs they identified was significantly enriched with antipsychotics. We adopted a similar strategy.

Gene-expression changes associated with epilepsy were imputed from the all epilepsy GWAS summary statistics using the FUSION software package64 and dorsolateral prefrontal cortex tissue RNA-sequencing data (n = 452, CommonMind Consortium65). We calculated z-scores for the association between epilepsy and changes in expression of all 5261 significantly heritable genes, using default settings of the FUSION software package as described above64. The top 10% of the gene-expression changes most strongly associated with epilepsy were used to construct the disease signature. Then, we identified drugs that change gene-expression in the opposite direction in cell lines, using the Combination Connectivity Mapping bioconductor package and the Library of Integrated Network-Based Cellular Signatures (LINCS) data74. This package utilizes cosine distance as the (dis)similarity metric75,76. A higher (more negative) cosine distance value indicates that the drug induces gene-expression changes more strongly opposed to those associated with the disease. A lower (more positive) cosine distance value indicates that the drug induces gene-expression changes more similar to those associated with the disease. In the LINCS dataset, some drugs have been profiled in more than one cell line, concentration, and time-point. For such drugs, the highest absolute cosine distance, whether positive or negative, was selected, as this value is less likely to occur by chance. The output of this analysis comprised 24,051 drugs or ‘perturbagens’, each with a unique cosine distance value.

To demarcate the set of drugs predicted to significantly reverse epilepsy-associated gene-expression changes, the threshold of statistical significance for cosine distance values was determined. For this, we performed 100 permutations of the disease gene-expression z-scores and compared them to drug gene-expression signatures. We combined the distribution of cosine distance values across all permutations, such that the null distribution was derived from 2,405,100 cosine distance values under H 0 . The cosine distance value corresponding to α of 0.05 was −0.386. Of the drugs with a cosine distance less than −0.386, thirty were experimentally-validated drug repurposing candidates from the Prescribable Drugs with Efficacy in Experimental Epilepsies (PDE3) database—a recently published systematic and comprehensive compilation of licenced drugs with evidence of antiepileptic efficacy in animal models77. We determined whether this is more than expected by chance, by creating 1,000,000 random drug-sets of the same size as drugs with a significant cosine distance. Next, we counted the number of subsets containing an equal or higher number of experimentally-validated drug repurposing candidates, as those found within drugs with a significant cosine distance. This permutation-based p-value was 1.0 × 10−6.

Supplementary Table 4 lists the 30 candidate re-purposable drugs that are predicted to significantly reverse epilepsy-associated gene-expression changes, have published evidence of antiepileptic efficacy in animal models, and are already licensed for the treatment of other human diseases. Of this list, 22 drugs have corroborated evidence of antiepileptic efficacy from multiple published studies or multiple animal models. For each drug, we list the studies demonstrating antiepileptic efficacy in animal models, the animal models used, and the licensed indication(s).

Validation of connectivity mapping results

Validation of the connectivity mapping results was performed using two non-overlapping sets of drugs with known antiepileptic efficacy: (1) a set of ‘clinically-effective’ drugs that have antiepileptic efficacy in people, and (2) a set of ‘experimentally-validated’ drugs that have antiepileptic efficacy in animal models. For the clinically-effective drug-set, we used the names of all recognized antiepileptic drugs, as listed in category N03A of the World Health Organization (WHO) Anatomical Therapeutic Chemical (ATC) Classification System, and of benzodiazepines and their derivatives (ATC codes N05BA and N05CD), and of barbiturates (ATC code N05CA), as these drugs are known to have antiepileptic efficacy in people. For the experimentally-validated drug-set, we extracted drug names from the PDE3 database77.

We determined whether, in our results, clinically effective drugs are ranked higher than expected by chance. The median rank of all drugs was 12,026. The median rank of clinically effective drugs was 3725. Hence, the median rank of clinically-effective drugs was 8301 positions higher than that of all drugs. A permutation-based p-value was determined by calculating the median ranks of 1,000,000 random drug-sets, each equal in size to the number of clinically effective drugs in the LINCS database. This permutation-based p-value was <1.0 × 10−6. Similarly, we determined whether, in our results, experimentally-validated drugs are ranked higher than expected by chance. The median rank of experimentally-validated drugs was 6372. Hence, the median rank of experimentally-validated drugs was 5654 positions higher than that of all drugs. A permutation-based p-value was determined by calculating the median ranks of 1,000,000 random drug-sets, each equal in size to the number of experimentally-validated drug repurposing candidates in the LINCS database. This permutation-based p-value was <1.0 × 10−6.

Heritability analysis

Linkage-Disequilibrium Adjusted Kinships (LDAK42,43) was used to calculate SNP-based heritability of all epilepsy phenotypes. Since these analyses require homogeneous cohorts, only Caucasian subjects (which represent >95% of epilepsy cases) were used for these analyses. SNP based heritabilities (\(h_o^2\)) were converted to liability scale heritability estimates (\(h_L^2\)) using the formula:8 \(h_L^2 = h_o^2 \ast K^2(1 - K)^2/p(1 - p) \ast Z^2\), where K is the disease prevalence, p is the proportion of cases in the sample, and Z is the standard normal density at the liability threshold. We estimated disease prevalence based on a combination of previous studies8,78,79 (Supplementary Table 11). Although prevalence estimates vary between studies, the \(h_L^2\) estimate has been shown to be fairly robust to such differences8. Similarly, we have modeled \(h_L^2\) using half and double of our prevalence estimates which lead to \(h_L^2\) estimates that varied between 0.4 and 11% (Supplementary Table 11). In addition, we compared the heritability estimates from LDAK with the alternative methods BOLT-REML80 and LDSC58 (Supplementary Table 12). Next, LDAK was used to calculate the genetic correlation between the 7 epilepsy subtypes. Subjects with a diagnosis of both CAE and JAE were excluded from heritability and genetic correlation analyses.

We computed the genetic correlation between all, focal and genetic generalized epilepsy with other brain diseases and traits using LDSC, as implemented in LD hub81. LD hub is a centralized database that contains publicly available GWAS summary statistics from various diseases and traits. We selected published GWAS of psychiatric, neurological, auto-immune diseases with known brain involvement and cognitive/behavioral traits from LD hub. We contacted the authors of published GWAS to provide us with summary statistics when no summary statistics were available on LDhub or when a more recent GWAS of a disease/trait was published that was not included on LDhub. The Caucasian subset of our data was used for all analyses and only other GWAS with primarily Caucasian subjects were included in our analyses. We used a conservative Bonferroni correction to assess significance of genetic correlations (p = 0.05/48 = 0.001).

Multi-trait analysis of GWAS (MTAG)46 was used with default settings to increase the effective sample size from our Caucasian all and generalized epilepsy GWAS by pairing it with the significantly correlated GWAS on cognitive ability (as assessed above) with a larger sample size (n=78,307). MTAG utilizes the fact that estimations of effect size and standard error of a primary GWAS, in this case epilepsy, can be improved by matching them to a genetically correlated secondary GWAS, in this case cognitive ability.