Two complementary concepts underlie the use of PRSs: the idea that complex traits and behaviors are polygenic or affected by a number of genes, and the idea that these outcomes are influenced in part by pleiotropy, in which each gene affects a number of behaviors8,10. This framework and converging evidence from GWASs illustrate that most complex behaviors have an underlying polygenic architecture10, which has prompted a recent rise in examination of PRSs in relation to psychopathological outcomes. One method for forming PRSs has been to use a theoretically or hypothesis-driven approach in selecting SNPs from the literature based on known or assumed associations with the trait or behavior of interest; the theoretical derived approach is similar to the candidate gene approach5. For example, SNPs have been chosen based on the knowledge that they reside in genes that are broadly related to a certain biological function and the related behavior26. For some well-validated genetic effects this is plausible and links to biological processes have been established. However, in other cases the theoretical approach is often predicated on theorized or ambiguous biological relationships between genes and biological processes, with no evidence that SNPs within the proposed genes have any biological function. The theoretical approach also suffers from the limitations associated with candidate gene research, including multiple testing issues, small effects, and a high likelihood of missing meaningful associations27,28.

A second common method for generating PRSs uses a data-driven, hypothesis-free approach to selecting SNPs. Specifically, SNPs found to be associated with the outcome of interest in a previous GWAS are combined into a PRS, often composed of hundreds or thousands of SNPs5. Commonly, multiple PRSs are formed using different significance thresholds for data-driven SNP selection (e.g., p < 0.01, p < 0.05, p < 0.10…p < 0.70), and in some cases all SNPs from a GWAS are included in a score, weighted by effect size. Genetic associations between each PRS and a phenotype are then tested, often in one or more separate replication samples, and the best score is considered the one that explains the most variance in the outcome. However, this approach raises concerns about multiple testing and Type I error. Further, forming scores that include data-driven SNPs that are not even nominally associated in the original GWAS (i.e., p > 0.30 or greater), or including all SNPs from a GWAS likely introduces spurious variance. Such a GWAS-based approach may be of little theoretical value and result in diluting a true signal. Finally, scores formed using this GWAS-based method have no biological relevance. By including SNPs associated with a phenotype based on a broad statistical threshold, it is difficult to make connections to a single, or even multiple biological systems or processes. Indeed, SNPs included in such scores may fall within regions of genes and have no functional relevance at all.

Bioinformatics tools offer the possibility of applying biological relevance to GWAS data in secondary analyses. Many methods for utilizing these tools exist, but one increasingly frequent approach being applied to GWAS data is Gene Set Enrichment Analysis (GSEA; 11–13). A gene set is a collection of genes that can represent a biological process (e.g., molecular, cellular, disease), but may also represent gene networks and ontologies. Known gene sets are available from numerous public databases, which specify the genes in each set and the process they represent13,29,30. GSEA is a statistical procedures that provides information about which gene sets a given gene, or multiple genes, belong to and what biological processes they represent, based on information accessible from these public databases. For a detailed walkthrough and recommendations for running GSEA, see Mooney and Wilmot12.

An innovative variation of GSEA is to statistically test whether SNPs within genes are significantly associated with a gene set. The broad steps involved in this type of GSEA vary based on the software used; the following description is based on iGSEA4GWASv231,32. First, the user typically specifies which gene sets to load from publicly available databases; either all gene sets or only those related to certain databases or processes (e.g., gene ontology). Next, the user provides a list of SNPs and their respective p-values from association tests with a given phenotype, such as those from a relevant discovery GWAS. These SNPs are then mapped to genes based on SNP and gene annotations from an online database (e.g., Ensembl Biomart) within the user-specified range upstream and downstream33. Each gene is ranked based on the number of SNPs in each gene and their respective p-values. These genes (and their ranks) are then compared with the available gene sets to calculate each set’s enrichment score; that is, the proportion of the association between the gene(s) and target gene sets compared with the association between gene(s) outside gene sets12. Finally, permutation tests apply a false-discovery rate (FDR) to correct for multiple testing, gene set size, and overlap in gene sets11,12. These procedures result in a list of SNPs and their respective gene and the gene set to which each gene was mapped. Thus, using GSEA can be applied to a large group of SNPs from a GWAS to filter and derive a smaller group of SNPs, which map to a gene set. These SNPs can then be formed into a biologically informed PRS.

A recent option in some GSEA software further refines the list of SNPs that were successfully mapped to genes and gene sets to those SNPs (or a SNP in LD proxy) that are functional32. Functionality can be conferred by annotation (e.g., a SNP resides in a coding region associated with a protein or RNA product), regulatory regions (e.g., a SNP resides in a region that controls the expression of other coding regions), or eQTL (e.g., a SNP is associated with variation in expression of mRNA or protein). Collectively, GSEA with functional SNP identification can be used to classify SNPs at two levels1: those SNPs significantly mapped to a gene set, or2 those SNPs both significantly mapped to a gene set and noted as functional.

To date, GSEA is being used with GWAS data to identify specific biological processes involved in disease outcomes in the medical literature (e.g., lung function34), with some emerging research on psychopathology outcomes such as ADHD35. However, these emerging studies typically examine single SNPs resulting from GSEA or the effect of single pathways. To our knowledge, no study has created PRSs composed of functional SNPs resulting from GSEA. Thus, the current study is the first to create biologically informed PRSs for psychopathological outcomes, by first filtering meta-GWAS data for child aggression through GSEA, then forming PRSs from SNPs that significantly mapped to gene sets and for the subset of SNPs with a known biological function.

An additional strength of the present study is that the discovery meta-GWASs examined associations with child aggression separately in early and middle childhood18, allowing us to create separate PRSs targeted to each of the two developmental periods. As previously mentioned, most studies rely on GWAS in adult samples and resulting PRSs are tested in child samples, which is problematic given genetic effects can vary with development. We tested the current PRS in a replication sample of children that developmentally aligns with those periods represented in the discovery meta-GWAS. In addition, we tested genetic associations with childhood aggression using the most common measure used in the discovery meta-GWAS (18; parent report of aggression on the Child Behavior Checklist).

Greater alignment of sample characteristics and specificity in the developmental period and phenotypes can help to uncover more precise genetic associations. To address differences in genetic association across development, we considered these associations using a time-varying effect model (TVEM; 36) to explicitly model change in the association between PRSs and aggression across childhood.

Based on the discovery meta-GWASs18, we created a total of six polygenic risk scores, all at the p < 0.05 threshold. For both developmental periods (early and middle childhood), we used meta-GWAS data to create three PRSs: one PRS formed from all SNPs at p < 0.05, one PRS formed from SNPs that significantly mapped to gene sets using GSEA at p < 0.05, and finally, one PRS formed from the subset of SNPs that both significantly mapped to gene sets and with biological function at p < 0.05. We chose p < 0.05 as a relatively stringent threshold for a number of reasons. First, it includes a smaller number of markers that represent statistically significant associations. Second, by using a more stringent threshold it excludes more chance associations, i.e., SNPs that may be spuriously associated with aggression in the original meta-GWAS. As stated in a recent article, selecting the optimal p-value threshold is “analogous to a tuning parameter that balances a signal and noise tradeoff. This tradeoff arises because more significant p-value thresholds have higher proportions of causal variants”37. This is in-line with our current approach to identify functional variants which is optimized by a more stringent statistical threshold.

To test the utility of these scores, we examined the association of the three early childhood PRSs with aggressive behavior from 2 to 5 years old in a time-varying effect model. We separately tested a similar model in middle childhood in which we examined the association of the three middle childhood PRSs with aggressive behavior from 7.5 to 10.5 years old in a time-varying effect model. The replication sample was drawn from a longitudinal study of child development in which children were randomly assigned to a family-based intervention condition38.

We hypothesized that the PRS formed from all SNPs at p < 0.05 would not be associated with aggressive behavior given we were not looking to maximize variance explained at multiple thresholds but rather chose a stringent threshold a-priori. Whereas less stringent criteria may explaining greater variance in a phenotype, it also likely includes SNPs that are spuriously associated in the original meta-GWAS or those that have less biological relevance. Further, we hypothesized that the PRS composed of mapped SNPs would show small, albeit significant, associations with aggression and that the PRS composed of functional SNPs would show the most robust associations with child aggression in the early childhood and middle childhood models, respectively. Further, given that PRS were created for unique developmental periods we expected this pattern of findings to be replicated within both the early and middle childhood models, respectively, but there would be no associations between PRSs across developmental periods.