Significance Health in later life and longevity vary substantially across sociodemographic groups, but the biological mechanisms of these disparities remain poorly understood. We conducted a transcriptome profiling study of inflammatory and antiviral gene activity in a large, nationally representative and ethnically diverse sample of young adults and found that sociodemographic variations in the activity of these molecular pathways emerge by young adulthood—well before they manifest as late-life chronic illness. Inflammation related to biobehavioral factors (BMI, smoking), interferons related to individual characteristics (sex, race/ethnicity), and transcription factor and immune-cell activation showed additional links to social context (family poverty, geographic region). These data suggest that interventions early in life may address the predisease physiological disparities that manifest as late-life health disparities.

Abstract Health in later life varies significantly by individual demographic characteristics such as age, sex, and race/ethnicity, as well as by social factors including socioeconomic status and geographic region. This study examined whether sociodemographic variations in the immune and inflammatory molecular underpinnings of chronic disease might emerge decades earlier in young adulthood. Using data from 1,069 young adults from the National Longitudinal Study of Adolescent to Adult Health (Add Health)—the largest nationally representative and ethnically diverse sample with peripheral blood transcriptome profiles—we analyzed variation in the expression of genes involved in inflammation and type I interferon (IFN) response as a function of individual demographic factors, sociodemographic conditions, and biobehavioral factors (smoking, drinking, and body mass index). Differential gene expression was most pronounced by sex, race/ethnicity, and body mass index (BMI), but transcriptome correlates were identified for every demographic dimension analyzed. Inflammation-related gene expression showed the most pronounced variation as a function of biobehavioral factors (BMI and smoking) whereas type I IFN-related transcripts varied most strongly as a function of individual demographic characteristics (sex and race/ethnicity). Bioinformatic analyses of transcription factor and immune-cell activation based on transcriptome-wide empirical differences identified additional effects of family poverty and geographic region. These results identify pervasive sociodemographic differences in immune-cell gene regulation that emerge by young adulthood and may help explain social disparities in the development of chronic illness and premature mortality at older ages.

Most chronic illnesses show marked demographic variations in prevalence and outcome, including cardiovascular (1), neoplastic (2), metabolic (3), and neurodegenerative diseases (4). These demographic disparities become increasingly prevalent in mid to later adulthood (5, 6), resulting in shorter life spans for men relative to women, for blacks and Hispanics relative to Asians and non-Hispanic whites, for the poor relative to the affluent, and for residents of the southern United States compared to other regions (7⇓–9). However, the biological underpinnings of these late-life health disparities may emerge decades earlier in adolescence and young adulthood (9⇓⇓⇓⇓⇓–15), well before such morbidities are commonly diagnosed. Most chronic diseases develop over the course of many years and are driven in part by the activity of disease-promoting molecular pathways involved in inflammation, metabolism, and immune function (16). Measurement of gene expression can provide insight into the molecular processes that underlie these sociodemographic gradients in health. However, little is known about sociodemographic variation in the molecular precursors of disease because population health studies have rarely surveyed the molecular characteristics of adolescents or young adults. Here we report results from a transcriptome profiling analysis of a large, nationally representative and ethnically diverse sample of young adults and find significant demographic variation in the molecular antecedents of chronic disease decades before those diseases typically manifest in late adulthood.

To determine whether demographic variations in gene regulation during young adulthood might contribute to social gradients in late-life disease risk, this study analyzed genome-wide transcriptional profiles in blood samples from a nationally representative sample of 1,126 young adults (mean age 37) participating in the National Longitudinal Study of Adolescent to Adult Health (Add Health) (17). Add Health is the largest, most comprehensive longitudinal study of adolescents ever undertaken, with national representation of all race, ethnic, immigrant, socioeconomic status, and geographic subgroups in the United States. Add Health used probability population-representative sampling to enroll a nationwide sample of adolescents (grades 7 to 12) in 1994 to 1995 and has followed that cohort longitudinally since then (17). We analyzed gene expression profiles in whole-blood samples collected ∼22 y later during young adulthood to assess transcriptome variation as a function of individual demographic characteristics (age, sex, race/ethnicity), sociodemographic conditions (family poverty status, geographic region), and biobehavioral factors that might potentially be confounded with demographic characteristics [smoking, alcohol consumption, and body mass index (BMI)]. Our initial analysis focused on quantifying variations in health-relevant gene expression among young adults as a function of fundamental demographic, social, and behavioral factors known to define disparities in chronic disease. In addition to clarifying the molecular origins of late- life health disparities, this analysis provides an essential platform for more detailed analyses of specific risk factors in adolescence and young adulthood, as well as methodological guidance to avoid the risk of sociodemographic confounding in future genomic research.

Our analyses focused on two molecular pathways involved in the pathogenesis of multiple chronic diseases (16): 1) genes involved in inflammation and 2) genes involved in type I interferon (IFN) responses. These two gene sets represent functionally distinct immunoregulatory programs (18, 19) and were selected for analysis based on their well-established relationship to chronic disease and longevity, both as empirical predictors (16, 20⇓⇓⇓⇓⇓⇓–27) and as molecular mechanisms of disease (16, 28⇓⇓⇓⇓–33). Both gene sets are subject to physiological regulation by tissue injury and microbial stimuli as well as by the neural and endocrine systems (34).

Neural/endocrine regulation of gene expression has been hypothesized to constitute one pathway through which social environmental conditions might contribute to health disparities, for example, through stress-induced activation of the Conserved Transcriptional Response to Adversity (CTRA) RNA profile that involves up-regulation of inflammatory genes and a reciprocal down-regulation of type I IFN genes in the circulating leukocyte pool (35⇓–37). Basic laboratory research has found the CTRA transcriptome shift to be mediated in part by sympathetic nervous system (SNS)-induced increases in hematopoietic output of myeloid lineage immune cells—monocytes, dendritic cells, and neutrophil granulocytes (38⇓–40).

In addition to examining basic sociodemographic variations in inflammatory and type I IFN gene modules due to their established relevance for chronic disease, we also conducted analyses testing whether the more specific CTRA pattern (i.e., IFN − inflammation) and related neuroendocrine and cellular mechanisms might contribute to such demographic variations. As such, the present analysis quantified demographic variation in young adult blood-cell gene expression profiles using three complementary analytic approaches corresponding to three distinct levels of biological influence on gene expression (41): 1) analyzing expression of a-priori–defined sets of inflammatory and IFN indicator genes used in previous research (level 1) (42); 2) analyzing genome-wide empirical differences in RNA expression in terms of their coregulation by transcription factors involved in inflammatory, type I IFN, SNS, and neuroendocrine response (level 2) (34, 36); and 3) analyzing genome-wide empirical differences in RNA expression in terms of their coexpression in specific immune-cell subsets involved in inflammatory and IFN gene expression (particularly monocytes, dendritic cells, and neutrophils) (level 3) (38, 39, 43).

Methods Sample and Survey Procedures. Data come from Add Health, a nationally representative study of US adolescents in grades 7 to 12 in 1994 to 1995 who have been followed into adulthood over five waves of data collection. We used data from sample 1 Wave V (2016 to 2017) that was collected when respondents were aged 32 to 42. Study design, interview procedures, and demographic and biobehavioral assessments have been previously described (13, 17). Participants provided written informed consent and all procedures were approved by the University of North Carolina School of Public Health Institutional Review Board. Details on measurement and coding are provided in SI Appendix. Blood Transcriptome Profiling. Venipuncture whole-blood samples were assayed by RNA sequencing using a 3′ messenger RNA counting assay (Lexogen QuantSeq 3′ FWD) on an Illumina HiSeq 4000 system following the manufacturers’ standard protocols. The 65-base single-strand reads were mapped to the ENSEMBL hg38 human transcriptome to estimate gene-level transcript abundance using STAR. Transcript abundance values were normalized using 11 reference genes (64) and analyzed by linear statistical models relating log 2 -transcript abundance to individual demographic characteristics (age, sex, race/ethnicity), sociodemographic contextual characteristics (US region, family poverty status), biobehavioral factors (BMI, smoking, alcohol consumption), and technical covariates [sample RNA integrity number (RIN), assay plate, sequencing depth, and profile consistency with other samples]. Sociodemographic Variables and Technical Controls. Variables were coded as follows: age (continuous self-reported years); sex (self-reported biologically assigned male sex at birth, coded by an indicator relative to reference point female); race/ethnicity (self-identified Asian, non-Hispanic black, Hispanic, and other race/ethnicity, each coded by an indicator relative to reference point non-Hispanic white); US region (census regions 2 to 4: Midwest, South, and West, each coded by an indicator relative to reference point region 1, Northeast); family poverty status (self-reported household income less than or equal to 2015 US federal poverty level based on household size, coded by an indicator relative to nonpoverty status); BMI (continuous kg/m2 derived from self-reported continuous height and weight); smoking history (self-reported ever smoked coded by an indicator relative to never smoked reference point); and alcohol consumption [represented as two variables: one “regular drinking” variable indicating whether participants self-reported drinking beer, wine, or liquor every day or almost every day, relative to less frequent drinking during the past 12 mo; and a second “binge drinking” ordinal variable reflecting days during the past 12 mo during which participants drank (female 4/male 5) or more drinks in a row, (coded none = 0, 1 to 2 d/y = 1, 3 to 12 d/y = 1 d/mo = 2, 2 to 3 d/mo = 3, 1 to 2 d/wk = 4, 3 to 5 d/wk = 5, every/almost every day = 6)]; assay batch (nominal indicators for plates 1 to 11 relative to reference point plate 12); sample RIN (continuous 0 to 10), total mapped reads per sample (continuous/106); read alignment rate (continuous percentage); and profile consistency (average Pearson r with 95 other samples). Analytic Methods. Data analyses examined inflammatory and type I IFN gene regulation at three distinct levels of biological function: 1) expression of a-priori-defined sets of inflammatory and type I IFN indicator genes (42); 2) activity of transcription factors involved in mediating inflammatory, type I IFN, SNS, and neuroendocrine responses (34, 36); and 3) activation of specific immune-cell subsets involved in inflammatory and IFN gene expression, particularly monocytes, dendritic cells (DCs), and neutrophils (38, 39, 43). For level 1 analyses, prespecified general inflammatory and IFN composite scores were computed by averaging standardized expression values for 19 genes involved in inflammation or for 32 genes involved in type I IFN responses (42). We also examined a previously derived CTRA indicator contrast score computed as the difference between inflammatory and type I IFN composites (inflammatory composite score − type I IFN composite score). Each of these molecular parameters was tested for differential expression as a function of individual demographic characteristics (age, sex, race/ethnicity), and contextual conditions (US region, family poverty status), with ancillary analyses examining potentially confounding effects of biobehavioral factors (BMI, smoking, alcohol consumption), while controlling for technical covariates as noted above. To avoid capitalizing on chance due to multiple testing, we followed standard biostatistical procedures by computing a single integrated omnibus hypothesis test of our primary hypothesis that there exists significant sociodemographic variation (either individual or contextual) in the expression of one or more of the examined gene sets (65⇓⇓–68). Contingent on a significant omnibus test of global sociodemographic variation in gene set expression, we conducted interpretive follow-up analyses testing for significant sociodemographic variation in expression of each gene composite in isolation [with a false discovery rate (69) correction for multiple testing]. For gene sets showing a significant omnibus test of global sociodemographic variation in activity, we presented the individual parameter estimates underlying that global result for descriptive/interpretive purposes and conducted follow-up nested aggregate hypothesis tests to assess the respective effects of individual vs. contextual demographic factors (again with a false discovery rate correction for multiple testing). Ancillary aggregate hypothesis tests also examine biobehavioral factors that might potentially confound sociodemographic effects. Throughout these analyses individual parameter estimates are presented for interpretive purposes only and do not serve as the analytic basis for primary substantive conclusions. To ensure that the a-priori-specified global inflammatory and type I IFN composite scores did not obscure the effects of more differentiated coregulated gene modules within each global set, we also conducted exploratory follow-up analyses of the analyzed gene sets to map their fine-grain coregulatory structure, using principal factor analysis (70) to identify sets of coregulated genes while accounting for residual sources of sampling variability (i.e., unique variance components). For level 2 and 3 analyses, empirical variations in genome-wide transcriptional profiles were mapped by identifying all genes showing >20% differential expression as a function of a binary demographic indicator variable or a 4-SD difference in a continuous demographic variable (ranging from 2 SD below the mean to 2 SD above the mean). Gene-specific statistical significance was based on a 5% dependent false discovery rate allowing for potential correlation among genes (71). In level 2 analyses, transcription factor activity was assessed by TELiS bioinformatic analysis (45) of RefSeq core promoter DNA sequences for all genes showing a maximum-likelihood point estimate of >20% differential expression as a function of a target demographic variable. Genes were screened into TELiS analyses based on differential expression effect size because effect-size–screened gene lists have been found to be more replicable than those based on p- or q-value screening (42, 72⇓⇓–75). TELiS analyses used TRANSFAC position-specific weight matrices for NF-κB, AP-1, ISRE, CREB, and the GR (76), with detection by the TRANSFAC mat_sim information criterion and statistical significance assessed by bootstrap resampling of linear model residual vectors to account for correlation among genes (77). Level 3 analyses examined the relative contributions of 10 leukocyte subsets to the same set of differentially expressed genes using Transcript Origin Analysis (43) based on reference transcriptome profiles derived from isolated cell samples (Gene Expression Omnibus GSE101489) (46) and bootstrap analysis of statistical significance. Additional analytic details are available in SI Appendix. Analyses were performed using SAS 9.4 software. Data Availability. Add Health data are available at https://www.cpc.unc.edu/projects/addhealth/documentation/. SAS code used in these analyses is available upon request from the corresponding authors.

Acknowledgments This research was supported by NIH Grants R01-HD087061 (specifying the present analyses), P30-AG017265, R01-AG043404, and R01-AG033590; and by the Jacobs Center for Productive Youth Development (University of Zürich). This research uses data from Add Health, a program project directed by Kathleen Mullan Harris and designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris (University of North Carolina at Chapel Hill) and funded by Grant P01-HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 23 other federal agencies and foundations (https://www.cpc.unc.edu/projects/addhealth/about/funders).

Footnotes Author contributions: S.W.C., M.J.S., L.G., and K.M.H. designed research; S.W.C. and K.M.H. performed research; S.W.C. and L.G. analyzed data; and S.W.C., M.J.S., L.G., and K.M.H. wrote the paper.

Reviewers: E.S.E., University of California, San Francisco; S.M., University of Pittsburgh; and C.M., University of Michigan.

The authors declare no competing interest.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1821367117/-/DCSupplemental.