Population and study design

The UK Biobank study design has been reported14. Briefly, all people aged 40–69 years who were registered with the National Health Service and living up to ~25 miles from one of the 22 study assessment centres were invited to participate in 2006-10. Overall, about 9.2 million invitations were mailed in order to recruit 503,325 participants (i.e. a response rate of 5.47%)15. Extensive self-reported baseline data were collected by questionnaire, in addition to anthropometric assessments. Age at menarche in women was self-reported in whole years and age at voice breaking in men reported as a categorical variable, where participants were asked whether their voice broke at an age “younger”, “about average” or “older” than their peers. For the current analysis, individuals of non-white ethnicity (N = 29,819) were excluded to avoid potential confounding effects of ancestry on puberty timing and disease risks3. Furthermore, we excluded women who did not report age at menarche (N = 7,318), or reported age at menarche at very extreme ages: <8 years (N = 29) or >19 years (N = 121) and we excluded men who did not report timing of voice breaking (N = 17,617). All participants provided informed written consent, the study was approved by the National Research Ethics Service Committee North West – Haydock and all study procedures were performed in accordance with the World Medical Association Declaration of Helsinki ethical principles for medical research.

Adverse health outcomes

Past or current diseases were self-reported in response to the question “Has a doctor ever told you that you have had any of the following conditions? (You can select more than one answer)”. To ensure good discrimination between medical conditions, the data were collected using a computer-assisted personal interview (CAPI), administered by trained interviewers. Twelve other adverse health outcomes were generated by re-classification of questionnaire data or objective measurements made at the baseline visit. Seven adverse outcomes were generated in both men and women; these were: short stature (defined as the lowest 5% of measured height, separately for men and women); obesity (BMI > 30 kg/m2 based on measured height and weight); low intelligence (scores of 2 or less out of a possible 13 on the UK Biobank fluid intelligence test, ~3·8% of the study sample); low FEV1 (low forced expiratory volume in 1 second, the lowest 5% of residuals for FEV1 from a model with height and sex as covariates), low trauma fracture (history of any fracture resulting from a simple fall); poor sleep (less than five hours sleep; compared to those reporting 8 hours); and overall poor health (those who answered “Poor” to the question “In general how would you rate your overall health?”). The five adverse outcomes specific to women were: stillbirth, low birth weight of first child (<5·5 pounds or <2·49 kg); oophorectomy, hysterectomy and early natural menopause (defined as menopause occurring before age 45, without a prior hysterectomy or oophorectomy and not taking hormone replacement therapy at the time of menopause)16.

Comparator groups were identified separately for each outcome. In general, any participant who did not report a specific disease was considered to be a non-case for that disease. For T2D, we excluded from the analysis model any case who might have type 1 diabetes (based on age at diagnosis ≤ 35 years, insulin use within 1 year of diagnosis, or were diagnosed less than one year prior to their UK Biobank assessment). Where adverse outcomes were derived from re-classification of other data, in general the comparator group comprised all men or women who provided higher (i.e. less adverse) responses/measurements, except where stated above.

To provide sufficient power to meet our conservative significance threshold, we considered only those diseases/outcomes with least 500 cases in either sex (~0·2% prevalence). In total, we considered 128 diseases plus 12 other adverse outcomes in women and 112 diseases plus 7 other adverse outcomes in men.

Statistical analysis

Separate logistic regression models in each sex were performed to test the associations between puberty timing and each outcome. Age at menarche in women was analysed in linear models and also in two categorical models, which compared the earliest approximate quintile (8–11 years inclusive, N = 50,405) or oldest approximate quintile (15–19 years inclusive, N = 41,338) to the median (13 years, N = 61,216). Age at voice breaking in men was analysed in only two categorical models, comparing either the “relatively younger” or the “relatively older” voice breaking group to the “about average” group.

Baseline models included birth year, age and age-squared, to account for potential confounding effects of the secular changes in puberty timing. Further adjusted models were also performed to account for the potential confounding and mediating effects of socio-economic position (SEP) and adiposity/body composition. To enable comprehensive adjustments without invalidating our models due to co-linearity, we performed a two-stage analysis. First, we calculated the principle components for all available ‘adiposity/body composition’-related variables (BMI, hip circumference and waist circumference, – measured by trained assessment centre staff; and weight, body fat percent, trunk fat free mass, trunk fat mass, whole body fat free mass, whole body fat mass and whole body water mass – all estimated by the Tanita BC418MA electrical bioimpedance analyser) or the available ‘SEP’-related variables (alcohol intake; education – 8 dummy variables for different levels of qualification, maternal smoking, reported income level, smoking – ever and current, Townsend index of deprivation). We then included in our adjusted logistic models the top principle components (explaining in each case over 99% of the variance) for adiposity/body composition (5 principle components) and SEP (11 principle components). Where ‘Obesity’ was the outcome, the adjusted models included only the principal components for SEP.

A conservative multiple test-corrected threshold of P < 7·48 × 10−5 was used to identify significant disease associations; this value represented P = 0·05 divided by the total number of tests performed (140 outcomes × 3 models in women, plus 119 outcomes × 2 models in men).