The capacity to infer others' mental states (known as ‘mind reading’ and ‘cognitive empathy’) is essential for social interactions across species, and its impairment characterizes psychopathological conditions such as autism spectrum disorder and schizophrenia. Previous studies reported that testosterone administration impaired cognitive empathy in healthy humans, and that a putative biomarker of prenatal testosterone exposure (finger digit ratios) moderated the effect. However, empirical support for the relationship has relied on small sample studies with mixed evidence. We investigate the reliability and generalizability of the relationship in two large-scale double-blind placebo-controlled experiments in young men ( n = 243 and n = 400), using two different testosterone administration protocols. We find no evidence that cognitive empathy is impaired by testosterone administration or associated with digit ratios. With an unprecedented combined sample size, these results counter current theories and previous high-profile reports, and demonstrate that previous investigations of this topic have been statistically underpowered.

1. Introduction

Decades of research on neuroendocrinological influences on animal behaviour has provided a reliable basis for exploring it in humans [1] and motivate a growing scientific focus on the biological basis of social aptitudes and the causes of their deficits [2]. One important element of social cognition is ‘cognitive empathy’, which constitutes the capacity to infer from observation the emotions, beliefs and goals of others.1 This capacity exists across taxa [4] where its impairment in humans characterizes a broad range of psychopathological conditions and is part of the clinical diagnostic criteria for autism spectrum disorders (ASDs) [5].2

(a) Testosterone-based biological theory of social cognition

A popular biopsychological model known as the extreme male brain (EMB) hypothesis [6] proposes that two distinct cognitive styles, ‘systemizing’ and ‘empathizing’, typify males and females, respectively. The stereotypically male systemizing domain has no social dimension, and in its extreme form, social cognition is extinguished. Guided by observations that ASDs emerge early in life and are substantially more prevalent among males,3 and that males typically score lower than females in tests of cognitive empathy [7], the EMB hypothesis proposes that elevated prenatal exposure to the sex steroid testosterone causes impairments in cognitive empathy, through its masculinizing effect on the developing brain [8].

The EMB hypothesis found evidential support in a study that reported a correlation between amniotic testosterone levels and ASD traits ([9], though see [10]), and has remained popular yet controversial to date. Much of its research has relied on the assumption that the ratio between the hand's second (index) to fourth (ring) digit (known as 2D : 4D) is a developmental proxy for prenatal testosterone exposure [11], which motivated examinations of correlations between 2D : 4D and cognitive empathy and ASDs occurrences. While some studies provided supporting evidence (e.g. [12]), several others failed to detect a relationship between digit ratio and cognitive empathy [13,14]. Moreover, because it is not feasible to experimentally manipulate prenatal testosterone exposure in humans (owing to ethical considerations), findings along this line of research have been correlational, which cannot establish causal relations [9].

(b) Testing testosterone's causal effect on cognitive empathy

A handful of experiments attempted to address the above limitation by testing the effects of testosterone administration on cognitive empathy in neurotypical adults, and investigating the dependency of these effects on the 2D : 4D biomarker [15–18]. This line of research critically relies on an assumption originating in animal research, that in utero androgen exposure moderates the activational effect of testosterone [19]. The seminal publication along this line of research reported a placebo-controlled within-subject experiment of 16 healthy females, in which exogenously administered testosterone strongly impaired cognitive empathy measured using the ‘Reading the Mind in the Eyes Test’ (RMET), a 36-item battery testing the ability to infer others' emotional states and intentions from pictures of their eye regions [7] (see the electronic supplementary material, figure S2 for example item). In addition to reporting a main effect of exogenous testosterone reducing cognitive empathy, more than 50% of the individual differences in the effect on the RMET were explained by the participants’ variation in the right-hand 2D : 4D, implying involvement of prenatal testosterone exposure in the causal effect [18].

A similar experiment with roughly twice the sample size (n = 33, all female sample) found a much smaller4 main effect (p = 0.048, one-tailed), and no moderation by 2D : 4D [17]. A third experiment of 16 females found neither a main effect nor a moderation by 2D : 4D [16]. Last, one experiment investigated the effect of testosterone administration on the RMET in 30 healthy males and found neither a main effect nor a moderation by the right-hand 2D : 4D; however, subsample analysis revealed that testosterone administration reduced cognitive empathy in participants with relatively low (i.e. more masculine) left-hand 2D : 4D, but no relationship for high 2D : 4D or either for the right hand [15].

(c) Do the data support the hypothesis?

Despite these earlier findings, the current literature on the effects of testosterone on cognitive empathy is subjected to important limitations and results reveal weaknesses under scrutiny. First, albeit there are a few parallel findings in terms of negative direction of the effect of testosterone on RMET performance, there is a lack of replicability across experiments, where only one of the four studies observed a statistically significant (p < 0.05) main effect of testosterone on the RMET [18]. Moreover, this publication's report of a strong moderating role of the right-hand 2D : 4D was not replicated in any of the other studies (the only other report of an interaction between testosterone administration and the 2D : 4D was observed for the left hand [15]).5

A second concern is statistical power. Although the RMET is a noisy psychological instrument,6 and 2D : 4D is, at best, a noisy proxy of prenatal testosterone exposure [20], all samples ranged between 16 and 33 participants, which might have been too meagre to credibly estimate a true effect size. It is therefore impossible to know whether the inconsistencies in the literature are owing to an absence of a true association, or the result of false-negative findings owing to low statistical power. Thus, these inconsistent results necessitate clarification through additional studies.

To this end, we conducted a powerful direct test of the activational and developmental effects of testosterone on cognitive empathy by measuring the causal effect of exogenous testosterone and the moderating role of putative prenatal androgenic biomarkers in two studies of healthy young men. Our studies constitute, to our knowledge, the two largest behavioural testosterone administration experiments conducted to date, with samples that were 15 and 25 times greater than the first study that reported a statistically significant effect of testosterone on the RMET in females [18] and 7 and 12 times greater than the largest experiment in males [15]. In both studies, we used a computer-based version of the RMET to test the hypothesis that testosterone administration and its purported developmental biomarkers affect cognitive empathy.

2. Methods

(a) Experiment 1

(i) Participants and experimental procedure

Two hundred and forty-three males aged 18–55 (mean age = 23.63, s.d. = 7.22) participated in the study and were mostly private Southern California consortium students from diverse ethnic backgrounds (see Participants and electronic supplementary material, table S1a). All data and materials are available on the Open Science Framework (https://osf.io/hztfe/).

Participants registered by their preferred session dates and were added to cohorts of 13–16. They arrived at the laboratory at 9.00, signed informed consent forms and had both of their hands scanned before being randomly assigned to private cubicles where they completed demographic and mood questionnaires (see the electronic supplementary material for all independent variables) and provided an initial saliva sample by passive drool. Next, participants proceeded to gel application (further details below), after which they were given printed material containing precautions and instructions prior to dismissal (experimental timeline shown in figure 1). All participants returned to the laboratory at 14.00, provided a second saliva sample, and began a battery of tasks that lasted approximately 2 h. We did not randomize the order of the behavioural tasks, in a similar fashion to previous studies [21], to standardize hormonal measurements (which have diurnal cycles) among participants. Following the experiment, participants completed an exit survey, where they indicated their beliefs about the treatment they had received using a five-point scale. Figure 1. Time-coded experimental timeline shows that, following intake (morning), participants completed half of the RMET (portion A or B), provided a saliva sample and received gel prior to being dismissed. Upon their return to the laboratory (afternoon), another saliva sample was collected prior to taking the second portion of the RMET (B or A) (standard errors shown). Timeline generalizes experimental sequence for all three start times for experiment 2. Following arrival and completion of consent and self-report questionnaires, participants provided a saliva sample approximately 30 min prior to drug administration. The second sample was collected at the end of a 2 h protocol approximately 1.5 h after drug administration and 15 min after the RMET. (a) experiment 1; (b) experiment 2.

(ii) Treatment administration

Following initial intake, participants were escorted in groups of two to six to a semi-private room. There they were provided en masse small plastic cups containing either 10 g of topical testosterone that is a widely prescribed transdermal testosterone gel with clearly mapped pharmacokinetics [22] (100 mg, Vogelxo™, n = 123) or volume equivalent of inert placebo of similar viscosity and texture placebo (80% alcogel, 20% Versagel®, n = 118) under a double-blind protocol (see the electronic supplementary material, figure S1a for randomization protocol). Participants were instructed to apply the entirety of the gel container following manufacturer instructions.

(iii) Saliva samples

Each participant provided four passive drool saliva samples (upon arrival prior to treatment administration, shortly after returning for afternoon session, another closely following the RMET and a final sample prior to exit survey) for subsequent assay (see the electronic supplemental material for precise timing). To allow robust manipulation checks and obtain statistical control for hormonal markers of participants' biological states, we used liquid chromatography tandem mass spectrometry (LC-MS/MS, detection levels and precision are available in the electronic supplementary material, table S2) to measure the following salivary steroids: oestrone, oestradiol, oestriol, testosterone, androstenedione, dehydroepiandrosterone (DHEA, a metabolic intermediate in the synthesis of sex steroids), dihydrotestosterone (DHT, a potent androgen synthesized from testosterone via 5-alpha-reductase), progesterone, 17OH-progesterone, 11-deoxycortisol, cortisol, cortisone and corticosterone (see the electronic supplementary material, table S7 for measurements).

(iv) Digit ratio measurements

Participants’ hand scans acquired at intake were measured by two independent raters and digital calipers to quantify 2D : 4D; inter-rater correlation was 0.96 and their scores were averaged.

(v) Behavioural task

We administered the adult version of the RMET developed by Baron-Cohen et al. [7] which shows the eye region of an actor's face, and a list of four words that describe emotional states and cognitive processes among which participants select the one that best described the person in the image (see the electronic supplementary material, figure S2 for example item). The task was divided into two segments, baseline (morning) and post-treatment (afternoon), in a repeated measures design (figure 1): each participant completed a half of the RMET in the morning (either part ‘A’, items 1–18, or part ‘B’, items 19–36) prior to receiving treatment, and the other half following treatment when, according to published pharmacokinetics, androgen levels are elevated and stable following exogenous application.

(vi) Psychological questionnaires

Because there are various feasible channels through which testosterone could affect RMET performance (and affect being one of them), we measured mood using the PANAS-X scale [23], both pre- and post-treatment (see the electronic supplementary material, table S1a for aggregated responses).

(b) Experiment 2

(i) Participants

Experiment 2 included both students and participants from the general public for a total sample of 400 participants (mean age = 22.80, s.d. = 4.68). The all-male sample was composed predominantly of Caucasians and overall ethnic heterogeneity was representative of the region (see Participants and electronic supplementary material, table S1b). All accepted participants completed the task and were included in the analysis (for pre-screening criteria, see the electronic supplementary material). The Nipissing University Research Ethics Board approved this study, all participants gave informed consent, no adverse events occurred during any experimental session and no participant or researcher was harmed.

(ii) Experimental procedure

Participants arrived at one of three testing session times (10.00, 12.30 or 14.30) in cohorts of six and were brought individually into private testing rooms to read and sign an informed consent form, receive a participant number and complete questionnaires (see the electronic supplementary material for all independent variables). Afterwards, participants provided a 1–2 ml saliva sample before treatment administration, after which they had their photos taken and hands scanned. Approximately 2 h after arrival to the laboratory and 1.5 h after drug administration, participants completed the RMET then provided their final saliva sample. Upon session completion, participants received compensation and completed an exit survey asking which treatment they believed they had received (see the electronic supplementary material, figure S1b).

(iii) Treatment administration

Following initial saliva sample collection, a researcher provided two syringes pre-filled by a pharmacist following a double-blind protocol each containing 5.5 mg of either placebo or testosterone gel (for a total of 11 mg). This is a newly approved nasal gel used for the treatment of hypogonadism. Pharmacokinetic data indicate that serum testosterone concentrations rise sharply within 15 min of testosterone gel application and remain elevated (relative to placebo) up to 180 min post-application [24]. The gel was either Natesto® or the volume equivalent of an inert placebo of similar viscosity and texture. Random assignment was determined such that half the participants in every group received testosterone and half received placebo such that total participants base was bisected with n = 200 for both testosterone and placebo groups (see the electronic supplementary material, figure S1b for randomization protocol).

(iv) Saliva samples

Each participant provided two saliva samples, with the first sample collection time 30 min following arrival (and prior to gel application), and the second 120 min after arrival. Participants provided passive drool into a 5 ml polystyrene tube while situated in their individual testing rooms as instructed by a research assistant and samples were analysed for pre- and post-treatment testosterone and baseline cortisol using commercially available enzyme immunoassay kits (DRG International) (see the electronic supplementary material, table S2b for hormone measures).

(v) Digit ratio measurements

Participants’ 2D : 4D were measured by two independent raters using hand scans and digital calipers with an inter-rater correlation of 0.86 and their scores were averaged.

(vi) Behavioural task

Participants evaluated all 36 items of the RMET [7] as a single task.

(c) Comparison of experimental features to van Honk et al. [18]

We note several differences between our study and the primary positive report of testosterone administration on the RMET in 16 females [18].

(i) Participant sex

The EMB theory does not make any sex-specific predictions regarding the developmental and activational effects of testosterone on cognitive empathy. However, van Honk et al. [18] conducted their study in a female-only sample because the pharmacokinetics of a single dose of testosterone had only been studied in females at the time: ‘We exclusively recruited women because the parameters … for inducing neurophysiological effects … are known in women but not in men’ ([18], p. 3450). The recent pharmacokinetic mappings for short-term single-dose testosterone administrations [15,22] and availability of two unique administration modalities of FDA-approved exogenous testosterone provided us with a reliable foundation for testing the EMB hypothesis in men.

(ii) Drug dosage and delivery

van Honk et al. [18] used a sublingual testosterone administration procedure, which causes a sharp increase in serum testosterone of 10-fold or more within 15 min, with a rapid decline to normal levels within 90 min in women [25]. It is important to note that the pharmacokinetic data for sublingual administration (published in fig. 1 of Tuiten et al. [25]) show that at the time the task was performed—4 h after sublingual administration—participants' testosterone levels were the same across the testosterone and placebo groups. Moreover, the Tuiten et al. study that served as a justification for using a 4 h delay had only eight participants, and reported a statistically weak treatment effect (p = 0.04, uncorrected for multiple comparisons).

In experiment 1, we chose to administer testosterone using the United States Food and Drug Administration-approved transdermal gel for three reasons. First, transdermal gel had been extensively studied in the medical literature both prior and following its approval [26,27]. Second, one of our laboratories found reliable behavioural effects with robust manipulation checks in serum using a single dose [28], and third, the pharmacokinetics of a single dose of this testosterone administration method were mapped prior to the inception of our experiments by a study showing that plasma testosterone levels peaked 3 h after single-dose exogenous transdermal administration, and stabilized at high levels between 4 and 7 h following administration [22].7 Therefore, we had all participants return to the laboratory 4.5 h after receiving gel, when androgen levels were elevated and stable. We used a 100 mg transdermal dose, which quickly elevates then holds testosterone levels high and stable for approximately 24 h [26] and was shown to generate effects on cognition, decision-making and other behaviours [28–32].

In experiment 2, we used nasal delivery, following a recent study indicating that serum testosterone concentrations rise sharply within 15 min after Natesto® gel application and remain elevated for approximately 3 h post-application among hypogonadal males [24] (see the electronic supplementary material, figure S3). This method conforms to our experimental paradigm's pharmacokinetic structure as serum testosterone approached its zenith in treated participants as they completed the RMET (see the electronic supplementary material, tables S2a and b for testosterone levels).

The doses in both experiments are commonly prescribed daily to men with low circulating testosterone levels and serve as two distinct physical transport channels (transdermal and intranasal, respectively) to reduce the probability that behavioural effects are transport channel-specific. Various studies show significant heterogeneity in change in testosterone levels depending on delivery method, location of application in the body and biofluid measured [15,22,25,28,30,33]. However, all the exogenous delivery methods in this particular literature cause a common hormonal trajectory characterized by a rapid initial rise, a peak above typical circulating levels, and eventual return to baseline.

(iii) Experimental designs

van Honk et al. [18] used the same questions in pre- and post-treatment testing. As testosterone treatment might affect participants' capacity to recall answers [34], this design choice might have introduced memory confounds. In experiment 1, we divided the RMET into two portions, and administered each portion as either pre- or post-treatment measure in, allowing us to capture baseline abilities while ruling out such confounds. In experiment 2, we conducted a between-subjects experiment that removes all effects of practice and recall from the data.

3. Results

(a) Manipulation check

In both experiments, pre-treatment (i.e. baseline) saliva testosterone levels were similar across the two treatment groups, and post-treatment saliva testosterone levels were greater in the testosterone group compared with the placebo group. In experiment 1, the mean logged pre-treatment testosterone level was 5.58 pg ml−1 (s.d. = 0.08) in the testosterone group and 5.77 pg ml−1 (s.d. = 0.09) in the placebo group (two-sided t-test: p = 0.13, t 239 = 1.50). The mean post-treatment logged testosterone levels were 8.38 pg ml−1 (s.d. = 0.15) in the testosterone group and 5.11 pg ml−1 (s.d. = 0.08) in the placebo group8 (t 239 = 18.43, p < 0.0001). Likewise in experiment 2, we find similar mean baseline saliva testosterone concentrations between the groups, with 15.3 pg ml−1 (s.d. = 0.88) in the testosterone group and 5.4 pg ml−1 (s.d. = 0.85) in the placebo group (two-sided t-test: p = 0.49, t 394 = 0.69). The mean logged post-treatment saliva testosterone levels were (8.00 pg ml−1, s.d. = 1.27) in the testosterone group and 5.33 pg ml−1 (s.d. = 0.73) in the placebo group (two-sided t-test: p < 0.001, t 396 = 22.14) (figure 1; electronic supplementary material, tables S2a and S2b).

Consistent with previous reports, we found no treatment effects on mood and treatment expectancy (e.g. [35]) (see the electronic supplementary material, table S2a and S2b), or levels of other hormones unaffected by exogenous testosterone, as measured by LC-MS/MS in experiment 1 or enzyme immunoassay in experiment 2 (see the electronic supplementary material, tables S1a and S1b).

(b) Influence of testosterone on Reading the Mind in the Eyes Test scores

Overall RMET scores in our samples were comparable with previous studies of similar populations (figure 2a). Figure 2b shows baseline and post-treatment RMET scores in experiment 1, separated by treatment group and order. As expected, baseline (morning) RMET performance were reliably correlated with afternoon scores (r 241 = 0.40, p < 0.001). In addition, participants’ scores were, on average, slightly higher in the B portion of the test (A portion average = 13.54, s.d. = 2.43; B portion average = 13.95, s.d. = 2.19, t 241 = 2.53, p = 0.01). Figure 2c shows experiment 2 scores by treatment groups. Figure 2. (a) Cumulative distributions of scores from our experiments juxtaposed with the original paper [7]; scores for experiment 1 portions A and B (both of which are unaffected by treatment) are combined into a total score: RMET scores in experiment 1 were 27.5 (s.d. = 3.9) and 25.6 (s.d. = 3.9) in experiment 2, which are similar to male students in [7] showing an average score of 27.3 (s.d. = 3.7). (b) experiment 1 pre- and post-treatment RMET scores. No pre- or post-treatment differences were found between the two groups, regardless of the order in which portions of the tests were taken (standard errors shown). (c) experiment 2 RMET scores. No differences were found between treatment groups.

To test for the main effect of testosterone administration on cognitive empathy for experiment 1, we estimated linear regression models with the post-treatment (afternoon) RMET score as the dependent variable, a binary treatment indicator (testosterone = 1, placebo = 0) as the independent variable of primary interest, controlling for baseline performance, the order of the two portions of the RMET (A and B) and additional control variables9 (the results remain unchanged when these control variables are excluded from the models; see the electronic supplementary material, tables S3a and S4a). Analogously, experiment 2 data were analysed using linear models with total RMET score as the dependent variable, a binary treatment indicator as the chief independent variable of interest, and control variables10 (results remain unchanged with their exclusion from the models; see the electronic supplementary material, tables S3b and S4b).

We found no reliable effect of testosterone administration on the RMET in experiment 1 (β = 0.11, 95% confidence interval (CI) = (−0.45, 0.68); t 237 = 0.37, p = 0.71; Cohen's d = 0.04, 95% CI = (−0.19, 0.28)). Thus, the effect's point estimate was positive and the 95% CI excluded the d = −0.49 reported in [18] or any negative effects that are greater in magnitude than d = 0.19. A sample of at least 870 participants (in a between-subject design), or 435 subjects (in a within-subject design), which is over 26–54 (within-subject 13 to 26) times greater than previous investigations, would be required to reliably detect even this ‘optimistic’ negative effect size estimate with statistical power of 0.8. Regression analyses with comprehensive controls corroborate the absence of a main treatment effect, and the absence of moderation by 2D : 4D (right hand, left hand and their average), as implied by insignificant interaction coefficients (see the electronic supplementary material, table S3a). Furthermore, in an analysis analogous to the previous positive report [18], we found no correlation between the treatment effect on the RMET and the right-hand 2D : 4D in the testosterone group (r 123 = 0.04, p = 0.66, 95% CI = (−0.14, 0.22)) (see the electronic supplementary material, figure S4).

Experiment 2, which had 400 participants, could also not reject the null hypothesis (β = 0.27, 95% CI = (−0.49, 1.02); t 398 = 0.69, p = 0.49; Cohen's d= 0.04, 95% CI = (−0.15, 0.24)) and there was no significant treatment effect in any regression model (electronic supplementary material, tables S3b and S4b). Similarly, the point estimate of the effect in experiment 2 was positive, and the 95% CI did not include negative effects of testosterone administration on the RMET that were greater in magnitude than 0.15. Further analyses of each question in isolation using χ2 tests revealed no systematic differences between treatment conditions in any of the RMET items in both experimental datasets (see the electronic supplementary material, tables S5a and b).

(c) Testing for the effect of 2D : 4D

Putative prenatal testosterone proxies (2D : 4D, either right-hand, left-hand or their average) did not correlate with RMET scores in both experiments or moderate the effect of testosterone administration, echoing other recent findings [15–17] (see the electronic supplementary material, tables S3a and S4b). These results are in line with previous reports showing no correlation between the 2D : 4D and RMET scores [13,39,40], and in contrast with the two papers reporting an interaction between 2D : 4D and the exogenous testosterone's effect on the RMET [15,18].

4. Discussion

Our experiments used two notably large samples to test the effects of pharmacological testosterone manipulation on cognitive empathy. Despite experimental differences between them, their collected data exhibit the same results with robust statistical consistency, to demonstrate a lack of effects of testosterone administration and 2D : 4D on cognitive empathy. These findings, and the literature as a whole, cast serious doubts on the proposal that testosterone causally impairs cognitive empathy, for several reasons.

First, the low statistical power of previous investigations undermines their reliability in capturing true effects. Even if we assume that a purported size of testosterone's negative effect on cognitive empathy is the overly ‘optimistic’ negative bound of our confidence interval for d = −0.19 in experiment 1, we find that all previous investigations of the topic were statistically underpowered (less than 0.3 power). Second, the results of the previous small sample studies are discrepant. Our large samples draw on drastically more data than all previous investigations combined, and generalize across geographically, economically and culturally distinct populations (see the Participants section of the electronic supplementary material). Our use of two different experimental designs and testosterone administration protocols across these populations further mitigates the concern that the outcomes were owing to a particular experimental factor. Of note, there are some design differences between our studies and previous investigations (table 1; differences from [18] discussed above). However, even if those design differences led to a complete abolishment of a ‘real’ effect of testosterone on cognitive empathy, our results demonstrate beyond a reasonable doubt that such an effect is not generalizable to both males and females. Future work with females could employ a similar approach as ours characterized by large samples from different geographies, distinct administration methods and other design features that strongly inform whether a relationship (or its absence) generalizes across sexes.

Table 1. Summary of the literature linking testosterone administration and the RMET. (In ‘repeated task’ paradigms, participants completed the RMET twice: prior to any treatment in experiment 1 in this study, and after receiving a placebo or testosterone in [18]. Effect size is calculated using Hedge's bias-correct effect size owing to small sample size and unequal variances between treatment and control groups.) Collapse study design results sex n design dose main effect effect size s.e of effect size 95% CI for effect size 2D : 4D moderation van Honk et al. [18] F 16 within subject; repeated task 0.5 mg sublingual one-tailed Wilcoxon, p = 0.01 −0.49 0.25 −1.19 0.21 low right-hand ratio, negative relationship Olsson et al. [17] F 33 within subject; repeated task 50 mg transdermal one-tailed, p = 0.048 −0.33 0.13 −0.15 0.82 no Bos et al. [16] F 16 within subject; no repetition 0.5 mg sublingual no effect, p = 0.78 −0.10 0.10 −0.60 0.79 no Carré et al. [15] M 30 within subject 150 mg transdermal no effect, p = 0.25 0.20 0.35 −0.70 0.31 low left-hand ratio, negative relationship Nadler et al. experiment 1 M 241 between subject; no repetition 100 mg transdermal no effect, p = 0.83 0.03 0.36 −0.19 0.28 no Nadler et al. experiment 2 M 400 within subject; no repetition 11 mg intranasal no effect, p = 0.66 0.04 0.26 −0.15 0.24 no

A third reason concerns the validity of the 2D : 4D biomarker. The initial findings that prenatal testosterone exposure correlates with 2D : 4D are supported in non-clinical and clinical human populations [12], as well as in preliminary causal evidence in relative phalanx/tibia lengths in mice [19]. However, recent work highlights concerns regarding the reliability of 2D : 4D as a biomarker [41,42]. For example, the 2D : 4D of complete androgen insensitivity syndrome patients were found to be only somewhat feminized, and had the same variance as in healthy controls, demonstrating that the preponderance of individual differences in the measure is not attributable to the influence of testosterone exposure [20]. There is also longitudinal evidence that 2D : 4D systematically changes during childhood [43,44], which is unconformable with the preposition that it accurately quantifies prenatal influences. Moreover, while many studies report 2D : 4D sexual dimorphism [45,46], other studies suggest lack of ethnic universality of dimorphism [47,48]. Finally, there is also a debate on whether sexual dimorphism is the product of allometric shift in shape rather than hormonal influences [49,50].

Furthermore, many reports of correlations between 2D : 4D and behavioural traits hold only for subsets of the population (e.g. particular sex or race). Correlation sometimes holds only for the right-hand 2D : 4D but in other times only for the left hand, or for the average of both hands [51]. Overall, significant results are seldom replicated, and few survive correction for multiple comparisons or meta-analytic aggregations (e.g. [52]). These concerns belie the validity of the measure as a biomarker and its capacity to detect reliable correlations with noisy psychological constructs in studies of small samples.

Despite our dissenting results, the absence of evidence is not necessarily evidence of absence. Specifically, the lack of an association between 2D : 4D and cognitive empathy could be attributable to the failure of the measure to serve as a reliable androgenic biomarker. We therefore agree with Baron-Cohen et al. ([53], p. 6) that it is worthwhile to study the occurrence of impaired cognitive empathy and other ASD traits in developmentally unique populations. One such study reported mixed evidence of higher scores in some Autism Quotient self-rating subscales among women with congenital adrenal hyperlexia (CAH) and lower scores in other subscales, compared with their unaffected relatives [54], with no significant results in men. However, results along this line of research, too, are far from being conclusive. For example, Kung et al. [10] found that young females with and without CAH did not differ in autistic traits, and that amniotic testosterone levels were not associated with scores from either sex individually or the entire sample. Other longitudinal studies also found no association between various measures of prenatal androgens measured in umbilical cord blood and amniotic fluid and autistic traits [55,56]. Thus, further investigations, preferably using larger samples, are required for resolving the inconsistencies in this literature.

To conclude, we tested testosterone's causal role in cognitive empathy across distinct administration methods using notably large samples from two distinct populations, and found no evidence of an effect of testosterone administration on RMET in young adult neurotypical human males. While our results do not exclude all possible relationships between testosterone and interpreting others' emotions and states of mind, our large-scale study and evaluation of previous literature exhibit robust evidence of no causal relationship between activational and purported developmental testosterone exposure and cognitive empathy.

Ethics

The institutional review boards of Caltech and Claremont Graduate University approved the study, all participants gave informed consent, no adverse events occurred during any experimental session and no participant or researcher was harmed. For experiment 2, the study was approved by the Nipissing University Research Ethics Board.

Data accessibility

Data are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.jm6qd39 [57] and the Open Science Framework (https://osf.io/hztfe).

Authors' contributions

A.N. and G.N.: experimental design, manuscript and data analysis; C.F.C.: manuscript; D.T.Z.: manuscript and hormonal assay; T.L.O.: experimental design, data collection and hormonal assay; N.V.W.: manuscript; J.M.C.: experimental design and manuscript.

Competing interests

We declare we have no competing interests.

Funding

Funding for this work was generously provided by Caltech, Ivey Business School, IFREE, Russell Sage Foundation, University of Southern California, INSEAD, Stockholm School of Economics, Wharton Neuroscience Initiative, the Natural Sciences and Engineering Council of Canada and the Northern Ontario Heritage Fund Corporation.

Acknowledgements The authors specially thank Jorge Barraza, Austin Henderson, Garrett Thoelen, Dylan Manfredi, Kimberly Gilbert, Caelan Mathers, Emily Jeanneault, Nicole Marley, Kendra Maracle, Victoria Bass-Parcher, Nadia Desrosiers, Charlotte Miller, Brittney Robinson, Dalton Rogers, Megan Phillips, Brandon Reimer, Camille Gray, Christine Jessamine and Brandon Reimer who assisted this study, and David Kimball for LC-MS/MS assay testing. The authors thank Ralph Adolphs for his comments on an earlier version of the manuscript.

Footnotes

Endnotes 1 Cognitive empathy is the ability to interpret others’ emotions and understand their behaviour vis-a-vis their emotional state; this is distinct from emotional empathy, which is the vicarious feeling of others’ emotions along with them [3]. 2 The DSM V criteria for ASDs include ‘Non-verbal communication problems, such as abnormal eye contact, posture, facial expressions, tone of voice and gestures, as well as an inability to understand these’. 3 ASD incidence rates vary widely by study, from 5.2 to 72.6 per 10 000 people and ratios range from 1.81 to 15.7 male : female. 4 The primary publication [17] had a statistical power of only 0.26 to detect the effect size found in the similar study with twice the sample size [18]. 5 One experiment (in males) reported a statistically significant moderating effect, but only for the left-hand 2D : 4D; the two other experiments reported no moderation of the 2D : 4D [16]. 6 The RMET has a test–retest reliability of 0.7 [7]. 7 Subsequent studies measuring testosterone in serum in significantly larger sample sizes demonstrate an earlier hormonal peak at 60 min post administration with subsequent stabilization [15,29]. 8 Median testosterone levels (unlogged) of the testosterone group were 33.5 times that of the placebo group post-treatment. 9 These include RMET baseline scores, portion A or B, 2D : 4D and treatment interactions, cognitive reflection task (CRT) scores, maths abilities, mood and affective measures, treatment expectancy, age, marital status, sexual preference and all other measured hormones that were not influenced by testosterone treatment and may affect cognition and decision-making (e.g. cortisol [36,37]). The CRT control was added because performance is impaired by exogenous testosterone [30] and people with ASD outperform non-ASD age-matched controls [38]. 10 These include CRT scores, factor 1 and 2 psychopathy measures, treatment expectancy, age, marital status, sexual preference and all other measured hormones that were not influenced by testosterone treatment.

Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.4635512.