General procedure

Two hundred forty-three males, aged 18–55 (M = 23.63, SD = 7.22), mostly (217, 89%) students at a private Southern California consortium, participated in the study. Non-student participants were community members from surrounding cities. For pre-screening criteria see Supplementary Methods; detailed demographic characteristics of the two treatment groups are available on Supplementary Table 1. The institutional review boards of Caltech and Claremont Graduate University approved the study, all participants gave informed consent, and no adverse events occurred.

Similar to a previously described study (42), participants arrived at the lab at 9 a.m., signed informed consent forms, and proceeded to a designated room for hand scanning (see Supplementary Methods). Participants were then randomly assigned to private cubicles, where they completed demographic and mood questionnaires (see Supplementary Methods) and provided an initial saliva sample by passive drool. Next, participants proceeded to gel application (further details below), after which they were instructed to refrain from bathing, any activity that might cause excessive perspiration, and direct physical contact with females before the afternoon session; finish eating no later than 1 p.m.; and return to the lab promptly at 1:55 p.m. well hydrated. Participants were given printed material containing these precautions and instructions prior to dismissal.

Participants returned to the lab at 2 p.m. (no participant was late), provided a second saliva sample, and began the behavioral experiment in the same cubicle they had occupied in the morning. The experiment lasted approximately 2 h and consisted of a battery of behavioral tasks, none of which included feedback about monetary payoffs or performance. Only the final task included feedback regarding the participants’ performance relative to others. The rationale for conducting a battery of tasks is maximizing the knowledge gained from each human participant undergoing a pharmacological manipulation, a practice that is standard58,59. The two tasks reported in the paper were focal and therefore were conducted at the outset, immediately after participants’ arrival at the lab in the afternoon and the first post-treatment saliva sample. On average, participants earned $68.12 USD (SD = $17.36) for participating in the experiment. Payout varied as a function of their performance in some of the other tasks.

To standardize hormonal measurements among participants, we did not randomize the order of the behavioral tasks, in similar fashion to previous studies58,59. Following the experiment, participants completed an exit survey, where they indicated their beliefs about the treatment they had received, using a five-point scale, and were privately paid in cash.

Testosterone administration

Participants were escorted in groups of two to six to a semiprivate room where a research assistant provided a small plastic cup containing 10 g of clear gel and stated that it was equally likely to contain T or placebo. The cups were filled in advance by the lab manager, who did not interact with participants and did not reveal the contents of the cup to the assistant. The gel contained either topical T 1% (2 × 50 mg packets Vogelxo® by Upsher-Smith) or the volume equivalent of an inert placebo (80% alcogel, 20% Versagel®). Participants were instructed to remove the clothing from their upper bodies and apply the entire contents of the gel container to their shoulders, upper arms, and chest, as demonstrated by the research assistant, and were told to wait until the gel fully dried before putting their clothes back on.

We chose to administer T using topical gel as this was the only T administration method for which the pharmacokinetics of a single-dose administration had been investigated at the time41. That study41 demonstrated that plasma T levels peaked 3 h after single-dose exogenous topical administration, and that T measurements stabilized at high levels during the time window between 4 and 7 h following administration. Therefore, we had all participants return to the lab 4.5 h after receiving gel, when androgen levels were elevated and stable.

Saliva sampling

Each participant provided four saliva samples by passively drooling into a plastic tube, at predetermined sampling times throughout the study: (1) before treatment administration; (2) upon return to the lab, just prior to starting the behavioral tasks; (3) in the middle of the behavioral tasks battery; and (4) following the one and only task involving performance feedback, at the end of the experiment. We used saliva samples to avoid potential stress that might be induced by high-resolution blood drawing throughout the experimental session. Each saliva sample was time stamped. No food or drinks were allowed into the laboratory, and the only water given to the participants was after their third saliva draw (an hour before the fourth and final saliva draw).

To allow robust manipulation checks and obtain control for the participant’s biological state, we used LC-MS/MS (detection levels and precision are available in Supplementary Table 2) to measure the following salivary steroids: estrone, estradiol, estriol, testosterone, androstenedione, DHEA, 5-alpha DHT, progesterone, 17OH-progesterone, 11-deoxycortisol, cortisol, cortisone, and corticosterone. A series of one-sample Kolmogorov–Smirnov tests for conformity to a Gaussian distribution (Supplementary Table 3) indicated that all hormonal measurement distributions were best approximated by a Gaussian distribution following a log-transformation, as indicated by higher p-values. Thus, all hormonal measurements were log-transformed prior to data analysis in order to make their distributions closer to Gaussian. We provide further technical details of the procedure and analysis of hormonal changes following the treatment in Supplementary Table 3.

Task 1: Testosterone’s effect on preference for brands high in social rank

In a pretest, we presented 184 students of a private Southern California college (with similar demographic characteristics as our participants) with the logos of 15 familiar apparel brands in a randomized, counterbalanced order. Participants rated each brand’s association with quality and social rank using 100-point subscales (0 = not at all to 100 = very much). Social rank was constructed by averaging three items related to status (status, conspicuousness, prestige60,61) and three items related to power (power, performance, control62,63,64).

Importantly, the first task did not allow us to directly disentangle status enhancement and power enhancement motives, as we could not identify (based on the pretest data) any brand pairs that were perceived differently with respect to their status and power associations, and thus we combined the average of the six items to a general measure of social rank associations. Our data indicated that brands high in social rank were typically also perceived as high in quality. However, we were able to identify five pairs of brands for which the difference in social rank associations was significantly greater than the difference in quality associations. Supplementary Table 5 summarizes these pairs, along with their perceived social rank and quality associations among the experiment participants. In order to mitigate the possibility that participants would guess the study’s purpose, the task included an additional pair for which both brands were associated with lower social rank (Gap vs. H&M).

In the experimental task, we presented participants with the five brand pairs in a randomized, counterbalanced order (Fig. 2a). One brand appeared on the left side of the screen and the other on the right (sides were randomized and counterbalanced). Participants indicated which of the two brands they preferred using a 10-point Likert rating scale (1 = strongly left brand, 10 = strongly right brand). For standardization, we z-scored the ratings at the question level; all of the results are robust to this analytical choice.

We followed the behavioral task with a survey that examined the participants’ associations with the brands used. We showed participants all brands in a randomized order and asked them to rate their associations with quality and social rank (i.e., power and status) using 100-point scales. We constructed a social rank scale by averaging their power and status ratings in a similar fashion to the pretest. This scale allowed us to examine whether T affects preferences for social-rank-enhancing brands rather than forming social rank associations.

Using the post-experimental survey, we conducted a manipulation check verifying that our pairs of brands differed in social rank associations more than they differed in quality associations, for the participants of our main study (Supplementary Table 5). Paired t-tests of the difference in difference between social rank and quality associations showed that the differences in social rank associations were greater than the differences in quality associations for all of the brand pairs (all p’s < 0.001).

To rule out the potential effect of T administration on brand associations rather than a difference in preference for these brands, we tested for effects of T on the perceived quality and social rank associations of the brands, using two-sample t-tests (Supplementary Table 6). We found no reliable treatment effects on any of the brands’ rank perceptions (all p’s > 0.25). Only one of the seven brands showed a significant difference (at the α = 0.05 level, uncorrected) in quality perception. Thus, T did not influence the brands’ perception among our participants.

Task 2: Testosterone’s effect on attitudes toward goods associated with status, power, or quality

For each of six goods we composed three different text descriptions, describing the goods as either power-enhancing, status-enhancing, or high in quality. The descriptions included the goods’ images accompanied by the text (all ads are available online in the project’s open science framework page). We pretested the different text descriptions in two separate online surveys (N = 714 and N = 744 Amazon Mechanical Turk users). Participants saw one of the three descriptions for each of the goods in a counterbalanced randomized order and reported to what extent the descriptions and the goods conveyed status, power, and quality on a 10-point Likert scale (0 = not at all, 10 = very much). As in the pretest for Task 1, respondents rated the descriptions’ and goods’ associations with quality, three items related to power, and three items related to status. The pretest results are summarized in Supplementary Fig. 1.

In the experimental task, we manipulated social rank (i.e., power and status) and quality associations for identical goods and investigated whether T administration altered attitudes toward these goods. (We included high quality as a third condition to conceptually replicate the findings of study 1 and assess the extent to which social rank-promoting behaviors might stem from a preference for characteristics typically associated with high-end options, such as quality, as opposed to deeper psychological motives directly tied to social rank promotion. This is important because perceived quality is often influenced by price and brand effects.)

We presented each participant with one of the three text descriptions of each good. Each participant saw the text descriptions for all six of the goods, such that two of the descriptions focused on quality, two on power, and two on the status features of the goods. We randomly assigned each participant to one of three groups that saw a different combination of goods × text description interaction (i.e., a third of the participants saw the status description, another third the power description, and another third the quality description for each of the goods, such that two out of the six descriptions were in each description condition for each participant). This resulted in a 2 (T/placebo treatment, between participants) × 3 (description condition, between participants) factorial design repeated over six good categories (within participant).

Participants reported their attitudes toward each good (e.g., “What is your attitude toward Alpina watches?”) using three 10-point Likert scales (1 = unfavorable, 10 = favorable; 1 = bad, 10 = good; 1 = negative, 10 = positive) that were averaged to create a single attitude score. The attitude score was z-scored at the text description level (all results are robust to this analytical choice). In addition, we asked participants to report hypothetical purchase intentions (10-point Likert scale) and WTP (open text entry); the two measures were z-scored and averaged to create an index for hypothetical purchasing behavior. We found that the attitude ratings predicted both hypothetical purchase intentions (r (1,446) = 0.329, 95% CI = [0.282 0.374], p < 0.001) and WTP (r (1,441) = 0.249, 95% CI = [0.200 0.297], p < 0.001).

To account for two sources of variance in purchase intentions, we included two task-related questions in our post-experiment survey, after completion of the full experimental battery. First, we asked participants whether they already owned goods in the target category. Second, we asked participants about their general buying intentions for the category (i.e., participants were asked “Within the next month, how likely are you to purchase goods of the following categories?” measured on 1–10 Likert scales). Our regression models included controls for these two factors, both of which were highly significant predictors (p’s < 0.001) of hypothetical purchasing behavior.

Data analysis

Data were analyzed using linear regression mixed models with item-specific and participant-specific random effects65. All estimated models and their detailed results across experimental tasks are available in the Supplementary Information online.

Availability of materials and data. Materials, data, and analysis scripts are available on the project’s Open Science Framework (OSF) page: https://osf.io/jqmnx.