Scientific interest in whether women experience systematic psychological changes across their ovulatory cycle has increased in recent years. A substantial amount of research indicates that women’s sexual interests change across the ovulatory cycle. Although cycle shifts in sexual desire appear robust, with higher levels of desire during women’s fertile phase (e.g., Arslan, Schilling, Gerlach, & Penke, 2018; Grebe, Thompson, & Gangestad, 2016; Jones, Hahn, Fisher, Wang, Kandrik, & DeBruine, 2018; Roney & Simmons, 2013, 2016), there is ongoing discussion whether there are changes in mate preferences as well. According to the good-genes-ovulatory-shift hypothesis (GGOSH; Gangestad, Garver-Apgar, Simpson, & Cousins, 2007; Gangestad, Thornhill, & Garver-Apgar, 2005), women’s mate preferences should differ according to the mating context: When fertile, women should prefer men with characteristics indicative of good genes for sexual relationships. These preferences should be absent in the luteal phase (i.e., between ovulation and menstrual onset) and when evaluating men for long-term relationships (given that long-term bonding with these men can be costly, because they may be less willing to provide parental effort; Gangestad & Simpson, 2000).

Evidence for this hypothesis is mixed. Previous research has documented cycle shifts in women’s mate preferences for several physical and behavioral traits (for an overview, see Gildersleeve, Haselton, & Fales, 2014). However, changes in preferences for masculine faces, bodies, and voices did not replicate in more recent studies (e.g., Jones, Hahn, Fisher, Wang, Kandrik, Han, et al., 2018; Jünger, Kordsmeyer, Gerlach, & Penke, 2018; Jünger, Motta-Mena, et al., 2018; Marcinkowska, Galbarczyk, & Jasienska, 2018; Muñoz-Reyes et al., 2014). Moreover, two meta-analyses came to strikingly diverging conclusions on whether cycle effects exist (Gildersleeve et al., 2014; Wood, Kressel, Joshi, & Louie, 2014). Additionally, previously conducted studies have been criticized for potentially serious methodological problems, such as inappropriate sample sizes, use of between-participants designs, lack of direct assessments of steroid hormones, and not using luteinizing-hormone (LH) tests for validating women’s fertile phase (Blake, Dixson, O’Dean, & Denson, 2016; Gangestad et al., 2016). In sum, to clarify the scientific discourse about the existence of ovulatory-cycle shifts, there is strong need for adequately designed and powered replications conducted in different interpersonal contexts.

The female ovulatory cycle is regulated by shifts in hormone concentrations. Although estradiol rises in the fertile phase, it decreases during the luteal phase, but with a second smaller peak in the midluteal phase. Progesterone levels are usually lower in the fertile phase and higher in the luteal phase. Therefore, cycle shifts in mate preferences should be mediated by natural within-women changes in hormone levels: higher estradiol and lower progesterone (Hypothesis 4). Because recent research suggests that progesterone effects on mate preferences are between women rather than within women ( DeBruine, Hahn, & Jones, 2019 ; Marcinkowska, Kaminski, Little, & Jasienska, 2018 ), we also tested between-women hormone effects in an exploratory manner. An important variable that might affect the strengths of ovulatory-cycle shifts is women’s relationship status. According to the dual-mating hypothesis, women may receive fitness benefits when forming a relationship with a reliably investing man while seeking good genes from other men through extra-pair sexual encounters ( Pillsworth & Haselton, 2006 ). Because it remains unclear whether singles also pursue different mating strategies across the cycle, we formed two alternative hypotheses: Cycle shifts in preferences for short-term mates will be larger for partnered women than for single women (Hypotheses 5a), or, alternatively, relationship status will not affect the strengths of cycle shifts in preferences for short-term mates (Hypotheses 5b).

In light of previous findings on ovulatory-cycle shifts, we hypothesized that fertile women, compared with women in their luteal phase, evaluate men’s behaviors as more attractive for sexual relationships 1 (Hypothesis 1). Moreover, women’s mate preferences should shift across the cycle: When fertile, women should be more sexually attracted to men who show more overt flirting behavior, more self-displays, more direct gazes toward the women they are talking to, and more behavior that is consensually perceived as attractive ( behavioral attractiveness ; Hypothesis 2a). Furthermore, although this hypothesis was not preregistered, 2 we expected comparable findings for behavioral cues of dominance, arrogance, assertiveness, confrontativeness, respectability, and likelihood of winning a physical fight. When evaluating long-term attractiveness, we expected preference shifts to be absent or only weakly present (Hypothesis 3). We predicted that our findings would be robust when we controlled for men’s age, physical attractiveness, and voice attractiveness. We also formed the alternative hypothesis that women’ mate preferences for sexual relationships would not shift (Hypothesis 2b).

In the current study, we set out to directly probe the GGOSH for men’s behaviors while overcoming previously reported methodological problems. In particular, we aimed to clarify (a) whether there are preference shifts for men’s behaviors across the ovulatory cycle, (b) which hormonal mechanisms might potentially mediate these effects, and (c) which moderators affect them.

In our preregistration, we explicitly stated our hypotheses, methods, recruitment strategy, and stopping rule. However, our preregistration was not fully explicit about our statistical analyses. Hence, we decided to run a number of robustness checks, which consist of analyses combining various reasonable analytical decisions. To substantiate that these choices were reasonable, we based them on procedures followed in previously published studies investigating cycle shifts in mate preferences or on suggestions we received during the review process. As described in our preregistration, all data analyses were done using multilevel modeling. Details can be found on our OSF project.

In addition, two trained research assistants coded objective male gazes (percentage of total amount of time the man looked the confederate directly in the face) using Observer software (Noldus, Leesburg, VA). Intraclass correlations were high (.99); thus, codings from both assistants were averaged. Additionally, men’s facial and vocal attractiveness were rated on 7-point Likert scales as control variables. For facial attractiveness, frontal face pictures with neutral facial expressions were rated by 15 independent undergraduate students. Interrater reliabilities were high (α = .91), so ratings were aggregated after z scoring. For vocal attractiveness, voice recordings (counting from 1 to 10) were rated by six trained research assistants, and ratings were aggregated afterward (α = .80). Behaviors varied substantially among the videos; descriptive values for all can be found in the Supplemental Material . More details about the rating and coding procedures can be found in Penke and Asendorpf (2008) .

Further ratings of the male behavior were collected separately later on. The following dimensions were rated: dominance, arrogance, assertiveness, confrontativeness, social respectability, and likelihood of winning a physical fight. Each dimension was separately rated by 10 independent raters (5 women, 5 men) on the basis of the 30-s videos using 7-point Likert scales. Interrater agreement was high (dominance: α = .88, arrogance: α = .71, assertiveness: α = .89, confrontativeness: α = .83, social respectability: α = .89, likelihood of winning a physical fight: α = .86); thus, ratings of all raters were aggregated for each dimension.

To assess the behaviors of all men, four independent, trained raters (two women, two men) who were unacquainted with the participants first rated the videos. Ratings were done using 7-point Likert scales for the 30-s sequences on the following behavioral dimensions: flirting behavior, self-displays, and behavioral attractiveness. Ratings were collected in two rounds, the first based on recordings from a side perspective, and the second based on the frontal recordings that were used as stimuli in the present study. In both rounds, videos were presented with audio. Interrater agreement was high (side perspective: αs = .84–.88; frontal perspective: αs = .85–.90); thus, ratings of all raters and both perspectives were aggregated.

Thirty-second-long sequences of videos of men in dyadic interactions, recorded in a study on sociosexuality ( Penke & Asendorpf, 2008 ), were presented. We selected the videos of 70 men who were single at the time of the initial study out of a larger pool of 283 videos in total. For every video, a male participant was seated in a room with an attractive female confederate. They were instructed to get to know each other, while the experimenter left the room (see Penke & Asendorpf, 2008 , for details). From each conversation, we extracted the sequence from 2 min to 2.5 min to avoid the potential awkwardness of the first moments and ensure that the interaction was in full flow. The participants saw the conversation from a camera recording over the shoulder of the female confederate, so they saw a frontal view of only the man in each interaction.

For hormone assays, we collected four saliva samples from each participant, one per testing session. Contamination of saliva samples was minimized by asking participants to abstain from eating, drinking (except plain water), smoking, chewing gum, or brushing their teeth for at least 1 hr before each session. Samples were visually inspected for blood contamination and stored at −80 °C directly after collection until shipment on dry ice to the Kirschbaum Lab at the Technical University of Dresden, where estradiol, progesterone, testosterone and cortisol were assessed via liquid chromatography–mass spectrometry (LCMS). Estradiol levels could be detected by LCMS analysis in only 22% of the hormone samples. Therefore, all samples were reanalyzed using a highly sensitive 17β-estradiol enzyme immunoassay kit (IBL International, Hamburg, Germany). These latter estradiol values were used in subsequent analyses. We centered all hormone values on their participant-specific means and scaled them afterwards (i.e., divided them by a constant), so that the majority of the distribution for each hormone varied from −0.5 to 0.5, to facilitate calculations in linear mixed models (e.g., Jones, Hahn, Fisher, Wang, Kandrik, & DeBruine, 2018 ; Jones, Hahn, Fisher, Wang, Kandrik, Han, et al., 2018 ). This is a common procedure to isolate effects of within-participants changes in hormones, avoiding the influence of outliers on results and dealing with the nonnormal distribution of hormone levels. It is also in line with the procedure followed by Jünger and colleagues ( Jünger, Kordsmeyer, et al., 2018 ; Jünger, Motta-Mena, et al., 2018 ). Hormone levels were nearly normally distributed afterward; Figure S1 in the Supplemental Material available online shows the distribution of hormone levels after this procedure. Importantly, this procedure did not change any findings compared with analyses with untransformed hormone values. The R code for this procedure can be found on our OSF project ( https://osf.io/8ntuc/ ). One woman had extremely high levels of progesterone and could be considered an outlier. However, results remained virtually identical when we excluded her from all hormone analyses. All analyses excluding this woman can be found on our OSF project.

For analyses of the main cycle phase, we excluded 45 participants because of negative LH tests in both cycles, irregular ovulatory cycles, or inappropriate scheduling of testing sessions (see Preliminary Analyses for more details), leaving a final sample of 112 women. Of these participants, 46 started with the first session in their luteal phase, and 66 started in their fertile phase. However, all 157 women were included in the robustness checks.

Women’s cycle phase was determined by the reverse-cycle-day method, which is based on the estimated day of the next menstrual onset ( Gildersleeve, Haselton, Larson, & Pillsworth, 2012 ) and confirmed by highly sensitive (10 mIU/ml) urine ovulation test strips (Purbay Ovulation Tests, MedNet, Muenster, Germany), which measure LH. These LH tests had to be done at home at the estimated day of ovulation and the 4 days prior to that. We investigated two ovulatory cycles, in which each participant reported to the lab twice: once while being fertile (at the days immediately preceding ovulation, usually Reverse Cycle Day 16 to 18, with Reverse Cycle Day 16 as the most ideal date) and once when not fertile (during the luteal phase, after ovulation and prior to the next menstrual onset, usually Reverse Cycle Day 4 to 11, with Reverse Cycle Days 6 to 8 as the most ideal dates). Out of all participants who finished every session, 66 participants started the first session in their luteal phase, and 91 started in the fertile phase.

Video clips were presented in a randomized order using the experimental software Alfred ( Treffenstaedt & Wiemann, 2018 ), which is based on the programming language Python (Version 2.7; http://www.python.org ). After watching each sequence, participants were to separately rate each individual man’s sexual attractiveness (to assess short-term attraction) and attractiveness for long-term relationships. Ratings were made on 11-point Likert scales from −5 ( extremely unattractive ) to 5 ( extremely attractive ), including 0 as a neutral point. Definitions of sexual attractiveness and attractiveness for a long-term relationship were provided prior to the rating task. Sexually attractive was defined as follows: “Men that score high would be very attractive for a sexual relationship that can be short-lived and must not contain any other commitment. Men scoring low would be very unattractive for a sexual relationship.” Attractiveness for a long-term relationship was defined as follows: “Men that score high would be very attractive for a committed relationship with a long-term perspective. Men that score low would be very unattractive as a long-term partner.” After each session, the appointment for the next session was arranged individually on the basis of the participant’s ovulatory cycle.

In the first testing session, participants saw a short preview video that presented facial pictures of all men they were about to rate for 1 s each. Participants were then instructed to evaluate the men in the following videos, which were the actual stimulus material, according to their attractiveness as they perceived it “in that moment,” independently of their own current relationship status or general interest in other men, and to rate the attractiveness of the men by focusing only on the behavior exhibited in the videos.

Sessions two to five were computer-based testing sessions and took place once during the fertile phase and once during the luteal phase for two consecutive cycles per participant. To control for possible effects of diurnal changes in hormone levels, we scheduled all sessions in the second half of the day (mainly between 11:30 a.m. and 6:00 p.m.). After arriving at the lab, participants first completed a screening questionnaire that assessed their eligibility and some control variables for saliva sampling (e.g., the last time participants had eaten something; Schultheiss & Stanton, 2009 ). Saliva samples were collected via passive drool before the participants started the first rating task. Participants also completed two other rating tasks in which they had to rate the attractiveness of men’s bodies or voices (see Jünger, Kordsmeyer, et al., 2018 , and Jünger, Motta-Mena, et al., 2018 , for detailed descriptions of these tasks). The order in which participants completed all rating tasks (the videos described in the current study, as well as bodies or voices as described by Jünger, Kordsmeyer, et al., 2018 ; Jünger, Motta-Mena, et al., 2018 ) was randomized between participants and sessions. Additionally, anthropometric data were collected between these tasks (a) to make sure that participants got breaks between the rating tasks and (b) as part of a larger study (see the preregistration).

All participants took part in five individually scheduled sessions. In the first introductory session, participants received detailed information about the general procedure, the duration of the study, and compensation. Furthermore, the experimenter explained the ovulation tests and checked the inclusion criteria. To count the days to the next ovulation and to plan the dates of the experimental sessions, we assessed cycle length as well as the dates of the last and the next menstrual onset. Finally, demographic data were collected.

In total, we recruited 180 participants, of whom 23 could not be included in the final sample: 17 women who attended only the introductory session of the study dropped out before participation (6 failed one of the inclusion criteria above, 4 quit the study, 4 did not respond to e-mails, and 3 had scheduling problems). Another 6 dropped out during the study because they completed only the first testing session (4 had scheduling problems, 2 did not respond to e-mails after the first session). One of the participants later reported that she was 35 years old. We included her data for robustness checks because she met all other including criteria and had positive LH tests. Excluding her data did not change the results. One hundred fifty-seven heterosexual female participants (age: range = 18–35 years, M = 23.3, SD = 3.4) finished all sessions and could therefore be included in further analyses. At the beginning of the study, 75 of these participants reported that they were in a relationship, 82 reported that they were single. Our sample vastly exceeded the size required to achieve 80% power given a within-participants design and anticipated effects of moderate magnitude (Cohen’s d = 0.5 with N = 48 for LH-test-validated cycle phases and two testing sessions per participant, suggesting that there was sufficient power to detect much smaller effect sizes in our study), as suggested by recent guidelines for sample sizes in ovulatory-shift research ( Gangestad et al., 2016 ). After completing all sessions, participants received a payment of 80€ or course credit.

We used the same sample as did Jünger and colleagues ( Jünger, Kordsmeyer, et al., 2018 ; Jünger, Motta-Mena, et al., 2018 ). Participants were recruited following the same inclusion criteria of other ovulatory-cycle studies and had to fit the following preregistered criteria: female, between 18 and 30 years old, and naturally cycling (i.e., not having taken hormonal contraception for at least 3 months, not unexpectedly switching to hormonal contraception during the study, not currently pregnant or breastfeeding, not having given birth or breastfed during the previous 3 months, not taking hormone-based medication or antidepressants). Additionally, they had to report that their ovulatory cycles had a regular length between 25 and 35 days during the last 3 months.

Our hypotheses, the study design, and the sampling and the analysis plan were preregistered online at the Open Science Framework (OSF; https://osf.io/m6pnz/ ) before any data on the women were collected or analyzed. This preregistration also contains further hypotheses that are not part of the present article, because they were written for a larger project. Data, analysis scripts, and instruction materials are also provided on our OSF project ( https://osf.io/8ntuc/ ). All participants signed a written consent form, and the local ethics committee approved the study protocol (No. 144).

Results

Preliminary analyses First, we checked how many of the participants’ ovulatory cycles had positive LH tests (showing an LH surge) in the calculated fertile phase to detect nonovulatory cycles. Twelve participants reported negative LH test results for both investigated cycles, and 9 reported negative LH tests results for one cycle. In total, LH tests in 33 of all 314 cycles (10.5%) were negative. Next, we counted how many cycles were reported as being irregular, that is, where days of the testing sessions deviated from the prior defined phase of appropriate testing days by more than 3 days (see Ovulatory-Cycle Phase). Eight women reported irregular cycles in both investigated cycles, and 32 reported one cycle being irregular, resulting in 48 out of 314 (15.3%) cycles being irregular (despite all participants reporting having regular ovulatory cycles in the introductory session prior to the testing sessions). Additionally, we checked the temporal relationship between the reported day of LH surge and the date of scheduled testing session. Because ovulation usually occurs within 24 to 36 hr after the observed LH surge, testing sessions that were scheduled more than 2 days after the surge might have already been in the early luteal phase. Out of the 281 cycles for which an LH surge was observed, 13 (4.63%) purportedly fertile-phase sessions were scheduled 3 or 4 days after the LH surge. Therefore, 268 (95.37%) were scheduled within an appropriate range of 3 days before to 2 days after the LH surge (in total: M = −0.12, SD = 1.39 days in relation to the day of the observed LH surge). For a histogram showing the distribution of days of fertile-phase testing sessions relative to the observed LH surge, see Figure S2 in the Supplemental Material. Participants with irregular cycles, negative LH tests, or the risk of early luteal phase instead of fertile-phase testing session were excluded from the main analyses (as cycle-phase estimates based on LH have a much higher validity than estimates based on backward counting alone; e.g., Blake et al., 2016; Gangestad et al., 2016) but included in robustness checks described in the Supplemental Material.

Hormonal mechanism potentially underlying preference shifts To investigate possible effects of steroid hormones underlying cycle shifts in women’s mate preferences for sexual attractiveness as a dependent variable (Hypothesis 4), we entered estradiol as well as progesterone within and between women, either the competitiveness factor (Model 5) or the courtship factor (Model 6) as fixed effects to our multilevel model, female participants and male stimuli as random effects, and random slopes for estradiol, progesterone, and the respective behavioral factor varying within participants. These models did not converge; thus, random-slope variance was reduced as described above (details can be found on our OSF project). Results revealed no significant interaction of within-women or between-women estradiol or progesterone and the behavioral factors, indicating that women’s mate preferences for specific cues in men’s behavior did not shift because of within-women changes in estradiol or progesterone, contradicting Hypothesis 4. However, there were significant main effects for both factors on sexual-attractiveness ratings (Table 2). The effects were comparable for ratings of long-term attractiveness (Table S25B in the Supplemental Material), besides significant interaction effects between the behavioral factors and between-women progesterone levels, in that both factors were rated as being more attractive for long-term relationships when between-women progesterone levels were lower (competitiveness: p = .029; courtship: p = .021). Table 2. Multilevel Regression Analyses of Sexual-Attractiveness Ratings as a Function of Within-Women (WW) or Between-Women (BW) Estradiol or Progesterone and Behavioral Factors View larger version Results were virtually identical when we computed the model with a global factor as predictor variable instead (Table S39 in the Supplemental Material) or when we added estradiol-to-progesterone ratio rather than estradiol and progesterone as predictor variables (Table S17 in the Supplemental Material). All results were virtually identical when we controlled for men’s age, physical attractiveness, and voice attractiveness (Tables S16 and S30 in the Supplemental Material).