Participants

53 older adults (34 women, mean age: 70.9 years, SD: 4.4; 19 men, mean age 69.4, SD: 4.9) volunteered to participate in the study. These participants all completed health screening to minimise the risk of an adverse event occurring during the exercise test. Screening information was reviewed by a cardiologist, resulting in 25 older adults being excluded from participating in aerobic fitness testing (for details on exclusion criteria, see below). Consequently, 28 older adults (20 women, mean age: 70.3 years, SD: 4.4; 8 men, mean age: 67.6 years, SD: 5.0) completed the aerobic fitness test and the language experiment. The average education level of the 28 older adults was 16.4 years (SD: 3.2) of formal education, which in the UK starts at 4 years old. The average height was 165.4 cm (SD: 10.5) and the average weight was 66.9 kg (SD: 10.7). For all but 3 older adults we obtained a MOCA score (Montreal Cognitive Assessment) and they scored 26 or higher, which is considered normal.

To provide a baseline against which to compare the older adults’ language abilities, 27 young participants (19 women, mean age: 23.4, SD:3.9; 8 men, mean age: 22.9, SD: 2.5) completed the language experiment. The young participants did not complete aerobic fitness testing. All young participants were currently enrolled as university students with the University of Birmingham.

All participants gave informed consent and were monetarily compensated for participation. All were non-bilingual native British English speakers with no speech or language disorders and no dyslexia. The research was conducted at the University of Birmingham. The research had full ethical approval (UoB ERN_16-0230) and all experimentation was performed in accordance with the relevant guidelines and regulations.

Electrocardiogram (ECG) and general health screening

53 older adults underwent a screening procedure prior to the exercise testing. The evaluation consisted of a general health questionnaire, a resting 12-lead electrocardiogram (ECG) assessment and a resting blood pressure measurement. This information was then reviewed by a cardiologist (MR). Participants who revealed a contraindication to non-medically supervised exercise testing in the general health questionnaire (N = 5; e.g., heart condition, family history of heart attack, asthma, prevention medication for stroke), had high resting blood pressure (N = 3; systolic > 160, diastolic > 90), or showed ECG abnormalities (N = 12; e.g., ST depression, abnormal QRS axis, abnormalities of cardiac rhythm) were excluded from the aerobic fitness testing (and referred on to their GP).

Furthermore, for 3 older adults who passed the screening protocol, exercise testing on the ergometer had to be terminated before a fitness score was obtained because the participants experienced knee pain. Two older adults choose to withdraw participation post-screening. As a result, fitness scores were obtained for 28 older adults.

In the retained sample of 28 older adults who completed fitness testing, 4 older adults were taking anti-hypertension medication, 3 older adults were on cholesterol lowering medication, and 1 older adult was taking both types of medication.

Aerobic fitness testing

After screening and inclusion into the study, participants completed a graded sub-maximal aerobic fitness test on a cycle ergometer to estimate maximal oxygen consumption (\(\dot{{\rm{V}}}\)O 2max ). The sub-maximal fitness test was based on the Åstrand-Rhyming Cycle Ergometer Test, which has been shown to provide a reliable and valid estimate of \(\dot{{\rm{V}}}\)O 2max 35. Submaximal estimation of maximal aerobic power is a standard procedure for measurement of fitness in sedentary older adults and clinical populations.

For this test participants were asked to cycle on an electromagnetically braked cycle ergometer at 60–70 rpm (rotations per minute). The initial workload began at 35 Watts and then depending on the participant’s sex, body mass and habitual physical activity levels, workload increased by 20 to 35 Watt increments every three minutes. This continued until heart rate reached 80% of the participant’s estimated maximum heart rate (i.e. 220 minus the participant’s age), unless the participant was unable to maintain over 50 rpm or until the participant reached volitional exhaustion. Respiratory gases and volume were collected for measurement of the rate of oxygen consumption (\(\dot{{\rm{V}}}\)O 2 ). Maximal \(\dot{{\rm{V}}}\)O 2 was then estimated from the relationship between oxygen uptake and heart rate at multiple measurements. The resulting regression equation predicted participant’s \(\dot{{\rm{V}}}\)O 2max (as per the standard procedure: Guiney, et al.28, Siconolfi, et al.35). Prior to this test, participants were asked to abstain from heavy physical exercise and alcohol for 24 hours. They were also instructed not to consume food for 2 hours prior to reporting to the laboratory.

For female participants, the average predicted \(\dot{{\rm{V}}}\)O 2max score was 23.32 (SD = 7.04) with values ranging from 9.4 to 35.1. For male participants, the average predicted \(\dot{{\rm{V}}}\)O 2max score was 31.16 (SD = 6.55) with values ranging from 24.7 to 44.1. It is a standard finding that males have higher \(\dot{{\rm{V}}}\)O 2max scores than females36. In general, males have larger body mass (including lung size and cardiovascular capacity) than females, so direct comparison of the raw \(\dot{{\rm{V}}}\)O 2max score for a male and a female is not valid37. Scores therefore must be normed or standardized. We calculated z-scores within each sex group for the purpose of relating \(\dot{{\rm{V}}}\)O 2max scores to tip-of-the-tongue occurrence. Using the standardized score allows male and female participants to be viewed on one and the same dimension with regard to \(\dot{{\rm{V}}}\)O 2max scores.

Tip-of-the-tongue experiment

Participants completed a definition filling task: a definition appeared on screen, and participants were asked to indicate whether they knew the word (No/Yes, produce the word) or had a tip-of-the-tongue experience.

The definition materials consisted of 20 definitions of low frequency words (adapted from Jones38), 20 questions about people famous in the UK, such as authors, politicians and actors (some adapted from39), and 20 definitions of easy words – see Table 3 for examples. Each participant received the 60 definitions in a random order.

Table 3 Examples of definitions, target words and foils for multiple-choice questions if participants indicated to have experienced a tip-of-the-tongue. Full size table

The sequence of events on each trial was as follows. A warning signal was displayed for 500 ms after which a definition appeared centred on the screen. The definition remained on screen until the participant responded as follows: they knew the word (button press ‘Yes’, and then said the word out loud), did not know the word (button press ‘No’), or had a tip-of-the-tongue experience (button press ‘ToT’). In the instructions to the participants we defined a tip-of-the-tongue experience as: “Usually we are sure if we know or don’t know a word. However, sometimes we feel sure we know a word but are unable to think of it. This is known as a ‘tip-of-the-tongue’ experience”.

If participants indicated they experienced a tip-of-the-tongue state, they were asked to provide three pieces of information about its sound structure in response to prompts on the screen which asked them to: 1) guess the initial letter or sound; 2) guess the final letter or sound, and 3) guess the number of syllables. Finally, in order to determine if they were correct in thinking that they knew the target word, participants were asked to select it from a list of four words that were displayed on the screen (the correct answer and three foils – see Table 3 for examples) or to indicate that the word they were thinking of was not in the list.

Data analyses

We analysed the data using mixed effects models, which are an extension of classical linear regression models. Mixed effects models are the most suitable models to analyse the present dataset because they can account for the fact that there are repeated observations for both items and participants. We modelled tip-of-the-tongue occurrence using mixed effects logistic regression40,41 in R42.

For a categorical outcome variable such as tip-of-the-tongue occurrence, a logistic regression is much more suited than an ANOVA to model the data40. Using ANOVA models when the dependent variable is categorical (e.g., yes/no, counts, percentages) can lead to spurious significance values40,43,44. In such instances, regression methods are thus preferred45. However, ordinary regression analysis ignores correlation of observations within clusters and treats within cluster observations the same as between cluster observations producing invalid standard errors of the fitted coefficients46. Any subsequent analysis based on these standard errors (e.g., hypothesis test) is therefore invalid. The use of mixed effect models allows accounting for the fact that there are repeated observations for both items and participants40,41 and therefore used frequently in psycholinguistic literature.

In addition to modelling tip-of-the-tongue occurrence, we also fitted a model for phonological access scores. We calculated a phonological access score for each trial on which the participant reported a tip-of-the-tongue: 1 point for listing the correct initial sound, 1 point for the correct final sound and 1 point for the correct number of syllables (resulting in a score between 0 and 3). Phonological access scores were modelled using mixed effects linear regression41 in R42, again to account for the fact that there are repeated observations for both items and participants.

The regression models for tip-of-the-tongue occurrence and phonological access scores were based on the following predictors: Number of Phonemes of the Target Word, Number of Syllables of the Target Word, Vocabulary Size (% of items named correctly), Education Level (years of formal education), and Age Group (young vs. older adults)/Age (in years). When modelling tip-of-tongue occurrence in the older adults group, we also included a predictor with the standardized \(\dot{{\rm{V}}}\)O 2max scores (see above).

Continuous variables were centered. Group was deviation coded. We started with including a maximal random effect structure as justified by the design and in the case of non-convergence we simplified the random effects structure until convergence was reached. The random effect associated with the smallest variance is dropped and this is done progressively until convergence is reached41. During the process of model comparison, we started with a model including all fixed effects and then simplified the model using model comparison for fixed effects in stepwise fashion until a model was reached with the lowest AIC value (Akaike information criterion).

Main models are summarized in tables; coefficient estimates are included in the text only when a full summary is not included in the tables.

Sample size estimation and power analysis

No previous study has investigated the effects of aerobic fitness on any aspect of language functioning; previous studies have so far reported effects only of aerobic fitness in other cognitive domains: cognitive control, executive functioning, visuo-spatial memory, learning and processing speed (Colcombe & Kramer, 2003). Thus there is no direct evidence on which we could postulate a hypothesized effect size prior to conducting the present study. Rather than postulating a hypothesized effect size based on indirect evidence, we performed a power analysis simulation study which can be a basis for future work. Indeed, any result of a power calculation depends entirely on the size of the hypothesized effect, for which it is impossible to obtain an accurate estimate until direct evidence from a first study is available. The current research may then serve as a benchmark for future studies.

We investigated the power of detecting a non-zero effect of aerobic fitness on the probability of experiencing a tip-of-the-tongue state by means of a simulation study under the statistical model describing the present data. As such, we can, based on informative evidence, investigate the Type II error which depends on both the effect size and the sample size. The simulation study allows us to generate data independent of the data described in the current work. Moreover, simulation studies are an effective way of obtaining a power estimate for complex models47,48,49 and the approach has been used in the field of psychology in recent years50,51,52. The results of our simulation will serve as in important indicator for postulating sample size in future studies. Further details are described in the results section.

Data availability

The stimulus materials and the datasets analysed during the current study can be obtained through emailing the corresponding author.