Participants and eligibility

We recruited college students planning to take the LSAT within 1 year. Inclusion criteria included being native English speakers; at least 18 years; normal/corrected vision; and no history of psychiatric disorders, learning disabilities, or prior LSAT experience. Participants were assigned pseudo-randomly to study for one of these two sections of the LSAT, the Logic Games or the Reading Comprehension section. The first quarter of participants were assigned to a group at random, whereas we distributed the rest to match the groups on age, gender, reasoning, working memory, and LSAT performance (Table S1). We collected data from 2015 to 2017, following the semester structure of UC Berkeley: Spring (January–May) 2015, Summer (June–August) 2015, Fall (August–December) 2015, Spring 2016, Summer 2016, Fall 2016, and Spring 2017.

Ninety-five participants completed the pre-tests, and 49 completed the LSAT course and post-tests. We excluded two of these participants because they failed to study for their assigned course. Participants in our final sample did not differ from those who only completed one time point on either cognitive performance or demographic variables. The final sample who prepared for the Logic Games section of the LSAT included 23 students (14 females, mean age 21.55 years). The final sample who prepared for the Reading Comprehension section of the LSAT included 24 students (13 females, mean age 21.88 years). Levels of attrition did not differ significantly between the groups (χ² = 0.01, p = 0.93). For analyses involving the transitive inference task, we excluded two subjects from each group for having >60% of trials missing valid fixation data, and one subject from the Reasoning group for having performance below chance levels (20% accuracy, chance was 50%). The research was approved by the Committee for the Protection of Human Subjects at the University of California, Berkeley. Written informed consent was obtained from all participants.

Summary of procedures

Before and after studying for the LSAT courses, participants completed a battery of nine online cognitive assessments,47 followed by an in-person testing session. Participants were blind to their LSAT group at pre-test, and the experimenters carrying out the testing sessions were blind to the group assignment at both time points.

During the laboratory sessions, we recorded gaze data from participants while they completed a transitive inference task, followed by two tests of inductive reasoning. Data from the transitive inference task is the subject of the current investigation. After finishing the eyetracking tasks, participants completed a standardized test of reasoning termed Analysis Synthesis (Woodcock–Johnson Battery III48), LSAT sample problems, and a survey. The survey included demographic and ocular health questions, questions regarding prior experience with the LSAT, and at post-test, questions about the participant’s experience with their LSAT course. The order of tests was the same at both time points.

LSAT courses

Participants studied for either the Logic Games or Reading Comprehension section of the LSAT with a commercially available online course (Kaplan, Inc.) for 6 weeks. The courses were similar in critical ways. Both courses included six lessons, each consisting of online videos and homework practice problems designed to help improve timing and increase mastery with different question types. Both courses featured the same instructors in the online videos, who explained problem-solving strategies and had students practice those skills with real LSAT problems. Students had access to the online course materials and were given a companion workbook that included practice problems.

We requested that participants (1) study only for the LSAT section we assigned to them, (2) complete all six lessons of the course within 7 weeks (approximately one lesson/week), and (3) space their practice (i.e., study every other day, three times per week), in keeping with prior work showing that spacing practice promotes learning49 and transfer effects.50 We chose these practice intervals so that students could incorporate their LSAT courses more easily with their typical school schedules. Participants reported having complied with these instructions and that they had completed on average one lesson per week (range = 0.5–2 lessons) and studied their course for on average 24 h (first quartile: 16; third quartile: 36). Both groups reported similar studying times (median = 24 h for each group).

The Logic Games section involves solving word problems that contain many rules that must be integrated to find the correct answer (sample problems: https://www.lsac.org/jd/lsat/prep/analytical-reasoning). The preparatory course for this section instructed on strategies such as organizing relational information into sketches to minimize the amount of information one needs to remember, as well as to facilitate deductions, rule abstractions, and correct rule application.

The Reading Comprehension section involves reading long passages and answering multiple choice questions based on relevant information in the passages (sample problems: https://www.lsac.org/jd/lsat/prep/reading-comprehension). The preparatory course for this section involved learning strategic reading techniques, such as finding keywords based on the passage questions and annotating main ideas on the passages to minimize working memory demands.

Participants in the two groups found their respective courses relatively effective and enjoyable, with no differences between groups (Table S2). However, we measured the effectiveness of the LSAT courses with short Logic Games and Reading Comprehension problem sets that participants completed in the laboratory and found little evidence of the effectiveness of the mini-courses in improving performance on either section (S1).

Eyetracking apparatus and procedures

We recorded binocular gaze data from participants completing a transitive inference task using Tobii T120 Eye Tracker (17-inch monitor, 1280 × 1024 pixel resolution). We sampled at a temporal resolution of 120 Hz, with participants sitting at 60 cm from the eyetracker camera. We took several precautions to collect high-quality ocular data following recommendations from.51 Furthermore, participants reported that they did not suffer from medical conditions or used medication that could affect ocular behaviors. We used Presentation® software (v. 18.0, Neurobehavioral Systems, Inc.) to present the task stimuli and the Tobii Eye Tracker Extension for Presentation v1.152 to synchronize the timing of the stimulus presentation and ocular events.

Transitive inference task

In the transitive inference task (adapted from a task we had developed previously for functional magnetic resonance imaging research;18,19 Fig. 1), participants saw four balance scales, each one with two color balls. Based on the relations shown by the scales, participants needed to infer the relative weights of two target balls. To solve the problems correctly, it was necessary to integrate the relationship shown by two of the four scales (i.e., the relevant scales). Participants completed 60 of these problems, divided into two blocks of 30 trials. We recalibrated the eyetracker during the short break between blocks.

We minimized potential confounds in gaze patterns by controlling for features that could impact visual saliency and subjects’ expectations as to where the relevant scales were likely to appear and which balls were likely to be relevant. We changed the position of the relevant scales across trials, and the program selected the color of the five balls at random from a set of six colors, which were all matched in luminance. Additionally, we biased the participant’s first fixation to the question area by first presenting the question alone for 100 ms and then adding the four scales (see trial sequence in Fig. 1b). We staggered the stimulus presentation in this way in an effort to encourage participants to begin the task by searching for the relevant relations and then proceed to integrating them.

Behavioral outcome measures

We examined changes in RTs and accuracy (proportion of trials answered correctly). Performance did not vary as a function of the spatial arrangement of the scales (e.g., the position of relevant scales) or the number of scales showing inequalities (Fig S1). Thus we did not include these factors in our analyses in favor of maximizing the statistical power to assess our hypotheses.

Given that pre-test RTs were highly positively skewed (sk = 4.55), we trimmed outlier trials falling on the long end of the tail (i.e., Q3 + 1.5 × IQR) to minimize bias in our gaze analysis that could result from including the highly variable fixation durations that could occur on these atypically long trials. Outlier trials were identified separately by subject, time point, and block, to retain individual differences in performance. Approximately 5% of trials were trimmed owing to outlier RTs from each group per time point.

Gaze preprocessing and outcome measures

We classified gaze data into fixations using a standard dispersion-based algorithm adapted from ref. 53 allowing a maximum dispersion of 35px over a 100 ms window (see details in Section S2). Participants had a median of 22 fixations on correct trials. Our analysis included only trials with at least three valid fixations, under the assumption that this is the minimum number of fixations needed to solve the problem, with a maximum of 64 fixations (i.e., Q3 + 1.5 × IQR) to minimize the bias that those outlier trials could induce.

We assigned an area of interest (AOI) label to the fixations. The AOIs included each of the four scales (two relevant and two irrelevant scales) and the area where the target balls and question appeared. We used these labeled fixations to calculate the number of gaze transitions between different AOIs. For instance, a fixation on “Relevant Scale 1” followed by a fixation on “Relevant Scale 2” was coded as one transition between the relevant scales. We refer to these events as transitions because we were primarily concerned with measuring how often fixations shifted between two different scales; we ignored, at most, one fixation that may have occurred elsewhere between those two target fixations.

We used the transitions and fixation data from each trial to derive three gaze outcome measures (Table 1), informed by an analysis of fixation sequences performed across groups and time points (Fig. 2). To compute the gaze metrics, we first marked the point at which it became more probable that a participant had homed in on the relevant scales during a trial. For each trial, and on an individual subject basis, we measured that point in the trial by calculating the empirical probability that the number of fixations on irrelevant scales was below chance (25%) and that the number of fixations on relevant scales was greater than chance. We estimated these probabilities with a sliding window that evaluated 20% of the fixations at once (min. size 4, max. size 8 fixations). This approach enabled us to capture a common pattern of fixations (Fig. 2), whereby participants began to preferentially fixate on the relevant scales after a certain point in the trial. Accordingly, the visual search metric constitutes the number of fixations the participant made on any scale prior to that point, and we indexed relational thinking as the duration of fixations on relevant scales occurring after that point. We additionally computed a more specific metric of relational integration as the number of saccades between the two relevant scales.

Composite reasoning measure and other transfer tasks

Three subtests included in the composite reasoning measure (Table 2) were part of a larger battery of nine online assessments, which included tests of selective attention, planning, and working memory (Table S4). These tests were developed by the Cambridge Brain Sciences Laboratory (http://www.cambridgebrainsciences.com) as an online adaptation of assessments designed and validated at the Medical Research Council Cognition and Brain Sciences Unit.47,54

Task difficulty in all the assessments was adaptive as a function of performance. Performance metrics differed between the tasks (e.g., a maximum level achieved versus total correct responses), so we standardized the scores after removing outlier scores (i.e., scores that deviated >3 S.D. away from the grand pre-test mean). Using this normalized dataset, we created composite measures of reasoning, planning, and working memory by averaging performance across related assessments. Composite measures provide a robust test of transfer38,40 and help minimize the number of statistical tests necessary. We derived these composite measures with a theory-driven approach, given that factor analytic methods were not appropriate for our sample size. For the reasoning measure, we averaged the standardized scores from the Analogical Reasoning, Object Reasoning, and Odd One Out tests, as well as the Analysis Synthesis test administered in the laboratory.

Statistical analysis

We used Bayesian models to quantify the strength of evidence supporting the model that tested a given hypothesis in question, as described in the Results section. For all analyses, we used participant’s median scores on the measure of interest and uniform distribution of prior probabilities with default Cauchy prior scales from the BayesFactor R package.55 The sample size was sufficient for the Bayesian analysis performed. For traditional hypothesis testing analysis, the sample size is sufficient to test for the effects of reasoning practice between time points with a power of 0.86 and an alpha criterion of 0.05, as well as a Group×Time interaction effect with a power of 0.73 and an alpha criterion of 0.05.

Code availability

We used custom scripts written in Python (v3.6) to preprocess and calculate gaze outcome metrics and R (v3.2) to perform the Bayesian analysis. The code and instructions can be found in the Open Science Framework repository, https://osf.io/hkzgw/?view_only=8f4749510a2f44ef86fea154e9f6e9c4.