There are several sources of inter-individual variability in human gaze patterns, such as genetics1,2, social functioning3,4, cognitive functioning5 and age6,7. These differences are often revealed using areas of interest (AOIs) that are chosen by the experimenter or other human observers. We adopted a data-driven approach for characterizing gaze patterns by using a combination of SVM and DL techniques to classify 18- and 30-month-old infants (final sample n = 41) by age based on their exploratory gaze patterns, which were recorded while they viewed static images of natural scenes. We chose these age groups because (1) social and symbolic communication changes dramatically over this period and, (2) delays in social and symbolic communication during this developmental period can represent risk factors for autism spectrum disorder8. Differences in fixation patterns could reflect meaningful changes in how toddlers process information about the world around them.

Images consisted of social and non-social scenes that included multiple dominant objects, rather than a central dominant object9 (Fig. 1A). For example, rather than depicting a social object, such as a person, front and center, an image might consist of non-human objects in the foreground with a person in the background. Thus, semantically salient objects (such as faces) are not always the most visually salient (low-level) features in the scenes, removing a common confound in scene viewing research. Multiple dominant objects in the same scene also create the opportunity for a statistical analysis on the relative importance of objects or features in gaze allocation. These images provide visual information more similar to real-world free viewing than traditional, controlled experimental stimuli that often display only the objects of immediate interest, and allow more equitable competition between low-level image salience (e.g. pixel-level salience) and high level, semantic salience.

Figure 1 Basic eye movement metrics. (A) Examples of the visual stimuli, fixation maps of each age group, and difference maps. In the difference maps, bright regions attract more fixations from the 30-month-olds, while dark regions attract more fixations from the 18-month-olds. (B,C) Significant differences between age groups were observed. (B) 30-month-olds had more fixations than 18-month-olds. (C) 30-month-olds had shorter average fixation durations than 18-month-olds. Full size image

Toddlers (n = 73) viewed 100 naturalistic images for 3 seconds each while their eye movements were tracked using a Tobii TX300 eye tracker (Tobii Technology AB, www.tobii.com). We first assessed data quality by generating measures of calibration accuracy and precision using the procedure described in a previous study10. Using only data from recordings with acceptable calibration accuracy (18-month-olds n = 19; 30-month-olds n = 22, see Methods), we looked at the basic eye movement metrics of fixation number and duration. A two-tailed unpaired t test showed that 18-month-olds made fewer fixations than 30-month-olds (18-month-olds: mean = 5.96, SD = 0.84; 30-month-olds: mean = 6.81, SD = 0.87, Fig. 1B, t(39) = −3.151, p = 0.003, 95% CI [−1.4–0.3], Cohen’s d = −0.987). We also observed that 18-month-olds had longer fixation durations than 30-month-olds (18-month-olds: mean = 433.93 ms, SD = 74.45; 30-month-olds: mean = 348.63 ms, SD = 68.71, Fig. 1C, t(39) = 3.814, p < 0.001, 95% CI [40.1 130.5], Cohen’s d = 1.194). We used the fixation information to generate fixation maps of the two groups (Fig. 1A). These fixation maps indicate where toddlers from each age group fixated and allowed us to compute difference maps highlighting the key areas within the naturalistic scenes that distinguish the looking patterns of the groups. The primary differences are that 18-month-olds show greater interest in faces and the 30-month-olds show greater interest in things that are gazed at, or touched, by others. This difference between age groups could reflect changes in cognitive representation of others’ intentionality.

Beyond analysing basic fixation metrics and looking patterns, our research method consists of two different computational approaches: (1) a group-wise analysis based on predefined features of interests, and (2) a DL-based classifier for estimating the age group of individual toddlers. First, we conducted computational and statistical analyses to identify the group-wise differences between 18-month-olds and 30-month-olds. To conduct a better-controlled statistical analysis, for each toddler, we trained a linear SVM model to classify salient vs. non-salient image regions using a set of predefined features of interest. The combination weights of the various features indicated their respective contributions to attention allocation (namely saliency weights), which were compared between the two age groups. The saliency model used five categories of predefined feature maps: (1) pixel-level features (e.g. color, intensity, orientation), (2) object-level features (e.g. size, complexity, etc.), (3) semantic-level features (e.g. faces, emotional expressions, objects touched by a human or animal, etc.), (4) image center, and (5) background9. In our previous studies, these features have showed their reliability in predicting eye movements9 and identifying people with autism4. We trained the saliency model using gaze patterns from each child and then compared feature weights between 18-month-olds and 30-month-olds to determine if differences exist between the groups. The comparison indicated that the two groups differed from each other in meaningful ways: while both the 18- and 30-month-olds looked at faces early on, the 18-month-olds looked at faces more, and continued to fixate faces across subsequent fixations (18-month-olds: mean = 0.096, SD = 0.035; 30-month-olds: mean = 0.065, SD = 0.033, t(39) = 2.881, Fig. 2A, p = 0.006, 95% CI [0.009 0.053], Cohen’s d = 0.902). In contrast, the 30-month-olds ultimately explored more of the scene, showing interest in the objects of other people’s gaze (18-month-olds: mean = 0.016, SD = 0.018; 30-month-olds: mean = 0.030, SD = 0.019, Fig. 2A, t(39) = −2.365, p = 0.023, 95% CI [−0.026–0.002], Cohen’s d = −0.741), objects being touched by a human or animal (18-month-olds: mean = 0.010, SD = 0.020; 30-month-olds: mean = 0.034, SD = 0.030, Fig. 2A, t(39) = −3.042, p = 0.004, 95% CI [−0.041–0.008], Cohen’s d = −0.953), objects that elicit strong tactile sensations (e.g. sharp knife, soft pillow; 18-month-olds: mean = 0.003, SD = 0.011; 30-month-olds: mean = 0.012, SD = 0.009, Fig. 2A, t(39) = −2.896, p = 0.006, 95% CI [−0.015–0.003], Cohen’s d = −0.907), and objects that can be tasted (e.g. food and drink; 18-month-olds: mean = 0.009, SD = 0.014; 30-month-olds: mean = 0.026, SD = 0.017, Fig. 2A, t(39) = −3.519, p = 0.001, 95% CI [−0.027–0.007], Cohen’s d = −1.102). These differences between groups emerge early, often within the first 1–3 fixations (excluding the first fixation after stimulus onset) (Fig. 2B) and represent semantic categories that reflect meaningful development of cognitive processing (see Discussion). Figure 2B illustrates how differences emerged in the groups across fixation number and highlights the importance of analyzing eye movement data across time, rather than simply extracting an outcome based on aggregate data from entire trials.

Figure 2 Saliency weights of semantic features. (A) Saliency weights of semantic features. The 30-month-olds had a greater interest in objects being touched or gazed. Error bars denote the SEM over the group of toddlers. Asterisks indicate significant difference between 18-month-olds and 30-month-olds using a two-tailed unpaired t test. *p < 0.05, **p < 0.01. (B) Evolution of saliency weights of the five semantic features with the most significant differences between age groups. Fixations start with first fixation away from the center fixation dot post-stimulus-onset. Shaded areas denote ± SEM over the group of toddlers. Asterisks indicate significant difference between people with ASD and controls using a two-tailed unpaired t test. *p < 0.05, **p < 0.01, and ***p < 0.001. Full size image

In the aforementioned weight analysis, we used predefined features at multiple levels and analyzed their relative importance in eliciting gaze in the two age groups of interest. Given that differences exist between the two groups of gaze patterns, our next exploration uses Deep Learning (DL) to automatically learn and visualize the hierarchical features from the images and eye-tracking data in a data-driven manner without any hypothesis or human design of features. In particular, by fitting the difference between the fixations of 18-month-olds and 30-month-olds, our DL model (Fig. 3A) learns multi-level image features that distinguish the two age groups. A difference map was first computed with a subtraction between the fixation maps of the two age groups. The model was composed of two parallel convolutional neural networks (CNNs) encoding two scales of input that extract multi-level image properties and then predict the corresponding difference map. At each fixation point, a 1024-dimensional feature vector was extracted from the convolutional layers. These features were aggregated across all experimental trials of each toddler. Figure 4A provides visualizations of the features that best represent the gaze patterns of 18- and 30-month-olds. Like the saliency weight analysis, DL revealed that 18-months-olds spent more time gazing at faces, while 30-month-olds fixated more diverse image features, such as the objects of people’s gaze (e.g. text). It is also interesting to note that the predefined feature maps were categorized by adult raters (see 9) yet did a reasonably good job of classifying toddlers gaze position (Fig. 4B).

Figure 3 Deep Learning model. (A) An overview of the DL model, which was composed of two parallel convolutional neural networks (CNNs) encoding two scales of visual input to extract high-level representations of an image and predict the corresponding difference map between fixations of 18-month-olds and 30-month-olds. At each fixation point, a 1024-dimensional feature vector was extracted from the convolutional layers. The feature vectors were integrated across trials to represent each toddler’s eye-tracking patterns. These representations were classified with a linear SVM to distinguish the two age groups. (B) The ROC curve of the DL classification. Positive values indicate 30-month-olds; negative values indicate 18-month-olds. Full size image

Figure 4 Visualization of DL features. (A) Visualization of the DL feature channels sorted into columns corresponding to classification weights. The lowest weights (row 1) correspond to 18-month-olds and the highest weights (row 2) correspond to 30-month-olds. Image patches represent features that are most characteristic of looking patterns for each age group across all fixations. (B) Visualization and classification weights of the DL features corresponding to each feature from Fig. 2B. Features that most strongly correlate with the labeled semantic features are presented. Bars indicate the evolution of classification weights across fixations. Higher classification weights suggest that 30-month-olds are more likely to fixate the feature than 18-month-olds. Note: for privacy reasons, faces are blocked in this Figure. Faces were intact in the original stimuli viewed by participants. Full size image

Finally, we used a linear SVM classifier on the features extracted by our DL model for the classification of age groups. The goal was to find a linear decision boundary with a maximum margin separating the two groups of toddlers. The ROC curve of our DL classification indicates a high degree of classification accuracy (0.83, AUC = 0.90, sensitivity = 0.81, specificity = 0.84, Fig. 3B). For comparison, our baseline model using simple gaze-based linear SVM without DL had an accuracy of 0.68 and AUC = 0.86 (sensitivity = 0.68, specificity = 0.68). Taken together, our model with DL for feature learning and SVM for classification produces highly accurate classifications. It also reveals features that drive that classification, consistent with conclusions from the saliency weight analysis. These features represent areas within the images that attract the toddlers’ attention differently depending on the age of the toddler and could therefore provide information about cognitive changes that are taking place during this period of development.

The current study demonstrates the efficacy of using SVM and DL for classifying toddlers based on their fixation patterns during free viewing of naturalistic images. The models identified meaningful image features that differentiated the two groups, such as faces, objects of others’ gaze, objects that can be touched, objects that elicit strong tactile sensations, and things that can be tasted. The combination of eye-tracking, DL, and SVM to investigate age-related variability in toddlers represents a new application of the methodology.

The SVM and DL techniques are unique in that they take looking patterns from individuals scanning very complex images and analyze them according to both pixel-level (bottom-up) salience, and semantic-level (top-down) salience. In other words, while some models focus on either bottom-up11 or top-down information12,13 for analyzing looking patterns across natural scenes, these techniques use low-level and high-level information from the scenes, in addition to where individuals look, to accurately classify observers according to age. This combination of selection mechanisms represents a more accurate account of how attention is actually deployed during visual exploration14. We used a probability map to identify group-level differences between 18- and 30-month-olds, from which the DL model can learn discriminative image features that maximally distinguish the two age groups. Compared with training a separate network for each group and combining their features for classification (AUC = 0.85, accuracy = 0.83, sensitivity = 0.86, specificity = 0.79), our DL model trained on the difference maps performed better with less computational cost.

Although DL has been criticized in the past for being a “black-box” in that it lacks transparency regarding how it generates its classifications15,16 we are applying visualization techniques to determine which critical image features distinguish the looking patterns of the two groups. In other words, rather than simply determining that age-related variability in toddlers can be captured using DL by analyzing their gaze patterns, this technique is providing meaningful insights into what distinguishes the groups, particularly which image features tend to be fixated differently by the two groups. The data-driven DL model converged on similar features to the saliency weight model, which used a three-layer saliency model including semantic-level features identified by human raters9. Although DL has been shown to effectively predict gaze behavior in humans before17,18,19,20,21,22 this study goes a step further by using SVM and DL together to classify toddlers according to age, while additionally revealing meaningful information about how the two groups allocate their attention differently to features within the scenes.

Researchers have investigated gaze patterns in toddlers, and it is clear that the complexity of the images they are shown matters. Bornstein, Mash, and Arterberry23 recorded looking patterns of 4-month-old infants as they examined natural scenes and compared those patterns to viewing patterns directed towards artificial (experimental) scenes. They discovered that infants treat objects and context differently in naturalistic scenes compared to experimental scenes. Pushing ecological validity a step further, others have shown infants dynamic scenes (videos) and reported differences between scanning patterns based on age6,7. However, features in these studies were coded according to broad experimenter-determined categories (AOIs) such as faces, hands, and salient objects, and groups were compared in terms of percentages of gaze directed to each category. Although our stimuli are static, they consisted of social and non-social scenes that included multiple dominant objects, rather than a central dominant object, in the same scene, providing visual information more similar to real-world free viewing than traditional, controlled stimuli. By using naturalistic images we are getting closer to determining how toddlers distribute gaze in the real world.

We used static images in this work as they are more controlled than dynamic stimuli. Importantly, image saliency models using static images have been much more explored in the literature, providing better platforms for the study, and allowing more delicate and comprehensive comparisons of features of interest. Our approach could be applied to dynamic stimuli with straightforward extensions (i.e., learning spatio-temporal features instead of spatial features, and increased computational cost). Recent deep neural networks have successfully extended gaze prediction from images to videos24,25,26,27. These models are able to encode complex and dynamic visual cues, making it feasible to extend our approach for video stimuli in future studies.

The features that are fixated differently between 18- and 30-month-olds in this study map onto meaningful developmental changes during this age period. In particular, language production and comprehension changes dramatically during this period28,29, perhaps augmenting the explicit representation of others’ intentionality. This change is reflected in the transition from looking primarily at faces at 18-months, to looking at the objects of other people’s gaze, and items that are touched by others, at 30-months. In contrast to the 18-month-olds, the 30-month-old toddlers appear to be interested not only in the people in the scenes, but also what these people are attending to, and actions that could be implemented. These findings are consistent with results from other studies using natural scenes with toddlers7. Using predefined areas of interest, Frank, Vul, and Saxe7 found a systematic change in what children ages 3- to 30-months looked at in dynamic stimuli. Infants looked at stimuli differently depending on whether they consisted of a single face, a whole person, or complex scene, but in particular, older children looked at hands in the complex scenes more than the younger children.

One consideration is whether the differences between 18- and 30-month-olds reported in the present study could be driven by differences in information processing speed between the two age groups. Although we did not include a procedure that would allow us to characterize information processing speed, Fig. 2B suggests that information processing speed alone cannot explain the differences in gaze patterns. If information processing speed were driving the differences between groups, we would expect that the 18-month-olds would show a lagged pattern of similar features to 30-month-olds (i.e. the 18-month-olds would fixate the same things as the 30-month-olds, but on later fixations).

In summary, leveraging a model-based eye-tracking approach, in combination with machine learning techniques, we elucidated age related variability in gaze patterns to natural scenes during an important developmental period. This work illustrates an analytic approach that emphasizes prediction30. However, the approach does not neglect explanation, as DL revealed specific features of a complex scene that differentiate gaze patterns between 18- and 30-month-olds. The capability to accurately predict age based on gaze patterns over a relatively short window of time has broad implications for the characterization of age-related variability in other psychological domains as well as developmental delays in various aspects of social cognition.