However, before the potential of online developmental research can be fully realized, we need a secure, robust platform that can translate developmental methods to a computer-based home testing environment. Here we present a new online developmental research platform, Lookit. Parents access Lookit through their web browsers, participate at their convenience in self-administered studies with their child, and transmit the data collected by their webcam for analysis. To follow, we address broad ethical, technological, and methodological issues related to online testing, describe the demographics of our online participant population, and offer recommendations for researchers seeking to adapt studies to an online platform. For an empirical report of our user case studies, including raw data and analysis code, please see Scott, Chu, and Schulz ( 2017 ). For technical and methodological details regarding the platform itself and video coding procedures, please see the Supplemental Materials (Scott & Schulz, 2017 ).

In adult psychology, online testing through Amazon Mechanical Turk (AMT) has begun to lower barriers to research, enabling scientists to quickly collect large datasets from diverse participants (Buhrmester, Kwang, & Gosling, 2011 ; Paolacci, Chandler, & Ipeirotis, 2010 ; Rand, 2012 ; Shapiro, Chandler, & Mueller, 2013 ). As the technical hurdles involved in online testing dwindle, we are poised to expand the scope of questions developmental science can address as well. Online testing can allow access to more representative populations, children from particular language groups or affected by specific developmental disorders, and information about children’s behavior in the home. Access to larger sample sizes will also allow researchers to estimate effect sizes with greater precision, detect small or graded effects, and generate sufficient data to test computational models. The motivation to bring studies online is bolstered by growing awareness of the importance of direct replication and reproducible results (Open Science Collaboration, 2015 ; Pashler & Wagenmakers, 2012 ).

Behavioral research with infants and children stands to illuminate the roots of human cognition. However, many important questions about cognitive development remain unasked and unanswered due to the practical demands of recruiting participants and bringing them into the lab. Such demands limit participation by families from diverse cultural, linguistic, and economic backgrounds; deter scientists from studies involving large sample sizes, narrow age ranges, and repeated measures; and restrict the kinds of questions researchers can answer. It is hard to know, for instance, whether an ability is present in all or only most children, or whether an ability is absent or weakly present. Such small distinctions can have large theoretical and practical implications. Fulfilling the promise of the field depends on scientists’ ability to measure the size and stability of effects in diverse populations.

At the start of each study, a consent form and the webcam video stream was displayed. The parent was instructed to record a brief verbal statement of consent (see Supplemental Materials for examples), ensuring that parents understood they were being videotaped. Parents were free to end the study at any point. After completing a study, the parent selected a privacy level for the video collected. Across our three test studies, 31% of sessions were marked “private” (video can be viewed only by our research team), 41% “scientific” (video can be shared with other researchers for scientific purposes), and 28% “free” (video can be shared for publicity or educational purposes). Parents also had the option to withdraw their data from the study at this point; this option was chosen by less than one percent of participants and was treated as invalid consent. A coder checked each consent video before looking at any other video associated with the session. Valid consent was absent for 16% of participants overall ( N = 961) due to technical failures in the video or audio transmission, parents not reading the statement, or subsequent withdrawal from the study.

Participants were recruited via AMT and linked to the Lookit site ( https://lookit.mit.edu ). Participants were paid three to five dollars (depending on the study) for participation to ensure payment of at least the minimum wage nationally, even in cases where parents encountered technical difficulties and contacted the lab. This policy is in accordance with guidelines for researchers regarding fair payment ( http://guidelines.wearedynamo.org/ ). To ensure that parents did not feel any pressure to complete a study, especially if their child was unwilling to continue, parents were paid if they initiated a study, regardless of completion and of any issues in compliance or implementation. Participation required creating a user account and registering at least one child. As in the lab, parents provided their child’s date of birth to determine study eligibility. A demographic survey was available for parents to fill out at any point.

Before completing study-specific coding, one coder checked that video from the study was potentially usable. Video was unusable 35% of the time (282 of 805 unique participants with valid consent records). The most common reasons for unusable video were absence of any study videos (44% of records with unusable video), an incomplete set of study videos (20%), and insufficient framerate (15%). Rarely, videos were unusable because a child was present but generally outside the frame (3%) or there was no child present (1%).

Multiple short clips were recorded in each study during periods of interest. Video quality varied due to participants’ upload speed. Our primary concern for coding looking measures was the effective framerate of the video. Because the putative framerate of the video was unreliable due to details of the streaming procedure, we estimated an effective framerate based on the number of “changed frames” in each clip. A “changed frame” differed from the previous changed frame in at least 20% of pixels; note that this underestimates higher framerates since frames close in time without major movement may actually differ by fewer than 20% of pixels. Videos with an effective framerate under 2 frames per second (fps) were excluded as unusable for looking studies (see SI for examples of video at various effective framerates). The median effective framerate across sessions with any video was 5.6 fps (interquartile range = 2.9–8.6 fps).

Condition assignment and counterbalancing were initially achieved simply by assigning each participant to whichever condition had the fewest sessions already in the database. Because many sessions could not be included in analysis, we later manually updated lists of conditions needed to achieve more even counterbalancing and condition assignment. Condition assignment was still not as balanced as in the lab, so we used analysis techniques robust to this variation. Future versions of the platform will allow researchers to continually update the number of included children per condition as coding proceeds.

Parents were largely compliant with testing protocols, including requests that they refrain from talking or close their eyes during parts of the studies to avoid inadvertently biasing the child. However, compliance was far from perfect: 9% of parents had their eyes mostly open on at least one of 3–4 test trials, and 26% briefly peeked on at least one test trial (see Table S2). In the forced-choice study with preschoolers, parents interfered in 8% of trials by repeating the two options or answering the question themselves before the child’s final answer, generally in cases where the child was reluctant to answer. Practice trials before the test trials mitigated data loss due to parent interference; additional recommendations based on our experience are covered in the Discussion and Recommendations section .

To address these issues, we established criteria for fussiness (defined as crying or attempting to leave the parent’s lap), distraction (whether any lookaway was caused by an external event), and various parental actions, including peeking at the video during trials where their eyes should be closed. (See Table S1 and Coding Manual for details.) Exclusion criteria were then based on the number of clips where there was parental interference or where the child was determined to be fussy or distracted. In the two studies using looking measures, two blind coders recorded which actions occurred during individual clips. The first author arbitrated disagreements. This constitutes one of the first direct studies of intercoder agreement on these measures. Coders agreed on fussiness and distraction in at least 85% of clips, with Cohen’s kappa ranging from .37 to .55 (see Table 1 ).

Two natural concerns about online testing are that the home environment might be more distracting than the laboratory or that parents might be more likely to interfere with study protocols. In laboratory-based developmental studies, 14% of infants and children on average are excluded due to fussiness, while only 2% of studies give an operational definition of fussiness (Slaughter & Suddendorf, 2007 ). Looking times from crying children are unlikely to be meaningful, but subjective exclusion criteria reduce the generalizability of results. Similar issues arise with operationalizing parental interference.

Each session of the preferential looking study was coded using VCode (Hagedorn et al., 2008 ) by two coders blind to the placement of test videos. Looks to the left and right are generally clear; for examples, see Figure 1 . Three calibration trials were included in which an animated attention getter was shown on one side and then the other. During calibration videos, all 138 participants coded looked on average more to the side with the attention getter. For each of nine preferential looking trials, we computed fractional right/left looking times (the fraction of total looking time spent looking to the right/left). Substantial differences (fractional looking time difference greater than .15, when that difference constituted at least 500 ms) were flagged and those clips recoded. A disagreement score was defined as the average of the coders’ absolute disagreement in fractional left looking time and fractional right looking time, as a fraction of trial length. The mean disagreement score across the 138 coded participants was 4.44% ( SD = 2.00%, range 1.75–13.44%).

Measuring until the first continuous lookaway of a given duration introduces a thresholding effect in addition to the small amount of noise induced by a reduced framerate. The magnitude of this effect depends on the dynamics of children’s looks to and away from the screen. We examined a sample of 1,796 looking times, measured until the first one-second lookaway, from 252 children ( M = 13.9 months, SD = 2.6 months) tested in our lab with video recorded at 30 Hz. Reassuringly, in 68% of measurements, the lookaway that ended the measurement was over 1.5 s. We also simulated coding of these videos at framerates ranging from 0.5 to 30 Hz; the median absolute difference between looking times calculated from our minimum required framerate of 2 Hz vs. original video was only .16 s (interquartile range = 0.07–0.29 s; see Figure S1 (Scott, Chu, and Schulz, 2017 )).

Each session of the looking time study was coded using VCode (Hagedorn, Hailpern, & Karahalios, 2008 ) by two coders blind to condition. Looking time for each of eight trials per session was computed based on the time from the first look to the screen until the start of the first continuous one-second lookaway, or until the end of the trial if no valid lookaway occurred. Differences of 1 s or greater, and differences in whether a valid lookaway was detected, were flagged and those trials recoded. Agreement between coders was excellent; coders agreed on whether children were looking at the screen on average 94.6% of the time ( N = 63 children; SD = 5.6%). The mean absolute difference in looking time computed by two coders was 0.77 s ( SD = 0.94 s).

To test the feasibility of online developmental research across a variety of methods and age groups, we conducted three studies: a looking-time study with infants (11–18 months) based on Téglás, Girotto, Gonzalez, and Bonatti ( 2007 ), a preferential looking time study with toddlers (24–36 months) based on Yuan and Fisher ( 2009 ), and a forced choice study with preschoolers (ages 3 and 4) based on Pasquini, Corriveau, Koenig, and Harris ( 2007 ). These allowed us to assess how online testing affected coding and reliability, children’s attentiveness, and parental interference. For details on the specific studies, see Scott et al. ( 2017 ).

Even where research samples are racially or economically diverse, participation in research studies is often skewed toward parents with higher education levels. We estimated the expected distribution of educational attainment by weighting American census data based on the distribution of Lookit parents’ age ranges and genders. Lookit parents had education levels much more representative of the American population than, for instance, a sample of 96 parents in a recent study conducted by our lab at the Boston Children’s Museum (an institution specifically committed to affordability and diversity; see Figure 2 ). Nonetheless, additional outreach is likely necessary to reach parents who have not completed high school and to obtain a truly nationally representative sample.

We collected demographic information from participants to see whether the promise of expanded participation in developmental research was fulfilled. Families participating on Lookit were more representative of the U.S. population than typical lab samples on several measures. Fifty percent of participants came from families with a yearly income under $50,000 and 73% from families with a yearly income under $75,000 ( N = 552 responding out of 759 unique participants in their study’s age range and with a valid consent video). Our participants report 21 distinct languages spoken in the home in addition to English; 9% ( N = 571 responding) are multilingual. Participants’ races were roughly representative of the American population; Table 2 summarizes the racial distribution of participants compared to recent census data.

DISCUSSION AND RECOMMENDATIONS Section: Choose Top of page Abstract INTRODUCTION ETHICAL ISSUES TECHNICAL ISSUES METHODOLOGICAL ISSUES DEMOGRAPHICS DISCUSSION AND RECOMMENDA... << ACKNOWLEDGMENTS AUTHOR CONTRIBUTIONS REFERENCES

Our case studies confirmed the viability of Lookit as a method for remote data collection in developmental psychology in several important respects. Most importantly, the platform worked, for both parents and researchers. Parents were able to log into the system, select studies, administer the experiments themselves, and upload their child’s data. Researchers were able to host multiple studies on the site, control timing and counterbalancing of stimuli, assign participants to conditions, limit the age range for each study, receive transmitted video, monitor consent, and code dependent measures, including preferential looking, looking time, and verbal response measures across ages from 11 months through 4 years.

The coding results show that preferential looking and looking time measures can be collected from streamed webcam video, without extensive instruction to parents about positioning. Despite varying webcam placement and video resolution, mean disagreement between blind coders on looking time was less than 1 s, with agreement 95% of the time, typical for offline coding in labs (90–95% agreement reported by Baillargeon, Spelke, & Wasserman, 1985; Feigenson, Carey, & Spelke, 2002, Onishi & Baillargeon, 2005; Starkey, Spelke, & Gelman, 1990; Xu & Garcia, 2008; Xu, Spelke, & Goddard, 2005). Mean disagreement on time spent looking to the left/right of the screen was less than 5% of trial length, also in line with lab estimates, although agreement is typically reported regarding whether the child was looking left, right, or away before summing those intervals (e.g., 91% by Smith & Yu, 2008; 98% by Yuan & Fisher, 2009).

The studies also mitigate concerns about parent-administered testing, although several modifications could improve compliance. The request that posed the biggest challenge was that parents close their eyes. For example, several parents reported peeking due to concerns about what was being shown to their child. If asking parents not to watch what their child is shown, we recommend that researchers provide a clear, simple explanation of why this is necessary and, if possible, allow parents to view stimuli in advance. To address more general difficulties with parent blinding, we recommend including practice trials and asking parents to close their eyes several seconds before blinding is necessary (for instance, so that they can first ensure a video is playing). We also recommend that if possible the parent be asked to face away from the computer with the child looking over their shoulder (to make peeking less tempting and easier to detect). We emphasize, though, that even with minimal instructions, parents generally closed their eyes when asked and failures to comply were readily detectable.

One striking difference between Lookit and traditional lab studies was that our overall yield was quite low at 26%: Of 997 nonrepeat sessions, only 255 were included in the final analysis across these three test studies. Data loss was primarily due to factors unique to the online testing environment (failure to provide informed consent, technical failure of video recording, and parents who left the study early). Because these were apparent early in the coding process, they did not create an undue coding burden. Exclusion rates due to infant behavior were similar to those in the lab, although higher than the stable average of 22% (range 0–87%) for violation-of-expectation paradigms reported by Slaughter and Suddendorf (2007). In the looking time study, 43 of 112 valid video submissions were excluded due to the child’s behavior and 20 for parent interference or technical criteria that were not relevant in the original study. The effective exclusion rate was 47%, compared to 50% excluded in Téglás et al. (2007). However, overall exclusion rates due to child and parent behavior were generally higher than in the original studies. In the preferential looking study, we excluded 36% of children due to small differences in looking preferences when asked to find familiar verbs on opposite sides of the screen, in addition to 10% excluded due to parent interference and 5% due to low attention. In contrast, Yuan and Fisher (2009) excluded only 10% due to side bias, distraction, poor practice trial performance, or outlier preference at test, and did not report any parent interference. Finally, in the forced-choice study we excluded 20% of children due to incorrect naming of familiar objects, and 17% due to insufficient valid answers to test questions (see Supplemental Materials of Scott et al., 2017, for details). In contrast, Pasquini et al. (2007) excluded only 6% of children and only due to inaccuracy on questions about familiar object names. However, having conducted studies both online and in the laboratory, we believe the low yield online is unlikely to offset the in-lab costs of outreach, recruitment, and scheduling: it only takes around 2 minutes total to code consent, check video usability, and process an AMT submission, in addition to about 20 minutes per coder to code a complete session (which would be necessary for most in-lab studies using looking measures as well). In contrast, recruiting, scheduling, and testing one child in the lab generally takes around an hour, exclusive of coding. Further technical and user-experience optimization will decrease the rates of invalid consent videos and video failure in online studies.

Moving protocols from the lab to the web browser will require continued methodological refinement. Encouragingly, we found looking times similar to those reported in the lab in the looking time study and excellent attention (over 80% looking) to the dialogues in the preferential looking study. However, despite children’s overall attentiveness, we recommend that researchers choose methods as robust as possible to minor interruptions. For this reason, we suggest using preferential looking rather than looking time paradigms if possible. In preferential looking, a distracted lookaway (e.g., to the family dog) simply decreases the measurement period; in a looking time paradigm it ruins an entire measurement. Preferential looking based on audio prompts may need to be optimized for online presentation to induce more reliable responses, as we observed wide variation in 2-year-olds’ looking to familiar verbs. We suggest that labs seeking to use verbal response measures design studies to be robust to nonresponses (e.g., by using many short questions and pooling responses); provide guidance for parents in prompting their children to respond, including example videos or practice trials; and encourage engagement by allowing the online interface to respond contingently (e.g., by repeating back an answer the child chose, as selected by the parent). As we move studies online, it will also be crucial to clearly define behavioral criteria for exclusion of trials or participants in order to harness Lookit’s full potential for replicable results.

Another striking difference between Lookit and traditional lab studies was the diversity of incomes, parental education levels, races, and language backgrounds represented among the participants. Inclusion criteria, both technical and behavioral, did not disproportionately affect lower socioeconomic status (SES) families in any of the studies; nor was performance linked to SES where there were clear normative choices (see Scott et al., 2017, for details). In the context of the current studies, we believe the absence of any relationship between performance and SES is encouraging with respect to the accessibility of the platform. However, the diversity of the participant pool suggests that for those studies where SES is a critical variable, online testing may be an appropriate interface for assessing its impact.

Nonetheless, there are limitations to online testing. Any dependence on the child’s behavior must be implemented via the parent, for instance, by allowing the parent to pause a study or repeat a question; infant-contingent displays are not yet possible. The experimenter cannot directly engage in joint attention or pedagogical cueing or adjust fluidly for momentary distractions. For studies where synchronous attention or direct pedagogical engagement is critical, or its effects are the question of interest (e.g., Csibra & Gergely, 2009, Yu & Smith, 2012), or where visual angle subtended by stimuli must be tightly controlled, Lookit will not be an appropriate interface.

Even within the scope of methods adaptable to the online environment, several issues must be addressed before online testing achieves its full potential. First, Lookit is currently a prototype and does not yet include a “plug and play” interface. Programming expertise beyond what is expected in most graduate developmental programs was required to implement studies on the platform. In collaboration with the Center for Open Science, we are working toward an easy-to-use experimenter interface. Second, AMT is not especially designed for parents, so recruitment is not as efficient as it might be. Lookit may become a more effective tool as it connects with sites that directly target parents. At that stage, however, maintaining parents’ interest will require a steady supply of novel content from developmental labs. Thus, expanding interest in the site from both researchers and parents should be mutually reinforcing.

We look forward to creative uses of this method. Although it will not be appropriate for every study, online research can expand access to both more representative populations and rare populations, and make it easier to conduct large-scale longitudinal studies, detect small and graded effects, generate data sufficient for testing computational models, and assess individual differences and developmental change. We hope this tool will also be used to replicate classic effects and make the replication of new results easier. Finally, in connecting families and scientists, Lookit offers exciting new opportunities for education and outreach. As a venue for “citizen science” as well as scientific research, our goal for Lookit is to expand the scope of both the questions we ask and the people we reach.