$\begingroup$

Short answer: The data is likely to be noisier, the absolute reaction time can't be trusted, but given enough power (which is easy to obtain on the Internet) relative reaction time differences should be similar to those in the lab. However, web-based reaction time studies might pose other problems, because you have less control over stimulus presentation and about how participants behave.

Long answer: There is some research that has looked at Internet-based collection of reaction time data using different software approaches. The number of papers is small, but they converge in the conclusion that there will be more noise, but that it can be quite useful depending on the specific research question.

The effect of additional noise

Some noise stems from the fact that hardware and software is widely different "in the wild". For example, using a JAVA - applet Eichstaedt (2001) has shown much variation in reaction times depending on different PCs. Some of this variation between computers is based on factors that add some constant to the reaction time on a specific machine. These constants don't matter if you do within-subjects reaction time comparisons as they are common in cognitive paradigms. Other factors will add random noise. For example, some keyboards only transmit responses with some frequency (e.g. every 20 ms.). Thus, the timing resolution will be bound to this limit. Also, other software running in the background may result in random noise. Nevertheless, given enough trials and enough participants this random noise may be a manageable nuisance.

In fact, using simulations, Brand and Bradley (2012) have found that adding a random 10 to 100 ms delay to response times reduced statistical power only by 1-4% across a range of different effect sizes.

Research that has compared response times collected with online and lab-technologies suggests similar conclusions. For example, using the Flash-based ScriptingRT Schubert et al. (2013, Study 1) have shown that

the SDs of [reaction times] stayed below 7 ms in all three browsers. That value is comparable to many regular keyboards and standard reaction time software. In addition, the constant added by measuring in ScriptingRT was about 60 ms. This result suggests that researchers using ScriptingRT should thus focus primarily on differences between RTs and be cautious when interpreting absolute latencies.

From Study 2:

ScriptingRT resulted in both longer response latencies and a larger standard deviation than all other packages except SuperLab and E-Prime in one configuration. Nevertheless, in absolute terms, the SD of 4.21 is comparable to what was standard for keyboards for a long time [16]. It is thus clear that any test with ScriptingRT should be well powered and used to assess primarily paradigms with a large effect size.

Similarly, comparing JavaScript and Flash-based data collection Reimers and Stewart (2014) concluded that, in general,

within-system reliability was very good for both Flash and HTML5—standard deviations in measured response times and stimulus presentation durations were generally less than 10 ms. External validity was less impressive, with overestimations of response times of between 30 and 100 ms, depending on system. The effect of browser was generally small and nonsystematic, although presentation durations with HTML5 and Internet Explorer tended to be longer than in other conditions. Similarly, stimulus duration and actual response time were relatively unimportant—actual response times of 150-, 300-, and 600-ms response times gave similar overestimations.

Replications of cognitive paradigms with online samples

Several papers have used online data-collection to replicate well known effects stemming from lab-based research.

For example, Schubert et al. (2013) replicated the Stroop-Effect with online-vs. lab technology and found that the size of the effect was independent from the Software used. Using JAVA, Keller et al. (2009) replicate a the results of a self-paced reading paradigm from the psycho-linguistic literature. The most comprehensive replication project has been published by Crump et al. (2013) who replicate Stroop, Switching, Flanker, Simon, Posner Cuing, attentional blink, subliminal priming, and category learning tasks on Amazon's Mechanical Turk.

Other challenges and limitations

There are several other challenges and limitations associated with online response time collection

A different question is the accuracy with which stimuli can presented online. There will be limits to time resolution (see, e.g., Garaizar et al. 2014, Reimers & Stewart, 2014, Schubert et al., 2013) and visual differences (color and resolution) depending on hardware and environmental light

Often online samples will be more diverse with regards to age and education, some may have difficulties understanding difficult instructions. Also, in an online study it is easier to abandon boring RT-tasks than in the lab (Crump et al., 2013)

Participants' hardware may be confounded with other variables, thus that there might be confounds in the absolute reaction times because a systematic RT constant may added to certain demographic groups. This is not a problem for reaction time differences within participants. However, correlations of absolute reaction times with personality variables may be spurious (as warned by Reimers and Stewart (2014)

References

Brand and Bradley (2012). Assessing the Effects of Technical Variance on the Statistical Outcomes of Web Experiments Measuring Response Times. Social Science Computer Review, 30, 350–357. doi:10.1177/0894439311415604

Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE, 8, e57410. doi:10.1371/journal.pone.0057410

Eichstaedt, J. (2001). An inaccurate-timing filter for reaction time measurement by JAVA applets implementing Internet-based experiments. Behavior Research Methods, Instruments, & Computers, 33, 179–186. doi:10.3758/BF03195364

Garaizar, P., Vadillo, M. A., & López-de-Ipiña, D. (2014). Presentation Accuracy of the Web Revisited: Animation Methods in the HTML5 Era. PLoS ONE, 9, e109812. doi:10.1371/journal.pone.0109812

Keller, F., Gunasekharan, S., Mayo, N., & Corley, M. (2009). Timing accuracy of Web experiments: A case study using the WebExp software package. Behavior Research Methods, 41, 1–12. doi:10.3758/BRM.41.1.12

Reimers, S., & Stewart, N. (2014). Presentation and response timing accuracy in Adobe Flash and HTML5/JavaScript Web experiments. Behavior Research Methods, 1–19. doi:10.3758/s13428-014-0471-1

Schubert, T.W., Murteira, C., Collins, E.C., Lopes, D. (2013). ScriptingRT: A software library for collecting response latencies in online studies of cognition. PLoS ONE 8: e67769. doi:10.1371/journal.pone.0067769