Amazon’s Mechanical Turk (MTurk) is arguably one of the most important research tools of the past decade. The ability to rapidly collect large amounts of high-quality human subjects data has advanced multiple fields, including personality and social psychology. Beginning in summer 2018, concerns arose regarding MTurk data quality leading to questions about the utility of MTurk for psychological research. We present empirical evidence of a substantial decrease in data quality using a four-wave naturalistic experimental design: pre-, during, and post-summer 2018. During and to some extent post-summer 2018, we find significant increases in participants failing response validity indicators, decreases in reliability and validity of a widely used personality measure, and failures to replicate well-established findings. However, these detrimental effects can be mitigated by using response validity indicators and screening the data. We discuss implications and offer suggestions to ensure data quality.

References

Aruguete, M. S., Huynh, H., Browne, B. L., Jurs, B., Flint, E., McCutcheon, L. E. ( 2019 ). How serious is the ‘carelessness’ problem on Mechanical Turk? International Journal of Social Research Methodology, 22, 441 – 449 . doi:10.1080/13645579.2018.1563966

Google Scholar Crossref

Aust, F., Diedenhofen, B., Ullrich, S., Musch, J. ( 2013 ). Seriousness checks are useful to improve data validity in online research . Behavior Research Methods, 45, 527 – 535 .

Google Scholar Crossref | Medline | ISI

Bagby, R. M., Young, L. T., Schuller, D. R., Bindseil, K. D., Cooke, R. G., Dickens, S. E.…Joffe, R. T . ( 1996 ). Bipolar disorder, unipolar depression and the Five-Factor Model of personality . Journal of Affective Disorders, 41, 25 – 32 .

Google Scholar Crossref | Medline | ISI

Bai, H . ( 2018 ). Evidence that a large amount of low quality responses on MTurk can be detected with repeated GPS coordinates . Retrieved February 4, 2019, from Sights + Sounds website: http://www.maxhuibai.com/1/post/2018/08/evidence-that-responses-from-repeating-gps-are-random.html

Google Scholar

Barger, P., Behrend, T. S., Sharek, D. J., Sinar, E. F. ( 2011 ). IO and the crowd: Frequently asked questions about using Mechanical Turk for research . The Industrial-Organizational Psychologist, 49, 11 – 17 .

Google Scholar

Behrend, T. S., Sharek, D. J., Meade, A. W., Wiebe, E. N. ( 2011 ). The viability of crowdsourcing for survey research . Behavior Research Methods, 43, 800 – 813 .

Google Scholar Crossref | Medline | ISI

Berinsky, A. J., Margolis, M. F., Sances, M. W. ( 2014 ). Separating the shirkers from the workers? Making sure respondents pay attention on self‐administered surveys . American Journal of Political Science, 58, 739 – 753 .

Google Scholar Crossref | ISI

Buhrmester, M., Kwang, T., Gosling, S. D. ( 2011 ). Amazon’s Mechanical Turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3 – 5 .

Google Scholar SAGE Journals | ISI

Buhrmester, M., Talaifar, S., Gosling, S. D. ( 2018 ). An evaluation of Amazon’s Mechanical Turk, its rapid rise, and its effective use . Perspectives on Psychological Science, 13, 149 – 154 .

Google Scholar SAGE Journals | ISI

Casler, K., Bickel, L., Hackett, E. ( 2013 ). Separate but equal? A comparison of participants and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral testing . Computers in Human Behavior, 29, 2156 – 2160 .

Google Scholar Crossref | ISI

Chmielewski, M., Clark, L. A., Bagby, R. M., Watson, D. ( 2015 ). Method matters: Understanding diagnostic reliability in DSM-IV and DSM-5 . Journal of Abnormal Psychology, 124, 764 – 769 . doi:10.1037/abn0000069

Google Scholar Crossref | Medline | ISI

Chmielewski, M., Sala, M., Tang, R., Baldwin, A. ( 2016 ). Examining the construct validity of affective judgments of physical activity measures . Psychological Assessment, 28, 1128 – 1141 . doi:10.1037/pas0000322

Google Scholar Crossref | Medline

Clark, L. A., Watson, D. ( 1991 ). Tripartite model of anxiety and depression: Psychometric evidence and taxonomic implications . Journal of Abnormal Psychology, 100, 316 – 336 . doi:10.1037/0021-843X.100.3.316

Google Scholar Crossref | Medline | ISI

Courrégé, S. C., Skeel, R. L., Feder, A. H., Boress, K. S. ( 2019 ). The ADHD Symptom Infrequency Scale (ASIS): A novel measure designed to detect adult ADHD simulators . Psychological Assessment, 31, 851 – 860 .

Google Scholar Crossref | Medline

Dennis, S. A., Goodson, B. M., Pearson, C. ( 2018 ). MTurk workers’ use of low-cost “virtual private servers” to circumvent screening methods: A research note (SSRN scholarly paper no. ID 3233954). Retrieved from Social Science Research Network website: https://papers.ssrn.com/abstract=3233954

Google Scholar

Dreyfuss, E. ( 2018 , August 17 ). A bot panic hits Amazon’s Mechanical Turk . Wired. Retrieved from https://www.wired.com/story/amazon-mechanical-turk-bot-panic/

Google Scholar

Eriksson, K., Simpson, B. ( 2010 ). Emotional reactions to losing explain gender differences in entering a risky lottery . Judgment and Decision Making, 5, 159 – 163 .

Google Scholar ISI

Flake, J. K., Pek, J., Hehman, E. ( 2017 ). Construct validation in social and personality research: Current practice and recommendations . Social Psychological and Personality Science, 8, 370 – 378 .

Google Scholar SAGE Journals | ISI

Goodman, J. K., Cryder, C. E., Cheema, A. ( 2013 ). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples . Journal of Behavioral Decision Making, 26, 213 – 224 . doi:10.1002/bdm.1753

Google Scholar Crossref | ISI

Hauser, D. J., Schwarz, N. ( 2015 ). It’s a trap! Instructional manipulation checks prompt systematic thinking on “tricky” tasks . Sage Open, 5, 1 – 6 . doi:10.1177/2158244015584617

Google Scholar SAGE Journals | ISI

Horton, J. J., Rand, D. G., Zeckhauser, R. J. ( 2011 ). The online laboratory: Conducting experiments in a real labor market . Experimental Economics, 14, 399 – 425 .

Google Scholar Crossref | ISI

John, O. P., Donahue, E. M., Kentle, R. L. ( 1991 ). The Big Five Inventory—versions 4a and 54. Berkeley : Berkeley Institute of Personality and Social Research, University of California .

Google Scholar

John, O. P., Srivastava, S. ( 1999 ). The Big Five Trait taxonomy: History, measurement, and theoretical perspectives . In Pervin, L. A., John, O. P. (Eds.), Handbook of personality: Theory and research ( 2nd ed ., pp. 102 – 138 ). New York, NY : Guilford Press .

Google Scholar

Kees, J., Berry, C., Burton, S., Sheehan, K. ( 2017 ). An analysis of data quality: Professional panels, student subject pools, and Amazon’s Mechanical Turk . Journal of Advertising, 46, 141 – 155 .

Google Scholar Crossref | ISI

Kennedy, R., Clifford, S., Burleigh, T., Jewell, R., Waggoner, P. ( 2018 ). The shape of and solutions to the MTurk quality crisis (SSRN Scholarly Paper No. ID 3272468). Retrieved from Social Science Research Network website: https://papers.ssrn.com/abstract=3272468

Google Scholar

Kittur, A., Chi, E. H., Suh, B. ( 2008 ). Crowdsourcing user studies with Mechanical Turk . Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Florence, Italy, 453 – 456 . doi:10.1145/1357054.1357127

Google Scholar Crossref

Kotov, R., Gamez, W., Schmidt, F., Watson, D. ( 2010 ). Linking “big” personality traits to anxiety, depressive, and substance use disorders: A meta-analysis . Psychological Bulletin, 136, 768 – 821 . doi:10.1037/a002032

Google Scholar Crossref | Medline | ISI

Krueger, R. F., Tackett, J. L. (Eds.). ( 2006 ). Personality and psychopathology. New York, NY : Guilford Press .

Google Scholar

Kucker, S. C., Zimmerman, C., Chmielewski, M. (submitted for publication). Taking parent personality and child temperament into account in child language development .

Google Scholar

Litman, L., Robinson, J., Rosenzweig, C. ( 2015 ). The relationship between motivation, monetary compensation, and data quality among US-and India-based workers on Mechanical Turk . Behavior research methods, 47, 519 – 528 .

Google Scholar Crossref | Medline | ISI

Malouff, J. M., Thorsteinsson, E. B., Schutte, N. S. ( 2005 ). The relationship between the Five-Factor Model of personality and symptoms of clinical disorders: A meta-analysis . Journal of Psychopathology and Behavioral Assessment, 27, 101 – 114 . doi:10.1007/s10862-005-5384-y

Google Scholar Crossref | ISI

Marge, M., Banerjee, S., Rudnicky, A. I. ( 2010 ). Using the Amazon Mechanical Turk for transcription of spoken language. IEEE International Conference on Acoustics, Speech and Signal Processing , Dallas, TX , 5270 – 5273 . doi:10.1109/ICASSP.2010.5494979

Google Scholar Crossref

Mason, W., Suri, S. ( 2011 ). Conducting behavioral research on Amazon’s Mechanical Turk . Behavior Research Methods, 44, 1 – 23 . doi:10.3758/s13428-011-0124-6

Google Scholar Crossref | ISI

Mason, W., Watts, D. J. ( 2009 ). Financial incentives and the performance of crowds . Proceedings of the ACM SIGKDD Workshop on Human Computation, Paris, France , 77 – 85 . doi:10.1145/1600150.1600175

Google Scholar Crossref

McCreadie, R. M., Macdonald, C., Ounis, I. ( 2010 ). Crowdsourcing a news query classification dataset (31–38). Geneva, Switzerland : CSE .

Google Scholar

Oppenheimer, D. M., Meyvis, T., Davidenko, N. ( 2009 ). Instructional manipulation checks: Detecting satisficing to increase statistical power . Journal of Experimental Social Psychology, 45, 867 – 872 .

Google Scholar Crossref | ISI

Paolacci, G., Chandler, J. ( 2014 ). Inside the Turk: Understanding Mechanical Turk as a participant pool . Current Directions in Psychological Science, 23, 184 – 188 .

Google Scholar SAGE Journals | ISI

Peer, E., Vosgerau, J., Acquisti, A. ( 2014 ). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk . Behavior Research Methods, 46, 1023 – 1031 .

Google Scholar Crossref | Medline | ISI

Permut, S., Fisher, M., Oppenheimer, D. M. ( 2019 ). Taskmaster: A tool for determining when subjects are on task . Advances in Methods and Practices in Psychological Science, 2, 188 – 196 . doi:10.1177/2515245919838479

Google Scholar SAGE Journals

Rammstedt, B., Farmer, R. F. ( 2013 ). The impact of acquiescence on the evaluation of personality structure . Psychological Assessment, 25, 1137 – 1145 .

Google Scholar Crossref | Medline | ISI

Shapiro, D. N., Chandler, J., Mueller, P. A. ( 2013 ). Using Mechanical Turk to study clinical populations . Clinical Psychological Science, 1, 213 – 220 . doi:10.1177/2167702612469015

Google Scholar SAGE Journals

Sheehan, K. B. ( 2018 ). Crowdsourcing research: Data collection with Amazon’s Mechanical Turk . Communication Monographs, 85, 140 – 156 .

Google Scholar Crossref

Soto, C. J., John, O. P., Gosling, S. D., Potter, J. ( 2008 ). The developmental psychometrics of big five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20 . Journal of Personality and Social Psychology, 94, 718 – 737 .

Google Scholar Crossref | Medline | ISI

Stewart, N., Ungemach, C., Harris, A. J., Bartels, D. M., Newell, B. R., Paolacci, G., Chandler, J. ( 2015 ). The average laboratory samples a population of 7,300 Amazon Mechanical Turk workers . Judgment and Decision Making, 10, 479 – 491 .

Google Scholar ISI

Stokel-Walker, C . ( 2018 , October 1 ). Bots on Amazon’s Mechanical Turk are ruining psychology studies . Retrieved February 4, 2019, from New Scientist website: https://www.newscientist.com/article/2176436-bots-on-amazons-mechanical-turk-are-ruining-psychology-studies/

Google Scholar

Suri, S., Watts, D. J. ( 2011 ). Cooperation and contagion in web-based, networked public goods experiments . PLoS One, 6, e16836 .

Google Scholar Crossref | Medline | ISI

Sylaska, K., Mayer, J. D. ( 2019 , June 28 ). It’s 2019: Do we need super attention check items to conduct web-based survey research? The evolution of MTurk survey respondents. Presented at the Association for Research in Personality, Grand Rapids, MI .

Google Scholar

U.S. Census Bureau . ( 2018 ). Historical households tables, households by size . Retrieved February 4, 2019, from https://www.census.gov/data/tables/time-series/demo/families/households.html

Google Scholar

Vannette, D. L. ( May , 2016 ). Testing the effects of different types of attention interventions on data quality in web surveys . Experimental Evidence From a 14 Country Study. Paper presented at the 71st Annual Conference of the American Association for Public Opinion Research , Austin, TX .

Google Scholar

Vannette, D . ( 2017 , June 29). Using attention checks in your surveys may harm data quality . Retrieved July 18, 2019, from Qualtrics website: https://www.qualtrics.com/blog/using-attention-checks-in-your-surveys-may-harm-data-quality/

Google Scholar

Widiger, T. A., Trull, T. J. ( 2007 ). Plate tectonics in the classification of personality disorder: Shifting to a dimensional model . American Psychologist, 62, 71 – 83 . doi:10.1037/0003-066X.62.2.71

Google Scholar Crossref | Medline | ISI

Wood, D., Harms, P. D., Lowman, G. H., DeSimone, J. A. ( 2017 ). Response speed and response consistency as mutually validating indicators of data quality in online samples . Social Psychological and Personality Science, 8, 454 – 464 . doi:10.1177/1948550617703168

Google Scholar SAGE Journals | ISI