Although no method can guarantee the validity of an inference, research design and sample choices can have important consequences for validity (Shadish et al. 2002). Shadish et al. (2002) developed a validity typology in guiding researchers’ evaluation of a variety of research designs and methods, including (1) internal validity, (2) statistical conclusion validity, (3) external validity, and (4) construct validity. We draw from this typology to focus on ten methodological concerns with MTurk samples. We discuss each concern in relation to relevant types of validity evidence, highlighting how each issue can potentially threaten validity of inferences made from MTurk samples. We also provide a set of practical recommendations for the use of MTurk based on our evaluations (see Table 1 for the summary).Footnote 1

Subject Inattentiveness

Subject inattentiveness refers to the phenomenon where respondents answer a question without paying full attention to/complying with study instructions, accurately understanding item content and/or providing accurate responses. Inattentive responding can be particularly problematic within online samples because experimental or survey administrations are often unproctored (Fleischer et al. 2015).

Threat to Internal Validity

Subjective inattentiveness, and the related issue of IER (Huang et al. 2012, 2015b), represents an important threat to internal validity with MTurk samples when MTurk Workers do not attend to the experimental stimuli or study instructions, and as a result, the manipulation and measurements may not work effectively. For example, IER can introduce an additional source of extraneous variance that may be misinterpreted as part of the hypothesized effect—or confound the observed covariations—if participants do not pay attention to the survey instructions prior to responding to any of the survey items (McGonagle et al. 2016). As we will detail in our recommendations below, researchers collecting data from MTurk samples should take steps to monitor subjective inattentiveness and detect IER.

Threat to Statistical Conclusion Validity

Statistical conclusion validity refers to the extent to which statistical inferences made about the correlation (or covariation) between two variables are warranted. As with internal validity, subjective inattentiveness can threaten statistical conclusion validity for studies of MTurk Workers and other similar online samples (Fleischer et al. 2015). When MTurk Workers do not pay attention to the items they respond to, the reliability and quality of items can be substantially compromised. Past research has shown that IER can increase measurement error variance and thus attenuate or inflate observed correlations between variables (Huang et al. 2015a, b; Kam and Meyer 2015; McGrath et al. 2010).

Threat to Construct Validity

Subject inattentiveness can also affect construct validity. Construct validity refers to “the degree to which inferences are warranted from the observed persons, settings, and cause and effect operations included in a study to the constructs that these instances might represent” (Shadish et al. 2002, p. 38). It is especially problematic when scale development and validation efforts are based on data with substantial amounts of inattentive or careless responses (Meade and Craig 2012). It has been found that in samples with as few as 10 % careless responses, factor structures can become inaccurate, item correlations and model fit indices can be negatively impacted, and subsequent conclusions made about the measured constructs become unreliable (Fleischer et al. 2015; Meade and Craig 2012; Schmitt and Stults 1985; Woods 2006).

Recommendations for Screening for Subject Inattentiveness

MTurk respondents may differ in the levels of effort and attention exerted in their tasks. Recent research by Hauser and Schwarz (2016) found that MTurk Workers tend to be more attentive than traditional subject pool samples, but the detection of IER or inattentiveness is still important in ensuring data quality (Aust et al. 2013; Oppenheimer et al. 2009; Ran et al. 2015). Attention check questions (ACQs) are particularly important for data quality if researchers choose not to use MTurk Workers’ approval ratings as a participation criterion (Peer et al. 2014).

There are multiple steps researchers can take to detect and screen out inattentive/careless responding. For example, Meade and Craig (2012) recommended incorporating instructed response items, computing response consistency indices (e.g., response patterns), and conducting multivariate outlier analyses. Huang et al. (2012, 2015a, b) recommended using response time, infrequency items (i.e., attentive participants should provide the same responses to items with improbable factual statements), psychometric antonyms (i.e., testing inconsistent responses across dissimilar items), and computing individual reliability estimates to detect IER. Also, DeSimone, Harms, and DeSimone (2015) recommended using direct (e.g., instructed or bogus items), archival (e.g., response time, long string), and statistical (e.g., psychometric antonyms) screening techniques, and that multiple techniques should be used as they detect different types of problems related to subject inattentiveness.

Researchers implementing any of these data screening techniques should consider explaining to MTurk Workers in the HIT instructions/informed consent that response patterns will be monitored and any indications of random responding (without specifying the detection methods) will not result in compensation. Huang et al. (2015a) found that a benign warning provided to respondents about IER detection was effective at reducing IER without raising negative reactions. When responses are monitored, Workers are expected to pay closer attention to instructions, and in the event they fail a detection item, the researchers will have justifiable reasons to reject their work.

We also recommend that researchers offer second chances to Workers who fail data screening during their first attempt. Oppenheimer et al. (2009) found that after a prompt requesting that respondents pay closer attention to the instructions, the responding behaviors of those who initially failed the ACQs were indistinguishable from those who passed the ACQs. Offering Workers a second chance may improve data quality while limiting data loss without risking selection biases (Aust et al. 2013). We encourage researchers to explain in their instructions or informed consent documents that each Worker is allowed a specific number of attempts.Footnote 2 Clarity about the maximum number of attempts provides strong justification for refusing further attempts to Workers or rejecting their HIT, minimizes perceptions of unfairness, and also protects Requesters’ reputation in the MTurk community.

Selection Biases

Selection biases occur when MTurk Workers (1) self-select into the MTurk Worker population and (2) self-select into a particular study. While the former applies uniquely to MTurk, the latter occurs in almost all types of research studies that require human subject participation. Specifically, regardless of the research design, participants need to voluntarily avail themselves to be a part of the study (Woo et al. 2015). Issues related to selection biases on MTurk can pose threats to construct and external validity.

Threat to Construct Validity

Selection biases in MTurk samples may raise construct validity concerns because the participants sampled in a study can have a direct bearing on construct validity. Researchers must consider the fact that MTurk Workers self-selected to become a part of the MTurk population. That is, even a randomly selected MTurk sample would still have the possibility of selection biases by virtue of who participates in MTurk and who opts for particular HITs. In particular, self-selection can threaten the extent to which the participant characteristics correspond to the population to which inferences are made. For example, construct validity may become questionable if a study purported to study retirement intentions of older employees but used MTurk Workers who are predominantly younger as participants. In other words, there may be a lack of correspondence between the operational definitions used in the study and the measured constructs to which researchers draw inferences.

There are several distinct characteristics of MTurk Workers that may lead to self-selection biases, including irregular employment status, interests in monetary incentives, and inherent enjoyment of participating in HITs (Ipeirotis 2010). Depending on the study purposes, these factors may or may not weaken the validity of findings. The larger point that we will reiterate throughout our review is that the extent to which validity can be threatened by MTurk depends heavily on the research questions being investigated. There is not a “one-size-fits-all” answer to the appropriateness of MTurk; rather, researchers need to critically evaluate whether MTurk is appropriate for their research objectives.

Threat to External Validity

MTurk is commonly praised in overcoming external validity concerns for different study designs, including experimental, quasi-experimental, and nonexperimental (e.g., survey) designs. Unlike other frequently used samples (e.g., undergraduates and employees from the same organization/occupation), MTurk allows researchers to overcome some generalizability problems by gaining easy access to heterogeneous populations (i.e., with greater occupational and demographic diversity). MTurk samples can be particularly suitable for researchers who seek to understand work outside of WEIRD and traditional organizational contexts (e.g., humanitarian work psychology, cross-cultural psychology; Woo et al. 2015) or in hard-to-reach employee populations (e.g., employees who are disabled, marginalized, victims of workplace harassment, or have low socioeconomic status; Smith et al. 2015). Additionally, Bergman and Jean (2016) discussed the need for increased attention to underrepresented workers in organizational psychology, and MTurk may help facilitate studies of certain groups of underrepresented workers (e.g., underemployed workers).

The ability to sample MTurk Workers from a large participant pool (more than 500,000 from 190 countries; mturk.com, or an active population of about 7300 MTurk Workers; Stewart et al. 2015) strengthens the external validity argument, but it does not preclude the issues arising from the fact that random sampling is highly unlikely on MTurk. Random sampling can simplify external validity inferences because observed relationships are expected to be the same as any other random samples of the same size from the same population (Shadish et al. 2002). However, MTurk Workers self-selected to become a part of the MTurk population and they select HITs to complete based on their personal preferences. Respondent characteristics or personal preferences may become confounds of the observed relationships—a key feature of selection bias (Shadish et al. 2002). Moreover, the large MTurk participant pool characterized by an international membership (mostly from the U.S. and India) is ideal for testing organizational theories expected to be broadly applicable across different organizational contexts (Landers and Behrend 2015), but it may not work for phenomena relevant to certain industries or non-English-speaking organizations.

We should also note that the generalizability of a research study is not limited to variations across persons, but also across settings, treatments, and outcomes (Shadish et al. 2002). Therefore, depending on the research objectives, MTurk’s ability to sample from a more diverse pool may not necessarily overcome all types of generalizability concerns.

Recommendations for Evaluating Selection Biases

MTurk Workers self-select into the MTurk participant pool and have the discretion to choose which HITs to complete. Based on their research questions, researchers should evaluate—before they begin MTurk data collection—the extent to which self-selection to MTurk may violate the validity of their findings. For example, if a study involves constructs that are inherently reflected by the decision to sign up as an MTurk Worker, such as being tech-savvy, having access to computers, or having an interest in online surveys, then MTurk would not be recommended as a data source because the construct measurement or manipulation would be contaminated. However, if a study aims to examine psychological phenomena among a diverse population from various geographies and industries, an experimental manipulation would more likely be successful with MTurk Workers compared to employees from a single organization (Woo et al. 2015). MTurk Workers’ motivation to participate in a study can also affect experimental manipulations and survey responses. Researchers can include questions about their motivation to participate and investigate whether participants’ responses affect survey responses and resulting findings (e.g., motivation-related common method variance; McGonagle 2015).

Demand Characteristics

Demand characteristics, or as Shadish et al. (2002) refer to as experimenter expectancies, are a potential methodological concern because researchers may influence responses from participants by conveying expectations regarding desirable (or correct) responses, and those expectations may become a part of the measured constructs and subsequently confound findings.

Threat to Internal Validity

The manner in which experimenter expectancies can be manifested differs with MTurk as compared to other settings in which it might be a concern. In traditional research settings, face-to-face interactions between researchers and participants can lead participants to react in response to their perceived expectations about desirable responses or behaviors. Although MTurk provides an added advantage that demand characteristics can be reduced due to a lack of face-to-face interactions between experimenters and participants (Highhouse and Zhang 2015), measurement contamination may still occur because MTurk Workers may communicate in MTurk forums and find out the study purposes of certain HITs (Schmidt 2015).

Threat to Construct Validity

Shadish et al. (2002) suggested that demand characteristics can be minimized by limiting contact between researchers and participants; MTurk is advantageous in that experimenter demand effects are less likely than in laboratory or field settings (Highhouse and Zhang 2015). The anonymity afforded by MTurk also reduces evaluation apprehension on the part of MTurk Workers. However, demand characteristics can still be apparent and provide cues about expected behaviors if MTurk Workers are asked to pass a number of qualification requirements before they can proceed to a HIT. MTurk Workers may be motivated to respond untruthfully, based on their perceived demand characteristics, in order to be qualified for a HIT or avoid losing approval ratings.

Participant motivation can also lead to reactive self-report changes and conformity to experimenter expectancies. MTurk Workers tend to be more honest in reporting behaviors and have less social desirability tendencies than in-person samples, and they are more comfortable with disclosing personal feelings due to the MTurk’s anonymous platform (Shapiro et al. 2013; Smith et al. 2015; Woo et al. 2015). However, socially desirable responding may still occur when participants’ payments and approval ratings are contingent on their behaviors (Antin and Shaw 2012). Participant motivation to earn money from HITs can thus contaminate the measured constructs, and can lead to issues such as changes in item quality (Fleischer et al. 2015).

Recommendations for Minimizing Demand Characteristics and Understanding Participant Motivation

As Schmidt (2015) stated, there is a vibrant online community of MTurk Workers (a unique MTurk feature that is uncommon to other sample sources). Multiple websites have been developed for Workers to rate Requesters, comment on them and their HITs, and communicate with Requesters and other Workers about individual HITs. Some examples of MTurk online communities include Turkopticon (http://turkopticon.ucsd.edu/), Turker Nation (http://www.turkernation.com/), and Reddit (http://www.reddit.com/r/mturk). Researchers are encouraged to actively monitor these websites when they collect data from Workers. In situations where deceptions or manipulations are involved, it is important to make sure that the study purposes are not revealed to other Workers; otherwise, contamination may occur and the integrity of findings may be compromised. Additionally, to the extent possible while adhering to principles of informed consent, researchers should avoid cues signaling to Workers ahead of time about the study purposes and desired participant characteristics (or eligibility criteria). That way, researchers can minimize demand characteristics and avoid Workers fabricating their identities to participate in a HIT.

Experimenter demand effects may vary depending on the nature of participant motivation. According to Podsakoff et al. (2012), motivational factors may cause biased responding. In order to fully understand the differential effects of their motives, researchers should measure Workers’ motives for participating in their studies (e.g., inherent enjoyment and monetary incentives). This allows researchers to better understand how Workers’ motivation might moderate the findings or change the study outcomes.

Repeated Participation

MTurk Workers are not limited in the number of HITs they can complete for each Requester. Repeated participation can occur especially if MTurk Workers are inclined to complete tasks published by their “favored” Requesters (Chandler et al. 2014). Repeated participation is a particularly prominent concern in online research due to the lack of face-to-face interaction between researchers and respondents. For instance, a two-wave study involving MTurk Workers who participated in the same set of experimental tasks at two points in time showed markedly smaller effect sizes in the second wave (Chandler et al. 2015). Additionally, a recent study of Internet panels identified four types of respondents, with professional respondent being one of them (Matthijsse et al. 2015). Even though Matthijsse et al. (2015) did not find substantial differences in data quality between professional and nonprofessional respondents, some demographic and motivational differences between the two groups may lead to inaccurate validity inferences.

Threat to Internal Validity

Repeated participation can cause problems with manipulations, especially with a potential for cross-experiment stimuli contamination, meaning that random assignments might not actually be completely random. For example, MTurk Workers who repeatedly participate in the HITs published by their favorite Requesters may have knowledge about the study purposes and the content of different experimental conditions. Evidence is mixed concerning the prevalence of habitual or repeated participation in MTurk. Berinsky et al. (2012) found that repeated survey-taking (i.e., multiple responses from a single IP address) is not a large problem in MTurk, whereas Harms and DeSimone (2015) highlighted that there are “professional” Turkers (e.g., those with Master Qualification) who are active MTurk users and their representation across multiple MTurk samples can cause problems such as sample nonindependence. Specifically, MTurk Workers may specialize in particular types of HITs, and their experiences in these HITs may confound the observed/hypothesized effects. Although repeated participation can threaten the internal validity of many experiments, it is not clear how severely it affects survey research. Ultimately, researchers should ask: what is the base rate of repeated participation among MTurk Workers, how would MTurk users’ nonnaiveté affect researchers’ ability to answer their research questions, and are naïve MTurk (i.e., nonrepeaters) users or seasoned/experienced MTurk users more suitable for their research questions?

Threat to Construct Validity

Habitual or repeated participation can create treatment diffusion—a threat to construct validity—because participants may receive information about conditions to which they were not assigned (Shadish et al. 2002). For example, a lucrative HIT may prompt MTurk users to create multiple accounts (even though it is prohibited under Amazon’s user agreement) and one person may participate in two or more study conditions under the disguise of different MTurk Worker IDs. Alternatively, the study purposes may be discussed among MTurk Workers on MTurk forums, causing potential treatment diffusion. However, it should be noted that cross-experiment stimuli contamination is arguably less likely on MTurk than in studies of workers from one organization (Highhouse and Zhang 2015). Even though MTurk Workers can talk among themselves on the Internet, employees who work within the same organization are more likely to communicate about the ‘treatment’ they receive due to physical proximity and familiarity (Shadish et al. 2002).

Recommendations for Reducing Repeated Participation

We encourage researchers to monitor discussions about their HITs on MTurk forums. In addition to MTurk forums, there are application plug-ins that allow MTurk Workers to monitor the activity of their “favored” Requesters and thus increase the chances of repeated participation (Chandler et al. 2014). By default, a Worker can only complete a HIT once. Researchers can deploy multiple surveys within the same HIT to avoid duplicated Workers. If researchers intend to combine multiple related HITs and consider them as one, they must take steps to ensure that all Workers are unique and there is not an overrepresentation of certain Workers in their samples (i.e., sample nonindependence). Examples of identifying information include MTurk Worker IDs and IP addresses. It is possible to have duplicated IP addresses if two different persons from the same household completed the same HIT. In this case, researchers should examine participant demographic characteristics before removing their data.

Nonnaïve or “professional” Workers who have completed a large number of HITs might be preferred in some instances (e.g., Master Workers), but researchers may risk them having foreknowledge of the study purposes or presence of attention check items, and the occurrence of treatment diffusion. It is difficult to completely rule out experienced Workers from participating in a study, but there are steps researchers can take to prevent their participation. Researchers who wish to recruit naïve MTurk Workers can establish qualification criteria that exclude more experienced Workers. For example, they can use system qualifications like the number of HITs completed (e.g., less than 10 HITs). Customized qualifications assigned by Requesters can be created according to the researchers’ needs. For instance, Requesters can use MTurk’s web interface or command line tools to exclude Workers who have completed a previous study, Workers who have completed related studies, or who belong to a certain demographic group by assigning a certain qualification value, so that these Workers cannot be granted HIT access.

Requesters can also maintain a pool of MTurk participants who meet their research criteria to either sample them for a future study or exclude them for having completed similar studies (Chandler et al. 2014). Chandler et al. (2014) also recommended the sharing of customized qualifications among a group of researchers with similar interests in a specific population, so that sample nonindependence can be minimized when they pool their samples and findings together. However, the importance of sample independence may depend on the research objectives and nonindependence may not always be problematic. For example, researchers considering repeated-measures study designs and examining within- and between-subject comparisons may benefit from within-person measurements, along with increased power and reduced effects of measurement error variance.

Range Restriction

The common characteristics of MTurk samples (e.g., younger, Internet users, lower to middle-income households) may in some cases produce range-restricted samples. Researchers may attempt to estimate parameters of an unrestricted employee population but only have data from a restricted population (Hunter et al. 2006). Range restriction can reduce statistical power, weaken relationships, and lead to inaccurate conclusions if left uncorrected.

Threat to Statistical Conclusion Validity

Range restriction is particularly problematic when researchers sample from a single organization to answer questions about the general working population (Roulin 2015). MTurk may be a better alternative in this case, particularly in comparison to organizations with a highly idiosyncratic workforce. Research can also overcome issues related to statistical power using MTurk as a low-cost data source to obtain larger sample sizes with more diverse sets of participants. In addition, the anonymity afforded by MTurk tends to result in more honest responses from a more diverse set of respondents, especially toward personal questions such as health-related and sexual behaviors (Smith et al. 2015). However, we note that range restriction can still occur, especially if researchers screen MTurk participants based on factors to more representatively sample from their targeted populations. For example, researchers studying the effects of job stress on older workers may choose to filter participants based on their age and may therefore restrict the ranges on other measured variables. This is an example of a trade-off where one validity type may be compromised to enhance another type of validity. Shadish et al. (2002) noted that researchers need not worry about controlling all of the validity threats, but rather recognize the trade-offs and make justifiable decisions based on their priorities and research questions.

Recommendations for Evaluating Possible Range Restriction

Although range restriction is less likely among MTurk Workers, it can still occur if researchers use qualification requirements to screen out some Workers. Researchers should carefully consider their qualification requirements and ensure that they are crucial for their research questions. For example, researchers interested in studying experiences of older employees might find it important to recruit based on Workers’ age, but not necessarily based on Workers’ years of working experience. Therefore, to find a common ground in the trade-off between range restrictions and sample representativeness (an issue we discuss below), researchers must base their decisions on their research objectives.

Consistency of Treatment and Study Design Implementation

Shadish et al. (2002) noted that the unreliability or inconsistency of treatment implementation is problematic especially when the study design is meant to be implemented and interpreted in a standardized manner. This can be particularly problematic when researchers use samples from MTurk and other different sources to generate conclusions.

Threat to Statistical Conclusion Validity

We noted in our literature search above that many papers published in top IO journals used MTurk samples in conjunction with other types of samples (e.g., undergraduates and working professionals). Even though using samples from different sources may bolster researchers’ conclusions, it is sometimes unclear whether experimental manipulations or survey study designs were implemented in a consistent and reliable manner across the MTurk and non-MTurk studies. Without standardization and consistent implementation, using the different samples to make study conclusions can lead to misestimated effect sizes, and it would be difficult to attribute the effect sizes to different design features and/or constructs.

Recommendations for Ensuring Consistency of Treatment and Design Implementation

Skepticism about MTurk may prompt researchers to use other data sources, in addition to MTurk, to support their findings. Although this may overcome generalizability concerns, researchers must make sure that their study designs are implemented consistently from sample to sample. For example, if an online survey was deployed to an MTurk sample, the same online survey should be administered to an organizational sample. This obviously does not rule out environmental inconsistency given that Workers complete their HITs in different settings; however, researchers should make their best efforts in consistently administering their study. If a lack of standardization is inherent to some study designs (e.g., widely differed training provided to different units), researchers should measure the different study components and explore how they are related to changes in relationships and outcomes (Shadish et al. 2002).

Extraneous Factors

The research settings afforded by MTurk through the Internet are different from traditional settings where researchers meet with participants face-to-face in the same physical environment (e.g., distributing surveys at an organization or conducting laboratory experiments). There are multiple factors that could contribute to extraneous variance in MTurk samples.

Threat to Internal Validity

In laboratory settings, many extraneous variables can be controlled for by putting participants in a uniform environmental setting, so that any systematic differences in the features of an environment will less likely contribute to errors in manipulations and measurements. Even though MTurk can facilitate randomized assignment, it cannot provide environmental uniformity given that MTurk Workers complete their HITs in different physical environments. Without understanding or measuring the sources of extraneous variance, internal validity of a study can be compromised.

Threat to Statistical Conclusion Validity

Inferences made about covariations and the strength of relationships can be erroneous if extraneous variables are not appropriately measured and controlled. Specifically, extraneous factors can introduce additional sources of variance that may be misinterpreted as part of the hypothesized/observed effects. For example, MTurk Workers may not be able to pay attention to the research study instructions due to the salience of environmental features (e.g., distracting noises); these features may add a systematic source of variance that researchers should take into consideration.

Threat to Construct Validity

The online MTurk platform cannot guarantee that the experimental settings theorized by researchers correspond with the empirical realization of the settings where MTurk Workers participate in the study. That is, construct validity would be questionable if extraneous factors (e.g., environmental distraction) introduce deviations from the settings and operational definitions assumed by the researchers.

Recommendations for Accounting for Extraneous Variables

Given that MTurk facilitates random assignments in settings where Workers are subject to different physical environmental influences, we encourage researchers to identify possible sources of noise, measure these extraneous variables during data collection, and include them in data analysis. These analyses can shed light on the extent to which extraneous factors change construct measurement, study relationships, and outcomes. Pilot studies can be conducted to identify these sources of extraneous factors. Extraneous variables common to online participation should be considered and controlled for across all MTurk samples, including their physical environment, browser experiences, environmental distraction, respondent interests, and motivation (Meade and Craig 2012). Researchers should also take proactive steps prior to data collection to minimize the effects of such factors. For example, they can specify in the instructions that participants must be in a quiet room when they complete the HITs or that participants must use a certain browser or software to complete the HITs.

Sample Representativeness and Appropriateness

The extent to which a sample is representative of a specific population or appropriate for the research objectives has implications for whether conclusions drawn from that sample apply to the population of interest. All types of samples can be evaluated for their representativeness and appropriateness, and MTurk samples are no exception.

Threat to External Validity

Sample representativeness is often discussed when evaluating the external validity of a research study. One of the criticisms of using online samples, such as those from MTurk, is that the identities of MTurk Workers are unknown. In addition, concerns about whether MTurk Workers represent the general population have been expressed in previous reviews, especially in light of the fact that MTurk Workers are Internet users and they may have systematic differences from non-Internet users (Paolacci and Chandler 2014). Although the diversity of MTurk Workers has been praised with regard to increasing external validity (e.g., Landers and Behrend 2015), certain demographic groups are over- or underrepresented on MTurk (e.g., age, education, and race). The suitability of MTurk samples would thus be best determined by the research questions researchers want answered.

Finally, we note that external validity evidence is not only limited to whether the samples are representative of or generalizable to a population, but also whether certain phenomena hold across settings. Single-organizational samples are limited partly because inferences are confounded with the fact that the employees went through the same hiring and selection procedures, orientations, training, socialization processes, etc. Researchers studying psychological phenomena expected to vary across settings may benefit from using MTurk because MTurk Workers are situated in a variety of settings. Researchers who are able to measure characteristics of their organizational settings may also be able to study contextual factors and their potential moderating effects on study relationships of interest.

Threat to Construct Validity

A lack of sample representativeness can also threaten construct validity. While external validity indicates the extent to which effects observed in one set of sampling particulars (e.g., persons, settings, treatments, and outcomes) are also observed in other sampling particulars, construct validity represents “the degree of correspondence between the constructs referenced by a researcher and their empirical realizations” (Stone-Romero 2011, p. 40).

Sample representativeness can threaten construct validity because it affects the extent to which a set of sampling particulars (e.g., participants, settings, treatments, and outcomes) correspond to the population to which researchers want to draw inference. For example, construct validity would be limited if researchers aim to study the behaviors of upper-level managers but their MTurk sample consists of only entry-level workers.

Recommendations for Maximizing Sample Representativeness and Determining Sample Appropriateness

Even though the diversity of MTurk Workers may benefit researchers from an external validity perspective, researchers should consider any possible trade-offs with statistical conclusion validity and/or construct validity. Specifically, while the heterogeneity of respondents creates greater variance on measures and may therefore affect the systematic covariation between variables, the homogeneity of respondents or treatment conditions may limit arguments for external validity. On the other hand, having participants from different (but relevant) populations may increase external validity, but it may threaten construct validity when they do not all strictly belong to the target population.

The nature of MTurk’s diverse participant pool needs to be understood as researchers decide whether to use MTurk and whether an MTurk sample would represent their targeted population. Since random sampling is not feasible with MTurk, researchers need to make their best efforts to ascertain that their sample characteristics closely resemble their population of interest. MTurk is unique in the way it creates system qualifications and allows Requesters to create customized qualifications based on desired sample characteristics (see Chandler et al. 2014). Researchers should recruit and select MTurk Workers by proactively utilizing both system and customized qualifications (e.g., age, location, gender, occupation, employment status) to increase the correspondence between their actual and desired sample characteristics. In administering qualification tests or questionnaire, researchers should avoid overt cues about the eligibility criteria that could influence participants’ responses. Researchers should also restrict Workers from attempting to take a qualification test more than a certain number of times, in order to prevent Workers from finding out the eligibility/inclusion criteria and subsequently giving “correct” but untruthful responses.

Researchers should also attempt to verify the desired characteristics of Workers and consistency in their responses. For example, if Workers are required to be full-time employees, researchers can verify their employment status (in addition to using qualification tests) by embedding questions that would only be answered affirmatively by someone who had the desired characteristics, such as their job title, work schedule, and salary. Inconsistent or implausible combinations of responses should be removed prior to any data analysis to avoid generalizability issues.

Finally, MTurk’s participant pool is international but it is by no means representative of the global workforce; MTurk Workers are predominantly U.S. citizens, Indians, and English speakers (Ipeirotis 2010). Therefore, MTurk samples may be most appropriate for testing theories or phenomena that are not expected to vary across cultures or that are specifically relevant to the U.S. or Indian samples. An investigation of, for example, cross-cultural issues among non-English-speaking organizations or employees may not be feasible on MTurk. Moreover, researchers sampling from MTurk Workers may miss some types of employees with certain characteristics (e.g., white-collar professionals) and, in some instances, a large single-organizational sample might be more appropriate. In other instances, researchers should carefully consider the relevance of measured constructs to MTurk Workers, and whether they are industry specific. For example, a study focusing on personality effects on job performance outcomes may be studied using an MTurk sample representing different occupations/industries, but a study focusing on the effects of industry-specific knowledge on job performance may not be relevant to all MTurk Workers. Therefore, we urge researchers to carefully consider whether MTurk samples are suitable for answering their research questions about their targeted populations, and not let MTurk’s cost-efficient and convenient nature dominate their decisions to use MTurk.

Consistency Between Construct Explication and Study Operations

Construct validity is assessed largely based on the extent to which the properties of operational definitions in a study are consistent with the properties of theorized constructs. Discrepancies between the two not only affect construct validity, but also other validity types and conclusions made about the constructs based on discrepant study operations. This issue applies to all types of samples, but it is particularly important when MTurk researchers use system/customized qualifications to select participants.

Threat to Construct Validity

Inferences about construct validity are more strongly supported when the characteristics of a sample collected from MTurk match the desired characteristics of a sample defined in a construct. As illustrated in Shadish et al.’s (2002) example, discrepancies between construct and operations may occur when a researcher is interested in the construct of unemployed and disadvantaged workers, and he/she samples from families below the poverty level who may not necessarily be unemployed or disadvantaged. A mismatch as such would undermine inferences about construct validity and lead to inaccurate conclusions made about the measured constructs.

Recommendations for Ensuring Consistency Between Construct Explication and Study Operations

As noted, a mismatch between construct explication and study operations can be problematic. For instance, a mismatch may occur if researchers interested in the construct of transformational leadership obtain a sample of MTurk Workers who are predominantly low-wage earners and upper-level employees are underrepresented. MTurk Workers in this sample may be asked questions about executive leadership styles that are irrelevant to them, because they have not had the opportunity to directly observe executives’ behavior. In addition to creating inconsistencies between the sample and the construct explication, erroneous conclusions may be drawn about executives based on perceptions of a low-wage worker sample. Such consistency between construct explication and study operations is not only important for persons, but also for settings, treatments, and outcomes. Researchers should thus evaluate the appropriateness of MTurk samples with careful consideration of the nature of the constructs they intend to measure.

Method Bias

Method bias is a commonly discussed methodological concern in the behavioral sciences (e.g., Spector 2006). A number of meta-analyses indicated that the impact of method biases on item validity and reliability can contribute to inappropriate conclusions if not appropriately controlled for (Podsakoff et al. 2012). Like any other sources of sample, researchers using MTurk samples should consider the possibility of method biases in light of their research questions/targeted populations and take steps to account for them in analyses.

Threat to Construct Validity

According to Shadish et al. (2002), mono-method bias is one of the method biases that can threaten the credibility of construct validity evidence. For example, using MTurk as a sole source of construct measurement may introduce problems with mono-method bias (also known as common method bias), a threat where the method (or measurement context) may become a part of the construct actually studied (Podsakoff et al. 2012; Shadish et al. 2002). MTurk is usually limited to one method in how treatments or surveys are presented to respondents given that Requesters can only distribute their HITs through the Web interface, and thus the measurement contexts are the same for all Workers. Another type of mono-method bias in MTurk samples is based on common rater effects, where the same respondents provide responses to both the predictor and criterion (Podsakoff et al. 2003). As of now, the MTurk platform does not have convenient and direct access to alternative rater sources, such as MTurk Workers’ supervisors, peers, or spouses.

It is important to note that mono-method bias is only one of many sources of method biases, but it is one that is particularly applicable to MTurk samples. Researchers should also consider the implications of other potential sources of method bias that are common to MTurk samples and other sample sources, such as item characteristic effects and item context effects (Podsakoff et al. 2003, 2012).

Recommendations for Examining Method Bias

The measurement context in MTurk is primarily the Web interface, and it may become a part of the measured constructs unless researches separate out the “method” factor in their analyses. Researchers using MTurk samples should examine whether method factors emerge in their latent measurement models; if they do, appropriate measures should be taken to control for them (Podsakoff et al. 2012). To minimize method biases arising due to the measurement context effects, researchers may consider adopting time-lagged research designs, where predictors and criterion variables are measured at different points in time; or predictors and criterion variables can be administered in different mediums (e.g., Qualtrics vs. MTurk interface). Other method biases such as common rater biases are harder to overcome on MTurk. MTurk does not currently have the capability to survey and match data from multiple rater sources (e.g., supervisors and coworkers). Requesters would have to administer the surveys to other raters through the MTurk Worker, which can cause a different set of problems (e.g., honesty and compensation issues). In this case, single-organizational samples would be easier to manage and more feasible in collecting data from multiple raters.