Faunalytics’ Research Director describes an attack on a Faunalytics survey, as well as best practices for ourselves and other researchers going forward.

Have you ever administered a survey to people online or thought about doing so? Then this blog post is for you! (And if you haven’t, maybe you’d be interested in checking out our resources for running your own research so that you can in the future.)

Animal advocacy research involves a lot of surveys. In general, when you’re interested in “the population at large,” experienced researchers will recruit participants through companies who verify that they are real people through background and data quality checks. That’s necessary because, unfortunately, surveys can be easily overrun by people and bots trying to make a quick buck with multiple fake survey completions.

This is a sad reality of the world researchers live in, and although it’s one I’ve been aware of for a long time, I’ve just recently been made painfully aware of how much worse it’s become. I hope other researchers will read this as a warning and share it widely.

A Sophisticated Scam

Faunalytics doesn’t ordinarily recruit participants over social media because they wouldn’t necessarily be very representative of the population. However, the longitudinal study we’re currently running has a particularly tiny population of interest: people who have started a vegetarian or vegan (veg*n) diet in the past two months. Recruiting such a specific sample using a panel company would be prohibitively expensive at best and impossible at worst.

We turned to social media to recruit, with the intention of weighting the data to better represent known veg*n demographics (handily available from our previous study of current and former veg*ns). Because this is a six-month longitudinal study, participant drop-off is a concern, so we’re paying $5 per survey to keep people motivated. We started recruiting from a large Facebook group and got about 300 participants in the first few days. That was a surprise and sent up a yellow flag of suspicion for me. However, since it was the beginning of the study, we didn’t have a good sense of how quick sign-ups should be.

As usual, I checked several standard data quality indicators before paying the participants: I looked for duplicate IP addresses, IP addresses outside North America, and overly fast completion times. We also checked for consistency between the age they entered on the survey with the year of birth that they entered in another part of the survey. With a handful of exceptions that I removed, everyone passed those checks.

I still had a niggling sense of unease, so the research team—including my co-investigator at Carleton University—also looked closely at the demographic breakdown of our sample compared to the previous Faunalytics study. But with the exception of a high proportion of male respondents that we chalked up to the nature of the Facebook group we recruited from, they all mapped onto what we expected. With all those checks complete, I told myself I was just being paranoid and paid our “participants.”

The Discovery

Between the time of sending out payments and the first follow-up survey going out, I downloaded the data and combed through it closely. In hindsight, I wish had listened harder to my suspicions and taken the time sooner, but after all, I had already checked everything we could think of, and a really close check takes a substantial amount of time. Once I did that close check, though, I found one small inconsistency after another over about 5 hours of looking. Those included odd, clunky phrasing in open-ended questions, reported gender not matching the apparent gender of a name in an email address, and email addresses including a name that didn’t seem to match the initials in the unique identifier code.

Over time, I also started seeing that many of the email addresses, while realistically human-looking at a glance (e.g., [email protected] or [email protected]) contained patterns like that, with a few letters or numbers at the end or middle of a name.

In the end, as I combed the data, no single factor was very suspicious on its own, but all together they added up to the confirmation that the data must have come from human scammers, sophisticated bots, or a combination of the two.

Impact

At this point, there were clear questions of financial and data quality impact. If we hadn’t caught the situation, our results would be overrun with meaningless, false data. To avoid any potential for data contamination, I took the decision to exclude all responses that came from the suspect Facebook group (we had tracked the URL source while collecting the data). In total, 316 responses had to be discarded, of which we were able to recover payments from only 26. I made the decision to exclude all of the participants after about 10 hours of data checking, trying to exclude suspicious responses, and realizing that the proportion was so high as to render additional effort cost-ineffective.

On the financial side, this was a hard lesson: $1,450 went to people who never existed. However, there are two silver linings: First, with this cautious approach to data cleaning, the potential impact on data quality in this study is close to zero. And second, I can share this lesson with other advocates and researchers to help prevent future occurrences. To that end, we have spent significant time investigating best practices for dealing with different kinds of fraud and present them below.

Prevention of Future Occurrences

Some fraud detection measures are low-cost: They are invisible or minimally time-consuming for participants and easy for the researcher to check. Here at Faunalytics, this type of measure will be used in all future studies, regardless of platform (see table below). They are included in all studies because, although I have labeled some platforms as particularly “risky,” there’s no such thing as a risk-free online platform. The rest are lower risk, but you should always include data quality indicators regardless.

Other fraud detection measures are stricter and more costly in terms of participant and researcher time. Going forward, we will employ these additional measures whenever we run studies on risky platforms that do not include prior participant vetting or quality assurance. Risky platforms most notably include social media sites and Amazon Mechanical Turk. Notably, the latter is less risky when used through companies like Positly, Prolific, or TurkPrime that screen participants for their panels.

At Faunalytics, we will not be using these stricter measures for all studies because they increase respondent and/or researcher burden, and therefore the cost of the study. However, I now consider them a requirement to ensure data quality on risky platforms going forward, and recommend that other researchers do something similar.

The battle against scammers is an arms race. As we become more sophisticated in our data quality checks, they become more sophisticated in getting around them. Although we have a plan in place for now and encourage others to do the same, this isn’t a one-time fix. We will need to continue to adapt and improve going forward. As part of that ongoing process, we invite feedback from experts in the field. If you would like to share your own experiences or provide additional ideas with the community, please comment below or contact Faunalytics’ Research Director at [email protected]

As a final note, although we encourage you to share this post widely, we have hidden it from search engines’ crawlers. We don’t expect scammers to be keeping a close eye on us in particular, but we don’t need to help them out more than necessary by making this information easily searchable.

Faunalytics Data Quality Assurance Plan



The goal of these checks is to catch two types of fraudulent survey completion: bots and survey farmers (people who “farm” paid surveys, often but not always from foreign countries with lower hourly rates of pay; see TurkPrime, 2018). Except where otherwise noted, most of these checks could catch either type of scam, though some are passable by more sophisticated scammers with VPNs or by bots working in conjunction with humans. The checks we previously had in place were not sufficient to catch more sophisticated scammers.



For each study, our pre-registration plan will indicate which specific metrics and pass/fail criteria will be used.



Usage Guidelines and Examples

General Usage Guidelines

Data quality checks that stem from responses to survey questions can be very helpful, but it is important to remember that they are subject to human error. That is, even an honest responder may make a mistake from time to time if they misread something, are sleepy, or briefly aren’t paying attention. We recommend using these checks in combination—for example, with a two- or three-strike rule—to avoid accidentally excluding honest responders.

Edit (8/29/19): Humans who farm surveys with bot assistance are likely to go through the survey once manually to program the bot with correct answers to these types of questions. It is best to have a larger bank of items for each from which you can randomly select a question so that all participants don’t get the same one.

Consistency Checks

This category of check refers to asking the same question twice in different ways (preferably not close together in the survey) or with one version reverse-coded so that it says the opposite of an earlier question. The goal is to catch people who aren’t answering honestly—if they are, their answers should be consistent.

An example of asking the same thing in two different ways is requesting age and birth year. Although this failed for us, it may work if the questions are placed farther apart in the survey.

An example of asking the same thing with reverse-coding is “Do you usually make decisions quickly?” and “Do you generally take a long time to make a decision?” If the respondent provides the same answer to both questions (e.g., “Strongly agree”), that should be flagged as suspicious.

Another example with reverse-coding is to ask people to check boxes for feedback at the end of a survey. If they check, for example, both “annoying” and “enjoyable,” that response should be flagged as suspicious.

Attention Checks

This category of check is intended to catch people who are answering randomly, too quickly, or selecting as many options as possible to try and pass a screening question.

An example of an attention check is: “Which of the following have you done in the past 12 months? Please select all that apply.” Ran a half-marathon or marathon Purchased a new television Used a computer or mobile phone ***flag if unchecked; they’re using one for the survey*** Ran a mile in less than 3 minutes ***flag if checked; the world record is 3:43*** Ate a pre-packaged frozen dinner Read a book Typed faster than 220 words per minute ***flag if checked; the world record is 216 WPM*** Donated to charity Applied for a patent None of the above



More overt attention checks (e.g., “Please select strongly agree”), while formerly common, have more recently been criticized as introducing demographic and social desirability bias into a sample (e.g., Clifford & Jerit, 2015; Vanette, 2016), so Faunalytics no longer recommends using them.



Basic Comprehension Checks

This category of check asks about the meaning of a simple sentence in order to flag possible foreign workers. However, if your sample may legitimately include people with a low level of reading comprehension, it would be both unethical and methodologically unsound to exclude people for failing only this check.

An example of a comprehension check is: “Sally’s blue dress is her favorite. What does the previous sentence imply?” Sally only has one dress Sally has a favorite dress ***flag if not checked*** Sally doesn’t wear dresses Sally only wears blue



Open-Ended Question Coding

Like the basic comprehension checks described above, this type of check is meant to flag foreign workers and survey farmers who just want to get through your survey as quickly as possible. We have adopted it based on the findings of Turkprime (2018), which found that 95% of non-farmers passed this check and only 8% of farmers did. As with the basic comprehension checks above, however, people with low reading comprehension may also fail, so this check should only be used in conjunction with others.

Once you have your first batch of data to check, skim the full set of open-ended responses to get a sense of what the answers are like. Then, for each, consider whether it:

Is a duplicate of another response

Has poor grammar

Is hard to understand (i.e., not perfectly intelligible)

Is clunky (e.g., “You are examining the mood of the complex moments”)

Does not answer the question (e.g., “Nice survey” in response to a question about food)

If any of the above are true, flag the response as suspicious.



Acknowledgements

Our sincere thanks go to Luke Freeman of Positly and Harish Sethu for their expert review and suggestions on this blog post.

