Fraud Detection

As more and more work is being crowdsourced, it is important to ensure that tasks are actually being honestly completed. During the 2012 US presidential election for example, both campaigns had online phone-banking tools for volunteers to make calls to voters. How do you ensure that these volunteer calls are legitimate contacts though and not the work of Teaparty sabateurs ? Other companies, such as YouGov, will pay folks for consumer survey data. It's important for them that the respondent is actually putting effort into their responses instead of clicking through as quickly as possible. The most effective way to start to solve the problem is to build a model for detecting fraudulent behavior. Below we will start to create a very simple model built from real survey data and talk about ways to extend it.

The Setup

I gathered example data for this by running a survey of my own. I wrote some nonsense questions (e.g. "Would you rather have 10 kids, 1 evil kid, 0 kids, or 2 bad kids?") that sometimes required a little bit of thought before giving an answer. I then hacked together a server to gather responses. I wanted to just use Survey Monkey, but they do not give time spent on each question or allow free access to raw CSVs of the data. The server along with deployment scripts is in github, so feel free to use it for any purpose you want.

Then came the fun part: I ran this survey using Amazon's Mechanical Turk infrastructure. I removed all competency requirements for respondents and had two different prompts. One asked participants to complete the survey and submit their code. The other was told to complete the survey, but that I don't really care about their answers and that they should just complete it as quickly as possible. The way that these two different groups of people worked through the survey was used to train a model, with the later being used as the more "fraudulent" activity type. We can then classify whether someone is more like the normal participant that thinks about each question, or if they are like the person who is just trying to rush through the survey as quickly as possible.

Results

The results definitely showed a difference between those who were asked to quickly go through the survey, and those who gave it more thought. I broke down the distributions by question. The more the timings differs for a question, the more useful that questions timings are for building our model. For example, I had one question asking someone to recall their first political memory. This should probably take a reasonable person a little bit of time, and there should be a story with it. Looking at the two distributions, we do indeed see a nice spread:

For the "good" reviewers, they are most likely going to take around 25 seconds. It is not unlikely for bad reviewers to take that much time either. But as the duration increases, it is more and more reasonable to expect the reviewer to be good. It is twice as likely for a good reviewer to take 60 seconds on this question as a bad reviewer. Beyond that, this difference continues to grow.

Some questions though really don't matter. The easy questions are going to be answered about the same regardless of which group you are in:

Timings for a question like this give very little insight into whether someone is fraudulent or not.

Using these distributions to create a model, we can compute fraud probabilities for each person that took the survey. The distribution of probabilities from this simple model looks like so:

We see in that some separation between the scores, a very useful property when constructing a binary classifier. To sanity check a model like this, it's also useful to actually look at some of the real results.

Let's pick someone at random from the low and high probability buckets and get a sense of what the answers look like.

A low fraud probability response

Question Answer Duration Fraud Probability First name Farrell 14.641 0.08054 People with your name honest? Yes. 60.766 0.01655 Earliest political memory? Dead: Abraham Lincoln, because I was obsessed with the Civil War as a child and his views on slavery. Alive: George W. Bush Sr and his first run for office, as well as the conflicts of the Gulf. 81.358 0.00772 Men or women need more exercise? More. Women are prone to cancer, yet men are prone to heart-disease. Heart-disease can be deterred with plenty of exercise, while cancer isn't as easily defeated by more exercise. 71.565 0.00482 What country do you live in? United States of America. 9.084 0.00348 Allocating money to different departments immigration:1 healthcare:2 education:3 warfare:2 transportation:2 54.985 0.00190 How sad would you be if various plants went away? radish:1 lettuce:10 eggplant:1 tomato:10 aubergine:1 kiwi:1 33.447 0.00135 What animal would you not want to leave with a sheep? Wolf. 15.939 0.00113 10 kids, 1 evil kid, 0 kids, or 2 bad kids? With those options, I would rather have no children, because 1 evil child would be just as stressful as 2 bad children and 10 kids would be equally as stressful; thus, I would choose none. If there was a fifth option, I would say adoption. 77.048 0.00056 Do you have any idea what the word 'Telluride' means? No. None. 11.941 0.00076 Who would your parents like? Nabokov:1 Obama:1 Fidel Castro:1 Your favorite TV host:5 Babe Ruth:4 21.941 0.00084

Farrell obviously put actual effort into her responses, and the timings show that. This is amongst the least fraudulent results, and we got that just from timing data.

A high fraud probability response

Question Answer Duration Fraud Probability First name brandon 4.393 0.13300 People with your name honest? nope 5.377 0.23076 Earliest political memory? Clinton and monica. 26.647 0.20252 Men or women need more exercise? more 10.700 0.31294 What country do you live in? usa 3.859 0.35673 Allocating money to different departments immigration:0 healthcare:2 education:8 warfare:0 transportation:0 16.906 0.49815 How sad would you be if various plants went away? radish:1 lettuce:5 eggplant:1 tomato:10 aubergine:1 kiwi:1 12.375 0.64341 What animal would you not want to leave with a sheep? wolf 6.337 0.79127 10 kids, 1 evil kid, 0 kids, or 2 bad kids? 0 kids 9.182 0.87548 Do you have any idea what the word 'Telluride' means? nope 4.478 0.93092 Who would your parents like? Nabokov:1 Obama:2 Fidel Castro:1 Your favorite TV host:2 Babe Ruth:5 15.773 0.95546

Brandon on the other hand is very rushed and doesn't give any response much effort. Even with the slower response on the last question, his prior probability of being a "fraudster" is so high that he does not come down much in score. Continuing on like this, we see that the users with high fraud probability also say things like "no clue brah", mash keys for answers, and fill in all fields identically. Each of these would be another signal to add to the model as we make it more robust.

These are extreme examples, but it's interesting that we picked them out with no manual feature specification, and just using timing. It's also important to note that distributions do not need to be disjoint for a classifier to work. The more disjoint the better, but with Bayesian reasoning and enough data, you will eventually figure out what distribution someone actually belongs in. Even with just a simple model, you can dig into the data and start getting better ideas on how to further create signals.

An Interactive Example

Below is the actual survey I ran, along with the simple model that was generated. It will tell you after every question what it believes your current level of fraud is. Take the survey twice, the first time answering it normally (giving thought to each question) and once where you try to just get through it as quickly as possible. See how the results differ.

Try out this survey. Click NEXT Below to start. {{question.question}}



{{section}} {{question.question}} Thanks for completing the survey. Fraud Probability: {{fraudProbability}}. Nonfraud Probability: {{1 - fraudProbability}}.

The Math

The model type here is a basic Bayesian classifier. Timings for the different classes were converted into a distribution via Kernel Density Estimation. We then just use that probability density function in computing the likelihood of different timings for each class. That is then multiplied by our prior belief that the respondent was in that class (and then probabilities are normalized to sum to 1). The IPython notebook is checked into github and available on nbviwer. If you want to learn about Bayesian modeling, the best first book on the subject is Allen Downey's Think Bayes. The book is available as a free pdf, but I suggest you give him some money anyways.

Extensions

This is of course a simplistic model built on just one variable: the amount of time it takes for someone to answer a question. There are many confounding issues here that weaken the utility of our model. If someone is on a mobile device or behind a slow ISP, they will have longer answer times just by virtue of how they are answering the survey. If you had enough data though, you could start to account for those factors in your model. Additionally, you should want to start to add other signals, such as the variety of responses someone does when assigning scores (all 1s or all 10s is not a great sign), and you would probably want to start to build some predictions around keymashing. Everything though can fit in cleanly to a Bayesian framework like this, and each further bit of information should help to further separate fraudsters from legit workers. Another interesting aspect that comes up is that we do expect users to get better at tasks over time. In the context of tracking phone bank time for example, it would be helpful to build different models for different experience levels of callers.

Takeaways