Adversarial analytics and business hacking: Amazon case study.

Chances are that you might have purchased a book, or visited a restaurant, as a result of reading fake reviews. The problem impacts companies such as Amazon and Yelp, while on Facebook, massive disinformation campaigns are funded by political money, hitting thousands of profiles and managed by public relation companies: they create fake profiles and try to become friends with influencers. Here the focus is specifically on Amazon book reviews, the Facebook issue will be discussed later, while the Yelp issue is well known and has resulted in a class action lawsuit: Yelp's account managers create bad reviews for restaurants, and if you pay a monthly advertising fee, suddently your rating dramatically improves.

Source for picture: Examples of bogus book reviews on Amazon

Amazon is selling books, so it has a conflict of interest when it comes to book (or product) reviews. The purpose of this article is three-fold:

How do detect fake reviews, and improve recommendation engines Test Amazon's review system: post fake reviews and reverse-engineer Amazon's algorithm (as a proof of concept or feasibility study) Discuss a business risk that could sink Amazon: a company selling good reviews to authors, after having thrashed their books with bad reviews

This is the new project for candidates interested in our data science apprenticeship. The full list of projects can be found here. The project description is as follows:

You will have to assess the proportion of fake book reviews on Amazon, test a fake review generator (possibly using EC2 to deploy the reviews), reverse engineer an Amazon algorithm, and identify how the review scoring engine can be improved. Extra mile: create and test your own review scoring engine. Scrape thousands of sampled Amazon reviews and score them, as well as users posting these reviews.

Note that we do not study here the impact of reviews and stars on purchasing behavior or pricing, this will be the subject of another article.

1. Fake review detection

Which metrics would you use to detect fake reviews?

Recency of the review,

same review posted by same user multiple times (or by multiple users),

1-star review,

user's IP address is blacklisted,

high number of likes in short time span for the review in question,

user's IP address is anonymous, non-static, or not an ISP address (see also Internet topology mapping)

review posted in the middle of the night.

These are features that should probably be included in any fake review detection system. HDT (hidden decision trees) is a great data science technology to design such scoring engines, to score reviews. What other metrics would you suggest?

2. Experimental design and proof of concept: test fake reviews on Amazon

Here the data science apprentice is asked to try various strategies to post fake reviews for targeted books on Amazon, and check what works (that is, undetected by Amazon). The purpose is to reverse-engineer Amazon's review scoring algorithm (used to detect bogus reviews), to identify weaknesses and report them to Amazon.

Strategies will involve

Using 2 or 3 ISPs or non-static IP addresses (so you can play with different IP addresses to fool detection algorithms). It's better if these IP addresses are attached to different locations; you might be able to leverage EC2 to accomplish this on a bigger scale.

Using 4 Internet browsers

Create 12 Amazon accounts (one per IP/browser combination) to post reviews

Post 3 to 5 bad reviews (1-star) per target book, and have each of these reviews liked by the remaining 11 users (11 = your 12 fake users minus the fake one who wrote the review). By "liked", I mean click on "I found this review useful".

Target new books that don't have many reviews yet, say less than 8 reviews

For each book, deploy your reviews slowly over a period of 2 weeks. But you can work on multiple (4-5) books at the same time.

Include a decent amount of variance in your actions (for instance, do not create exactly 11 likes for each review, use a number that varies between 4 and 11)

Any other strategy that you discover yourself.

You might have to fine-tune the suggested parameters, to optimize performance of your fake review posting process. Success here is measured by the proportion of 4- or 5-stars books where you managed to reduce the number of stars, to 3 or below. Deliverable is a paper summarizing the results of your test, how scalable your strategy is (can it be automated?) and recommended fixes to make Amazon reviews more trustworthy (that is, designing a better review scoring system). A review scoring system score the reviews, and automatically "review the reviews" to decide which ones should be accepted.

3. The real business risk associated with reviews

Amazon authors are vulnerable to the following fraud, that would eventually result in significant business loss for Amazon.

A start-up company selling good reviews for $500 per book with a $100 monthly fee. It would work as follows.

A new book receives several negative reviews (1-star) using the methodology developed in the previous section

The author is then reached by email: typically, most authors have a public email address easy to harvest with automated tools, or easy to purchase from mailing list re-sellers

The start-up offers to post good reviews only (and it does not discuss the bad reviews previously planted before reaching out to the author to "fix" the problem)

How scalable is this? A college student could easily make $500 a day, targeting only a few books each day. That's $100k per year, and collect the money via Paypal. Because the money is relatively easy to make, a large number of (educated and under-employed) people could be interested in setting up such a scheme, eventually targeting thousands of authors each day when combined together. Or someone might find a way to automate this activity, maybe using a Botnet, and make millions of dollars each year. Many authors would eventually refuse to have their books listed on Amazon, and choose to self-publish with platforms such as Lulu. Publishers would also opt out of Amazon. Revenue on Amazon (from book sales) would drop. Or Amazon could simply eliminate all reviews and not accept new ones.

Interestingly, it appears that Yelp might be making money with a similar scheme: out of fake reviews and blackmailing small businesses listed on its website. And I've seen companies selling fake Twitter followers or Facebook profiles, though they quickly disappear. Even LinkedIn was recently victim of a massive scheme involving fake profiles automatically generated.



Conclusions

Website relying on reviews (books, products, restaurants reviews, etc.) are vulnerable to massive attacks that could destroy their reputation, and eventually their income.

How could Amazon protect itself from such a risk? Using a better review scoring engine. Relying more on their recommendation engine (user who purchased A also purchased B). Design a better fraud-resistant user reputation engine, and integrate user reputation as a metric in the review scoring engine. Display reviews with high score at the top, or more frequently. Or dropping user-generated reviews altogether.

Also, Amazon could categorize users, so that a data science book review by a user categorized as "interested in web design" does not carry the same weight as a data science book review by a user categorized as "interested in data science". Or a new company could emerge and start competing with Amazon, by offering much better user experience. Such a company could make additional revenue by offering authors the possibility to have their book featured at the top, when a user is searching for books - just like Google does with webmasters who want to promote their website.

Note: I never write reviews, despite the many requests that I receive from authors or publishers. I don't have the time, and I expect to be paid to provide quality content (reviews, bad or good, of high quality). No executive has time to spend on writing reviews anyway, thus if you write a book aimed at executives, you won't get any reviews from fellow executives. In short, all the reviews will be worthless.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge