Recently reported problems relating to the use of bots on Amazon’s Mechanical Turk service demonstrate some of the dangers of outsourcing data labeling to the platform and highlight some advantages of Neuromation’s synthetic data approach.

A major bottleneck in traditional artificial intelligence research and development is the generation of high-quality datasets. Frequently, AI researchers and developers begin a proposed project with a large amount of raw, unlabeled data — in the form of images, video, text, sound, etc. This data must then be accurately labeled in order to arrive at a ground truth to which a deep learning model must adapt.

As we have discussed previously, the manual labeling of datasets for artificial intelligence applications has several shortcomings. It is limited in scale, given the finite number of people who are suitable and available for a given task. It is limited by the capabilities of the people doing the labeling — for example, people cannot accurately estimate certain aspects of an image visually, such as the precise distance between two objects, or the rotation angle of an image. And finally, the expense of manual labeling creates severe limitations on the amount of accurately labeled data a developer can afford to produce.

There is an additional shortcoming of manual labeling, which we have not previously discussed on this blog, however. Data labeling fraud is when what is advertised as manual, or the product of actual human labor, is instead the work of bots or other automated tools.

This is the situation that has been observed in some cases on Amazon’s Mechanical Turk service, which is an extremely popular (and useful) platform for manual data labelling for AI applications. According to a recent article in Wired Magazine, a researcher at the University of Minnesota recently raised the alarm that he was seeing suspicious nonsense answers to research surveys with duplicate GPS coordinates — in one case, this problem was so bad that he had to thrown out half his data. Other researchers subsequently corroborated that they have seen similar behavior on the platform.

For AI researchers, this problem represents a major challenge. Without accurate labeling, a supervised learning algorithm, for example, would not have an accurate version of ground truth to which it would to train to recognize.

To be fair, it has always been incumbent upon users of Amazon Mechanical Turk to conduct QA on work product received and to test data for accuracy. This continues to be the case, but the presence of active platform participants with technical sophistication seeking to present automated data as human, presents a new and serious challenge to artificial intelligence practitioners who may be working with datasets with sizes in the hundreds of thousands or millions.

An additional issue to be aware of when vetting data accuracy on these platforms is that Mechanical Turk workers whose work product is rejected and subsequently are not paid may become upset, resulting in negative comments and a subsequent reputational issues on the platform which could hinder an organization’s ability to have additional work completed on the platform, and potentially reducing the available talent pool.

We would also expect Amazon to work hard on this issue and to successfully clean up its platform, if this is in fact a widespread problem, and to close any loopholes that any bad actors may now be using. But as we have seen in other areas such as anti-virus and anti-fraud in financial services, progress can be akin to an arms race between participants, in which the platforms increase security only to see scammers create new and more sophisticated means of evading it. The job of a security professional is never done, and the task of keeping these platforms free of bad actors will also require vigilance on the part of the platforms themselves but also of each user to ensure that they are receiving only accurate data.

Neuromation is a pioneer in the field of synthetic data for artificial intelligence applications. Rather than merely automating labeling of existing images, synthetic data is the term for a range of techniques used to create de novo, unique and varied images or other data types displaying the desired elements and characteristics for the training of deep learning algorithms. Synthetic data has the capability to show variety in data that may be difficult or impossible to capture otherwise, and to eliminate bias found in many datasets that can negatively impact performance in the real world. It also has significant advantages in terms of cost and scalability.

Accuracy in data labeling is another major advantage of synthetic data, as it can achieve labeling accuracy far in advance of human labeled or real world data. In a world now facing a potential new arms race by bot-enabled data labelers working on major platforms, this feature of synthetic data takes on new importance.

A recent article in Analytics India Magazine described the situation in AI development, in which the need for training data is making organizations and researchers increasingly rely on third parties and platforms providing Training-Data-as-a-Service. The article warns that it is essential that these services be vetted responsibly, and goes on to suggest that use of synthetic data is quickly becoming the go-to approach to solving the data-labeling problem, given their improved security and accuracy.

Neuromation believes that synthetic data is an extremely important tool for AI development, but one that in many applications will never completely replace the need for real world data. To this end, it is extremely important that manual labeling be done responsibly and accurately and we hope that the platforms will improve their security and reliability, or this could have a major negative impact on AI research and development.

For additional information on Neuromation’s services and experience in a wide range of industries (including healthcare, retail, manufacturing, agriculture, pharmaceuticals and more), and to keep track of the progress in our development of the Neuromation Platofrm, you can check out our website or leave us a message and we’ll have one of our experts get in touch.

By Angus Roven,

Neuromation Investor Relations Analyst