Machines have been "seeing" for decades. Early uses of machine vision included a wide range of applications, from medical imaging to checking products for defects. More recent work has focused on improving image recognition: looking at a picture and (somehow) determining what it might be a picture of. Uses range from tagging and cataloging images for retrieval, to determining whether images violate Terms of Service.

For example, most social networks currently employ human "content moderators" to handle this challenging work. Facial recognition software algorithms are trained by running them on sets of images and checking the results. Law enforcement users of this software have discovered that a wide range of training images helps the algorithms recognize faces from a full range of ethnic groups.

To make it easy for you to experience the capabilities of commercially available image recognition, Parthenon's Tyler Spaeth and Tim Tate have written an application that lets you compare outcomes from Google Vision, Amazon Rekognition and CloudSight. Tyler's full report is below.

Python source code

Robovision Report

Google Vision, Amazon Rekognition, and CloudSight have much in common:

A REST API and a variety of SDKs. (I found the Python versions straightforward.)

Metered billing for API calls, with the first tier free.

Google Vision and Amazon Rekognition provide ranked keywords as output. (My implementation displays the first five for you.) CloudSight provides human-readable results.

My conclusions are based on my subjective experience while working with the APIs. Below, I've described results using one instance each of a single object image, multiple objects in one image, and an image that forms an entire scene.

Single object detection

I started with a picture of a cat with no background:

CloudSight Google Vision Rekognition "silver tabby cat" cat, mammal, vertebrate, european shorthair, cat like mammal Animal, Cat, Mammal, Manx, Pet

While all three services turned up results involving cats, CloudSight distinguished itself by returning a short, common sense description. Google Vision and Rekognition provide keywords that could have come from a nature taxonomy at varying levels of specificity.

Multi-object detection

It is not uncommon to have multiple disparate subjects in a single image, such as in this image of a place setting:

CloudSight Google Vision Rekognition "white ceramic round plate in the middle of fork, knife and spoon" cutlery, fork, tableware, plate, dishware Goblet, Dish, Food, Plate, Beverage

CloudSight's description of the image is phenomenal, even ordering the subjects from left to right and giving their relative positions, not to mention describing the plate in detail. While Google Vision returned reasonable results, Rekognition included terms that are clearly not present in the image, such as "Goblet" and "Beverage".

Scene detection

Images often have backdrops, providing the context for the subject of the image:

CloudSight Google Vision Rekognition "sunflower field during sunset" sunflower, flower, plant, yellow, field Landscape, Scenery, Daisies, Daisy, Nature

I noticed that I hadn't paid attention to the sunset background until I read CloudSight's results, even though the sunset takes up almost half the frame. It's useful to note that Google Vision did list "sunset" down towards the bottom of its list and "Sunflower" showed at the end of Rekognition's results. I think CloudSight and Google Vision did a fine job here.

Conclusion

In my testing, I found CloudSight to have the highest accuracy and relevancy of the three image recognition services. But I'd be remiss if I didn't point out that although CloudSight appealed a great deal to me as a user, it isn't particularly machine friendly. In fact, while working on this review, I found myself installing Google's SyntaxNet to parse CloudSight's results before it occurred to me I was not going down the most time efficient path.

From the development side, all three services were trivial to use, and each of the SDKs followed a similar path: authenticate, pass up an image, get back your results. I found the documentation for each roughly on par as well: easy to read but not always deeply detailed, especially in regard to error handling.

Reliability was the same across the board. Each service had hiccups, but not often. In most cases, immediately retrying a query produced the correct results.

Speed was where the services differed: CloudSight took several times longer than either Rekognition or Google Vision. To mitigate this, you can opt in to asynchronous recognition and CloudSight's API will return immediately, providing a token your code can use to check the status of your request as it is in flight.

Without factoring in the price of each service, where machine readable results are needed, I'd recommend Google Vision over Amazon Rekognition. For human readable descriptions of images, I'd choose CloudSight (far and away). Since in real life price is a factor, your mileage may vary.

Note: CloudSight excels at decoding CAPTCHAS in a way that suggests there may be humans involved somewhere in their process.

This CAPTCHA was correctly described as "good morning text". Perhaps the service is augmented by people, or the people are augmented by deep learning or are helping to train the algorithms. It is unlikely that known current technology is completely sufficient for these results.

Tyler Spaeth

Related Links

In 2016 Gaurav Oboroi pondered how CloudSight might be getting its results. His app Cloudy Vision compared six services in April 2017.

In 2015 redditors speculated that CloudSight might be combining deep learning and reverse image search (reading the tags from the images found) with human assistance. They pointed out that having humans and software work together to offer and test the acceptance of a product is a known "lean startup" technique. One redditor claimed to have worked for the company as a tagger in 2013.