Many years ago I was wandering through the University of Maryland CS Library and found a dusty old book titled What Computers Can’t Do, adjacent to its successor, What Computers Still Can’t Do. The second book was thicker, which made me realize that Computer Science was a worthwhile field to study. While preparing to write this post I found an archive copy of the first book and found an interesting observation:

Since a human being using and understanding a sentence in a natural language requires an implicit knowledge of the sentence’s context-dependent use, the only way to make a computer that could understand and translate a natural language may well be, as Turing suspected, to program it to learn about the world.

This was a very prescient observation and I’d like to tell you about Amazon Comprehend, a new service that actually knows (and is very happy to share) quite a bit about the world!

Introducing Amazon Comprehend

Amazon Comprehend analyzes text and tells you what it finds, starting with the language, from Afrikans to Yoruba, with 98 more in between. It can identify different types of entities (people, places, brands, products, and so forth), key phrases, sentiment (positive, negative, mixed, or neutral), and extract key phrases, all from text in English or Spanish. Finally, Comprehend‘s topic modeling service extracts topics from large sets of documents for analysis or topic-based grouping.

The first four functions (language detection, entity categorization, sentiment analysis, and key phrase extraction) are designed for interactive use, with responses available in hundreds of milliseconds. Topic extraction works on a job-based model, with responses proportional to the size of the collection.

Comprehend is a continuously-trained trained Natural Language Processing (NLP) service. Our team of engineers and data scientists continue to extend and refine the training data, with the goal of making the service increasingly accurate and more broadly applicable over time.

Exploring Amazon Comprehend

You can explore Amazon Comprehend using the Console and then build applications that make use of the Comprehend APIs. I’ll use the opening paragraph from my recent post on Direct Connect to exercise the Amazon Comprehend API Explorer. I simply paste the text into the box and click on Analyze:

Comprehend processes the text at lightning speed, highlights the entities that it identifies (as you can see above), and makes all of the other information available at a click:

Let’s look at each part of the results. Comprehend can detect many categories of entities in the text that I supply:

Here are all of the entities that were found in my text (they can also be displayed in list or raw JSON form):

Here are the first key phrases (the rest are available by clicking Show all):

Language and sentiment are simple and straightforward:

Ok, so those are the interactive functions. Let’s take a look at the batch ones! I already have an S3 bucket that contains several thousand of my older blog posts, an empty one for my output, an IAM role that allows Comprehend to access both. I enter it and click on Create job to get started:

I can see my recent jobs in the Console:

The output appears in my bucket when the job is complete:

For demo purposes I can download the data and take a peek (in most cases I would feed it in to a visualization or analysis tool):

$ aws s3 ls s3://comp-out/348414629041-284ed5bdd23471b8539ed5db2e6ae1a7-1511638148578/output/ 2017-11-25 19:45:09 105308 output.tar.gz $ aws s3 cp s3://comp-out/348414629041-284ed5bdd23471b8539ed5db2e6ae1a7-1511638148578/output/output.tar.gz . download: s3://comp-out/348414629041-284ed5bdd23471b8539ed5db2e6ae1a7-1511638148578/output/output.tar.gz to ./output.tar.gz $ gzip -d output.tar.gz $ tar xf output.tar $ ls -l total 1020 -rw-r--r-- 1 ec2-user ec2-user 495454 Nov 25 19:45 doc-topics.csv -rw-rw-r-- 1 ec2-user ec2-user 522240 Nov 25 19:45 output.tar -rw-r--r-- 1 ec2-user ec2-user 20564 Nov 25 19:45 topic-terms.csv $

The topic-terms.csv file clusters related terms within a common topic number (first column). Here are the first 25 lines:

topic,term,weight 000,aw,0.0926182 000,week,0.0326755 000,announce,0.0268909 000,blog,0.0206818 000,happen,0.0143501 000,land,0.0140561 000,quick,0.0143148 000,stay,0.014145 000,tune,0.0140727 000,monday,0.0125666 001,cloud,0.0521465 001,quot,0.0292118 001,compute,0.0164334 001,aw,0.0245587 001,service,0.018017 001,web,0.0133253 001,video,0.00990734 001,security,0.00810732 001,enterprise,0.00626157 001,event,0.00566274 002,storage,0.0485621 002,datar,0.0279634 002,gateway,0.015391 002,s3,0.0218211

The doc-topics.csv file then indicates which files refer to the topics in the first file. Again, the first 25 lines:

docname,topic,proportion calillona_brows.html,015,0.577179 calillona_brows.html,062,0.129035 calillona_brows.html,003,0.128233 calillona_brows.html,071,0.125666 calillona_brows.html,076,0.039886 amazon-rds-now-supports-sql-server-2012.html,003,0.851638 amazon-rds-now-supports-sql-server-2012.html,059,0.061293 amazon-rds-now-supports-sql-server-2012.html,032,0.050921 amazon-rds-now-supports-sql-server-2012.html,063,0.036147 amazon-rds-support-for-ssl-connections.html,048,0.373476 amazon-rds-support-for-ssl-connections.html,005,0.197734 amazon-rds-support-for-ssl-connections.html,003,0.148681 amazon-rds-support-for-ssl-connections.html,032,0.113638 amazon-rds-support-for-ssl-connections.html,041,0.100379 amazon-rds-support-for-ssl-connections.html,004,0.066092 zipkeys_simplif.html,037,1.0 cover_art_appli.html,093,1.0 reverse-dns-for-ec2s-elastic-ip-addresses.html,040,0.359862 reverse-dns-for-ec2s-elastic-ip-addresses.html,048,0.254676 reverse-dns-for-ec2s-elastic-ip-addresses.html,042,0.237326 reverse-dns-for-ec2s-elastic-ip-addresses.html,056,0.085849 reverse-dns-for-ec2s-elastic-ip-addresses.html,020,0.062287 coming-soon-oracle-database-11g-on-amazon-rds-1.html,063,0.368438 coming-soon-oracle-database-11g-on-amazon-rds-1.html,041,0.193081

Building Applications with Amazon Comprehend

In most cases you will be using the Amazon Comprehend API to add natural language processing to your own applications. Here are the principal interactive functions:

DetectDominantLanguage – Detect the dominant language of the text. Some of the other functions require you to provide this information, so call this function first.

DetectEntities – Detect entities in the text and return them in JSON form.

DetectKeyPhrases – Detect key phrases in the text and return them in JSON form.

DetectSentiment – Detect the sentiment in the text and return POSITIVE, NEGATIVE, NEUTRAL, or MIXED.

There are also four variants of these functions (each prefixed with Batch ) that can process up to 25 documents in parallel. You can use them to build high-throughput data processing pipelines.

Here are the functions that you can use to create and manage topic detection jobs:

StartTopicsDetectionJob – Create a job and start it running.

ListTopicsDetectionJobs – Get the list of current and recent jobs.

DescribeTopicsDetectionJob – Get detailed information about a single job.

Now Available

Amazon Comprehend is available now and you can start building applications with it today!

— Jeff;