In my previous blog Twitter Sentiment Analysis using Talend, I showed how to extract tweets from Twitter using Talend and then how to do some basic sentiment analysis on those tweets. In this post, I will introduce the Stanford CoreNLP toolkit and show how to integrate it with Talend to perform various NLP (Natural Language Processing) analyses including sentiment analysis.

Where were we?

Previously I had managed to perform some basic sentiment analysis on tweets. However, I’d noticed a major flaw with my technique: the method I was using would take each word in a sentence and average the sentiment score of each word. I explain the issue in more detail in my original post, but to give you a flavour of it, I’ll show you some examples of correct/incorrect sentiment identification that would result from my previous method:

Negative Sentiment Detected – Correct

“I really hated his performance, it was awful”

Positive Sentiment Detected – Incorrect

“His performance wasn’t amazing”

In the first example, the process correctly detects negative sentiment; all sentiment-carrying words in the sentence as negative. In the second example, the process detects positive sentiment from the sentence. This is incorrect as it should be fairly obvious that this sentence carries negative sentiment. The process has failed to identify the phrase “wasn’t amazing” as negative and has instead judged it as being positive overall due to the presence of the word “amazing”.

Therefore, it turns out sentiment analysis is difficult!

Luckily, the clever people at Stanford Natural Language Processing Group have built a fantastic tool kit that can perform many useful analyses on text. These include tokenization, sentence splitting, lemmatization, part of speech tagging and named entity recognition amongst others.

If you would like to know more about each of these analyses and how the Stanford CoreNLP tool implements them then you can check out their web demo of the tool here: http://corenlp.run/. There are additional links provided at the bottom of this post which I found useful while researching for this piece of work.

How to integrate Stanford CoreNLP with Talend

The first step is to download CoreNLP from this link: http://stanfordnlp.github.io/CoreNLP/. It might also be a good idea to have a scan of the documentation. Unzip the file and save it to a sensible location. If you open the unzipped folder you will find the source code, models, dependencies and documentation for CoreNLP.

Below is a simple Talend job that implements the Core NLP tool set.

You’ll need to add a tLibraryLoad component for each of the following jars in the stanford-corenlp folder;

ejml-0.23.jar

slf4j-simple.jar

slf4j-api.jar

stanford-corenlp-3.6.0-models.jar

stanford-corenlp-3.6.0.jar

Configure the advanced settings for your stanford-corenlp-3.6.0.jar tLibraryLoad component as below.

Insert the below text into your tJava component.

Now let’s run this example and see how it deals with the text “His performance was hardly amazing” which is a slightly more difficult version of the example I gave earlier.

The result: Negative sentiment detected!

Conclusion

Hopefully this should be enough to get you going with some very effective sentiment analysis using Talend. If you are wondering how to use some of the other analyses (annotators) included in the Stanford CoreNLP toolkit, and you would like some help, then please feel free to add a comment at the bottom of this post and I will do my best to answer your questions.

In my next blog post I will go through a job that I have built that performs sentiment analysis on tweets stored in a database with a modified version of the above code.

Don’t forget to subscribe to our blog, so you can stay up to date!

Links / References