Analysing Stack Overflow comment sentiment using Google Cloud Platform

The decline of Stack Overflow?

A few months back I read this post from 2015 (yes, I know I’m a little late to the party) about how Stack Overflow (SO) was in serious decline, and heading for total and utter oblivion. In the post, the first item to be called out was that SO “hated new users“:

Stack Overflow has always been a better-than-average resource for finding answers to programming questions. In particular, I have found a number of helpful answers to really obscure questions on the site, many of which helped me get past a road block either at work or in my hobby programming. As such, I decided I’d join the site to see if I could help out. Never before has a website given me a worse first impression.

At the time, I remember thinking that this seemed like somewhat of an unfair statement. That was mostly down to the fact that when I joined the community (many years ago), I had fond memories of a smooth on-boarding, and never experienced any snarky remarks on my initial questions. Yes, gaining traction for noobs is very, very hard, but there is a good reason why it exists.

For me, SO is invaluable. How else would I be able to pretend to know what I’m doing? How else could I copy and paste code from some other person who’s obviously a lot smarter than me, and take all the credit for it? Anyway, once I had read the post, and gotten on with my life (e.g. copying and pasting more code from SO), I did’t think too much more about the post. Maybe I had just been lucky with my foray into the SO community?

However, just last week, I was reminded of that post once again, when I noticed that BigQuery (BQ) now has a public dataset which includes all the data from SO – including user comments and answers. Do you see where I am going with this yet? If not, then don’t worry. Neither did I when I started writing this.

Hatching a plan

And so, I got to thinking. I thought it could be really interesting, and fun at the same time, to try and analyse the freshly uploaded data in BQ to see if it was really true that SO hated new users. The idea I had was (*cough*) simple-ish:

BQ: craft some SQL to extract all comments made on questions posted by noobs (SO reputation <= 1). Dataflow & Natural Language API: spin up a pipeline to execute the SQL from 1, and process each comment to get a sentiment score using the Google Natural Language API. BQ: analyse the sentiment results.

I’d been eager to play with some more of the Machine Learning (ML) magic on GCP. On a recent POC for a client, we had tested out the Cloud Vision API (stay tuned for a blog on that), and everyone involved was super-duper impressed with it. So, by convincing my boss that experimenting with the Natural Language API (NL-API) was a very good thing indeed, he then let me have some TEL time to spin up this test. However, there were some constraints to my uber scientific experiment:

Over the entire dataset, there were ~250K comments made on noob questions. See next point. The NL-API currently has some strict limits i.e. only 1000 API requests per 100s. Ouch. The NL-API is (obviously) not free.

Taking those three points into consideration, I trimmed it down to just analysing just one year’s worth (2016). I cut it down even further by only looking at either JS, Java or Python tagged questions (three languages I use regularly enough). Applying both of those filters left me with a manageable dataset size of 69662 records/comments to flick through the NL-API. Outstanding!

Disclaimer at this point: I don’t really know a whole lot about how the incredibly complex world of machine learning works under the hood. I’m not smart enough. But the beauty of it, is that I don’t have to understand. I leave all that up to the whizzes at Google, and who simply provide an API for folks like me to use. Also, I’m neither a statistician, nor a data analyst so my results may be…ehh…questionable in terms of accuracy, correctness, confidence levels etc. Hence the disclaimer.

Step 1: crafting some (horribly inefficient) SQL in BigQuery

One of the awesome things about BQ is that it’s so bloody fast. Like, insanely fast. We love using it at Shine. However, this can also make you lazy at times because it affords you the luxury of not having to worry too much about writing inefficient SQL. Of course, when you’re dealing with massive datasets in production, then you definitely want to be thinking of efficiency – even when using BQ.

But for sake of a blog or POC, then knocking up something quick and dirty is absolutely fine in my book. Alas, without further ado, here’s the SQL for grabbing all comments made on noobs question posts – filtered by just 2016, JS/Java/Python tags only, and not comments made by the OP:

#standardSQL SELECT comments_on_questions_only.id, comments_on_questions_only.text, comments_on_questions_only.creation_date, comments_on_questions_only.post_id, comments_on_questions_only.user_id, comments_on_questions_only.score, comments_on_questions_only.tags, comments_on_questions_only.owner_user_id, users.reputation FROM ( SELECT comments.id, comments.text, comments.creation_date, comments.post_id, comments.user_id, comments.user_display_name, comments.score, questions.id AS question_id, questions.tags, questions.owner_user_id FROM `bigquery-public-data.stackoverflow.comments` comments LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` questions ON comments.post_id = questions.id GROUP BY comments.id, comments.text, comments.creation_date, comments.post_id, comments.user_id, comments.user_display_name, comments.score, question_id, questions.tags, questions.owner_user_id HAVING questions.id IS NOT NULL AND creation_date >= TIMESTAMP('2016-01-01')) comments_on_questions_only LEFT JOIN `bigquery-public-data.stackoverflow.users` users ON comments_on_questions_only.owner_user_id = users.id GROUP BY comments_on_questions_only.id, comments_on_questions_only.text, comments_on_questions_only.creation_date, comments_on_questions_only.post_id, comments_on_questions_only.user_id, comments_on_questions_only.user_display_name, comments_on_questions_only.score, comments_on_questions_only.tags, comments_on_questions_only.owner_user_id, users.reputation HAVING users.reputation <= 1 AND user_id != owner_user_id AND (tags LIKE '%javascript%' OR tags LIKE '%java%' OR tags LIKE '%python%')

Listen up: I’ve shared the query publicly here, and you can also find the public dataset with some of the tables I created along the way here. I also saved the query as a view to make the pipeline code easier to read (see next step).

Step 2: ML magic

Google have opened up a lot of their machine learning technologies on GCP recently e.g. Cloud Vision API, TensorFlow, Natural Language API etc. are all available now as public APIs. I’ll let the NL-API website take over at this point:

Google Cloud Natural Language API reveals the structure and meaning of text by offering powerful machine learning models in an easy to use REST API. You can use it to extract information about people, places, events and much more, mentioned in text documents, news articles or blog posts. You can use it to understand sentiment about your product on social media or parse intent from customer conversations happening in a call center or a messaging app. You can analyze text uploaded in your request or integrate with your document storage on Google Cloud Storage.

I was interested in the “understand sentiment” feature. What I wanted to was to grab all the comments made on noob questions, and pass each one to the NL-API to get a sentiment score to do some analysis on i.e. what overall attitude of comments made on noob questions was like (according to NL-API), and which developers were the most (un)welcoming of noobs – JS, Java or Python! Using the NL-API, sentiment scores are grouped into 3 bands (positive, neutral or negative):

Each response has a score and magnitude associated with it. The Google docs explain it best (and it saves me trying to fumble around with my awful Irish vocabulary for the right wording):

The score of a document’s sentiment indicates the overall emotion of a document. The magnitude of a document’s sentiment indicates how much emotional content is present within the document, and this value is often proportional to the length of the document. A document with a neutral score (around 0.0 ) may indicate a low-emotion document, or may indicate mixed emotions, with both high positive and negative values which cancel each out. Generally, you can use magnitude values to disambiguate these cases, as truly neutral documents will have a low magnitude value, while mixed documents will have higher magnitude values.

For this test, I didn’t believe analysing the magnitude was necessary because what I was trying to analyse were just comments (max 300 chars), and not lengthly prose text or paragraphs. However, if you were to use NL-API on large text samples, then you do need to consider the magnitude. See here for a detailed explanation.

Moving on. OK, so I didn’t really have to use Cloud Dataflow for this part of my experiment. I could have just as easily ran the query directly in BQ, download the results, and parse each row for sentiment analysis using a Python script, or even just some simple bash cat | awk | curl magic. Yes, Dataflow was a bit of overkill to be honest. But, I wanted to demonstrate the ease of wiring up several tools/services on GCP, so that if you plan on doing something similar (but with much bigger datasets) then it will scale. Well that, and I just wanted to show off that I know how to use Dataflow too.

The pipeline process:

With 2 workers in the pool, run the SQL from step 1 against BQ. Process each table row (record) using a simple `ParDo` (dataflow distributed function). The ParDo calls out to the NL-API (yes, synchronously!) to request the sentiment score for each comment. Write the results back to BQ for more SQL analysis.

Here’s the Java code for the pipeline:

import com.google.api.client.googleapis.auth.oauth2.GoogleCredential; import com.google.api.services.bigquery.BigqueryScopes; import com.google.api.services.bigquery.model.TableFieldSchema; import com.google.api.services.bigquery.model.TableRow; import com.google.api.services.bigquery.model.TableSchema; import com.google.api.services.language.v1beta1.CloudNaturalLanguage; import com.google.api.services.language.v1beta1.CloudNaturalLanguageScopes; import com.google.api.services.language.v1beta1.model.AnalyzeSentimentRequest; import com.google.api.services.language.v1beta1.model.AnalyzeSentimentResponse; import com.google.api.services.language.v1beta1.model.Document; import com.google.cloud.dataflow.sdk.Pipeline; import com.google.cloud.dataflow.sdk.io.BigQueryIO; import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions; import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory; import com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner; import com.google.cloud.dataflow.sdk.transforms.DoFn; import com.google.cloud.dataflow.sdk.transforms.ParDo; import com.google.cloud.dataflow.sdk.values.PCollection; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.concurrent.TimeUnit; import java.util.stream.Collectors; import java.util.stream.Stream; import static com.google.cloud.dataflow.sdk.io.BigQueryIO.Read.fromQuery; import static com.google.cloud.dataflow.sdk.io.BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED; import static com.google.cloud.dataflow.sdk.io.BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE; import static com.google.cloud.dataflow.sdk.options.DataflowPipelineWorkerPoolOptions.AutoscalingAlgorithmType.NONE; /** * Created by Graham Polley | Shine Solutions on 23/02/2016. */ public class SentimentDataflowPipeline { public static transient CloudNaturalLanguage LANGUAGE_SERVICE; //lazy load (don't serialize in ParDo) /** * A quick and dirty Dataflow pipeline that analyses Stack Overflow sentiment. * * @param args * @throws IOException */ public static void main(String[] args) throws IOException { // -- Boring credential stuff. Remember to add the correct scopes. List<String> scopes = Stream.concat(BigqueryScopes.all().stream(), CloudNaturalLanguageScopes.all().stream()).collect(Collectors.toList()); final GoogleCredential credential = GoogleCredential.getApplicationDefault().createScoped(scopes); // -- Construct the Dataflow/Beam pipeline DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class); options.setRunner(BlockingDataflowPipelineRunner.class); options.setProject("stackoverflow-sentiment"); options.setWorkerMachineType("n1-standard-1"); options.setNumWorkers(2); options.setZone("asia-east1-a"); options.setStagingLocation("gs://sentiment-pipeline/"); options.setTempLocation("gs://sentiment-pipeline/"); options.setAutoscalingAlgorithm(NONE); options.setGcpCredential(credential); options.setJobName("analyse-stackoverflow-comment-sentiment"); Pipeline pipeline = Pipeline.create(options); PCollection<TableRow> commentsOnNoobsQuestions = pipeline.apply("read-StackOverflow-from-BQ", fromQuery("SELECT * FROM `stackoverflow-sentiment.sentiment.v_comments_on_questions_from_noobs`").usingStandardSql()).apply(ParDo.of(new DoFn<TableRow, TableRow>() { @Override public void processElement(ProcessContext c) throws Exception { DataflowPipelineOptions options = (DataflowPipelineOptions) c.getPipelineOptions(); if (LANGUAGE_SERVICE == null) { LANGUAGE_SERVICE = new CloudNaturalLanguage.Builder( options.getGcpCredential().getTransport(), options.getGcpCredential().getJsonFactory(), request -> options.getGcpCredential().initialize(request)).setApplicationName("NL-API").build(); } //Construct the NL-API request per comment Document doc = new Document(); doc.setLanguage("EN"); doc.setType("PLAIN_TEXT"); doc.setContent((String) c.element().get("text")); AnalyzeSentimentRequest sentimentRequest = new AnalyzeSentimentRequest(); sentimentRequest.setDocument(doc); CloudNaturalLanguage.Documents.AnalyzeSentiment res = LANGUAGE_SERVICE.documents().analyzeSentiment(sentimentRequest); AnalyzeSentimentResponse response = res.execute(); TimeUnit.MILLISECONDS.sleep(100); //Output the row back to BigQuery TableRow outputRow = new TableRow(); outputRow.set("id", c.element().get("id")); outputRow.set("text", c.element().get("text")); outputRow.set("creation_date", c.element().get("creation_date")); outputRow.set("post_id", c.element().get("post_id")); outputRow.set("user_id", c.element().get("user_id")); outputRow.set("score", c.element().get("score")); outputRow.set("tags", c.element().get("tags")); outputRow.set("owner_user_id", c.element().get("owner_user_id")); outputRow.set("reputation", c.element().get("reputation")); outputRow.set("sentiment_score", response.getDocumentSentiment().getScore()); outputRow.set("sentiment_magnitude", response.getDocumentSentiment().getMagnitude()); c.output(outputRow); } }).named("Process-comment-using-NL-API")); // --Boiler plate BQ schema stuff List<TableFieldSchema> fields = new ArrayList<>(); fields.add(new TableFieldSchema().setName("id").setType("INTEGER")); fields.add(new TableFieldSchema().setName("text").setType("STRING")); fields.add(new TableFieldSchema().setName("creation_date").setType("TIMESTAMP")); fields.add(new TableFieldSchema().setName("post_id").setType("INTEGER")); fields.add(new TableFieldSchema().setName("user_id").setType("INTEGER")); fields.add(new TableFieldSchema().setName("score").setType("INTEGER")); fields.add(new TableFieldSchema().setName("tags").setType("STRING")); fields.add(new TableFieldSchema().setName("owner_user_id").setType("INTEGER")); fields.add(new TableFieldSchema().setName("reputation").setType("INTEGER")); fields.add(new TableFieldSchema().setName("sentiment_score").setType("FLOAT")); fields.add(new TableFieldSchema().setName("sentiment_magnitude").setType("FLOAT")); TableSchema schema = new TableSchema().setFields(fields); // --Write the results back to BigQuery commentsOnNoobsQuestions.apply(BigQueryIO.Write .named("write-results-to-BQ") .to("stackoverflow-sentiment:sentiment.noob_sentiment_results_2016") .withSchema(schema) .withWriteDisposition(WRITE_TRUNCATE) .withCreateDisposition(CREATE_IF_NEEDED)); pipeline.run(); } }

Here’s what the pipeline looked like while it was chugging away, and running on two GCE (n1-standard-1) instances. Because the NL-API limits requests to 1000 per 100s, I simply let the pipeline do its thing while I got on with trying to figure out what to do next.

When the the pipeline had finished, it left me with the following table in BQ ready to be analysed:

To finish it off, I ran this SQL over the top of it to clean, and roll up the tags and sentiment scores into Strings – `positive`, `neutral`, or `negative` to make it more results more humanly readable.

Step 3: pretending to be a data analyst super hero

Let’s recap. I now had all comments for 2016 (tagged with either JS, Java or Python) made on a noob (reputation <= 1) question post, and each one had a sentiment score assigned by the NL-API. Next came the fun part.

The average sentiment score is-0.1026. In other words, it’s neutral. So, the overall comment sentiment for 2016 on noob questions is in fact neutral – at least according to the NL-API. This doesn’t align to the ‘hates new users ‘ statement! On the contrary, there are a total of just 652 negative comments compared to 62810 neutral and 6200 positive.

So, what is the split of sentiment across the three different types of developers? Well, it turns out that Python developers post the lowest percent of negative comments overall, followed by Java, and then it’s JS developers that are the (according to the NL-API) the most unwelcoming to new users on SO:

Here are the top 10 positive comments. I would not consider all of these positive comments, but on the whole the NL-API looks to have gotten these ones relatively correct:

Row sentiment_score tag_rolled_up text 1 0.9 java Lookup abstract base class and virtual methods. 2 0.9 javascript Google is an invaluable tool. 3 0.9 python @DavidZemens Useful info, thanks! 4 0.9 java Improved readability, code formatting 5 0.9 java Use the debugger in your IDE, this is a perfect opportunity to learn how to use it 6 0.9 python This is a tricky thing to do in matplotlib – here is a good starting point: http://stackoverflow.com/questions/14827650/pyplot-scatter-plot-marker-size 7 0.9 python Awesome stuff! 🙂 8 0.9 java Improved readability, updated the title. 9 0.9 java I would recommend using [simplejson](https://pypi.python.org/pypi/simplejson/), works great for reading JSON-files. 10 0.9 java @Tunaki: Excellent finds, I used the first one.

Let’s now have a look at the top 10 negative ones. Number 3 & 5 are particularly harsh, and clearly negative comments! But, the other ones don’t look particularly bad in my opinion:

Row sentiment_score tag_rolled_up text 1 -0.9 java Unless `OAuth2TokensForPkce` subclasses `Throwable` *(very unlikely)*, the `throw` statement won’t compile either: [*The Expression in a throw statement must either denote a variable or value of a reference type which is assignable (§5.2) to the type **Throwable**, or denote the null reference, or a compile-time error occurs.*](https://docs.oracle.com/javase/specs/jls/se8/html/jls-14.html#jls-14.18) 2 -0.9 java `Compilation failure [ERROR] /var/local/modular/src/main/java/demo/SecUserDetails.java:[6,8] demo.SecUserDetails is not abstract and does not override abstract method isEnabled() in org.springframework.security.core.userdetails.UserDetails`, doesn’t this suggest that you should override `isEnabled()`? 3 -0.9 java This is very poor object oriented design 4 -0.9 java Looks like you have incompatible versions in your pom.xml somewhere: `java.lang.NoSuchMethodError: org.springframework.util.ReflectionUtils.doWithLocalFields(…)` 5 -0.9 java no offence, but your code is horrible and doesn’t show any logic even if your code did work, you will always get 0 returned 6 -0.9 python Your code outputs `(7, 3)`, it does not use `numlist`, it might return `None` if the `if` is never executed; it is unclear what it is supposed to do, or why you called it `sum_2_array`. 7 -0.9 java Do you mean to say — you want to set Pass / Fail result in testNG results ?? 8 -0.9 javascript Unrelated: `return false; // <=== breaks out of the loop early` that will not work the way you assume. 9 -0.9 javascript `chrome.storage.sync.get` has no returned value, you can’t assign it to `prefSite`, see duplicated link for how to retrieve returned data from aynnchronous call. 10 -0.9 javascript You have a module for that (https://github.com/witoldsz/angular-http-auth) all you need to do is returen http error code 403 for unauthorized request, and it will trigger a global event that you catch in your code and redirect the user to the login page

How about negative comments broken down by the day of the week. Are developers influenced by the Monday Blues when posting on SO? Here’s what the NL-API thinks of that:

Final thoughts

This was a really fun blog post. Hooking up all the tools, and dabbling with the NL-API was really quite interesting, and it has given the team some ideas on how it could be used for some upcoming projects that we have in the pipeline.

While the NL-API is really easy to use, I think it’s extremely hard for the model itself to accurately capture the true sentiment of text, because after all it’s just looking at classified keywords. That is clearly evident in the results posted above, which I don’t believe can be truely taken at face value. In fact, this SO answer sums up the challenges of sentiment analysis quite nicely I think:

I don’t think there’s anything particularly wrong with your algorithm, it’s a fairly straightforward and practical way to go, but there are a lot of situations where it will get make mistakes. Ambiguous sentiment words – “This product works terribly” vs. “This product is terribly good” Missed negations – “I would never in a millions years say that this product is worth buying” Quoted/Indirect text – “My dad says this product is terrible, but I disagree” Comparisons – “This product is about as useful as a hole in the head” Anything subtle – “This product is ugly, slow and uninspiring, but it’s the only thing on the market that does the job”

There’s lot more that could be done with this data i.e. much deeper analysis. Maybe you have some more ideas for analysing the data? Or you’d like to extend the queries some more – how about analysing the entire dataset and plot the sentiment across time! If so, please feel free to take any of my code/queries and re-use modify as you see fit. Then ping me with your results. I’d love to hear from you.

Happy analysing on GCP!