That makes a lot of sense — for example, the most popular type of changes is “fix”, the documentation modifications affect the tutorial a lot and so on. One interesting observation : “$compile” scope demands a lot of attention in all the development areas. I don’t know a lot about $compile, but found a quite popular article that says:

View compilation in Angular is some of the most ingenious functional programming I’ve seen in JavaScript.

That explains the amount of attention required! How cool is that? Simple text mining tools available to everyone allows to reveal what’s going on in the development of one of the most popular open source libraries!

Okay then, what about React’s most common words in the commit messages?

A lot of messages have “merge”, “pull” and “request” words. In fact 2274 React’s commit messages start with “Merge pull request #xxxx” string. The culprit behind it seems to be the “Collaborating on projects using issues and pull requests” workflow and in particular the recommended approach to merge pull requests. The very same pattern pattern of commit messages can be found in other open-sources projects, for example take a look at Rails changes history. But it was not like that at all for the Angular commit messages! Why is it so? It seems that Angular maintainers use the alternative technique of integrating the contributors’ changes that boils down to applying the series of patches from the pull requests. You can find the detailed explanation in the excellent “Merge pull request Considered Harmful” blog post. The Angular’s approach along with the rules for writing the commit messages turns the changes history to a useful product story that is easy to read and analyze. Unfortunately, React commits lack this beauty, but that’s not a reason to give up. There are other text mining methods that can tell a story of React, namely term frequency–inverse document frequency can be effectively used. Here is the code that selects the words that are frequently used in React’s messages, but rarely occurred in both of React and Angular messages.

This stats is much more informative than the previous than. First of all, “spicyj”, “zpao”, “sebmarkbage”, “chenglou”, “syranide”, “jimfb”, “benjamn” refer to Github accounts of the most active contributors. These names are used in the messages like

Merge pull request #2343 from zpao/proptypes-deprecation Update PropTypes for ReactElement & ReactNode

so let’s ignore them and look at other highest tf-idf words. What about “korean” and “japanese” for example? It seems that React maintainers put a certain amount of efforts to translate the flagship documentation to CJK. That’s was not observed for Angular and its documentation seems to have the English version only. Other common words indicate the areas that are frequently affected by the changes and every React developer should be familiar with them. Funny enough, they perfectly describe almost everything you need to implement and test a React-based application, here is the illustration:

That’s all that I was able to extract from the React’s commit messages.Other methods like working with combinations of words did not help to obtain any other interesting information.

Sentiment Analysis

Finally let’s try to conduct the sentiment analysis of commit messages by using the dictionary that labels each word as “positive” or “negative”. Here is the code that plots the distribution of the most common positive and negative words in the angular’s messages:

Indeed the words labeled as “negative” are not actually negative in context! We already know that “chore” is just a type of changes suggested in the guidelines for contributors. Or, for example, “error” is certainly not negative in the “fix ‘type mismatch’ error on IE8 after each request” message. The similar picture is observed for React:

Practically it means that React and Angular contributors do not put a lot of emotion into the commit messages, for example there is only one f-bomb among 15K commit messages. See my previous post for the examples of slightly different approach to formatting the commit messages ;)

By way of conclusion

In his great book “How Google Works” Eric Schmidt explains how astonishing things are in the Internet Century:

Three powerful technology trends have converged to fundamentally shift the playing field in most industries. First, the Internet has made information free, copious, and ubiquitous — practically everything is online. Second, mobile devices and networks have made global reach and continuous connectivity widely available. And third, cloud computing has put practically infinite computing power and storage and a host of sophisticated tools and applications at everyone’s disposal, on an inexpensive, pay-as-you-go basis.

This story is the example of using these powerful technologies : the data to analyze has been obtained from the publicly available source(Google Cloud) by using the free computational power(BigQuery) that tool only 15.6 sec to process 121Gb of data , the analysis has been done in the local machine, but I could’ve leverage the kaggle kernels platform to run the code in the cloud.