The Most Frequent Commit Messages on GitHub are Mostly Useless

Last Wednesday Google and GitHub announced several interesting new GitHub datasets that can be analyzed using Google's BigQuery service. Having delved into GitHub commit messages before, I knew that a lot of the messages fail at communicating what a commit actually changes.

Now that there is a table with roughly 151 million commits available, I wanted to see what the most frequent commit messages are and charted the top 30 in the image below.

Methodology

In a first naive attempt I simply grouped the commit table on the lowercased message field. The result contained a lot of messages starting with 'merge'. Since default merge messages are provided by git itself, I decided to ignore those messages. Moreover, I trimmed leading and trailing whitespace, because it doesn't make a message more meaningful. I ended up using the following query to aggregate the message counts:

SELECT msg, COUNT(msg) AS msg_count FROM ( SELECT LOWER(REGEXP_REPLACE(message, r'^\s+|\s+$', '')) AS msg FROM [bigquery-public-data:github_repos.commits] ) WHERE LEFT(msg, 5) != 'merge' GROUP BY msg ORDER BY msg_count DESC LIMIT 1000000

Without the limit, the query resulted in the error "Resources exceeded during query execution."

Interpretation of results

The number one spot goes to "initial commit", which is the message suggested through the GitHub user interface, when you create a new repository. Not the worst message in the list, but a brief description of the project would be more helpful I guess.

Messages in the form of "update FILE" occur very frequently as well. When you edit a file directly on GitHub, this is the suggested placeholder message. I'm not sure whether it was an actual default form value before, but it probably influences the frequency of this message type.

There are also a lot of "empty" commit messages in the form of whitespace only strings or texts such as "no message" or just a period. The messages "[maven-release-plugin] prepare for next development iteration" and "translation update done using pootle." look like they are automatically generated by some kind of build system.

Conclusion

Looking at the top 30 messages, I 'd argue that "typo" and "version bump" are still the most useful ones. Obviously, messages that occur very often won't be very specific, but when you consider that the top 30 make up about 4.8% of all messages, there is certainly a lot of room for improvement.

Some of GitHub's UI choices seem to be affecting these messages, but in the end it is the git users, who make the final choice and they should take some time to think about the messages they send.