Interestingly, all of the speeches can start from anywhere in terms of sentiment but it will always end on an overall positive note and at a level that is higher than the start.

Topic modelling

Topic modelling is an unsupervised machine learning technique in NLP to cluster words in a set of documents. The assumption is that words that occur closely together in a body of text are more related to each other, and we can obtain these clusters of similar words through topic modelling.

We use latent Dirichlet allocation (LDA), one of the techniques available for topic modelling. As there is a small number of documents (11 speeches), we limited our analysis to five topic clusters with ten keywords per cluster. “Singapore” inevitably turns up in each year’s topic modelling so we removed it, leaving the nine keywords. For illustration, we have shared one of the five clusters from each year in the list below (we didn’t want to bore you with too many words, but you can check the rest of the cluster out in our Google Colab notebook).

Selected topic clusters by year

2008: [work children new people women quite good want life] 2009: [religious government time know harmony problem think new group] 2010: [line foreign good need think workers students new train] 2011: [special help needs education people good schools support need maybe] 2012: [children people work time students home just singapore year old] 2013: [people make singaporeans community new time day government healthcare] 2014: [jurong gardens lake garden good just cpf government year help] 2015: [singapore years sg50 flats old hdb did home children parents] 2016: [china sea people south countries asean president race friends] 2017: [preschool good teachers children moe early child years start career] 2018: [years flats hip old new flat hdb government leases time]

As you can see, some years’ clusters make more sense than others. However, if you read through the transcripts for each year’s rally, you’d find that the clusters of words more or less sum the topics up.

Predictions for National Day Rally 2019

The wordclouds and modelled topics are indicative of the year’s concerns. As such, for fun, we’d also like to make predictions on what to expect from this year’s speech on the 18th. We predict the following ideas and terms to appear on our wordcloud, besides the obviously consistent "one", “Singapore”, and “will”:

Merdeka/Pioneer CPF Coding/Data science Bicentennial Entrepreneurship Resilience

One of the things we could have done was to see if there was a correlation between a year's National Day message and the subsequent National Day Rally. However, due to a lack of time, that is something to be explored in the near future.

Conclusion

In this 5-week project (done in the evening), we successfully obtained our data, performed data cleaning, and applied three different NLP techniques to extract interesting insights. We have, to the best of our knowledge, used data analytics to analyze the National Day Rally speeches over the years. You can check the analysis using the Google Colab notebook.

P.S. Check back again after this year’s National Day Rally (18th August) to see if we nailed our (non-quantitative) predictions. We will be updating this space.

Liked what you saw in our projects and wish you could do it too? You definitely can. Embark on cool projects with UpLevel, guided by mentors over 4-6 week projects. Head on to https://discover.uplevel.work for more details!