Yes, this great East African country has been among the countries with the best media freedom conditions in the world. — Daily News (government-owned Tanzanian newspaper) in an article with no byline

It seems that the media will never be free from the clutches of the government. The laws imposed on the media are strict and makes it practically impossible for journalists to do their job. — The Citizen (foreign-owned Tanzanian newspaper) in an article with no byline

If you find yourself in Tanzania and can’t read the headlines above, hakuna matata. Major publishers here offer both English and Swahili language dailies. Last summer as media coverage ramped up around the now-concluded general election, I started to wonder how much it matters that I, an english-speaking foreigner, read the news in a different language than most Tanzanians. Were there differences in coverage in English vs. Swahili, or between publishing houses? Were foreigners getting the same news as locals, and could I figure out a way to measure the difference?

Analysis of almost 9000 articles showed that election coverage was the primary focus of Tanzanian print media this Fall, especially in Swahili. Unsurprisingly the ruling party and topics favorable to the Tanzanian government received the most coverage in government-owned publications. I didn’t find an obvious difference in which topics were covered in English vs. Swahili, but Election-related terms appeared more often in Swahili. I think this consistency between languages reflects well on the Tanzanian media. Experiments with topic modeling highlighted some unexpected aspects of the election coverage, including a dominant focus on political promises and electoral policy and procedures.

What election?

On October 25 Tanzania held its 5th general election since 1992, electing John Magufuli as President by a 58% majority. Dr. Magufuli represented the ruling party CCM (Chama Cha Mapinduzi), and defeated Edward Lowassa, the candidate for opposition party CHADEMA (Chama cha Demokrasia na Maendeleo). It was a lively campaign but CCM’s victory was expected, and the transition of power from former president Jakaya Kikwete was smooth and generally peaceful.

A wrinkle worth mentioning happened in Zanzibar, which runs its own presidential election in parallel with the mainland. Amid rumors of an opposition victory the CCM-dominated Zanzibar Electoral Commission (ZEC) annulled the election result on October 28, citing irregularities. The move was widely criticized and left Zanzibar in an uncertain political situation that is still unresolved.

Three publishers, six newspapers

Tanzania has an active media landscape including over 30 print publications that vary widely in scope and quality. Conveniently the three most prominent publishers each produce a daily newspaper in English and Swahili, so these six papers were an obvious choice for this project. The Kenya-based Nation Media Group publishes The Citizen (EN) and Mwananchi (SW), Tanzanian-owned IPP Media publishes The Guardian (EN) and Nipashe (SW), and government-owned TSN (Tanzanian Standard Newspapers) publishes the Daily News (EN) and Habarileo (SW). Here they are in a table:

English Swahili Nation Media Citizen Mwananchi IPP Media Guardian Nipashe TSN Daily News Habarileo

The setup

On September 14, 2015 I set up scrapers to download all articles linked from the front page of these six newspapers’ websites. The scrapers were built in Python with scrapy and ran twice daily, with deduplication done in post-processing. The election was held on October 25, and by mid-November the scrapers had accumulated almost 9000 articles: about 4900 in English and 3900 in Swahili. I used Python to clean and organize the scraped articles, and looked at the data in three ways:

Counted terms of interest by language and publication Manually compared selected English and Swahili headlines Experimented with topic modeling using Latent Dirichlet Allocation (LDA) on the English articles

Counting terms

I selected some key terms related to the election, plus a few more general terms for comparison, and counted the number of times they were mentioned over two months from September 15 - November 15.

Daily counts for Magufuli Lowassa CCM UKAWA CHADEMA Zanzibar election (uchaguzi) corruption (rushwa) sports (michezo) Kenya

Election-related terms were consistently mentioned more in Swahili than English, often by at least a factor of 2, and absolute counts were high across all publications. Ruling party CCM and candidate [John] Magufuli were mentioned more than opposition party CHADEMA/UKAWA and candidate [Edward] Lowassa. This was true for all publications, but the difference was largest for the TSN (government-owned) papers Daily News and Habarileo. For example, Magufuli was mentioned 3.6 times as often as Lowassa by TSN, vs 2.2 times as often in IPP Media publications. TSN mentioned CCM 1.8 times as often as CHADEMA and UKAWA combined. Their coverage of Zanzibar stands out as the only case where an election-related term was mentioned more in English (Daily News) than Swahili (Habarileo).

Select a publisher: Nation media (Kenya-owned) IPP media (TZ independent) TSN (TZ government) EN/SW totals

Citizen Mwananchi Total Citizen:Mwananchi Magufuli 644 1164 1808 0.55 Lowassa 457 928 1385 0.49 CCM 1101 2007 3108 0.55 UKAWA 292 596 888 0.49 CHADEMA 614 1087 1701 0.56 Zanzibar 425 578 1003 0.74 election/uchaguzi* 1322 2259 3581 0.59 corruption/rushwa* 237 171 408 1.39 sports/michezo* 126 138 264 0.91 Kenya 345 127 472 2.72 Guardian Nipashe Total Guardian:Nipashe Magufuli 326 1099 1860 0.30 Lowassa 267 683 853 0.39 CCM 733 1828 2801 0.40 UKAWA 172 433 574 0.40 CHADEMA 467 1007 1334 0.46 Zanzibar 501 744 1266 0.67 election/uchaguzi* 889 2134 3176 0.42 corruption/rushwa* 128 169 322 0.76 sports/michezo* 143 114 454 1.25 Kenya 170 68 338 2.50 Daily News Habarileo Total Daily News:Habarileo Magufuli 761 1205 1966 0.63 Lowassa 170 371 541 0.46 CCM 973 1013 1986 0.96 UKAWA 141 169 310 0.83 CHADEMA 327 484 811 0.68 Zanzibar 522 459 981 1.14 election/uchaguzi* 1042 1897 2939 0.55 corruption/rushwa* 153 217 370 0.71 sports/michezo* 340 313 653 1.09 Kenya 270 183 453 1.48 Total (EN) Total (SW) Total (EN):Total (SW) Magufuli 1731 3468 0.50 Lowassa 894 1982 0.45 CCM 2807 4848 0.58 UKAWA 605 1198 0.51 CHADEMA 1408 2578 0.55 Zanzibar 1448 1781 0.81 election/uchaguzi* 3253 6290 0.52 corruption/rushwa* 518 557 0.93 sports/michezo* 609 565 1.08 Kenya 785 378 2.08

Some academic studies of media bias normalize term counts, for example as counts per 10,000 words or as a fraction of words published. I briefly played with these techniques and didn’t find them useful for highlighting trends between publications, especially on days when few articles were published, so the tables and charts here use absolute counts.

Reading the headlines

Manually reading headlines around controversial events helps put these term counts in context. It’s not quantitative, but coverage of political scandals can be more revealing of a publication’s editorial bias than topic selection. Headlines are also how many people in Tanzania get their news: the photo at the top of this post is a common scene.

The Zanzibar election annulment on October 28 is a good example because it’s a discrete, high profile and polarizing event. Ben Taylor (@mtega), a blogger and consultant with TZ civil society organization Twaweza, graciously helped with translations from Swahili to English. The list below shows headlines from all articles scraped on October 30 (2 days after the annulment) that mention Zanzibar. There’s unavoidable subjectivity in interpreting them, but two trends stand out.

First, consistent with the word counts above, TSN is pro-government and pro-CCM, IPP Media is pro-opposition and Nation is more centrist. This isn’t surprising (government-owned media supports the government, foreign-owned publisher less opinionated, nobody shocked), but it does make for colorful comparisons. Two days after a major story the Guardian calls Zanzibar a “sure cause for worry” and Nipashe speculates on what will happen if Zanzibar becomes violent, while the Daily News hails “Peaceful Elections” and Habarileo runs a fluffy piece about Zanzibari architecture. You’ll see a similar pattern most days.

Citizen Congrats, Dr Magufuli; we now must move on

CUF: No need to conduct fresh polls in Zbar

EAC releases poll result report

Forge a democratic Zanzibar, Moyo tells CCM

Peace resumes after tense 3 days in Zanzibar

Pressure mounts on ZEC to reverse decision on polls

Take leadership responsibility, Maalim Seif asks Kikwete, Shein

What it’ll cost to repeat Zbar polls Mwananchi * Congratulations Dr. Magufuli, take this into account

Lowassa contests Dr Magufuli presidency

Maalim Seif orders Jakaya Kikwete and Shein to stand for peace

Magufuli 2015

Observers put pressure on ZEC

Guardian Final Whistle: Magufuli is President

Situation in Zanzibar sure cause for worry

Six Zanzibar presidential candidates fault ZEC on election results rulling

Status of tuna fisheries in Tanzania under spotlight

UK, observers call upon ZEC to resume tabulation process Nipashe * Maalim Seif: What ZEC has done is a revolution

Nine hard questions on the Zanzibar elections

Nine houses burned down on Zanzibar while Magufuli is announced as the winner

ZEC will take the blame if Zanzibar becomes violent

Daily News Businesses open in Isles after a weeks lull

Let us all give Magufuli full support

Observers hail Tanzania over peaceful polls

PBZ posts 2.19bn/- profit for July-Sept

Peaceful elections? The people have shown the way!

Tanzania Postal Bank makes 1.88bn/- profit in Q3

TSA picks constitutional amendment committee Habarileo * Election challenges should be settled peacefully

Human rights organization wants to assist ZEC

Stone town's valued architectural art

Quiet returns to Zanzibar

Second, and more positive, is that for each publisher there doesn’t appear to be a major difference between their coverage of important current events in English vs Swahili. If anything Swahili headlines are more emotionally charged. In the example above Nipashe discusses houses burning, violence, revolution (though the word has less volatile connotations in Swahili) and asks hard questions. It’s impressive to see strong dissenting viewpoints in a major local language publication.

Topic modeling

Word counts turned out to be a simple if rough way to quantify topic coverage, but counts can’t incorporate word sense or context. Latent Dirichlet Allocation (LDA) is a computational technique for discovering groups of words that represent topics covered by a collection of documents. It is often applied to find topics in large, unstructured texts, for example Sarah Palin’s leaked emails in 2011 (this page also links to a good general discussion of LDA). In the end it wasn’t especially useful, but worth including because it highlighted two aspects of the overall election coverage that I didn’t expect.

I ran LDA on the English language articles, n = 4935. I used NLTK and Gensim to clean the text (downcase, remove punctuation/white space/stop words, and identify common bigrams), and then ran Gensim’s LDA implementation with k = 100. k is an LDA parameter which represents the number of topics and is often chosen heuristically. I then manually reviewed each topic and assigned it a label. For example a topic including these terms:

players, stars, team, taifa_stars, tournament, mkwasa, tanzania, dar_es, teams, match

was labeled “sports”. I then used the LDA model to assign these labels to articles based on the most strongly represented topics in each article.

Unfortunately most of the topics discovered by LDA, at least at my level of skill with the technique, were too general (e.g. “wildlife”) or too specific (e.g. “stampede during the Hajj”) to help identify editorial differences. Still, two stand out as noteworthy.

The most frequently occurring topic in Daily News and Habarileo, and second most frequent overall, was labeled “political promises”. It looks like this:

government, would, people, dr_magufuli, ensure, country, residents, water, promised, area

Many articles strongly represented by this topic have headlines like:

Lowassa: I’ll make Tanzania land of milk, honey

Dr Shein vows to uphold Union as CCM launches campaign in Zanzibar

Tanzanians like to see real development, says Magufuli

This is neat because LDA turned out to be good at unsupervised identification of political promises, and a little surprising because most Tanzanians don’t have faith in politicians’ promises. Apparently they still like to read about them.

Another interesting topic I labeled “election process”. Its terms include:

election, nec (national electoral comission), vote, people, political_parties, polling_stations, peace, country, campaigns

and some typical headlines:

Why NEC needs to act fairly to all

General Election campaigns should be smooth, peaceful

IGP warns parties’ security groups over ‘grabbing’ of powers of police

NEC working on problems in voter registration

This topic was among the top 10 for each publication, and 4th most common overall. There’s room for interpretation, but I think this shows a media focus on the procedures and mechanics of the election. It suggests a lively interest in the electoral process from a young democracy during its 5th multi-party general election.

Wrapping up

When I started this project I thought it would be an opportunity to learn more about algorithms from text analytics, but simple tools ended up being able to identify high level trends. While the analysis surfaced some interesting features of the election coverage, overall topic coverage in English and Swahili publications seems similar. I don’t think it would be valuable to probe for more subtle differences with computational techniques alone.

Many academic studies of media bias use human labeling to supplement results from LDA and other machine learning approaches. If I were to take this project further, for example to examine what topic coverage is associated with the high counts of election-related words in Swahili, I would start by having human readers label articles. I’d only look to topic models or machine classification if trends were still unclear, or I wanted to try and generalize the results to new articles.

Data

Want to go deeper? Lonely on a Tuesday? The data’s online! You can download the original articles in json format or check out the scrapers on github. The brave might peruse my idiosyncratic Python scripts for data cleaning, including a Jupyter notebook with the LDA experiments.

Thanks!

Ben Taylor for context, translation and thoughtful feedback. Josh Levens, Jessica Padron and Daniel Waistell for help with translation. Angela Ambroz, Mike Dewar, David Feldman, Kelly Hamblin, and Ashely Price for useful discussions. Kelly and Jennifer Hamblin contributed photos.