One of them is “20th-century physicists.” That belongs to the category “Physicists by century,” which belongs to “Scientists by century,” and so on through “People by occupation and century,” “People by century,” “People by time,” and finally… “People.” Bingo! It seemed to work for creative works too — e.g. Moby Dick belonged to “1851 American novels” which, after a number of hops, eventually belonged to “Books”:

Wikipedia articles belong to categories which can be traversed upwards to desired “root” categories

All I needed to do was identify all the subcategories underneath certain top-level categories such as “People” and “Books”, and I’ve have the cultural items I wanted. Excitedly, I added code to my script to record in my database the categories each article belonged to (which included the categories each “category article” belonged to, so I had the full tree). Then to check the results, I wrote a query to run the hierarchy in reverse, and find all the articles covered by subcategories of “People” a certain number of levels deep. I hit “execute”, and…

…it was a disaster. Every kind of article was popping up! For example, it listed “Apple” the company, which was certainly not a person. Digging in, it turned out “Apple” belonged to the category Steve Jobs which eventually belonged to… “People,” of course. It turned out Wikipedia categories aren’t strictly hierarchical at all, but are used for so many “related” things as to make them useless for determining what kind of a thing an article represented. Deflated, I wondered if there was another way?

Infoboxes

I remembered that in the top-right corner of many Wikipedia articles, there’s box of information, usually with a photo… an “infobox” in Wikipedia parlance, it turned out. And looking at the Wikitext of each article, each infobox seemed to belong to a clearly-defined category as well. Steve Jobs had a “person” infobox, Apple had a “company” infobox, and Moby-Dick had a “book” infobox. Promising… how many infoboxes were there? More than 1,500, I discovered. Could it work?

Infoboxes are on the right-hand side and can contain a title, photo, and standardized fields. The article for Albert Einstein uses the “scientist” infobox.

I manually made a list of the couple-hundred infoboxes referring to people (“person”, “NFL biography”, “Christian leader”, etc.), and found the couple-dozen infoboxes for the creative works I wanted (“film”, “book”, “album”, etc.). I wrote some further simple string processing to locate the infobox (if present) in the Wikitext of each article and extract its name, mapped it to my manual list of desired categories, and then saved it to my database.

The good news: it did work! The bad news: it needed a lot of further tweaks. While virtually all creative works have infoboxes, most of the “long tail” of people in Wikipedia don’t, so I had to find another way of identifying them (solution: detect string matches for birth-year categories like “1923 births”). Some articles have infoboxes not for the main topic but for something related later in a page (e.g. an author doesn’t have an infobox of their own, but has an infobox for a book of theirs later on) so I needed to limit myself to infoboxes at the start of an article. But sometimes there’s valid text that comes before the infobox (e.g. various types of headers) so I had to develop various heuristics to find infoboxes that “count.” But in the end, I successfully wound up with:

1,500,000 people both current and historical (also including groups like musical artists)

200,000 albums and songs, 120,000 movies, 40,000 TV shows, and 20,000 video games

40,000 books and short stories, 9,000 comics, anime and manga, and 5,000 plays, musicals and operas

6,000 artworks (such as paintings and sculptures), and 1,500 compositions (such as symphonies)

So I had my database of cultural items — but that was just the start.

Determining item popularity

Now that I had my database of nearly two million cultural items, I needed a way to determine their relative levels of popularity. After all, I couldn’t simply randomly sample quiz items from all two million, since that would result in a quiz where nobody would usually know any of the items. One of the keys to Test Your Vocab’s success was in making the quiz adaptive — showing easier words to people with smaller vocabularies, and harder words to people with larger ones. How could I determine the level of difficulty of a Wikipedia article?

Import Wikipedia pageviews

Fortunately, along with the text of articles, Wikipedia makes another piece of data public as well: the number of times an article is viewed. Wikipedia provides raw pageviews data in separate downloadable files for each hour of traffic of Wikipedia’s history (2007–2016 and then 2015 onwards). But since each (compressed) file is about 50–100 MB, completely analyzing even a full year would involve downloading something like 600 GB across close to 9,000 files. Yikes! Fortunately, some further digging revealed that starting in 2011, Wikipedia processed these into monthly files — phew. So I wrote a second script to download a range of months of pageviews files, and import the lines into a new database table.

It wasn’t quite that simple, however — pageviews files list article URL’s, not article titles, and a single article can be listed under a wide variety of different URL’s thanks to Wikipedia redirects, which include things like alternate names, common misspellings, etc. So I added code to my original import script to create an additional table of Wikipedia redirects. With the new table, a simple join maps an article to the pageviews associated with all its redirects and sums them together to determine its popularity. And fortunately, Wikipedia doesn’t support redirects to redirects, so a single join is all that’s needed.

A selection of redirects leading to “Lady Gaga”. To accurately measure article popularity, traffic must be combined from all possible redirects

(There is the minor detail that a redirect now isn’t necessarily the same as a redirect in the past, and articles can be renamed — so historical traffic could wind up getting sent to the wrong present-day article, or being lost entirely when an article turns into a disambiguation page, for example. But it’s a relatively rare occurrence, and something I’ll just have to accept for now.)

Improving pageviews quality

Initial results were encouraging: at a glance, pageviews really did seem to reflect the cultural popularity of items. It passed the smell test. But it still needed some tweaks.

Initially I tried ranking items by their current popularity (the most recent full month) — but this made the list feel too current, giving cinema blockbusters and musical artists of the moment far too much importance. Experimenting with different time ranges, I settled on averaging the popularity of items over the previous five years, which seemed to strike the right balance between ensuring that cultural items felt durable, while still changing with the times. (I also had to normalize pageviews for any given month by the month’s total, so Wikipedia’s overall growing traffic wouldn’t given stronger weight to more recent months.)

But there were still a few oddball items with unexpectedly high scores. Digging into the data, I saw that certain articles would have unexpectedly massive spikes in traffic for a month or two, and then level off again (sometimes clearly due to a news event, other times with no obvious explanation). So to remove outliers, for each article I ignored the 5% of months with the highest traffic.

With these strategies combined, the list felt intuitively right and I was on my way. The top 5 items were:

Donald Trump Game of Thrones Elizabeth II Barack Obama Cristiano Ronaldo

But already from this list, I found myself questioning which culture I would be measuring. American? British? Or even international, when it comes to soccer (football)?

Test whose culture?

I was basing my measure of cultural popularity on English-language Wikipedia pageviews. When I looked up traffic by country, the majority was American. But there was still significant traffic from the UK and India, which explained the presence of a bit more British royalty and Bollywood stars than Americans would be accustomed to — as well as footballers and mixed martial artists.