Disclaimer: I originally published this article on my old blog. I’ve since taken that blog down for legal reasons which my lawyers tell me not to discuss, but kept the articles so that I could repurpose the more popular ones here on our Medium. This was one of my highest performing articles, so I figured I’d update it and publish it here.

My Rationale

For building BandNameOrigins.com, I needed to seed the database with a wide set of musical artists, so that it had a rich set of content to begin with, which would encourage users to participate in the community, and contribute additional content.

When I’ve typically asked the question in the past about where a band name comes from, I’ve gone to Wikipedia. Most band name origins are there, but they tend to be buried deep within the page, at inconsistent locations with inconsistent keywords.

Also, Wikipedia probably contains the broadest single well-organized set of musical artists that I could imagine who might be likely to have information online about where their band name came from.

To top it all off, Wikipedia, as part of its grandiose goals of organizing all human knowledge and making it available to everybody for free, exports their entire database for anybody to use, free of charge.

So our goal here was to export the list of all musical artists from Wikipedia, for two purposes. First, it will seed our database with a fairly large and comprehensive set of popular bands, and second, the actual text of the Wikipedia pages can be fed into an annotation tool for extraction when the information exists on the page.

Getting the Data

Exports of English Wikipedia data are periodically snapshotted and distributed here. In one of the only legitimate uses of the Torrent protocol, they are distributed that way. We want the latest version of the “pages-articles-multistream.xml.bz2” file, which contains all of current versions of all of the pages on wikipedia, excluding talk and user pages.

These files are large, so I found it easiest to do all this work on a rented virtual instance from Google Cloud Engine. You could work on AWS or your local machine if you want, but this article is written in the context of a fully fresh debian box.

I download the files via a command-line torrent:



curl -sL

sudo apt-get install -y nodejs

sudo npm install webtorrent-cli -g # install node and webtorrentcurl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -sudo apt-get install -y nodejssudo npm install webtorrent-cli -g

webtorrent # download and unzip the exportwebtorrent http://itorrents.org/torrent/09EED43E8A9C5E086F3728F66F986CFE3B9FD0DD.torrent bzip2 -dk enwiki-20170820-pages-articles-multistream.xml.bz2

It downloaded in under 10 minutes. Not bad for the sum total of all human knowledge. We live in a truly amazing time!!

Unzipping the file on the other hand took a lot longer, a little over an hour for me. If you want to get a sense for progress, you can run a watch script to check the file size of the uncompressed file, which will eventually get to 58 gigs.

watch ls -lah enwiki-20170820-pages-articles-multistream.xml

Categorizing Every Concept in the Universe

Wikipedia pages are organized into categories, which is helpful, but unfortunately, the categories themselves are very open-ended. It’s the classic dilemma of how to categorize literally everything in the world, and it’s probably one of the things that makes us human, but I digress.

In essence, Wikipedia models categories as a directed graph of category and subcategory. All in, there are a total of 1,652,623 different categories. Considering there are about 6 million articles, that’s about 1 category for every 4 articles. That amount of data contained in that categorization graph is astounding.

If you visit a Wikipedia page, you’ll notice at the bottom a list of categories. The categories here are the categories that the article itself belongs to, and they have a tendency to be hyper-specific, for example, “Musical Groups disestablished in 2014”.

It is ultimately a way of connecting the data of the world, and it’s a topic I’d be very interested in one day exploring further, but we have a very specific task at hand, so we’ll constrain ourselves.

None of the categories listed satisfy my goal, which I should probably be more specific about defining. The universe of things I’m interested in is what falls under the column “Artist” on Spotify. It’s not entirely clear what best captures that, because sometimes we are referring to a Musical Group (or Band), and sometimes we are referring to a Musician.

However, speaking generally, the category Musician also includes individual musicians who only release music as part of a larger Musical Group. We would ideally find a single top-level category that can be crawled downward to expand into our entire universe. Alternatively, we may have a small set of top-level categories worth crawling.

Clicking through the categories brings you to a few different types of pages, none of which thoroughly explains what the criteria for inclusion is, but most are fairly self-explanatory.

Using the export, we can also explore the category graph a little more directly, to figure out exactly what we’re looking for here.

The XML Export File

This XML dump contains only pages, which broadly encompasses a number of purposes on Wikipedia, including articles and categories. Each page has some limited metadata, and then consists of wikitext, which can be parsed out to give all sorts of interesting properties.

We don’t need a full-on wikitext parser for our purposes, since we only really want to extract category relationships. The way categories are expressed in wikitext is with a tag like `[[Category:Anti-capitalism]]`, which we should be able to extract using a simple regex. (Full documentation here).

The XML file, as we know, is quite large, so unless you have a massive computer to work with, parsing it all into memory is out of the question. XML has the nice property of being capable of parsed iteratively, so we will be using the python library lxml to do so. First, grab any of the dependencies you might need:

sudo apt-get install python3-dev python3-pip libxml2-dev libxslt-dev

sudo pip3 install lxml

Since our goal right now is just to extract the category graph, we might want to output the extracted data into a simpler file for exploration, so that we can process that more quickly moving forward. The following python script can do that:

There are a total of 17,773,690 pages to scan through, so I recommend running this in the background and doing something else for a while (a little under an hour for me).

nohup python3 extract-categories.py enwiki.xml &> extract-categories.log &

Feel free to grab the file generated from the 8/20/17 export. (It’s big, 633M compressed, so make sure you can handle that.)

Exploring the Category Graph

Now that we have the metadata of the category graph extracted into a much smaller file (down to just 3 gigs from 59), we can go a step further and extract just the category -> subcategory relationships. From there, once we find our master set of top-level categories, we’ll be able to crawl down the hierarchy and enumerate our entire universe of pages.

There are a total of 1,411,252 unique categories referenced from pages, and a total of 1,626,013 categories that have pages. That was unexpected. All in, there are 241,371 categories that contain no pages, and 26,610 categories that do not have an associated category page. It’s also worth noting that there are some category cycles, including cases in which a child and parent are clearly swapped by mistake. This is all to be expected with a datasest grown as organically as Wikipedia’s.

We can take this raw set of categories and produce the nodes and edges of a category metagraph with the following python code:

This produces a graph with 1,652,623 nodes and 3,417,964 edges (which occupies just ), which we can then process into a proper graph for analysis using the excellent python-graph library.

Feel free to grab the file generated from the 8/20/17 export. (75M compressed)

First, we’ll need to install the graph lib:



cd python-graph/

make install-core

make install-dot

cd core

python3 setup.py install

cd ../dot

python3 setup.py install git clone https://github.com/Shoobx/python-graph cd python-graph/make install-coremake install-dotcd corepython3 setup.py installcd ../dotpython3 setup.py install

Then we can parse the graph file, and start to explore a bit.

We can then explore either the Parent-Child graph by examining G, or the Child-Parent graph by examining GR.

A fun game for exploration is to print out the hierarchy, either traversing from a child category up, or from a parent category down.

If we start with a few examples of pages we want included in our category, we can look at its first-level categories and go upward. For example, let’s start with my favorite band, Rilo Kiley.

One of Rilo Kiley’s categories is “Indie rock musical groups from California”. Let’s do a DFS of that category going upward…

If you do this, you’ll find a total iteration count of about 14k categories! That’s much more than what we had in mind, and it’s clear why that happens. Eventually we fall into very broad categories like “Arts”, “Creativity”, and “Consciousness”.

Safer to set a depth limit when going this direction! Exploring out, one thing that stands out is “Musical groups”, but it’s not clear if that would contain solo artists.

For instance, if we check out Jenny Lewis, none of her categories reference the term “Musical Groups”. However, I do see a category type in common between the two, “Saddle Creek Records artists”.

This is hopeful. Going up that hierarchy, we find “Artists by record label”, which appears to have subcategories for all major record labels, which should theoretically give us all artists who are signed by any label.

While this categorization might be missing certain artists who are unsigned, or who don’t have that information available on Wikipedia, it’ll likely cover most bands popular enough to satisfy both conditions. We could always take the union of “Musical Groups” and “Artists by record label” if we don’t like the results, but for our purposes, this should suffice.

As a general rule, I have found the best categories to start from to be categories that break your category down by something common, rather than the higher-level generic categories, as they have a tendency to be over-expansive.

At the end of the day, this dataset is crafted by a huge array of people who are involved in the editing to varying degrees of seriousness. It’s not reasonable to suspect that they are somehow

If we check out the hierarchy for “Artists by record label”, we get mostly what we are looking for, though some go in a weird direction, for example, G-Unit spirals out of control…

Though it appears that if we limit ourselves to categories ending in “artists”, we are pretty good to go. We can export all of these categories to a file, and then use them to extract a list of articles for these categories. All-in, we’re looking at 808 different record labels to explore.

Extracting Articles in Those Categories

We can go back to our full article -> category graph to enumerate all the articles that exist in any of the provided categories. The following python code will do that (from a list of line-separated categories).

For our data set, this produces a total of 19,370 articles, which you can download here. Scanning through the output, we have a few misses, but for the most part, it’s highly accurate, including both bands and solo artists, but not individual band members!

An Alternative Way to Extract Items by High-Level Category

There is more than one way to skin a cat, and the category hierarchy is a pretty messy beast to tangle with.

For certain very common types of things on Wikipedia, there is a special feature known as the infobox, which expects certain properties for that type of item. If your category is common enough to warrant this, that might be another useful way to enumerate items by category. The full list of infoboxes is here.

Musical artist’s have just such an infobox, so we can look for all articles that contain this infobox!

This also implicitly has selection bias insofar as infoboxes are only present for articles important enough to warrant somebody going in and crafting the special formatting.

Python code working from the original xml file below:

From this export, we got 88,580 total articles containing the “musical artist” infobox. Unfortunately, this appears to include band members who do not release music on their own, so it’s not an appropriate data set for our purposes. But you could theoretically find this valuable for your purposes, or you could leverage some of the detailed properties of the infobox to further refine your universe.

Conclusion

For my purposes, I ended up using the hierarchy below the category “Artists by record label” to seed my database, which was my goal.

The Wikipedia data set is incredibly rich, if a little disorganized, but what could you expect from compiling all human knowledge into a single file small enough to fit on a cell phone?

I’d love to spend a week just looking through the categorization data for insights. To wax philosophic for a second, categorization is a fundamental human ideal. In a sense, you can describe nearly any fact as merely a form of classification. Plato and Aristotle did a lot of thinking about categorization for a reason: it’s critical.

Sound off in the comments if you end up using any of these ideas in your projects!

Next week, we’ll look at ranking these articles by importance.