Today I dove into the new WikiReading dataset that a team at Google released alongside a fascinating paper which describes how they’re using it to benchmark their advances in deep learning for natural language understanding.

In the dataset there are almost 19 million instances of (document, property, values), where document is the full text of a Wikipedia article, and property and values are facts from the corresponding WikiData item. E.g. the article for Barack Obama has the following statement on its WikiData item:

In the above screenshot, “spouse” is the property, and “Michelle Obama” is the value. The dataset doesn’t include WikiData’s qualifiers (e.g. start time) or references. Oh, and an item can have multiple values for the same property:

So in the released files each of these (doc, property, values) instances is a JSON object in its own line. When uncompressed it weights in at 208GB. It’s split up into ~180 different files, and also split by training, testing and validation.

Each object has a bunch of unnecessary reformulations of the same information — as far as I can work out, it’s just an artefact of the machine learning workflow Google used to produce their paper. For example, each object has 16 fields, when there probably only need to be three. But that’s fine, I can slim it down with a little processing (see below, or check out this jqplay snippet).

The bigger problem I can see with the dataset so far is that both the count of instances and the distribution of them across training / validation / testing sets is significantly different from the numbers they cite in their paper. I’ve raised an issue on GitHub, so we’ll see what they say. But right now it looks like it’s going to be hard to do a fair comparison between their algorithms and other natural language understanding methods, which is the main reason I wanted to dive into this in the first place.

For getting the JSON into a format more suitable for my needs (slimmer, and with raw text rather than arrays of tokens), I used the jq command line tool.

Initially I used Cloud Dataflow, driven by the datasplash library in Clojure. It was interesting to learn, and I enjoyed watching a whole bunch of computers in the cloud spin up to work on my data, but I was also getting frustrated with the long minutes between starting a job and getting back the latest data — it felt like the opposite of the interactive process I usually have at the REPL.

At the same time I stumbled on an article arguing for using simple UNIX tools unless you really, really need to bring in the heavyweights. It’s convincing.

So I took a look at jq:

I won’t do a full tutorial on it here, but if you want to see how I’m transforming the data, check out this interactive jqplay snippet.

I loved how easy it was to express the transformation, but it was slower than I was expecting:

150 seconds for a 1.3GB, 100,000-line file. There are 180 files. It’d take about 7.5 hours to run. How can I speed it up?

Well, when I took a look at htop, I can see that only one of the 16 cores on my machine is under load:

The fact that it’s at 100% means we can safely assume that it’s CPU bound. So how didI get it onto those other cores?

GNU parallel. It’s a tool for taking in input and automatically distributing it to new commands so that they run across multiple cores. It has a “pipe” option that let’s you pipe in input to it (in this case via running cat over all the JSON files we care about), and it will break the input up into 1MB chunks and pass each chunk through to a seperate process. Witness:

That’s a speedup of almost eight times. I actually have no idea why it isn’t sixteen times, given that it went from running on a single core to running on sixteen. Can anyone enlighten me?

Even after my bloat-removing jq transformation, the filesize is only about four times less (~50GB). Still a little too big to just slurp into Clojure and start playing with the data on my laptop, which is the quickest way for me to explore.

I reckon most of the filesize is due to the full text of the article being repeated for every single (property, values) statement associated with the item. Next step is to change it to be a two-level map of {document -> {property -> value}}.

But it’s getting a bit late, and I can get to that tomorrow.

I’ve been a fan of WikiData for ages. It’s exciting to see big research teams paying attention to it and using it to further the state of the art.

My goal for the next few days is to run Facebook’s fastText classifier over the dataset and compare its performance on text classification to the deep learning methods used in Google’s paper.