Lately, I’ve been working on a project called Kapok, which aims to create a knowledge graph of Wikipedia. In this knowledge graph, nodes are articles and edges are links from one article to another. Categories are also nodes that can be referred to by articles but do not refer to anything themselves.

The first part of creating the graph is parsing Wikipedia’s corpus. Wikipedia provides a database download here, which is essentially a 44GB XML file of 4.4 million articles.

Pprof

Since I don’t have lots of money to blow on distributing this job, I’ve resorted to optimizing my parser so that it can run reasonably well on my laptop. To do this, I’ve been using Go’s wonderful pprof tool. I don’t really know what it stands for, but it’s probably something like “pretty profiler.” It’s a tool that analyzes the CPU and memory usage of a program function by function. pprof even has a visualization tool that can create pretty graphs of programs and highlight slow spots. If you’re interested in learning about how to use pprof, check out this article on the Golang blog.

To generate these profiles, I parsed the titles, links, and categories from the first 10,000 Wikipedia articles (the archive is sorted alphabetically). I used my laptop, which has a 1.8GHz Core i5-3427U processor with 4GB of RAM and an SSD. On the software side, I used XUbuntu 13.10 with the newly released Golang 1.2.

Original Performance

Here’s the CPU profile for my first attempt, which processed 10,000 pages in about 110 seconds (91 pages/sec) while using 350MB of memory. You can view the code used in this iteration here. Basically, it chunks the file line by line, feeds it through an XML parser, and extracts the title, links, and categories of the page with regular expressions.

If you click on the ‘Original SVG’ link to the left of the image, you’ll be able to see the entire image and delve into the details. The parser spent almost 75% of its time on regexp.doExecute(), which is the node just above regexp.(*machine).match(). The main caller of regexp.doExecute() is regexp.(*Regexp).Match() which is in turn called by parse.GetRawPages(). This function accounts for 6986/8127 (86%) of the regex calls, so it’s the first function that we should optimize. The other slow spot was XML parsing, which took about 12% of the time.

Here’s the memory profile for that first iteration:

Before interpreting this graph, a quick note about the code: generate.GetID() is a function used for allocating and retrieving unique ids for each node. It stores a map from article names to unique ids, which can then be used to retrieve nodes and create relationships between them. It is used to export the graph to a CSV file, not to parse pages.

In the initial version, XML parsing consumed about 75% of the total memory. On top of this, Go’s garbage collector didn't like to throw away the memory allocated by the XML parser. At this stage, Kapok could parse about 100,000 pages before filling my laptop’s 4GB of memory and 4GB of swap space. It would slow to a crawl as the garbage collector frantically tried and failed to free up more memory. I would then have to manually kill the process.

Parser Overhaul

At this point, I decided to completely overhaul the parser. I removed all regular expressions and used the bytes package instead, especially bytes.Index(). I also ditched Go’s XML parser. I know that XML isn’t supposed to be parsed with regular expressions, so it certainly shouldn’t be parsed on the byte level. However, since the XML parser was hogging memory and producing a lot of data that I didn’t need, I decided to ditch it anyway. For those who are curious, the code for extracting pages is in parse/parse.go, and the code for extracting links is in parse/page.go. It isn’t very readable code, but I think that it demonstrates Go’s lower level file reading abilities quite well.

Final Performance

So how much faster is this ad hoc parser? It can parse 10,000 pages in just 6 seconds, while only using 26MB of memory. That is 18 times faster than the original, and 13 times less memory intensive. It can parse the entire corpus of Wikipedia (~40GB) in 20 minutes. It also plays well with Go’s garbage collector, and tends to only use about 2GB of memory instead of taking all that it can.

Judging by the final CPU profile, there isn’t much low hanging fruit left to optimize. One interesting thing is the time taken on syscalls, which is about 28% of the total time. At first, I thought that this was due to the concurrency I had used in my parser. However, making the parser completely sequential did not reduce the number of syscalls, it only decreased memory usage and increased compute time. Now I think that the excessive syscalls are due to file reading. The current iteration reads until it hits a > sign, and then parses whatever came before that, generally adding it to a buffer. Since XML contains a ton of > signs, there are a lot of unnecessary reads.

The memory side is looking pretty good, too. parse.getLinks() uses quite a bit because it deals with a lot of byte buffers that store pages. generate.GetID() is a hash map that has to store millions of entries, so it makes sense that it would use quite a bit.

Performance over time

Another large part of Kapok’s performance is how it performs over time and interacts with the garbage collector. Below is a graph of the number of pages processed per second over time.

To generate this graph, I first took average time to process each set of 1,000 pages. Then, I sampled every 5th data point and plotted it. Because of this, some garbage collection dips are missing.

As you can see from this graph, Kapok’s performance is quite volatile. I’m pretty sure that this can be attributed to Go’s mark-and-sweep garbage collector that halts the program while it frees up memory. A better indicator of Kapok’s overall performance would be a graph of the cumulative average of pages per second, which I’ve made below.

Overall performance is looking pretty good. The parser even speeds up over time, and averages out at about 4,100 pages per second. I suspect that the speedup occurs because, at the beginning, most of the articles encountered are new and have to be created and allocated ids. Near then end, almost every article has a been linked to before, so the program needs to make fewer new ids.

Future

Later on, I might try to make Kapok even faster. Its current throughput is about 30MB/s on my laptop, which is far from the potential read rate of even HDDs. Chunking and then concurrently parsing a file has the potential to be very fast, but it also presents quite a few problems. For instance, what happens if half of a page is in one chunk, and the other half is in another?

If you want to have a look at Kapok’s source code, it’s all on Github: https://github.com/aaasen/kapok

Thanks for reading, and remember that: