Wouldn’t it be cool to import Wikipedia into Neo4j?

Mirko Nasato thought so, and built graphipedia using the batch importer that does just that.

It’s written in Java, so if you’re a pure ruby guy, I’ll walk you through the steps.

Let’s clone the project and jump in.

git clone git://github.com/mirkonasato/graphipedia.git cd graphipedia

If you look in here you’ll see a pom.xml file which means you’ll need to download Maven and build the project.

sudo apt-get install maven2 mvn install

You’ll see a bunch of stuff flying by, that’s just the dependencies being downloaded. At the end you should see this:



[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] ------------------------------------------------------------------------ [INFO] Graphipedia Parent .................................... SUCCESS [1:08.932s] [INFO] Graphipedia DataImport ................................ SUCCESS [1:16.018s] [INFO] ------------------------------------------------------------------------ [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 2 minutes 25 seconds [INFO] Finished at: Thu Feb 16 11:36:55 CST 2012 [INFO] Final Memory: 28M/434M [INFO] ------------------------------------------------------------------------

Ok, so now let’s get the file from wikipedia we need. You can download it with wget.

wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Whoa, hold up. That’s a 7.6 G file… can we try a smaller data set first?

Sure. Let’s go with Lea faka-Tonga ’cause it just sounds cool…and we’ll unzip it.

wget http://dumps.wikimedia.org/towiki/latest/towiki-latest-pages-articles.xml.bz2 bzip2 -d towiki-latest-pages-articles.xml.bz2

It is a two step process, so first lets create a smaller intermediate XML file containing page titles and links only:

java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

You should see:

Parsing pages and extracting links... .. 2835 pages parsed in 0 seconds.

Then we run the batch importer on this file and dump the contents on to the graphdb directory:

java -Xmx3G -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db

You should see:

Importing pages... .. 2835 pages imported in 0 seconds. Importing links... ..... 5799 links imported in 0 seconds; 6383 broken links ignored

Go inside and take a look and you’ll see our neostore files.

cd graph.db ls

You can copy this folder over any existing neo4j database by overwriting the /neo4j/data/graph.db folder and enjoy.