I have Couchbase Server installed on drive E. From the folder where the wikidata json file is (mine is called wikidata-20161219-all.json, but yours may differ) I ran:

E:CouchbaseServerbincbimport.exe json -c couchbase://localhost -u Administrator -p password -b wikibase file://wikidata-20161219-all.json --generate-key %id% --format list

Based on the Wikibase data model documentation, I knew that there would be an id field in each item with a unique value. That’s why I used %id% . A more complex key can be generated with the relatively robust key generator templates that cbimport offers.

While cbimport ran, I carefully monitored the memory usage of cbimport, since I was afraid it would have a problem with the huge dataset. But no problem, it didn’t exceed 21mb of RAM usage while it was running.

I started with 512mb of RAM and moved to 924mb of RAM to my bucket in Couchbase during the import. I only have one node. So, I expected this to mean that a lot of ejections from the cache would take place. That is what happened.

The total file is 99gb, so there’s no way it could all fit in RAM on my desktop. In production, 99+gb wouldn’t be unrealistic to fit into RAM with a handful of nodes. As wikibase continues to grow, it could be accomodated by Couchbase’s easy scaling: just rack up another server and keep going.

This takes a long time to run on my desktop. In fact, as I write this blog post, it’s still running. It’s up to 5.2 million documents and going (I don’t know how many records there are in total, but disk usage is currently at 9.5gb, so I think I have a long way to go).