There’s been a bit of buzz today about “Whoosh” – a search engine written in Python. I did some performance measurements, and posted them to the Whoosh mailing list, but it looks like there’s wider interest, so I thought I’d give a brief summary here.

First, I took a corpus of (slightly over) 100,000 documents containing text from the english wikipedia, and indexed them with whoosh, and with xapian.

– Whoosh took 81 minutes, and produced a database of 977Mb.

– Xapian took 20 minutes, and produced a database of 1.2Gb. (Note: the unreleased “chert” backend of xapian produced a 932Mb database, and compacting that with “xapian-compact” produced a 541Mb database, but it’s only fair to compare against released versions.)

Next, I performed a search speed test – running 10000 1 word searches (randomly picked from /usr/share/dict/british-english) against the database, and measuring the number of searches performed per second. I did tests with the cache cleared (with “echo 3 > /proc/sys/vm/drop_caches”), and then with the cache full (by running the same test again):

– With an empty cache, whoosh achieved 19.8 searches per second.

– With a full cache, whoosh achieved 26.3 searches per second.

– With an empty cache, xapian achieved 83 searches per second.

– With a full cache, xapian achieved 5408 searches per second.

In summary, whoosh is damn good for a pure-python search engine, but Xapian is capable of much better performance.