{ title: Trie vs Hash Table Deathmatch! description: a cautionary tale about verifying one's extraordinary claims. } Trie vs Hash Table Deathmatch! ============================== A few days ago, I have submitted my simple string interning implementation on [Reddit][]. I got a few internet points, and 2 _major_ objections to my approach: [Reddit]: https://www.reddit.com/r/programming/comments/3axeca/easy_string_interning/ - Hash tables are actually simple. They can fit in less than 100 lines of code. - I didn't measure the performance of my code. String interning is an optimisation after all. If hash tables were the scary data structure I thought they were, performance would have been a secondary concern. They're not, so I have to measure. Which I did. The data -------- I have used the [GitHub Java Corpus][GJC]. For each project, I have concatenated all `.java` files together, then used [this program][ID] to remove everything but the keywords and identifiers. The result is a long list of newline separated identifiers as they appear in the project. Of course, there is a lot of repetition. [GJC]: http://groups.inf.ed.ac.uk/cup/javaGithub/ [ID]: identifiers.c The test were performed on some of the smallest projects (a batch of 2000 of them, most of which where only a few Kb worth of source code), and on 50 of the biggest projects. The biggest of all weighted 65Mb, including 85,000 unique identifiers (3 million if you count the repetitions). The programs ------------ I have tested 4 implementations: - A [control][] in pure C, - a [Trie][] in pure C, - a hand rolled [hash table][HT] in pure C, - and an `std::unordered_map` (hash table) [wrapper][], in C++. [control]: intern_control.c [Trie]: intern_trie.c [HT]: intern_hash.c [wrapper]: intern_hash.cpp All 4 program have the same `main()` function. The only difference is which data structure is used to store the strings. The control program just reads the files. The other 3 put all the identifiers of each project in an intern pool (either a trie based one, or a hash table based one). A new pool is created for each project. All happen in a single thread. The protocol ------------ All my timings measure total CPU time (user + system). Each program was run about half a dozen time against the same 2 batches of data (many small projects, or a few big ones). Timings were very stable and easy to measure, so I just ran the same tests a couple times to check for outliers. The timings of the "control" program were used to deduce more precise timings for the other 3. By subtracting the time doing IO, we get the time actually spent on adding identifiers to the pool. The only operation tested here was adding a string to the pool. There is no support for deletion (which hardly matters anyway) or retrieval. (Tries don't support retrieval by default, hash tables do.) The rig ------- Intel Core i5, Debian, i686 (32 bits mode). The results ----------- Here are the timings (in seconds):

IO alone Hash table Hash table (STL) Trie big projects 3.50s 5.76s 12.35s 10.12s small projects 0.21s 0.30s 0.70s 0.53s big - control 0 2.26s 8.85s 6.62s small - control 0 0.09s 0.49s 0.32s Slowdown (big) ×1 ×3.9 ×2.9 Slowdown (small) ×1 ×5.4 ×3.6