Update: Jan 03, 2019 Based on com­ments, the Bonus: Uni­code sec­tion was clar­i­fied to men­tion canon­i­cal equiv­a­lence.

“How hard can it be?”

That’s what I told to my­self about two weeks ago. I was so un­hap­py with the re­spon­sive­ness and us­abil­i­ty of the stock Doxy­gen search that I usu­al­ly went straight to Google in­stead. With that ex­pe­ri­ence, for the new theme I want­ed a so­lu­tion that:

Is fast and sta­ble, with­out re­sults be­ing re­ordered as they ap­pear

Is small enough to not need any com­plex serv­er-side so­lu­tion

Is show­ing rel­e­vant re­sults first, so I don’t need to scroll through end­less lists to find what I’m look­ing for

Is us­able for API dis­cov­ery (im­me­di­ate looka­head au­to­com­ple­tion in­stead of be­ing on­ly a full-text search)

Has an UX that al­lows me to pop it up any­where on a page, nav­i­gate us­ing just the key­board and re­turn back to the page with­out los­ing con­text

With not even 500 lines of pure JavaScript and rough­ly half a megabyte bi­na­ry con­tain­ing search da­ta I’m able to search among all 7500 sym­bols of the Mag­num en­gine faster than I can type.

The idea The Trie search struc­ture seemed to fit the case for looka­head au­to­com­ple­tion, how­ev­er build­ing a huge tree struc­ture in JavaScript didn’t re­al­ly get my hopes for bet­ter per­for­mance up. First hit when search­ing for “javascript fast tries” was this blog post by John Re­sig (the au­thor of jQuery). I was quite hor­ri­fied that even the fast im­ple­men­ta­tion still in­volved build­ing a tree struc­ture un­til I saw that the ar­ti­cle was writ­ten in 2011 — in the age be­fore asm.js, typed ar­rays and Em­scripten. Phew, okay. I de­cid­ed to drop “javascript” from my search query and a pa­per about Tight­ly packed tries fi­nal­ly popped up in the re­sults. This start­ed to look like the right di­rec­tion to fol­low.

Ini­tial im­ple­men­ta­tion The ba­sic idea is: dur­ing out­put gen­er­a­tion in the m.css Doxy­gen theme, take the in­put search da­ta and pre­pro­cess them to a trie struc­ture us­ing sim­ple nest­ed Python dict s. Then, once ful­ly pop­u­lat­ed, lin­earize the trie to a bi­na­ry file. The bi­na­ry file will lat­er be load­ed us­ing JavaScript in the brows­er and search per­formed di­rect­ly on it with­out build­ing any new struc­ture out of it to min­i­mize ini­tial start­up time and mem­o­ry us­age. Each node in the trie will store a list of re­sults (rep­re­sent­ed by sim­ple in­te­ger in­dices to an ex­ter­nal ar­ray, ex­pla­na­tion be­low) and a map from char­ac­ters to child nodes. I set a rea­son­able lim­it of 64k sym­bols in to­tal, max bi­na­ry file size to 256 MB, max 256 re­sults and at most 256 child 8-bit char­ac­ters for each node. With that, the bi­na­ry se­ri­al­iza­tion for­mat for the trie is rough­ly as fol­lows: Field size Con­tent 32b Root node off­set … … 8b Node N re­sult count r 8b Node N child count c r \times 16b Node N re­sult in­dices 8b + 24b Node N child 1 char­ac­ter + byte off­set 8b + 24b Node N child 2 char­ac­ter + byte off­set … … 8b + 24b Node N child c char­ac­ter + byte off­set … … The trie will be tra­versed in post-or­der when se­ri­al­iz­ing, which means that each node off­set in the file is al­ready known when se­ri­al­iz­ing its par­ent. The on­ly ex­cep­tion is the root node off­set, which is up­dat­ed at the start of the file as the last step. Now, what ac­tu­al­ly will be saved in­to the trie? Let’s start with a very sim­pli­fied case: hav­ing the Mag­num and Mag­num::Math names­paces, Mag­num::Math::Vec­tor and Mag­num::Math::Range class­es and just three func­tions Mag­num::Math::Vec­tor::min(), Mag­num::Math::Range::min() and Mag­num::Math::min(). The stock Doxy­gen search al­lows me to on­ly search us­ing the leaf name — if I want to look up Math::min() , en­ter­ing min() gives too many re­sults and I need to fil­ter the re­sults man­u­al­ly, which is a waste of brain mat­ter. The goal here is to be able to search us­ing any pre­fix, which means that each sym­bol has to be saved in­to the trie with all its pos­si­ble pre­fix­es. Hav­ing good de­bug vi­su­al­iza­tion is key to un­der­stand­ing the da­ta. A tu­ple con­tain­ing the above sev­en sym­bols in­sert­ed with dif­fer­ent pre­fix­es looks like this — each col­umn is one tree depth lev­el (go­ing to the next col­umn in­volves search­ing for the cor­rect let­ter) and par­tic­u­lar leaf nodes con­tain re­sult IDs (shown in cyan). The search is case-in­sen­si­tive, so all strings are nor­mal­ized to low­er­case: magnum [ 0 ] ||| ::math [ 1 ] ||| ::vector [ 2 ] ||| | ::min [ 3 ] ||| range [ 4 ] ||| | ::min [ 5 ] ||| min [ 6 ] || th [ 1 ] ||| ::vector [ 2 ] ||| | ::min [ 3 ] ||| range [ 4 ] ||| | ::min [ 5 ] ||| min() [ 6 ] | in [ 3 , 5 , 6 ] vector [ 2 ] | ::min() [ 3 ] range [ 4 ] | ::min() [ 5 ]

Sub­tree merg­ing There’s one prob­lem with the above ap­proach — al­most all sym­bols are con­tained in the trie two or more times, de­pend­ing on their depth. With thou­sands of sym­bols this would be a sig­nif­i­cant size in­crease. Thanks to the de­bug vi­su­al­iza­tion, the so­lu­tion was im­me­di­ate­ly vis­i­ble — what if I would sim­ply elim­i­nate sub­trees that look the same? Dur­ing se­ri­al­iza­tion the tree is tra­versed in post-or­der, which means that the fi­nal bi­na­ry form of a node is known when se­ri­al­iz­ing its par­ent. Then it’s enough to main­tain a lookup ta­ble of se­ri­al­ized sub­trees. When a new se­ri­al­ized sub­tree is al­ready in the lookup ta­ble, it’s not writ­ten to the file but the byte off­set of the al­ready ex­ist­ing in­stance is used in­stead. This of course re­quires that sub­trees with the same da­ta are se­ri­al­ized the same, in­de­pen­dent on their po­si­tion in the file — in or­der to achieve that, I had to switch from rel­a­tive child off­sets (as sug­gest­ed in the pa­per for space sav­ing) to ab­so­lute off­sets. Com­par­ing with the vi­su­al­iza­tion above, # de­notes where a du­pli­cate sub­tree was elim­i­nat­ed by this method: magnum [ 0 ] ||| ::math [ 1 ] ||| ::vector [ 2 ] ||| | ::min [ 3 ] ||| range [ 4 ] ||| | ::min [ 5 ] ||| min [ 6 ] || t # | in [ 3 , 5 , 6 ] v # r # With the sub­tree merg­ing in place, the se­ri­al­ized size of this min­i­mal case went down from 592 bytes to not even a half: 290 bytes. In the full Mag­num search da­ta, this saved over half a megabyte. A nice thing about this ap­proach is that it’s com­plete­ly lan­guage-obliv­i­ous — it just works on the bi­na­ry da­ta.

Re­sult map So far there was just the trie that con­tained re­sult in­dices such as [3, 5, 6] . That’s ob­vi­ous­ly not enough to show a link with search re­sult to the us­er. Next to the trie there has to be a map from those in­dices to ac­tu­al re­sult ti­tles and URLs. Each ti­tle and URL is a string of a dif­fer­ent length and in or­der to make the lookup a con­stant-time op­er­a­tion, there needs to be first a ta­ble map­ping an in­dex to byte off­set with the string da­ta. Be­cause in the trie da­ta I’m us­ing on­ly 24 bits for byte off­set, I can do the same here and use the re­main­ing 8 bits for var­i­ous ad­di­tion­al in­for­ma­tion, such as sym­bol type. Field size Con­tent 8b + 24b Re­sult 0 flags + byte off­set o_0 8b + 24b Re­sult 1 flags + byte off­set o_1 … … 8b + 24b Re­sult N flags + byte off­set o_n 32b File size l o_1 - o_0 Re­sult 0 ti­tle, '\0' , URL … … l - o_n Re­sult N ti­tle, '\0' , URL Look­ing up a re­sult r in­volves read­ing da­ta from byte off­set range giv­en by in­dex r and r + 1 . The ti­tle and URL is null-sep­a­rat­ed. Re­sult da­ta cor­re­spond­ing to the above sim­pli­fied case look like this when vi­su­al­ized: 0 : Magnum [ type=NAMESPACE ] -> namespaceMagnum.html 1 : Magnum::Math [ type=NAMESPACE ] -> namespaceMagnum_1_1Math.html 2 : Magnum::Math::Vector [ type=CLASS ] -> classMagnum_1_1Math_1_1Vector.html 3 : Magnum::Math::Vector::min() [ type=FUNC ] -> classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde 4 : Magnum::Math::Range [ type=CLASS ] -> classMagnum_1_1Math_1_1Range.html 5 : Magnum::Math::Range::min() [ type=FUNC ] -> classMagnum_1_1Math_1_1Range.html#ad4919361a2086212fac96da0221e4dcd 6 : Magnum::Math::min() [ type=FUNC ] -> namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21

Re­sult map pre­fix merg­ing Again, thanks to the da­ta vi­su­al­iza­tion it’s im­me­di­ate­ly clear that there’s a lot of re­dun­dant in­for­ma­tion that’s neg­a­tive­ly af­fect­ing da­ta size. The re­dun­dant in­for­ma­tion is main­ly in pre­fix­es, so I reused the al­ready im­ple­ment­ed trie to per­form the op­ti­miza­tion. The fol­low­ing vi­su­al­iza­tion shows the re­sult — the de­tails are way too bor­ing to ex­plain in full here, but ba­si­cal­ly ev­ery en­try ti­tle can be op­tion­al­ly pre­fixed by an­oth­er en­try ti­tle (de­not­ed by prefix=M ) and take the first [:N] char­ac­ters from that en­try URL, re­cur­sive­ly. 0 : Magnum [ type=NAMESPACE ] -> namespaceMagnum.html 1 : ::Math [ prefix=0[:15] , type=NAMESPACE ] -> _1_1Math.html 2 : ::Vector [ prefix=1[:0] , type=CLASS ] -> classMagnum_1_1Math_1_1Vector.html 3 : ::min() [ prefix=2[:34] , type=FUNC ] -> #af029f9f7810201f0bd8d9580af273bde 4 : ::Range [ prefix=1[:0] , type=CLASS ] -> classMagnum_1_1Math_1_1Range.html 5 : ::min() [ prefix=4[:33] , type=FUNC ] -> #ad4919361a2086212fac96da0221e4dcd 6 : ::min() [ prefix=1[:28] , type=FUNC ] -> #ae22ef0cb2a5a5e4c5e626a3df670be21 In this par­tic­u­lar case the map size went down from 494 bytes to 321 bytes. In case of Mag­num da­ta, the sav­ing was over a megabyte.

Client-side search With the bi­na­ry da­ta in hands I was fi­nal­ly ready to leave the Python world and start the client-side JavaScript im­ple­men­ta­tion. Mo­ti­vat­ed by how Em­scripten is able to run huge code­bas­es with min­i­mal garbage col­lec­tor in­volve­ment, I reached di­rect­ly to typed ar­rays and im­ple­ment­ed the whole search di­rect­ly on the bi­na­ry da­ta. Bor­ing de­tails aside, the client-side search func­tion­al­i­ty works sim­ply by fol­low­ing the point­er to the root trie node and then con­sum­ing the search string one 8-bit char­ac­ter at a time. The child lookup is per­formed us­ing a sim­ple lin­ear search (con­trary to what the pa­per sug­gests), as that is fast enough and not the bot­tle­neck. The child count has some rea­son­able up­per bound (which is not even 256, more like 26 or so), so the big-O no­ta­tion for this lookup is \mathcal{O}(n) , lin­ear with search string size. Fast. How­ev­er, with just a lookup the search would show the re­sults on­ly if the us­er en­tered the whole search string (for ex­am­ple min() ). That would be wast­ing users’ time and not re­al­ly im­prov­ing API dis­cov­er­abil­i­ty. The looka­head au­to­com­ple­tion is per­form­ing a breadth-first tree search, col­lect­ing re­sult in­dices along the way, short­est re­sults first. With the above trie, search­ing for just m , the au­to­com­ple­tion would col­lect the fol­low­ing out­put: [{ title : 'Magnum::Math::min()' , // matching the `min` suffix url : 'namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21' }, { title : 'Magnum::Math::Range::min()' // matching the `min` suffix url : 'classMagnum_1_1Math_1_1Range.html#a22af2191e4ab88b45f082ef14aa45185' }, { title : 'Magnum::Math::Vector::min()' // matching the `min` suffix url : 'classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde' }, { title : 'Magnum::Math' , // matching the `Math` suffix url : 'namespaceMagnum_1_1Math.html' }, { title : 'Magnum' , // matching the `Magnum` suffix url : 'namespaceMagnum.html' }, { title : 'Magnum::Math::min()' // matching the `Math::min` suffix url : 'classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde' }, { title : 'Magnum::Math::Range' , // matching the `Math::Range` suffix url : 'classMagnum_1_1Math_1_1Range.html' }, ... Un­like look­ing up the search string, this is the bot­tle­neck — and as just about ev­ery­thing in Mag­num starts with m (and not just C++ sym­bols, don’t for­get about pre­proces­sor macros, di­rec­to­ries and files), such au­to­com­ple­tion can spew out end­less pages of re­sults on en­ter­ing the first char­ac­ter.

Looka­head bar­ri­ers Look­ing at the above, some­thing seems off. It’s def­i­nite­ly not the us­er’s ex­pec­ta­tion to get to know about all sym­bols that ev­er ex­ist­ed when en­ter­ing a bunch of char­ac­ters. For ex­am­ple, when I’m search­ing for a names­pace, I am not in­ter­est­ed in all its mem­bers. Or when a some pre­fix is ap­pear­ing both in func­tion name and en­clos­ing names­pace, I’ll get the re­sult twice in the list. So how to solve that? When adding sym­bols to the trie, let’s add bar­ri­ers to sig­nal the looka­head au­to­com­ple­tion that it should not dig deep­er look­ing for pos­si­ble re­sults. How­ev­er, I don’t want to pre­vent API dis­cov­ery, so when one en­ters e.g. Magnum: , the search should go past the bar­ri­er and show all mem­bers of the names­pace. (But not nest­ed mem­bers!) In the fol­low­ing vi­su­al­iza­tion, the $ char­ac­ters de­note a looka­head bar­ri­er. If the au­to­com­ple­tion reach­es a sym­bol that’s in front of it, it will not at­tempt to con­tin­ue fur­ther. magnum [ 0 ] ||| : $ ||| :math [ 1 ] ||| : $ ||| :vector [ 2 ] ||| | : $ ||| | :min [ 3 ] ||| range [ 4 ] ||| | : $ ||| | :min [ 5 ] ||| min [ 6 ] || th [ 1 ] ||| : $ ||| :vector [ 2 ] ||| | : $ ||| | :min [ 3 ] ||| range [ 4 ] ||| | : $ ||| | :min [ 5 ] ||| min [ 6 ] | in [ 3 , 5 , 6 ] vector [ 2 ] | : $ | :min [ 3 ] range [ 4 ] | : $ | :min [ 5 ] With these bar­ri­ers in place, search­ing for m gives the fol­low­ing out­put — unique sym­bols start­ing with m , sort­ed from short­est to long­est: [{ title : 'Magnum::Math::min()' , url : 'namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21' }, { title : 'Magnum::Math::Range::min()' , url : 'classMagnum_1_1Math_1_1Range.html#a22af2191e4ab88b45f082ef14aa45185' }, { title : 'Magnum::Math::Vector::min()' , url : 'classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde' }, { title : 'Magnum::Math' , url : 'namespaceMagnum_1_1Math.html' }, { title : 'Magnum' , url : 'namespaceMagnum.html' }] Search­ing for math gives just this: [{ title : 'Magnum::Math' , url : 'namespaceMagnum_1_1Math.html' }] While search­ing for math: pro­vides a list of all its im­me­di­ate mem­bers: [{ title : 'Magnum::Math::min()' , url : 'namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21' }, { title : 'Magnum::Math::Range' , url : 'classMagnum_1_1Math_1_1Range.html' }, { title : 'Magnum::Math::Vector' , url : 'classMagnum_1_1Math_1_1Vector.html' }]

Ex­tra good­ies Just repli­cat­ing the stock search func­tion­al­i­ty and mak­ing it faster would be bor­ing, so I added some cher­ries on top. First is prop­a­gat­ing in­fo about dep­re­cat­ed and delete d func­tions to search re­sults (and mem­ber list­ing as well) to save some the users ex­tra clicks when try­ing to find a func­tion that’s ac­tu­al­ly callable. You can see it in the GIF above. The sec­ond fea­ture is abil­i­ty to spec­i­fy ad­di­tion­al search key­words for par­tic­u­lar sym­bols. This is use­ful es­pe­cial­ly in the OpenGL wrap­ping lay­er of Mag­num — one can search for a well-known OpenGL sym­bol to get to know what API it cor­re­sponds to. This is ex­posed through cus­tom @m_keywords , @m_keyword and @m_enum_values_as_keywords com­mands, head over to the docs for de­tails. Self-ex­plana­to­ry ex­am­ple: /** * @brief Set texture storage * * @m_keywords{glTexStorage2D() glTextureStorage2D()} */ Texture2D & Texture2D :: setStorage (...); /** * @brief Renderer feature * * @m_enum_values_as_keywords */ enum class RendererFeature : GLenum { /** Depth test */ DepthTest = GL_DEPTH_TEST , ... };

Fi­nal im­ple­men­ta­tion In this ar­ti­cle I tried to ex­plain main­ly the key as­pects of the search im­ple­men­ta­tion . The rest is im­por­tant too, but would make the ar­ti­cle too long and bor­ing. In or­der to prop­er­ly high­light the cur­rent­ly typed part of the search re­sult, the trie lookup is count­ing num­ber of suf­fix char­ac­ters when gath­er­ing the re­sults. This is ex­tend­ed fur­ther in or­der to dis­play func­tion ar­gu­ments lists and const / r-val­ue over­loads (which are not part of the trie da­ta).

/ r-val­ue over­loads (which are not part of the trie da­ta). Due to re­stric­tions of Chro­mi­um-based browsers, it’s not pos­si­ble to down­load da­ta us­ing XMLHttpRequest when served from a lo­cal file-sys­tem. Be­cause that’s a very com­mon use case, I had to work around this by con­vert­ing the bi­na­ry to a Base85-en­cod­ed *.js file that’s load­ed di­rect­ly via < script > .

when served from a lo­cal file-sys­tem. Be­cause that’s a very com­mon use case, I had to work around this by con­vert­ing the bi­na­ry to a Base85-en­cod­ed file that’s load­ed di­rect­ly via . A lot of ef­for went in­to client-side UX, like key­board nav­i­ga­tion, cut­ting off over­ly long pre­fix­es on the prop­er side, avoid­ing page jumps etc. There are still some known is­sues left to be re­solved.

Some stats With the cur­rent state of Mag­num doc­u­men­ta­tion, where there is 7565 unique sym­bols and I’m in the mid­dle of adding @m_keywords to all OpenGL wrap­per APIs, the da­ta size is as fol­lows, de­pend­ing on what fea­tures are en­abled: Da­ta type Un­com­pressed Gzipped Base­line, bi­na­ry 2635.6 kB 884.2 kB Sub­tree merg­ing, bi­na­ry 2033.3 kB 537.8 kB Sub­tree and pre­fix merg­ing, bi­na­ry 959.3 kB 501.5 kB Sub­tree and pre­fix merg­ing, base85 1202.5 kB 752.5 kB As the docs are ac­cessed through a web­serv­er, I can use the bi­na­ry and en­able serv­er-side gzip com­pres­sion. That gives slight­ly be­low half a megabyte, which, in com­par­i­son to sizes of con­tem­po­rary web­sites, is more than okay. The doc­u­men­ta­tion is up­dat­ed not more than once a week, so the browsers will usu­al­ly just hap­pi­ly serve this dataset from a cache. One im­por­tant thing to note: the pre­fix merg­ing doesn’t give re­al­ly that much ad­van­tage when com­par­ing the gzipped size, as the da­ta are al­ready very nice­ly com­press­ible.