Fast HTTP body searching in JavaScript to parse strings (Part 1)

Read time: 8 minutes

15/10/2019

Searching for one piece of text in another is an expensive business. searching small amounts of text is pretty performant, say:

1 2 3 const myText = "toothbrushes" myText . search ( "tooth" ) // 0 myText . search ( "brush" ) // 5

innocent enough, right? works for most of us and would actually be slower to use any other kind of algorithm to search.

But, what if you had to search a 1mb file upload, 100mb? 1GB? well, thats just going to be undeniably slow.

There are lots of papers on the Boyer-Moore family of algorithms but I'm going to write about the Boyer-Moore Horspool algorithm. It is a text searching algorithm with a low complexity and medium memory footprint.

I built my own to

learn more about this type of algorithm

Write a full test suite (which others don't have)

Fully type it with TypeScript

It works by constructing a table of characters present within the search string and using that to actually skip checking a certain number of characters entirely.

Your browser probably uses the Boyer-Moore algorithm every time you search in a web page.

What am I using it for?

I'm using it in the rebuild of Multicolour to help parse incoming form data and file uploads in multipart format.

Typically a multipart/form-data body is split by a token to indicate the start and end of each field. Here is an example multipart payload separated by --boundary

1 2 3 4 5 6 7 8 9 10 11 12 13 POST /test.html HTTP/1.1 Host: example.org Content-Type: multipart/form-data;boundary="boundary" --boundary Content-Disposition: form-data; name="field1" value1 --boundary Content-Disposition: form-data; name="field2"; filename="example.txt" My text file contents, this is super secret. Don't tell the NSA or the aliens will come and get us and take us to Zona. --boundary--

We can see two fields in this payload. field1 is a form-data field. field2 is a piece of form-data too. Except, we see it has a filename , this is the file that was uploaded.

We can also see that the boundary token is specified in the Content-Type header which is handy!

Well, thats nice and all but what the heck do we do with it? Well, we have a couple of important pieces of information in the request. We have the boundary token and we know its .length . This + our body is all we need to get started parsing this.

The first thing we need to do is create some storage for each known character of the search text. We'll call this our "bad char table". I'm going to use a Buffer to construct this table as we'll be comparing against a buffer in our search body.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 function makeBadCharTable ( pattern : string ) : Buffer { const DICT_SIZE = 65535 const badCharShift : Buffer = Buffer . allocUnsafe ( DICT_SIZE ) const truePatternLength = pattern . length - 1 // Populate the table with the default. for ( let i = 0 ; i <= DICT_SIZE ; i ++ ) { badCharShift [ i ] = pattern . length } // Add our offsets. for ( let char = 0 ; char < truePatternLength ; char ++ ) { badCharShift [ pattern [ char ] ] = truePatternLength - char } return badCharShift }

Theres a few things happening here, first I declare the size of our dictionary. This is a "magic" number but it happens to be the maximum code point for a character in JavaScript. See MDN

I then pre-populate our table with the un true length of the pattern. I do this because if we hit a mismatch while we're searching later, we want to skip to the true length of the pattern + 1 and JavaScript does this for us anyway because 0-indexed arrays.

Then we loop over the length of the pattern and we assign a value to it which is true length of string - index we're at . This is how many characters we're going to skip when there's a mismatch.

The eagle eyed might notice I'm using Buffer.allocUnsafe(DICT_SIZE) and wince but this is perfectly safe here as we write over it with our default values straight away instead of letting Node write it with 0s and then giving it to us.

Then we return our table of indexes.

You can speed this process up by memoizing this function. Why not try it and let me know how you did?

#Time to search

We're going to search our body now and get the indexes of each occurance of our boundary. This information is what we need to slice up our body Buffer into it's separate fields for further parsing.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 function search ( text : Buffer , pattern : string , badCharShift : Buffer ) { const results : number [ ] = [ ] let skip = 0 haystackLoop : for ( let haystackChar = 0 , maxTextChar = text . length - 1 ; haystackChar <= maxTextChar ; haystackChar += skip ) { pattern : for ( let needleChar = pattern . length - 1 ; needleChar >= 0 ; needleChar -- ) { const lookupIndex = haystackChar + needleChar skip = badCharShift [ text [ haystackChar + ( this . pattern . length - 1 ) ] ] if ( text [ lookupIndex ] !== pattern [ needleChar ] ) { break pattern } else if ( needleChar === 0 ) { results . push ( haystackChar ) skip = pattern . length break pattern } } }

Wow, there's a lot happening here. Let's break it down.

Goals

We want the indexes of each occurance of --boundary . We assume the last index is the closing token for the body so it's trailing -- is moot.

. Do this using only the body, bad char table and our search pattern.

Do this correctly and quickly.

I'm using labelled loops for readability. You can ommit the labels without changing behaviour.

So, what the heck is happening?

First up is we start to loop over every character in the body, we then start another loop over the search text from it's last character to the first. We do this because if there's a mismatch we can at least skip the entire length of the search pattern + 1. This is where the algorithm shines, when you have a large search string you skip more characters and the performance of this algorithm actually decreases the shorter the search string.

When we hit a character that matches, we move onto the next character in the search text until either theres no more search text which is a full match and we push to our results array or we skip the number of characters specified by our bad char table on a mismatch.

This algorithm is super fast for getting the locations of one string in a massive amount of data. However, it is utterly utterly rubbish and wasteful on small haystacks and with tiny search strings. The worst case would be not finding a match and a single character search.

This helps me to take a payload in a request and very quickly parse it into fields to pass to a handler and it's great for searching text in files.