A few weeks ago I set out to create a new multipart/form-data parser for node.js. We need this parser for the new version of transloadit that we have been working on since our setback last month.

The result is a new library called formidable, which, on a high level, makes receiving file uploads with node.js as easy as:

var formidable = require ( 'formidable' )

, http = require ( 'http' )

, sys = require ( 'sys' ) ;



http. createServer ( function ( req, res ) {

if ( req. url == '/upload' && req. method . toLowerCase ( ) == 'post' ) {

// parse a file upload

var form = new formidable. IncomingForm ( ) ;

form. parse ( req, function ( fields, files ) {

res. writeHead ( 200 , { 'content-type' : 'text/plain' } ) ;

res. write ( 'received upload:



' ) ;

res. end ( sys. inspect ( { fields: fields, files: files } ) ) ;

} ) ;

return ;

}



// show a file upload form

res. writeHead ( 200 , { 'content-type' : 'text/html' } ) ;

res. end

( '<form action="/upload" enctype="multipart/form-data" method="post">'

+ '<input type="text" name="title"><br>'

+ '<input type="file" name="upload" multiple="multiple"><br>'

+ '<input type="submit" value="Upload">'

+ '</form>'

) ;

} ) ;

Essentially this works similar to other platforms where file uploads are saved to disk before your script is invoked with a path to the uploaded file.

What's nice about this however, is that you can hook into the whole thing on a lower level:

form. parse ( req ) ;

form. addListener ( 'file' , function ( field, file ) {

// file looks like this:

// {path: '...' , filename: '...', mime: '...'}

} ) ;

We use that interface for processing HTML5 multi-file uploads as they come in, rather than waiting for the entire upload to finish.

You could even overwrite the onPart handler, which gives you direct access to the raw data stream:

form. onPart = function ( part ) {

part. addListener ( 'data' , function ( chunk ) {

// do cool stuff, like streaming incoming files somewhere else

} ) ;

} ;

All of this is possible thanks to the underlaying multipart parser which makes heavy use of node.js buffers.

Buffers in node are basically just (efficient) arrays of raw memory that you can access byte by byte. The parser works by looping over each incoming buffer chunk, while maintaining as little state as possible to do its work:

// simplified excerpt from MultipartParser.write

// chunk = Buffer of incoming data



for ( var i = 0 ; i < chunk. length ; i++ ) {

var character = chunk [ i ] ;

switch ( this . state ) {

case 'BOUNDARY-BEGIN' :

if ( character != this . boundary [ i ] ) {

// unexpected character, abort parsing

return 0 ;

}



if ( i == this . boundary . length ) {

// emit event, advance to next state

this . onPartHeaderBegin ( ) ;

this . state = 'HEADER-BEGIN' ;

}

break ;

case 'HEADER-BEGIN' :

// ...

break ;

}

}

But, as you can imagine, this approach turned out to be somewhat limited in speed. I was only able to get about 16-20 mb/s out of this. My goal however was to get somewhere around 125 mb/s, enough to saturate a 1 gbit network connection.

So I started to look for ways to optimize. The data I was parsing looks like this:

--AaB03x content-disposition: form-data; name="title" A picture of me on my unicycle! --AaB03x content-disposition: form-data; name="pic"; filename="unicycle.jpg" Content-Type: image/jpeg ... binary data ... --AaB03x--

The sequence here being:

Series of boundary characters (--AaB03x), announcing the beginning of a new part Headers for that part \r

\r

Data for this part Series of boundary characters, announcing the end of the part or end of the stream

What stands out is the fact that there is no parsing needed for step #4. All the data of a part itself is a plain octet stream. So after talking to Ryan about it, he recommended me to look into the Boyer-Moore algorithm to speed things up.

The algorithm is usually the fastest method to find a sub string within a string of text. It basically works by analyzing the needle string and building a lookup table of its characters to efficiently skip over parts of the haystack.

But implementing it, was not exactly easy. The algorithm is not trivial, and many of the example implementations I found were wrong. That however was not the problem. The real challenge was that I was working with a stream of data, rather than a big string I had full access to.

This meant keeping lots of additional state in the parser, as well as creating a very complicated piece of code. I like challenges, but I also like efficiently using my time, so I started looking for a shortcut.

And then it hit me. Most of the boyer-moore algorithm is designed to improve the performance of the worst-case scenario. The worst-case scenario for this problem is the case where you hit a character in your haystack that is also a character in the needle string. Boyer-moore deals with this case by knowing the offset of each character in the needle string, so it can maximize the number of characters to skip in any case.

But file uploads rarely cause these worst-case scenarios! With human text, character repetition is pretty high. But file uploads are binary data, so most bytes are likely to fall outside the ASCII range of characters usually used for the boundary.

That made the solution much simpler. All I had to do was generating a list of all characters in the boundary, and whenever I hit a character that was not in that list, I knew I could safely skip the full length of the boundary:

while ( i + boundary. length <= chunk. length ) {

if ( chunk [ i + boundary. length - 1 ] in boundaryChars ) {

// worst case, go back to byte by byte parsing until a non-matching char occurs

break ;

}

i += boundary. length ;

}

This resulted in an incredible speed up, allowing to parse uploads at 500 mb/sec. The parsing can be even faster if a longer boundary sequence is used. Short boundaries and hitting the worst-case scenario frequently will slow things down.

The benchmark I created is using an actual boundary created by Mozilla Firefox. Your milage may vary slightly.

The whole thing could still be optimized further, but at this point I believe it is fast enough to make other parts of the system more likely to become the bottleneck.

Formidable is licensed under the MIT license. Questions & suggestions regarding the library, node.js or the parser would be most welcome.

--fg