Text processing in node.js has historically been both slow and cumbersome. But don’t fret, there’s hope on the horizon! The new async iterators proposal solves the “cumbersome” problem brilliantly. But sadly, it misses an opportunity to make text processing fast, too. While your code may look cleaner with async iterators, it’ll still be slow as ever.

Read on for the sordid tale…

Python: Living the Text Processing Dream

The Python code to read a file line by line is beautifully concise:

for line in open('filename'):

process(line)

But while this code looks simple on the surface, there’s quite a bit going on behind the scenes. Control is alternating back and forth between your Python code ( for line in... ) and the Python interpreter, which is reading the file from disk. Python is making some decisions for you about how to read the file (how many bytes at a time should it read before yielding control?) and it’s figuring out the line delimiters. Memory consumption stays low, since the whole file needn’t be read all at once.

What’s also impressive is that this code is fast! On my 2015 MacBook with an SSD, it can process around 200MB/sec. That’s well within an order of magnitude of wc -l , which can process ~800MB/sec.

Reading a CSV file is still concise, though a bit slower:

import csv

for row in csv.reader(open('file.csv')):

process(row) # row is an array of values

On my machine, this drops us down to about 35MB/sec. This combination of concision and speed make Python a great choice for processing any kind of text file.

Node Streams: Slow and Messy

For those of us who work in node.js, the situation isn’t so good. Until recently, processing a text file required either loading it all into memory synchronously or using streams and callbacks, concepts which are intimidating for beginners and cumbersome for experts. (Update: it’s been pointed out that this isn’t true; you can be synchronous and read the file piecemeal. See response below)

Here’s what a “simple” program looks like in Node using the popular csv-parse module:

Reading a CSV file in node.js. Does this seem more complicated than the Python?

Holy cow! That’s a lot more complicated. We’re using callbacks, streams and pipes. And our logic is split across three separate callbacks.

Even worse, this code is slow as molasses! On my machine, it can process 3MB/s of CSV data. That’s over 10x slower than Python!

Async iteration to the rescue?

ES2015 (aka ES6) introduced iterators and generators. These provide a clean way to produce new iterables: you implement a special method which returns a {done, value} pair. A for-of loop will then iterate over the value s, finishing when the done field is set.

Unfortunately, generators and iterators are synchronous. Unless we’re willing to read our entire file into memory, they can’t help us.

That’s why I was excited about the new proposal for asynchronous iterators and generators. Rather than returning a {done, value} pair, an asynchronous iterator returns Promise<{done, value}> . This makes it possible to read a file line-by-line using a nice syntax:

Reading a file line-by-line using async iterators. This looks much cleaner than the streams!

There’s some boilerplate to put ourselves in an async function, but the meat of the program looks much more like the concise Python version from the start of this post. This is a big improvement!

But what about the performance? I got 7.8MB/sec. Compare this with 200MB/sec for the equivalent Python code. Not so great!

(note: I’m using TypeScript to convert the async iterators and generators into something that node.js can run. See my code here.)

We can get a better sense of why this is still so slow by creating the simplest async iterator imaginable:

async function* asyncRange(from, to) {

for (let i = from; i < to; i++) {

yield i;

}

} (async() => {

for await (const i of asyncRange(0, 550000)) {}

})().catch(e => console.error(e));

My CSV file has 550,000 lines.

The example to read it line-by-line (above) ran in 4 seconds.

This trivial async iterator runs in 3 seconds.

The equivalent trivial synchronous iterator runs in 0.2 seconds.

It’s the Promises that are killing our performance!

A missed opportunity

The problem is that Promises have to be resolved in the next turn of the event loop. This is part of the spec. There’s no way around it. Spinning the event loop 550,000 times is going to be slow, no matter what your code does.

(Update: it’s been pointed out that this isn’t quite correct; the spec allows multiple promises to be resolved in a single tick, even if contemporary browsers and node don’t do it that way. So there’s hope that this will get better!)

The solution is simple: yield the lines in batches, so that multiple lines are processed in each turn of the event loop:

By reading lines in batches, we can get good performance at the cost of exposing an implementation detail.

This runs in ~350ms, equivalent to ~90MB/sec. This is within a factor of two of the equivalent Python.

The cost of this speed is that we’ve exposed an implementation detail (the batching) that Python hides. What’s more, you can’t write a function to encapsulate the batching without killing your performance, since you’ll be back to an event loop per line:

// This is slow again.

export async function* lines(filename) {

for await (const batch of lineBatches(filename)) {

for (const line of batch) {

yield line;

}

}

}

A simple solution to this would have been to make an AsyncIteratorResult include a values array, rather than a value scalar. This would mean that multiple values could be processed per event loop, rather than a single value. This corresponds to something like a paginated result, where you get many values on each network response. If you wanted to get a separate event loop per value, you could write a function to do that:

export async function* onePerEventLoop(asyncIterator) {

for await (const value of asyncIterator) {

yield value;

}

}

This would slow down the CSV example, but you’d be knowingly paying that cost. Being fast would at least be possible.

Conclusions

Async iterators finally give node.js an alternative to streams with a clean, concise syntax. Unfortunately, they do it in a way that makes them inherently slow. They represent a missed opportunity to overcome node’s difficulty doing text processing compared to languages like Python.