At first glance, JavaScript may not be the ideal language to write scripts or applications that manipulate large amounts of data. It's (mostly) single-threaded, limited in memory allocation, and powered by an improving but not quite industry-leading garbage collector. There are, of course, reasons to want to use JavaScript though, such as the ease of running in web browsers or the diverse community of packages available. This post goes over a few key problem areas and how you can avoid them.

Data IO

While this isn't a rarely-used optimisation, it's a critical first step for anyone new to using JavaScript for data-heavy tasks. For those who don't know, JavaScript runs in a single thread. Running in a single thread means that only one task can be happening at once in your application. When working with data stored in files or on the web, reading or writing can be a massive bottleneck. Luckily, JavaScript is a concurrent language. Using Promise.all() alongside multiple fetch or fs.readFile calls, you can tell JavaScript to queue the file or network calls and wait for them to complete before continuing. For more information about this, check out Mozilla's guide on using promises. If a call to something like Promise.all is not used here, the application will read each file or network request one after the other, which could be significantly slower depending on the circumstances.

Chunk data

When possible, operating on as minimal a set of data as possible is ideal. As NodeJS by default has a low memory limit, and most browsers heavily limit the amount of memory a page can use, performing the operations on chunks of data can make data-heavy processing possible. Doing this does depend on what the operations are, however. In cases where you can't do any processing before loading all of the data, but only need a small subset of it, one common technique is to read the file twice. On the first read, gather the required data, and then on the second read, do the actual processing.

Workers

While promises are good at reducing IO bottlenecks, they don't help at all with CPU-heavy tasks. They may slow down the program slightly due to the overhead of the promises. One way around this is to use a feature of JavaScript known as Web Workers. These allow you to run code in other threads, getting around the limitation of only being able to do one thing at a time. One major downside of employing workers in JavaScript is that they can't share data. You must send data to the worker via message channels, and then send data from the worker back to the main application via the same message channel system. For more complex setups, this can become messy.

A library I've used in the past to make these simple is microjob on npm. It allows you to send data as a parameter to a function, and also get the result by treating the call to the other thread as a promise. The library is for NodeJS only and does not support browsers. When writing code in the browser, you must either use another library or directly use web workers.

Hidden Classes

While this is considered a micro-optimisation in most cases, for large-scale data manipulation in JavaScript, this internal v8 (the JavaScript engine NodeJS and Chrome use) optimisation can be crucial. The v8 engine created hidden classes to speed up data access. When the structure of data is changed, such as by adding or removing a property from an object, it must re-create these hidden classes. Doing this lowers overall performance of object accesses by a small factor, which can add up when doing large amounts of data processing. To solve this, try to keep the actual structure of objects as rigid as possible. If you know you'll be adding another field later in the program, try setting it to a dummy value at the start.

Memory Allocations

One of the biggest issues with processing large amounts of data in NodeJS is hitting the garbage collector. Most garbage collectors optimise for gradual memory allocations over time, rather than massive amounts in short durations. All of the JavaScript array helper methods, such as map , filter , flatMap , and others, all return a copy of the original array. Thus allocating more memory that the garbage collector must free. Once memory usage gets over a certain amount, the garbage collector will pause the application while it frees memory. The time it takes to free memory is often non-trivial, and can significantly slow down the application.

In a case where an application has three chained map calls and a filter call on a 10000 element array, three new copies of this array are created and immediately discarded. Cases such as this are rather typical in data-heavy applications, and very quickly add up. When possible, try to avoid throw-away memory allocations.

Array mutation

In slow areas of the code where it is safe to modify the original array, writing a custom map function that doesn't create a copy can be beneficial.

function inlineMap ( arr , f ) { for ( let i = 0 ; i < arr . length ; i ++ ) { arr [ i ] = f ( arr [ i ] ) ; } return arr ; }

This function is simple and doesn't support features of map such as getting the index; however, it will massively reduce memory usage. You can add further functionality if the application requires it. While this allocates less memory, it does modify the array object passed to it. Due to this, you must take care to ensure that nothing requires the original data afterwards.

Another boost to apply here is modifying the objects inside the array as well. When running map on an array, the function generally takes the current value and returns another. In cases where you're just modifying the original object, you can instead perform that modification and return it. Consider the following two examples. The first is a typical map function, while the second mutates the original object.

inlineMap ( data , i => ( { id : i . id , data : i . data * 2 } ) ) ; inlineMap ( data , i => { i . data *= 2 ; return i ; } ) ;

One major caveat with this method is that this will modify the object within every array. If you copy an array and modify an object inside the copy, the modification will also apply to the original. You should avoid using these methods in most general code. This method dramatically sacrifices readability and maintainability of the codebase, so you should not use it unless necessary.

For reference, here is a jsperf entry on the above inlineMap function. While it only shows a small gain over the native function on most modern browsers, the benefits come from the minimised garbage collection.

Use a Profiler

Despite being the last point in this post, using a profiler is always the first thing you should do when optimising code. Optimisation is a data-driven process. If you don't know if something is slow, you don't know how to fix it. To quote Donald Knuth from The Art of Computer Programming,

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimisation is the root of all evil (or at least most of it) in programming.

Once you have verified something is slow, and modified the code, you must then also use a profiler to confirm that you have fixed the issue. If you haven't established a speed-up with metrics, you have no way of knowing that you improved anything. An excellent tool for testing out differences in small chunks of code is jsperf, as linked earlier.

Conclusion

Optimising software can be complex. Understanding a few of the common pitfalls such as extreme memory allocation or loading files synchronously, as well as validating performance issues before making changes, are both essential to keep in mind. Optimising code without validating that it's slow and that you've improved it is not only a waste of time but also potentially a detriment to the overall quality of the codebase.