So what does it take to analyse NPM and all its modules and versions? To begin with, you need data about what is in the repository. This is actually surprisingly easy, because NPM provides you with a simple way to set up your own copy of it. All it takes is to:

install couchdb clone npm-registry-couchapp run the specified scripts to create views in couchdb set up data replication from the registry wait for the data to download

Now that we have all the meta-information about NPM modules, we can actually export the data into a different database where it is possible to run the desired queries and compare various data points.

But this would only provide a part of the results if we also would want to analyse code: check files, run some regex, get size information etc.

To get the full results we would have to:

download metadata from the repository analyse metadata download the code for all versions of the module unpack and analyse the code

Easy right? Right?

Not so fast! Turns out that downloading and analysing a version takes on average about 1s. That’s not a lot, but when there are 1.65M versions, then it means ~20 days of non stop download and analysis. I didn’t have that much time.

Luckily Google helped me out — I happened to have 300$ trial money Google Cloud first-time sign up bonus. So I decided to put it into good use by creating a script to do the following:

divide the whole repository into 30 chunks provision a machine for each chunk install and start the analysis process in them all then collect the results and destroy the corresponding machines

This allowed me to do bulk of the analysis in a day — thank you Google.

PS! Doing this in Google Cloud was surprisingly simple, so cudos to Google for that as well

But I only got a bulk of the analysis done this way, because you see NPM lacks this thing called validation. There are no checks on the validity of data or code. I’m not sure there are any checks at all — I found modules with invalid semvers, dependencies like “../../../etc/passwd”, licences like: ‘hglv2<! — \” onmouseover=alert(1)”’, packages with invalid tar headers etc.

This in turn created a mind boggling amount of edge cases, which had to be handled. Especially since I wanted to analyse whole dependency trees — which of course meant that the dependency requirements had to be solved. Oh no, not the dependencies!

After spending weeks fixing various edge cases and rerunning tests (…and losing more hair than usual), the analysis is finally complete and we can move on to the last part.

Part 5 — Mirror mirror on the wall, whats the fairest module of them all

tl;dr

lots of dead code in NPM

everything depends on everything

NPM is a wild mix of everything

etc

Having learned how deep the NPM rabbit hole goes, it is time to look at the findings. The analysis revealed a lot of interesting statistics, averages, as well as creations that are simply funny, baffling things and scary. So let’s take a closer look at what NPM jungle has to offer.

PS. For the following statistics, I analysed only the latest version of each module.

First off there is actually a surprising amount of placeholder modules in NPM — almost 2% of all modules have no content besides a package.json and maybe a licence. While that doesn’t necessarily mean that they are placeholders then mostly they are. Some, however, have scripts and the package.json is the whole content of the package.

Speaking of licences — MIT is by far the most popular one with over 50% of modules. Then of course we have a large chunk of missing licence information and the last quarter is divided between myriad of less popular licenses. For me the most interesting one was WTFPL (do What The Fuck you want to Public Licence), which is surprisingly popular, with nearly a 1000 modules sporting it.

Licences of modules

Next up I looked at the age of modules — more specifically how long has it been since they since they were last updated? To no great surprise, things get old and people stop paying attention to them. Looking at the graph below, we see that about 40% of modules were not updated last year.

Last update to module

Let’s now shift over to what we actually use NPM for — dependencies. Turns out that over half of the modules in NPM only have 0–1 dependency and the mean is only 2.42. And then there is a module called mikolalysenko-hoarders, which has 389 wildcard dependencies.

Dependencies in a package.json

The low number of dependencies in a package.json is actually deceiving, because if we calculate the full tree of dependencies of dependencies installed for a module, the picture is drastically different.

Nr of packages actually installed

The mean number of packages installed for a module is 35.3 , with the maximum being a whopping 1615 for npm-collection-explicit-installs — a module that collects popular NPM modules under one package.

While the outliers do skew the data and over 50% of modules install 4 or fewer packages, over 10% of the modules pack 100+ packages, which I find rather disturbing. Especially when put into context of my average project (not just a package, but an actual service) where I usually have a heck of a lot more dependencies than 2–3 in my package.json and that means the actual amount of dependencies is very high.

Another equally interesting dependency related nuance is that the mean amount of original code vs dependency code is about 45%. (Note: I calculated this based on unpacked file sizes with no regard to file types etc)

The data itself is very polarised — there are loads of modules that have only original code, which is consistent with our original dependency graph (modules that have none). And then there are loads of modules that are completely overshadowed by their dependencies. It does in no way mean that they are poorly done, just that their dependencies are either large or plentiful or have large/plentiful dependencies of their own.

Original code % based on size

Now this particular metric should be taken with a gracious pinch of salt, because we all know that estimating the amount of work or usefulness by the size of the package is an act of futility.

But the last few graphs do show that the packages do depend a lot on each other. So what are the top packages that creep into dependency trees most often?

Top 10

readable-stream — Node.js core stream functionality as a separate package async — Control flow management isarray — Array.isArray inherits — Node.js core util.inherits functionality as a separate module glob — Match files using the patterns the shell uses, like stars and stuff. minimist — for parsing command line arguments lodash — utility function collection minimatch — A minimal matching utility. commander — For parsing command line arguments assert-plus —A wrapper around core assert to provide convenience functions

What to make out of this? Well apparently we are not confident in the language itself. There are a lot of people who need to depend on a separate module to check if something is an array. That function has been in the language since Chrome 5 (2010) FFS.

And those of you who say that you have to consider ancient systems — well write/copy the function. It’s here in all its glory.

Or don’t. I mean, for offloading that gigantic effort of including it in your code, you will instead get to download 2kb of metadata and 2kb of files, with a chance that something is down and your build breaks.

It’s even in the core util module if you feel like using that, because looking at Nr 1 and Nr 4, a lot of people don’t. And while inherits was written for the browser, I have serious doubts that it is only used in that context. Here is the module in its entirety for non-browser environments:

For the browser, there is a separate version with 23 lines.

I don’t want to rant on this topic for much longer, but we all remember the left-pad incident, right? It proved that having lots of dependencies might actually not be the best thing.

Having your build break down on someone else’s account is indeed frustrating. But do also consider that your applications attack surface gets exponentially larger with all those modules. An attacker won’t have to compromise you directly — it will be enough to compromise any one of your dependencies. And that means it will be enough to compromise any one of the developers of the said modules. It’s a terrifying world out there that is made even more terrifying by the possibility of lifetime scripts in package.json.

For those who do not know — you can specify commands that get run during install or uninstall of a module. These scripts can include anything and will be run with the user’s privileges.

Looking at packages in the repository, about 2% contain such scripts and they vary greatly in purpose and implementation. One of the funnier ones I found was on a package called closing-time. What it does is download and execute a shell script, which in turn downloads Semisonic’s — Closing Time and adds a row in crontab that starts playing the song every day at 5pm.

Naturally there are also packages that try to elevate the level of danger this all brings — by logging out your ssh keys for example.

And while my quick analysis didn’t find any purposely malicious scripts, I did find horrors that for example on install run the NPM install again against a different repository.

If that’s not enough, there was recently an interesting NPM Worm concept attack, which could use the lifetime scripts to propagate through the NPM packages.

Whether you like it or not, scripts are dangerous and thus I dearly recommend running install with the ignore-scripts flag or setting it true as a default.

npm config set ignore-scripts true

Speaking of bad dependencies — there are 4k modules that actually have dependencies pointing out of the NPM ecosystem. For example, dependencies are allowed to point to GitHub, external URL, the file system etc.

I understand that it is a nice feature to have in your own development — don’t want to publish everything to NPM, no problem, just point to your own projects. But I seriously think published modules should not have dependencies from outside the system. Not only because I lost half my hair trying to handle the various cases in the dependency tree, but because we as users lose control.

Semver has no meaning if your dependency points to the master branch of a repository. It can change at any point and you’re just going to have to hope that the owner doesn’t suddenly introduce breaking changes or include something malicious.

But not everything in NPM is dark and out to get you. There were some weirdly modules that stood out and gave a good laugh during the analysis.

For example a module called 0126af95c0e2d9b0a7c78738c4c00a860b04acc8 — which ironically exports a function to produce a random string.

Or the biggest module of them all, which after unpacking comes in at 8.6GB of data. You might have guessed it — it’s called:

yourmom @ 1.0.0

You’ve got to give credit to the guy — the troll is strong with him. The readme even includes nice puns:

It’s very easy to use yourmom, in fact you can do it in under 5 minutes (provided your hardware can handle the sheer weight of yourmom). — from the README.md of yourmom

Leaving yourmom aside, we also have a lot of modules that try and test various aspects of the NPM itself. There were dozens of examples of basic script injections and path traversals etc in the various fields of the package.json. However, for now, the great majority of modules out there are just what their authors advertise — blocks of code that solve problems in their best known or easiest way possible.