1. Monitoring! TL;DR: Monitoring is a game of finding out issues before our customers do – obviously this should be assigned unprecedented importance. The market is overwhelmed with offers thus consider starting with defining the basic metrics you must follow (my suggestions inside), then go over additional fancy features and choose the solution that tick all boxes. Click ‘The Gist’ below for overview of solutions Otherwise: Failure === disappointed customers. Simple. THE GIST popup : click here for quick examples, quotes and code examples

2. Increase transparency using smart logging TL;DR: Logs can be a dumb warehouse of debug statements or the enabler of a beautiful dashboard that tells the story of your app. Plan your logging platform from day 1: how logs are collected, stored and analyzed to ensure that the desired information (e.g. error rate, following an entire transaction through services and servers, etc) can really be extracted Otherwise: You end-up with a blackbox that is hard to reason about, then you start re-writing all logging statements to add additional information THE GIST popup: click here for quick examples, quotes and code examples

3. Delegate anything possible (e.g. gzip, SSL) to a reverse proxy TL;DR: Node is awfully bad at doing CPU intensive tasks like gzipping, SSL termination, etc. Instead, use a ‘real’ middleware services like nginx, HAproxy or cloud vendor services Otherwise: Your poor single thread will keep busy doing networking tasks instead of dealing with your application core and performance will degrade accordingly THE GIST popup: click here for quick examples, quotes and code examples

4. Lock dependencies TL;DR: Your code must be identical across all environments but amazingly NPM lets dependencies drift across environments be default – when you install packages at various environments it tries to fetch packages’ latest patch version. Overcome this by using NPM config files , .npmrc, that tell each environment to save the exact (not the latest) version of each package. Alternatively, for finer grain control use NPM” shrinkwrap”. *Update: as of NPM5 , dependencies are locked by default. The new package manager in town, Yarn, also got us covered by default Otherwise: QA will thoroughly test the code and approve a version that will behave differently at production. Even worse, different servers at the same production cluster might run different code THE GIST popup: click here for quick examples, quotes and code examples

5. Guard process uptime using the right tool TL;DR: The process must go on and get restarted upon failures. For simple scenario, ‘restarter’ tools like PM2 might be enough but in today ‘dockerized’ world – a cluster management tools should be considered as well Otherwise: Running dozens of instances without clear strategy and too many tools together (cluster management, docker, PM2) might lead to a devops chaos THE GIST popup: click here for quick examples, quotes and code examples

6. Ensure error management best practices are met TL;DR: Error management must be the most time-consuming and painful task in keeping Node.JS environments stable. This is happening mostly due to the ‘one thread’ model and the lack of proper strategy for error paths in asynchronous flows. No shortcuts here, you must fully understand and tame the error management beast. My list of error handling best practices might get you there quicker Otherwise: Crazy stuff will go on such as process crashing only because a user passed-in an invalid JSON, errors disappear without a trace and stack-trace information revealed to the end-user Click here for my list of Node.JS error handling best practices

7. Utilize all CPU cores TL;DR: At its basic form, a Node app runs over a single CPU core while as all other are left idle. It’s your duty to replicate the Node process and utilize all CPUs – For small-medium apps you may use Node Cluster or PM2. For a larger app consider replicating the process using some Docker cluster (e.g. K8S, ECS) or deployment scripts that are based on Linux init system (e.g. systemd) Otherwise: Your app will likely utilize only 25% of its available resources(!) or even less. Note that a typical server has 4 CPU cores or more, naive deployment of Node.JS utilizes only 1 (even using PaaS services like AWS beanstalk!) THE GIST popup: click here for quick examples, quotes and code examples

8. Create a ‘maintenance endpoint’ TL;DR: Expose a set of system-related information, like memory usage and REPL, etc in a secured API. Although it’s highly recommended to rely on standard and battle-tests tools, some valuable information and operations are easier done using code Otherwise: You’ll find that you’re performing many “diagnostic deploys” – shipping code to production only to extract some information for diagnostic purposes THE GIST popup: click here for quick examples, quotes and code examples

9. Discover errors and downtime using APM products TL;DR: Monitoring and performance products (a.k.a APM) proactively gauge codebase and API so they can auto-magically go beyond traditional monitoring and measure the overall user-experience across services and tiers. For example, some APM products can highlight a transaction that loads too slow on the end-users side while suggesting the root cause Otherwise: You might spend great effort on measuring API performance and downtimes, probably you’ll never be aware which is your slowest code parts under real world scenario and how these affects the UX THE GIST popup: click here for quick examples, quotes and code examples

10. Make your code production-ready TL;DR: Code with the end in mind, plan for production from day 1. This sounds a bit vague so I’ve compiled inside (click Gist below) few development tips that are closely related to production maintenance Otherwise: A world champion IT/devops guy won’t save a system that is badly written THE GIST popup: click here for quick examples, quotes and code examples

11. Tick the obvious security boxes TL;DR: Node embodies some unique security challenges, in this bullet I’ve grouped the straightforward security measures. Goes without saying that a “Secured” system requires a much more extensive security analysis Otherwise: What is worth than a security leak that is covered in press? a no-brainer security issue that you just forgot to address THE GIST popup: click here for quick examples, quotes and code examples

12. Measure and guard the memory usage TL;DR: Node.js has controversial relationships with memory: the v8 engine has soft limits on memory usage (1.4GB) and there are known paths to leaks memory in Node’s code – thus watching Node’s process memory is a must. In small apps you may gauge memory periodically using shell commands but in medium-large app consider baking your memory watch into a robust monitoring system Otherwise: Your process memory might leak a hundred megabytes a day like happened in Wallmart THE GIST popup: click here for quick examples, quotes and code examples

13. Get your frontend assets out of Node TL;DR: Serve frontend content using dedicated middleware (nginx, S3, CDN) because Node performance really get hurts when dealing with many static files due to its single threaded model Otherwise: Your single Node thread will keep busy streaming hundreds of html/images/angular/react files instead of allocating all its resources for the task it was born for – serving dynamic content

THE GIST popup: click here for quick examples, quotes and code examples

14. Be stateless, kill your Servers almost every day TL;DR: Store any type of data (e.g. users session, cache, uploaded files) within external data stores. Consider ‘killing’ your servers periodically or use ‘serverless’ platform (e.g. AWS Lambda) that explicitly enforces a stateless behavior

Otherwise: Failure at a given server will result in application downtime instead of a just killing a faulty machine. Moreover, scaling-out elasticity will get more challenging due to the reliance on a specific server THE GIST popup: click here for quick examples, quotes and code examples

15. Use tools that automatically detect vulnerabilities TL;DR: Even the most reputable dependencies such as Express have known vulnerabilities from time to time that put a system at risk. This can get easily tamed using community and commercial tools that constantly check for vulnerabilities and warn (locally or at GitHub), some can even patch them immediately Otherwise: Keeping your code clean from vulnerabilities without dedicated tools will require to constantly follow online publications about new threats. Quite tedious THE GIST popup: click here for quick examples, quotes and code examples

16. Assign ‘TransactionId’ to each log statement TL;DR: Assign the same identifier, transaction-id: {some value}, to each log entry within a single request. Then when inspecting errors in logs, easily conclude what happened before and after. Unfortunately, this is not easy to achieve in Node due its async nature, see code examples inside

Otherwise: Looking at a production error log without the context – what happened before – makes it much harder and slower to reason about the issue

THE GIST popup: click here for quick examples, quotes and code examples

17. Set NODE_ENV=production TL;DR: Set the environment variable NODE_ENV to ‘production’ or ‘development’ to flag whether production optimizations should get activated – many NPM packages determining the current environment and optimize their code for production Otherwise: Omitting this simple property might greatly degrade performance. For example, when using Express for server side rendering omitting NODE_ENV makes the slower by a factor of three!

THE GIST popup: click here for quick examples, quotes and code examples

18. Design automated, atomic and zero-downtime deployments TL;DR: Researches show that teams who perform many deployments – lowers the probability of severe production issues. Fast and automated deployments that don’t require risky manual steps and service downtime significantly improves the deployment process. You should probably achieve that using Docker combined with CI tools as they became the industry standard for streamlined deployment Otherwise: Long deployments -> production down time & human-related error -> team unconfident and in making deployment -> less deployments and features Generic topic, read further information on the web. This topic is not related directly to Node.JS.



19. Bump your NPM version in each deployment TL;DR: Anytime a new version is released, increase the package.json version attribute so that it will become clear in production which version is deployed. All the more so in MicroService environment where different servers might hold different versions. The command “npm version” can achieve that for you automatically Otherwise: Frequently developers try to hunt a production bug within a distributed system (i.e.multiple versions of multiple services) only to realize that the presumed version is not deployed where they look at

20. Use an LTS release of Nodejs TL;DR: Ensure you are using an LTS version of Node.js to receive critical bug fixes, security updates and performance improvements Otherwise: Newly discovered bugs or vulnerabilities could be used to exploit an application running in production, and your application may become unsupported by various modules and harder to maintain

21. Check your monitoring against real chaos (using a monkey) TL;DR: Life tells us that unpredicted things will happen in production. To name of a few, servers will get killed, SSL certificate validity may get revoked, the event loop tends to get blocked, DNS records may change and it goes on. These sound like rare conditions, but they do happen and usually leaves a huge impact like a storm. There’s no way to truly mitigate this risk without simulating these chaotic conditions and verifying that the application can survive or at least report their occurrence. Netflix chaos-monkey is the most famous tool for chaos generation. My Nodejs chaos monkey is focused on node and process related chaos Otherwise: No escape here, Murphy’s law will hit your production without mercy