This is part one of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross, a stealth mode startup seeking to empower patients. Alex is also a proud Carnegie Mellon alumnus, a graduate of the onsite courses offered at MongoDB University and a Cloudera Certified developer for Apache Hadoop (CDH-410).

At Upstream Systems, Persado, Care Across and through various consulting roles, I have dealt with all types of MongoDB installations ranging from single server instances, medium size deployments, to large cloud-based sharded clusters. Whether large or small, monitoring is essential to assuring performance and reliability. We needed to visualize the health of production environments and maintain a clearly defined procedure for metrics exceeding threshold values, as well as measure the impact of development changes.

MongoDB Management Service (MMS) is rich with metrics, but in my experience, the most valuable metrics in practice are the following:

Lock percentage: This was more important in earlier versions, where the global write lock could eat you alive and lock yielding was not yet implemented. While it’s less important with more recent versions (please vote on SERVER-1240!), lock percentage still shows a lot about your database activity. A continuously high lock percentage will affect reads as they will eventually queue up behind writes.

Replication lag: Designing your application to read data from a secondary node can sometimes be a good idea, when it reduces latency of the read. But if your application is using the secondary’s data and you have high replication lag, your application will use stale data. In addition, a primary node failure when you have a high replication lag means that a secondary may not be sufficiently up-to-date in a failover scenario.

Journal writes: If your writes are overwhelming your journal file this will impact performance and stability of your MongoDB installation.

Page faults: Page faults are expensive to process and at sufficiently high rates, it probably means that your working set is not fitting in memory. In complex data driven applications, page faults may indicate a deeper root cause hidden in the implementation of the business logic of the app.

Non Mapped Virtual Memory: When this grows without an end, this usually means a memory leak. It’s better to monitor it and proactively restart the server or try to hunt down the leak rather than wait for the crash to happen.

There’s a lot of data in MMS Monitoring but I have found that these metrics are the most interesting. In my next post, I will go over how to make this data actionable.