MongoDB and the MEAN Stack: Optimizing an Example Application

Join us for the story of a fictional startup using MongoDB and the rest of the MEAN stack to build a geo-location based app. At MongoDB we want to ensure that all our technical new hires become subject matter experts with our technology before they begin to contribute. Part of this effort is a 3-4 week “bootcamp” project focused on a real implementation with MongoDB, where we build an app with the intent of teaching new hires how to optimize real deployments. In this post, we’ll look at one such application. As our fictional startup evolves, you’ll see the trials, tribulations, false turns, head-slapping moments and a few insights we learned along the way. We’ll share some of the insights we found as we built, debugged and optimized the application. What is MEAN? MEAN is a Javascript-centered stack for developing web applications that uses the following technologies: MongoDB: The database Express: Web application framework/router AngularJS: Client side HTML/JS client-side model–view–controller (MVC) framework NodeJS: JS runtime environment for server-side and networking applications The great thing about MEAN is that you get much of what you need to begin development right out of the box. To get up and running even faster with the MEAN stack, check out mean.io , which also includes a number of other useful packages like Mongoose . The Application Our example app is relatively simple. It utilizes the geolocation reported by a user’s mobile device to deliver offers from nearby businesses. Mongoose allows you to define a schema map for a MongoDB collection in order to define the shape of the documents within that collection. We started by defining a data model that represents our advertisers: var AdvertiserSchema = new Schema({ name: { type: String, default: '', trim: true }, loc: { 'type': { type: String, required: true, enum: ['Point', 'LineString', 'Polygon'], default: 'Point' }, coordinates: { type: Array, default: [0.0, 0.0] } } }); This schema defines individual documents in MongoDB that look something like this: { name: 'Long Hall', loc: {type: 'Point', coordinates: [-6.265535, 53.3418364]} } Next, we implemented a function for Express to handle the REST customer queries for the nearest businesses: exports.near = function(req, res) { var dist = Number(req.query.dist) || 1000; findQuery = { loc: { $near: { $geometry: { type: 'Point', coordinates: [Number(req.query.lng), Number(req.query.lat)] }, $maxDistance: dist }}}; req.session = null; Advertiser.find(findQuery, function (err, advertisers) { if (err) { // Error handling } else { res.jsonp(advertisers); } }); } One of the biggest advantages of the MEAN stack is the ability to rapidly prototype our app. As you can see, the implementation is simple. The challenge came when we began to optimize for performance. The following details some of strategies we found and surprises we encountered. Test with a Production Dataset A common error that occurs with respect to MongoDB is when developers don’t test with a dataset that is representative of what the app will see in production. For example, when developing our app we began by separately benchmarking the number of request per second it could handle when using each of the different types of geoqueries available in MongoDB. In our initial test on a development laptop we found that geoHaystack performed the best, but at this point we were using a very small dataset of nine businesses, all located within a small area in Dublin. Once that dataset was expanded to a much larger, real-world scenario of tens of thousands of locations, the near and nearSphere geoqueries performed better. The point is, while a small dataset on your local machine is great for prototyping, it should never be considered as indicative of the performance you will get out of MongoDB in production with a much larger amount of data. Look under the Covers The great thing about frameworks like MEAN and Express is the fact they provide a lot of the standard functionality used in many applications. These frameworks free developers to focus on the features that are especially valuable in their application. On the other hand they can be opaque, and it can be hard to see what’s going on under the covers. We used the MongodDB profiler , which logs every operation in MongoDB to a separate, queryable collection. This allows us to see what’s really happening in the database layer of our application. What we found when running our customer REST requests was surprising: {“op”:”query”, “ns”:”tings.clients”,… {“op”:”command”, “command”:{“geoSearch”… {“op:”update”, “ns”:”things.sessions”… In the first two lines of the log, we see a client id query and the geoquery executing as expected, but this was followed by an unexpected write operation. Why was a REST request also executing an update? We discovered that under the hood when a JSON response is returned, the MEAN stack was actually checking to see if a session was set in the request, and was attempting to update a session document in MongoDB. The solution was to set the session variable in the request to null, since the REST request does not need to store session information: exports.all = function(req, res) { ... req.session = null; res.jsonp(advertisers); } On top of this behavior being unexpected, it’s also not very well documented, so it’s unlikely we would have found the issue without using the profiler. When we identified the problem and fixed it, we saw a 10% performance improvement in our application. Long story short, use the MongoDB profiler to ensure that the database operations that are actually happening coincide with what you you are expecting. The profiler can be a useful tool to identify unexpected queries, and the performance savings it can help you identify can be significant. Aggregation Strategy The user story in our app development was to allow advertisers to query how many users saw their offers. To do this, we initially took a very naive, pragmatic approach to the problem, and stored all the customer queries with their geolocation into MongoDB as separate documents. When an advertiser queried the database for how many users saw their offer within a given distance of their business, we would run an aggregation on the dataset to identify the number of matching documents. Assuming an average of 2,000 customer requests per second, it’s easy to see how this quickly becomes a problem. The aggregation would need to process an additional 7.2 million documents for every hour at this number of requests per second. As one might expect, the performance of an aggregation decays as the number of documents grows. At 7.2 million documents, our aggregation took approximately 1 second, but if we then looked at what happened if two aggregations were run every second, the performance of other simultaneous queries fell dramatically (by a factor of 4). To overcome this, we took an approach similar to how MMS works, by performing a small amount of pre-aggregation, taking a small performance hit when the customer query was recorded in order to save a much larger performance hit later. To do this, we created a new document that recorded a counter for the number of times a customer saw the ad for a given advertiser in a given month. This meant performing a very light aggregation up front to find the document tracking the customer impressions for a given advertiser and updating the document by incrementing the counter. As a result, when we did our aggregation for the advertiser query, all we needed to do was aggregate unique customers to get our total number of impressions per advertiser, a much lighter weight operation than aggregating millions of documents on-demand. To make this step even more efficient we then stored all of our counters in a single document and performed multi-updates, which gave us additional performance improvements over updates on multiple documents. Find External Bottlenecks By this point we had gained sizable performance increases by optimizing our application code, so the next step was to begin looking outside our app for potential bottlenecks on the infrastructure side. We were using NodeLoad to send REST requests to an HAproxy load balancer which forwarded them to a number of Node.js instances, which in turn queried a MongoD instance. We also had a separate MongoD instance for replication. We used medium-sized AWS instances for everything. This type of performance optimization can, of course, become a game of whack-a-mole, since bottlenecks can occur at every step, and often don’t become apparent until other bottlenecks are resolved. We began by looking at the number of Node.js application servers. Unsurprisingly, increasing the number servers running Node.js increased the number of requests per second our application could handle, until performance began to flatten out around 8 Node.js instances. This told us that the number of Node servers becomes a bottleneck before MongoDB, but once we reached that threshold, the question was whether our MongoD instances were becoming I/O or CPU bound. A quick experiment showed that increasing the size of the AWS instance running MongoD lead to an minimal increase in performance, so the issue was CPU. When we ran this test again on a production-sized dataset of 110k documents, MongoD became saturated when it was hit by 6 application servers running on AWS instances of the same size, but thanks to our simple, earlier experiment, we could feel reasonably sure of where the bottleneck was occurring should we need to scale our app to handle more traffic. In our tests we found additional bottlenecks in both the HAproxy load balancer and NodeLoad. In the case of the load balancer, as we scaled the amount of load, it turned out the medium Amazon instance was not fast enough to handle the amount of load we were sending it. This created a false impression that either MongoDB or Node was hitting a limit. Moving the load balancer to a more powerful instance solved this. In the case of NodeLoad, we found that as load increased, performance became constrained by the number of parallel requests it was able to perform, another unforeseen issue that yielded better results once we increased the size of the instance it was running on, but whose symptoms we might have otherwise blamed on our application. These types of bottlenecks don’t tend to have as large an impact as some of the issues we discussed earlier, but they show that it’s important to perform systems tuning before going to production to get a clearer picture of what’s actually happening. Recap As we’ve seen, performance optimization isn’t a simple task. Bottlenecks can occur at almost every step along the way, from application code to environment configurations to the database, but luckily there’s a lot you can do to mitigate these issues. Here are some of the takeaways: Test with a real-world dataset Use the profiler to verify that what you are expecting is what’s actually happening in MongoDB (especially when using third-party frameworks like the MEAN stack) Be smart about how you aggregate, especially when dealing with large datasets Run experiments on varying configurations and iterate Search for bottlenecks inside and outside your application, including in application code, infrastructure, CPU, IO and network bandwidth And lastly, don’t forget to read our [production notes](http://docs.mongodb.org/manual/administration/production-notes/)! They provide a great resource with lots of information and tips based on what we’ve seen on systems running MongoDB in the real-world. For even more information about the details of our sample app, check out the full presentation or read our Performance Best Practices White Paper: Download the White Paper About Ger Hartnett Ger Hartnett manages MongoDB’s Technical Services team in EMEA. Before joining MongoDB he was CTO of Goshido and software architect at Intel, Tellabs, Digital & Motorola. He’s enthusiastic about technology especially concurrent systems, design-patterns, agile/lean & new forms of collaboration and team organization. He co-authored a book for Intel Press (on embedded networking, performance & security).