Server Selection in Next Generation MongoDB Drivers

I love to cook. Sometimes, my guests like something so much that they ask for the recipe. Occasionally, I have to confess there isn't one — I just made it up as I went along! Improvisation is fine in the kitchen, but it's not a great approach for consistency in software development. The MongoDB Drivers team is responsible for writing and maintaining eleven drivers across ten languages. We want our drivers to have similar behaviors, even while staying idiomatic for each language. One way we do that is by writing and sharing driver specification documents for those behaviors that we'd like to have in common across all drivers. Just as a recipe helps a chef serve a consistently great dish night after night, these specifications guide software development for consistency across all drivers, at MongoDB and our community. One of the most recent specifications we've developed covers server selection . Production MongoDB deployments typically consist of multiple servers, either as a replica set or as a sharded cluster. Server selection describes the process by which a driver chooses the right server for any given read or write operation, taking into account the last known status of all servers. The specification also covers when to recheck server status and when to give up if an appropriate server isn't available. The rest of this article describes our design goals and how server selection will work in the next generation of MongoDB drivers. Design Goals The most important goal is that server selection be predictable . If an application is developed against a standalone server, later deployed in production against a replica set, then finally used with a sharded cluster, the application code should be constant and only need appropriate changes to configuration. For example, if some part of an application queries a secondary, that should succeed with a standalone server (when the notion of primary and secondary is irrelevant), work as expected against a replica set, and keep working in a sharded cluster where secondary reads are proxied by a mongos. The second design goal is that server selection be resilient whenever possible. That means that in the face of detectable server failures, drivers should try to continue with alternative servers rather than immediately fail with an error. For a write, that means waiting for a primary to become available or switching to another mongos (for a sharded cluster). For a read, that means selecting an alternative server, if the read preference allows. The third design goal is that server selection be low-latency . That means that if more than one server is appropriate for an operation, servers with a lower average round-trip time (RTT) should be preferred over others. Overview of the Server Selection Specification The Server Selection specification 1 has four major parts: Configuration Average Round-Trip Time (RTT) Read Preferences Server Selection Algorithm Configuration Server selection is governed primarily by two configuration variables: serverSelectionTimeoutMS . The serverSelectionTimeoutMS variable gives the amount of time in milliseconds that drivers should allow for server selection before giving up and raising an error. Users can set this higher or lower depending on whether they prefer to be patient or to return an error to users quickly (e.g. a "fail whale" web page). The default is 30 seconds, which is enough time for a typical new-primary election to occur during failover. localThresholdMS . If more than one server is appropriate for an operation, the localThresholdMS variable defines the size of the acceptable "latency window" in milliseconds relative to the server with the best average RTT. One server in the latency window will be selected at random. When this is zero, only the server with the best average RTT will be selected. When this is very large, any appropriate server could be selected. The default is 15 milliseconds, which allows only a little bit of RTT variance. For example, in the illustration below, Servers A through E are all appropriate for an operation – perhaps all mongos servers able to handle a write operation – and the localThresholdMS has been set to 100. Server A has the lowest average RTT at 15ms, so it defines the lower bound of the latency window. The upper bound is at 115ms, thus only Servers A, B and C are in the latency window. Servers A, B and C are in the latency window The ‘localThresholdMS’ variable used to be called secondaryAcceptableLatencyMS, but was renamed for more consistency with mongos (which already had localThreshold as a configuration option) and because it no longer applies only to secondaries. Average Round-Trip Time Another driver specification, Server Discovery and Monitoring, defines how drivers should find servers from a seed list and monitor server status over time. During monitoring, drivers regularly record the RTT of ismaster commands. The Server Selection specification calls for these to be calculated using an exponentially-weighted moving average function. If the prior average is denoted RTT t-1 , then the new average (RTT t ) is computed from a new RTT measurement (X t ) and a weighting factor (α) using the following formula: t = α·X t + (1-α)·RTT t-1 The weighting factor is set to 0.2, which was chosen to put about 85% of the weight of the average RTT on the 9 most recent observations. Weighting recent observations more means that the average responds quickly to sudden changes in latency. Read Preferences A read preference indicates which servers should handle reads under a replicated deployment. Read preferences are usually configured in the connection string or the top-level client object in a driver. Some drivers may allow read preferences to be set at the database, collection or even individual query level, as well. A read preference can be thought of as a document with a mode field and an optional tag_sets field The mode determines whether primaries or secondaries are preferred: primary: only read from the primary secondary: only read from a secondary primaryPreferred: read from the primary if possible, or fall back to reading from a secondary secondaryPreferred: read from a secondary if possible, or fall back to reading from the primary nearest: no preference between primary or secondary; read from any server in the latency window The tag_sets field, if provided, contains a tag set list that is used to filter secondaries from consideration (thus it only applies when the mode is not "primary"). The terminology around tags and tag sets can be a little confusing, so the Server Selection specification defines them like this: tag: a single key/value pair tag set: a document containing zero or more tags tag set list: an ordered list of tag sets In a replica set, one can assign a tag set to each server to indicate user-defined properties for each server. A read preference tag set matches a server tag set if the read preference tag set is a subset of the server tag set. In a replica set, one can assign a tag set to each server to indicate user-defined properties for each server. A read preference tag set matches a server tag set if the read preference tag set is a subset of the server tag set. For example, a read preference tag set { dc: 'ny', rack: 2 } would match a server with the tag set { dc: 'ny', rack: 2, size: 'large' }: { dc: 'ny', rack: 2 } ⊆ { dc: 'ny', rack: 2, size: 'large' } Because the tag set list is ordered, the first tag set that matches any secondary is used to choose eligible secondaries. For example, consider the following tag set list (where 'dc' stands for 'data center'): [ { dc: 'ny', rack: 2 }, { dc: 'ny' }, { } ] First, the driver tries to choose any secondaries in the NY data center on rack 2. If there aren't any, then any secondaries at all in the NY data center are chosen. If the NY data center itself is down, the last tag set allows any secondary to be chosen. If the behavior of the empty tag set ({ }) seems surprising, remember that in mathematical terms, the empty set is a subset of any set, thus the empty set matches all secondaries. It's a good fallback for an application that prefers particular secondaries, but doesn't want to fail if those secondaries aren't available. Server Selection Algorithm When a driver needs to select a server, it follows a series of steps to either select a server or else try again until the server selection timeout is reached. Within the algorithm, there are slight differences for different deployment types to ensure that the overall selection process achieves the predictable design goal. A high-level overview of the algorithm follows 3 : 1. Record the server selection start time When selection starts, the driver records the starting time to know when the selection timeout has been exceeded. 2. Find suitable servers by topology type A 'suitable' server is one that satisfied all the criteria to carry out an operation. For example, for write operations, the server must be able to accept a write. The specific rules for suitability vary by the type of deployment: Single server : The single server is suitable for both reads and writes. Any read preference is ignored. Replica set : Only the primary is suitable for writes. Servers are suitable for reads if they meet the criteria of the read preference in effect. Sharded cluster : Because mongos is a proxy for the shard servers, any mongos server is suitable for reads and writes. For reads, the read preference in effect will be passed to the selected mongos for it to use in carrying out the operation on the shards. 3. Choose a suitable server at random from within the latency window If there is only one suitable server, it is selected and the algorithm ends. If more than one server is suitable, they are further filtered to those within the latency window. If there is more than one suitable server in the window, one is chosen at random to fairly distribute the load and the algorithm ends. Because the server with the shortest average RTT defines the lower bound of the latency window, it is always one of the servers that might be selected. 4. If there are no suitable servers, wait for a server status update If no server is selected – for example, when the driver needs to find a replica set primary, but the replica set has failed over and is having an election to choose the new primary – then the driver tries to update the status of the servers it is monitoring and waits for a change in status. 5. If the server selection timeout has been exceeded, raise an error If more than serverSelectionTimeoutMS milliseconds have elapsed since the start of server selection, the driver raises a server selection error to the application. 6. Goto Step #2 If the timeout has not expired and the status of servers have been updated, then the selection algorithm continues looking for suitable servers. Summary The Server Selection specification will guide the next generation of MongoDB drivers in a consistent approach for server selection that deliver on three goals of being predictable , resilient , and low-latency . Users will be able to control how long server selection is allowed to take with the serverSelectionTimeoutMS configuration variable and control the size of the acceptable latency window with the localThresholdMS configuration variable. For more on the next-generation MongoDB drivers, see our blog post, Announcing the Next Generation Drivers for MongoDB . About the Author - David David is a senior engineer on the Developer Experience team. He has been active in open-source software for over 15 years, with particular emphasis on the Perl language and community. When he's not writing spec documents, David maintains the MongoDB Perl driver and avoids social media as much as possible. 1 http://goo.gl/HM3tgS 2 http://goo.gl/wOsmJb 3 For this article, some steps have been simplified and some client-server interoperability checks have been omitted