Introduction

Vert.x is an incredibly performant library to implement low latency services. Its multi-reactor pattern makes it possible to process many requests per second in a few milliseconds.

Working with Real-Time Bidding, we receive thousands of requests per second, and we have to answer in less than 100 milliseconds. That’s why we chose Vert.x.

In this article, I will give you the lessons we have learned from 4 years of operating production services based on this library.

Basics

A vert.x application looks like the following snippet of code. The main method creates a vertx instance; then verticles are deployed through this instance with some deployment options.

Endpoints

Now you probably need to expose an endpoint as you are working API first, right ? Endpoints are exposed through a Router that will map a route to a handler which is basically your business code.

Request Handlers

Handlers are responsible for reacting to something that is happening in the system. They will manage the code execution once an asynchronous operation ends (by failing, succeeding or timeout).

Failure Handlers

We can also attach a failure handler to our Route to execute a piece of code if a failure occurs during the request processing.

The failure handler is great to make sure connections are properly closed, record metrics as error types that can help us analyze our application behavior, especially unexpected errors.

Self Adaptability

Configurations

Some part of the application will be configurable because we want the same binary to be executed into several environments following the twelve factors principles.

The Vert.x ecosystem offers a vertx-config module that is a really well designed module to deal with configuration loading. It is organized around the ConfigStore notion, that represent anything able to contain a configuration (Redis, Consul, Files, Environment Variables).

As showed in the example below, configuration stores are chainable using the ConfigRetriever, and the latest value overrides the first one.

This design is particularly good when using an external configuration system (like Redis, Consul, or your cloud provider one like Google Runtime Config). Your system will try to retrieve a configuration from the remote system if it fails it falls back to the environment variable and if they are not defined, it will use the default configuration file. Be careful to set the remote stores as Optional ones to avoid exceptions if a remote system failure occurs.

Configuration is also a hot reloadable, which makes our application able to adapt its behavior without downtime. ConfigurationRetriever periodically refreshes configuration (defaults to 5s), so you can inject it into your application then call it to retrieve its latest values.

Our main method now looks like this, please note that we wait for the first configuration retrieval before deploying the verticle and propagate configuration changes through the Event Bus.

Our verticle now needs to decide how to react to configuration changes by subscribing to EventBus topic.

For example, I chose to mutate a variable in each handler to adapt the endpoint response.

A little word about extensibility of the module, because it is really easy to add new Configuration Stores, we needed to extend vertx-config to support Google Runtime Config, and it was a breeze to implement it. You only need to implement two classes: a ConfigStore which defines how to retrieve the configuration from the store, and a ConfigStoreFactory which defines how to create a ConfigStore and inject it a configuration like credentials or filters criteria.

Design for failure

Our production services are rarely alone, they depend on some external dependencies, a database, a message queue, a key/value store, a remote API, …

As we cannot rely on external systems health, we have to prepare our application to adapt itself when upstream dependencies face an outage or an unexpected latency. A Vertx ecosystem contains a module that implements the Circuit Breaker pattern to deal with this easily.

For this example, let’s define a new handler that will contact the awesome PokeAPI to list pokemon! As this is an external dependency, we need to wrap the call with a Circuit Breaker.

Now, we can simulate network latency by using Traffic Control and see the circuit breaker opens once the max failure limit is reached. Then our handler will answer immediately with the fallback value.

tc qdisc add dev eth0 root netem delay 2000ms

To simulate the external service recovery, let’s remove the latency rule and observe the breaker will close again, and return the remote API response.

tc qdisc add del eth0 root netem

Observability

No one goes in production blindly, so we have to define how to observe our application at runtime.

Health checks

The most basic way to observe our beloved software is to ask how it fares periodically. Thanks to the vertx ecosystem, there is a module for that!

Our policy is to expose two endpoints, one to tell if the application is alive or not, and another to tell if the application is healthy or not. It could be alive without being healthy as it would need to load some initial data for instance.

I like to use the “healthy” endpoint for monitoring purpose, to know when our service quality is degraded because of an external dependency failure and use the alive endpoint for alerting, because it requires an external action to recover (restart service, replace instances, …)

Let’s add these health checks to the PokemonHandler.

Now we need to expose the relevant endpoints in our Router.

With this configuration, our software can tell us if it is alive, and in this case, if it is ready to achieve its objective in optimal conditions. Here is the mapping I do in my mind:

Alive and healthy: NORMAL, everything is fine.

Alive not healthy: WARNING, the application works in degraded mode.

Not alive: CRITICAL, the application does not work.

Logs

The first thing we do once an alert has been raised, is often to look at the metrics and error logs. So we need to publish logs in an exploitable way.

We need to:

Export logs with an easy to parse format like JSON.

Add information about the execution environment (host, container id, platform, environment) to identify if the problem is isolated to a process easily; a compute node or a version of the application.

Add information about execution context (user id, session id, correlation id) to understand the sequence that caused an error.

Export format

We need to configure logback to output logs with the expected format.

A little trick you need to know, is that vertx logging itself does not care about logback configuration, we have to explicitly tell what we want to use by setting the “vertx.logger-delegate-factory-class-name” System Property to “io.vertx.core.logging.SLF4JLogDelegateFactory”

System.setProperty("vertx.logger-delegate-factory-class-name", "io.vertx.core.logging.SLF4JLogDelegateFactory");

You can also do it at JVM level by launching your app with -Dvertx.logger-delegate-factory-class-name=io.vertx.core.logging.SLF4JLogDelegateFactory parameter.

Now that our application outputs standard JSON, let’s add some contextual data about the environment it runs on. For this purpose, we use MDC that maintains a context in which metadata can be added.

Beware that MDC cannot be used to add contextual metadata like userId, requestId, sessionId, correlationId, …), because it relies on thread local values which are not compatible with vertx async nature. See this topic to dig further.

We need another solution to log these data… As a workaround, let’s add them to the log message itself and let our centralized logs platform parse it and transform it into metadata.

Now, triggering a validation error gives us a message full of context, making logs a powerful tool to analyze the path that led to an error.

Metrics

Vertx is natively integrated with Dropwizard Metrics and Micrometer. Let’s say we want to see how many requests each endpoint handled. I will not demonstrate how to report metrics to a backend but you have basically two options: Dropwizard Reporters and Opencensus.

Here is how to configure metrics using a standard Dropwizard Metric Reporter.

We can observe the evolution of our Service Level Indicators in time. It allows us to see both application indicators (requests/second, HTTP response codes, thread usage, …) and business metrics (those who make sense for my application like new customers, deactivated customers, mean activity duration, …).

If we want to count application errors, for instance, we can now use Dropwizard as we usually do.

Traces

Last but not least, we may need to add traces to our application to understand what happens when we detect an unexpected behavior and walk through particular requests across our systems.

Vert.x is not currently ready to handle this feature set, but it is definitely moving forward towards this direction. See the RFC.

Security

No surprise here, we need to secure access to our application before deploying it in production.

Authentication & Authorization

We use Auth0 to handle these purposes. As values are not meant to be public, I did not add this part of the article in the Github repository. Instead, I give you links to some useful resources :

Please remember that you should never trust the network so implement authentication and authorization at the application level.

Authentication to associate an identity to each action, Authorization to restrict actions depending on permissions associated with the identity performing actions. For instance, our health endpoints can be publicly accessible but actions should be restricted applying the principle of least privilege.

Using Auth0, decoding and verifying the token is enough to perform authentication, checking for the scopes is necessary to perform authorization.

Input Validation

Inputs should always be validated to ensure our application never processes malicious or erroneous data that could cause damage to the system. Again, there is a module for that!

Let’s add a new “hello world” endpoint to our application, and say that it is only callable with the following parameters:

A name path parameter which is required string beginning with an uppercase character and composed with alphabetical characters.

path parameter which is required string beginning with an uppercase character and composed with alphabetical characters. An Authorization header which is required string

header which is required string A Version header which is an optional int

Let’s add the validation and greeting handlers to our Router without forgetting to implement a custom validator to handle the business validation of the name.

If we call the endpoint without any of the required parameters, a ValidationException is thrown and the registered FailureHandler will continue the request processing.

curl -i -H 'Authorization: toto' -H 'Version: 1' http://localhost:8080/greetings/fds > HTTP/1.1 400 Bad Request

> content-length: 38

> Name must start with an uppercase char

Conclusion

The Vert.x ecosystem is really impressive regarding the number of modules which cover almost all of the features we need in production. The library is well designed and propose SPI for each concept, which makes it very extensible.

For instance, we needed to add new Configuration Stores to load configurations from Google Runtime Configurations and Google Compute Metadata, and the first implementation draft took us less than a day!

Although tracing is a missing piece for Observability, it is not a blocker to release our services in production environments as we are able to observe it through Health checks, Metrics, and Logs.

You can find the source code here: https://github.com/migibert/vertx-in-production