Going full reactive?

Following this blog post on Spring Boot and MongoDB, I decided to port it in order to be "fully reactive" :

Migrating from the standard Spring Web stack to Spring Webflux, which uses project Reactor in order to have a reactive API.

Migrating from the CosmosDB with MongoDB API to the newest CosmosDB SDK, using the SQL API. If you want more information on CosmosDB, here is the documentation.

The idea of this post is that we should be fully reactive from the database to the Web layer, so we can study the required APIs and have a look at the performance and scalability of this architecture.

Giving a spin to the new Azure CosmosDB SDK

Please note that this uses the newest Azure CosmosDB SDK, which not yet finished, so this post is also a world-first preview of that SDK, so we can test it and discuss it. I (Julien Dubois) am directly in contact with the SDK team at Microsoft, so if you have any comment or issue, don't hesitate to post it on this article, or contact me directly!

This new SDK is available on https://github.com/Azure/azure-cosmosdb-java/tree/v3.

At the time of this writing, the documentation and sample applications are not ready yet.

If you are using the previous SDK, please note that the Maven artifactId has changed, and that this SDK is now available at com.microsoft.azure:azure-cosmos.

As this is some new code, you can expect a few issues. For example I found this one while doing this blog post, and I proposed a fix here. But as you'll see along this blog post, this new API is much better than the previous one, and behaves really well.

Doing a CRUD with the new CosmosDB SDK

The really awesome news about the new CosmosDB SDK is that is uses project Reactor, like Spring Webflux, instead of the aging RxJava v1 API from the previous release. This means it's going to return Mono and Flux objects, and those are exactly what Spring Webflux likes, so integrating both is going to be very smooth and easy.

The whole demo project is available at https://github.com/jdubois/spring-webflux-cosmosdb-sql/, but let's focus on the repository layer as this is where all the CosmosDB SDK magic lives.

CosmosDB configuration and connection

We have created a specific Spring Boot configuration properties class in order to hold our configuration. This is used in our repository layer, which looks like this:



ConnectionPolicy connectionPolicy = new ConnectionPolicy (); connectionPolicy . connectionMode ( ConnectionMode . DIRECT ); client = CosmosClient . builder () . endpoint ( accountHost ) . key ( accountKey ) . connectionPolicy ( connectionPolicy ) . build ();

The important part is that it uses the "direct mode" connection policy, instead of the default "gateway" policy: in our tests, this clearly did a huge difference, as our reactive code is maybe too efficient for the gateway, and was flooding it quickly. So we had a lot of connection errors to the gateway, which just disappeared as soon as we switched to direct mode: if you can use it, it is very highly recommended in this "reactive" scenario.

In the init() method (available here) we also did 2 blocking calls to create the database and its container. There are a couple of interesting tweaks here:

Our container doesn't have an indexing policy, as we used indexingPolicy.automatic(false); . By default, CosmosDB indexes all fields of all stored objects, which has an important cost during insertion. We didn't need this for our tests, but we also believe it is too aggressive, and should be tuned for each specific use-case.

. By default, CosmosDB indexes all fields of all stored objects, which has an important cost during insertion. We didn't need this for our tests, but we also believe it is too aggressive, and should be tuned for each specific use-case. The container is created with a default RU/s of 400, using database.createContainerIfNotExists(containerSettings, 400) . Be careful with this setting, as this can quickly cost a lot of money if it is set too high. Strangely, it was set to 1000 with the MongoDB API, when it is 400 by default with the SQL API - but anyway, this is such an important setting that it's better to fix it than to rely on the defaults.

. Be careful with this setting, as this can quickly cost a lot of money if it is set too high. Strangely, it was set to with the MongoDB API, when it is by default with the SQL API - but anyway, this is such an important setting that it's better to fix it than to rely on the defaults. When doing new CosmosContainerProperties(CONTAINER_NAME, "/id"); , we used our id as the partition key. This is why will get an item using container.getItem(id, id) : the first argument is the id and the second is the partition key, which happens to be also the id . This works fine, and in our demo the Project should indeed be used to partition everything, so this makes a business sense.

Creating, finding and deleting an item

For simple operations, when we have an item's id (and partition key), we can directly use a simple API, for example for creating:



public Mono < Project > save ( Project project ) { project . setId ( UUID . randomUUID (). toString ()); return container . createItem ( project ) . map ( i -> { Project savedProject = new Project (); savedProject . setId ( i . item (). id ()); savedProject . setName ( i . properties (). getString ( "name" )); return savedProject ; }); }

As there is no ORM provided, we need to manually map our return result to our domain object. That's quite a lot of boilerplate code for bigger objects, but that's common to this kind of technology. The great news, of course, is that in this case we can easily return a Mono<Project> , which is exactly what Spring Webflux wants to have.

Querying

Doing an SQL query is a bit more complex, and we had two issues here:

As our id is also our partition key, we had to allow cross-partition queries in order to get all the data, using options.enableCrossPartitionQuery(true); . This of course has a performance cost.

is also our partition key, we had to allow cross-partition queries in order to get all the data, using . This of course has a performance cost. As we wanted to have paginated data, we used TOP 20 in our SQL query to only get 20 items, and not flood the system.

Here is the resulting code:



FeedOptions options = new FeedOptions (); options . enableCrossPartitionQuery ( true ); return container . queryItems ( "SELECT TOP 20 * FROM Project p" , options ) . map ( i -> { List < Project > results = new ArrayList <>(); i . results (). forEach ( props -> { Project project = new Project (); project . setId ( props . id ()); project . setName ( props . getString ( "name" )); results . add ( project ); }); return results ; });

Be careful when trying to limit the number of return values, as you might be tempted to configure the FeedOptions instance using options.maxItemCount(20) . This will not work, and is quite tricky:

The query returns paginated values, and maxItemCount is in fact the number of values in each page. This comes from the CosmosDB API (in fact that's the name of the HTTP header used underneath, when doing the query), so there is some logic in this name, but this can definitely cause trouble as the name is misleading. So if you set it to 20 , this means you will still get your whole items list, in small pages, and this is going to be really costly.

is in fact the number of values in each page. This comes from the CosmosDB API (in fact that's the name of the HTTP header used underneath, when doing the query), so there is some logic in this name, but this can definitely cause trouble as the name is misleading. So if you set it to , this means you will still get your whole items list, in small pages, and this is going to be really costly. Please note that the documentation doesn't say what the default maxItemCount is, but it is hard-coded to 100.

is, but it is hard-coded to 100. It is because of this API that our query returns a Flux<List<Project>> and not a Flux<Project> : we have a flux of pages, and not just a flux.

Performance testing

At least we arrive to performance testing! We're going to do something similar to the blog post on Spring Boot and MongoDB so you can have a look at both results, but don't compare apples and oranges, as this other application was created using JHipster. JHipster does not (yet) fully support reactive programming, so the Spring Webflux was coded manually, and is thus quite different:

The JHipster application had security, auditing and metrics: this all consumes quite a lot of performance, but they are essential if you want to deploy real applications in production. Our Spring Weblux demo is much more simple.

Also, JHipster provides several performance tweaks that we don't have on the Spring Webflux demo, for example if uses Afterburner.

So, while it is interesting to compare both applications, keep in mind they are not exactly 1-to-1.

Going to production

As we did in the Spring Boot/MongoDB blog post, we deployed the application on Azure Web Apps using the provided Maven plugin (see it here in our pom.xml).

Test scenario

Our test scenario is made with Gatling, and is available at https://github.com/jdubois/spring-webflux-cosmosdb-sql/blob/master/src/test/gatling/user-files/simulations/ProjectGatlingTest.scala. It's a simple script that simulates users going through the API: creating, querying and deleting items.

Running with 100 users

Our first test is with 100 users, and as expected everything works really well as it's not a lot of concurrent requests:

Nothing to see, let's move on!

Going to 500 users

Going to 500 users is interesting: it still works really well, but we had 3 errors:

This is because deleting items is a costly operation on CosmosDB (it uses a bit more than 5 RU), so doing this when the application is on full load means we're hitting our API limit. This is a result of having a more performant and stable application than with the classical Spring Web framework: we are hitting our backend harder, and we need to take this into account.

Reaching 1,000 users

To go past 500 users, we needed to increase our CosmosDB RU/s, like we did with Spring Web. Here, 1,200 RU/s seems enough, but to be honest we pushed it to 5,0000 RU/s so we didn't have to worry about this for the rest of the tests.

Again, everything went fine without any issue, let's scale up!

10,000 users

Going to 10,000 users had an interesting side-effect: our Gatling tests started to fail, on the client side. So we had to increase the ulimit on our load testing machine: this is quite usual, but it didn't happen with Spring Web, so here again we see that going fully reactive has an effect, as it's running too fast for our load testing machine. Still, we had a few client-side errors, as Gatling could not find the hostname of our server: this is sadly why we couldn't go to 20,000 users...

We also started to have some server errors after reaching 5,000 users: those are basically the same ones than on the client side, with too many files opened on the server. As we are using Azure Web Apps we couldn't modify anything on the server, but we could easily scale it out. From our tests, it seems that 2/3 servers would be enough, but we used 5 just to be sure. Please note that with Spring Web we used 20 servers: once again, both tests are not 1-to-1 equivalents, and should be refined, but it's pretty clear that we use less resources with the reactive stack.

Please also note that our 99th percentile performance was excellent, and that we scaled very easily to 1,0000 requests/seconds in one minute, with a very clean graph:

Profiling

As everything looked really great with our graphs, but our load testing tool prevented us to go further, we decided to do some profiling with YourKit, in order to be sure we had nothing blocking or holding us to go further.

Running with 5,000 users on a local machine, we could see that no thread was blocking:

And also that our CPU usage was extremely low, our thread count stable, and our memory low and stable:

We also ran some YourKit analysis to find bottlenecks, locks or memory-hungry objects: we spare you with the details, as we couldn't find anything!

Conclusion and final thoughts

By going "full reactive", we got a number of advantages :

The application starts faster, uses less CPU and memory.

It has a very stable throughput.

It scales easily.

However everything isn't perfect :

There is much more code, which is quite complex and requires a good technical background.

Everything needs to be non-blocking: it's awesome in this simple use-case, but in real life it's a bit more complex. For instance, I love to use Spring Cache: it's easy to use, and using a Memcached or Redis server is probably way cheaper than scaling Cosmos DB. But as this is a blocking operation, we can't use it here!

There's only a significant interest to go "full reactive" when there is a high number of users. If you have just 500 requests/second, you're probably over-engineering.

We also had a first taste of the new v3 version of the CosmosDB SDK : we proved it worked extremely well under high load, and also we had the good fortune to have it work with the same reactive framework as Spring Webflux.

There are definitely still a few bugs, and also APIs to improve, for instance :

It doesn't use setters/getters, for example we used options.maxItemCount(20) to set the max item count, and options.maxItemCount() to get that count. I personally don't find this very easy to use.

to set the max item count, and to get that count. I personally don't find this very easy to use. I find it strange that to create an item, you just give a POJO using container.createItem(project) , but if you need to read that item you receive a CosmosItemResponse and then need to create the POJO manually. I think we could have some automatic POJO mapper, like we have with MongoDB.

, but if you need to read that item you receive a and then need to create the POJO manually. I think we could have some automatic POJO mapper, like we have with MongoDB. For querying, we could have a fluent query builder

As there is still time to improve that API, please take the time to read the demo code, and provide feedback: add a comment on this post, send a Twitter message... Don't hesitate, I'll be happy to communicate that feedback with the SDK team at Microsoft.