As OpsGenie, we have been growing aggressively, both in terms of headcount and product features. To give you some idea, our engineering team grew from 15 to 50 just last year. To scale up the development teams, we divided the engineering power to eight-people teams by obeying the Two Pizza team rule.

As you would expect, our current product is somewhat monolithic. Developing and operating it is challenging in terms of parallel development efforts of teams, CI/CD (Continuous Integration/Continuous Delivery) process, etc. We are following the current trend and working on transitioning from the monolith to microservices architecture. You can read more about microservices architecture and its benefits from Martin Fowler’s this article.

There are some recommended architectural patterns for applying Microservice concepts. One of these patterns is the API Gateway. API gateway is a single entry point for all clients. The API gateway handles requests in one of two ways. Some requests are simply proxied/routed to the appropriate service. It handles other requests by fanning out to multiple services.

API Gateway pattern is a good starting point for the microservice architecture because it enables routing specific requests to different services that we detached from the monolith. Actually API Gateway is not a new concept for us. Up until so far, we have been using Nginx as the API Gateway in front of our monolith application, but we wanted to re-evaluate our decision in the context of microservice transition. We care about performance, ease of extensibility and additional capabilities such as rate limiting. The first step is to evaluate the performance of the alternatives under heavy load to assure that they will scale enough to meet our needs.

In this blog post, we explain how we setup our test environment and compare the performance of alternative API Gateways: Zuul 1, Nginx, Spring Cloud Gateway, and Linkerd. In fact, we have other alternatives like Lyft’s Envoy and UnderTow. We are going to perform similar tests with these tools and share the results in future blog posts.

Zuul 1 seems promising for us since it is developed with Java and has Spring framework’s strong support. There are already some blog posts that compare Zuul with Nginx, but we also want to evaluate the performance of Spring Cloud Gateway and Linkerd. Besides, we plan to perform further load tests, so we decided to set our own test workbench.

To evaluate the performance of the API Gateways independently, we created an isolated test environment independent of OpsGenie product. We used Apache Http Server Benchmarking tool — ab for benchmarks.

We first installed Nginx to an AWS EC2 t2.micro instance according to the official Nginx documentation. This environment is our initial test environment, and we added Zuul and Spring Cloud Gateway installations to this environment. Nginx web server hosts static resources, and we defined reverse proxies to the web server for Nginx, Zuul and Spring Cloud Gateway. We also started another t2.micro EC2 to perform requests (Client EC2).

Initial Test Environment

The dashed arrows in the figure are our test paths. There are four of them:

Direct access

Access via Nginx reverse proxy

Access via Zuul

Access via Spring Cloud Gateway

Access via Linkerd

We know that you are impatient about seeing the results, so let’s give the results first, and the details later.

Performance Benchmark Summary

Test Strategy

We used Apache HTTP Server Benchmarking tool. We made 10,000 total requests with 200 concurrent threads at each test run.

ab -n 10000 -c 200 HTTP://<server-address>/<path to resource>

We performed tests on three different AWS EC2 server configurations. We narrowed down test cases at each step for further clarification:

We performed an additional direct access test in just the first step to see the overhead of proxies, but since direct access is not option for us, we didn’t performed this test on the following steps.

Since Spring Cloud Gateway is still not released formally, we evaluated it just at the last step.

Zuul’s performance is better at subsequent calls after the first call. We think this is probably caused the JIT (Just In Time) optimization is performed on the first call, so we called the first of Zuul run as “Warmup”. The values shown in the summary tables below are after the warm-up performance.

We know that Linkerd is a resource intensive proxy, so we compared it just at the last step with the highest resource configuration.

Test Configuration

T2.Micro — Single Core CPU, 1GB of Memory: We ran tests for direct access, Ngnix reverse proxy, and Zuul (after warmup).

M4.Large — Dual Core CPU, 8GB of Memory: We compared the performance of Nginx reverse proxy and Zuul (after warmup).

M4.2xLarge — 8 Core CPU, 32GB of Memory: We compared the performance of Nginx reverse proxy, Zuul (after warmup), Spring Cloud Gateway, and Linkerd.

Test Results

The performance benchmark summary is below:

Test Details

Direct Access Request

First, we accessed a static resource directly without any proxy. The results are as follows. Mean time per request is 30 ms.

Direct access to static resource

Access Via Nginx Reverse Proxy

In our second test, we accessed a resource via Nginx reverse proxy. Mean time per request is 40 ms. We can say that Nginx reverse proxy added a %33 overhead at average when compared to direct access that is explained in the previous section.

Performance of Ngnix reverse proxy

Access via Zuul Reverse Proxy

After that, we created a Spring Boot Application with a main method:

Spring Boot main application for Zuul

And this is our application.yml file:

The results of the initial Zuul test is as follows:

Zuul Reverse Proxy first run performance

Time per request for Nginx were 30ms and 40ms for direct access and Nginx reverse proxy, respectively. Time per request for Zuul at first run is 388ms. As mentioned in other blog posts, a JVM warmup may help. When we reran the test, we got the following results:

Zuul Reverse proxy after warmup

Zuul proxy performs better after warmup (time per request is 200ms), but it is still not that good when compared to Nginx reverse proxy which has a score of 40ms.

What if we upgrade the server to m4.large?

As shown in Figure 1, the server is a t2.micro ec2 which has a single core and 1GB of memory. Nginx is a native C++ application and Zuul is Java-based. We know that Java applications are a little bit :) more demanding. So we changed the server to an m4.large instance which has two CPU cores and 8GB of memory.

We ran the Nginx and Zuul reverse proxy tests again, and the results are given below:

Nginx reverse proxy on m4.large

Zuul Reverse Proxy on m4.large (after warmup)

As shown in the above figures, the request per second values are 32ms and 95ms for Nginx and Zuul, respectively. These results are way better than the results of the tests on t2.micro which were 40ms and 200ms.

A valid criticism is that we are introducing extra overhead by using Zuul via a Spring Boot application. Most probably it will perform better if we run Zuul as a standalone application.

What if we upgrade the server to m4.2xlarge?

We also evaluate m4.2xlarge server which has eight cores and 32GB of memory. The results for Ngnix and Zuul are given in the following figures:

Nginx reverse proxy on m4.2xlarge server

Zuul reverse proxy on m4.2xlarge server

Zuul outperformed Ngnix on m4.2xlarge server. We performed some research to find out what type of ec2 instances Netflix is using to host Zuul instances, but we couldn’t find any information about it. In some blog posts, people complained about performance of Zuul and asked how Netflix scales it; we think this is the answer; as it is said, Zuul is CPU-bound :)

Benchmark for Linkerd

Linkerd is a Cloud Native Computing Foundation member project. It is a service mesh application developed in Scala. It provides reverse proxy capabilities in addition to service mesh capabilities such as service discovery. We have evaluated performance of Linkerd and the results are given below. Performance of Linkerd is very close with Zuul.

Linkerd reverse proxy on m4.2xlarge server

Benchmark for Spring Cloud Gateway

Spring Cloud community is also developing a Gateway module. Although it is not still released officially, we think it is worth comparing it with the other alternatives. Thus, we modified the sample application of Gateway application according to our test environment.

We ran the same performance test with Apache Http Server Benchmarking Tool that sends 10,000 requests with 200 concurrent threads. The results are shown in the following figure:

Spring Cloud Gateway on m4.2xlarge server

As shown in the figure, Spring Cloud Gateway can handle 873 requests per second, and mean time per request is 229ms. According to our tests, the performance of Spring Cloud Gateway can not reach the level of Zuul, Linkerd and Nginx, at least that’s the case with their current codebase on Github. Comparison of Nginx, Zuul, Linkerd and Spring Cloud Gateway is given above, at the end of Benchmark Summary section.

What is Next?

In this blog post, we compared the performance of Zuul, Nginx, Linkerd, and Spring Cloud Gateway with Apache Http Server Benchmarking tool, ab. The next steps that we are planning to follow are:

We are going to evaluate Envoy. Actually Envoy is more than an API gateway; it is a service mesh but it also provides an API Gateway that can be used at front side of the application.

Undertow also has reverse proxy capabilities, so we are going to evaluate it, too.

Netflix redesigned Zuul as a Netty based non-blocking application. This new version is called “Zuul 2”. We are going to perform a benchmark and share the results for the new Zuul whenever it’s open source version is officially released.

Spring Cloud Gateway is still under development. It is a Netty based non-blocking gateway developed with Java, so it is a good candidate for us. We are going to evaluate the performance of its official release.

Some of the API Gateways (Zuul 1, Ngnix) are blocking, and others (Zuul 2, Linkerd, Envoy) are non-blocking. Blocking architectures are good for simple development and tracing the requests, but blocking nature can cause scalability problems. Non-blocking architectures are more complex in terms of development and traceability but they are better in terms of scalability and resiliency. We are going to decide whether we will use blocking or non-blocking architecture later.

We are going to perform tests with Gatling. We will share the results in our next blog post.

We are going to share the results for each step in our succeeding blog posts, stay tuned!

Interested in more posts?