Key Takeaways AppsFlyer processes nearly 70+ billion HTTP requests a day, and is built using a microservices architecture style. The entry point to the system that wraps all of the frontend services is a mission-critical (non-micro) service called the API Gateway.

The original API Gateway that was written in the AppsFlyer default language, Clojure, started accumulating technical debt.

Golang was selected as the language to benchmark against Clojure for the proposal for a newly designed API Gateway service.

Benchmarking was conducted with NGINX (enhanced by Lua) as an option, alongside Golang and Clojure. Go delivered improved throughput versus Clojure, and this was selected as the language of choice for the implementation.

The fact that the API gateway is now built in a typed language provides the ability to plug in diverse functionalities and introduce new technologies much more easily with Golang’s library support and community.

The newly deployed solution is capable of supporting exponentially more traffic than it does today - and with traffic and requests growing in scales of 10X this was important from a forward-thinking perspective.



AppsFlyer a leading mobile attribution and marketing analytics platform, processes nearly 70+ billion HTTP requests a day (approximately 50 million requests a minute), and is built using a microservices architecture style. The entry point to the system that wraps all of the frontend services is a mission-critical (non-micro) service called the API Gateway. This essentially serves as a single point for routing traffic from customers to our backend services, simplifying authentication and authorization exponentially for our clients, but with the tradeoff of also potentially being a single point of failure.

This article explores why and how the engineering team migrated from a Clojure-based API gateway implementation to a Go-based implementation.

Accumulating Technical Debt within the API Gateway

We’ve talked previously about how technical debt originates, and many times it happens, just as it happened with our API Gateway service.



Originally, AppsFlyer’s services were a Python monolith, which required a single solution for authentication and authorization as part of the monolith itself. As time went by, traffic and complexity grew and we migrated to a Microservice architecture. As such, we needed to create a unified API gateway solution that will serve as our authentication and authorization provider.

We started by just rolling up our sleeves and writing this in Clojure, skipping the design phases, and building the service largely in proof-of-concept mode. Our company is one of the largest Clojure shops in production in EMEA, and therefore Clojure is many times by default the language of choice without many more considerations of the specific project at hand. While this is good for velocity, and a “get stuff done” mindset, it’s less ideal for the long-term maintenance of a project. We quickly realized as traffic grew - that the code for the newly rolled out API gateway was too complex, and needed constant refactoring to enable the throughput required.

We eventually came to a crossroads where the service was too unstable, and we realized that we needed to rewrite the project completely - either in Clojure (but with a better design), or explore other language options as well. With this iteration, we decided not to embrace our cognitive biases and revert to our Clojure comfort zone, but instead do the proper design work required to build the service we need, and not just rework a service we already have.

We eventually selected Golang as the language to benchmark against Clojure for this API Gateway service, which also brought with it the added benefits of language diversity and contributed to our mentality of code craftsmanship, by mastering additional syntaxes.

We understood the flip side of adding another programming language to our stack. We are strong believers in CI/CD mentality, and introducing a new language, which is not JVM based (as opposed to Clojure) had its operational costs, but we were able to resolve that in short time.

There were also, of course, learning curves with mastering a new language, and the need to ensure that the code would be stellar and robust enough for the long-term, which is hard to know before actually writing your first project in a specific language and seeing how it performs in production.



I’ll provide a brief aside on why we selected Go for this specific service -- just for some context. Go has very strong support for building network services and specifically for proxy-like services with the built-in reverse-proxy. Its biggest advantage versus other solutions like the http-kit that we’ve used in Clojure, is the ability to stream the data through the proxy instead of storing it in-memory, and return it to the client only after the last byte was received from the server. This feature alongside the support for efficient I/O without the price of overly complicated asynchronous code that we would have to write in other platforms like the JVM, made the choice of Go very compelling. An additional advantage that became apparent while we started to implement the service, was the fact that a statically typed language makes it a lot easier to refactor the code and reason about it, since the types are an excellent way to self-document your code.

Evaluating Our Options

We understood that to be able to properly evaluate the different languages suitability, we would need to examine a few aspects - performance as well as specific benefits of each language for the specific task at hand. To measure performance, we understood we would need to properly benchmark Clojure vs. Go in as close of a production simulation as possible.

To do so, we started by doing stress testing, with NGINX (enhanced by Lua) as an option, alongside Golang and Clojure. Go delivered improved throughput versus Clojure.



The basic statistics of the test:

We used WRK as our benchmarking tool

3-minute bursts

64 threads

1000 connections pool

2-minute request timeout

Each request returned a static file weighing 500kb

All traffic was fired from the same AZ to mitigate network noise using c4 xlarge instances

Proxy solution Req/Sec Trans/Sec Total requests Total transaction size Bad Req Avg. Latency Direct 190 72 MB 34500 12.8 GB ~ 400 (drop:200) 4.41 Sec NGINX 185 73 MB 33486 12.7 GB ~ 300 (drop:37) 7.95 Sec Clojure (basic Http-Kit implementation) 190 72 MB 34412 12.8 GB ~ 100 (drop:600) 8.48 Sec Golang (native reverse proxy & http layer) 185 73 MB 33443 12.7 GB ~ 200 (drop: 0) 5.42 Sec

We moved away from re-writing the service in Clojure not only because Go showed better performance but also because we wanted to challenge ourselves and be exposed to a different language and a different way of thinking.

The design phases started by outlining the functionality we required the service to have, and after having the basic concepts specified, we examined backward compatibility considerations and potential pitfalls with migrating our production user base to the new service. Once we ensured that we had covered all of our bases we started to get to work by assigning an architect and developer to the project.

From Concept to Delivery

We were surprised by how quickly the coding part of the project was completed, with approximately only two months of work required. Because this was the first time we introduced Go in-house, we were very careful with the coding part of the project. We did two iterations on each function to ensure we were doing it right, and did manycode reviews. This is because we knew that this code had to be crafted and clean, as it would serve as a source for other Go projects going forward.

Despite this being the first project introduced in Go, we had the opportunity to really get a good grasp of the language and it’s core functionality, as we had to compensate for libraries used in Clojure for communication with additional parts of the stack including Redis (persistent state of user login counters to prevent DDoS and bots) and Kafka (we manage a CQRS of domain events, one of which is successful or unsuccessful logins), which required creating similar libraries in Go.

In order to match the ecosystem we have in Clojure, we needed to integrate a whole range of libraries like a metrics collection library, a logging library, a JWT library, among others, and we were very happy to find all of them at a maturity level which is a very strong indication of the level of adoption of the Go language by the community - which is an important consideration when making the decision to migrate to a new a language. Its community sustainability and maturity play an important role in such a decision.

We were ready for the basic migration after approximately two months, having the basic functionality covered and tested. We started migrating services iteratively within the parent group (our domain group) in a controlled way to the new API Gateway, which was basically a canary release.

We decided to do a controlled rollout with the first few services over the course of the first few weeks, so we could discover the bugs and flaws in production, and have the time to properly fix them before rolling out all of our services. We wanted to learn from the mistake of moving too quickly with the original API solution, which eventually led to delivering low quality.

Once we felt we were ready and fixed all the flaws, we started the migration plan for all of our services. This included a migration guide PDF for each service including the exact steps needed to transfer over to the new service, and the benefits included in such a move, and the optimal way to perform the migration based on its specific stack and dependencies.

To roll out the new reverse proxy in a gradual manner, we used an application load balancer (ALB) to route the traffic based on a set of predefined URLs that indicate the services we want to be exposed via the new API gateway vs. the old one.

This enabled a very controlled approach to how to route traffic with minimal effort and risk. We took our time, tested each migrated service and worked hand-in-hand with all the other teams that were responsible for their user-facing services. It took us six months, but we managed to migrate ~40 microservices to use the new API gateway with zero downtime.

Results

The end result enabled us to reduce 25 instances (c4 xlarge) running Clojure code - able to process 60 concurrent requests, to two instances (c3.2xlarge) running Go code able to support ~5000 concurrent requests a minute - a huge improvement. The new architecture design was also robust enough of a solution for our next phase growth by giving us both a powerful service that can withstand high scale and grow in business complexity easily due to its procedural approach, and also a new language to add to our toolbox when dealing with high scale.

Let’s take for example our reverse proxy solution in Clojure and in Go.

Clojure:

;; Creating a connection manager (let [cm (clj-http.conn-mgr/make-reusable-conn-manager {:timeout 1 :threads 20 :default-per-route 10})]) ;; Creating a proxy server using cm (connection manager) (client/request {:method :get :url (service/service-uri service-spec uri-match) :headers (dissoc (into {} (:headers req)) “content-length”) :body (when-let [len (get-in req [:headers “content-length”])] (bs/to-byte-array (:body req))) :follow-redirects false :throw-exceptions false :connection-manager cm :as :stream}))

And in Golang:

func NewProxy(spec *serviceSpec.ServiceSpec, director func(*http.Request), respDirector func(*http.Response) error, dialTimeout, dialKAlive, transTLSHTimeout, transRHTimeout time.Duration) *MultiReverseProxy { return &MultiReverseProxy{ proxy: &httputil.ReverseProxy{ Director: director, //Request director function ModifyResponse: respDirector, Transport: &http.Transport{ Dial: (&net.Dialer{ Timeout: dialTimeout, //limits the time spent establishing a TCP connection (if a new one is needed). KeepAlive: dialKAlive, //limits idle keep a live connection. }).Dial, TLSHandshakeTimeout: transTLSHTimeout, //limits the time spent performing the TLS handshake. ResponseHeaderTimeout: transRHTimeout, //limits the time spent reading the headers of the response. }, },

Notice how Golang has many features that are oriented towards better management of connection pools and reverse proxy capabilities baked into its core classes.

In Summary

Choosing to write the new version of the API Gateway in Go has proven to be a very good decision. The minimal learning curve of Go made it an excellent language to learn “on the fly” while working on a real production service. Its support for low-level networking constructs such as a reverse-proxy, and a general mindset towards performance, made the final result both a real measurable improvement, as well as more robust. All of the production issues that we had as a result of the previous code are now obsolete, it is much easier to add new features to the gateway and the increased traffic we can now support enables us all to sleep better at night.

This article was updated 15 February 2019 to clarify several minor points raised in the comments discussion.

About the Author

Asaf Yonay is the R&D Group Manager at AppsFlyer, who is passionate about taking managerial and technical challenges and turning them into success stories by adding the human element into the mix. Asaf is a firm believer in defining processes that help R&D teams grow and scale without losing their velocity, and taking a Hands-on, Full-stack approach to staying in touch with the challenges - believing that's what evolves managers into leaders. He has been working in the start-up in various roles, ranging from support, QA and various R&D roles, building scalable, robust systems in Clojure, Golang, Node.js and Python to power-up a React and Angular services, while working with Kafka, Aerospike and Neo4J to handle large scale or complex business logic states.