At QCon SF, Suudhan Rangarajan presented "Netflix Play API: Why We Built an Evolutionary Architecture". Key takeaways from the talk included: services that have a single identity/responsibility are easier to maintain and upgrade; engineers should spend time identifying core decisions that need to be made when building a service, and determine whether these are "Type 1" or "Type 2" decisions which require thorough deliberation or rapid experimentation, respectively; and designing an "evolutionary architecture", using tools like fitness functions, provides many benefits.

Rangarajan, senior software engineer at Netflix, began the presentation by talking about two key business milestones within Netflix in 2016 that also had a large engineering impact. The first was the release of "#netflixeverywhere" in January, which enabled customers in many countries where Netflix was previously not available to now signup and watch content. The second milestone was the release of new "Download & Go" functionality in November, which provided the ability for customers to download content to a device for offline viewing. Both of these releases put an increased strain on the "Play API" that was, amongst other tasks, responsible for starting the streaming of content to customers. This resulted in several service outages, and also a decrease in deployment frequency and an increase in rollbacks, which, as stated by Forsgren, Humble and Kim in their book "Accelerate" are key metrics correlated with software delivery performance.

An overview of the previous architecture of the Play API Service was provided. Customer devices connect to an API Proxy Service running at the edge (presumedly the Zuul API gateway) that communicates with a monolithic API Service which contained several APIs, including the Play API. This API service in turn communicated to domain specific microservices to handle the user's downstream request.

The remainder of the talk was divided into three sections, and discussed the context and guiding principles related to the recent enhancements made to the monolithic API service in more detail. The sections included: service identity -- exploring why the service exists; identifying type 1 and 2 decisions -- determining which decisions will have a big long-term impact, and as such require a large amount of up-front investment; and evolvability -- exploring how to build a service that can evolve alongside changing requirements and constraints.

In the service identity section, Rangarajan suggested that engineers must "start with why"; ask why a service exists in order to determine its responsibility. For the Play API the chain of motivation went from Netflix wanting to "lead the Internet TV revolution to entertain billions of people across the world", to maximising customer engagement from signup to streaming, to ultimately "enabling acquisition, discovery and playback functionality 24/7". The audience was reminded of the single responsibility principle, and cautioned that when conducting this exercise they should be wary of "multiple-identities rolled up into a single service", as this can lead to a service being created with the architectural antipatterns of low cohesion and high coupling. Accordingly, the first big change the Play API team proposed was to divide the existing monolithic API service into an "API service per function" model. The Play API would be re-built and deployed as a microservice -- and this was made as a "Type 1" decision.

We believe in a simple singular identity for our services. The identity relates to and complements the identities of the company, organization, team and its peer services

The "Type 1 and 2 decisions" part of the talk began with an explanation of the source quotes about this decision making model from Jeff Bezos. Type 1 decisions are highly consequential and have long-ranging impact, and so these decisions must by made methodically and by engaging in consultation with others. Type 2 decisions are easily changeable, and do not have long-ranging implications, and therefore these decisions should be made quickly and by "high judgements individuals or small groups". The three type 1 decisions identified by the Play API team included appropriate coupling, synchronous versus asynchronous communications, and data architecture".

Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation [...] We can call these Type 1 decisions…

Looking first at appropriate coupling, Rangarajan stated that there are effectively two types of shared libraries when designing a microservice-based architecture: libraries that provide common functions, and client libraries used for inter-service communications. When using shared libraries with common functions -- for example the "utilities package" -- it is easy to introduce excessive "binary coupling", which makes maintenance and upgrading of this library challenging. With client communication libraries it can also be easy to introduce "operational coupling", for example, where well-intentioned fallback functionality provided within a client library can consume excessive resources and cause cascade failure. It is also easy to introduce "language coupling" with communication libraries, for example, if an upstream service team only provides a Java client library.

The combination of these issues alongside the identification and discussion of current requirements led the Play API team to decide to actively work to minimise the use of shared "utility" libraries in the new services they were planning to create. The team also decided to use gRPC instead of REST via JSON and HTTPS for service-to-service communication, which allowed RPC methods and entities to be defined via Protocol Buffers, and client libraries/SDKs automatically generated in a variety of languages. The summary advice for this "appropriate coupling" Type 1 decision was to "consider 'thin' auto-generated clients with bi-directional communication and minimize code reuse across service boundaries".

The second Type 1 decision, synchronous versus asynchronous, was discussed next. After deliberation the team decided that they did not have a need beyond request/response type interaction between the Play API and supporting services, and therefore they implemented a blocking request handler with non-blocking I/O for outgoing inter-service calls.

As the discussion turned to the third Type 1 decision encountered by the team, "data architecture", Rangarajan cautioned that "without an intentional data architecture, data becomes its own monolith". In the previous Play API architecture, several services accessed the same data source, which resulted in high coupling and reduced the paths for evolution of both the services and underlying data schema. With a deliberate smile, he quoted David Wheeler by saying that "all problems in computer science can be solved by another level of indirection", and stated that after discussion and analysis the Play API team ultimately introduced an intermediate data loader and data store layer that effectively implemented a materialised view between the service and underlying data sources.

In summary, the advice for the Type 1 decision related to data architecture was:

Isolate Data from the Service. At the very least, ensure that data sources are accessed via a layer of abstraction, so that it leaves room for extension later

The final piece of advice in this section of the talk was in relation to Type 2 decisions, which should be made by "choosing a path, experimenting and iterating". The guiding principles for decision making when building a service is to focus on identifying the type of the decisions being faced:

Identify your Type 1 and Type 2 decisions; Spend 80% of your time debating and aligning on Type 1 decisions

The final part of the talk focused on "big picture" architecture principles, and began with a quote from the "Building Evolutionary Architectures" book by Neal Ford, Rebecca Parsons, and Patrick Kua; "an evolutionary architecture supports guided and incremental change as first principle among multiple dimensions". Rangarajan argued that the results of the Type 1 decisions discussed previously ultimately led to a microservices architecture with appropriate coupling, which supported the type of evolution required. He discussed that "Fitness Functions" can be used to monitor and guide future change, and also allow discussion to be focused around the inevitable tradeoffs that must be made when designing architecture e.g. for the Play API team these were simplicity over reliability (e.g. fallbacks can cause cascade issues), and scalability over throughput (e.g. extensive caching gave high performance but did not scale well due to the time taken for the initial cache warming).

In conclusion, Rangarajan stated that over the year since the new changes had been made there had been no production incidents, and the team was close to its deployment target with an average of 4.5 deployments per week with only two rollbacks.

The complete video and accompanying transcript for the talk, "Netflix Play API: Why We Built an Evolutionary Architecture", is available on InfoQ.