Netflix Platform Engineering — we’re just getting started

by Ruslan Meshenberg

“Aren’t you done with every interesting challenge already?”

I get this question in various forms a lot. During interviews. At conferences, after we present on some of our technologies and practices. At meetups and social events.

You have fully migrated to the Cloud, you must be done…

You created multi-regional resiliency framework, you must be done…

You launched globally, you must be done…

You deploy everything through Spinnaker, you must be done…

You open sourced large parts of your infrastructure, you must be done…

And so on. These assumptions could not be farther from the truth, though. We’re now tackling tougher and more interesting challenges than in years past, but the nature of the challenges has changed, and the ecosystem itself has evolved and matured.

Cloud ecosystem:

When Netflix started our Cloud Migration back in 2008, the Cloud was new. The collection of Cloud-native services was fairly limited, as was the knowledge about best practices and anti-patterns. We had to trail-blaze and figure out a few novel practices for ourselves. For example, practices such as Chaos Monkey gave birth to new disciplines like Chaos Engineering. The architectural pattern of multi-regional resiliency led to the implementation and contribution of Cassandra asynchronous data replication. The Cloud ecosystem is a lot more mature now. Some of our approaches resonated with other companies in the community and became best practices in the industry. In other cases, better standards, technologies and practices have emerged, and we switched from our in-house developed technologies to leverage community-supported Open Source alternatives. For example, a couple of years ago we switched to use Apache Kafka for our data pipeline queues, and more recently to Apache Flink for our stream processing / routing component. We’ve also undergone a huge evolution of our Runtime Platform. From replacing our old in-house RPC system with gRPC (to better support developers outside the Java realm and to eliminate the need to hand-write client libraries) to creating powerful application generators that allow engineers to create new production-ready services in a matter of minutes.

As new technologies and development practices emerge, we have to stay on top of the trends to ensure ongoing agility and robustness of our systems. Historically, a unit of deployment at Netflix was an AMI / Virtual Machine — and that worked well for us. A couple of years ago we made a bet that Container technology will enable our developers be more productive when applied to the end to end lifecycle of an application. Now we have a robust multi-tenant Container Runtime (codename: Titus) that powers many batch and service-style systems, whose developers enjoy the benefits of rapid development velocity.

With the recent emergence of FaaS / Serverless patterns and practices, we’re currently exploring how to expose the value to our engineers, while fully integrating their solutions into our ecosystem, and providing first-class support in terms of telemetry / insight, secure practices, etc.

Scale:

Netflix has grown significantly in recent years, across many dimensions:

The number of subscribers

The amount of streaming our members enjoy

The amount of content we bring to the service

The number of engineers that develop the Netflix service

The number of countries and languages we support

The number of device types that we support

These aspects of growth led to many interesting challenges, beyond standard “scale” definitions. The solutions that worked for us just a few years ago no longer do so, or work less effectively than they once did. The best practices and patterns we thought everyone knew are now growing and diverging depending on the use cases and applicability. What this means is that now we have to tackle many challenges that are incredibly complex in nature, while “replacing the engine on the plane, while in flight”. All of our services must be up and running, yet we have to keep making progress in making the underlying systems more available, robust, extensible, secure and usable.

The Netflix ecosystem:

Much like the Cloud, the Netflix microservices ecosystem has grown and matured over the recent years. With hundreds of microservices running to support our global members, we have to re-evaluate many assumptions all the way from what databases and communication protocols to use, to how to effectively deploy and test our systems to ensure greatest availability and resiliency, to what UI paradigms work best on different devices. As we evolve our thinking on these and many other considerations, our underlying systems constantly evolve and grow to serve bigger scale, more use cases and help Netflix bring more joy to our users.

Summary:

As Netflix continues to evolve and grow, so do our engineering challenges. The nature of such challenges changes over time — from “greenfield” projects, to “scaling” activities, to “operationalizing” endeavors — all at great scale and break-neck speed. Rest assured, there are plenty of interesting and rewarding challenges ahead. To learn more, follow posts on our Tech Blog, check out our Open Source Site, and join our OSS Meetup group.

Done? We’re not done. We’re just getting started!