by Amit Joshi, Andrew Leung, Corin Dwyer, Fabio Kung, Sargun Dhillon, Tomasz Bak, Andrew Spyker, Tim Bozarth

Today, we are open-sourcing Titus, our container management platform.

Titus powers critical aspects of the Netflix business, from video streaming, recommendations and machine learning, big data, content encoding, studio technology, internal engineering tools, and other Netflix workloads. Titus offers a convenient model for managing compute resources, allows developers to maintain just their application artifacts, and provides a consistent developer experience from a developer’s laptop to production by leveraging Netflix container-focused engineering tools.

Over the last three years, Titus evolved initially from supporting batch use cases, to running services applications (both internal, and ultimately critical customer-facing). Through that evolution, container use at Netflix has grown from thousands of containers launched per week to as many as three million containers launched per week in April 2018. Titus hosts thousands of applications globally over seven regionally isolated stacks across tens of thousands of EC2 virtual machines. The open-sourcing of Titus shares the resulting technology assembled through three years of production learnings in container management and execution.

Why are we open sourcing?

Over the past few years of talking about Titus, we’ve been asked over and over again, “When will you open source Titus?” It was clear that we were discussing ideas, problems, and solutions that resonated with those at a variety of companies, both large and small. We hope that by sharing Titus we are able to help accelerate like-minded teams, and to bring the lessons we’ve learned forward in the container management community.

Multiple container management platforms (Kubernetes, Mesosphere DC/OS, and Amazon ECS) have been adopted across the industry during the last two years, driving different benefits to a wide class of use cases. Additionally, a handful of web-scale companies have developed solutions on top of Apache Mesos to meet the unique needs of their organizations. Titus shares a foundation of Apache Mesos and was optimized to solve for Netflix’s production needs.

Our experience talking with peers across the industry indicates that other organizations are also looking for some of the same technologies in a container management platform. By sharing the code as open source, we hope others can help the overall container community absorb those technologies. We would also be happy for the concepts and features in Titus to land in other container management solutions. This has an added benefit for Netflix in the longer term, as it will provide us better off-the-shelf solutions in the future.

And finally, a part of why we are open-sourcing is our desire to give back and share with the community outside Netflix. We hope open sourcing will lead to active engagements with other companies who are working on similar engineering challenges. Our team members also enjoy being able to present their work externally and future team members can learn what they have an opportunity to work on.

How is Titus different from other container platforms?

To ensure we are investing wisely at Netflix, we stay well aware of off-the-shelf infrastructure technologies. In addition to the aforementioned container orchestration front, we also stay deeply connected with the direction and challenges of the underlying container runtime technologies such as Docker (Moby, container-d) and CRI-O. We regularly meet with engineering teams at the companies both building these solutions as well as the teams using them in their production infrastructures. By balancing the knowledge of what is available through existing solutions with our needs, we believe Titus is the best solution for container management at Netflix.

A few of those key reasons are highlighted below:

The first is a tight integration between Titus and both Amazon and Netflix infrastructure. Given that Netflix infrastructure leverages AWS so broadly, we decided to seamlessly integrate, and take advantage of as much functionality AWS had to offer. Titus has advanced ENI and security group management support spanning not only our networking fabric but also our scheduling logic. This allows us to handle ENIs and IPs as resources and ensure safe large scale deployments that consider EC2 VPC API call rate limits. Our IAM role support, which allows secure EC2 applications to run unchanged, is delivered through our Amazon EC2 metadata proxy. This proxy also allows Titus to give a container specific metadata view, which enables various application aspects such as service discovery. We have leveraged AWS Auto Scaling to provide container cluster auto scaling with the same policies that would be used for virtual machines. We also worked with AWS on the design of IP target groups for Application Load Balancers, which brings support for full IP stack containers and AWS load balancing. All these features together enable containerized applications to transparently integrate with internal applications and Amazon services.

In order to incrementally enable applications to transition to containers while keeping as many systems familiar as possible, we decided to leverage existing Netflix cloud platform technologies, making them container aware. We choose this path to ensure a common developer and operational approach between VMs and containers. This is evident through our Spinnaker enablement, support in our service discovery (Eureka), changes in our telemetry system (Atlas), and performance insight technologies.

Next is scale, which has many dimensions. First, we run over a thousand different applications, with some being very compute heavy (media encoding), some being critical Netflix customer facing services, some memory and GPU heavy (algorithm training), some being network bound (stream processing), some that are happy with resource over commitment (big data jobs) and some that are not. We launch up to a half million containers and 200,000 clusters per day. We also rotate hundreds of thousands of EC2 virtual machines per month to satisfy our elastic workloads. While there are solutions that help solve some of these problems, we do not believe there are off-the-shelf solutions that can take on each of these scale challenges.

Finally, Titus allows us to quickly and nimbly add features that are valuable as our needs evolve, and as we grow to support new use-cases. We always try to maintain a philosophy of “just enough” vs “just in case” with the goal of keeping things as simple and maintainable as possible. Below are a few examples of functionality we’ve been able to quickly develop in response to evolving business and user needs:

In the scheduling layer, we support advanced concepts such as capacity management, agent management, and dynamic scheduling profiles. Capacity management ensures all critical applications have the capacity they require. Agent management provides multiple functions required to support a fleet of thousands of hosts. Agent management is inclusive of host registration and lifecycle, automatic handling of failing hosts, and autoscaling hosts for efficiency. We have dynamic scheduling profiles that understand the differences needed in scheduling between application types (customer facing services vs. internal services vs. batch) and differences in scheduling needed during periods of normal or degraded health. These scheduling profiles help us optimize scheduling considering real world trade-offs between reliability, efficiency and job launch time latencies.

In container execution, we have a unique approach to container composition, Amazon VPC networking support, isolated support for log management, a unique approach to vacating decommissioned nodes, and an advanced operational health check subsystem. For container composition, we inject our system services into containers before running the user’s workload in the container. We classify container networking traffic using BPF and perform QoS using HTB/ECN ensuring we provide highly performant, burstable as well as sustained throughput to every container. We isolate log uploading and stdio processing within the container’s cgroup. Leveraging Spinnaker, we are able to offload upgrade node draining operations in an application specific way. We have operationalized the detection and remediation of kernel, container runtime, EC2, and container control plane health issues. For our security needs, we run all containers with user namespaces, and provide transparent direct user access to only the container.

Titus is designed to satisfy Netflix’s complex scalability requirements, deep Amazon and Netflix infrastructure integration, all while giving Netflix the ability to quickly innovate on the exact scheduling and container execution features we require. Hopefully, by iterating our goals in detail you can see how Titus’s approach to container management may apply to your use cases.

Preparing for open sourcing

In the fourth quarter of 2017, we opened up Titus’s source code to a set of companies that had similar technical challenges as Netflix in the container management space. Some of these companies were looking for a modern container batch and service scheduler on Mesos. Others were looking for a container management platform that was tightly integrated with Amazon AWS. And others still were looking for a container management platform that works well with NetflixOSS technologies such as Spinnaker and Eureka.

By working with these companies to get Titus working in their AWS accounts, we learned how we could better prepare Titus for being fully open sourced. Those experiences taught us how to disconnect Titus from internal Netflix systems, the level of documentation needed to get people started with Titus, and what hidden assumptions we relied on in our EC2 configuration.

Through these partnerships, we received feedback that Titus really shined due to our Amazon AWS integration and the production focused operational aspects of the platform. We also heard how operating a complex container management platform (like Titus) is going to be challenging for many.

With all these learnings in mind, we strived to create the best documentation possible to getting Titus up and running. We’ve captured that information on the Titus documentation site.

Wrapping Up

Open sourcing Titus marks a major milestone after over three years of development, operational battle hardening, customer focus, and sharing/collaboration with our peers. We hope that this effort can help others with the challenges they are facing, and bring new options to container management across the whole OSS community.

In the near future we will keep feature development in Titus well aligned with Netflix’s product direction. We plan to share our roadmap in case others are interested in seeing our plans and contributing. We’d love to hear your feedback. We will be discussing Titus at our NetflixOSS meetup this evening and will post the video later in the week.

Appendix

Conference talks: Dockercon 2015, QCon NYC 2016, re:Invent 2016, QCon NYC 2017, re:Invent 2017, and Container World 2018

Articles: Netflix techblog posts (1, 2), and ACM Queue