On Tuesday I attended a fabulous talk by Netflix's Jeremy Edberg about the suite of tools that Netflix developed (and open-sourced) to manage its AWS clusters.

The Netflix API handles 2 billion requests per day, and makes 12 billion outbound requests to its API dependencies. Netflix essentially wrote the book on high-availability system administration at scale.

The principles guiding Netflix's infrastructure setup are:

Automate everything—build tools, machine image creation, image deployment, application startup, and monitoring. Triple redundancy—everything is "built for three." Minimum rules, policies, and bureaucracy—hire good people and trust them. Developers deploy when they want, manage their own capacity and autoscaling, and fix anything that they break. Highly aligned, loosely coupled services built by different teams working together ("service oriented architecture").





Netflix built a suite of open source tools that sit on top of AWS—essentially an open source platform-as-a-service tailored to work with the Amazon cloud. Netflix's application code then sits on top of the application-agnostic open source PaaS software.

The Netflix PaaS provides:

Support for all regions and zones Support for multiple accounts Cross-region and cross-account replication Internationalization Localization GeoIP routing Advanced key management Autoscaling thousands of AWS instances Monitoring and alerting at scale

Netflix has 42 public repositories on their github page.





Application Availability

Hystrix makes distributed systems resilient by stopping cascading failure, acting as a sort of circuit breaker. The Simian Army is made up of 10 different monkeys that patrol the cluster looking for issues. Chaos Monkey kills random instances.

kills random instances. Chaos Gorilla kills random zones.

kills random zones. Chaos Kong kills random regions.

kills random regions. Latency Monkey randomly degrades the network and injects faults.

randomly degrades the network and injects faults. Conformity Monkey looks for outliers.

looks for outliers. Circus Monkey maintains balance in the zone by killing and launching instances.

maintains balance in the zone by killing and launching instances. Doctor Monkey fixes unhealthy resources.

fixes unhealthy resources. Janitor Monkey cleans up unused resources to keep bills where they need to be.

cleans up unused resources to keep bills where they need to be. Howler Monkey complains about issues such as Amazon limit violations.

complains about issues such as Amazon limit violations. Security Monkey identifies security issues and expiring certificates. Turbine connects to thousands of Hystrix-enabled servers and aggregates real-time streams from them.





Cloud Management

Netflix has four open-source libraries for cloud management:

Ice monitors AWS usage to keep an eye on costs. Asgard provides a web interface for application deployments and cloud management. Frigga, which refers to the wife of Odin and queen of Asgard (get it?) holds the logic that Asgard uses to generate and parse AWS object names. Glisten is a Groovy library for building JVM apps with Amazon Simple Workflow Service (SWF).





Infrastructure Services

Edda keeps track of infrastructure changes. Suro is a data pipeline service designed to collect and and aggregate large volumes of log data and other application events. Eureka handles load balancing and failover of middle-tier servers. Zuul is an edge service that provides dynamic routing, monitoring, resiliency, and security.





Platform Libraries

Karyon is the blueprint for the rest of the platform libraries. This is the "nucleus" or base container for the other platform libraries. Archaius is a set of configuration-management APIs. Denominator is a portable Java library for manipulating DNS clouds. Feign is a Java to HTTP client binder designed to reduce the complexity of binding Denominator uniformly to HTTP. Ribbon is a zone-aware software load balancer. Servo allows application developers to publish metrics to Amazon CloudWatch. Blitz4j is a high-performance, asynchronous logging framework built on top of log4j. Governator is a library of extensions and utilities that enhance Google Guice to provide classpath scanning, lifecycle management, injector bootstrapping, configuration to field mapping, field validation, parallelized object warmup, and generic binding annotations.





Deployment

Netflix deploys by creating new custom EBS images. To do that they use a library called aminator.





Database

Jeremy told us that Netflix stores every piece of data 9 different times (triple redundancy times three). Seems a little excessive for a movie streaming service, but I guess you can't be too paranoid when it comes to database administration.

Netflix uses the Apache Cassandra database because it's open source, it's written in Java, and it favors availability over consistency and writes over reads.

Netflix has nine open source tools related to database administration:

Astyanax is an object-oriented abstraction to Cassandra with multi-region support. CassJMeter supports using JMeter to monitor a Cassandra cluster. Curator is a set of Java libraries to make using Apache ZooKeeper easier. EVCache is a memcached- and spymemcached-based caching solution designed for use with AWS EC2. Exhibitor is a supervisor system for ZooKeeper that supports instance monitoring, backup/recovery, cleanup, and visualization. Priam provides zero-touch auto-config, state management, token assignment, node replacement, and S3 backup/restore for Cassandra. Staash is a REST-based web interface for accessing a data store that automates common data access patterns and abstracts the complexity of the underlying database.

Jeremy's slides are available on Slideshare.