Our Approach

With considerations to aforementioned challenges and design principles, we closed on a declarative reconciliation architecture to drive a self-servable platform. On a high level, this architecture allows user to come to the UI to declare desired job attributes, the platform will orchestrate and coordinate subservices to ensure goal states are met as quickly as possible, even in face of failures.

This following section covers the high level architecture and lightly touches various areas of the design. We’ll share more in depth technical details and use cases in future follow up posts.

1. Declarative Reconciliation

The declarative reconciliation protocol is used across the entire architectural stack, from control plane to data plane. The logical conclusion for taking advantage of this protocol is to store a single copy of user declared goal states as durable source of truth, where all other services will reconcile from. When state conflict arises, either due to transient failures or normal user trigger actions, the source of truth should always be treated as authoritative, all other versions of the states should be considered as the current view of the world. The entire system is expected to eventually reconcile towards the source of truth.

Source of Truth Store is a durable, persistent storage that keeps all the desired state information. We currently use AWS RDS. It is the single source of truth for the entire system. For example, if a Kafka cluster blows away because of corrupted ZK states, we can always recreate the entire cluster solely based off the source of truth. Same principles apply to the stream processing layer, to correct any processing layer’s current states that deviates from its desired goal states. This makes continuous self healing, and automated operations possible.

Another advantage we can take from this protocol design is that operations are encouraged to be idempotent. This means control instructions passed from user to control plane and then to the job cluster, inevitable failure conditions will not result in prolonged adversary effect. The services would just eventually reconcile on its own. This also in term brings operational agility.

2. Deployment Orchestration

Control plane facilitates orchestration workflow through interactions with Netflix internal continuous deployment engine Spinnaker. Spinnaker internally abstracts integration with Titus container runtime, which would allow control plane to orchestrates deployment with different tradeoffs.

A flink cluster is composed of job managers and task managers. Today, we enforce complete job instance level isolation by creating independent Flink cluster for each job. The only shared service is ZooKeeper for consensus coordination and S3 backend for storing checkpoint states.

During redeployment, stateless application may choose between latency or duplicate trade-offs, corresponding deployment workflow will be used to satisfy the requirement. For stateful application user can choose to resume from a checkpoint/savepoint or start from fresh state.

3. Self-service Tooling

For routing jobs: through self service, a user can request a stream to produce events to, optionally declare filtering / projection and then route events to managed sink, such as Elasticsearch, Hive or made available for downstream real-time consuming. Self service UI is able to take these inputs from user and translate into concrete eventual desired system states. This allows us to build a decoupled orchestration layer that drives the goal states, it also allows us to abstract out certain information that user may not care, for example which Kafka cluster to produce to, or certain container configurations, and gives us the flexibility when it’s needed.

For custom SPaaS jobs, we provide command line tooling to generate flink code template repository and CI integration etc.

Once user customizes and checks in the code, the CI automation will be kicked off to build docker image, register the image and configurations with platform backend, and allow user to perform deployment and other administrative operations.

4. Stream Processing Engines

We are currently focusing on leveraging Apache Flink and build an ecosystem around it for Keystone analytic use cases. Moving forward, we have plans to integrate and extend Mantis stream processing engine for operational use cases.

5. Connectors, Managed Operators and Application Abstraction

To help our users to increase development agility and innovations, we offer a full range of abstractions that includes managed connectors, operators for users to plug in to the processing DAG, as well as integration with various platform services.

We provide managed connectors to Kafka, Elasticsearch, Hive, etc. The connectors abstract away underlying complexity around custom wire format, serialization (so we can keep track of different format of payload to optimize on storage and transport), batching/throttling behaviors, and is easy to plug into processing DAG. We also provide dynamic source/sink operator that allows user to switch between different sources or sinks at runtime without having to rebuild.

Other managed operators includes filter, projector, data hygiene with easy to understand custom DSL. We continue to work with our users to contribute proven operators to the collection and make them accessible to more teams.

6. Configuration & Immutable Deployment

Multi-tenancy configuration management is challenging. We want to make configuration experience dynamic (so users do not have to rebuild/reship code), and at the same time easily manageable.

Both default managed and user defined configurations are stored along with application properties files, we’ve done the plumbing to allow these configurations to be overriable by environment variable and can be further overridden through self-service UI. This approach fits with the reconciliation architecture, which allows user to come to our UI to declare the intended configs and deployment orchestration will ensure eventual consistency at runtime.

7. Self-healing

Failures are inevitable in distributed systems. We fully expect it can happen at any time, and designed our system to self heal so we don’t have to be woken up in the middle of night for incident mitigations.

Architecturally, platform component services are isolated to reduce blast radius when failure arises. The reconciliation architecture also ensures system level self-recovery by continuous reconciling away from drift behavior.

On individual job level, the same isolation pattern is followed to reduce failure impact. However, to deal and recover from such failures, each managed streaming job comes with a health monitor. The health monitor is an internal component runs on in Flink cluster which is responsible for detecting failure scenarios and perform self-healing:

Cluster Task Manager drift: if Flink’s view of the container resources persistently unmatched with container runtime’s view. The drift will be automatically corrected by proactive termination of affected containers.

if Flink’s view of the container resources persistently unmatched with container runtime’s view. The drift will be automatically corrected by proactive termination of affected containers. Stall Job Manager leader: if leader fails to be elected, the cluster becomes brainless. Corrective action will be performed on the job manager.

if leader fails to be elected, the cluster becomes brainless. Corrective action will be performed on the job manager. Unstable container resources: if certain task manager shows unstable pattern such as periodical restart/failure, it will be replaced.

if certain task manager shows unstable pattern such as periodical restart/failure, it will be replaced. Network partition: if any container experiences network connectivity issues, it will be automatically terminated.

8. Backfill & Rewind

Again, failures are inevitable, sometimes user may be required to backfill or rewind the processing job.

For source data that is backed up into data warehouse, we have built functionality into the platform to allow dynamically switching source without having to modify and rebuild code. This approach comes with certain limitations and is only recommended for stateless jobs.

Alternatively, user can choose to rewind processing to a previous automatically taken checkpoint.

9. Monitoring & Alerting

All individual streaming jobs comes with a personalized monitor and alert dashboard. This helps both platform/infrastructure team and application team to diagnose and monitor for issues.

10. Reliability & Testing

As platform and underlying infrastructure services innovate to provide new features and improvements, the pressure to quickly adopt the changes comes from bottom up (architecturally).

As applications being developed and productionized, the pressure for reliability comes from top down.

The pressure meets in the middle. In order for us to provide and gain trust, we need to enable both platform and users to efficiently test the entire stack.

We are big believers in making unit tests, integration tests, operational canary and data parity canary accessible for all our users, and easy to adopt for the stream processing paradigm. We are making progress on this front, and still seeing lots of challenges to solve.