(updated April 9, 2020) We recently open-sourced multicluster-scheduler, a system of Kubernetes controllers that intelligently schedules workloads across clusters. In this blog post, we will use it with Argo to run multicluster workflows (pipelines, DAGs, ETLs) that better utilize resources and/or combine data from different regions or clouds.

UPDATE (2020-04-09) - Multicluster-scheduler has changed quite a bit since this blog post was originally published. Multicluster-scheduler is now a virtual-kubelet provider; observations and decisions were replaced by more direct control loops. The integration with Argo Workflows discussed here still works. Multicluster-scheduler also works with Argo CD, as shown in this ITNEXT blog post published today by Gokul Chandra.

Most enterprises that use Kubernetes manage multiple clusters. For various reasons, you may have one or several clusters per team, region, environment, or combination thereof. Your clusters may be hosted by different cloud providers—in a multicloud infrastructure—and/or on premises—in a hybrid infrastructure. The benefits of isolation, however, come at the expense of, among other things, reduced bin-packing efficiency and data fragmentation. Let's explore two scenarios.

Scenario A: You need to run a large parallel workflow, e.g., a machine learning training pipeline, which requires more resources than available in your team's cluster. You could scale out to go fast, or limit parallelism to save money. In the meantime, available resources in other teams' clusters stay idle. Multicluster-scheduler allows you to elect pods to be delegated to other clusters, where resources are available, from the comfort of your own cluster, with a single pod or pod template annotation.

Scenario B: You need to run a workflow that combines data from multiple clouds or regions. It is either more efficient or required to run some steps closer to their data sources. To optimize throughput or save on data egress charges, you may want to compress or aggregate the data, before loading the results closer to you. Or to respect privacy regulations and minimize your attack surface, you may want to anonymize the data as upstream as possible. You could deploy remote services or functions and call them from your workflow, but that would be complicated. Multicluster-scheduler allows you to simply specify which cluster a pod should run in, again, with a single pod or pod template annotation.

A Multicluster Pod's Journey#

Here's a quick summary of a multicluster pod's journey.

When a pod is created with the multicluster.admiralty.io/elect="" annotation, the multicluster-scheduler agent's mutating pod admission webhook replaces the pod's containers by a dummy busybox that just waits to be killed. The original spec is saved for later as another annotation. We call the resulting pod a proxy pod. The agent then sends an observation of the proxy pod to the scheduler's cluster, which can be the same cluster. The agent also watches other pods, nodes, and node pools and sends observations of them to the scheduler's cluster to guide its decisions. The scheduler creates a delegate pod decision in its own cluster. If the original pod was annotated with multicluster.admiralty.io/clustername=foo , the delegate pod decision is targeted at cluster "foo". Otherwise, the scheduler targets the cluster that could accommodate the most replicas of our pod, based on current observations. More advanced scheduling options are in the works. The agent in the target cluster sees the decision and creates a delegate pod. The delegate pod has the same spec as the original pod. An observation of the delegate pod is sent back to the scheduler's cluster. When the delegate pod is annotated, the same annotation is fed back to the proxy pod (e.g., so Argo can read step outputs), and when it succeeds or dies, a signal is sent to the proxy pod's container for it to succeed or die too.

For more details, check out the README.

Let's see that in action. Create two clusters, e.g., with Minikube or your favorite cloud provider. In this blog post, we'll assume their associated contexts in your kubeconfig are "cluster1" and "cluster2", but we'll use variables so you can use your own:

Copy CLUSTER1 = cluster1 CLUSTER2 = cluster2

Now, following the installation guide, install the scheduler in cluster1 and the agent in both clusters.

We also need Argo in either cluster. We'll use cluster1 in this guide, but feel free to change the variables:

Copy ARGO_CLUSTER = $CLUSTER1 NON_ARGO_CLUSTER = $CLUSTER2

Install the Argo controller and UI (step 2 of the Argo getting gtarted guide):

Copy kubectl --context $ARGO_CLUSTER create ns argo kubectl --context $ARGO_CLUSTER apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.2.1/manifests/install.yaml

We're not big fans of giving Argo pods admin privileges, as recommended in step 3 of the Argo getting started guide, so we'll use a minimal service account instead. Because pods will run in the two clusters, we need this service account in both:

Copy ARGO_POD_RBAC = https://raw.githubusercontent.com/admiraltyio/multicluster-scheduler/master/config/samples/argo-workflows/_service-account.yaml kubectl --context $ARGO_CLUSTER apply -f $ARGO_POD_RBAC kubectl --context $NON_ARGO_CLUSTER apply -f $ARGO_POD_RBAC

Let's also install the Argo CLI locally—although it's optional—to nicely submit and track workflows:

Copy brew install argoproj/tap/argo curl -sSL -o /usr/local/bin/argo https://github.com/argoproj/argo/releases/download/v2.2.1/argo-linux-amd64 chmod +x /usr/local/bin/argo curl -sSL -o argo https://github.com/argoproj/argo/releases/download/v2.2.1/argo-windows-amd64

(UPDATE, 2019-03-17, multicluster-scheduler v0.3) Finally, let's enable multicluster-scheduler in the default namespace:

Copy kubectl --context $ARGO_CLUSTER label ns default multicluster-scheduler = enabled

You can now turn any Argo workflow into a multicluster workflow by adding multicluster.admiralty.io annotations to its pod templates. Also, don't forget to specify resource requests if you want the scheduler to decide where to run your pods.

Scenario A: Optimizing a Large Parallel Workflow#

A default GKE cluster has three nodes, with 1 vCPU and 3.75GB of memory each, out of which 940m vCPU and 2.58GiB of memory are allocatable. The system pods, along with multicluster-scheduler and Argo already request 1840m vCPU in cluster1 and 1740m vCPU in cluster2. Therefore, cluster1 has 980m vCPU available and cluster2 has 1080m. We don't need to spend extra money for this experiment: we will model a "large parallel workflow" by 10 parallel steps requiring 200m vCPU each (including 100m for the Argo sidecar).

First, let's run the following single-cluster workflow (also available in the multicluster-scheduler samples directory):

Copy apiVersion : argoproj.io/v1alpha1 kind : Workflow metadata : generateName : singlecluster - parallel - spec : entrypoint : singlecluster - parallel templates : - name : singlecluster - parallel steps : - - name : sleep template : sleep withItems : [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 10 ] - name : sleep container : image : busybox command : [ sleep , 10 ] resources : requests : cpu : 100m

Submit it:

Copy argo --context $ARGO_CLUSTER submit --serviceaccount argo-workflow --watch https://raw.githubusercontent.com/admiraltyio/multicluster-scheduler/master/config/samples/argo-workflows/blog-scenario-a-singlecluster.yaml

Here's the final state:

Copy Duration: 1 minute 16 seconds STEP PODNAME DURATION MESSAGE ✔ singlecluster-parallel-6rtkc └-·-✔ sleep(0:0) singlecluster-parallel-6rtkc-839758060 11s ├-✔ sleep(1:1) singlecluster-parallel-6rtkc-1823198064 12s ├-✔ sleep(2:2) singlecluster-parallel-6rtkc-4064072188 11s ├-✔ sleep(3:3) singlecluster-parallel-6rtkc-2040401880 27s ├-✔ sleep(4:4) singlecluster-parallel-6rtkc-3078784476 27s ├-✔ sleep(5:5) singlecluster-parallel-6rtkc-3529283624 27s ├-✔ sleep(6:6) singlecluster-parallel-6rtkc-3081898924 43s ├-✔ sleep(7:7) singlecluster-parallel-6rtkc-2914639584 43s ├-✔ sleep(8:9) singlecluster-parallel-6rtkc-3024028329 43s └-✔ sleep(9:10) singlecluster-parallel-6rtkc-3224503614 1m

It took the workflow 1 minute 16 seconds to run on cluster1 alone. We can see that cluster1 could only run three steps concurrently, in four waves, which is less than ideal, but expected, because the 940m vCPU available are in three "bins".

Let's annotate our workflow's pod template with multicluster.admiralty.io/elect="" to make it run on two clusters:

Copy apiVersion : argoproj.io/v1alpha1 kind : Workflow metadata : generateName : multicluster - parallel - spec : entrypoint : multicluster - parallel templates : - name : multicluster - parallel steps : - - name : sleep template : sleep withItems : [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ] - name : sleep container : image : busybox command : [ sleep , 10 ] resources : requests : cpu : 100m metadata : annotations : multicluster.admiralty.io/elect : ""

Submit it:

Copy argo --context $ARGO_CLUSTER submit --serviceaccount argo-workflow --watch https://raw.githubusercontent.com/admiraltyio/multicluster-scheduler/master/config/samples/argo-workflows/blog-scenario-a-multicluster.yaml

Here's the final state:

Copy Duration: 31 seconds STEP PODNAME DURATION MESSAGE ✔ multicluster-parallel-lmw2d └-·-✔ sleep(0:0) multicluster-parallel-lmw2d-1353848687 12s ├-✔ sleep(1:1) multicluster-parallel-lmw2d-714502387 14s ├-✔ sleep(2:2) multicluster-parallel-lmw2d-894725111 14s ├-✔ sleep(3:3) multicluster-parallel-lmw2d-711387939 13s ├-✔ sleep(4:4) multicluster-parallel-lmw2d-479610983 14s ├-✔ sleep(5:5) multicluster-parallel-lmw2d-1696675651 13s ├-✔ sleep(6:6) multicluster-parallel-lmw2d-1336174783 15s ├-✔ sleep(7:7) multicluster-parallel-lmw2d-2767328819 29s ├-✔ sleep(8:9) multicluster-parallel-lmw2d-3117624962 29s └-✔ sleep(9:10) multicluster-parallel-lmw2d-2469206667 29s

It took the workflow only 31 seconds to run across cluster1 and cluster2. Seven steps were able to run concurrently at first, followed by the three remaining steps. Notice that some of the steps were run in cluster1 and the others in cluster2:

Copy kubectl --context $ARGO_CLUSTER get pods

outputs:

Copy NAME READY STATUS RESTARTS AGE cluster1-default-multicluster-parallel-lmw2d-1336174783 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-1696675651 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-2767328819 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-3117624962 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-479610983 0/2 Completed 0 4m multicluster-parallel-lmw2d-1336174783 0/2 Completed 0 4m multicluster-parallel-lmw2d-1353848687 0/2 Completed 0 4m multicluster-parallel-lmw2d-1696675651 0/2 Completed 0 4m multicluster-parallel-lmw2d-2469206667 0/2 Completed 0 4m multicluster-parallel-lmw2d-2767328819 0/2 Completed 0 4m multicluster-parallel-lmw2d-3117624962 0/2 Completed 0 4m multicluster-parallel-lmw2d-479610983 0/2 Completed 0 4m multicluster-parallel-lmw2d-711387939 0/2 Completed 0 4m multicluster-parallel-lmw2d-714502387 0/2 Completed 0 4m multicluster-parallel-lmw2d-894725111 0/2 Completed 0 4m ... (and all the pods from the single-cluster workflow)

The five pods whose names are prefixed with "cluster1-default-" are delegate pods. The prefix indicates their origin. The other pods are the proxy pods.

In cluster2, there are only delegate pods:

Copy kubectl --context $NON_ARGO_CLUSTER get pods

outputs:

Copy NAME READY STATUS RESTARTS AGE cluster1-default-multicluster-parallel-lmw2d-1353848687 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-2469206667 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-711387939 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-714502387 0/2 Completed 0 4m cluster1-default-multicluster-parallel-lmw2d-894725111 0/2 Completed 0 4m

Scenario B: Multicluster ETL#

We will model this scenario with a simple DAG workflow, where steps A and C can run in cluster1, but step B must run in cluster2; step C depends on steps A and B. Note the use of the multicluster.admiralty.io/clustername pod template annotation to enforce a placement:

Copy apiVersion : argoproj.io/v1alpha1 kind : Workflow metadata : generateName : multicluster - dag - spec : entrypoint : multicluster - dag templates : - name : multicluster - dag dag : tasks : - name : A template : sleep - name : B template : sleep - remote arguments : parameters : - name : clustername value : cluster2 - name : C dependencies : [ A , B ] template : sleep - name : sleep container : image : busybox command : [ sleep , 10 ] - name : sleep - remote inputs : parameters : - name : clustername container : image : busybox command : [ sleep , 10 ] metadata : annotations : multicluster.admiralty.io/elect : "" multicluster.admiralty.io/clustername : "{{inputs.parameters.clustername}}"

If NON_ARGO_CLUSTER is not equal to "cluster2" in your case, modify the workflow before submitting it.

Copy argo --context $ARGO_CLUSTER submit --serviceaccount argo-workflow --watch https://raw.githubusercontent.com/admiraltyio/multicluster-scheduler/master/config/samples/argo-workflows/blog-scenario-b.yaml

Here's the final state:

Copy Duration: 26 seconds STEP PODNAME DURATION MESSAGE ✔ multicluster-dag-ftrwh ├-✔ A multicluster-dag-ftrwh-745251266 11s ├-✔ B multicluster-dag-ftrwh-728473647 12s └-✔ C multicluster-dag-ftrwh-711696028 12s

Note that step B was delegated to cluster2:

Copy kubectl --context $NON_ARGO_CLUSTER get pods

outputs:

Copy NAME READY STATUS RESTARTS AGE cluster1-default-multicluster-dag-ftrwh-728473647 0/2 Completed 0 2m

In a real case scenario, you would pipe data between steps using artifact repositories and/or step input/outputs.

As we've demonstrated, multicluster-scheduler integrates nicely with Argo. We didn't have to modify the Argo source code, and simple annotations to the workflows' manifests were enough to make them run across clusters. This would not have been possible with a project like Federation v2, which requires clients to use new, federated APIs, e.g., federated deployment templates, placements, and overrides. The main advantage of multicluster-scheduler is that it federates clusters at the pod level, "the smallest and simplest unit in the Kubernetes object model." The entire Kubernetes ecosystem revolves around pods. By choosing pods as multicluster-scheduler's unit, we're enabling a series of "for free", loosely coupled integrations. Multicluster Horizontal Pod Autoscaler with custom and external metrics should work out-of-the-box (we'll prove that soon), while integrations with Istio and Knative are in the works.

Multicluster-scheduler can run Argo workflows across Kubernetes clusters, delegating pods to where resources are available, or as specified by the user. It can make parallel workflows run faster without scaling out clusters, and it simplifies multi-region and multicloud ETL processes. This integration was made possible by multicluster-scheduler's architecture centered around pods, "the smallest and simplest unit in the Kubernetes object model." The way forward is exciting and includes, among other things, more integrations with the cloud native ecosystem, and advanced scheduling. We're curious to hear the thoughts and feedback of the community, and we welcome contributions!

Many thanks to the Argo authors for designing a great cloud-native workflow engine, and to the authors of controller-runtime, which powers a lot of the components of multicluster-scheduler.