Following William’s post on gRPC Load Balancing on Kubernetes without Tears, I become interested in finding out how much work is actually involved to implement gRPC load balancing.

In this post, I’ll like to share with you what I’ve learned about using the gRPC-Go balancer and resolver packages to implement a simple client-side¹ round robin load balancing. Then I will show how you can use Linkerd 2 to automatically load balance gRPC traffic, without any application code change or deployment of additional load balancer.

I have posted the code used in this post on my GitHub repo. The gRPC applications are tested with:

Go 1.11.5

protoc 3.6.1

gRPC 1.18.0

How It Works²

Let’s explore the two main components needed for the gRPC client-side load balancing to work; the name resolver and the load balancing policy.

Name Resolution

When a gRPC client wants to interact with a gRPC server, it first attempts to resolve the server name by issuing a name resolution request to the resolver. The resolver returns a list of resolved IP addresses. Each IP address is associated with an indicator, used to determine if it’s a backend or a load balancer address. In addition, a service config object with information on which load balancing policy to use is also returned.

By default, gRPC uses dns as its default name-system.

For the purpose of code demonstration, let’s use the manual resolver. In the client application code, we can create and register the resolver like this:

// create a manual.Resolver with a random scheme

resolver, _ := manual.GenerateAndRegisterManualResolver() // define the initial list of addresses

resolver.InitialAddrs([]resolver.Address{

resolver.Address{Addr:"10.0.0.10", Type: resolver.Backend},

resolver.Address{Addr:"10.0.0.11", Type: resolver.Backend},

resolver.Address{Addr:"10.0.0.12", Type: resolver.Backend},

}) // set the default name resolution scheme to that of the resolver

resolver.SetDefaultScheme(resolver.Scheme())

The manual.GenerateAndRegisterManualResolver() method returns a resolver of type manual.Resolver . This resolver is created with a random naming scheme. Since the resolved IP addresses are manually added (in step 2), it really doesn’t matter what this resolver’s scheme is. The manual.InitialAddrs() method allows us to register a list of addresses to be used for naming resolution. Finally, the resolver.SetDefaultScheme() method is used to set our application’s default naming resolution scheme to that of our new resolver.

With this in-place, whenever the client needs to resolve a server name, the resolver will always return these (and only these) registered addresses.

Load Balancing Policy

The second component is the load balancing policy. The two built-in policies in the gRPC-Go library are the roundrobin and grpclb policies. The grpclb policy is normally used with an external load balancer like this one. There is also a base policy which is generally used to build more complex picking algorithm.

For each non-load balancer address returned by the resolver, the load balancing policy creates a new sub connection to the address. The policy then returns a picker, which provides an interface to the client to retrieve the sub connections for making the RPC calls.

The following code snippet shows how to enable the client-side round robin load balancing policy by specifying it as a grpc.DialOption :

opts := []grpc.DialOption{

grpc.WithBalancerName(roundrobin.Name)),

// ....

}

conn, err := grpc.Dial(serverAddr, opts...)

Running The Applications

The gRPC server and client applications used in this example are based on the routeguide example found on the gRPC Basic - Go page with the following modifications:

The server implements health check and uses an interceptor to simulate faulty responses.

The client can be started in either the firehose or repeat-N mode. The firehose mode causes the client to issue random calls to all 4 APIs in an infinite loop. The repeat-N mode is more suitable for tests that require more control and predictability, by issuing N calls to a pre-selected API and then terminates.

The server can be started using the make server target.

$ make server

go build -o ./cmd/server/server ./cmd/server/

./cmd/server/server -port=8080 -fault-percent=0.3

2019/02/20 20:08:49 [main] fault percentage: 30%

2019/02/20 20:08:49 [main] hostname: orca:8080

Start the client without load balancing by running:

$ ENABLE_LOAD_BALANCING=false make client

go build -o ./cmd/client/client ./cmd/client/

./cmd/client/client \

-server=:8080 \

-timeout=20s \

-mode=REPEATN \

-api=GetFeature \

-n=15 \

-enable-load-balancing=false \

-server-ipv4=127.0.0.1:8080,127.0.0.1:8081,127.0.0.1:8082

2019/02/18 20:35:35 [main] connecting to server at :8080

2019/02/18 20:35:35 [main] running in REPEATN mode

2019/02/18 20:35:35 [main] calling getfeature 15 times

2019/02/18 20:35:35 [GetFeature] (req) latitude:402133926 longitude:-743613249

2019/02/18 20:35:35 [GetFeature] (resp) (server=orca:8080) location:<latitude:402133926 longitude:-743613249 >

2019/02/18 20:35:38 [GetFeature] (req) latitude:411349992 longitude:-743694161

2019/02/18 20:35:38 [GetFeature] (resp) (server=orca:8080) location:<latitude:411349992 longitude:-743694161 >

2019/02/18 20:35:41 [GetFeature] (req) latitude:416855156 longitude:-744420597

...

By default, the client is configured to invoke the GetFeature API 15 times. Notice from the client logs that all responses are coming from orca:8080 .

Now start up two more instances of the server on ports 8081 and 8082:

$ SERVER_PORT=8081 make server &

$ SERVER_PORT=8082 make server &

Re-start the client with load-balancing enabled:

$ make client

go build -o ./cmd/client/client ./cmd/client/

./cmd/client/client \

-server=:8080 \

-timeout=20s \

-mode=REPEATN \

-api=GetFeature \

-n=15 \

-enable-load-balancing=true \

-server-ipv4=127.0.0.1:8080,127.0.0.1:8081,127.0.0.1:8082

2019/02/18 20:58:07 [main] load balancing scheme: round_robin

2019/02/18 20:58:07 [main] resolver type: manual

2019/02/18 20:58:07 [main] connecting to server at :8080

2019/02/18 20:58:07 [main] running in REPEATN mode

2019/02/18 20:58:07 [main] calling getfeature 15 times

2019/02/18 20:58:07 [GetFeature] (req) latitude:402133926 longitude:-743613249

2019/02/18 20:58:07 [GetFeature] (resp) (server=orca:8081) location:<latitude:402133926 longitude:-743613249 >

2019/02/18 20:58:10 [GetFeature] (req) latitude:411349992 longitude:-743694161

2019/02/18 20:58:10 [GetFeature] (resp) (server=orca:8082) location:<latitude:411349992 longitude:-743694161 >

2019/02/18 20:58:13 [GetFeature] (req) latitude:416855156 longitude:-744420597

2019/02/18 20:58:13 [GetFeature] (resp) (server=orca:8080) name:"103-271 Tempaloni Road, Ellenville, NY 12428, USA" location:<latitude:416855156 longitude:-744420597 >

2019/02/18 20:58:16 [GetFeature] (req) latitude:409146138 longitude:-746188906

2019/02/18 20:58:16 [GetFeature] (resp) (server=orca:8081) name:"Berkshire Valley Management Area Trail, Jefferson, NJ, USA" location:<latitude:409146138 longitude:-746188906 >

...

We can see that responses are coming back from orca:8080 , orca:8081 and orca:8082 .

Great! It looks like our client-side round robin load balancing works on localhost 👍 👍 🎈 🎈.

On Kubernetes

Let’s try to run it on Kubernetes. The k8s-server.yaml and k8s.client.yaml manifest files will deploy 3 gRPC server replicas and a gRPC client. Both servers and client read their configurations from their config maps.

By default, the client is configured to used the dns resolver type. We won’t be able to use the manual resolver type because we don’t know the servers’ pod IP ahead of time. We also don’t have a watcher to watch the pod IP addresses.

The following is the client’s default configuration:

kind: ConfigMap

apiVersion: v1

metadata:

name: rg-client-config

labels:

app: rg-client

data:

SERVER_HOST: rg-server.default.svc.cluster.local

SERVER_PORT: "80"

GRPC_TIMEOUT: 60s

MODE: repeatn

MAX_REPEAT: "20"

REMOTE_API: GetFeature

ENABLE_LOAD_BALANCING: "true"

RESOLVER_TYPE: dns

To deploy the servers and client to a Kubernetes cluster, run:

$ make deploy

The client logs show that all the responses are coming back from one (instead of all) of the server instances. That’s disappointing, but not surprising 😦.

$ kubectl logs -f rg-client

2019/02/20 05:27:21 [main] load balancing scheme: round_robin

2019/02/20 05:27:21 [main] resolver type: dns

2019/02/20 05:27:21 [main] connecting to server at rg-server.default.svc.cluster.local:80

2019/02/20 05:27:21 [main] running in repeatn mode

2019/02/20 05:27:21 [main] calling GetFeature 20 times

2019/02/20 05:27:21 [GetFeature] (req) latitude:402133926 longitude:-743613249

2019/02/20 05:27:21 [GetFeature] (resp) (server=rg-server-6c49b4dcf5-c7bxm:80) location:<latitude:402133926 longitude:-743613249 >

2019/02/20 05:27:24 [GetFeature] (req) latitude:410873075 longitude:-744459023

2019/02/20 05:27:24 [GetFeature] (resp) (server=rg-server-6c49b4dcf5-c7bxm:80) name:"Clinton Road, West Milford, NJ 07480, USA" location:<latitude:410873075 longitude:-744459023 >

2019/02/20 05:27:27 [GetFeature] (req) latitude:414777405 longitude:-740615601

2019/02/20 05:27:27 [GetFeature] (resp) (server=rg-server-6c49b4dcf5-c7bxm:80) location:<latitude:414777405 longitude:-740615601 >

2019/02/20 05:27:30 [GetFeature] (req) latitude:415301720 longitude:-748416257

2019/02/20 05:27:30 [GetFeature] (resp) (server=rg-server-6c49b4dcf5-c7bxm:80) name:"282 Lakeview Drive Road, Highland Lake, NY 12743, USA" location:<latitude:415301720 longitude:-748416257

>

....

With Linkerd 2

Let’s install Linkerd 2 and use it to enable automatic gRPC load balancing and (as a bonus) TLS encrypted connections.

Install the Linkerd control plane:

$ linkerd install --tls=optional | kubectl apply -f -

Disable the client-side round robin load balancing by updating the client config map:

kind: ConfigMap

apiVersion: v1

metadata:

name: rg-client-config

labels:

app: rg-client

data:

SERVER_HOST: rg-server.default.svc.cluster.local

SERVER_PORT: "80"

GRPC_TIMEOUT: 60s

MODE: repeatn

MAX_REPEAT: "20"

REMOTE_API: GetFeature

ENABLE_LOAD_BALANCING: "false"

RESOLVER_TYPE: dns

Mesh and deploy the servers and client:

$ linkerd inject --tls=optional k8s-server.yaml | kubectl apply -f -

$ linkerd inject --tls=optional k8s-client.yaml | kubectl apply -f -

Looking at the client logs now, we can see that responses are coming back from all 3 servers.

$ kubectl logs -f rg-client rg-client

2019/02/20 05:50:21 [main] connecting to server at rg-server.default.svc.cluster.local:80

2019/02/20 05:50:21 [main] running in repeatn mode

2019/02/20 05:50:21 [main] calling GetFeature 20 times

2019/02/20 05:50:21 [GetFeature] (req) latitude:402133926 longitude:-743613249

2019/02/20 05:50:21 [GetFeature] (resp) (server=rg-server-b7b84d954-fh9q2:80) location:<latitude:402133926 longitude:-743613249 >

2019/02/20 05:50:24 [GetFeature] (req) latitude:410873075 longitude:-744459023

2019/02/20 05:50:24 [GetFeature] (resp) (server=rg-server-b7b84d954-9ttc7:80) name:"Clinton Road, West Milford, NJ 07480, USA" location:<latitude:410873075 longitude:-744459023 >

2019/02/20 05:50:27 [GetFeature] (req) latitude:414777405 longitude:-740615601

2019/02/20 05:50:27 [GetFeature] (resp) (server=rg-server-b7b84d954-s2vx5:80) location:<latitude:414777405 longitude:-740615601 >

2019/02/20 05:50:30 [GetFeature] (req) latitude:415301720 longitude:-748416257

2019/02/20 05:50:30 [GetFeature] (resp) (server=rg-server-b7b84d954-fh9q2:80) name:"282 Lakeview Drive Road, Highland Lake, NY 12743, USA" location:<latitude:415301720 longitude:-748416257

>

2019/02/20 05:50:33 [GetFeature] (req) latitude:402647019 longitude:-747071791

2019/02/20 05:50:33 [GetFeature] (resp) (server=rg-server-b7b84d954-fh9q2:80) name:"330 Evelyn Avenue, Hamilton Township, NJ 08619, USA" location:<latitude:402647019 longitude:-747071791 >

2019/02/20 05:50:36 [GetFeature] (req) latitude:405957808 longitude:-743255336

2019/02/20 05:50:36 [GetFeature] (resp) (server=rg-server-b7b84d954-fh9q2:80) name:"82-104 Amherst Avenue, Colonia, NJ 07067, USA" location:<latitude:405957808 longitude:-743255336 >

....

Looking at the Linkerd dashboard, notice that: