This is a consolidation of learnings from weeks of tuning and debugging NGINX behind the Google Cloud Platform HTTP(S) Load Balancer.

There is an unfortunate lack of documentation around the web for some of this, so I hope it helps you! But remember, tuning is always specific to different environments and conditions—your mileage may vary.

1. Enable gzip to work with Google’s load balancer

By default, NGINX does not compress responses to proxied requests (requests that come from the proxy server). The fact that a request comes from a proxy server is determined by the presence of the Via header field in the request.

- NGINX Admin Guide: Compression and Decompression

Google’s load balancer adds the “Via: 1.1 google” header, so nginx will not gzip responses by default behind the GCP HTTP(s) Load Balancer. This happens because nginx does not think the proxy can handle the gzipped response.

To re-enable gzipped responses, configure gzip_proxied in nginx.conf (in http, server, or location blocks):

gzip_proxied any;

gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;

2. Firewall rule: only allow load balancer traffic

Traffic from the load balancer to your instances has an IP address in the range of 130.211.0.0/22. When viewing logs on your load balanced instances, you will not see the source address of the original client. Instead, you will see source addresses from this range.

- https://cloud.google.com/compute/docs/load-balancing/http/

For security reasons, you can force all HTTP(S) traffic to flow through the load balancer and block direct access to your instances (from port scanners, for example). You just need to know this specific CIDR range:

130.211.0.0/22

When making a GCE firewall rule, just set Source IP ranges to this range.

Update: Google has added more ranges that load balance traffic might come from. As of Jan. 31, 2018, you’ll also need to have this range allowed:

35.191.0.0/16

These ranges only apply for the HTTP(S) Load Balancer and SSL Proxy. If you are using Network Load Balancing, see the docs for the applicable ranges.

3. NGINX timeouts: fix a nasty 502 Bad Gateway race condition

This was a hard one. Many hours tuning sysctl settings, running tcpdump, re-architecting flows, and trying to recreate a rare and intrusive error before figuring out what was happening.

Summary: the default nginx keepalive_timeout is incompatible with the Google Cloud Platform HTTP(S) Load Balancer!

You must increase nginx’s keepalive_timeout, or risk intermittent and sporadic 502 Bad Gateway responses to POST requests.

# Tune nginx keepalives to work with the GCP HTTP(S) Load Balancer:

keepalive_timeout 650;

keepalive_requests 10000;

The “650 seconds” here is not arbitrary, see below for justification of why we picked this specific timeout. Notably, this is opposite of the advice that most articles will give you, but most of them are configuring nginx for direct connections and not for sitting behind a global load balancer.

In-depth root cause

Several times a day, POST requests to our API would return a 502 Bad Gateway response, with no backend log of the error. Long ago we added client-side retries to our API libraries to handle these cases, but I decided to finally root-cause this bug once and for all so we didn’t have to keep patching it across libraries.

In Google Cloud Logging, you can see if you are experiencing these particular 502 responses using an advanced filter:

metadata.serviceName="network.googleapis.com"

httpRequest.status=502

structPayload.statusDetails="backend_connection_closed_before_data_sent_to_client"

Now it gets tricky. There were no other logs that correlated with this rare error—no logs from nginx itself, nothing from the API app, nothing related to any syslog or kernel message—nothing, except rare 502s in load balancer logs.

That looks very much like a problem with the GCP HTTP(S) Load Balancer itself. But, as an ex-Googler who has configured services behind Google load balancers in the past, I knew it was very unlikely that my site was a special snowflake that was hitting some new bug uncaught by the billions of requests that flow through those LBs every day. It was much more likely I had a bad config somewhere.

Digging further, we see more info:

^ backend_connection_closed_before_data_sent_to_client ^

Ah ha. So, this was not a problem finding a backend machine (which would have been “failed_to_pick_backend”). The docs have a tiny description of this error:

backend_connection_closed_before_data_sent_to_client: The backend unexpectedly closed its connection to the load balancer before the response was proxied to the client.

- Google Cloud Platform: Setting Up HTTP(S) Load Balancing

Meaning, a TCP connection is being established from the load balancer to the GCE instance, but the instance is terminating the connection prematurely.

Some things were common to the error:

- It only happened on POST requests, never on GET requests.

- It only happened on our nginx + API services, not on our nginx + static.

- It only happened under moderate traffic load (but the server was not overloaded, still 2–5% CPU usage and plenty of memory). This made it feel like a race condition or timeout problem of some sort.

I finally found the right knob to turn. It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.

Let’s dig deeper.

I was actually able to reliably reproduce this error using a simple curl POST load test script, combined with an aggressive nginx timeout: