Intro

At Helpshift, our production engineering team is dedicated to ensuring the uptime and scalability of our cloud infrastructure. We extensively use HAProxy to load balance and scale our TCP and HTTP-based applications.

In this post, we look at how we have used HAProxy to achieve load balancing for Apache Phoenix Query Server with sticky sessions.

Apache Phoenix and what is it used for?

Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications. It brings the power of standard SQL and JDBC APIs into HBase.

We use Apache Phoenix to power our Analytics APIs. Customers can derive their own product usage insights using this API. These APIs are synchronous and expected to have low latencies. Phoenix helps us run aggregated SQL queries on our massive data sets of tens of millions of rows with a latency of just a few seconds.

To know more about how we engineered our analytics API, you can read here.

Before we define the problem statement, let me first introduce you to the Analytics API architecture at Helpshift.

Following are the components involved:

HBase is a column-oriented NoSQL database. We use it to store our Analytics data.

is a column-oriented NoSQL database. We use it to store our Analytics data. Phoenix is used to provide standard SQL and JDBC APIs on top of HBase. It is designed for low latency OLTP workloads.

is used to provide standard SQL and JDBC APIs on top of HBase. It is designed for low latency OLTP workloads. Phoenix Query Server (PQS) provides an alternative means of interaction with Phoenix and HBase. It is responsible for managing Phoenix Connections on behalf of clients.

provides an alternative means of interaction with Phoenix and HBase. It is responsible for managing HAProxy for load balancing phoenix query servers.

for load balancing phoenix query servers. API service to expose this data to our customers over predefined documented schemas. This utilizes the SQL interface over PQS.

Analytics API Architecture at Helpshift

As you can see in the above figure, the API service makes JDBC requests (SQL query) over HTTP to PQS. PQS then transforms and optimizes these queries to HBase Scans, aggregates results, and returns it to the API service.

Problem statement

PQS requires clients (in our case, API service) to have a sticky session with it. This will ensure that clients do not spend an excessive amount of time in recreating the server-side state on PQS.

The PQS documentation mentions:

The Query Server can use off-the-shelf HTTP load balancers such as the Apache HTTP Server, nginx, or HAProxy. The primary requirement of using these load balancers is that the implementation must implement “sticky session” (when a client communicates with a backend server, that client continues to talk to that backend server).

A standard solution to this problem is an architecture like the figure below.

Solution using HAProxy stick tables

HAProxy can share stick tables among multiple nodes with “peers” feature. In this method, HAProxy stick tables help us maintain stickiness based on the given request pattern, eg. source IP, HTTP header, etc.

We started testing our setup with the approach mentioned above. While testing for fault tolerance, we realized that this stick table solution comes with the following caveat.

If any of the PQS backends go down, the traffic gets redistributed on the available backend servers. But, when the node comes back up, the redistribution of the existing traffic doesn’t happen. Only the traffic from new client connections will get distributed. This makes load distribution imbalanced.

The issue with stick table solution

To overcome the problem mentioned above, stick table entries have to be purged manually, which might result in application errors (due to server-side state on PQS) as the connections are redistributed.

Our Solution:

To address the problems outlined above, we decided to get rid of existing HAProxy stick table solution. We set up a local HAProxy on all API service client nodes and route request through the local HAProxy. Using HDR load balancing algorithm in HAProxy, we ensure that stickiness is maintained per HTTP keepalive connection.

API clients make connections to PQS through HAProxy, running locally on API server. This avoids the caveat mentioned in the stick table method above.

Our solution using HAProxy’s HDR routing algorithm

What is HDR and how does it work?

We have used HDR routing algorithm to achieve stickiness on a TCP connection level. This algorithm selects a backend server based on the HTTP header in each request.

From HAProxy documentation:

HDR: The HTTP header <name> will be looked up in each HTTP request. Just as with the equivalent ACL 'hdr()' function, the header name in parenthesis is not case sensitive. If the header is absent or if it does not contain any value, the round-robin algorithm is applied instead

In our case, the HTTP header is “X-Unique-ID”. We configure this header using “unique-id-format” and “unique-id-header” options in HAProxy frontend configuration of PQS. We have used a combination of client IP and client port for the value of this header.

HAProxy HDR routing configuration:

frontend phoenix

bind 127.0.0.1:8888

unique-id-format %{+X}o\ %ci:%cp

unique-id-header X-Unique-ID

option tcp-smart-accept

option splice-request

option splice-response

default_backend phoenix_backend backend phoenix_backend

option tcp-smart-connect

option splice-request

option splice-response

balance hdr(X-Unique-ID)

server phoenix01 10.0.0.10:8765 maxconn 10 weight 1 check

server phoenix02 10.0.0.11:8765 maxconn 10 weight 1 check

The balancing will be done on the value of “X-Unique-ID” header which will pass traffic from frontend to backend. The header value (client IP and client port combination) will remain the same for a single HTTP keepalive connection. Therefore, HAProxy will always send all HTTP requests for a single keepalive session to the same PQS server.

Conclusion

HAProxy is a versatile piece of software, allowing you to shape traffic for any requirement. Our solution efficiently solves the stickiness requirement, providing better fault tolerance for Phoenix but can be used to solve any generic stickiness problem.