Introduction

My team: the Cloudflare PROTOCOLS team is responsible for termination of HTTP traffic at the edge of the Cloudflare network. We deal with features related to: TCP, QUIC, TLS and Secure Certificate management, HTTP/1 and HTTP/2. Over Q1, we were responsible for implementing the Enhanced HTTP/2 Prioritization product that Cloudflare announced during Speed Week.

This is a very exciting project to be part of, and doubly exciting to see the results of, but during the course of the project, we had a number of interesting realisations about NGINX: the HTTP oriented server onto which Cloudflare currently deploys its software infrastructure. We quickly became certain that our Enhanced HTTP/2 Prioritization project could not achieve even moderate success if the internal workings of NGINX were not changed.

Due to these realisations we embarked upon a number of significant changes to the internal structure of NGINX in parallel to the work on the core prioritization product. This blog post describes the motivation behind the structural changes, how we approached them, and what impact they had. We also identify additional changes that we plan to add to our roadmap, which we hope will improve performance further.

Background

Enhanced HTTP/2 Prioritization aims to do one thing to web traffic flowing between a client and a server: it provides a means to shape the many HTTP/2 streams as they flow from upstream (server or origin side) into a single HTTP/2 connection that flows downstream (client side).

Enhanced HTTP/2 Prioritization allows site owners and the Cloudflare edge systems to dictate the rules about how various objects should combine into the single HTTP/2 connection: whether a particular object should have priority and dominate that connection and reach the client as soon as possible, or whether a group of objects should evenly share the capacity of the connection and put more emphasis on parallelism.

As a result, Enhanced HTTP/2 Prioritization allows site owners to tackle two problems that exist between a client and a server: how to control precedence and ordering of objects, and: how to make the best use of a limited connection resource, which may be constrained by a number of factors such as bandwidth, volume of traffic and CPU workload at the various stages on the path of the connection.

What did we see?

The key to prioritisation is being able to compare two or more HTTP/2 streams in order to determine which one’s frame is to go down the pipe next. The Enhanced HTTP/2 Prioritization project necessarily drew us into the core NGINX codebase, as our intention was to fundamentally alter the way that NGINX compared and queued HTTP/2 data frames as they were written back to the client.

Very early in the analysis phase, as we rummaged through the NGINX internals to survey the site of our proposed features, we noticed a number of shortcomings in the structure of NGINX itself, in particular: how it moved data from upstream (server side) to downstream (client side) and how it temporarily stored (buffered) that data in its various internal stages. The main conclusion of our early analysis of NGINX was that it largely failed to give the stream data frames any 'proximity'. Either streams were processed in the NGINX HTTP/2 layer in isolated succession or frames of different streams spent very little time in the same place: a shared queue for example. The net effect was a reduction in the opportunities for useful comparison.

We coined a new, barely scientific but useful measurement: Potential, to describe how effectively the Enhanced HTTP/2 Prioritization strategies (or even the default NGINX prioritization) can be applied to queued data streams. Potential is not so much a measurement of the effectiveness of prioritization per se, that metric would be left for later on in the project, it is more a measurement of the levels of participation during the application of the algorithm. In simple terms, it considers the number of streams and frames thereof that are included in an iteration of prioritization, with more streams and more frames leading to higher Potential.

What we could see from early on was that by default, NGINX displayed low Potential: rendering prioritization instructions from either the browser, as is the case in the traditional HTTP/2 prioritization model, or from our Enhanced HTTP/2 Prioritization product, fairly useless.

What did we do?

With the goal of improving the specific problems related to Potential, and also improving general throughput of the system, we identified some key pain points in NGINX. These points, which will be described below, have either been worked on and improved as part of our initial release of Enhanced HTTP/2 Prioritization, or have now branched out into meaningful projects of their own that we will put engineering effort into over the course of the next few months.

HTTP/2 frame write queue reclamation

Write queue reclamation was successfully shipped with our release of Enhanced HTTP/2 Prioritization and ironically, it wasn’t a change made to the original NGINX, it was in fact a change made against our Enhanced HTTP/2 Prioritization implementation when we were part way through the project, and it serves as a good example of something one may call: conservation of data, which is a good way to increase Potential.

Similar to the original NGINX, our Enhanced HTTP/2 Prioritization algorithm will place a cohort of HTTP/2 data frames into a write queue as a result of an iteration of the prioritization strategies being applied to them. The contents of the write queue would be destined to be written the downstream TLS layer. Also similar to the original NGINX, the write queue may only be partially written to the TLS layer due to back-pressure from the network connection that has temporarily reached write capacity.

Early on in our project, if the write queue was only partially written to the TLS layer, we would simply leave the frames in the write queue until the backlog was cleared, then we would re-attempt to write that data to the network in a future write iteration, just like the original NGINX.

The original NGINX takes this approach because the write queue is the only place that waiting data frames are stored. However, in our NGINX modified for Enhanced HTTP/2 Prioritization, we have a unique structure that the original NGINX lacks: per-stream data frame queues where we temporarily store data frames before our prioritization algorithms are applied to them.

We came to the realisation that in the event of a partial write, we were able to restore the unwritten frames back into their per-stream queues. If it was the case that a subsequent data cohort arrived behind the partially unwritten one, then the previously unwritten frames could participate in an additional round of prioritization comparisons, thus raising the Potential of our algorithms.

The following diagram illustrates this process:

We were very pleased to ship Enhanced HTTP/2 Prioritization with the reclamation feature included as this single enhancement greatly increased Potential and made up for the fact that we had to withhold the next enhancement for speed week due to its delicacy.

HTTP/2 frame write event re-ordering

In Cloudflare infrastructure, we map the many streams of a single HTTP/2 connection from the eyeball to multiple HTTP/1.1 connections to the upstream Cloudflare control plane.

As a note: it may seem counterintuitive that we downgrade protocols like this, and it may seem doubly counterintuitive when I reveal that we also disable HTTP keepalive on these upstream connections, resulting in only one transaction per connection, however this arrangement offers a number of advantages, particularly in the form of improved CPU workload distribution.

When NGINX monitors its upstream HTTP/1.1 connections for read activity, it may detect readability on many of those connections and process them all in a batch. However, within that batch, each of the upstream connections is processed sequentially, one at a time, from start to finish: from HTTP/1.1 connection read, to framing in the HTTP/2 stream, to HTTP/2 connection write to the TLS layer.

The existing NGINX workflow is illustrated in this diagram:

By committing each streams’ frames to the TLS layer one stream at a time, many frames may pass entirely through the NGINX system before backpressure on the downstream connection allows the queue of frames to build up, providing an opportunity for these frames to be in proximity and allowing prioritization logic to be applied. This negatively impacts Potential and reduces the effectiveness of prioritization.

The Cloudflare Enhanced HTTP/2 Prioritization modified NGINX aims to re-arrange the internal workflow described above into the following model:

Although we continue to frame upstream data into HTTP/2 data frames in the separate iterations for each upstream connection, we no longer commit these frames to a single write queue within each iteration, instead we arrange the frames into the per-stream queues described earlier. We then post a single event to the end of the per-connection iterations, and perform the prioritization, queuing and writing of the HTTP/2 data frames of all streams in that single event.

This single event finds the cohort of data conveniently stored in their respective per-stream queues, all in close proximity, which greatly increases the Potential of the Edge Prioritization algorithms.

In a form closer to actual code, the core of this modification looks a bit like this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn, ngx_http_v2_stream *h2_stream, ngx_buffer *buffer) { while ( ! ngx_buffer_empty(buffer) { ngx_http_v2_frame_data(h2_conn, h2_stream->frames, buffer); } ngx_http_v2_prioritise(h2_conn->queue, h2_stream->frames); ngx_http_v2_write_queue(h2_conn->queue); }

To this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn, ngx_http_v2_stream *h2_stream, ngx_buffer *buffer) { while ( ! ngx_buffer_empty(buffer) { ngx_http_v2_frame_data(h2_conn, h2_stream->frames, buffer); } ngx_list_add(h2_conn->active_streams, h2_stream); ngx_call_once_async(ngx_http_v2_write_streams, h2_conn); }

ngx_http_v2_write_streams(ngx_http_v2_connection *h2_conn) { ngx_http_v2_stream *h2_stream; while ( ! ngx_list_empty(h2_conn->active_streams)) { h2_stream = ngx_list_pop(h2_conn->active_streams); ngx_http_v2_prioritise(h2_conn->queue, h2_stream->frames); } ngx_http_v2_write_queue(h2_conn->queue); }

There is a high level of risk in this modification, for even though it is remarkably small, we are taking the well established and debugged event flow in NGINX and switching it around to a significant degree. Like taking a number of Jenga pieces out of the tower and placing them in another location, we risk: race conditions, event misfires and event blackholes leading to lockups during transaction processing.

Because of this level of risk, we did not release this change in its entirety during Speed Week, but we will continue to test and refine it for future release.

Upstream buffer partial re-use

Nginx has an internal buffer region to store connection data it reads from upstream. To begin with, the entirety of this buffer is Ready for use. When data is read from upstream into the Ready buffer, the part of the buffer that holds the data is passed to the downstream HTTP/2 layer. Since HTTP/2 takes responsibility for that data, that portion of the buffer is marked as: Busy and it will remain Busy for as long as it takes for the HTTP/2 layer to write the data into the TLS layer, which is a process that may take some time (in computer terms!).

During this gulf of time, the upstream layer may continue to read more data into the remaining Ready sections of the buffer and continue to pass that incremental data to the HTTP/2 layer until there are no Ready sections available.

When Busy data is finally finished in the HTTP/2 layer, the buffer space that contained that data is then marked as: Free

The process is illustrated in this diagram:

You may ask: When the leading part of the upstream buffer is marked as Free (in blue in the diagram), even though the trailing part of the upstream buffer is still Busy, can the Free part be re-used for reading more data from upstream?

The answer to that question is: NO

Because just a small part of the buffer is still Busy, NGINX will refuse to allow any of the entire buffer space to be re-used for reads. Only when the entirety of the buffer is Free, can the buffer be returned to the Ready state and used for another iteration of upstream reads. So in summary, data can be read from upstream into Ready space at the tail of the buffer, but not into Free space at the head of the buffer.

This is a shortcoming in NGINX and is clearly undesirable as it interrupts the flow of data into the system. We asked: what if we could cycle through this buffer region and re-use parts at the head as they became Free? We seek to answer that question in the near future by testing the following buffering model in NGINX:

TLS layer Buffering

On a number of occasions in the above text, I have mentioned the TLS layer, and how the HTTP/2 layer writes data into it. In the OSI network model, TLS sits just below the protocol (HTTP/2) layer, and in many consciously designed networking software systems such as NGINX, the software interfaces are separated in a way that mimics this layering.

The NGINX HTTP/2 layer will collect the current cohort of data frames and place them in priority order into an output queue, then submit this queue to the TLS layer. The TLS layer makes use of a per-connection buffer to collect HTTP/2 layer data before performing the actual cryptographic transformations on that data.

The purpose of the buffer is to give the TLS layer a more meaningful quantity of data to encrypt, for if the buffer was too small, or the TLS layer simply relied on the units of data from the HTTP/2 layer, then the overhead of encrypting and transmitting the multitude of small blocks may negatively impact system throughput.

The following diagram illustrates this undersize buffer situation:

If the TLS buffer is too big, then an excessive amount of HTTP/2 data will be committed to encryption and if it failed to write to the network due to backpressure, it would be locked into the TLS layer and not be available to return to the HTTP/2 layer for the reclamation process, thus reducing the effectiveness of reclamation. The following diagram illustrates this oversize buffer situation:

In the coming months, we will embark on a process to attempt to find the ‘goldilocks’ spot for TLS buffering: To size the TLS buffer so it is big enough to maintain efficiency of encryption and network writes, but not so big as to reduce the responsiveness to incomplete network writes and the efficiency of reclamation.

Thank you - Next!

The Enhanced HTTP/2 Prioritization project has the lofty goal of fundamentally re-shaping how we send traffic from the Cloudflare edge to clients, and as results of our testing and feedback from some of our customers shows, we have certainly achieved that! However, one of the most important aspects that we took away from the project was the critical role the internal data flow within our NGINX software infrastructure plays in the outlook of the traffic observed by our end users. We found that changing a few lines of (albeit critical) code, could have significant impacts on the effectiveness and performance of our prioritization algorithms. Another positive outcome is that in addition to improving HTTP/2, we are looking forward to carrying our newfound skills and lessons learned and apply them to HTTP/3 over QUIC.

We are eager to share our modifications to NGINX with the community, so we have opened this ticket, through which we will discuss upstreaming the event re-ordering change and the buffer partial re-use change with the NGINX team.

As Cloudflare continues to grow, our requirements on our software infrastructure also shift. Cloudflare has already moved beyond proxying of HTTP/1 over TCP to support termination and Layer 3 and 4 protection for any UDP and TCP traffic. Now we are moving on to other technologies and protocols such as QUIC and HTTP/3, and full proxying of a wide range of other protocols such as messaging and streaming media.

For these endeavours we are looking at new ways to answer questions on topics such as: scalability, localised performance, wide scale performance, introspection and debuggability, release agility, maintainability.

If you would like to help us answer these questions and know a bit about: hardware and software scalability, network programming, asynchronous event and futures based software design, TCP, TLS, QUIC, HTTP, RPC protocols, Rust or maybe something else?, then have a look here.