by Mark Kalman, Geraint Davies, Michael Hill, and Benjamin Pracht

The Low-Latency Streaming Challenge

Periscope is a live streaming application where viewers respond to broadcasters in real time with comments and “heart” feedback. A major part of the application’s appeal is the interactivity it gives participants. During development, we found that to maintain interactivity, feedback from viewers needed to arrive to the broadcaster within two-to-five seconds. This placed a relatively tight constraint on media stream latency, defined as the time between an event being captured on the broadcaster side and when it is rendered for the viewer. The challenge for the Periscope video engineering team was to build a system that meets the latency constraint in a way that cost-effectively scales to large audience sizes.

To stream cost-effectively to a large audience, an application typically uses an HTTP-based streaming protocol that can leverage content delivery networks (CDNs) and commodity HTTP servers. We found that we couldn’t meet our latency constraint using standard HTTP-based streaming servers and clients. With HTTP Live Streaming (HLS) using standard players, for example, we found that we could get latencies down to about six seconds in good conditions, but closer to 10 seconds was more typical.

An Initial Approach

When Periscope publicly launched, we used a hybrid approach. The first 100 viewers of a broadcast would each get a low-latency RTMP stream served over a socket connection from our backend server. After the first 100, viewers would get higher-latency HLS streams. It worked, but there were a few downsides in addition to the higher latency when audience size exceeded our limit. One was that the egress bandwidth from backend servers for the individual RTMP streams became a significant operating cost. Another was that the long-lived individual socket connections to the low-latency portion of the audience made it time consuming to deploy updates to our servers. A third issue was that while we could choose a streaming server from a datacenter closest to the broadcaster, viewers might be on the other side of the world connecting over a TCP link with a round-trip time in the hundreds of milliseconds. Packet loss on the viewer’s access network would consequently have a high impact on throughput to the viewer. Finally, we had to support two server side infrastructures and two players on the clients.

Our LHLS Solution

To improve upon our hybrid approach, we’ve moved to a new scheme we call Low-Latency HLS (LHLS) which gives us the low-latency performance of individual socket-connected streams and allows us to leverage commodity HTTP CDN infrastructure. There are two fundamental mechanisms that distinguish our scheme from standard HLS. One is that segments are delivered using HTTP/1.1 Chunked Transfer Coding. The other is that segments are advertised in the live HLS playlist before they are actually available.

The benefit of using chunked transfer coding is that it eliminates the segmentation delay normally associated with HTTP-based streaming. In HLS live streaming, for instance, the succession of media frames arriving from the broadcaster is normally aggregated into TS segments that are each a few seconds long. Only when a segment is complete can a URL for the segment be added to a live media playlist. The latency issue is that by the time a segment is completed, the first frame in the segment is as old as the segment duration. While the “segmentation latency” can be reduced by reducing the size of the segment, this would lower video coding efficiency, assuming each segment starts with an I-frame, because I-frames are typically many times larger than predicted frames. By using chunked transfer coding, on the other hand, the client can request the yet-to-be completed segment and begin receiving the segment’s frames as soon as the server receives them from the broadcaster.

In LHLS, we add segment URLs to the playlist several segment periods early. For example, when a stream is starting and the first frame of the stream arrives from broadcaster to server, the server will immediately publish an HLS media playlist containing, say, three segments. When clients receive the playlist, they request all three. The server responds to each request with a chunked transfer coded response. The request for the first segment will initially get whatever media has accumulated in the segment by the time the request arrives, but will then get chunks of media as they arrive at the server from the broadcaster for the remaining duration of the segment. Meanwhile the request for the second segment receives just some MPEG Transport Stream (TS) segment header initially, then receives nothing while the first segment completes, then starts receiving data in real time as the server receives it from the broadcaster. Similarly, the request for the third segment receives just TS header data until the second segment is completed. The benefit of advertising segments in the playlist well before they are available is that it eliminates the issue of playlist latency due to client playlist polling frequency and playlist TTL in CDN caches. Since segments are advertised several seconds before they actually contain media, if playlist TTLs of, say, one second are compounded by a layer of shield caching in the CDN, it doesn’t impact latency. Clients learn about upcoming segments several seconds early, request them, and receive frames of media as soon as they are available at the server.

Figure 1 — A plot comparing video frames received at the playback client over time for LHLS and HLS. In this example, segment duration is fixed at 2 seconds for both LHLS and HLS. For LHLS, frames arrive continuously due to chunk transfer coded responses and because segments appear in the media playlist well before they are “live”. For HLS, in contrast, the client can only request segments after they are completed at the server and after the client receives the updated playlist. Frame arrival is therefore bursty with typically no frames added to the buffer in between periodic segment requests.

Figure 2 — A plot comparing playback buffer fullness for LHLS and HLS for the example video arrival traces used in Figure 1. In this example, playback begins after 4 seconds of media are buffered for both LHLS and HLS. For HLS, the buffer empties steadily between each bursty arrival of frames. For LHLS, frame arrival is more continuous and the buffer state is more constant. The playback buffer occupancy dips to 2.75 s during a bandwidth fluctuation. In comparison, HLS buffer occupancy frequently dips below 1 s.

LHLS in Operation

When we initially experimented with LHLS, we were happy to see that our CDN vendor supported chunked transfer coding and that the behavior on a cache miss was as we had hoped. When a client requests a TS segment that is not in the edge cache, the request is backended via a shield cache to our server which responds with a chunked transfer. During the chunked response, which can last seconds, chunks are forwarded (streamed, if you will) to the client through the shield and edge caches. Meanwhile, requests for the same segment coming from other clients get a cache hit even while the original response is ongoing. They immediately receive whatever data has already been transferred and then receive subsequent chunks of the response as they arrive from the backend server for the original request. As an aside, the shield cache between the edge and our backend server is for request coalescing, making it so that concurrent requests coming from different edge PoPs don’t cause the same object to be backended from our server multiple times.

We’ve been pleased with the performance of our LHLS protocol and architecture. Metrics like playback stall rate as a function of end-to-end latency and time-to-first frame have been favorable compared to our old RTMP architecture. Even with the shielding layer in the CDN, the delay between frame capture at the broadcaster and arrival at the client is consistently in the low hundreds of millisecond. At the same time, egress bandwidth from our servers, formerly our largest operational expense, has been cut down to where it is similar to our ingress bandwidth. This is because each segment is only backended to the CDN once.

We’ve also reduced the cost of serving broadcast “replays”. When a live broadcast is complete, we make it available as video on demand (VOD) using HLS. Before, for streams with less than 100 live viewers, none of the stream’s HLS segments would be cached in the CDN and they’d all have to be backended for replay viewers. This incurred additional egress bandwidth and storage access expense. Now that segments are always cached during live viewing, segment requests from replay viewers are often served from CDN caches and don’t need to be backended.

LHLS has also allowed us to eliminate the “statefulness” of our viewing service. With RTMP, we needed to serve outgoing streams from the same server that was receiving the ingest from the broadcaster. Now, as frames of media arrive at the ingest server, we package them into TS packets and store the packets in a Redis cache. We then serve LHLS requests from a stateless service that accesses the TS segment from Redis.

But What About…

What about the impact on CDN capacity of all those open HTTP requests lasting several seconds? So far it hasn’t been an issue. We are planning to experiment with HTTP/2 which should reduce the number of open socket connections because a client’s requests would all be multiplexed onto the same socket.

What about the multiplexing inefficiency of TS streams? Initially efficiency was lousy at low bit-rates rates because of the way we were packing frames into packetized elementary streams (PES). After improving PES packing, however, TS muxing overhead has dropped to ~5% for our lowest bit-rate streams. Note also that LHLS would work equally well for fragmented MP4.

What about segment duration and discontinuity tags in the HLS playlist — if you add segments to the playlist before they’ve even started, how can you assure their duration and know whether there was a discontinuity? The answer is that we’ve implemented our own players that don’t rely on discontinuity and segment duration tags in the playlist. They identify timestamp discontinuities with heuristics and identify changes in elementary streams by comparing elementary stream parameter sets. Note, though, that to support standard players, we also create HLS-compliant playlists that refer to the same TS segments as our LHLS playlists. These higher-latency playlists don’t advertise segments until they are completed. We can therefore accurately provide discontinuity and segment duration tags for the players that rely on them.

What about variant selection for adaptive bit-rate content? Don’t streamed responses complicate bandwidth estimation? They do, yes. Normally a client can estimate bandwidth by observing how long it takes to receive segments with known sizes. With responses streamed in real time, it’s harder for a client to observe bandwidths higher than the rate of the current variant. Estimating how much more bandwidth is available is important, however, for safely switching to a higher quality stream. The players we’ve implemented rely on indirect measurements to infer whether additional bandwidth is available. For standard HLS clients, the higher-latency playlists that we provide assure that segments aren’t streamed in real time but are complete before they are requested.

Conclusion

The problem of streaming to large audiences in a way that meets our latency requirements didn’t have a standard solution. Protocols like RTMP which require a stateful server connection would meet our latency constraint, but were too expensive to scale. Existing HTTP-based protocols would scale, but didn’t meet our latency constraint. With LHLS, we found a way to meet our latency constraint and achieve scale. Since deployment, LHLS has proven itself in practice, reducing our operational costs while achieving performance metrics that are favorable compared to our former RTMP solution.