Seeking 100GB of Video: How We Made Sightline

By Kevin

Nest Cam can record up to 30 days of video (with Nest Aware subscription), which is great for capturing everything that happened, but with all that content, it can be difficult for our brains to figure out which moments are worth our attention. The volume simply isn’t digestible.

To solve this problem, in September 2016, Nest launched Sightline, a new interface to Nest Cam on iOS and Android. Sightline introduced several new features, but the biggest breakthrough was the ability to quickly rewind and fast forward through all recorded video simply by scrolling on the vertical timeline.

When the design team first demoed the concept, everyone got really excited. It felt like your thumb was connected to the video as you smoothly moved through time — as opposed to the previous experience of guessing where you move a slider and then waiting several seconds for buffering before seeing if you were close to what you wanted to find.

Early drawing

But was it even technically possible? How can you smoothly scroll through up to 30 days of video history stored in the cloud, which can be over 100GB of data, from a phone?

It was clear that we would need to subsample the video, but would that be enough? Even if we could make it 1000 times smaller, it could still be on the order of hundreds of megabytes. Just getting that much data to a phone is problematic. After all, phones have limited disk space, flaky internet connections, and capped cellular data plans.

We formed a small engineering team to explore the problem, and we prototyped the kernel of a solution: extract thumbnail frames from the HD video stream and transcode them into condensed video chunks that can be downloaded in batch and aggressively cached. From there, we could tune the various parameters controlling the stream to achieve the performance characteristics necessary. We call this condensed thumbnail video stream the Scrubby Stream.

Our services process a lot of new video everyday –- even more than YouTube does. That makes transcoding and serving the video difficult to do in a low-latency and cost-effective way. But that’s another article. For now, we’ll focus on the tradeoffs we made in order to dial in the Scrubby Stream to get the Sightline experience just right.

Paper prototypes from design sprint

Frame density vs scroll distance

There are two parameters that primarily control the user experience. One is the frame density, which is the number of frames displayed during a period of time. The other is the scroll distance, which is the physical distance the user must scroll on the phone to move through a period of time.

The frame density and scroll distance complement each other. If scrolling one pixel seeks by one second, then having more than one frame per second would be wasteful.

Frame density and scroll distance also have a widespread impact on the overall system design and UI. The frame density affects the design of the transcoding service, including the impact it will have on other services. The scroll distance controls how much real estate is on the screen for other UI elements, and how quickly we expect the device to be consuming video from the service.

Through prototyping and testing, we settled on a frame density of approximately one frame per minute and a scroll distance of 3pt per minute. This gives a nice balance between showing enough detail about what happened throughout a day of video, while also making it easy to scroll back several days quickly.

With these essentially fixed in place, we began thinking about how best to encode the Scrubby Stream.

Seek speed vs file size

A key aspect to the Sightline experience is being able to seek through the Scrubby video stream with very low latency. This meant that all the Scrubby data would have to be cached locally on the device, but there was still more work to do to get great seek performance on the wide range of devices we support. To do that, we had to look at how H.264 is decoded.

One way that H.264 achieves high compression is by grouping consecutive frames into a Group of Pictures (GOP.) The first frame in the GOP contains the entire image, and the rest just contain delta information, making them much smaller. To decode a particular frame, you must generally decode all prior frames in the GOP.

Therefore, we can improve the seek performance by shortening the GOP length, since it decreases the decode time. However, the shorter the GOP, the more GOPs are required to cover the same number of video frames. And since each GOP has one frame that contains the entire image, the overall file size goes up.

After a lot of testing on an array of devices, we found that 20-frame GOPs made the Sightline scroll experience very responsive. Luckily, by carefully tuning other encoder parameters, we were also able to rein in the file size.

Early prototype

Quality vs file size

The overall file size of the Scrubby Stream is extremely important. It impacts the download speed, the disk space consumed by the cache, and the network usage. Both the frame density and the GOP size affect the file size, but so does the quality.

With the x264 encoder, the two primary controls over the quality are the resolution and the Constant Rate Factor (CRF). Since the Scrubby Stream is just used for thumbnails for the HD stream, we can carefully dial down the quality to dramatically reduce the file size. This makes the download speed very fast, even on slow networks, and reduces the amount of disk space consumed by the app.

The average size for a full day of Scrubby data is just 5 MB. And to be extra considerate to users on capped mobile data plans, who may even have multiple cameras or occasionally need to download multiple days worth of Scrubby data, we encode an additional lower-quality stream that averages a mere 1 MB per day.

We also take special care to avoid ever unnecessarily re-downloading Scrubby data. Each individual GOP in the Scrubby Stream is cached to disk, so even if the connection dies partway through the download, the app never needs to download the same data twice. One exception, however, is that any lower-quality Scrubby data downloaded while on a cellular connection will automatically be replaced with high-quality data next time the user is on WiFi.

Frame density vs motion events

Some readers may note that a frame density of one per minute may not be enough to capture short motion events. Indeed, if there is motion in the scene that only lasts for say, 10 seconds, naively sampling one frame every 60 seconds will usually miss it.

Luckily, for subscribers to Nest Aware, we already perform computer vision analysis on all recorded video to help users identify the most interesting events, like when there is motion or a person in the scene. This is the basis for our intelligent alerts.

Now we also use this information to pick out the best frames to include in the Scrubby Stream. So while the frame density averages one per minute, it actually varies slightly because every frame is chosen by considering all of the interesting activity that occurred over the time period. This way we can capture even the short lived events with a thumbnail in the Scrubby Stream.

Sightline 1.0

Wrapping up

With Sightline, we managed to bring an idea to life that we didn’t think would be possible at first. It took a lot of talented individuals from several teams working together to launch this to all our users on both mobile platforms. It’s now fun and easy to flick through the video captured with Nest Cam and see all the interesting activity that happened during the last 10 or 30 days of cloud recording, and we’re very happy with the early feedback we’ve received. And while Sightline is a great new experience, it’s really just the start of many new features we have planned for Nest Cam.

Join the conversation at Nest’s Developer community. Get started with Nest Developers. Or explore careers with Nest.

The information contained in this blog is provided only as general information for educational purposes, and may or may not be up to date. The information is provided as-is with no warranties. This blog is not intended to be a factual representation of how Nest’s products and services actually work. No license is granted under any intellectual property rights of Nest, Google, or others.