Wed 24 January 2018

We have a new app out on the Mac App Store. It's called RetroClip, and it makes taking instant replay video captures of your Mac's screen as easy as taking a screenshot. There is a browser based demo you can try if you're on a Mac or a PC (and you really should try it because it's super cool).

I got the idea to write RetroClip after playing the game Fortnite Battle Royale and winning, and then having nothing to show for it besides a static screenshot. Current generation video game consoles all have a feature where you can press a button and capture the last minute or so of gameplay and I wanted this for my Mac. The key idea is that you don't know when you want to save a video clip until after something exciting has happened, so it needs to work retroactively — it's no good if you have to press a button to start the recording because it's already too late.

Unlike other screen capture utilities for macOS, RetroClip does not save to an ever growing scratch file on disk as it is recording. Instead, it saves video frames to a fixed size circular buffer in main memory. The neat thing is, we don't even need that much memory to do a good job. We can fit 30 seconds of HD video in around 40MB of memory. Once you see something you want to save, just press the RetroClip keyboard shortcut, and RetroClip writes the memory region with the video data out to disk, essentially instantaneously.

Anyway, you should download it and give it a try. Even if you don't use it for games, you may find it useful for capturing those pesky bug reports where you can't seem to reproduce it a second time. Please let us know what you think, either by email or on Twitter.

I've been code golfing RetroClip in my spare time for the past couple of months. It actually only took me a few hours to get an initial version of the app up and running, but eking out the maximum performance became something of an obsession. It's not just for bragging rights, though, the idea is to be able to leave RetroClip running all the time, even while playing games, watching videos, compiling code, writing blog posts, and so on. The more efficient RetroClip can be, the more is left over for doing important things, and for saving battery life.

If you're interested in the inner workings of RetroClip, read on because this section is for you. If not, that's fine, there's no reason to know this stuff anyway.

Screen Capture in macOS inevitably starts with the Window Server. The Window Server is the process in macOS responsible for keeping track of windows and their contents and for compositing them together to make the image you see on your display. As an application, we can ask the Window Server to send us the screen contents as they're displayed via the CGDisplayStream APIs. When we do this, the Window Server will feed us, via IPC, IOSurfaces, which are pointers to shared graphics texture memory, as fast as either it can create them or as fast as we can consume them, whichever is slower.

Once we get the pointers to the texture data, we can turn around and feed this into the hardware H.264 video encoder which is present in most Macs via the VideoToolbox APIs.

Everything described so far all works great, and it is a highly optimized area of macOS as it's used in important built-in features such as AirPlay Mirroring. For convenience, it's even wrapped up in a higher level AVFoundation API that works pretty well (and this is either the basis of QuickTime Player's screen recording functionality or their implementations are substantially similar).

With a bit of effort, however, it is possible to optimize things further and save some CPU time along the way. We can't get around asking the Window Server to do some work on our behalf to get the display data in the first place, and we can't really get around having to H.264 encode the images. But, there is still a place to save time: the mouse cursor.

The mouse cursor is special. It lives in an overlay that is composited on top of everything else by the GPU at the last moment before an image is shown on screen. This actually makes a difference in how responsive your computer feels. If, when you move the mouse, the Window Server had to re-composite the entire scene, and the mouse cursor was locked to vsync, you can actually, subtly, feel it as being laggy. You may have experienced this effect when playing some full screen video games that attempt to implement their own cursor rendering. You can also notice this effect when turning on AirPlay Mirroring, as this apparently forces the Window Server into a mode where it will composite the cursor along with everything else, as opposed to using the hardware cursor overlay.

So going back to the CGDisplayStream API, notice that we can ask the Window Server to composite the cursor for us. This works just fine and produces the correct picture, but disables the hardware cursor overlay, which will slowly drive you crazy. So, to avoid a slow descent into madness, the best approach is to instruct the Window Server to not include the mouse cursor in the display stream and just composite it myself on my own time. And, in fact, this is the approach that QuickTime Player and AVCaptureScreenInput also employ. However, I know a way to do it faster.

My trick is to observe that the IOSurfaces sent over to my application from the Window Server are writeable. The Window Server apparently rotates through a small handful of them, so I can't hang on to them for more than a few frames or I risk a backlog, but what I can do is quickly modify them, and this is a lot faster than trying to copy them. So, I do a quick blit to a small scratch texture to save aside the pixels in the area where the cursor is, then I composite the cursor into the original full screen texture I got from the Window Server. If the mouse cursor moves without a screen update (which can and should happen when the Window Server is using the hardware cursor overlay), then I can use that scratch texture I saved aside to undo the cursor compositing I did earlier, and then I have a clean canvas on which to re-composite the cursor at its new location.

I think the mouse cursor trick is the main source of my performance advantage over QuickTime Player/AVCaptureScreenInput, along with generally careful programming to avoid unnecessary copies, allocations, and indirections. On my machine, RetroClip typically uses about one-fifth of the CPU time to capture the same content at 60 frames per second as AVCaptureScreenInput does. Finally, if there is no mouse cursor, because you're playing a game without one or you're watching a video, then we can go down an even faster code path and avoid all of the cursor compositing magic.

RetroClip stores the encoded H.264 video frames in memory, rather than immediately writing them out to disk like QuickTime Player does. The goal in RetroClip for buffering this video is to hold it in memory as efficiently as possible, and to have stable memory use over days or weeks of use. Additionally, when the user requests to save a clip, we want to do this as quickly as possible.

My first pass was to just put the reference counted pointers to the frames output from the video encoder into a queue. This works OK, but it has two downsides: first, we have to lock whenever doing a save operation until we've had a chance to duplicate the queue and increment the retain counts on all of the sample buffers within it, and second, keeping thousands of these variably sized things scattered around turns the heap into swiss cheese after a while.

The solution to both of these problems is to use a circular queue to store the encoded video data end to end without any fragmentation. As new frames are generated, we write them at the head, and we can forget about old frames at the tail simply by incrementing a pointer.

Mike Ash wrote a very interesting series of articles about using mach virtual memory tricks to avoid having to expose segmented objects that are split across the edge of a circular queue, and you should definitely read it if you haven't already, but I actually don't need to use this trick myself in RetroClip. This is because the code that writes out the mp4 file already can easily handle a split frame if necessary.

Fortunately, there is still an opportunity to do a mach virtual memory trick. When the user requests a clip to be saved, we can make use of the rarely used vm_remap function to make a quick copy of the video data. An I/O thread gets the copy and writes it out to disk, and the media buffer thread keeps on going with the original.

What's neat is, the copy isn't really a copy, it's just some new virtual memory page table entries pointing to the same physical memory containing the video data as the original. Only as new frames are received and written to the original do the contents of the circular buffers begin to diverge, and the kernel handles finding new memory to store the data for us transparently. And because doing a contiguous write of 40 or 80 MB from memory to disk is incredibly fast these days, much faster than the incoming data rate from the H.264 encoder, there really isn't much copy on write that the kernel needs to do for us, at the most maybe a couple of megabytes.

As purveyors of real native Mac software, Nick and I occasionally worry that we are ignoring the advancing capabilities of web browsers as application platforms. Regarding our first product a couple years ago, a Hacker News reader confidently wrote:

Also all the stated reasons for using native are actually wrong and just as possible in web app, or will be by end of 2016.

As it's already 2018, we knew we had to get with the times. While at first, I just wanted to make a video, write a blurb, and have a Mac App Store link for our website, Nick quickly convinced me that I could and should do better.

So, last week, armed with nothing else besides the latest version of Safari and my trusty text editor, I set off to port RetroClip to the web.

My first challenge was to figure out how to do H.264 encoding in a web browser. Using the hardware encoder was obviously off limits, but I figured doing software encoding on a background thread would have to suffice. I found a couple of projects that aimed at doing it, but I was hoping to do better than 25MB of uncompressed javascript.

I wasn't about to write my own H.264 encoder, but I thought maybe if I got x264 built as web assembly with as little extra glue code as needed to get RGB data from the browser into H.264 video in an MP4 container and back, that would be good enough. And, it turns out, it is. It works out to about 850KB of uncompressed web assembly, and it will encode scaled image data to H.264 much faster than real time (at least on decent computers).

Once the video encoding piece of the puzzle was solved, I was confident that RetroClip for Web could become a reality. All I needed now was to create for the web a Window Server, a menu bar, some Cocoa view hierarchy, window management, event handling code, NSVisualEffectView, NSUserNotificationCenter, and the picture in picture media player introduced in macOS Sierra, and then I'd have everything I needed to port and run RetroClip on the web!

In the end, it wasn't too hard. The HTML Canvas element is basically the same thing as CGContext, so reimplementing a macOS UI on the web proceeds somewhat naturally.

Granted, an environment in which one can only run RetroClip is a bit too self-referential to be of any practical use beyond marketing RetroClip for Mac, but I figure, with the ground work I've already done, I will have a leg up on porting other Cocoa apps to the web. Hey, it worked for 280 North a few years ago, and now with the advent of exciting new web technologies it might be time to revisit the idea.