An adventure with WebKitGTK+, v8, and multithreaded C++

At untapt, resumes are our bread and butter. When we need to show them in a browser, we lay them out with HTML and CSS. Rendering that HTML and CSS to a PDF is a crucial task for us, both because we have downstream vendors that import candidate data by parsing PDFs (ugh), and because our clients need the ability to share resumes with their hiring teams.

From the perspective of a software engineer, this should be an easy task to automate, should it not? Users all over the world are taking HTML and rendering PDFs. They click File->Print , then Save as PDF , and happily carry on. Sadly, headless HTML to PDF rendering is in a sorry state. Headless Chrome is under active development, but it does not yet support PDF output. We previously used PhantomJS, which uses the WebKit engine, in production, but the PDFs that it generates are very poor. Notice the odd selection outline in this screenshot:

Selecting text in a PDF generated by PhantomJS.

PhantomJS places each character glyph on the page individually. Here’s what happens when you copy and paste this as text:

Fir.t emplo4ee hire at thi. earl4 .tage .tartup. We're uilding an automated digital tech recruitment platform. uilt the we platform for our two .ided market place with AngularJ on the frontend and P4thon/Fla.k/MongoD/Redi. on the ackend. uilt a data warehou.e that aggregate. cu.tom u.ine.. data with marketing .pend to optimize acqui.ition. Feature engineering from .tructured and un.tructured re.ume data including named entit4 recognition and proaili.tic topic modeling. Trained and productionized an en.emle model to predict the proailit4 of an interview given a candidate and jo. Thi. wa. uilt u.ing .cikit learn linear model. a. well a. Markov chain Monte Carlo method. in p4mc. etup creative, targeting and udget for digital marketing acro.. Google, Faceook and Twitter. Developed growth .trateg4 4 tracking ke4 u.ine.. metric.. Participated in the due diligence portion of a venture capital round. 2

This will certainly not be parsed correctly by any vendor, and it’s annoying that you can’t simply copy and paste text from our resumes. In addition, a lot of horrific code is required to end up with this equally horrific output. I wanted a Node.js module that would allow me to write some simple code like this:

and receive a beautiful PDF on my hard drive, like this:

A PDF rendered by our new tool with proper text selection.

If you’d like to cut to the chase, the full source code is available on GitHub. Otherwise, read on :). There are two parts to this project, building the renderer and connecting it to Node.js¹

Building the Renderer

As a foundation, we’re going to use a project called WebKitGTK+ that wraps WebKit, the same rendering engine used in Safari on macOS and iOS, in a C++ library. Our goal will be to expose a simple, synchronous API that takes some HTML and gives us a PDF, which we’ll later connect to Node. Here are the signatures of the two functions that we’ll create.

Before we get into the code, there are a couple of gotchas with GTK+ to keep in mind. First, all code that calls out to it is required to do so from a single thread. The Node threading model is complex, and we’ll get into how to make the two play nice later on. Second, GTK+ will only write the PDF data out to an actual file, not a stream (yeah, you read the sentence correctly). My unfortunate solution involves the /tmp directory². I don’t see a workaround here that does not involve hacking on GTK+, but if you have one I’d love to hear it!

The first function we’ll look at is setup_printer . Please note that some helper functions and almost all error handling are removed; refer to the GitHub repo for the full source.

This is pretty straightforward. Notice that we call gtk_init , we can now only make gtk calls from this thread! Next we’ll need a function that takes some HTML and loads up our web view with it:

HTML is stored as a raw pointer and length as opposed to an std::string to avoid any encoding issues. The ownership of that pointer will be discussed once we get into the Node interop, but note that the data is not copied and its ownership does not change in this function.

Next we’ll look at the function that submits the print job to Webkit:

Step by step, this function:

Creates a new temporary file Configures the print settings to write to that file Creates a new print operation with the settings from the setup function Requests a notification when the print operation is finished Submits the operation Enters the gtk_main loop

We’re strategically using gtk_main and gtk_quit to get this into a synchronous mode, because the callback is the only way to know that the load is finished. Let’s put these two functions together into our main render function:

Here we string together the previous two functions, read the PDF data back, and cleanup the temp files. Notice that we’re allocating memory on the heap for the PDF data here, and giving away ownership to the caller.

We’ve successfully implemented the simple rendering API. Now let’s connect it to Node.

Building a Multithreaded Node.js Addon

In order to follow GTK+’s rule about only being called from one thread, everything we just built will run in a background thread³. We’ll use the Producer-Consumer Pattern and a mutex to queue HTML for rendering. Then the challenge will be to get the PDF data back into Node land, for this we’ll use libuv , Node’s multithreading library.

First let’s look at the main loop of the background thread. Notice that there is no v8 code in here at all. That is all handled by the WorkItem data structure that we’ll get to later.

The work_queue and mutex are declared statically, and setup_printer is called once when this thread is spawned; remember, this is the background thread, so we get to call GTK+ here.

Next we enter the loop, here are the steps that happen inside:

Wait for a lock on work_queue Pop out work if it’s available Release lock If work was available, call out to our render method. Call work->done , sending the PDF data back to Node.

If you’re keeping score, the memory allocated by render passes through this function, and is then given to the WorkItem .

Now let’s look at the Node interop methods that start this loop, and add to the work_queue . I’ll leave some error checking in this time to show how to throw Node exceptions.

This is pretty simple, initialize the mutex and start the background thread.

Next the render method:

The goal here is to extract the parameters passed into to the function from JavaScript properly, and then get them into a WorkItem and onto the queue safely. To do that, we:

Extract the callback function and buffer from args . Allocate our own memory for the HTML and copy it in. Construct a WorkItem to manage this render request. Wait for a lock on work_queue . Push the work. Release the lock.

Now we’ve completed the classic Producer-Consumer Pattern. We have many threads writing, and one thread reading. A mutex synchronizes access to the queue. In order to actually expose these methods to Javascript, we have to finish setting up the module:

Next, let’s look at the WorkItem data structure. This class has three functions: it stores the callback given to us in the render function, it owns the memory containing the HTML and PDF data, and it manages the transition from our background thread back into Node land, in order to trigger the callback.

I’ll show the entire data structure and then go through it piece by piece:

It’s clear that this object is initialized with the html and callback that were passed to the render function.

The next interesting thing that happens is when the done method is called from the background thread:

Here we’re trying to wrangle all of the data that we’ll need later to invoke the callback function, and then tell Node that we want to get back into its territory. The best way to keep track of this data as we surrender control to Node is to allocate a payload struct on the heap, and then keep a reference to that struct through the transition.

Instantiate a new payload object on the heap. Grab the pointer to the pdf data and its length. Grab a reference to our callback. Keep a reference to the original WorkItem , remember that render instantiated it, and it can’t be cleaned up until after the callback is fired. Send the payload through to Node land with uv_async_send .

Finally, let’s look at the function that executes as a result of uv_async_send , in which it’s safe to invoke the callback.

In this final function we just need to construct proper Node objects from the raw data, and send them up to Javascript.

Prepare the object we’ll call back into javascript with, the signature is the typical (err, result) . Create a Node Buffer from our PDF data. Node now takes care of this memory for us and we don’t have to delete it! Call the callback. Cleanup the WorkItem Cleanup the payload

Thats it! All of this scary native code is completely hidden behind a simple JavaScript API. Have a look at the Github Repo if you’d like to use it!

My C++ was quite rusty going into this project, and I have no doubt that a guru would have a lot to stay about the choices I made. The biggest criticism I have of my own implementation is that I did not use modern RAII primitives such as unique_ptr . All of the heap allocated data in this project ended up being passed around magically through either GTK or libuv, so I did not see a way to follow that best practice. I’m curious to know if I there was a strategy for this that I missed. That said, I’ve been meaning to learn Rust, and it would probably eliminate all of these problems by default!

¹ For the informed reader, I’m using “Node” interchangeably with “v8” throughout this article.

² It’s quite possible that this exposes a security vulnerability. However, exploiting it would require RCE ability within my docker container, which runs inside my private kubernetes network.