Creating intuitive and engaging experiences for users is a critical goal for companies of all sizes, and it’s a process driven by quick cycles of prototyping, designing, and user testing. Large corporations like Facebook have the bandwidth to dedicate entire teams to the design process, which can take several weeks and involve multiple stakeholders; small businesses don’t have these resources, and their user interfaces may suffer as a result.

My goal at Insight was to use modern deep learning algorithms to significantly streamline the design workflow and empower any business to quickly create and test webpages.

The design workflow today

A design workflow goes through multiple stakeholders

A typical design workflow might look like the following:

Product managers perform user research to generate a list of specifications

Designers take those requirements and explore low-fidelity prototypes, eventually creating high-fidelity mockups

Engineers implement those designs into code and finally ship the product to users

The length of the development cycle can quickly turn into a bottleneck, and companies like Airbnb have started to use machine learning to make this process more efficient.

Airbnb’s demo of their internal AI tool to go from drawings to code

Though promising as an example of machine-assisted design, it’s unclear how much of this model is being fully trained end-to-end, and how much relies on hand-crafted image features. There’s no way to know for sure, because it’s a closed-source solution proprietary to the company. I wanted to create an open-source version of drawing-to-code technology that’s accessible to the wider community of developers and designers.

Ideally, my model would be able to take a simple hand-drawn prototype of a website design, and instantly generate a working HTML website from that image:

The SketchCode model takes drawn wireframes and generates HTML code

In fact, the example above is an actual website generated from my model on a test set image! You can check out the code on my Github page.

Drawing inspiration from image captioning

The problem I was solving falls under a broader umbrella of tasks known as program synthesis, the automated generation of working source code. Though much of program synthesis deals with code generated from natural language specifications or execution traces, in my case I could leverage the fact that I had a source image (the hand-drawn wireframe) to start with.

There’s a well-studied domain in machine learning called image captioning that seeks to learn models that tie together images and text, specifically to generate descriptions of the contents of a source image.

Image captioning models generate descriptions of source images

Drawing inspiration from a recent paper called pix2code and a related project by Emil Wallner that used this approach, I decided to reframe my task into one of image captioning, with the drawn website wireframe as the input image and its corresponding HTML code as its output text.

Getting the right dataset

Given the image captioning approach, my ideal training dataset would have been thousands of pairs of hand-drawn wireframe sketches and their HTML code equivalents. Unsurprisingly, I couldn’t find that exact dataset, and I had to create my own data for the task.

I started with an open-source dataset from the pix2code paper, which consists of 1,750 screenshots of synthetically generated websites and their relevant source code.

The pix2code dataset of generated website images and source code

This was a great dataset to start with, with a couple interesting points:

Each generated website in the dataset consists of combinations of just a few simple Bootstrap elements such as buttons, text boxes, and divs. Though this means my model would be limited to these few elements as its ‘vocabulary’ — the elements it can choose from to generate websites — the approach should easily generalize to a larger vocabulary of elements

The source code for each sample consists of tokens from a domain-specific-language (DSL) that the authors of the paper created for their task. Each token corresponds to a snippet of HTML and CSS, and a compiler is used to translate from the DSL to working HTML code

Making the images look hand-drawn

Turning colorful website images into hand-drawn versions

In order to modify the dataset for my own task, I needed to make the website images look like they were drawn by hand. I explored using tools like OpenCV and the PIL library in python to make modifications to each image such as grayscale transformations and contour detection.

Ultimately, I decided to directly modify the CSS stylesheet of the original websites, performing a series of operations:

Changed the border radius of elements on the page to curve the corners of buttons and divs

Adjusted the thickness of borders to mimic drawn sketches, and added drop shadows

Changed the font to one that looks like handwriting

My final pipeline added one further step, which augmented these images by adding skews, shifts, and rotations to mimic the variability in actual drawn sketches.

Using an image captioning model architecture

Now that I had my data ready, I could finally feed it into the model!

I leveraged a model architecture used in image captioning that consists of three major parts: