A about a year ago I read two blog posts about generating fonts with deep learning; one by Erik Bernhardsson and TJ Torres at StitchFix. Inspired by their work I figured that I wanted to give fonts a go as well, so I set up a variational autoencoder* that would learn a low-dimensional representation of the word “Endless” from 1,639 different fonts, and was capable of generating very smooth interpolations between the different fonts, as can be seen in the animation below.

*A variational autoencoder is in short a model that learns to take a high-dimensional input, transformation it into a lower dimensionality and then transform this back into the original. By doing this the model effectively learns how to boil the input down to its essentials, which in this case allows one to interpolate smoothly between the original fonts, creating completely new ones in the process.

Almost perfectly smooth interpolation between fonts. Please excuse the corny background imagery.

However, generating different interpolations between fonts of a single word is not quite creating an entirely new font. It very much lacks the expressiveness of a full alphabet.

So why not just take what Erik and TJ have made and simply use that to generate new fonts? Because their models are lacking something: Even though they manage to capture the styles of individual characters very well, they do not incorporate the styling found between pairs of characters, namely the intended spacing in between them, known as kerning.

For those that are unfamiliar with the term, kerning is the spacing between specific pairs of characters, which makes fonts look nice and more readable. For instance, in “LEEWAY” there is no overlap between “L” and “E”, nor “E” and “E”, whereas “W” and “A”, and “A” and “Y” are overlapping. Another example, taken directly from Wikipedia, is the four bigrams below, with and without kerning applied:

So I want to build a model which incorporates both intra-character and inter-character styling.. but how to do this?

Requirements to get there

I am going to list out the requirements in an odd order, by leaving the first requirement until the end, as it actually was how I went about thinking about this problem:

The second requirement is how I will make the model learn about kerning. The best way of achieving this I could think of was to use bigrams: If a model learns to reproduce all combinations of two characters, along with the spacing encoded in the original fonts, it has effectively learned how to kern as intended.

However, just producing bigrams is not quite equal to creating a fully fledged font, as these bigrams cannot readily be used to write sentences (e.g. “CO”+“OM”+“MP”+“PU”+“UT”+“TE”+”ER” is quite different from “COMPUTER”). So something needs to automatically overlay these bigrams, which is the third and final requirement of my solution.

Having figured out how I want to make a model that can learn how to kern, I need to have a representation of what font style and which bigram I want the model to generate. This will be the input to the model, denoted X, and needs to be composed of two parts, namely the style and bigram. The encoding for “font style” has to be in a continuous space as I want to be able to interpolate between the styles of the fonts, creating a continuum of different font-styles which I can sample to generate completely new fonts. And the encoding for bigrams needs to enable me to create all combinations of characters, such that I end up with a font that can be used for whatever purpose. This is the first requirement of the solution.

In short, the three requirements are:

Figure out the encoding of the input, X, such that it incorporates the styles of fonts and bigrams Teach a model to generate bigrams of various fonts Automatically overlay the generated bigrams to form words and sentences

Encoding of the input

As mentioned, the encoding of X will need to consist of two parts: Bigram and style.

Encoding of the bigram

I want my generated fonts to be able to write all the letters of the English alphabet in upper case (A..Z) as well as a hyphen (-). This adds up to 27 different characters, and we need to be able to represent two times this as we are generating bigrams.

How about simply encoding it as two times a number in the range 1 to 27 where the characters are numbered according to the order I just described? Then 15 and 11 would correspond to “O” and “K”.

Yeah, no. This is bad idea because even though machine learning algorithms are “smart” enough to be taught certain things, they operate within the world of arithmetics, meaning that it would interpret the characters “A” and “B” as being very similar, but “B” and “P” as being quite dissimilar due to their numerical values in this encoding would be 1 and 2, and 2 and 16, respectively.

Instead I am going with something called one-hot encoding for representing the characters in the bigram. One-hot encoding is a list of a single one and N-1 zeros, for an alphabet of length N. For our alphabet of 27 characters, each position of the zeroes/one correspond to a specific character as the example here should hopefully show:

00000000 00000000 00000000 100 ABCDEFGH IJKLMNOP QRSTUVWX YZ-

The 25th position in the one-hot vector is a 1, meaning that the vector represents the 25th character of the alphabet, which is “Y”.

So to represent a bigram of an alphabet with 27 characters such as ours, we will have 52 zeroes and 2 ones. For example, one will encode “ET” as follows:

00001000 00000000 00000000 000 00000000 00000000 00010000 000

One-hot encoding ensures that the machine sees absolutely zero correlation between characters, which is what I want.

That being said, one can make the argument that some characters look more alike than others, such as “M” and “N” but certainly not “Q” and “T”, and thus maybe one should use a representation that is a combination of these two types of encoding, allowing for some correlation between the characters.

Encoding of the font-style

That was representing the bigrams, but we also need something representing the style of the font we want the model to generate. As I want to be able to interpolate between existing fonts to generate completely new ones, the encoding of a given type of font has to be in a continuous space.

A quick and dirty solution is to use an algorithm called t-Distributed Stochastic Neighbor Embedding (t-SNE), which simply put can take images of the 1,639 fonts and map these to a z-dimensional space, Z, where z is much smaller than the dimensionality of the imagery, while trying to have the proximity of the points in the Z-space correspond to the similarity of the original images of the fonts. When z is set to be 2, one can plot the fonts at their corresponding positions in this space:

The result of generating images of the nonsense word “Aon” with the 1,639 fonts and having t-SNE map these to a 2-dimensional space

The dimensionality of the space Z can theoretically be any integer from 1 to the original number of dimensions in the images (which was 2304 in this case), but setting the number of dimensions, z, closer to 1 means that more information about the font-similarities is lost, while each of the dimensions encode more information about the similarity of the fonts. And on the contrary if a larger z is chosen, meaning that each axis encode relatively less information about the similarity of the fonts, but on the other hand introduce less error from the reduction of dimensionality. I chose to set z to 10, as a trade-off between the pros and cons.

Going back to why I wanted to have this in a continuous space, let us assume that I had chosen z = 2. Using t-SNE I can once again map the 1,639 fonts to a 2-dimensional space. This time I plot the location of the fonts in this space rather than the fonts as well, as it allows for seeing how t-SNE cluster them together:

t-SNE map of the points corresponding to the 1,639 fonts

Each of these 1,639 points correspond to a specific, original font. Zooming in on two points in this 2-dimensional space, we can see their corresponding fonts:

Two fonts mapped to the 2-dimensional Z space, not particularly interesting as is

As Z is a continuous space we can interpolate between the original fonts, and below is an example from the final model where I interpolated between two fonts:

Interpolation between font-styles: The dark blue dots correspond to two original fonts, whereas the three teal dots correspond to the novel fonts in between

As you can see, this will allow us to create a continuum of fonts which are novel combinations of the original fonts and their looks!

Now we have settled on how to encode the input to capture both style and which characters to produce, and with z = 10 plus two times 27 characters with one-hot encoding, we end up with the input to the model being a 64-dimensional vector. With that in place, it is time to look at the model and how it is taught to generate these bigrams.

The Model and training hereof

In order to teach any kind of model to generate bigrams, we need some training data. For this problem, it comes quite easy, as I happen to already have the 1,639 fonts lying around on a USB-stick from when I wrote “Endless” in “endlessly” many interpolations of fonts (sorry, I couldn’t resist the pun). With this, I can generate both the 64-dimensional input vectors and the corresponding expected output.

For any of the 1,639 fonts, the 10-dimensional style-vector is simply the point in the space Z which t-SNE told me corresponds to the given font, and the bigram is the 54-dimensional one-hot encoding corresponding to the two characters it should output. Concatenate these two vectors and you have the input.

And the outputs are simply 41 by 65 pixel images generated from the 1,639 fonts, with all 27*27 combinations of the two characters in the bigrams.This yields roughly 1.2 million inputs, x, and corresponding outputs, y, to train the model on, over and over.

A very simplistic view of the model is this:

Given an input vector, x, the model produces an output image, ỹ, of the estimated corresponding bigram.

The model, which happens to be a deep neural network, is trained by having it produce an estimated ỹ from x. ỹ is then subtracted from the expected image, y, and the mean absolute error is then used to update the model such that it produces a ỹ closer to y.

For this particular problem, the model generally seems to converge after having iterated over a few million examples of inputs and expected outputs, which takes roughly 24 hours on a computer with a powerful GPU.

At this point we can create bigrams (and monograms, because I made a model for that too) in a continuum of styles. Here’s a video of interpolation between 20+ different styles for the upper case alphabet + hyphen:

20+ interpolations between fonts. I can recommend looking at “K” and “S”. They are my favorites here and undergo some quite dramatic changes, yet interpolate very smoothly.

Overlaying bigrams into n-grams

With a model capable of producing bigrams all we need now before we can write full words in the generated fonts is to be able to stitch them together automatically (because we are somewhat lazy and don’t want to do this manually).

Given a word, such as “HELLO”, what I did was to break this down into bigrams: “HE”, “EL”, “LL”, “LO” and then have an algorithm figure out the pairwise overlap between them. E.g. for “HE” and “EL” it should figure out that the two E’s should overlap.

For this purpose I used an algorithm called Simulated Annealing, and programmed it to maximise the overlap between the black parts of the image (corresponding to where the characters actually are and ignoring the white space surrounding them).

In short, simulated annealing works by taking a random action, and if some condition is met it updates the state according to this action. An action in this case is moving (/translating) the second bigram along the x and y axes, as well as scaling it up/down in the x and y axes. The state is the information about how much the second bigram currently has been moved and scaled. The condition is two things: If the random action improves the state with regards to the objective (which is maximising the overlap over the two bigrams) it always applies the action and updates the state, but if the action worsens the state with regards to the objective (i.e. make the bigrams overlap less) it randomly decides whether to update the state with the action. This depends on the “temperature” of the algorithm, which is an exponentially decreasing number, and the algorithm becomes less likely to accept a worsening of the state with lower temperatures.

Why am I describing how this particular algorithm works, but not the others? Because I made the visualisation below of the annealing process, which makes little sense without some understanding of how the algorithm works!

Simulated annealing in action, working on matching pairs of bigrams for the sentence ‘hello-there’

With this in place we can now pairwise match the bigrams and chain them together to create a full word or sentence (where hyphens replaces spaces).

Breaking the sentence “MACHINE-LEARNING” into bigrams and getting the neural network to produce images of this bigrams in a random font yields this:

And matching them using simulated annealing we get this:

Et voilà! That’s how you generate almost perfect fonts!

The end.

… Just kidding. The annealing process often mess up, especially with hyphens:

Here the annealing found maximum overlap by having the hyphens overlap with the top of ‘E’ and the bottom of ‘L’

And less often with other characters than hyphens, but when it does, it’s quite amusing:

A few more examples of excotic overlaps

This is not a fault of the algorithm simulated annealing, but rather the objective I defined as naively maximising the overlap of the fonts.

Here’s an example of successful matching of the bigrams of a long list of interpolated fonts writing the nonsense word “WAVTSRXA”:

The gray arrow denotes the direction of the interpolation (starting in the upper left corner and ending in the upper right). Note how the kerning changes, especially between “X” and “A”, which touch each other in the bottom of the image, but are far apart in the top of the image. This goes to show that the neural network has indeed learned to kern bigrams according to their font-style and characters!

Conclusion

While this does a pretty decent job at generating new fonts with proper kerning, there is room for improvement. An obvious improvement that could be implemented is the mapping from a font to the low-dimensional style-space, Z, should be done somehow other than with t-SNE. t-SNE doesn’t allow for adding new fonts without completely remapping all the fonts and retraining the neural network. An autoencoder is likely well-suited for this purpose.

And in order for this to be truly useful on a large scale, the generated images of the fonts have to be converged into a vectorised font-format such as .woff, .otf, .ttf, or you-name-it. Having this in place would actually remove the need for the fuzzy bigram matching with simulated annealing, as the kerning is already incorporated into the bigrams.

Hopefully this piece went to show a framework for generating novel and fully fledged fonts with proper kerning of the characters, and hopefully it was interesting and somewhat enlightening at the same time!

Finally, the code used to do all of this can be found here on Github.