A team of students in the Applied Deep Learning class spent two months working on a deep learning model for converting an image to LaTeX. They wrote this article to describe their process and what they learned along the way:

Galileo, the father of modern science, said “the book of nature is written in the language of mathematics” and in modern times, mathematics is written in the language of LaTeX. All scientific disciplines from psychology to artificial intelligence use LaTeX as a tool for communicating their brilliant ideas through beautiful, perfectly formatted equations. Unfortunately, there is overhead in taking an equation and transcribing it into the LaTeX language. Im-2-LaTeX is a project with the objective of reducing this overhead for scientists.

‍



Problem Statement‍

In order to reduce the amount of time a scientist takes to write a LaTeX equation, we created an automated process which translates images of formulas into LaTeX code for the user. We hope by utilizing this application, our users can free themselves from learning and spending time creating correct LaTeX, but rather focus on what’s really important - their work. A scientist could capture an equation that already exists in a paper or on the internet and instantly get the LaTeX code to modify it to fit their purpose. By leveraging deep learning, we managed to train a model that performs better than the public state of the art for this task.







Previous Work



This project heavily referenced the Harvard paper: What You Get Is What You See. In the paper the authors use a neural encoder-decoder model to convert images into presentational markup based on a scalable coarse-to-fine attention mechanism. Our work heavily stands on the shoulders of the Harvard research group. As will be seen in the model section, we built upon their encoder-decoder architecture.





The Dataset



This problem was inspired by an OpenAI request for research prompt. The Harvard paper published a data set: Im2LaTeX-100K - that contains a prebuilt dataset for the image-2-latex system. It includes a total of ~100k latex formulas and rendered images that are collected from arXiv, which are split into train, validation and test sets. Each image is a PNG image of fixed size. Formulas are in black and the rest of the image is transparent. Before model training we performed heavy processing on the data. For example, equation normalization, since equations have multiple ways of being written.



Data example:





Model Architecture



During experimentations, we tested two models. The first was a model that used a naive CNN encoder and GRU decoder with Bahdanau attention. This model was treated as a baseline since it was already implemented as an image captioning tutorial for Tensorflow 2.0, making relatively straightforward to use on our dataset.





Our final model architecture was based on the Harvard paper - we’ve essentially used Tensorflow 2.0 to implement a model based on the specifications given in the paper. The model consisted of three main components:





Convolutional neural network (CNN) encoder

Bidirectional LSTM row encoder

LSTM decoder with Luong-style attention





We designed the convolutional neural network without a fully connected layer, so that it can handle inputs of any shape. The purpose of the row encoder is to localize the relative positions in the images by scanning through each row. The final component is an LSTM decoder that was designed with a Luong attention mechanism which builds context vectors for better learning.













Training



During training, we experimented with several key components to improve against the baseline. For optimizers, we tried with Stochastic gradient descent, Adam and RMSprop. For initialization techniques, we tried Glorot Uniform and He normal. Additionally, we experimented with different learning rates and batch sizes.



After multiple tests, we discovered that the SGD optimizer with an approach of updating learning rate based on loss plateauing, a batch size of 32 and He normal initializer gave the best results. The loss plot can be seen below.









We compared the loss among the baseline model, state-of-art model, and our experimental model as shown above. The loss is calculated using cross entropy loss function, and the train loss from our best performance model was lower than the state-of-art model.





In addition, we adopted the evaluation metrics from the Harvard paper - perplexity score:









The perplexity score for the training and validation datasets were also lower than the state-of-art model.





Sample Images



Below are some examples of images given to the model and the rendered output predictions. All images were taken from the test set - a subset of the data that was never used for model training or hyperparameter tuning.









‍

‍





‍





‍





The most surprising characteristic about the model is its ability to write complete LaTeX syntax and allow us to render images with no changes. Only the last example with the multiple equations needed to have \end{array} added to the end for it to render. Eventually we would like to deploy the model so users can take screenshots of equations and test it out themselves.







Current Challenges & Future Work



Although our final model is able to achieve high predictive performance on the dataset, it still gives poor predictions on conventional screenshot images of equations. This represents a significant distribution shift between pre-processed input images in the dataset and images that are expected from regular users (see figure below).





The above figure illustrates the effect of the difference between an image from the dataset (pre-processed) and an image from a screenshot. (A) shows that the model gives an accurate output prediction on the dataset image. (B) shows that the model gives a poor output prediction on the screenshot image.



In order to be able to deploy our model as a useful software tool for users, we must first address this distribution shift. From our initial experiments, we identified that the main reason behind this shift is likely due to the rigid preprocessing pipeline for the dataset. The pre-processed input images are not an accurate representation of expected user inputs. We are currently exploring two main methods to address these issues:



1. Create a flexible pre-processing pipeline with random image augmentation - the augmentations should capture the distortions that are common in screenshot images

2. Replace hard-coded pre-processing with an additional deep learning component that can learn the optimal mapping from raw to preprocessed images - although this may be more difficult, we believe that it would represent the most promising route towards general optical character recognition (OCR)





Conclusion



We are excited about the performance of the model and look forward to continue trying to improve it. We plan on using Weights & Biases’ newly released hyperparameter tuning feature as well as adjusting the dataset to make the acceptable inputs less rigid.

Thank you to the Weights & Biases for putting on the applied deep learning course which allows us to work on amazing projects with other deep learning enthusiasts.

