It was at this point that I came up for an idea for an app that would allow users to input their favorite restaurant, and their favorite menu item from that restaurant. The app would then return similar menu items at other restaurants, along with more details like restaurant location, menu item price and Yelp rating.

Scraping restaurant menu data using BeautifulSoup

Obtaining restaurant menu item data proved to be much more challenging than originally thought. There were a few APIs I found like OpenMenu, Grubhub and Locu, but all of them either didn’t allow me to sign up, or were too limited in the amount of data they allowed me to scrape. After days of panicking, I finally found two websites, menupages.com and allmenus.com, which allowed me to scrape restaurant menu data from a variety of restaurants in many cities. I decided to go with allmenus.com, as the HTML data was much cleaner to work with. I scraped a set of over 75,000 menu items from more than 1,000 restaurants. I used the BeautifulSoup library to extract the data I needed, such as the restaurant name, address, the menu item name, the description of each menu item and even the Yelp rating of the restaurant. I then cleaned the text by removing words that appeared infrequently, and stop-words such as “the” and “for.”

NLP-based algorithms: cycling through my options

When I came to Insight, I knew that I wanted to work on a project that used Natural Language Processing (NLP)-based algorithms. This was further reinforced when I saw some of the alumni projects that demonstrated the power of NLP. The big problem, however, was that I knew nothing about NLP and how to use it.

Typically one can use Google and StackOverflow to get the answers one needs. The strength of the Insight program is the number of alumni with a diverse range of skills who are willing to help Fellows with their projects. In the second week, Chris Moody from Stitch Fix (inventor of lda2vec) came in to talk to us about different NLP-based algorithms and when they are used typically. This is what I gathered from listening to him:

Latent Semantic Analysis (LSI) / Latent Dirichlet Allocation (LDA): This algorithm looks at the words appearing in each document (in this case, each menu item would be a document), compares them to the words appearing in other documents, and classifies each document by a set of “topics”. Since my intention was not to classify each food item, I did not pursue these algorithms.

Word2vec: This algorithm looks at the entire corpus of text that is present, and assigns each word to a point in vector space. The more often two words appear close to each other in the corpus of text, the closer they are in vector space. So in the context of menu items, one can imagine that “bacon” and “eggs” or “burger” and “fries” would probably be very close in vector space.

Term frequency (TF): This algorithm transforms your corpus of text into a matrix, where the number of rows is the total number of documents and the number of columns is the total number of unique words appearing across all the documents. Each column corresponds to one word, and each row to one document (in this case, a menu item). The number of times a particular word appears in a document populates the corresponding entry in the TF matrix.

Term frequency-inverse document frequency (tf-idf): This algorithm is very similar to TF described above, except that the terms in the matrix are scaled by the number of times it appears in a document, and by how good that word is in differentiating between documents.

Settling on tf-idf representation of my text corpus

In my case, it was clear that I wanted to find menu items that were similar to the one that I was craving - in this case, a mushroom crepe. To achieve this, I would first have to map my entire corpus of menu items that I scraped to vector space. Each menu item would then be a point, and to quantitatively measure how similar two menu items are, I used a metric called “Cosine Similarity”. For those familiar with basic linear algebra, this is essentially the cosine of the angle between two points in space; for those not familiar, this is a number between 0 and 1, where “1” indicates that the two points are very similar (cos 0), while “0” indicates that they are completely dissimilar (cos pi/2).

So imagine if there were a menu item called “Two eggs with bacon and potatoes” and “Two eggs scrambled, bacon, potato, cheese”, we would hope that the cosine similarity would be close to one. To achieve this, I considered the different NLP-based algorithms that transformed my corpus to vector space.

I first considered using word2vec to transform my corpus. However, each document (which is a list of words), will be similar to another one that is “semantically similar”; that is, an item with only bacon might be thought to be similar to one with just eggs since they appear together frequently within the corpus. Since I was looking specifically for a mushroom crepe (and not something similar to a mushroom crepe), I passed on word2vec.

It then came down to deciding between using TF and tf-idf. Both would have probably worked fine for my use case, but the power of tf-idf is illustrated by this simple example.

Suppose a user were searching for a menu item labeled simply “Mushroom crepe”. Once the entire corpus is searched, let us suppose that there are two similar results, “Mushroom crepe with cheese” and “Mushroom crepe with tomato”. Which of these would rank higher? As it turns out, the tf-idf weighting ensures that since cheese is more ubiquitous throughout the corpus (and hence a poorer differentiator of menu items) than tomato, “Mushroom crepe with cheese” will have a larger cosine similarity to “Mushroom crepe” than would “Mushroom crepe with tomato”. This example persuaded me to use a tf-idf mapping for my app.

Tf-idf for mushroom crepe with cheese vs tomato

Putting algorithm to use: web app points me to my mushroom crepe

I used this algorithm to create an app to search for similar items to my favorite mushroom crepe from Ti-Couz, and returned the five most similar menu items ranked by cosine similarity. Boom! The very first result was a mushroom crepe from a restaurant called Café Europa. It did indeed come with a mushroom sauce, and I went there at the first opportunity. After 6 years of waiting, sinking my teeth into the crepe was incredibly satisfying.