Since publishing our post about “Extracting Structured Data From Recipes Using Conditional Random Fields,” we’ve received a tremendous number of requests to release the data and our code. Today, we’re excited to release the roughly 180,000 labeled ingredient phrases that we used to train our machine learning model.

You can find the data and code in the ingredient-phrase-tagger GitHub repo. Instructions are in the README and the raw data is in nyt-ingredients-snapshot-2015.csv.

There are some things to be aware of before using this data:

The ingredient phrases have been manually annotated by people hired by The New York Times, whose efforts were instrumental in making the success of our model possible. The data can be inconsistent and incomplete. But what it lacks in quality, it makes up for in quantity. There is not a tag for every word and there are sometimes multiple tags per word. We have spent little time optimizing the conditional random fields (CRF) features and settings because the initial results met our accuracy needs. We would love to receive pull requests to increase the accuracy further.

Examples