Part I: Scraping and preparing the data

In order to be able to train a model at all, you need enough data (so-called data augmentation and fine-tuning of pre-trained models can be used as a remedy). Only because of this amount of data generalization of the training set can be continually increased to some degree and high accuracy can be achieved in a test set. The first part of this tutorial deals with the data acquisition, analysis and visualization of features and their relationships.

Shameless plug: I’m working on a python code editor which simplifies data analysis and data plotting. More information is available under: Möbius Code Editor

Peter Norvig, Google’s Director of Research, revealed in an interview in 2011

We do not have better algorithms. We just have more data.

Without exception, the quality and quantity of the data set is not negligible. That’s why Europe’s biggest cooking platform will be scraped: each recipe, finally 316'756 recipes (as of December 2017), are downloaded with a total of 879'620 images. It is important not to proceed too fast when downloading and to protect the servers with too many queries, since otherwise a ban of the own IP address would make the data collection more difficult.

More data leads to more dimensions, but more dimensions do not necessarily lead to a better model and its representation. Deviating patterns in the data set which disturb the learning can be unintentionally amplified by more dimensions, a generalization and learning of the data record is impaired for the neural network, the signal-to-noise ratio decreases.

All 300k recipes sorted by date: http://www.chefkoch.de/rs/s30o3/Rezepte.html

When doing website scrapping, it is important to respect the robots.txt file. Some administrators do not want visits from bots to specific directories. https://www.chefkoch.de/robots.txt provides:

User-agent: * # directed to all spiders, not just scooters

Disallow: / cgi-bin

Disallow: / stats

Disallow: / pictures / photo albums /

Disallow: / forumuploads /

Disallow: / pictures / user /

Disallow: / user /

Disallow: / avatar /

Disallow: / cms /

Disallow: / products /

Disallow: / how2videos /

Listed are directories that do not interest us, so you can confidently continue. Nevertheless, measures such as random headers and enough big pauses between the individual requests are recommended to avoid a possible ban from the website (I learned this working on another project the hard way).

# Chefkoch.de Website

CHEFKOCH_URL = 'http://www.chefkoch.de'

START_URL = 'http://www.chefkoch.de/rs/s'

CATEGORY = '/Rezepte.html' category_url = START_URL + '0o3' + CATEGORY



def _get_html(url):

page = ''

while page == '':

try:

page = requests.get(url, headers=random_headers())

except:

print('Connection refused')

time.sleep(10)

continue

return page.text



def _get_total_pages(html):

soup = BeautifulSoup(html, 'lxml')

total_pages = soup.find('div', class_='ck-pagination qa-pagination').find('a', class_='qa-pagination-pagelink-last').text

return int(total_pages)



html_text_total_pages = _get_html(category_url)

total_pages = _get_total_pages(html_text_total_pages)

print('Total pages: ', total_pages) Total pages: 10560

A next important step is feature selection to disadvantage unimportant data. Preparing raw data for the neural net is commonplace in practice. In the first pass, the recipe name, the average application for the recipe, the number of ratings, the difficulty level, the preparation time and the publication date are downloaded. In the second pass, then the ingredient list, the recipe text, all images, and the number of times the recipe has been printed. With these features, the data record can be described very well and helps to gain a strong understanding of the data set, which is important to select the algorithms.

Data such as recipe name, rating, date from the upload of the recipe, etc. are stored in a csv file. If the recipe has an image, the thumbnail is placed in the search_thumbnails folder. We will make usage of multiprocessing to ensure shorter download time. For further information visit Python’s Documentation

def scrap_main(url):

print('Current url: ', url)

html = _get_html(url)

_get_front_page(html)

#sleep(randint(1, 2)) start_time = time()

with Pool(15) as p:

p.map(scrap_main, url_list)

print("--- %s seconds ---" % (time() - start_time))

Please note the given code has been shortened. For the full code visit the corresponding Jupyter Notebook.

Next we need to scrape the list of ingredients, the preparation, the tags and all images of each recipe.

def write_recipe_details(data):

dpath = DATAST_FOLDER + DFILE_NAME

with open(dpath, 'a', newline='') as f:

writer = csv.writer(f)

try:

writer.writerow((data['link'],

data['ingredients'],

data['zubereitung'],

data['tags'],

data['gedruckt:'],

data['n_pics']

#data['reviews'],

#data['gespeichert:'],

#data['Freischaltung:'],

#data['author_registration_date'],

#data['author_reviews']

))

except:

writer.writerow('')

If everything went smoothly with the download, our data looks like this: