Recently, a good friend of mine and I have been trying to teach ourselves machine learning. According to many industry experts, it’s a necessary next step in the programming world to stay gainfully employed for the next 10–15 years.

Note: If you’ve heard the same but aren’t sure where to start, I’d point you to this incredible Medium post and this Coursera course to start.

One of the things we quickly ran into was needing a data set. While there are plenty of incredibly interesting data sets in the world that you can download for free, we have specific interests (rocket launches, space exploration, Mars, etc.) and figured it would be fun to build our own by scraping.

Personally, I’d never built a scraper (or used Python for that matter) but I figured it couldn’t be too hard. It took us about a day and a half to build our spiders, get a system for them to run on a daily schedule, and to drop the data into an S3 bucket.

It was a bit difficult to find any sort of guide on the process, so I figure I’d write how we did it and expose our process to the internet. If you see anything we did wrong, please feel free to tell us how we could have done things better. We’re learning.

Building the spiders

Install Scrapy — their documentation is awesome. You should have no issues if you follow the installation guide Install dependencies: boto (might already be installed depending on how you install Scrapy) & dotenv (this is to not check AWS secrets into our VCS) Set up a new Scrapy project with scrapy startproject directoryname

You should have a scrapy.cfg file and a directory (named in the step above.) Inside of the directory, you’ll see what makes Scrapy tick. Primarily, we’ll be working with the settings.py and the /spiders directory.

Let’s start building spiders. Drop into your /spiders directory and create a new spidername.py file. Here’s an example spider using both CSS and XPath selectors.

This is an incredibly simplified version but let’s be real. Your needs are going to vary and Scrapy’s documentation is 100x better than what I can write in this post. The important thing to realize is you can create multiple spider.py files and have different spiders for different sites.

Once you have your spiders up and running, the next step is to get your settings.py file set up correctly.

There is a fair amount of boilerplate code that Scrapy starts you out with. I’m only including the relevant parts I’ve changed. You’ll notice a couple of important things:

We’re being good citizens of the internet by obeying robots.txt and manually checking Terms of Service of each site before we scrape. We don’t scrape sites who ask not to be scraped. We’re using dotenv to protect our AWS access key id and secret access key. This means you’ll have a .env file in the same directory as settings.py . You’ll want to include the .env in your .gitignore . The super legit part of Scrapy is that all you need are those couple of options set for it to handle pushing to S3.

Cool. We’ve got Scrapy all set. Spiders are built and settings.py is all set up to be pushing the data to S3 once we give it the correct credentials.

Setting up AWS

AWS can be fairly intimidating if you’re not familiar with it. We need to do two things:

Create an IAM user and get an access key id and secret access key Set up a bucket and add the correct permissions

IAM users can be a little tricky so I’d suggest reading AWS’s documentation. Essentially, you need to log into your AWS console, go into the IAM section, create a new user and generate an access key. I would recommend creating this user solely for the purpose of making API calls to your bucket. You’re going to need three strings from your created user. Save them somewhere safe.

User ARN

Access Key ID

Secret Access Key

Now, let’s set up a new bucket. It’s pretty simple. Create your bucket and then navigate to the “Permissions” tab. You need to set a bucket policy that allows the IAM user you just created to push data to this bucket. Here’s what ours looks like:

Swap out the [your-user-id] and [your-bucket-name] parts where necessary. It probably goes without saying but don’t include the []’s.

Finally, add the access id & key to your .env file in your Scrapy project.

Deploy to ScrapingHub

ScrapingHub is a nifty service run by the awesome folks that support Scrapy and a dozen or so other open source projects. It’s free for manually triggering spider crawls but it has a very reasonably priced $9 / month plan that allows for a single spider running concurrently at any given time. For our needs, it allowed us to schedule different spiders every 5 minutes i.e. 280+ unique spiders, as long as their run time is <5 minutes.

We’re in the home stretch. This part is easy. You’ll create a ScrapingHub account, log in, and generate an API key. You’ll install shub , ScrapingHub’s CLI and follow the directions to add your key and project ID.

Note: If you’ve obfuscated your AWS id & secret key with .env then you’re going to need to comment out a couple of lines in your settings.py before deploying to ScrapingHub. Namely, lines 5–7 and 14–18 in the Gist above.

Once everything is set up you’ll use shub deploy to push your project to ScrapingHub.

The last thing you need to do is enter the Spider Settings area and manually add in the AWS credentials. You’ll notice we also added a timeout to make sure we were failing gracefully if any errors occurred.

You’re done! Well, almost.

I’d recommend running your spider manually at least once first, checking your S3 bucket for the generated file and making sure you’re happy with all the results. Once you’re happy with that, go over to the Periodic Jobs section of your project and set your spiders to run at whatever intervals you need. E.g. Daily at 04:30 UTC.

Summary

I realize this isn’t an entirely exhaustive guide of every single line of code or button clicked but hopefully it gets you on the right track. Up next we’re going to be taking our generated CSV files, normalizing the data, and inserting it into a Postgres database.

Follow me for the updates!