SnooScraper

A small program to scrape content from a subreddit, a reddit user, and from various popular image sites, downloading files matching your criteria.

You can scrape a subreddit or a username, downloading posts and content from their posts, depending on the options you pick. I have tested it by downloading large amounts of data and it works well. There are several parameters to customise what kind of content to get and where to get it from, the more you turn on the longer a scrape will take but the more content you will be able to download.

Animated GIFs/videos for example can be large, and can quickly add up to take up valuable disk space, so downloading them can be turned off. Another issue on reddit is that often images are not linked directly but via webpages on sites like imgur. I have included functions which go on to get data from those sites so as to not miss anything. These functions can be called separately using the command line options, to just download a whole imgur album for example.

My aim is to write SnooScraper in POSIX-compliant shell script for portability and efficiency, it should work in any shell. I've tried it in bash and dash on GNU/Linux, and in bash in Cygwin on Windows so far. It should work on other operating systems that I haven't yet tried.

SnooScraper is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Supported External Sites: imgur, instagram, pinterest, tumblr

Getting SnooScraper

If you use Arch Linux or an Arch-based GNU/Linux distribution, you can get SnooScraper from the AUR.

Other than a shell, and standard tools curl, sed and grep, the only dependency is jq. jq is available in most GNU/Linux distributions, on Homebrew for macOS, in Cygwin on Windows, or can be downloaded from github.

You can download the script and default config file by downloading and extracting a release, or cloning this repository, and make it executable:

git clone https://notabug.org/odg/SnooScraper.git #if you haven't downloaded it cd SnooScraper chmod +x snooscraper ./snooscraper -h

Using SnooScraper

The help text (-h) should explain the command line options. Review the parameters in the config file and customise them as you like before using. Basically a subreddit's name or username can be passed to the script to download posts' contents, or URLs of the supported external sites can be passed directly. Any files already present will not be downloaded, delete them to force redownloading.

Files will be named after the ID (a base 36 number) of the reddit post from which they originate, or a unix timestamp if the URL is passed to the script directly. Reddit post IDs are zero-padded, so when sorting alphabetically they are also in chronological order. Albums have their images with a zero-padded number suffix after a dash, for example: abcxyz-01.jpg, abcxyz-02.jpg...

A cleaning function (-c) is also included. This attempts to remove any duplicates (ie. reposts), empty files (404 errors, any failed downloads) or files with the wrong extension (depending on configuration) in a given directory.

Planned Features

In the future I plan to add more sites (twitter, flickr wikimedia, maybe more), and the ability to scrape multiple subreddits/accounts simultaneously. Also more options: to change the sort method, narrow down by upvotes, start/end searches at different dates, etc. Also potentially different download methods to increase speed and better handle parallel downloads.

Thanks