I consume a lot of contents on Reddit. I spend most of the time lurking and hoping to discover the best things. I love the serendipity.

What I also like is to browse subreddit where users post only images, /r/spaceporn to name one of them.

What I also like is to download those images and make private collections to use as wallpapers. But I find tedious to open each image and save by myself.

That’s why I decided to write a small script that will download an entire subreddit for me! :) I’m lazy enough to spend some time to achieve this!

It turns out that with a one-liner you can download everything you see on a subreddit!

Copy Paste and Run!

curl -s -H "User-Agent: cli:bash:v0.0.0 (by /u/codesharer)" \

https://www.reddit.com/r/pics/.json \

| jq '.data.children[].data.url' \

| xargs -P 0 -n 1 -I {} bash -c 'curl -s -O {}' -s -H "User-Agent: cli:bash:v0.0.0 (by /u/codesharer)"'.data.children[].data.url'-P 0 -n 1 -I {} bash -c 'curl -s -O {}'

With this one-liner you can download an entire subreddit page in parallel. Let’s break down each command and see what it is going on.

The curl command will download the JSON view of the subreddit, /r/pics in this case. It’s in silent mode and there is a custom header, that tells Reddit that it is a custom application and not a browser. This way Reddit will not hard limit your requests.

jq '.data.children[].data.url'

The result of this command is piped into jq that only extracts the external url of each submission.

xargs -P 0 -n 1 -I {} bash -c 'curl -s -O {}'

Then the magic! With the xargs utility, you can download in parallel each url. The argument -P 0 tells to xargs to use all your cpu cores and the-n 1 arguments instructs to only process one argument at the time. Then the bash command servers to download the url using curl. The -O option tells to curl to download the file.

I know, there is a lot of stuff to digest!

Cobbling Together a Script

Believe or not that’s all! But of course, I want to take a dump of an entire subreddit. That’s why I have prepared a script that sifts trough the entire listing and downloads all the urls.

#!/bin/bash # By Nicola Malizia https://unnikked.ga under MIT License USER_AGENT="User-Agent: cli:bash:v0.0.0 (by /u/codesharer)" if [ $# -ne 1 ]; then

echo "USE: $0 subreddit"

exit 1

fi # Get the first page

DATA="$(curl -s -H $USER_AGENT

AFTER="$(echo "$DATA" | jq '.data.after')" ="$(curl -s -H https://www.reddit.com/r/$1/.json )"$(echo "" | jq '.data.after') # Parallel download all the links

echo "$DATA" | jq '.data.children[].data.url' | xargs -P 0 -n 1 -I {} bash -c 'curl -s -O {}' # Iterate over listing and get all links

while [[ $AFTER != "null" ]]; do

DATA="$(curl -s -H $USER_AGENT

# Parallel download all the links

echo "$DATA" | jq '.data.children[].data.url' | xargs -P 0 -n 1 -I {} bash -c 'curl -s -O {}'

AFTER="$(echo "$DATA" | jq '.data.after')"

done != "null"="$(curl -s -H $USER_AGENT https://www.reddit.com/r/$1/.json?after=${AFTER:1:-1 })"jq '.data.children[].data.url'xargs -P 0 -n 1 -I {} bash -c 'curl -s -O {}'$(echo "" | jq '.data.after')done

To use this script just specify the subreddit name via command line:

~$ ./crape_sub spaceporn

Remarks

This script is not battle tested, so it might have some quirks.

It downloads all the data as it is. It means that if the external link is an image it will download an image, if it is a web page it will download a web page and so on.

Consider to set a custom User-Agent to prevent rate limits using the following format <platform>:<app ID>:<version string> (by /u/<reddit username>) as explained on the Wiki.

Conclusions

It was a nice experience to use existing tools to achieve this goal. If you are a Linux user like me and you want to automate something, always ask yourself if you can do it without using a programming language. Those are the occasions to better learn the vast world of Linux!