I usually don't write about short scripts that I've written, but this one might be useful to others. Link for the impatient.

I needed to download videos from Khan Academy so that I could watch them offline. That should be easy enough, right? The videos are hosted on YouTube, so it should just be a matter of finding a playlist and running get_flash_videos on all the URLs. Turns out this isn't the case: the playlists on YouTube do not match up with all the videos on the Khan Academy website. Argh.

I could try to go through each of the sections on the website and copy the URLs into a file, but doing that with 700 videos isn't my idea of a fun way to spend a couple hours. I looked around for a way to download videos, but all I found was this download page which had an old torrent. I looked for an API and found one that was a bit under-documented. After trying to figure out the easiest way of using the API, I decided that trying to unravel the 10 MB JSON file returned by http://api.khanacademy.org/api/v1/topictree wasn't worth it . Time to scrape the site!

The final code as of this writing is here. The scraping code in download.pl isn't exactly great, but it does the job. It just recursively follows children URLs and records them in a data structure which is written out to ka-data.json . Then process.pl takes over and reads the data structure. The important thing here is that the files get written out with some way of maintaining the order of the playlist. I use the order of the children URLs on the page to assign a numberic prefix to every directory and file so that it will sort by name.