Alright! Let's get things going on this blog. This is officially the first blog post on my new blog. As I mentioned in about this blog post, one of the things I love to do is figuring out how to get something done with the wrong set of tools. It usually reveals obscure and unexpected features of these tools and lets me approach things creatively. I love obscure things and creativity.

This will be a tutorial on how to download YouTube videos with nothing but Awk. I love watching YouTube videos. One of the latest videos I watched was this brilliant commercial. Also I love programming and one of the languages I learned recently was AWK or as I like to capitalize it – Awk. The name of this language comes from the initials of its designers – Aho, Weinberger, Kernighan.

The question is – is it be possible to download YouTube videos with awk command line tool?! After all it's a text processing language.

I don't want to go into language details today as it's not my goal. I only want it to download videos. If you want to learn this awesome language check out this tutorial.

Awk language originally doesn't have networking support so without help of some networking tools which would create a network connection for us to pipe contents from YouTube to awk we would be out of luck. Also awk is not quite suited for handling binary data, so we will have to figure out how to read large amounts of binary data from the net in an efficient manner.

Let's find out what Google has to say about awk and networking. Quick search for 'awk + networking' gives us an interesting result - "TCP/IP Internetworking With `gawk'". Hey, wow! Just what we were looking for! Networking support for awk in GNU's awk implementation through special files!

Quoting the manual:

The special file name for network access is made up of several fields, all of which are mandatory: /inet/protocol/localport/hostname/remoteport

Cool! We know that the web talks over the tcp protocol port 80 and we are accessing www.youtube.com for videos. So the special file for accessing YouTube website would be:

/inet/tcp/0/www.youtube.com/80

(localport is 0 because we are a client)

Now let's test this out and get the banner of the YouTube's webserver by making a HEAD HTTP request to the web server and reading the response back. The following script will get the HEAD response from YouTube:

BEGIN { YouTube = "/inet/tcp/0/www.youtube.com/80" print "HEAD / HTTP/1.0\r

\r

" |& YouTube while ((YouTube |& getline) > 0) print $0 close(YouTube) }

I saved this script to youtube.head.awk file and and run gawk from command line on my Linux box:

pkrumins@graviton:~$ gawk youtube.head.awk HTTP/1.1 200 OK Date: Mon, 09 Jul 2007 21:41:59 GMT Server: Apache ... [truncated]

Yeah! It worked!

Now, let's find out how YouTube embeds videos on their site. We know that the video is played with a flash player so html code which displays it must be present. Let's find it.

I'll go a little easy here so the users with less experience can learn something, too. Suppose we did not know how the flash was embedded in the page. How could we find it?

One way would be to notice that the title of the video is 'The Wind' and then search this string in the html source until we notice something like 'swf' which is extension for flash files, or 'flash'.

The other way would be to use a better tool like FireFox browser's FireBug extension and arrive at the correct place in source instantly without searching the source but by bringing up the FireBug's console and inspecting the emedded flash movie.

After doing this we would find that YouTube videos are displayed on the page by calling this JavaScript function which generates the appropriate html:

SWFObject("/player2.swf<strong>?hl=en&video_id=2mTLO2F_ERY&l=123&t=OEgsToPDskK5DwdDH6isCsg5GtXyGpTN&soff=1&sk=sZLEcvwsRsajGmQF7OqwWAU</strong>"

Visiting this URL http://www.youtube.com/player2.swf?hl=en... loads the video player in full screen. Not quite what we want. We want just the video file that is being played in the video player. How does this flash player load the video? There are two ways to find it out - use a network traffic analyzer like Wireshark (previously Ethereal) or disassembling their flash player using SoThink's SWF Decompiler (it's commercial, i don't know a free alternative. can be bought here) to see the ActionScript which loads the movie. I hope to show how to find the video file url using both of these methods in future posts.

UPDATE: This is no longer true. Now YouTube gets videos by taking 'video_id' and 't' id from the following JavaScript object:

var swfArgs = {hl:'en',video_id:'xh_LmxEuFo8',l:'39',t:'OEgsToPDskKwChZS_16Tu1BqrD4fueoW',sk:'ZU0Zy4ggmf9MYx1oVLUcYAC'};

UPDATE: This is no longer true. Now YouTube gets videos by taking 'video_id' and 't' id from the following JavaScript object:

var swfArgs = {"BASE_YT_URL": "http://youtube.com/", "video_id": "JJ51hx3wGgI", "l": 242, "sk": "sZLEcvwsRsajGmQF7OqwWAU", "t": "OEgsToPDskJfAwvlG0JDr8cO-HVq2RaB", "hl": "en", "plid": "AARHZ9SrFgUPvbFgAAAAcADYAAA", "e": "h", "tk": "KVRgpgeftCUWrYaeqpikCbNxXMXKmdUoGtfTNVkEouMjv1SwamY-Wg=="};

UPDATE: This is also no longer true. Now YouTube gets videos by requesting it from one of the urls specified in 'fmt_url_map', which is located in the following JavaScript object:

var swfArgs = {"rv.2.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2FCSG807d3P-U%2Fdefault.jpg", "rv.7.length_seconds": "282", "rv.0.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DOF5T_7fDGgw", "rv.0.view_count": "2379471", "rv.2.title": "Banned+Commercials+-+Levis", "rv.7.thumbnailUrl": "http%3A%2F%2Fi3.ytimg.com%2Fvi%2FfbIdXn1zPbA%2Fdefault.jpg", "rv.4.rating": "4.87804878049", "length_seconds": "123", "rv.0.title": "Variety+Sex+%28LGBQT+Part+2%29", "rv.7.title": "Coke_Faithless", "rv.3.view_count": "2210628", "rv.5.title": "Three+sheets+to+the+wind%21", "rv.0.length_seconds": "364", "rv.4.thumbnailUrl": "http%3A%2F%2Fi3.ytimg.com%2Fvi%2F6IjUkNmUcHc%2Fdefault.jpg", "fmt_url_map": "18%7Chttp%3A%2F%2Fv22.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Cburst%252Cfactor%26itag%3D18%26ipbits%3D0%26signature%3D41B6B8B8FC0CF235443FC88E667A713A8A407AE7.CF9B5B68E39D488E61FE8B50D3BAEEF48A018A3C%26sver%3D3%26expire%3D1251270000%26key%3Dyt1%26factor%3D1.25%26burst%3D40%26id%3Dda64cb3b617f1116%2C34%7Chttp%3A%2F%2Fv19.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Cburst%252Cfactor%26itag%3D34%26ipbits%3D0%26signature%3DB6853342CDC97C85C83A872F9E5F274FE8B7B4A2.2B24E4836216C2F54428509388BC74043DB1782A%26sver%3D3%26expire%3D1251270000%26key%3Dyt1%26factor%3D1.25%26burst%3D40%26id%3Dda64cb3b617f1116%2C5%7Chttp%3A%2F%2Fv17.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Cburst%252Cfactor%26itag%3D5%26ipbits%3D0%26signature%3DB84AF2BE4ED222EC0217BA3149456F1164827F0C.1ECC42B7587411B734CC7B37209FDFA9A935391D%26sver%3D3%26expire%3D1251270000%26key%3Dyt1%26factor%3D1.25%26burst%3D40%26id%3Dda64cb3b617f1116", "rv.2.rating": "4.77608082707", "keywords": "the%2Cwind", "cr": "US", "rv.1.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dmp7g_8rEdg8", "rv.6.thumbnailUrl": "http%3A%2F%2Fi1.ytimg.com%2Fvi%2Fx-OqKWXirsU%2Fdefault.jpg", "rv.1.id": "mp7g_8rEdg8", "rv.3.rating": "4.14860864417", "rv.6.title": "best+commercial+ever", "rv.7.id": "fbIdXn1zPbA", "rv.4.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D6IjUkNmUcHc", "rv.1.title": "Quilmes+comercial", "rv.1.thumbnailUrl": "http%3A%2F%2Fi2.ytimg.com%2Fvi%2Fmp7g_8rEdg8%2Fdefault.jpg", "rv.3.title": "Viagra%21+Best+Commercial%21", "rv.0.rating": "3.79072164948", "watermark": "http%3A%2F%2Fs.ytimg.com%2Fyt%2Fswf%2Flogo-vfl106645.swf%2Chttp%3A%2F%2Fs.ytimg.com%2Fyt%2Fswf%2Fhdlogo-vfl100714.swf", "rv.6.author": "hbfriendsfan", "rv.5.id": "w0BQh-ICflg", "tk": "OK0E3bBTu64aAiJXYl2eScsjwe3ggPK1q1MXf7LPuwIFAjkL2itc1Q%3D%3D", "rv.4.author": "yaquijr", "rv.0.featured": "1", "rv.0.id": "OF5T_7fDGgw", "rv.3.length_seconds": "30", "rv.5.rating": "4.42047930283", "rv.1.view_count": "249202", "sdetail": "p%3Awww.catonmat.net%2Fblog%2Fdownload", "rv.1.author": "yodroopy", "rv.1.rating": "3.66379310345", "rv.4.title": "epuron+-+the+power+of+wind", "rv.5.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2Fw0BQh-ICflg%2Fdefault.jpg", "rv.5.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dw0BQh-ICflg", "rv.6.length_seconds": "40", "sourceid": "r", "rv.0.author": "kicesie", "rv.3.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2FKShkhIXdf1Y%2Fdefault.jpg", "rv.2.author": "dejerks", "rv.6.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dx-OqKWXirsU", "rv.7.rating": "4.51851851852", "rv.3.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DKShkhIXdf1Y", "fmt_map": "18%2F512000%2F9%2F0%2F115%2C34%2F0%2F9%2F0%2F115%2C5%2F0%2F7%2F0%2F0", "hl": "en", "rv.7.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DfbIdXn1zPbA", "rv.2.view_count": "9744415", "rv.4.length_seconds": "122", "rv.4.view_count": "162653", "rv.2.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DCSG807d3P-U", "plid": "AARyAMgw_jlMzIA7", "rv.5.length_seconds": "288", "rv.0.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2FOF5T_7fDGgw%2Fdefault.jpg", "rv.7.author": "paranoidus", "sk": "I9SvaNetkP1IR2k_kqJzYpB_ItoGOd2GC", "rv.5.view_count": "503035", "rv.1.length_seconds": "61", "rv.6.rating": "4.74616639478", "rv.5.author": "hotforwords", "vq": "None", "rv.3.id": "KShkhIXdf1Y", "rv.2.id": "CSG807d3P-U", "rv.2.length_seconds": "60", "t": "vjVQa1PpcFOeKDyjuF7uICOYYpHLyjaGXsro1Tsfao8%3D", "rv.6.id": "x-OqKWXirsU", "video_id": "2mTLO2F_ERY", "rv.6.view_count": "2778674", "rv.3.author": "stephancelmare360", "rv.4.id": "6IjUkNmUcHc", "rv.7.view_count": "4260"};

We need to extract these two ids and make a request string '?video_id=xh_LmxEuFo8&t=OEgsToPDskKwChZS_16Tu1BqrD4fueoW'. The rest of the article describes the old way YouTube handled videos (before update), but it is basically the same.

For now, I can tell you that once the video player loads it gets the FLA (flash movie) file from:

http://www.youtube.com/get_video<strong>?hl=en&video_id=2mTLO2F_ERY&l=123&t=OEgsToPDskK5DwdDH6isCsg5GtXyGpTN&soff=1&sk=sZLEcvwsRsajGmQF7OqwWAU</strong>

Where the string in bold after 'http://www.youtube.com/get_video' is the same that is in the previous fragment after player2.swf (both in bold).

If you now entered the url into a browser it should popup the download dialog and you should be able save the flash movie to your computer. But it's not that easy! YouTube actually 302 redirects you to one or two other urls before the video download actually starts! So we will have to handle these HTTP redirects in our awk script because awk does not know anything about HTTP protocol!

So basically all we have to do is construct an awk script which would find the request string (previously in bold), append it to 'http://www.youtube.com/get_video' and handle the 302 redirects and finally save the video data to file.

Since awk has great pattern matching built in already we can extract the request string (in bold previously) by getting html source of the video page then searching for a line which contains SWFObject("/player2.swf and extracting everything after the ? up to ".

So here is the final script. Copy or save it to 'get_youtube_vids.awk' file and then it can be used from command line as following:

gawk -f get_youtube_vids.awk <http://www.youtube.com/watch?v=ID1> [http://youtube.com/watch?v=ID2 | ID2] ...

For example, to download the video commercial which I told was great you'd call the script as:

gawk -f get_youtube_vids.awk http://www.youtube.com/watch?v=2mTLO2F_ERY

or just using the ID of the video:

gawk -f get_youtube_vids.awk 2mTLO2F_ERY

Here is the source code of the program:

#!/usr/bin/gawk -f # # Peter Krumins (peter@catonmat.net) # https://catonmat.net -- good coders code, great reuse # # Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ... # or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> # BEGIN { if (ARGC == 1) usage(); BINMODE = 3 delete ARGV[0] print "Parsing YouTube video urls/IDs..." for (i in ARGV) { vid_id = parse_url(ARGV[i]) if (length(vid_id) < 6) { # havent seen youtube vids with IDs < 6 chars print "Invalid YouTube video specified: " ARGV[i] ", not downloading!" continue } VIDS[i] = vid_id } for (i in VIDS) { print "Getting video information for video: " VIDS[i] "..." get_vid_info(VIDS[i], INFO) if (INFO["_redirected"]) { print "Could not get video info for video: " VIDS[i] continue } if (!INFO["video_url"]) { print "Could not get video_url for video: " VIDS[i] print "Please goto my website, and submit a comment with an URL to this video, so that I can fix it!" print "Url: https://catonmat.net/downloading-youtube-videos-with-gawk/" continue } if ("title" in INFO) { print "Downloading: " INFO["title"] "..." title = INFO["title"] } else { print "Could not get title for video: " VIDS[i] print "Trying to download " VIDS[i] " anyway" title = VIDS[i] } download_video(INFO["video_url"], title) } } function usage() { print "Downloading YouTube Videos with GNU Awk" print print "Peter Krumins (peter@catonmat.net)" print "https://catonmat.net -- good coders code, great reuse" print print "Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..." print "or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..." exit 1 } # # function parse_url # # takes a url or an ID of a youtube video and returns just the ID # for example the url could be the full url: http://www.youtube.com/watch?v=ID # or it could be www.youtube.com/watch?v=ID # or just youtube.com/watch?v=ID or http://youtube.com/watch?v=ID # or just the ID # function parse_url(url) { gsub(/http:\/\//, "", url) # get rid of http:// part gsub(/www\./, "", url) # get rid of www. part gsub(/youtube\.com\/watch\?v=/, "", url) # get rid of youtube.com... part if ((p = index(url, "&")) > 0) # get rid of &foo=bar&... after the ID url = substr(url, 1, p-1) return url } # # function get_vid_info # # function takes the youtube video ID and gets the title of the video # and the url to .flv file # function get_vid_info(vid_id, INFO, InetFile, Request, HEADERS, matches, escaped_urls, fmt_urls, fmt) { delete INFO InetFile = "/inet/tcp/0/www.youtube.com/80" Request = "GET /watch?v=" vid_id " HTTP/1.1\r

" Request = Request "Host: www.youtube.com\r

\r

" get_headers(InetFile, Request, HEADERS) if ("Location" in HEADERS) { INFO["_redirected"] = 1 close(InetFile) return } while ((InetFile |& getline) > 0) { if (match($0, /"fmt_url_map": "([^"]+)"/, matches)) { escaped_urls = url_unescape(matches[1]) split(escaped_urls, fmt_urls, /,?[0-9]+\|/) for (fmt in fmt_urls) { if (fmt_urls[fmt] ~ /itag=5/) { # fmt number 5 is the best video INFO["video_url"] = fmt_urls[fmt] close(InetFile) return } } close(InetFile) return } else if (match($0, /<title>YouTube - ([^<]+)</, matches)) { # lets try to get the title of the video from html tag which is # less likely a subject to future html design changes INFO["title"] = matches[1] } } close(InetFile) } # # function url_unescape # # given a string, it url-unescapes it. # charactes such as %20 get converted to their ascii counterparts. # function url_unescape(str, nmatches, entity, entities, seen, i) { nmatches = find_all_matches(str, "%[0-9A-Fa-f][0-9A-Fa-f]", entities) for (i = 1; i <= nmatches; i++) { entity = entities[i] if (!seen[entity]) { if (entity == "%26") { # special case for gsub(s, r, t), when r = '&' gsub(entity, "\\&", str) } else { gsub(entity, url_entity_unescape(entity), str) } seen[entity] = 1 } } return str } # # function find_all_matches # # http://awk.freeshell.org/FindAllMatches # function find_all_matches(str, re, arr, j, a, b) { j=0 a = RSTART; b = RLENGTH # to avoid unexpected side effects while (match(str, re) > 0) { arr[++j] = substr(str, RSTART, RLENGTH) str = substr(str, RSTART+RLENGTH) } RSTART = a; RLENGTH = b return j } # # function url_entity_unescape # # given an url-escaped entity, such as %20, return its ascii counterpart. # function url_entity_unescape(entity) { sub("%", "", entity) return sprintf("%c", strtonum("0x" entity)) } # # function download_video # # takes the url to video and saves the movie to current directory using # santized video title as filename # function download_video(url, title, filename, InetFile, Request, Loop, HEADERS, FOO) { title = sanitize_title(title) filename = create_filename(title) parse_location(url, FOO) InetFile = FOO["InetFile"] Request = "GET " FOO["Request"] " HTTP/1.1\r

" Request = Request "Host: " FOO["Host"] "\r

\r

" Loop = 0 # make sure we do not get caught in Location: loop do { # we can get more than one redirect, follow them all get_headers(InetFile, Request, HEADERS) if ("Location" in HEADERS) { # we got redirected, let's follow the link close(InetFile) parse_location(HEADERS["Location"], FOO) InetFile = FOO["InetFile"] Request = "GET " FOO["Request"] " HTTP/1.1\r

" Request = Request "Host: " FOO["Host"] "\r

\r

" if (InetFile == "") { print "Downloading '" title "' failed, couldn't parse Location header!" return } } Loop++ } while (("Location" in HEADERS) && Loop < 5) if (Loop == 5) { print "Downloading '" title "' failed, got caught in Location loop!" return } print "Saving video to file '" filename "' (size: " bytes_to_human(HEADERS["Content-Length"]) ")..." save_file(InetFile, filename, HEADERS) close(InetFile) print "Successfully downloaded '" title "'!" } # # function sanitize_title # # sanitizes the video title, by removing ()'s, replacing spaces with _, etc. # function sanitize_title(title) { gsub(/\(|\)/, "", title) gsub(/[^[:alnum:]-]/, "_", title) gsub(/_-/, "-", title) gsub(/-_/, "-", title) gsub(/_$/, "", title) gsub(/-$/, "", title) gsub(/_{2,}/, "_", title) gsub(/-{2,}/, "-", title) return title } # # function create_filename # # given a sanitized video title, creates a nonexisting filename # function create_filename(title, filename, i) { filename = title ".flv" i = 1 while (file_exists(filename)) { filename = title "-" i ".flv" i++ } return filename } # # function save_file # # given a special network file and filename reads from network until eof # and saves the read contents into a file named filename # function save_file(Inet, filename, HEADERS, done, cl, perc, hd, hcl) { OLD_RS = RS OLD_ORS = ORS ORS = "" # clear the file print "" > filename # here we will do a little hackery to write the downloaded data # to file chunk by chunk instead of downloading it all to memory # and then writing # # the idea is to use a regex for the record field seperator # everything that gets matched is stored in RT variable # which gets written to disk after each match # # RS = ".{1,512}" # let's read 512 byte records RS = "@" # I replaced the 512 block reading with something better. # To read blocks I had to force users to specify --re-interval, # which made them uncomfortable. # I did statistical analysis on YouTube video files and # I found that hex value 0x40 appears pretty often (200 bytes or so)! # cl = HEADERS["Content-Length"] hcl = bytes_to_human(cl) done = 0 while ((Inet |& getline) > 0) { done += length($0 RT) perc = done*100/cl hd = bytes_to_human(done) printf "Done: %d/%d bytes (%d%%, %s/%s) \r", done, cl, perc, bytes_to_human(done), bytes_to_human(cl) print $0 RT >> filename } printf "Done: %d/%d bytes (%d%%, %s/%s)

", done, cl, perc, bytes_to_human(done), bytes_to_human(cl) RS = OLD_RS ORS = OLD_ORS } # # function get_headers # # given a special inet file and the request saves headers in HEADERS array # special key "_status" can be used to find HTTP response code # issuing another getline() on inet file would start returning the contents # function get_headers(Inet, Request, HEADERS, matches, OLD_RS) { delete HEADERS # save global vars OLD_RS=RS print Request |& Inet # get the http status response if (Inet |& getline > 0) { HEADERS["_status"] = $2 } else { print "Failed reading from the net. Quitting!" exit 1 } RS="\r

" while ((Inet |& getline) > 0) { # we could have used FS=": " to split, but i could not think of a good # way to handle header values which contain multiple ": " # so i better go with a match if (match($0, /([^:]+): (.+)/, matches)) { HEADERS[matches[1]] = matches[2] } else { break } } RS=OLD_RS } # # function parse_location # # given a Location HTTP header value the function constructs a special # inet file and the request storing them in FOO # function parse_location(location, FOO) { # location might look like http://cache.googlevideo.com/get_video?video_id=ID if (match(location, /http:\/\/([^\/]+)(\/.+)/, matches)) { FOO["InetFile"] = "/inet/tcp/0/" matches[1] "/80" FOO["Host"] = matches[1] FOO["Request"] = matches[2] } else { FOO["InetFile"] = "" FOO["Host"] = "" FOO["Request"] = "" } } # function bytes_to_human # # given bytes, converts them to human readable format like 13.2mb # function bytes_to_human(bytes, MAP, map_idx, bytes_copy) { MAP[0] = "b" MAP[1] = "kb" MAP[2] = "mb" MAP[3] = "gb" MAP[4] = "tb" map_idx = 0 bytes_copy = int(bytes) while (bytes_copy > 1024) { bytes_copy /= 1024 map_idx++ } if (map_idx > 4) return sprintf("%d bytes", bytes, MAP[map_idx]) else return sprintf("%.02f%s", bytes_copy, MAP[map_idx]) } # # function file_exists # # given a path to file, returns 1 if the file exists, or 0 if it doesn't # function file_exists(file, foo) { if ((getline foo <file) >= 0) { close(file) return 1 } return 0 }

Each function is well documented so the code should be easy to understand. If you see something can be improved or optimized, just comment on this page. Also if you would like that I explain each fragment of the source code in even more detail, let me know.

The most interesting function in this script is save_file which does chunked downloading in a hacky way (see the comments in the source).

Download link: get_youtube_vids.awk

Next, I'll make a web server in Awk. See you then!