This is a .torrent file parser. It was written for fun/challenge purposes, and is not necessarily intended to be useful for any serious purpose.

Torrent files are "bencoded". Writing a bencode parser in bash poses several difficult challenges, and pushes against the limitations of the language.

The approach used here is to parse the data stream and store the contents as fields within an associative array. The structure of the fields is determined dynamically during the parsing, which is by necessity recursive (a dictionary can contain a list, which can contain a dictionary, which can contain a list, etc.). Once the associative array has been built, all the fields become available for any purpose, assuming the application can determine how to identify the field(s) of interest.

This approach would also work well for parsing an XML document, I think. I haven't actually tried it yet.

First, here is the output. You can see the hierarchical structure of the field-namespace contained within the .torrent file. The dots between namespace elements were chosen arbitrarily, and could be replaced by another delimiter. I chose the dot partly because the .torrent namespace is reminiscent of the window namespace in Tk. The way I've represented lists with numeric namespace components is a design choice I made.

I chose to sort the output lexicographically, to keep related elements together; the order of the fields shown here is almost the same as the order inside the .torrent file (the dictionaries must be sorted lexicographically, but my use of .n. for lists, without zero-padding, means the lists are printed out of sequence if they have more than 10 elements).

$ ben /stuff/tori2007-09-11.torrent .announce http://tracker.toritraders.com/announce.php .comment 2007-09-11 .created by BitsOnWheels 1.0.6 .creation date 1192017330 .encoding UTF-8 .info.files.0.length 1083 .info.files.0.path.0 Tori Amos 2007-09-11.txt .info.files.1.length 1276 .info.files.1.path.0 tori2007-09-11.ffp .info.files.10.length 40970187 .info.files.10.path.0 Tori2007-09-11d1t09.flac .info.files.11.length 2100576 .info.files.11.path.0 Tori2007-09-11d1t10.flac .info.files.12.length 37919531 .info.files.12.path.0 Tori2007-09-11d1t11.flac .info.files.13.length 39882119 .info.files.13.path.0 Tori2007-09-11d1t12.flac .info.files.14.length 32640182 .info.files.14.path.0 Tori2007-09-11d1t13.flac .info.files.15.length 21839105 .info.files.15.path.0 Tori2007-09-11d1t14.flac .info.files.16.length 35343829 .info.files.16.path.0 Tori2007-09-11d2t01.flac .info.files.17.length 10502680 .info.files.17.path.0 Tori2007-09-11d2t02.flac .info.files.18.length 22704770 .info.files.18.path.0 Tori2007-09-11d2t03.flac .info.files.19.length 27933648 .info.files.19.path.0 Tori2007-09-11d2t04.flac .info.files.2.length 36546205 .info.files.2.path.0 Tori2007-09-11d1t01.flac .info.files.20.length 37142439 .info.files.20.path.0 Tori2007-09-11d2t05.flac .info.files.21.length 48347811 .info.files.21.path.0 Tori2007-09-11d2t06.flac .info.files.22.length 42226577 .info.files.22.path.0 Tori2007-09-11d2t07.flac .info.files.23.length 38228003 .info.files.23.path.0 Tori2007-09-11d2t08.flac .info.files.3.length 44009544 .info.files.3.path.0 Tori2007-09-11d1t02.flac .info.files.4.length 27210926 .info.files.4.path.0 Tori2007-09-11d1t03.flac .info.files.5.length 28364865 .info.files.5.path.0 Tori2007-09-11d1t04.flac .info.files.6.length 29017426 .info.files.6.path.0 Tori2007-09-11d1t05.flac .info.files.7.length 31116854 .info.files.7.path.0 Tori2007-09-11d1t06.flac .info.files.8.length 26280337 .info.files.8.path.0 Tori2007-09-11d1t07.flac .info.files.9.length 29748219 .info.files.9.path.0 Tori2007-09-11d1t08.flac .info.name Tori2007-09-11 .info.name.utf8 Tori2007-09-11 .info.piece length 524288

And here is the script:

1 2 export LC_ALL =C =C 3 4 5 6 declare -A ben -A ben 7 benparse() { benparse() { 8 local data skip p max data skip p max 9 [[ -r $1 ]] || { echo "cannot read file '$1'" ; return 1; } [[ -r]] || {1; } 10 IFS = read -rd '' data < <(tr \\ 0 \\ 1 < "$1" ) -rddata < 11 max = ${# data } 12 13 14 bp_parse 0 "" bp_parse 0 15 } 16 17 18 19 20 21 bp_parse() { bp_parse() { 22 (( $1 >= max)) && return ((>= max)) && 23 case "${data:$1:1}" in in 24 d) d) 25 26 local i = $1 j = $(( $1 + 1 )) key value key value 27 while ((j < max)) && [[ ${ data : j : 1 } != e ]]; do ((j < max)) && [[!= e ]]; 28 bp_parse $j "$2." bp_parse 29 key = $p 30 ((j+=skip)) ((j+=skip)) 31 bp_parse $j "$2.$key" bp_parse 32 value = $p 33 ((j+=skip)) ((j+=skip)) 34 [[ $value ]] && ben[ "$2.$key" ]= $value [[]] && ben[]= 35 done 36 p = "" 37 skip = $(( j-i+1 )) j-i+1 38 ;; ;; 39 i) i) 40 41 local i = $1 j = $(( $1 + 1 )) 42 while [[ ${ data : j : 1 } != e ]]; do [[!= e ]]; 43 ((j++)) ((j++)) 44 done 45 p = ${ data : i +1: j -i-1 } +1:-i-1 46 skip = $(( j-i+1 )) j-i+1 47 ;; ;; 48 l) l) 49 50 local i = $1 j = $(( $1 + 1 )) k =0 value =0 value 51 while [[ ${ data : j : 1 } != e ]]; do [[!= e ]]; 52 bp_parse $j "$2.$k" bp_parse 53 [[ $p ]] && ben[ "$2.$k" ]= $p [[]] && ben[]= 54 ((k++, j+=skip)) ((k++, j+=skip)) 55 done 56 p = "" 57 skip = $(( j-i+1 )) j-i+1 58 ;; ;; 59 *) *) 60 61 local n n_len n n_len 62 bp_getnum $1 bp_getnum 63 n_len = ${# n } 64 p = ${ data : $1 +n_len+1: n } +n_len+1: 65 skip = $(( n_len+1+n )) n_len+1+n 66 ;; ;; 67 esac 68 } 69 70 71 72 bp_getnum() { bp_getnum() { 73 local i = $1 j = $1 74 while [[ ${ data : j : 1 } = [[:digit:]-] ]]; do [[= [[:digit:]-] ]]; 75 ((j++)) ((j++)) 76 done 77 n = ${ data : i : j -i } -i 78 } 79 80 benparse "$1" benparse 81 82 83 printf "%s

" "${!ben[@]}" | sort | while IFS = read -r idx; do | sort |-r idx; 84 (( ${# ben [ "$idx" ] } > 80)) && continue ((> 80)) && 85 printf "%-24.24s %s

" "$idx" "${ben[" $idx "]}" 86 done

Here are some points of interest in the code:

We override the locale ( LC_ALL=C ) because we're reading binary data from a file, and we don't want Bash doing anything funny like treating multiple bytes as a single character.

The associative array requires Bash 4.0 or higher. The machine on which this was written and tested has Bash 4.1, which means the associative array ( ben ) has to be declared in the global scope. In Bash 4.2, the AA could have been declared inside the benparse function using declare -g .

There are a lot of function calls in this parser, and the functions often need to return multiple pieces of data -- both strings and integers. Both for performance reasons, and because multiple returned elements are needed, I decided to return the data in variables that would be available to the function's caller (rather than some sort of foo=$(myfunc) capturing, which has many issues). In order to permit this, the variables to be populated by the function must be global, or must be declared at a scope outside of the function. Bash uses dynamic scoping, meaning that in a function call tree f -> g -> h, variables declared in g are available to h. If a variable by the same name is declared in f, or at the global scope, the definition in g will shadow it. So, we declare all the variables that our functions will populate as return values ( p , skip , etc.) in the caller. The variables declared at the top of the benparse function (the first function in the call chain) are available to every function in the parser, but not to the global scope. This means every function "shares" the same data (the full bencoded stream), and so on. They simply index it as a string as needed.

We can't read NUL bytes from the data file, and .torrent files most definitely contain them (within the data dictionary!). So we munge all the NULs into \1 s during the read. The resulting munged data will not be usable for whatever it does. You can't write an actual Bit Torrent client in Bash, or at least, not using this approach.

When we dump the AA at the end of the script, we sort the indices (piping to sort and reading them back in), and omit any long fields. In my .torrent file, there's a gigantic binary blob called pieces ( .info.pieces in my namespace). This is the data that has munged NUL bytes and is thus ruined for its purpose.

If your version of 'sort' supports the -V (natural sort of version number within text), then this solves the 'lists are printed out of sequence if they have more than 10 elements' problem. However, it does then alter the order of other elements. Strangely, this is fixed by piping the output of the script to `sort -V' again! --Nemo

CategorySillyThings