Today I want to tell you about my recent adventure in data recovery. A friend of mine had a broken USB disk that was no longer readable. The single 230 GB partition was formatted with NTFS and neither Windows nor Ubuntu (with the NTFS-3g driver, I assume) were willing to read it. This disk contained photos, audio files and videos, the mission was to at least restore the photos.

Make a disk image

The first step for a data recovery project should be to make an image of the drive or partition: when a drives starts to lose data because of physical errors on the disk these errors tend to spread, things are getting worse and you will not get any data out of anymore at some point. And every tool will encounter the same problems when trying to read defective sectors on disk, and it will not be possible to repair these.

By the way, if your data is really valuable you shouldn’t even try to recover it yourself, that is, if the drive shows signs of some physical damage. You should disconnect it as soon as possible and hand it to a professional data recovery company – if it is worth a few hundred or thousand Euros.

The data I was working with, however, wasn’t business critical. So, after consulting $SEARCH_ENGINE, I did an image of the drive with dd_rescue – this works similar to a normal “dd” but handles I/O errors more gracefully. But here things started to get confusing: there are two programs for this purpose with an almost identical name:

Kurt Garloff’s original dd_rescue tool uses the executable named “dd_rescue”, the Debian/Ubuntu package is named “ddrescue”.

Antonio Diaz Diaz new and improved GNU ddrescue provides an executable named “ddrescue”, the Debian/Ubuntu package carries the name “gddrescue”.

The latter is the one to choose: you get much better progress information – copying hundreds of gigabytes takes quite some time, so you want to know what’s going on – and the capability to interrupt the process and continue where you left off with the help of a log file. After finding out about that the hard way I got my image with

sudo ddrescue -r3 /dev/sdb1 hdimage logfile

where “-r3” means: “in case of an error, retry 3 times” and /dev/sdb1 is the name of the partition of the USB disk, obviously.

Unfortunately, the resulting image still couldn’t be mounted. “ddrescue” only reported a few bad sectors on the disk, but it was obviously enough to make file access impossible. Another idea that I wasn’t able to pursue: it might have been possible to repair the NTFS filesystem with a virtualized windows instance running in VirtualBox – but VirtualBox only takes complete disks as images, not single partitions. If I had done an image of the complete disk instead, including the partition table, this might have worked out. I didn’t feel like copying the 230 GB over into a new image with a partition table and also didn’t have enough free disk space to do it.

Recovery tools: file carvers

The next step was trying to recover as much of the data as possible. I had successfully used a “file carver” before to recover images from my digital camera’s memory card after the FAT filesystem became corrupted. A file carver is a program that scans a raw binary stream for the headers of known file types, like that of JPEG images or MP3 audio files, and tries to extract the contents, completely ignoring the file system. The advantage is that it doesn’t matter how broken your filesystem is – the program doesn’t have to know anything about the filesystem’s structure. It can also recover deleted files. The disadvantage is that you lose all information that is stored in the filesystem, the file name and directory structure. It’s also prone to errors for fragmented file systems, which also means that you’re less likely to succeed when recovering large files.

I tried two tools from this category, “foremost” and “photorec”. “foremost” is a simple command line tool, you call it like this:

foremost -i hdimage -o recovered -v

and it will sort the files it can find by file type into sub folders of “recovered”.

Photorec has a curses interface. It also takes hints about the structure of the image, like presence of a partition table or filesystem type. It is part of the “testdisk” package. The command line invocation shows that this tool was ported from DOS:

photorec /log /debug /d output-directory hdimage

Recovery tools: Sleuth Kit

Researching further, I stumbled upon the Sleuth Kit and Autopsy. These are forensic analysis tools and therefore are designed to recover data that someone deliberately tried to hide or destroy. The Sleuth Kit is a suite of command line tools which Autopsy is a web frontend for. Autopsy comes with its own web server. I started it with these commands:

mkdir my-autopsy-dir/ autopsy -d my-autopsy-dir/ firefox http://localhost:9999/autopsy

Getting around the web interface can be a bit confusing: you have to create a “case” first, then add a host to investigate and finally a hd image to look at. Anyway, the time it took me to get used to autopsy wasn’t wasted because I now was able to see the complete contents of the original NTFS filesystem! I was able to look at the data, browse the filesystem, download single files and compute MD5 sums. However, autopsy offers no feature for copying whole directory trees. This is because it is intended for forensic analysis rather than data recovery. So you, the computer forensics expert, are supposed to look at every single file and make notes about it which in turn are then recorded in the “case”.

I wasn’t really interested in a forensic analysis of the contents of my friend’s drive so I took a closer look at the command line tools. The relevant commands from the Sleuth Kit are “fls” for listing files in an image and “icat” for getting at the contents. You use “fls” like this:

fls -urp hdimage

where -u means that I’m not interested in deleted files, -r that I want a recursive listing and -p that I need to have the full path for every file. The output looks something like this:

d/d 180-144-8: some-dir d/d 5192-144-1: some-dir/some sub dir r/r 5190-128-3: some-dir/some sub dir/some_file.exe r/r 5188-128-3: some-dir/some sub dir/another_file.jpg

The funny numbers in the second column are the “inode” of the file, which you need to feed into “icat” to get the contents. So how do you recover a whole directory tree with these tools? What I should have done is using a script like this one:

#!/bin/sh IMAGE =hdimage fls -urp $IMAGE | while read type inode name; do case $type in d / d ) mkdir " $name " ;; r / r ) icat $IMAGE $ ( echo $inode | sed 's/://g' ) > " $name " ;; esac done #!/bin/sh IMAGE=hdimage fls -urp $IMAGE | while read type inode name; do case $type in d/d) mkdir "$name" ;; r/r) icat $IMAGE $(echo $inode | sed 's/://g') > "$name" ;; esac done

But I was lazy and so I saved the file listing in a text file which I turned into a big shell script using Emacs’ rectangle functions, regular expressions and keyboard macros. This wasn’t working so well: there were some funny characters in the file names I forgot to escape, like single quotes and backticks. So, as always, it turned out to be more work doing it “the easy way”. However, in the end I was able to completely recover the data from the partition.

Analyzing the data

Since now I got all the data back, having already tried other methods of recovery before, this can serve as a nice real world benchmark of the usefulness of the file carving tools I used.

Just counting how many files these tools think they’ve found doesn’t help us much, we also need to know if the recovered files were really complete and undamaged. I did a quick check with the files the Sleuth Kit recovered, and all files I checked seemed to be ok: the photos were fine and the videos and mp3s played without any hiccups. So, let’s assume that the data I got from the Sleuth Kit is really genuine. To find out about the identity of the recovered files, I computed the MD5 hash for all of them with this little script:

for tool in foremost photorec sleuthkit; do find $tool -type f -print0 | xargs -0 md5sum | tee md5sums / ${tool} .txt done for tool in foremost photorec sleuthkit; do find $tool -type f -print0 | xargs -0 md5sum | tee md5sums/${tool}.txt done

And here’s a script I hacked together to do some analysis on these files:

#!/bin/bash md5s_by_ext ( ) { local ext = $1 shift grep -hi "\. ${ext} \$ " "$@" | awk '{ print $1 }' } unique_md5s_by_ext ( ) { md5s_by_ext "$@" | sort | uniq } unique_md5s ( ) { cat "$@" | awk '{ print $1 }' | sort | uniq } clean_wc ( ) { wc -l | sed 's/ //g' } common_files ( ) { local ext = "$1" echo -ne " ${ext} \t " echo -ne $ ( unique_md5s_by_ext $ext sleuthkit | clean_wc ) " \t " for tools in photorec foremost "photorec foremost" ; do echo -ne $ ( unique_md5s_by_ext $ext $tools | clean_wc ) " \t " \ $ ( comm -12 < ( unique_md5s $tools ) < ( unique_md5s_by_ext $ext sleuthkit ) | clean_wc ) " \t " \ $ ( comm -12 < ( unique_md5s_by_ext $ext $tools ) < ( unique_md5s_by_ext $ext sleuthkit ) | clean_wc ) " \t " done echo } common_files_total ( ) { echo -e "total \t " \ $ ( unique_md5s sleuthkit | clean_wc ) " \t " \ $ ( unique_md5s photorec | clean_wc ) " \t " \ $ ( comm -12 < ( unique_md5s photorec ) < ( unique_md5s sleuthkit ) | clean_wc ) " \t \t " \ $ ( unique_md5s foremost | clean_wc ) " \t " \ $ ( comm -12 < ( unique_md5s foremost ) < ( unique_md5s sleuthkit ) | clean_wc ) " \t \t " \ $ ( unique_md5s photorec foremost | clean_wc ) " \t " \ $ ( comm -12 < ( unique_md5s foremost photorec ) < ( unique_md5s sleuthkit ) | clean_wc ) } echo -e " \t sleuthkit \t photorec \t \t \t foremost \t \t \t photorec+foremost" common_files_total for i in jpg gif mp3 avi mpg zip rar exe cab dll txt htm rtf pdf doc xls; do common_files $i done #!/bin/bash md5s_by_ext() { local ext=$1 shift grep -hi "\.${ext}\$" "$@" | awk '{ print $1 }' } unique_md5s_by_ext() { md5s_by_ext "$@" | sort | uniq } unique_md5s() { cat "$@" | awk '{ print $1 }' | sort | uniq } clean_wc() { wc -l | sed 's/ //g' } common_files() { local ext="$1" echo -ne "${ext}\t" echo -ne $(unique_md5s_by_ext $ext sleuthkit | clean_wc) "\t" for tools in photorec foremost "photorec foremost"; do echo -ne $(unique_md5s_by_ext $ext $tools | clean_wc) "\t" \ $(comm -12 <(unique_md5s $tools) <(unique_md5s_by_ext $ext sleuthkit) | clean_wc)"\t"\ $(comm -12 <(unique_md5s_by_ext $ext $tools) <(unique_md5s_by_ext $ext sleuthkit) | clean_wc)"\t" done echo } common_files_total() { echo -e "total\t"\ $(unique_md5s sleuthkit | clean_wc) "\t"\ $(unique_md5s photorec | clean_wc) "\t"\ $(comm -12 <(unique_md5s photorec) <(unique_md5s sleuthkit) | clean_wc) "\t\t"\ $(unique_md5s foremost | clean_wc) "\t"\ $(comm -12 <(unique_md5s foremost) <(unique_md5s sleuthkit) | clean_wc) "\t\t"\ $(unique_md5s photorec foremost | clean_wc) "\t"\ $(comm -12 <(unique_md5s foremost photorec) <(unique_md5s sleuthkit) | clean_wc) } echo -e "\tsleuthkit\tphotorec\t\t\tforemost\t\t\tphotorec+foremost" common_files_total for i in jpg gif mp3 avi mpg zip rar exe cab dll txt htm rtf pdf doc xls; do common_files $i done

And here are the results as a really ugly table:

sleuthkit photorec foremost photorec+foremost found matching matching

+ext found matching matching

+ext found matching matching

+ext total 4391 6600 3669 1210 771 6960 3718 jpg 831 768 711 711 853 747 747 901 755 755 gif 1 1 0 0 46 1 1 47 1 1 mp3 3218 4697 2851 2851 0 0 0 4697 2851 2851 avi 128 5 0 0 5 0 0 10 0 0 mpg 1 207 0 0 1 0 0 208 0 0 zip 5 3 3 3 13 0 0 16 3 3 rar 25 29 24 24 30 8 8 50 24 24 exe 37 60 4 4 78 6 6 83 6 6 cab 0 3 0 0 0 0 0 3 0 0 dll 10 69 6 6 71 7 7 80 8 8 txt 12 699 4 4 1 0 0 700 4 4 htm 6 0 2 0 3 1 1 3 2 1 rtf 1 2 0 0 0 0 0 2 0 0 pdf 0 1 0 0 1 0 0 1 0 0 doc 7 15 6 5 16 0 0 31 6 5 xls 0 2 0 0 0 0 0 2 0 0

This table needs a bit of an explanation:

“found” means the number of files the tool extracted from the image

“matching” means the number of files the tool found that are identical with files recovered with the sleuth kit

“matching+ext” means that we’ve also got the extension right

Foremost recovered almost 90% of the images, Photorec following close behind 85%, and it found only 8 photos that foremost couldn’t identify. Looking at other data types, Photorec is clearly superior: it found 24 of the 25 RAR files present in the image, foremost only got 8 of them right. And only photorec was able to recover any mp3s: it found 89% of them, but we also got quite some false positives here. Neither of the tools was able to recover any movies – possibly because they were fragmented on disk.

Conclusion

So here comes the take home message: