1k SHARES Share Tweet

Introduction

File system analysis is a very important part of digital forensics. A lot of investigations involve hard drives whose contents need to be analyzed. Possibly, deleted files might need to be recovered as well. There are several file system types and NTFS is currently one of the most popular.

File system corruption can occur for several reasons and it may compromise the ability to access and recover files. Hence, forensic tools must understand the structure of a file system and they need to be able to extract as much data as possible, even in harsh conditions. File carving constitutes a popular technique for extracting files from damaged media, however, carved files generally lose their metadata and the directory structure of the partition cannot be retrieved. A better approach is needed because file names, paths and timestamps are very important information.

In this article, you will learn how the directory structure of an NTFS drive can be rebuilt even if parts of the metadata are partial, broken or completely absent. All the phases of the process will be explained. The presented algorithm leads to an interpretation of the file system that allows for the recovery of file names, paths, timestamps and contents of files (including fragmented ones). Finally, you will learn how to use and customize an open source tool that implements these techniques.

How does NTFS reconstruction work?

Before delving into the techniques that can be used to rebuild an NTFS partition, a few concepts need to be introduced. NTFS is a proprietary file system developed by Microsoft and used by default on Windows starting from Windows 2000. It is also found in high capacity external hard drives, as Linux and macOS support it as well.

NTFS inner workings

The main concept of NTFS is that everything is a file, i.e. also the metadata needed for it to work is stored in several files. The most important one is the MFT (Master File Table) which includes at least one file record (called MFT entry) for every allocated file or directory. Each entry contains several attributes. For this reason, a file might have more than one entry if its attributes cannot fit into one.

These are the most important attributes:

$STANDARD_INFORMATION (id 0x10 = 16) stores the MAC (Modification, Access and Creation) times in the weirdest timestamp format ever: the number of one-hundred nanoseconds since January 1, 1601 UTC

$ATTRIBUTE_LIST (id 0x20 = 32) includes references to the non-base MFT entries that contain other attributes of the same file (if any)

$FILE_NAME (id 0x30 = 48) stores multiple file names: generally the DOS 8.3 file name and the long one used by recent versions of Windows, unless the two are the same

$DATA (id 0x80 = 128) includes the whole file if it is smaller than about 700 bytes, otherwise we have to refer to the runlist (a list pointing to the location and size of each fragment on disk)

$INDEX_ROOT (id 0x90 = 144, only for directories) contains the $FILE_NAME attributes of (some) child elements

$INDEX_ALLOCATION (id 0xA0 = 160, only for directories) refers to external index records stored on the disk that store the remaining references to child elements

File records are 1 KB in size and they can be recognized because they start with the FILE signature, or BAAD if the record is marked as damaged by the OS. Index records are 4 KB in size and their signature is INDX .

Finally, the first and the last sectors of a NTFS partition are two copies of the boot record, which contains these important parameters:

Sectors per cluster (SPC), needed because addresses in NTFS are expressed in clusters (groups of sectors) and must be translated when recovering files or metadata

Starting address of the MFT, relative to the first sector of the partition (which we will refer to as Cluster Base or CB)

Size of the file system, expressed in sectors and clusters

For a detailed discussion on how NTFS works, refer to the Linux-NTFS project documentation. I strongly recommend reading Brian Carrier’s invaluable File System Forensic Analysis book as well.

Carving the metadata

Normally, an operating system would start from the boot record, then jump to the MFT and start reading file records and index records while exploring the entire directory structure. However, to perform an effective NTFS reconstruction, we need to keep in mind that metadata could be partially corrupted:

The partition table might be wrong

We may not have the boot record nor its backup

Some MFT entries and/or index records might be unreadable

MFT entries could be lost due to a new file system partially covering a deleted one

For these reasons, a safer approach to this problem is assuming the worst case scenario. Instead of relying on the partition table or the boot record of a partition, the correct process is to scan the entire drive for traces of boot records, file records and index records. This allows us to find the MFT entries of recently deleted files, too.

Basically, instead of carving files, we perform metadata carving, collecting the interesting sectors in three lists.

This gives a list of sectors where potential file records are, but they may belong to several different partitions. Hence, the final step of this phase is clustering them. We cannot rely on their position because our assumption is that we don’t know yet where the partitions (or their MFTs) were located, moreover one partition could have been partially overwritten by another one.

We can, however, exploit the fact that each file record contains its own identifier and the MFT is generally contiguous. Through a simple linear formula, we can compute the position of the corresponding file record #0 for each file record we are considering:

p = y - 2x

Where y is the record position on disk (in sectors) and x is the record number. We can then divide the file records based on their p-value.

A visual representation of the file record clustering process

Bottom-up tree reconstruction

At this point we have some “partitions” that are just lists of file records. There is directory tree structure yet. We cannot read the file system structure bottom down, because if any directory record is missing, all files under it would be inaccessible.

Instead, we can leverage the fact that MFT entries contain the identifier of the parent directory as well. Moreover, we know that the root directory in NTFS is always #5. We can proceed to rebuild the directory tree in a bottom up fashion using a simple process.

For each file:

If we found the record of its parent, link the file to the parent

If we know the parent id but there was no file record, create a fake file record called Dir_[number] and link the file to it

If we don’t know the parent, link the file to a LostFiles directory

If the file record is 5, consider this the root of the file system

Accessing the file system top-down (left) versus reconstructing the tree bottom-up (right)

As you can see, this simple process does not depend on the file system type and it could be easily applied to other file systems as long as the file records have an id and a pointer to the parent record. If the file record of a directory is missing, it’s not a big deal because we create one for it that includes all of its children. The missing directory is then put among the lost files.

Example of bottom-up reconstruction

Up to now, we have not considered index records at all. This means that the approach would work even if all index records were unavailable. However, the resulting tree may be completed with additional information about what we can call “ghost files”. These are files whose records couldn’t be found, but their existence could be determined indirectly by entries found in the index records.

Ghost files cannot be recovered, but the fact that they existed in the past might be relevant in an investigation.

Finding the partition geometry

We have seen how the directory tree (or parts thereof) can be rebuilt using file records and index records. This yields an accurate representation of the file system, including file names, timestamps and other relevant metadata.

This is a very good result, however, we are still missing a tiny detail: file contents.

Metadata for sure is important, but so is data. All the pointers and references in NTFS are expressed in clusters and refer to the beginning of the partition. We need to find these two parameters if we want to access the files:

Sectors per cluster (SPC)

Cluster base (CB)

If we have found a boot record that can be linked to the MFT entries we have, then it’s all good. The SPC parameter is written in the boot sector and the CB value is given by its position, or that minus the partition size if we were reading the backup boot record.

Unfortunately, we cannot really rely on the presence of the boot sector when dealing with damaged hard drives. We need a way to recover both parameters even when we only have file records and index records.

We can exploit the information contained inside index records. As we have seen before, index records contain a list of $FILE_NAME attributes. Among other data, these also contain a reference to the parent directory. Therefore, for each index record we can estimate what directory it belongs to just by looking at its entries.

By repeating this process for all records, we can turn the hard drive into a “text” where each letter corresponds to a sector. Each sector has a blank space, except those who contain an index record. In that case, the place is filled with the id of its owner.

Next, we consider all MFT entries of directories that have at least one $INDEX_ALLOCATION attribute. From the attributes, we can see in what position (in clusters) there should be a corresponding index record. If done for all MFT entries, we get a “word” whose letters correspond to clusters and contain either a blank space or the identifier of a directory.

What we don’t know yet is how many sectors are in a cluster (SPC), but we can guess. Valid values are small powers of two, like 1, 2, 4, 8, 16, 32, 64 and 128. The most common are actually 1 and 8.

Our goal is to find the correct values of SPC, enumerating CB. For each possible value of CB the process consists of:

Converting the “word” from clusters to sectors using the current CB value. For instance, [_, 6, 9, _, 2, 11] with SPC = 2 becomes [_, _, 6, _, 9, _, _, _, 2, _, 11, _] .

Matching the “word” on the “text” given by index records found on the drive. We can use an optimized version of the Baeza-Yates–Perleberg approximate string matching algorithm to compute the position that gives the most accurate match.

This provides a CB value for each possible SPC one. The value with the highest accuracy computed during the matching process wins. If we recover the partition geometry, we can extract all the files by reading their MFT entries.

Also, at this point we can check if some partitions are actually two parts of the same partition (which may happen if the MFT was fragmented). If this is the case, the two parts can be merged.

The algorithms outlined here are discussed in greater detail in Chapter 6 of my thesis RecuperaBit: Forensic File System Reconstruction Given Partially Corrupted Metadata. They have been successfully implemented in an open source tool.

Enter RecuperaBit

RecuperaBit is a command line program I developed using all the techniques described above. It is free open source software written in Python, hence you can study the code, extend it and customize it as needed.

Currently, it supports only NTFS, but its architecture allows for new plug-ins to be added in the future. The program attempts directory tree reconstruction regardless of:

missing partition table

unknown partition boundaries

partially-overwritten metadata

quick format

Installing the software

RecuperaBit can be run on Linux, Windows and macOS provided that Python 2.7 is available on the system. During the latest ESC2016 hack camp in Venice, Italy it was even shown running on a rooted Android tablet. Installation is very simple because the tool is self-contained. You just need to download the archive from GitHub and uncompress it, then call the main.py file.

Only a simple symlink is required to get the convenience of calling recuperabit from the command line. On a Linux system, installation can be as simple as running the following commands as root :

cd /opt git clone https://github.com/Lazza/RecuperaBit.git ln -s /opt/RecuperaBit/main.py /usr/local/bin/recuperabit

You can now show the usage instructions with:

recuperabit -h

This will run with the default cPython implementation. For faster performances, you might want to install PyPy and change the shebang of main.py (that is the first line of the file) so that it refers to /usr/bin/pypy .

RecuperaBit in the CAINE 8 applications menu

If you are using the amazing CAINE 8 digital forensics Linux OS, be aware that RecuperaBit is already included among the tools provided in the default installation however it is not easily available from the command line, due to a small bug. While on CAINE, you don’t need to download RecuperaBit, you may want to symlink it with the following command:

ln -s /usr/share/caine/pacchetti/RecuperaBit/main.py /usr/local/bin/recuperabit

Nevertheless, please be aware that a bunch of bug fixes for RecuperaBit have been released after CAINE 8 came out. For this reason, I suggest you install the latest version as outlined above, to avoid crashes on particularly damaged file systems.

A concrete example

You can test the tool on any raw disk image that contains an NTFS partition or some traces of it. You might even run it on the dump of a single MFT entry and you would still get some basic information, like the file name or the file record id.

For this article, we are going to use a specially crafted example that you can download here. The compressed archive contains a 1 GB disk image with an NTFS file system somewhere in the middle, using an unusual SPC value of 16. No partition table is available and boot sectors have been erased. Also, the first four entries of the MFT and their backup (constituting the so called "MFT mirror") are gone as well.

The root directory contained the following items:

A directory called other with two subdirectories: libraries and executables , containing a bunch of files extracted from a Ubuntu installation

A directory called pictures , including JPG and PNG files

A directory called texts with several text files

In total, there were roughly 500 files.

Assuming you are working in the directory where you saved the image file, you can pass the filename to RecuperaBit as an argument. You should also specify an output directory and a filename for storing the list of interesting sectors:

recuperabit borderless_1GB_v6.img -o recovered -s borderless.save

The program will print a brief recap of what it’s going to do and it will wait until you press Enter . It will then proceed to scan the drive and print a detailed log of what operations are going on. At the end, you will get a primitive command line that you can use to analyze and extract the partitions that have been found.

RecuperaBit after the initial processing phase

You can see that RecuperaBit tried to determine the partition geometry even though the boot records were not available, finding a match:

INFO:root:Finding partition geometry DEBUG:root:Found MATCH in positions set([768640]) with weight 16 (100.0%)

If you type recoverable at the prompt, you will see that basic information about the damaged file system has been detected and files can be recovered:

Partition #0 -> Partition (NTFS, ??? b, 517 files, Recoverable, Offset: 223232, Offset (b): 114294784, Sec/Clus: 16, MFT offset: 223264, MFT mirror offset: None)

At this point, you can recursively extract the files. The root directory in NTFS has identifier 5 and the partition you want is #0. Therefore, you can do:

restore 0 5

This command will extract allocated and deleted files. It will also create empty placeholders for ghost files, if any. The output looks promising:

INFO:root:Restoring #5 Root INFO:root:Restoring #64 Root/other INFO:root:Restoring #65 Root/other/executables INFO:root:Restoring #166 Root/other/executables/ping6 INFO:root:Restoring #167 Root/other/executables/plymouth INFO:root:Restoring #168 Root/other/executables/plymouth-upstart-bridge INFO:root:Restoring #146 Root/other/executables/networkctl INFO:root:Restoring #169 Root/other/executables/ps INFO:root:Restoring #170 Root/other/executables/pwd INFO:root:Restoring #171 Root/other/executables/rbash INFO:root:Restoring #172 Root/other/executables/readlink [... snip ...]

In some cases, files would end up under the lost files directory, which is given a conventional id equal to -1. In that case, you can run restore 0 -1 as well. Now it’s time to look inside the recovered folder and check what RecuperaBit has extracted!

Some of the recovered files

With RecuperaBit, you can also get:

A tree representation of the files (don’t do it on large drives!), e.g.: tree 0

A CSV listing of the files, e.g.: csv 0 contents.csv

body file bodyfile 0 contents.body listing of the files, e.g.:

Adding a plug-in for another file system type

Currently, only NTFS partitions can be scanned using RecuperaBit. However, it is possible to write new plug-ins for other file systems. Discussing the inner workings of a file system like FAT or HFS+ would be a very difficult task and it is outside the scope of this article. Nevertheless, I will briefly describe what you need to start writing a RecuperaBit plug-in.

The program works by feeding each sector of the drive to a set of scanners (each of which is for a different file system type). A scanner should inherit the DiskScanner class and implement two methods: feed (taking a sector and its position as input) and get_partitions (providing an associative array of detected partitions).

RecuperaBit uses a simple abstract file system with two classes: File and Partition . They can be used directly, but in most cases this is not sufficient. They should be extended to add file system specific attributes. You should pay special attention to the get_content method, that may return either raw bytes or an iterator. If the file is large, an iterator is recommended.

The following file is a complete stub implementation of a scanner that detects sectors with the Ubuntu keyword, dividing them in two “partitions” (even and odd):

"""Ubuntu plug-in. A pretty useless plug-in that matches sectors containing the word 'Ubuntu'.""" from core_types import File, Partition, DiskScanner from ..utils import sectors class UbuntuFile(File): """Ubuntu File.""" def __init__(self, offset): name = "Sector " + str(offset) size = 0 is_dir = False is_del = False is_ghost = False File.__init__(self, offset, name, size, is_dir, is_del, is_ghost) self.set_parent("fakeID") self.set_offset(offset) def get_content(self, partition): """Just extract the sector, for demonstration purposes.""" image = DiskScanner.get_image(partition.scanner) dump = sectors(image, File.get_offset(self), 1) return str(dump) class UbuntuPartition(Partition): """Partition for all Ubuntu fans.""" def __init__(self, scanner, position=None): Partition.__init__(self, 'Ubuntu', 10, scanner) self.set_recoverable(True) def additional_repr(self): """Return additional values to show in the string representation.""" return [ ('Ubuntu version', '16.04') ] class UbuntuScanner(DiskScanner): """Ubuntu Scanner.""" def __init__(self, pointer): DiskScanner.__init__(self, pointer) self.found_ubuntu = set() def feed(self, index, sector): """Feed a new sector.""" # check boot sector if 'Ubuntu' in sector: self.found_ubuntu.add(index) return 'Ubuntu sector' def get_partitions(self): """Get a list of the found partitions.""" partitioned_files = { 'even': UbuntuPartition(self), 'odd': UbuntuPartition(self) } for offset in self.found_ubuntu: if offset % 2 == 0: partitioned_files['even'].add_file(UbuntuFile(offset)) else: partitioned_files['odd'].add_file(UbuntuFile(offset)) return partitioned_files

Save the file as ubuntu.py in the recuperabit/fs/ directory. Then open the main.py file and add this instruction near the top:

from recuperabit.fs.ubuntu import UbuntuScanner

Finally, update the list of scanners:

plugins = ( NTFSScanner, UbuntuScanner )

Run the program again and this time you should see three partitions:

> recoverable Partition #0 -> Partition (NTFS, ??? b, 517 files, Recoverable, Offset: 223232, Offset (b): 114294784, Sec/Clus: 16, MFT offset: 223264, MFT mirror offset: None) Partition #1 -> Partition (Ubuntu, ??? b, 10 files, Recoverable, Offset: None, Offset (b): None, Ubuntu version: 16.04) Partition #2 -> Partition (Ubuntu, ??? b, 14 files, Recoverable, Offset: None, Offset (b): None, Ubuntu version: 16.04)

You can now type tree 1 and check the directory tree:

> tree 1 Rebuilding partition... Done ---------- Root/ (Id: 10, Offset: None, Offset bytes: None) [GHOST] LostFiles/ (Id: -1, Offset: None, Offset bytes: None) [GHOST] Dir_fakeID/ (Id: fakeID, Offset: None, Offset bytes: None) [GHOST] Sector 1251888 (Id: 1251888, Offset: 1251888, Offset bytes: 640966656, Size: 0.00 B) Sector 834514 (Id: 834514, Offset: 834514, Offset bytes: 427271168, Size: 0.00 B) Sector 978102 (Id: 978102, Offset: 978102, Offset bytes: 500788224, Size: 0.00 B) Sector 224418 (Id: 224418, Offset: 224418, Offset bytes: 114902016, Size: 0.00 B) Sector 224420 (Id: 224420, Offset: 224420, Offset bytes: 114903040, Size: 0.00 B) Sector 1433746 (Id: 1433746, Offset: 1433746, Offset bytes: 734077952, Size: 0.00 B) Sector 1631030 (Id: 1631030, Offset: 1631030, Offset bytes: 835087360, Size: 0.00 B) Sector 1697158 (Id: 1697158, Offset: 1697158, Offset bytes: 868944896, Size: 0.00 B) Sector 510918 (Id: 510918, Offset: 510918, Offset bytes: 261590016, Size: 0.00 B) Sector 1492998 (Id: 1492998, Offset: 1492998, Offset bytes: 764414976, Size: 0.00 B) ----------

Note how RecuperaBit rebuilt the directory structure on its own, detecting traces of a directory identified by fakeID . Plug-ins do not need to implement directory tree reconstruction, as it is built-in.

The example I provided simulates file contents by extracting single sectors, therefore you can test the restore command as well.

In the future

NTFS analysis is a useful application, however, there are also other file system types that are very common. It would be interesting to extend the program in order to work for those, too.

Moreover, the command line version works well for scripting and testing but not so well for advanced options (like choosing what files to recover). I am planning to work on a GUI that will make using RecuperaBit much easier. In the future, it might be interesting to consider packaging the program for the most common operating systems.

Luckily, the open source nature of RecuperaBit allows anyone interested to help the development or suggest patches. If you are working on an investigation and need to look for a specific kind of data, you can just tweak a scanner or write your own!

About the author

Andrea Lazzarotto is an independent full stack developer and IT consultant based in Italy. He holds a MSc in Computer Science, obtained with a thesis about forensic techniques for reconstructing NTFS with partially corrupted metadata. This work lead to the development of RecuperaBit, an open source program for NTFS data recovery.

Website (Italian only): https://andrealazzarotto.com

RecuperaBit on GitHub: https://github.com/Lazza/RecuperaBit