Prelude

Today’s RE community is focused on code research and its main question is “how code works and how it handles data?” We will look at reverse engineering from another angle and ask “how data being processed by code is organized?” We will see how reverse data engineering is less evolved than reverse code engineering.

What you can expect is a proprietary file format and database analysis, interfile formats, data structure reverse engineering approaches, disassemblerless and debuggerless research methods, and data unpacking ways. Please do not expect the complete source code and SQL databases.

File Format Reverse Engineering

Database RE is a process based on file format RE, the former is more abstract than the latter.

Abstraction levels

At the first abstraction level, we research single files that do not depend on each other or binded semantically only, e.g. a set of .jpg files in a folder. Let’s define file, file format, and binary file terms at first.

File is data structured by certain rules.

File format is rules to structure data in a file.

A simple example is a text file containing several lines. The format of that file can be treated as a UTF-8 encoded text divided by newline characters.

For us, and for our tragedy,

Here stooping to your clemency,

We beg your hearing patiently.

Binary file is a file containing raw bytes and / or human-readable information.

Let’s suppose a binary file storing messages and their date. You can see the content of the file below. Its format is more strict than one showed before: the first four bytes is a message length (0xC), the next bytes is a message with specified length without NULL-byte (“Some message”) followed by the one byte to describe addition day (0x6 = 6), the one byte to describe addition month (0xB = 11) and the last two is addition year (0x7E1 = 2017).

0C 00 00 00-53 6F 6D 65-20 6D 65 73-73 61 67 65 ♀ Some message

06 0B E1 07- - - ♠◙с•

There are couple of articles on file format reverse engineering on the internet but not to say they are everywhere. Also, there are tools to describe file formats, hex editors augmented with imperative scripting languages primarily. One of the cutting-edge tools is Kaitai Struct which allows you to describe a binary file format in a declarative way.

Database Reverse Engineering

The second abstraction level appears when we research a set of binary files tied together internally. We call that set “database”.

Database is a set of binary files which contain structured data and cross reference to each other.

Note that it is very low-level definition of a database. Today reverse engineering process is still detailed and database RE is not an exception, we have very few cases where bypassing database internals research to begin to work with data immediately is possible. In those rare cases, there is no need to reverse engineer.

Consider the musical band database consists of three binary files. Band.dat1 is a list of bands, Album.dat2 has albums of each band, and Image.dat3 contains band member photos and album covers.

Band.dat1

Album.dat2

Image.dat3

Let’s describe Band.dat1. It consists of an array of single element being described by the following structure:

Word — band ID. 0x20 bytes — band name; unused bytes are zeroed out. List of word until 0xFFFF terminator — list of album ID. 0xFFFF terminator. List of dwords until 0xFFFFFFFF terminator — list of photo ID. 0xFFFFFFFF terminator.

There are two dependencies: album ID (the cross reference to Album.dat2) and photo ID (the cross reference to Image.dat3).

00 00 54 6F-6F 6C 00 00-00 00 00 00-00 00 00 00 Tool

00 00 00 00-00 00 00 00-00 00 00 00-00 00 00 00

00 00 00 00-01 00 02 00-03 00 FF FF-00 00 00 80 ☺ ☻ ♥ А

01 00 00 80-FF FF FF FF- - ☺ А

Look at Album.dat2 now. There is an array of four elements each of which is described as follows:

Word — album ID. Word — release year. Byte — album title length. N bytes — album title. Dword — cover ID.

Here we see only one dependency: cover ID (the cross reference to Image.dat3).

00 00 C9 07-08 55 6E 64-65 72 74 6F-77 02 00 00 ╔•◘Undertow☻

80 01 00 CC-07 06 41 65-6E 69 6D 61-03 00 00 80 А☺ ╠•♠Aenima♥ А

02 00 D1 07-09 4C 61 74-65 72 61 6C-75 73 04 00 ☻ ╤•○Lateralus♦

00 80 03 00-D6 07 0A 31-30 30 30 30-20 44 61 79 А♥ ╓•◙10000 Day

73 05 00 00-80 - - s♣ А

Study Image.dat3 at last. We see an array of six elements which size is 0x10 bytes, no dependencies on other files, an element format is:

Dword — image ID. 0xC bytes — image (fictive of course).

00 00 00 80-50 49 43 00-00 00 00 00-00 00 00 00 АPIC

01 00 00 80-50 49 43 00-00 00 00 00-00 00 00 00 ☺ АPIC

02 00 00 80-50 49 43 00-00 00 00 00-00 00 00 00 ☻ АPIC

03 00 00 80-50 49 43 00-00 00 00 00-00 00 00 00 ♥ АPIC

04 00 00 80-50 49 43 00-00 00 00 00-00 00 00 00 ♦ АPIC

05 00 00 80-50 49 43 00-00 00 00 00-00 00 00 00 ♣ АPIC

The below are the three file formats and their dependencies which is a database architecture actually.

The database architecture

So we have learned the two abstraction levels — the file format level and the database level — and ready to formulate the database research problem.

Database reverse engineering problem is the problem of unknown data structures reconstruction and finding dependencies between them.

Database RE is a phase rather than a final goal in many real tasks. After this phase a data processing followed by a data conversion to a new, possibly relational, database begins. That process can be called ETL (Extract, Transform, Load) but it has nothing to do with database reverse engineering.

I did not found any papers on the subject. I suppose it is because specialists did not distinguish between file format RE and database RE. Another reason is database reverse engineering is very time-consuming process that will be performed mostly for customers under a NDA which will lead to an informative impact on potential papers.

Summary

We gave a formal enough definition of a database as a set of files but borders may be blurred in reality and often we call it database only because of our subjective vision of it as a linked data set. It is important to research how data is linked, no matter is it located in a single file or partitioned on multiple files. But the whole complexity is concentrated in that research process. Starting a new task we have a ton of files with meaningless names and extensions and a program processing that files. We begin to review the files one-by-one using hex editor and note their difference, their possible structure, an existence or non-existence of text files etc. We review the program itself after that. When the review is completed we are left alone with dozens of blobs whose makers were not burning with desire to let us to research. In the following parts, you will know what to do next.