This blog post talks about reverse engineering the Dropbox client, breaking its obfuscation mechanisms, de-compiling it to Python code as well as modifying the client in order to use debug features which are normally hidden from view. If you're just interested in relevant code and notes please scroll to the end. As of this writing it is up to date with the current versions of Dropbox which are based on the CPython 3.6 interpreter.

NOTE: At 20 May 2019 this blogpost was updated with a reference to more interesting related work. See the bottom of this post.

Introduction

As tempting as it would be to turn a blog post with this title into a postmodern critical analysis of the movie Se7en I’ll be discussing another kind of box today. Dropbox to be exact. I have been fascinated by Dropbox from the moment it came on my radar shortly after launching. Dropbox’ concept is still deceptively simple. Here’s a folder. Put files in it. Now it syncs. Move to another computing device. It syncs. The folder and files are there now too!

The amount of work that goes on behind the scenes of such an application is staggering though. First there are all the issues the engineers need to deal with when building and maintaining a cross-platform application for the major desktop operating systems (OS X, Linux, Windows). Add to that all the support for different web browsers, different mobile operating systems. And that’s just talking about the local client. The back-end of Dropbox’ infrastructure which enabled them to achieve scalability, low-latency with insanely write-heavy workloads whilst supporting half a billion users is just as interesting to me.

It’s for those reasons I always liked seeing what Dropbox did under the hood and how it evolved over the years. My first attempts at figuring out how the Dropbox client actually worked were roughly eight years ago after I saw some unknown broadcast traffic on a hotel network. Upon investigating it turned out to have been a part of Dropbox’ feature called LanSync which enables faster synchronization if Dropbox nodes on the same local network have access to the same files. However, the protocol was not documented and I wanted to know more. So I decided to look at the client in more detail. I ended up reverse engineering a lot of the client. This research was never published although I did share some notes here and there with some folks.

When starting Anvil Ventures, Chris and I, evaluated a number of tools for document storage, sharing and collaboration. One of these was obviously Dropbox and that was another reason for me to dig up my old research notes and check it against the current status of the Dropbox client.

Decryption and Unobfuscation

At the time I downloaded the Dropbox client for Linux. I was quick to find by running strings over it that the Dropbox client was written in Python. As the Python license is fairly permissive it’s easy for people to modify and distribute a Python interpreter together with other dependencies as commercial software. I then embarked on a reverse engineering project to see if I could figure out how the client worked.

At the time the byte-compiled files were in a ZIP file that was concatenated to the Linux *dropbox* binary itself. The main binary was simply a modified Python interpreter that would then load itself by hijacking the Python import mechanisms. Every subsequent import call would then be redirected by parsing the ZIP file inside the binary. Of course extracting the ZIP file from this binary was easy by simply running unzip on it. Besides that a tool like binwalk would have done the job to extract the file with all the byte-compiled pyc-files in it.

As I couldn't break the encryption applied to the byte-compiled pyc-files at the time, I ended up taking a Python standard library shared object that I recompiled with a "backdoor" in it. When Dropbox now ran and loaded this .so file I could use this to easily execute arbitrary Python code in the running interpreter. Although I discovered this independently the same technique was used by Florian Ledoux and Nicolas Ruff in a presentation given at Hack.lu in 2012.

Being able to investigate and manipulate the running Dropbox Python code lead me down the rabbit-hole. Several anti-debugging tricks were used to make it harder to dump the code objects. For example under normal CPython interpreter conditions it's easy to get the compiled bytecode representing a function back. A quick example:

>>> def f(i=0): ... return i * i ... >>> f.__code__.co_code b'|\x00|\x00\x14\x00S\x00' >>> import dis >>> dis.dis(f) 2 0 LOAD_FAST 0 (i) 2 LOAD_FAST 0 (i) 4 BINARY_MULTIPLY 6 RETURN_VALUE >>>

But the co_code property was patched out of the exposed member list in the compiled version of Objects/codeobject.c. This member list normally looks something like the following and by simply removing the co_code property one cannot dump those code objects anymore.

static PyMemberDef code_memberlist[] = { ... {"co_flags", T_INT, OFF(co_flags), READONLY}, {"co_code", T_OBJECT, OFF(co_code), READONLY}, {"co_consts", T_OBJECT, OFF(co_consts), READONLY}, ... };

Besides that other libraries such as the standard Python disassembler were removed. In the end I managed to dump the code objects to files but I still couldn't decompile them. It took me a while to figure out what was going on until I realized that the opcodes as used by the Dropbox interpreter were not the same as the standard Python opcodes. So the problem then becomes how to figure out what the new opcodes are so one can rewrite the code objects back to the original Python byte code.

One option for this is called opcode remapping and it was, to the best of my knowledge, pioneered by Rich Smith and presented at Defcon 18. In this talk he also introduced pyREtic which was an approach to in-memory reverse engineering of Python bytecode. The pyREtic code seems fairly unmaintained and targeted towards "old" Python 2.x binaries. Rich's talk is highly recommended for the techniques he pioneers there.

The opcode remapping technique takes all the code objects of the Python standard library and compares them to the one extracted from the Dropbox binary. For example, the code-objects in hashlib.pyc or socket.pyc which are in the standard library. If each time, for example, opcode 0x43 matches unobfuscated opcode 0x21 one can slowly build up a translation table to rewrite code objects. Then those code objects can be put through a Python decompiler. It still requires patching the modified interpreter to even be able to dump the code objects by making sure that the co_code object is exposed properly.

Another option is to break the serialization format. In Python this is called marshalling. Simply trying to load the obfuscated files by unmarshalling them the usual route did not work. Upon reverse engineering the binary using IDA Pro I discovered that there's a specific decryption phase taking place. The first person that seemed to have published something on this publicly was Hagen Fritsch in this blogpost. In it he alludes to changes being made in newer versions of Dropbox (when Dropbox switched from using Python 2.5 to Python 2.7 for its builds). The algorithm works as follows:

When unmarshalling a pyc file the header is being read to determine the marshalling version. This format is explicitly undocumented save for the CPython implementation itself.

file the header is being read to determine the marshalling version. This format is explicitly undocumented save for the CPython implementation itself. The format defines a list of types which are encoded in it. Types are True , False , floats etc but the most important one is the type for the aforementioned Python code object .

, , etc but the most important one is the type for the aforementioned Python . When loading a code object two extra values first are being read from the input file.

two extra values first are being read from the input file. The first is a 32 bit sized random value

value The second is a 32 bit sized length value denoting the length of the serialized code object.

value denoting the length of the serialized code object. Both the rand and length value are then fed into a simple RNG function yielding a seed .

and value are then fed into a simple RNG function yielding a . This seed value is then supplied to a Mersenne Twister and four 32-bit values are being generated.

value is then supplied to a Mersenne Twister and four 32-bit values are being generated. These four values concatenated together yield the encryption key for the serialized data. The encryption algorithm then is the Tiny Encryption Algorithm which is then used to decrypt the data.

In the code I ended up writing I wrote a Python based unmarshaller from scratch. The part that decrypts the code objects looks something like the excerpt below. It should be noted that this method will have to be called recursively too. The top-level object for a pyc file is a code object which then contains code objects which can be classes, functions or lambdas, which can then themselves contain methods, functions or lambdas. It’s code objects all the way down!