Overview

After reading feedback from the first part to the Accelerated Assembly guide, I’ve decided to take on a custom target, and call back to high-level languages when we encounter obscure or new pieces in the assembly. I realize that the level of detail in my last article may have been cumbersome to some readers, but I plan to stick to covering what is necessary to understand the material on the page. That being said this article is going to be about the same length, but only because the example was created by a friend so I have no prior knowledge of the implementation details. We’ll go from blackbox to well-documented. Along the way, I’ll be teaching the readers how I go about assessing a target and documenting functionality, as well as techniques I use to understand complex assembly listings. We’ll be referencing the Intel and AMD software development manuals often. It’s important to remember that this series serves as a guide to reverse engineering on a Windows OS, and how to think about reverse engineering. All skills learned can be taken and applied to other systems.

All demos are performed on Windows 10 Version 2004; Build 19035. (This build is not required. Having Windows 10 will be sufficient.)

Disclaimer All examples and information provided in these articles are based on C/C++ applications. It is assumed you have programmed and have experience in a high-level language like C, CPP, Rust, or otherwise. If you do not, the contents of this series may be difficult to follow. All author projects are written with Visual Studio 2019, and compiled using Intel C++ 19.1 Compiler. All optimizations are turned off to reduce the number of obscure assembly listings due to compiler optimizations complicating comprehension. Some details are omitted to prevent diving down the rabbit hole even further. If you’re an avid reader and want to know more than is provided see the recommended reading section at the end.

Addendum: I want to quickly take a moment to address my style of writing and teaching. I’m a firm believer in learning by doing as that’s the way I learned and how I continue learning in regards to reverse engineering or development of anything, for that matter. I realize some learn better through extensive breakdowns, simpler examples, live demonstrations, etc. As much as I wish I could I don’t have the means to cater to all learning styles, so writing is the best outlet I can give. I write so that interested readers, regardless of learning style, can come back without having to timestamp a video or scour many examples to find some piece of information they forgot or need. I hope that if your learning style is much different than mine that you still find value in these pieces and know that I’m always available to answer questions or help improve your understanding.

I want you to succeed at learning how to reverse engineer and apply it to the real world, but I’m not a teacher by any means so if there are gaps please bear with me, or let me know so that I can add it in! And thank you for your patience while I write these :).

— Omissions from Part 1

𝛿 Linked List/Doubly-Linked List Example (Intentional)

Target Acquired

Now that we’ve covered a few necessary examples, and you now know the details of all the calling conventions – let’s get right into it. We’re going to assess a single target in this section. It’s going to be a long one, complicated, and probably mildly frustrating for you as well but you will come out on the other side knowing much more than you did before!

— Robbing A Bank Requires A Blueprint

This first example is based on an authorization protocol I’ve seen used in the wild. It’s quite shoddy and broken, and yes – this is a rough recreation of it. There are multiple procedures used in this function, however, all license validation was performed locally and in the entry point of the application. The application itself was widely used and presented a number of attack vectors in regards to exploitation which we’ll cover as we encounter them. The assembly is somewhat confusing if you’re just starting out, but we’re going to break it down piece by piece and establish a knowledge base of the target. We’ll note things like local variables used, potentially inlined functions, CRT procedures, and all the attack vectors. This example makes use of structures and we’ll see how these structures are used as well as how to deduce what the different members of the structure(s) are. There will be a lot of new things encountered, so make note of anything that may be confusing and be sure to review the breakdown afterward.

We will be starting without prior knowledge of the source code, and to ensure that I did that as well I had a friend write the application and then I reversed it ahead of time. I did this so that assumptions aren’t made with insider knowledge, so to speak. That way your results will be consistent with mine as we walk through it. I’ll provide the source code given at the end of this break down for you to compare your pseudocode with.

What about the tools?

The reason I haven’t introduced any of the tools yet is that as you learn to reverse engineer it’s important to not become dependent on the tools you may have such as IDA, x64dbg, Hiew, etc. To become proficient at RE you need to be able to work from a cut and dry disassembly listing and be able to deduce as much as you can from that. There may not always be a tool that supports an architecture your target is running on and then at that point, if you’re dependent on tools, you become useless. We’re going to work from a standpoint of only having knowledge of basic instructions, architecture, and make deductions of behavior and program flow. Once we understand the flow of the program overall we’ll dig into the details and hunt down what is used where. This will be slow, but it will allow you to move quickly in the future if you don’t have any tools at your disposal.

With that being said, here’s the listing from the entry point of the target program.

push rbp sub rsp, 0B0h lea rbp, [rsp+40h] mov [rbp+68h], rdi mov [rbp+60h], rsi mov [rbp+58h], rbx mov dword ptr [rbp+0], 0 mov eax, 105h mov rcx, rax call _unk_fnc mov [rbp+28h], rax mov rax, [rbp+28h] mov [rbp+30h], rax mov dword ptr [rbp+4], 105h mov dword ptr [rbp+8], 0 mov dword ptr [rbp+0Ch], 0 mov rax, cs:GetVolumeInformationW mov [rbp+38h], rax mov eax, 105h mov rcx, rax call _unk_fnc mov [rbp+40h], rax mov rax, [rbp+40h] mov [rbp+48h], rax lea rax, unk_140028000 mov rdx, [rbp+48h] mov rcx, rax call sub_14000113C mov [rbp+10h], eax mov rax, [rbp+38h] mov edx, 0 mov rcx, [rbp+30h] mov ebx, [rbp+4] lea rsi, [rbp+0] lea rdi, [rbp+8] mov [rsp+20h], rdi lea rdi, [rbp+0Ch] mov [rsp+28h], rdi mov qword ptr [rsp+30h], 0 mov edi, 0FFFFFFFFh add edi, [rbp+4] mov [rsp+38h], edi mov [rbp+50h], rcx mov rcx, rdx mov rdx, [rbp+50h] mov r8d, ebx mov r9, rsi call rax mov [rbp+14h], eax mov eax, [rbp+0] mov ecx, eax call sub_14000136C mov [rbp+18h], eax mov eax, [rbp+18h] test eax, eax jz short loc_140001125 mov rax, [rbp+48h] mov rcx, rax call sub_1400014A4 mov [rbp+1Ch], eax mov eax, [rbp+0] mov ecx, eax call sub_14000136C mov [rbp+20h], eax mov eax, [rbp+1Ch] mov edx, [rbp+20h] cmp eax, edx jnz short loc_140001125 lea rax, unk1 mov rcx, rax call sub_140001254 mov [rbp+24h], eax loc_140001125: mov eax, 0 mov rbx, [rbp+58h] mov rsi, [rbp+60h] mov rdi, [rbp+68h] lea rsp, [rbp+70h] pop rbp retn

Initially, this may appear quite daunting given the examples in the previous post. When you see long listings like this your first instinct should be to break things off in chunks. Upon immediate overview, we only notice two branching instructions that lead us to an area where cleanup is performed and the function returns. This means there’s two conditionals inside of this function, and we know that right off the bat which is helpful. There are no error checks after invocation (otherwise there would be other branches after call instructions), and the code path is linear until we hit the instruction where we could potentially branch. Knowing this simplifies analysis. We can walk down the listing until the branching instruction.

Let’s start by pulling a section of this assembly out and walking through it.

push rbp sub rsp, 0B0h lea rbp, [rsp+40h] mov [rbp+68h], rdi mov [rbp+60h], rsi mov [rbp+58h], rbx mov dword ptr [rbp+0], 0 mov eax, 105h mov rcx, rax call _unk_crt mov [rbp+28h], rax

How did I know to pull out the code up to that point? I just looked for the first call saw where the return value in rax was stored and pulled the instructions from that instruction to the first instruction. We’re going to start off with some simple math to attempt to determine how many local variables are present, and guess their type. In the snippet, we see that we push rbp to save the previous functions stack frame, then we subtract B0 (176 bytes) from the stack. This is a total of 184 bytes, and 184 modulo 16 = 8. We can assume that when a compiler generates this code the stack will be aligned properly, but to infer local storage allocation we need to know how much is initially allocated. Then we see the function setup our stack frame using lea rbp, [rsp+40h] . With all this information, we want to take our initial allocation value B8h (184 [incl. push rbp]) and subtract 40h (64) which leaves us with 78h (120) bytes. That’s still a lot of space. Let’s then take the size of our shadow store (spill space) and subtract it from 78h: 78h (120) - 20h (32) = 58h (88) . We have 88 bytes of on the stack for local storage. 88 modulo 16 = 8 – we’re aligned. That’s still a lot of storage, so let’s try something different. Refer back to the full function disassembly. We need to find the lowest and highest offset from rbp , and note missing offsets when addressing and storing values.

The lowest offset is shown in our snippet at the 7th instruction, mov dword ptr [rbp+0], 0 , so the lowest offset is 0. How about our highest? At the very end lea rsp, [rbp+70h] is restoring the stack to its state before we ran the code of our function. Now comes the tedious part: make note of all offsets used.

58, 60, 68, 70, 0, 28, 30, 4, 8, C, 40, 48, 10, 38, 20, 28, 50, 14, 18, 1C, 24

What can we do with these? Well, first we should sort them and identify the gaps. The offsets that are unused are typically used for alignment purposes when different sized variables are used. I ran this list through a sorting algorithm and removed duplicates, and this was the output.

00 04 08 0C 10 14 18 1C 20 24 28 30 38 40 48 50 58 60 68 70

We accidentally included the offsets that go into our shadow store, so let’s chop those off.

00 04 08 0C 10 14 18 1C 20 24 28 30 38 40 48 50

Now we’re getting somewhere. Simply glancing at the list tells you there are some 4-byte offsets indicating some 32-bit storage, and then it jumps to 8-byte offsets indicating some 64-bit storage area. We could take this information and assume this is how many local variables were used in this function. If we did that we’d wind up calculating that there are potentially:

32-bit variables = 10 64-bit variables = 5

We want to be more thorough than that. How could we determine the true number of local variables used? Remember that this example is unoptimized and the compiler tends to repeat operations that could be cut out (think back to the examples in last post). If you thought about counting uses of temporary storage you’re absolutely correct. Counting uses of temporary storage is as simple as looking for rbp+N offsets that are used to store register contents and then are followed by the copying of that value to a different register. Here’s an instance of it from the code above:

call _unk_crt mov [rbp+28h], rax mov rax, [rbp+28h] mov [rbp+30h], rax

It takes the return value of _unk_crt and stores it in [rbp+28h] , then copies [rbp+28h] back into rax , and then finally stores rax into [rbp+30h] . Why didn’t it copy rax into [rbp+30h] to begin with? No optimization. The compiler was so lazy it made its job even more difficult, but this helps us! Look around in the code and you’ll see there are no other references to [rbp+28h] . Great, now we can mark it as temporary storage and take it out of our 32-bit variable count. If we do the same for all instances of temporary storage and even potentially wasted storage we wind up with 4 32-bit local variables. By looking for unused storage, or temporary storage/unnecessary copies we narrowed down the count of our 32-bit variables. What do I mean by unused storage?

mov [rbp+10h], eax ; [rbp+10h] is not used anywhere else in function.

Okay, how about our 64-bit variables? Well, we see there are a few function calls and then the return value is stored in a stack location much like the call to _unk_crt . We see that it winds up in some temporary storage at [rsp+30h] then in [rsp+38h] finally. That’s one 64-bit variable. There are two calls to _unk_crt so it’s safe to assume there are 2 64-bit variables at a minimum. If we look for some copies to stack locations with the specifier qword ptr we could possibly note those as 64-bit variables. But wait, the only move to a 64-bit location with this instruction encoding uses [rsp+30h] and isn’t offsetting from the current stack frame! This is generally indicative that the value you’re looking at is used as an argument to a function.

Calling Convention Matters The default calling convention on Windows is fastcall. The first 4 arguments are passed through registers rcx, rdx, r8, and r9; respectively. Any other arguments are pushed onto the stack, or in this case, loaded into preallocated stack locations.

This means that the 64-bit copy you see occurring offsetting from rsp is loading that location with an argument to some function call down the instruction stream. You can see this call happen a few instructions later: call rax . In this case, we don’t consider this a 64-bit local variable for our current function. If we continue to look around you may also notice the copying of the address of GetVolumeInformationW into rax which then stores rax in [rbp+40h] using the default operation size (64-bits). That’s 3 64-bit variables we’ve tracked so far. Scanning the excerpt more I don’t see any other instructions that store a 64-bit value to a stack location. We’ve now determined with reasonable confidence that we have the following variables used in our function:

32-bit variables = 4 64-bit variables = 3

If you thought that was a lot of work to determine this you’re right, it’s more than you’d normally have to do when using commercial tools like IDA or Binary Ninja. Learning as if you have no tools is the best way to become proficient, however, so we’re going to continue operating as if we don’t have the tools at our disposal.

Reversing Challenge There is an easier way, in this example, to determine the number of 32-bit variables being used in this function. Can you identify how? There are hints in the above explanation.

# Returning Back to Snippet

push rbp sub rsp, 0B0h lea rbp, [rsp+40h] mov [rbp+68h], rdi mov [rbp+60h], rsi mov [rbp+58h], rbx mov dword ptr [rbp+0], 0 mov eax, 105h mov rcx, rax call _unk_crt mov [rbp+28h], rax

After the stack allocation, stack frame setup, and the copying of our arguments to shadow store we see a 32-bit move to [rbp] . This is one of our local variables being initialized to 0. Notice the dword ptr specifier – this is the answer to the challenge above. Moving on we see a value of 105h (261) loaded into rcx and then a call to _unk_crt . The 261 is one of the arguments as per the calling convention. We don’t know what this function does, but we do know that its return value is 64-bits and is stored in a local variable. Let’s pick out the next snippet.

mov rax, [rbp+28h] mov [rbp+30h], rax mov dword ptr [rbp+4], 105h mov dword ptr [rbp+8], 0 mov dword ptr [rbp+0Ch], 0 mov rax, cs:GetVolumeInformationW mov [rbp+38h], rax mov eax, 105h mov rcx, rax call _unk_crt mov [rbp+40h], rax mov rax, [rbp+40h] mov [rbp+48h], rax

The return value from _unk_crt is stored into rax then rax into its final location. The next three instructions are initializing 3 more 32-bit locals to the respective value shown. Make note of the 105h (261) constant again. It may be useful in the future. The next part is quite interesting and maybe not something you’ve seen before. It’s taking the address of GetVolumeInformationW and storing it in rax which is then copied to the stack location [rsp+38h] which we’ve already determined is 64-bits in size. If you’ve worked in C or C++ before you might recognize this pattern as some form of function pointer perhaps, but that’s a little presumptive with what information we have now. Following, we see the subroutine _unk_crt executed again with the same constant used in the beginning, and then the temporary storage is used before copying the returned value to its local variable’s stack location. The next snippet is a little bit longer and somewhat more confusing. Since we know which offsets are used for our local variables lets give those offsets aliases instead.

[rbp+0] => [rbp+v1] [rbp+4] => [rbp+v2] [rbp+8] => [rbp+v3] [rbp+0Ch] => [rbp+v4] [rbp+30h] => [rbp+unk_crt_ret_1] [rbp+38h] => [rbp+gviw_address] [rbp+48h] => [rbp+unk_crt_ret_2]

This will make identifying where certain things are used much simpler. To be clear v1 is simply an alias for 0, v2 for 4, and so on. We’re going to use these from now on and I’ve replaced the offsets in the snippets with their associated alias. Here’s the modified assembly of the next excerpt we have to analyze:

lea rax, unk_140028000 mov rdx, [rbp+unk_crt_ret_2] mov rcx, rax call sub_14000113C mov [rbp+10h], eax mov rax, [rbp+gviw_address] mov edx, 0 mov rcx, [rbp+unk_crt_ret_1] mov ebx, [rbp+v2] lea rsi, [rbp+v1] lea rdi, [rbp+v3] mov [rsp+20h], rdi lea rdi, [rbp+v4] mov [rsp+28h], rdi mov qword ptr [rsp+30h], 0 mov edi, 0FFFFFFFFh add edi, [rbp+v2] mov [rsp+38h], edi mov [rbp+50h], rcx mov rcx, rdx mov rdx, [rbp+50h] mov r8d, ebx mov r9, rsi call rax

We can quickly pinpoint where certain locals are being used now. First, it loads the address of some unknown object using lea . We don’t know what this object is but it later placed in rcx to be used as an argument to call sub_14000113C . We also see that the return value from the second call to _unk_crt is used as the second argument. Alright, there may be some sort of string or memory operation being performed by sub_14000113C . We’ll get back to that. The next instruction is a garbage store because rbp+10h is not used anywhere in the assembly. Let’s start speeding this up and absorb what multiple instructions are doing at one time. We’re loading registers with contents of local variables, then we load the address of three local variables – v1 , v3 , and v4 . Notice the two instructions not copying to areas in our stack frame. They’re offsetting from rsp so the addresses of these local variables are being used as function arguments! Below is a view of what using these locals in a high-level language would look like.

rand_func(.., .., .., &v1, &v3, &v4, .., etc);

There’s the mov qword ptr [rsp+30h], 0 which is passing 0 as an argument to some function, then -1 to edi and adding the value of v2 to it. This is the same as v2 - 1 . We see it copy one more argument then preserve the value of rcx in a stack location, and then we load up our calling convention registers rcx , rdx , r8 , and r9 . Then the program executes call rax – wait what? Recall that a few instructions above we loaded the address of GetVolumeInformationW into rax . This confirms that the local storing that address was a function pointer.

# Developing Pseudocode



So far we’ve uncovered the local variable initialization, two function calls where the return value is saved, two function calls where the return address is ignored or unused, and now we have a function call using almost all of the locals we initialized. At this point, we need to take a second to start developing the pseudocode of this function. Let’s bring the instructions up to the end of the last excerpt into view.

push rbp sub rsp, 0B0h lea rbp, [rsp+40h] mov [rbp+68h], rdi mov [rbp+60h], rsi mov [rbp+58h], rbx mov dword ptr [rbp+v1], 0 mov eax, 105h mov rcx, rax call _unk_crt mov [rbp+28h], rax mov rax, [rbp+28h] mov [rbp+unk_crt_ret_1], rax mov dword ptr [rbp+v2], 105h mov dword ptr [rbp+v3], 0 mov dword ptr [rbp+v4], 0 mov rax, cs:GetVolumeInformationW mov [rbp+gviw_address], rax mov eax, 105h mov rcx, rax call _unk_crt mov [rbp+40h], rax mov rax, [rbp+40h] mov [rbp+unk_crt_ret_2], rax lea rax, unk_140028000 mov rdx, [rbp+unk_crt_ret_2] mov rcx, rax call sub_14000113C mov [rbp+10h], eax mov rax, [rbp+gviw_address] mov edx, 0 mov rcx, [rbp+unk_crt_ret_1] mov ebx, [rbp+v2] lea rsi, [rbp+v1] lea rdi, [rbp+v3] mov [rsp+20h], rdi lea rdi, [rbp+v4] mov [rsp+28h], rdi mov qword ptr [rsp+30h], 0 mov edi, 0FFFFFFFFh add edi, [rbp+v2] mov [rsp+gviw_address], edi mov [rbp+50h], rcx mov rcx, rdx mov rdx, [rbp+50h] mov r8d, ebx mov r9, rsi call rax

Start on line 7 since the prolog is of no interest. We have a 32-bit variable initialized to 0, we know the other three are 105h, 0, and 0 in that order.

int __cdecl main(int argc, char** argv) { u32 v1 = 0; u32 v2 = 0x105; u32 v3 = 0; u32 v4 = 0; return 0; }

We know it’s the main entry point and we know it’s using the cdecl calling convention as well since it’s saving rdi , and rsi into the shadow store. We also know there are three 64-bit variables. One of which is a function pointer to GetVolumeInformationW and the other two store some return value from _unk_crt and the arguments to each of those calls is the value 105h.

int __cdecl main(int argc, char** argv) { u32 v1 = 0; u32 v2 = 0x105; u32 v3 = 0; u32 v4 = 0; u64 unk_crt_ret_1 = _unk_crt(0x105); u64 unk_crt_ret_2 = _unk_crt(0x105); // Create function pointer prototype and bind it. // typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 ); gviw_t gviw = (gviw_t)GetVolumeInformationW; // Call GetVolumeInformationW indirectly. // return 0; }

Let’s simplify this a little bit more. We can use v2 instead of the two constants for the arguments to _unk_crt .

int __cdecl main(int argc, char** argv) { u32 v1 = 0; u32 v2 = 0x105; u32 v3 = 0; u32 v4 = 0; u64 unk_crt_ret_1 = _unk_crt(v2); u64 unk_crt_ret_2 = _unk_crt(v2); // Create function pointer prototype and bind it. // typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 ); gviw_t gviw = (gviw_t)GetVolumeInformationW; // Call GetVolumeInformationW indirectly. // return 0; }

Nice, we’re beginning to develop a clear picture of what’s going on. Now is the tricky part where it comes in handy to know your calling conventions and how the stack is manipulated to pass arguments. Here are the instructions that prepare the arguments for the indirect call to GetVolumeInformationW.

mov rax, [rbp+gviw_address] mov edx, 0 mov rcx, [rbp+unk_crt_ret_1] mov ebx, [rbp+v2] lea rsi, [rbp+v1] lea rdi, [rbp+v3] mov [rsp+20h], rdi lea rdi, [rbp+v4] mov [rsp+28h], rdi mov qword ptr [rsp+30h], 0 mov edi, 0FFFFFFFFh add edi, [rbp+v2] mov [rsp+38h], edi mov [rbp+50h], rcx mov rcx, rdx mov rdx, [rbp+50h] mov r8d, ebx mov r9, rsi call rax

There’s an easy way to do this and a hard way. We’re going to do the hard way first, of course.

# The Hard Way

Let’s begin with what we know. Parameters are passed through registers from left to right into rcx , rdx , r8 , and r9 – in that order. If the function has more than 4 arguments they will be placed on the stack from right to left, meaning the last argument will be at the highest offset from rsp . This means that the last argument of this procedure call is placed in [rsp+20h] . Remember that offsets from rbp are into our stack frame and offsets from rsp are placing arguments on the stack for the function call. Let’s quickly look at what our function call would look like without the variables:

rax(rcx, rdx, r8d, r9, [rsp+20h], [rsp+28h], [rsp+30h], [rsp+38h]);

In this pseudo-call, we see that rax is used as a function and the arguments are in there place as they would appear at a high-level language. How do we know that [rsp+38h] is the last argument of the function? Using context clues from the excerpt and the fact that offset from rsp are not used anywhere else we are inferring that these are used in the function call. Let’s start by determining what is in each register and argument space. If we look at the first instruction it’s copying gviw_address into rax . Simple. Afterward, we load edx with 0 – now to determine all uses without scanning we can just search for any instances of edx in this chunk. We see it winds up getting loaded into rcx toward the end and used nowhere else, so we know our first argument is 0. We’ll do this same process for the rest of the variables. You might notice that rcx gets loaded with [rbp+unk_crt_ret_1] , however, rcx copies its value into [rbp+50h] which is then stored in rdx prior to the call . Sweet, we know that our second argument is unk_crt_ret_1 . If we continue doing this for all the arguments you’ll wind up determining that the function call looks like this:

gviw(0, unk_crt_ret_1, v2, &v1, &v3, &v4, 0, v2 - 1);

Remember that when lea is used it is loading the address of the location referenced, and not the data inside. Thus our couple of lea instructions are supplying the address of the locals they reference.

# The Easy Way

We know that the function being called is GetVolumeInformationW – it’s just using a function pointer instead of a direct call to the API. If we open up the MSDN page for GetVolumeInformationW we can see the arguments that are used. This is why having context clues is a world of help when reverse engineering a program. This will also help us determine which arguments are of what type and then we can properly write the types of our local variables.

Why didn’t we do this from the start? Imagine you have a target that implements a custom CRT, doesn’t reference any documented API, and is a completely black box with minimal context clues. If you only learned how to take advantage of documented information you would be lacking fundamental knowledge. It’s important to be able to determine what arguments are passed to the function without utilizing reference material. If it’s available, definitely use it. If it’s not then knowing how to do it the hard way will be advantageous.

The prototype of GetVolumeInformationW is this:

BOOL GetVolumeInformationW( LPCWSTR lpRootPathName, LPWSTR lpVolumeNameBuffer, DWORD nVolumeNameSize, LPDWORD lpVolumeSerialNumber, LPDWORD lpMaximumComponentLength, LPDWORD lpFileSystemFlags, LPWSTR lpFileSystemNameBuffer, DWORD nFileSystemNameSize );

Let’s take this knowledge now and rename our locals and adjust our pseudocode implementation.

int __cdecl main(int argc, char** argv) { u32 VolumeSerialNumber = 0; u32 VolumeNameSize = 0x105; u32 MaximumComponentLength = 0; u32 FileSystemFlags = 0; char* VolumeNameBuffer = _unk_crt(VolumeNameSize); u64 unk_crt_ret_2 = _unk_crt(VolumeNameSize); // Create function pointer prototype and bind it. // typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 ); gviw_t gviw = (gviw_t)GetVolumeInformationW; // Call GetVolumeInformationW indirectly. // gviw(0, VolumeNameBuffer, VolumeNameSize, &VolumeSerialNumber, &MaximumComponentLength, &FileSystemFlags, 0, VolumeNameSize - 1); return 0; }

This looks so much better now. We can also guess what _unk_crt is at this point. Since our VolumeNameBuffer is a pointer to the return address of _unk_crt and it takes the size of the buffer as the argument an educated guess would be malloc.

int __cdecl main(int argc, char** argv) { u32 VolumeSerialNumber = 0; u32 VolumeNameSize = 0x105; u32 MaximumComponentLength = 0; u32 FileSystemFlags = 0; // Allocate buffers for two objects. // char* VolumeNameBuffer = (char*)malloc(VolumeNameSize); u64 unk_crt_ret_2 = malloc(VolumeNameSize); // Create function pointer prototype and bind it. // typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 ); gviw_t gviw = (gviw_t)GetVolumeInformationW; // Call GetVolumeInformationW indirectly. // gviw(0, VolumeNameBuffer, VolumeNameSize, &VolumeSerialNumber, &MaximumComponentLength, &FileSystemFlags, 0, VolumeNameSize - 1); return 0; }

# Completing Analysis

This is coming together nicely, but we’re not quite done with the main function. There’s one buffer we don’t know about and still a little more functionality to document before we get into the different calls. Let’s bring the disassembly from call rax to the end of the function.

call rax ; GetVolumeInformationW(...); mov [rbp+14h], eax mov eax, [rbp+VolumeSerialNumber] mov ecx, eax call sub_14000136C mov [rbp+18h], eax mov eax, [rbp+18h] test eax, eax jz short loc_140001125 mov rax, [rbp+unk_crt_ret_2] mov rcx, rax call sub_1400014A4 mov [rbp+1Ch], eax mov eax, [rbp+VolumeSerialNumber] mov ecx, eax call sub_14000136C mov [rbp+20h], eax mov eax, [rbp+1Ch] mov edx, [rbp+20h] cmp eax, edx jnz short loc_140001125 lea rax, unk1 mov rcx, rax call sub_140001254 mov [rbp+24h], eax loc_140001125: mov eax, 0 mov rbx, [rbp+58h] mov rsi, [rbp+60h] mov rdi, [rbp+68h] lea rsp, [rbp+70h] pop rbp retn

Starting after our procedure call to GetVolumeInformationW we see that the return value is stored in some stack location, but we already noted that this is unused and essentially discarded. The VolumeSerialNumber is stored into eax then copied to ecx , and then the program calls sub_14000136C . This means our function prototype for this function looks like sub_14000136C(VolumeSerialNumber) . The return value is stored and then checked using test eax, eax . We’ve covered this sequence before, and know it’s used to determine if eax is 0 and if so the following jump instruction will be taken. Why? The zero flag will be set if the result is 0. This means that our function return value is used in a conditional statement.

if( sub_14000136C(VolumeSerialNumber) ) { }

The target of the branching instruction is the else block and in this chunk, it just cleans up the stack and returns. We can assume that this isn’t an else block and that the if statement exists independently. The next call instruction, call sub_1400014A4 , uses the unknown buffer unk_crt_ret_2 . We don’t yet know what this does, but we’ll investigate soon. It then calls sub_14000136C with the VolumeSerialNumber as the argument again, and then stores the two return values in stack storage locations and then those are moved to registers prior to their comparison.

call sub_1400014A4 mov [rbp+1Ch], eax mov eax, [rbp+VolumeSerialNumber] mov ecx, eax call sub_14000136C mov [rbp+20h], eax mov eax, [rbp+1Ch] mov edx, [rbp+20h] cmp eax, edx jnz short loc_140001125

It then compares the results and jumps to loc_140001125 – similar to the first condition. Note that all of this code occurs within the if-block of the first condition, so we have a nested condition. If you’re confused about why I recommend revisiting the first examples of the previous article. Our block will look like this:

if( sub_14000136C(VolumeSerialNumber) ) { if( sub_1400014A4(unk_crt_ret_2) == sub_14000136C(VolumeSerialNumber) ) { } }

I know that the comparison of the return values is occurring because of this sequence:

mov eax, [rbp+1Ch] mov edx, [rbp+20h] cmp eax, edx jnz short loc_140001125

The code inside of the nested if-block is pretty trivial and with tools, but it’s not immediately obvious what sub_140001254 is without them.

lea rax, unk1 mov rcx, rax call sub_140001254 mov [rbp+24h], eax

Since we’re operating as if we don’t have these tools so we’ll need to be creative. The unk1 object is globally available to the program, and we don’t have any idea where to look for it (you’ll learn about this in the PE File Format article). Given the structure so far of this program, and taking into account the conditional statements we’re going to have to investigate the function. Below is the disassembly of sub_140001254 .

push rbp sub rsp, 50h lea rbp, [rsp+20h] mov [rbp+30h+var_10], rbx mov [rbp+30h+arg_0], rcx mov [rbp+30h+arg_8], rdx mov [rbp+30h+arg_10], r8 mov [rbp+30h+arg_18], r9 mov eax, 0 imul eax, 8 movsxd rax, eax lea rdx, [rbp+30h+arg_8] add rdx, rax lea rax, [rbp+30h+var_28] mov [rax], rdx mov eax, 1 mov ecx, eax call __acrt_iob_func mov [rbp+30h+var_20], rax mov rax, [rbp+30h+var_20] mov rdx, [rbp+30h+arg_0] mov ecx, 0 mov rbx, [rbp+30h+var_28] mov [rbp+30h+var_18], rcx mov rcx, rax mov rax, [rbp+30h+var_18] mov r8, rax mov r9, rbx call sub_1400011F4 mov [rbp+30h+var_30], eax mov eax, [rbp+30h+var_30] mov [rbp+30h+var_2C], eax mov [rbp+30h+var_28], 0 mov eax, [rbp+30h+var_2C] mov rbx, [rbp+30h+var_10] lea rsp, [rbp+30h] pop rbp retn

We’re not going to break this down we just want to quickly scan for any potential clues as to what this function is. Right smack-dab in the middle of the disassembly is a reference to __acrt_iob_func, and with a quick google, we see tons of results. One of which is particularly helpful:

If we look at the argument supplied to __acrt_iob_func , we see that ecx is set to 1 – this is our stdout file descriptor which is most commonly used in printf! This function is almost guaranteed to be printf, and I know from my own experience that it is. You’ll learn to recognize CRT function patterns as you gain experience reversing applications. Now that we know this if we go back to our main function we know that unk1 is a string and sub_140001254 is printf. We are now one step closer to completing our analysis and pseudocode implementation.

lea rax, unk1 ; format string mov rcx, rax call printf mov [rbp+24h], eax

The rest of the code for this main function is just clean-up and exit. What do we know about this application? If we’re unable to run it then we can only guess based on the flow of the program given the disassembly. However, for this example, we have the ability to execute it. Let’s do that, and see what we get initially.

Running Targets Being able to run a target application prior to analysis provides a lot of information or clues you can use to your advantage when reversing it. In this breakdown, I wanted to save it until the end to make sure you were able to make the connections to the disassembly and program overall. Normally, if available, I would run the target and look for information that could help identify certain constructs like menu items, login form labels, and so on.

Upon running that program it waits for keyboard input. If you recall there was a call to some function earlier in the disassembly, call sub_14000113C , that used an unknown object as well – unk_140028000 . If I type some random characters and press enter the application exits.

Hm, so what are some candidate functions in C that take user input? The first one that comes to mind is scanf. If we look at where this function is called it makes sense.

call _unk_crt mov [rbp+40h], rax mov rax, [rbp+40h] mov [rbp+unk_crt_ret_2], rax lea rax, unk_140028000 mov rdx, [rbp+unk_crt_ret_2] mov rcx, rax call sub_14000113C

We see that sub_14000113C takes the malloc allocated variable and some unknown object: sub_14000113C(unk_140028000, unk_crt_ret_2) . Since this application requires a login name that works we know that if this is indeed scanf then unk_crt_ret_2 is the allocated storage where the user input is stored. That also means that unk_140028000 is the format string, likely %s to specify string format. It fits the purpose of this application so we’re going to go with it. Here is our first pass pseudocode implementation of our main function:

int __cdecl main(int argc, char** argv) { u32 VolumeSerialNumber = 0; u32 VolumeNameSize = 0x105; u32 MaximumComponentLength = 0; u32 FileSystemFlags = 0; // Allocate buffers for two objects. // char* VolumeNameBuffer = (char*)malloc(VolumeNameSize); char* user_name = (char*)malloc(VolumeNameSize); // Create function pointer prototype and bind it. // typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 ); gviw_t gviw = (gviw_t)GetVolumeInformationW; // Call GetVolumeInformationW indirectly. // gviw(0, VolumeNameBuffer, VolumeNameSize, &VolumeSerialNumber, &MaximumComponentLength, &FileSystemFlags, 0, VolumeNameSize - 1); // Check if serial is valid? Make sure serial is not null? // if( sub_14000136C(VolumeSerialNumber) ) { // Compare some value based on user_name against serial number? I // did not get a print when entering random characters so we know // it's doing something else with these inputs. // if( sub_1400014A4(user_name) == sub_14000136C(VolumeSerialNumber) ) { printf(unk1); } } return 0; }

Reversing Challenge If you were writing your own pseudocode implementation yourself compare it against the one provided above and see how similar it looks. What did you miss? What could be done better? Did you make any assumptions that misled you? If so, what were they?

Deeper Investigation

Unfortunately, we’re not done yet. We still have to investigate what those functions in the conditional statements do. We have some ideas based on behavior when providing random input, but we can’t be for certain. That being said if we were looking to bypass this sort of authentication protocol we could modify some of the conditionals through byte patching and bypass the nested comparison to reach the printf. Yes, we could do all of this without continuing our reversal. We’re not going to do that – this is not meant to be a crackme on its own, but a way of teaching assembly in a “real” investigation. We’re going to start with analyzing sub_14000136C and then move to sub_1400014A4 . These functions are much more confusing than the first one given the lack of optimization and pollution with useless operations, but as always we’ll walk through to solidify the concepts you’ve learned so far.

# sub_14000136C

The disassembly you’re about to see is a jumbled nightmare, and we’re going to encounter some new instructions and learn how to simplify this unoptimized disassembly. Take a break if necessary because this section is going to be a long one.

push rbp sub rsp, 150h lea rbp, [rsp+20h] mov [rbp+128h], rdi mov [rbp+120h], rsi mov [rbp+140h], ecx lea rax, [rbp+0] mov edx, 0 mov ecx, 104h mov rdi, rax mov eax, edx and eax, 0FFFFh mov ah, al mov edx, eax shl eax, 10h or eax, edx mov esi, ecx shr rcx, 2 rep stosd mov ecx, esi and ecx, 3 rep stosb mov byte ptr [rbp+104h], 0 loc_1400013F0: movzx eax, byte ptr [rbp+104h] movzx eax, al cmp eax, 4 jl short loc_140001418 jmp loc_1400014B2 loc_140001404: movzx eax, byte ptr [rbp+104h] movzx eax, al inc eax mov [rbp+104h], al jmp short loc_1400013F0 loc_140001418: mov eax, [rbp+140h] lea rdx, [rbp+0] mov ecx, 0Ah mov [rbp+118h], ecx mov ecx, eax mov eax, [rbp+118h] mov r8d, eax call _itoa mov [rbp+110h], rax mov rax, [rbp+110h] mov rcx, rax call hash mov [rbp+108h], eax movzx eax, byte ptr [rbp+104h] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, 4 add rdx, rax mov eax, [rdx] mov edx, [rbp+108h] cmp eax, edx jnz short loc_140001404 movzx eax, byte ptr [rbp+104h] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, rax mov eax, [rdx] mov rsi, [rbp+120h] mov rdi, [rbp+128h] lea rsp, [rbp+130h] pop rbp retn loc_1400014B2: mov eax, 0 mov rsi, [rbp+120h] mov rdi, [rbp+128h] lea rsp, [rbp+130h] pop rbp retn

This is quite painful to see at first glance. But we’re going to do it as we always have, byte-sized pieces. Really bad joke. My sanity is waning – I’m sure yours is too.

Let’s take a moment to quickly scan the disassembly and look for any hints as to the local variables used. The only immediate hint I see is this instruction: mov byte ptr [rbp+104h], 0 . This is a byte-sized local at [rbp+104h] . We don’t know what it’s used for yet though. Now we have to do the dirty work and see what the function prologue is doing, quickly though.

push rbp sub rsp, 150h lea rbp, [rsp+20h] mov [rbp+128h], rdi mov [rbp+120h], rsi mov [rbp+140h], ecx lea rax, [rbp+0]

The function saves the old stack frame, allocates 150h (336) bytes of space for our function, and creates a new stack frame that’s based from [rsp+20h] . It stores rdi , rsi , and ecx into the spill space then loads the address of stack location [rbp+0] into rax . This is a local variable of some sort. The fact that it’s loading the address of this local is a hint that its some sort of data structure – most likely an array. The next section of assembly code is very different and uses a lot of the bitwise instructions for some macro-operation. We’re going to encounter some new instructions here as well.

lea rax, [rbp+0] mov edx, 0 mov ecx, 104h mov rdi, rax mov eax, edx and eax, 0FFFFh mov ah, al mov edx, eax shl eax, 10h or eax, edx mov esi, ecx shr rcx, 2 rep stosd mov ecx, esi and ecx, 3 rep stosb mov byte ptr [rbp+104h], 0

The first instruction following our local address load is zeroing edx , copying 104h (260) into ecx , loading the address in rax into rdi , and then moving edx into eax . That’s a lot, so let’s log register states by hand.

rax = address_of([rbp+0]) edx = 0 ecx = 104h rdi = rax eax = edx ----- SIMPLIFY ----- edx = 0 ecx = 260 rdi = address_of([rbp+0]) eax = 0

We took the initial states and then simplified it so that the states are less confusing and require fewer lookups by us. The next instruction is the and instruction. Since most readers have worked in C/C++ I expect familiarity with bitwise operations like and , or , xor , and so on. The instruction and eax, 0FFFFh performs a bitwise and operation on eax with the second operand. You might write this in a high-level language like eax & 0FFFFh . This may appear to be a pointless operation since the value of eax is 0 and the result will be 0, however, the and instruction affects a few status flags. The OF and CF status flags are cleared and the SF, ZF, and PF status flags are modified according to the result of the operation. We’ll see why this happened further down.

The next instruction is mov ah, al . This is a pointless instruction inserted as the result of no optimization applied to the program – it’s just wasted cycles. As is the instruction after, mov edx, eax . The register edx is already 0. The instruction shl is a bitwise left-shift, in this case it doesn’t have any effect because 0 << 10h is still 0, but in normal cases, it will take the value in the first operand and shift the bits 10h bits to the left. I’m sure you can guess what the or instruction does, and it’s also just pollution in this case. We arrive at mov esi, ecx which stores the value 104h into esi . Our register states have now changed.

edx = 0 ecx = 260 rdi = address_of([rbp+0]) eax = 0 esi = 260

We then see shr rcx, 2 where the value of rcx is 260. This is an operation worth noting! Every bit shift to the right effectively divides the value to be shifted by 2, N times. As an example, shr rcx, 1 takes the value in rcx and shifts all bits to the left which is equivalent to a divide by 2. The result would be 130 and stored in rcx . In this way, shr rcx, 2 is the same as taking the first operand and dividing it by 4. 260 divided by 4 is 65, so the value of rcx becomes 65 after this instruction executes. The opposite arithmetic operation applies to shl – multiply.

shr rcx, 1 => rcx / 2 shl rcx, 1 => rcx * 2 shr rcx, 4 => loop_4_Times:[rcx / 2] shr rcx, N => loop_N_Times:[rcx / 2]

Using Bitwise Operations Bitwise instructions are used by compilers to do many types of things. You might see them in place of standard arithmetic operations like multiple, divide, add, or subtract. This is because bitwise instructions have a lower instruction execution time than arithmetic instructions. There are other instructions on newer architectures that have faster execution times, but they’re primarily used when compiler optimizations are at their highest setting.

Observing our disassembly chunk again let’s take a look at the instruction after the shift right: rep stosd .

lea rax, [rbp+0] mov edx, 0 mov ecx, 104h mov rdi, rax mov eax, edx and eax, 0FFFFh mov ah, al mov edx, eax shl eax, 10h or eax, edx mov esi, ecx shr rcx, 2 rep stosd mov ecx, esi and ecx, 3 rep stosb mov byte ptr [rbp+104h], 0

This instruction is composed of two parts, as many x86 instructions can be. The first part is the instruction prefix: rep . The REP means to repeat the string operation of the instruction it prefixes. It’s typically used in string operations, but you may also see it in memory copy operations. From this function, the whole instruction is rep stosd . This is the encoding that instructs the processor to repeat a dword sized store operation on edi with the value in eax , ecx number of times. The size of the operation is indicated through the d appended to the stos instruction.

How’s that for a confusing mess? Let me put it in high-level terms, although this isn’t exactly what it would look like in translation.

rep_stosd() { while( ecx > 0 ) { edi[ecx] = eax; ecx -= sizeof(dword); } }

Think of it as if the processor is looping from 0 to ecx and storing the value of eax into edi[ecx] . It decrements the temp counter by the size of a doubleword because the size of the store operation is 32-bits. Hopefully, the instruction isn’t too confusing now. We just need to determine what these register states are at the time of execution to discover what it’s operating on. Let’s repost our register states:

edx = 0 ecx = 65 rdi = address_of([rbp+0]) eax = 0 esi = 260

Note that ecx changed as a result of the bitwise right shift that occurred. Knowing that rep stosd operates on edi we’ll define [rbp+0] as a data structure of size ecx * 4 . I inferred the size of the structure because rep stosd loops ecx times and writes a doubleword ( eax ) into each doubleword element of edi[ecx] . This is an unoptimized form of memset, and is typically seen when a string or data structure is set to 0 using an initializer-list. This particular data structure is 260 bytes in length, so it divided the length by 4 so that it can be initialized to 0 in 32-bit chunks. Neat!

Register Names Ever wondered why it chose EDI over EBX? EDI is formally known as the Extended Destination Index. Where is our instruction write destination? EDI. Likewise, with ECX, this is also known as the Extended Counter Register. It’s typically used in looping/repeating instructions to keep track of iterations. The more you know!

We’ve determined that [rbp+0] is a data structure allocated on the stack with a size of 260 bytes. The next three instructions serve little purpose as esi is 260 which is then AND’d with 3 which means that ecx is now 0. The termination condition (like when a loop completes) for rep stosd / rep stosb instructions is when rcx / ecx = 0. We can just move over those and get our pseudocode started.

u32 __fastcall sub_14000136C(u32 VolumeSerialNumber) { char buffer[260] = { 0 }; return -1; }

It looked like a lot, but once we acknowledge which instructions were wasted cycles and junk it wasn’t so bad! Moving on to the next section of assembly there will be some behavior that should be recognized.

mov byte ptr [rbp+104h], 0 loc_1400013F0: ; ................. here movzx eax, byte ptr [rbp+104h] movzx eax, al cmp eax, 4 jl short loc_140001418 jmp loc_1400014B2 loc_140001404: movzx eax, byte ptr [rbp+104h] movzx eax, al inc eax mov [rbp+104h], al jmp short loc_1400013F0 ; <---- loops back to ^

At first glance, we know there’s a loop because of the unconditional jump back to loc_1400013F0 . But this code is a little messy with all the potential branches so let’s rename the ones we know of right now.

mov byte ptr [rbp+104h], 0 outer_loop: ; .................. here movzx eax, byte ptr [rbp+104h] movzx eax, al cmp eax, 4 jl short loc_140001418 jmp loc_1400014B2 loc_140001404: movzx eax, byte ptr [rbp+104h] movzx eax, al inc eax mov [rbp+104h], al jmp short outer_loop ; <---- loops back to ^

A little bit easier to follow. This is a pretty standard loop setup in assembly. The first instruction initializes a counter to 0 using mov byte ptr [rbp+104h], 0 . Then we see it takes that local variable and stores it in eax , and then compares it against 4. It will branch to loc_140001418 if eax is less than 4 otherwise, it will jump to loc_1400014B2 . The jl instruction is jump if less than. Alright, so we know that [rbp+104h] is a local variable that’s a byte in width and this loop will execute 4 times. Before we analyze the jump targets lets add it to our pseudocode.

u32 __fastcall sub_14000136C(u32 VolumeSerialNumber) { char buffer[260] = { 0 }; u8 counter = 0; while(counter < 4) { // do something counter++; } return -1; }

Now, look at the jump target loc_140001418 :

loc_140001418: mov eax, [rbp+140h] lea rdx, [rbp+buffer] mov ecx, 0Ah mov [rbp+118h], ecx mov ecx, eax mov eax, [rbp+118h] mov r8d, eax call _itoa mov [rbp+110h], rax mov rax, [rbp+110h] mov rcx, rax call sub_1400014A4 mov [rbp+108h], eax movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, 4 add rdx, rax mov eax, [rdx] mov edx, [rbp+108h] cmp eax, edx jnz short loc_140001404 movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, rax mov eax, [rdx] mov rsi, [rbp+120h] mov rdi, [rbp+128h] lea rsp, [rbp+130h] pop rbp retn

I’ve renamed some of the offsets as was demonstrated before. We see a reference to a register that was preserved in the shadow store at [rbp+140h] which just so happens to be ecx and is also the argument of the function (calling convention). The argument is the VolumeSerialNumber that was acquired from our call to GetVolumeInformationW in the previous function. We know this from looking at the pseudocode we generated for the first function. The next instruction loads the base address of our local data structure into rdx . The value 0Ah (10) is then copied into ecx and then ecx into [rbp+118h] . We’re beginning to see a lot of temporary storage uses so we need to just skim and see where the final location is. It looks like 10, buffer, and our serial number are arguments to the _itoa function. The function _itoa is used to convert an integer to a string – if you don’t know the details of the arguments check the manual pages. The result of this function is then loaded into some temporary storage then copied to rcx prior to a call to sub_1400014A4 . So we know this function is likely used in a manner similar to this:

sub_1400014A4( _itoa( VolumeSerialNumber, buffer, 10 ) );

Scan the rest of the excerpt and you’ll notice a branching instruction – this means that there is another conditional statement involved. Let’s continue from the line designated below…

loc_140001418: mov eax, [rbp+140h] lea rdx, [rbp+buffer] mov ecx, 0Ah mov [rbp+118h], ecx mov ecx, eax mov eax, [rbp+118h] mov r8d, eax call _itoa mov [rbp+110h], rax mov rax, [rbp+110h] mov rcx, rax call sub_1400014A4 mov [rbp+108h], eax ; <------ continue from here movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, 4 add rdx, rax mov eax, [rdx] mov edx, [rbp+108h] cmp eax, edx jnz short loc_140001404 movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, rax mov eax, [rdx] mov rsi, [rbp+120h] mov rdi, [rbp+128h] lea rsp, [rbp+130h] pop rbp retn

The return value of sub_1400014A4 is stored on the stack, and we can see it’s used later down the line. We copy the counter to the eax register and zero the remaining bits. The following instruction movzx eax, al does nothing for us – ignore it. This is where things get a little dicey.

We multiply the value in rax by 8 and store the result in rax . This is called scaling, and is the process of taking an index (the counter) and scaling it by the width of an address. A 64-bit address is 8 bytes in width. You will commonly see this type of scaling used when accessing an array or similar data structure. Here’s an example to help:

uint64_t a[2] = { 0 }; // // To access the second element of the array // we can take the index * size of an element. // *(uint64_t*)(a + (1 * sizeof(uint64_t))) = 10; // // The above is the same as doing this. // a[1] = 10;

Array Access in Assembly

Pretend the array is based from 0. If we wanted to access the first 8-byte element (an unsigned 64-bit integer) then we would just read from the base to 8-bytes ahead, so from address 00 to 07. Conversely, to read the second 8-byte element we’d need to offset 8 bytes from the base (to get the start of the second element) and read to address 15. The scaling is done because we have to offset the correct number of bytes to get the desired element. If you think about the math it makes sense.

a = base of array = 00 (like diagram) // // Access first element // *(u64*)(a + (0 * sizeof(u64))) => *(u64*)(a + 0) => read 64-bit integer @ base of 'a' (address 00) // // Access second element // *(u64*)(a + (1 * sizeof(u64))) => *(u64*)(a + 8) => read 64-bit integer @ base of 'a' + 8 (address 08)

This is how indexing into arrays work, but the array access in the code is a little more complicated. We’re going to have to understand how to index into a structure.

Structure Accesses in Assembly

The reason being familiar with a language like C or C++ is important is because sometimes you’ll run into assembly that accesses a data structure that isn’t an array, and its layout isn’t immediately obvious. Take the below structure for example:

struct _s { u32 first; u32 second; };

When a structure like this is allocated, whether on the stack or the heap, accessing may not be intuitive. For the above example, it’s similar to the array accesses. Take note of a few things, however. This structure is not 32-bits in size, it is 64-bits because it contains two 32-bit integers. So how would we go about accessing the first or second members of this structure _s ? Let’s take a small program to help us.

struct _s temp; temp.first = 0; temp.second = 1; printf( "%d %d

", temp.first, temp.second ); // 0 1

In order to initialize the first member of the temp structure, we’d need the base of it. Once we have the base it’s very much like an array where the second member would be at the base address + sizeof(first member) . For the _s structure both members are 32-bit integers so the offset would be 4. The instructions to initialize these two would look similar to this.

lea rdx, qword ptr [_s] mov dword ptr [rdx+0], 0 mov dword ptr [rdx+4], 1

Different Types We know that structures allow the storage of different types in a packed data structure, so you can guess that accessing a member of a different type may require a different offset. This is not always true thanks to memory alignment requirements and the compiler. If you want to learn more about structure padding and how it affects memory accesses be sure to check the recommended reading.

Knowing how structures are accessed is sufficient for this example, but there’s something off about the assembly we’re looking at. I’ll bring it back into view.

movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, 4 add rdx, rax mov eax, [rdx] mov edx, [rbp+108h] cmp eax, edx jnz short loc_140001404

You might’ve noticed the lea rdx, dword_140024000 instruction. This is quite confusing since our counter is currently 0, we’re multiplying the counter value by 8, loading rdx with the base of some data structure, and then adding 4 to the base then also adding 8. When you encounter sequences like this writing it out in generic terms helps. Let’s do that.

eax = 0 rax * 8 = 0 rdx = 140024000 <add rdx, 4> rdx = 140024004 <add rdx, 0> rdx = 140024004 <mov eax, [rdx]> eax = *(u32*)140024004 edx = [rbp+108h] RECALL: rbp+108h is the return value of sub_1400014A4

To me, it looks like structure access and then comparing one of the members to the return value of sub_1400014A4 . We can observe the pattern similar to our structure access example here:

lea rdx, dword_140024000 add rdx, 4

Then what is the scaling of the counter with 8 for? Great question. This is because this data structure is actually an array of structures! Something that you’ll see quite often in the wild. If you’re wondering what I mean by an array of structures picture the earlier example but as an array of _s structs. You’d recognize it in C – check it out.

static struct _s sarr[ N ] = { { 0, 1 }, { 2, 3 }, { 4, 5 }, ...etc... { X, X+1}, };

The elements of this array are _s structures and are initialized inside of the static array using { } . This relates to our disassembly because these structures are 8-bytes in size and our array of structures, therefore, is operated on in memory as being an array of 64-bit integers. Remember that when attempting to get the next element of an array when the elements are 8-bytes in size you have to add 8 * the index – just like in our target function:

movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8

On the first iteration, this scale value is 0 meaning that it will read from the first element in the array of structures! We load the base of the data structure into rdx :

lea rdx, dword_140024000 add rdx, 4 add rdx, rax

Add 4 to rdx which is the offset into the structure for the second member, and then add the scale value to rdx . Realistically these two add operations could be swapped and would be more intuitive, but assembly isn’t always intuitive. If it helps, I put together a diagram to represent this array of structures. We know that there are 4 structures in this array of structures based on the condition of our function loop and that the size of these structures is 8 bytes, and judging by the index into it with 4 the members of that structure are likely 32-bit integers. Study the illustration below and try to connect the dots.

The instructions on the left represent the instructions executed in each loop. I added comments that represents the state of the register in that specific loop. For the first loop, we see that the code attempts to scale, but 0 * 8 is 0 so the add rdx, rax doesn’t modify the address that will be accessed on the next instruction. The 8-byte scale is to index into the array of structures since each structure contains two 4-byte integers. If we read from rdx on first iteration, it will yield the first structure in the array. Adding 4 to the address gives us the second member of the first structure in the array. I’ll layout the structure of the array based on how these accesses are occurring.

typedef struct _unk_struct { u32 a; u32 b; } unk_struct; static unk_struct [ 4 ] = { { 0, 0 }, { 0, 0 }, { 0, 0 }, { 0, 0 } };

The disassembly is noting it as an array of doublewords due to some compiler headache hence the dword_140024000 . However, it’s loading the 64-bit address of this data structure into rdx so we assume it’s an array, and then use context clues from the indexing method to realize that it’s an array of 8-byte structures. Now that we’ve covered all of that we just need to see what’s compared, check the jump target, and construct our pseudocode.

call sub_1400014A4 mov [rbp+ret_of_14a4], eax movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, 4 add rdx, rax mov eax, [rdx] mov edx, [rbp+ret_of_14a4] cmp eax, edx jnz short loc_140001404

So the second member in the specific structure in our array of structures is loaded into eax and compared against the return of sub_1400014A4 . We see that it then jumps to loc_140001404 if the values are not equal. The target, loc_140001404 , is presented below.

loc_140001404: movzx eax, byte ptr [rbp+counter] movzx eax, al inc eax mov [rbp+counter], al jmp short outer_loop

Ahh, hey wait a second! We’ve seen this before. Way at the beginning of the analysis of this function. This block simply copies the counter value into eax and increments it then stores it back in the local variable and jumps back to the outer_loop . We knew there were nested conditions in the loop, but now we know the logic in them is pretty simple. This is an unoptimized look at a for-loop where the counter increment is put in its own jump block away from the rest of the looping code. Let’s bring the whole disassembly back into view and add our new pseudo-C.

push rbp sub rsp, 150h lea rbp, [rsp+20h] mov [rbp+128h], rdi mov [rbp+120h], rsi mov [rbp+VolumeSerialNumber], ecx lea rax, [rbp+buffer] mov edx, 0 mov ecx, 104h mov rdi, rax mov eax, edx and eax, 0FFFFh mov ah, al mov edx, eax shl eax, 10h or eax, edx mov esi, ecx shr rcx, 2 rep stosd ; zero our local buffer mov ecx, esi and ecx, 3 rep stosb ; ignore mov byte ptr [rbp+counter], 0 ; zero our loop counter outer_loop: movzx eax, byte ptr [rbp+counter] movzx eax, al cmp eax, 4 ; compare counter to 4 jl short loc_140001418 ; if (counter < 4) ? loc_140001418 : loc_1400014B2 jmp loc_1400014B2 loc_140001404: movzx eax, byte ptr [rbp+counter] movzx eax, al inc eax ; counter++ mov [rbp+counter], al jmp short outer_loop ; for(...) loc_140001418: mov eax, [rbp+VolumeSerialNumber] lea rdx, [rbp+buffer] mov ecx, 0Ah mov [rbp+118h], ecx mov ecx, eax mov eax, [rbp+118h] mov r8d, eax call _itoa ; _itoa(VolumeSerialNumber, buffer, 10) mov [rbp+110h], rax mov rax, [rbp+110h] mov rcx, rax call sub_1400014A4 ; sub_1400014A4( _itoa(VolumeSerialNumber, buffer, 10) ) mov [rbp+ret_of_14A4], eax ; ret_of_14A4 = ^^ movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 add rdx, 4 add rdx, rax mov eax, [rdx] ; eax = *(u64*)( dword_140024000[ counter ] ) + 4 ; ----- SIMPLIFIED ----- ; eax = dword_140024000[ counter ].u32b mov edx, [rbp+ret_of_14A4] cmp eax, edx ; if( ret_of_14A4 == eax ) ? loc_140001404 : next_instr jnz short loc_140001404 movzx eax, byte ptr [rbp+counter] movzx eax, al imul rax, 8 lea rdx, dword_140024000 ; dword_140024000[counter] add rdx, rax ; no index into struct by 4, 0 assumed mov eax, [rdx] ; eax = dword_140024000[counter].u32a mov rsi, [rbp+120h] mov rdi, [rbp+128h] lea rsp, [rbp+130h] pop rbp retn ; return dword_140024000[counter].u32a (eax) loc_1400014B2: mov eax, 0 mov rsi, [rbp+120h] mov rdi, [rbp+128h] lea rsp, [rbp+130h] pop rbp retn ; return 0

And let’s add the pseudo-C we believe to be used in this function. I provided comments so you won’t have to scroll up and recall information from the explanation.

u32 __fastcall sub_14000136C(u32 VolumeSerialNumber) { char buffer[260] = { 0 }; u8 counter = 0; for(u8 counter = 0; counter < 4; counter++) { if(dword_140024000[ counter ].u32b == sub_1400014A4(_itoa(VolumeSerialNumber, buffer, 10))) return dword_140024000[ counter ].u32a; } return 0; }

Assembly Complexity Complex or confusing assembly will not always result in the most extravagant functions, as seen above. The important takeaway is that the amount of indirection used in the above code greatly impacted the assembly we had to analyze. This will be true for many targets you’ll encounter in the real world. That’s why it’s so important to be able to do the hard parts by hand, and understand where the result came from. If all of this made sense to you, that’s awesome! If not, don’t feel down – it takes time to recognize these things. You will get it.

In order to determine what this function is comparing, you’re gonna have to analyze the last function. There’s good news though! It’s not nearly as complicated as the last two we’ve looked at. If you noticed I said you will have to analyze it to uncover the functionality. Remember this isn’t about cracking the program or authorization protocol, just documenting it. The breaking of it comes later.

Final Challenge From the start of this series to the end of this section we’ve covered an enormous amount of information. It’s time for you to try to float in the deep end, lucky for you the last function isn’t super complicated. Use the resources you have at your disposal to determine what’s going on. Google, references, anything. Once you document the functionality of this function you’ll be able to reconstruct the program and understand how it’s validating the user’s input. The actual source code is provided following the disassembly, but try to complete it without referencing it. Leave your solutions in the comments, and show off how close you got! Good luck!

# sub_1400014A4

push rbp sub rsp, 30h lea rbp, [rsp+20h] mov [rbp+20h], rcx mov dword ptr [rbp+0], 1505h loc_1400014E5: mov rax, [rbp+20h] movzx eax, byte ptr [rax] movzx eax, al mov [rbp+4], eax mov eax, [rbp+4] mov [rbp+8], eax mov eax, [rbp+4] mov [rbp+0Ch], eax mov eax, 1 add rax, [rbp+20h] mov [rbp+20h], rax mov eax, [rbp+0Ch] test eax, eax jz short loc_140001523 mov eax, [rbp+0] shl eax, 5 add eax, [rbp+0] add eax, [rbp+8] mov [rbp+0], eax jmp short loc_1400014E5 loc_140001523: mov eax, [rbp+0] lea rsp, [rbp+10h] pop rbp retn

Program Source Code | Daax’s Pseudocode

Conclusion

This behemoth of an article concludes the Accelerated Assembly saga of the Applied Reverse Engineering series. We’ve covered a ton of instructions, sequences of instructions, types of accesses and compiler-generated headaches. At this point, if you completed the final challenge with little to no reference to my pseudo or the actual source code, you are well on your way. If you still struggled I commend you for making it to the end of the article and following along. I sincerely hope you learned something about assembly and this didn’t intimidate you or put you off from continuing down this path. It’s not easy and the Accelerated Assembly articles are not meant to be a one-stop-shop but more of a jump in and sink or swim type approach.

Learning assembly is one of the harder tasks when it comes to reverse engineering, but if you can become proficient at reading listings like the ones I gave above you’ll excel quickly in any facet of reverse engineering or vulnerability research. I intended to do more than one example, but as you can see this one got lengthy on even a simple program. The others were a bit more complex and would require a series themselves. I don’t want to get away from the point of these articles too much so I stuck with one example!

The next article will step away from the architecture to an extent as we are introduced to OS constructs that track objects within the system. These objects are processes and threads. It will detail the ins and outs of the structures from how they’re accounted for by the operating system to their related structures that control operation in a normal operating environment. You’ll learn about a few techniques for analyzing thread and process states, and how standard code injection works. It will be MUCH shorter than these last two articles, but still just as in-depth.

I hope you were able to take away something from these assembly posts, and I look forward to providing more content in the next articles!

As always feel free to leave me a comment, question, feedback, or a coffee to keep me awake while writing these. Good luck and all the best to everyone reading! My DMs are open on Twitter as well: @daax_rynd

Recommended Reading