Login: Password: Remember Me Register Defeating HyperUnpackMe2 With an IDA Processor Module

Thursday, February 22 2007 19:21.58 CST Author: RolfRolles # Views: 83040 Printer Friendly ...

1.0 Introduction

This article is about breaking modern executable protectors. The target, a crackme known as HyperUnpackMe2, is modern in the sense that it does not follow the standard packer model of yesteryear wherein the contents of the executable in memory, minus the import information, are eventually restored to their original forms.



Modern protectors mutilate the original code section, use virtual machines operating upon polymorphic bytecode languages to slow reverse engineering, and take active measures to frustrate attempts to dump the process. Meanwhile, the complexity of the import protections and the amount of anti-debugging measures has steadily increased.



This article dissects such a protector and offers a static unpacker through the use of an IDA processor module and a custom plugin. The commented IDB files and the processor module source code are included. In addition, an appendix covers IDA processor module construction. In short, this article is an exercise in overkill.



NOTE: all code snippets beginning with "ROM:" come from the disassembled VM code; all other snippets come from the protected binary.



HyperUnpackMe2.zip is provided as an ancillary to this article and includes: codeseg--lightly--commented.idb : IDB of Virtual Machine (VM)

: IDB of Virtual Machine (VM) dumped.exe : Statically unpacked executable

: Statically unpacked executable Notepad.idb : IDB of packed executable

: IDB of packed executable processor_module_source.zip : Source code for IDA processor module

: Source code for IDA processor module th.w32: IDA processor module The processor module (th.w32) belongs in %IDADIR%\procs. It requires IDA 5.0, as do both of the IDBs. Although I own IDA 5.0, these IDBs are linked with the pirated 5.0 key. This is due to the fact that IDB files contain the majority of your personal keyfile. Hence, the IDBs will stop working under 5.1, unless you patch out the blacklist code (which is trivial). If you are a legitimate customer of IDA and would like IDBs for a later version, contact me under the information at the bottom of the article.



1.1 Modern Protectors

Protectors of generations past mainly compress/encrypt the original contents of the executable's sections; redirect the entrypoint to a new section that contains the decompression/decryption stub mixed in with anti-disassembly and anti-debugging techniques; strip the import information at protect-time and rebuild the import address tables at runtime; and finally transfer control back to the original entrypoint. In other words, while the sections' contents are modified on disk, they are mostly (with the exception of the import information) restored to their original state before execution is transferred back to the original program. Although there are some protectors which are exceptions, this is the basic idiom.



To unpack such protectors, execution is traced back to the original entrypoint, the process is dumped to create a new executable, and the import information is rebuilt. ImpRec and a sufficiently patched debugger are all that is needed to unpack protectors of this variety.



Rather than having an unmolested image in memory, new protectors are applying transformations to the original code in an effort to thwart understanding it and to make dumping the executable more difficult. Examples include converting portions of the code into proprietary byte-code formats which are executed by an embedded interpreter (so-called virtualization, virtual machines or VMs) and copying portions of the code elsewhere in the process' address space (so-called stolen bytes, stolen functions). These techniques are now mainstream in all areas of software protection, from crackmes and commercial packers to industrial-grade protections.



1.2 Transformations Applied by HyperUnpackMe2 to the Original Code

HyperUnpackMe2 extensively modifies the original code, and the entirety of the packer code is executed in a virtual machine. The anti-debugging is heavy and some of it is novel.



By quickly examining the code at the beginning of the binary, we notice the following: Direct inter-module API calls are replaced with int 3 / 5x NOP. It is not known a priori whether these are fixed up directly, or whether they actually require a trip through a SEH. This could be problematic: think about Armadillo. Thunks to APIs are similarly obfuscated. The relevant data in the original IIDs and IATs have been zeroed.

Instructions which reference imports without calling them directly, i.e.



.text:01004462 mov esi, ds:__imp__lstrcpyW@8 ; lstrcpyW(x,x)

have been replaced with zeroes. However, the surrounding context remains the same:



TheHyper:0103A44D push ebx TheHyper:0103A44E mov ebx, [esp+8] TheHyper:0103A452 push esi TheHyper:0103A453 db 0,0,0,0,0,0 TheHyper:0103A459 push edi TheHyper:0103A45A push _szAnsiText TheHyper:0103A460 push ebx TheHyper:0103A461 call esi

So clearly, the missing instructions must be re-inserted (in some form) into the code before it'll execute properly. Perhaps this happens via a trip through the virtualizer, perhaps they're patched directly, perhaps a SEH-triggering event is patched in. Without further analysis, we have no way of knowing.

have been replaced with zeroes. However, the surrounding context remains the same: So clearly, the missing instructions must be re-inserted (in some form) into the code before it'll execute properly. Perhaps this happens via a trip through the virtualizer, perhaps they're patched directly, perhaps a SEH-triggering event is patched in. Without further analysis, we have no way of knowing. Intra-module calls are replaced with call $+5. It seems likely that these references are directly fixed up prior to execution; this turns out not to be the case (the 'directly' part is false).

Long jump instructions have had their targets replaced with a zero dword.



.text:010023CE E9 00 00 00 00 jmp $+5

Again, it's unknown what sort of obfuscation is being applied here.

Again, it's unknown what sort of obfuscation is being applied here. Functions have been stolen, with zeroes left behind in place of the original code. These functions have been deposited towards the end of the packer section.



.text:01001C5B ; __stdcall SetTitle(x) .text:01001C5B 00 _SetTitle@4 db 0 .text:01001C5C 00 db 0 .text:01001C5D 00 db 0 .text:01001C5E 00 db 0

2.0 Virtual Machines

Although VM assembly languages are often simple, VMs pose a challenge because they severely dilute the value of existing tools. Standard dynamic analysis with a debugger is possible, but very tedious because of the low ratio of signal to noise: one traces the same VM parsing / dispatching code over and over again. Static analysis is broken because each different VM has a different instruction encoding format (and this can be polymorphic). Patching the VM program requires a familiarity with the instruction set that must be gained through analysis of the VM parser. Basically, reverse engineering a VM with the common tools is like reverse engineering a scripted installer without a script decompiler: it's repetitious, and the high-level details are obscured by the flood of low-level details.



2.1 General Setup of VM Protections

The virtual machine needs an environment to execute in. This is generally implemented as a structure, hereinafter "the VM context structure". Each VM is different, but of the ones I've encountered thus far, each is based on the concept of a register architecture, and so the VM context structures typically consist of registers, flags, and various pointers (e.g. stack, maybe a heap of some sort, or a static data section).



Before the first instruction is executed, the VM context structure is allocated, and the registers and pointers are initialized, which usually involves allocating memory (perhaps on the host stack) for the VM stack.



After initialization, the archetypal VM enters into a loop which: Decodes instructions at VM_context.EIP,

Performs the commands specified by the instruction, and then

Calculates the next EIP. The process of execution usually involves examining the first byte of the instruction and determining which function/switch statement case to execute.



Eventually, the VM reaches some stop condition, and either exits or transfers control back to the native processor.



3.0 Description of HyperUnpackMe2's VM Harness

The HyperUnpackMe2 VM context structure contains sixteen dword registers, including ESP, which can each be accessed as a little-endian byte, word, or dword. There is an EIP register and an EFLAGS register as well. There is a pointer to the VM data (which is where EIP begins), and its length. The structure is zeroed upon creation. Its declaration follows. See the included x86 IDB for all of the gory details.



struct TH_registers { unsigned long rESP; unsigned long r1; unsigned long r2; unsigned long r3; unsigned long r4; unsigned long r5; unsigned long r6; unsigned long r7; unsigned long r8; unsigned long r9; unsigned long rA; unsigned long rB; unsigned long rC; unsigned long rD; unsigned long rE; unsigned long rF; }; struct TH_context { unsigned char *vm_data; unsigned long vm_data_len; unsigned char *EIP; unsigned long EFLAGS; TH_registers registers; TH_keyed_mem keyed_mem_array[502]; unsigned long stack[0x9000/4]; };

3.1 Instruction Encoding

The HyperUnpackMe2 VM consists of 36 instructions, split up into five groups. Each group has a different instruction encoding format, with a few commonalities. The commands understood by the VM are the following (non-obvious ones will be explained in detail in subsequent sections): Group One : Two-operand arithmetic instructions: mov, add, sub, xor, and, or, imul, idiv, imod, ror, rol, shr, shl, cmp

: Two-operand arithmetic instructions: Group Two : One-operand arithmetic and general instructions: push, pop, inc, dec, not

: One-operand arithmetic and general instructions: Group Three : One-operand control flow instructions: jmp, jz, jnz, jge, jg, jle, jl, vmcall, x86call

: One-operand control flow instructions: Group Four : Memory-related instructions: valloc, vfree, halloc, hfree

: Memory-related instructions: Group Five : Miscellaneous instructions: getefl, getmem, geteip, getesp, retd, stop

: Miscellaneous instructions: The VM itself is heavily based on the x86 architecture, as evident from the following snippets:



TheHyper:0104A159 VM_set_flags_dword: TheHyper:0104A159 cmp [edi], esi TheHyper:0104A15B pushf TheHyper:0104A15C pop [eax+VM_context_structure.EFLAGS] TheHyper:0104A316 VM_jz: TheHyper:0104A316 push [eax+VM_context_structure.EFLAGS] TheHyper:0104A319 popf TheHyper:0104A31A jnz short loc_104A31F TheHyper:0104A31C mov [eax+VM_context_structure.EIP], edi TheHyper:0104A31F loc_104A31F: TheHyper:0104A31F jmp short VM_dispatcher_13h_locret

The VM is using the host processor's flags in a very literal fashion. Group one and two, and to some extent group three, instructions are implemented very thinly on top of existing x86 instructions, reflecting the fundamental similarity of this virtual processor to it.



3.2 X86 <-> VM Crossover

The x86call instruction, depicted below, switches the host ESP with the VM ESP, and transfers control to the x86 code pointed to by EDI (what EDI is depends on the specifics of the instruction's encoding). The result of the function call is placed in virtual register #A. We'll find out later that this functionality is only ever used to call small functions associated with the protector, so we don't have to worry about alternative calling conventions and the clobbering of EDX and EBP by the function.



The switching of the host ESP with the VM ESP signifies that parameters to x86 functions are pushed onto the VM stack in the same order and manner as they would be if the calls were being made natively.



TheHyper:0104A36A mov esi, esp TheHyper:0104A36C mov edx, [eax+VM_context_structure.VM_registers.rESP] TheHyper:0104A36F mov esp, edx TheHyper:0104A371 call edi TheHyper:0104A373 mov edx, [ebp+arg_0] TheHyper:0104A376 mov [edx+VM_context_structure.VM_registers.rA], eax TheHyper:0104A379 mov esp, esi

The stop instruction in group five, depicted below, is suspicious and looks like it's used to transfer control back to OEIP. EBP, the frame pointer, points to the saved frame pointer coming into the function, which is the first thing pushed after the return address of the caller. Therefore, [ebp+4] is the return address.



TheHyper:0104A69F cmp cl, 0FFh TheHyper:0104A6A2 jnz short go_on_parsing TheHyper:0104A6A4 popa TheHyper:0104A6A5 mov eax, [ebp+var_4_VM_context_structure] TheHyper:0104A6A8 mov eax, [eax+VM_context_structure.VM_registers.rA] TheHyper:0104A6AB mov [ebp+4], eax ; [ebp+4] = return address TheHyper:0104A6AE leave TheHyper:0104A6AF retn 8

We thus expect that the packer will return to OEIP by using the stop instruction, with OEIP in virtual register #A.



3.3 Memory Keying

The virtual machine also maintains an associative array of memory locations. Each block of memory that it tracks has a keying tag associated with it. There are native functions to add memory pointers with keys, retrieve a pointer by passing in its associated key, remove a pointer given its key, and update a pointer given a key and a new block of memory to point to. Not all of these functions are accessible through the instruction set; they seem to be for debugging purposes.



Some of the memory blocks contain non-obfuscated x86 code, some obfuscated, some contain VM code, and some contain data.



The internal data structure for a keyed memory entry looks like the following:



struct TH_keyed_mem { unsigned char *ptr; unsigned long key; };

Analyzing the functions which manipulate this structure can be slightly confusing due to negative structure displacements:



TheHyper:0104A3DC mov [esi], edx ; key TheHyper:0104A3DE mov [esi-4], eax ; ptr TheHyper:0104A3E1 add dword ptr [esi-4], 8 ; ptr



3.3.1 Initializing the Associative Array



During initialization, the VM calls a function which scans the VM's data looking for all occurrences of the dword '$$$$'. For each instance found, it treats the next dword as the key, and takes the address of the dword following that as the pointer.



['$$$$'][4-byte key]^[arbitrary data] ^: pointer



3.3.2 Using the Associative Array



In the instruction set, group four specifically, there are two pairs of instructions which add and remove memory blocks from the internal associative array. The first pair allocates memory with VirtualAlloc, and the second pair uses HeapAlloc. There is no protection in the VM against attempting to de-allocate a block which wasn't allocated in the first place.



Group five contains an instruction, getmem, to fetch a memory block given a key. Group three, the control-flow transfer instructions, can take memory keys as arguments. In other words, jmp/jcc key will transfer control into the memory region pointed at by the key. In fact, the first instruction executed by the VM is of the form jmp key, and this is the primary form of control-flow transfer in the VM.



4.0 Static Analysis of HyperUnpackMe2's VM Code

Based on the analysis of the VM dispatching harness, I constructed an IDA processor module to examine the code inside of the VM -- dead and natively. As such, the anti-debugging tricks are generally beyond the scope of this article, but a brief discussion can be found in appendix A. See appendix B for information about writing IDA processor modules.



Beyond the anti-debugging, there's a lot of anti-dump protection in this packer. The main "tricks" all involve the redirection of certain aspects of normal code execution. The stolen functions are copied into VirtualAlloc'ed memory.

The API calls and API-referencing instructions point to obfuscated stubs which eventually redirect to their intended targets, which are actually in copies of the referenced DLLs, not the originals. There are 73 kilobytes' worth of obfuscated stubs in the packer section.

Relative jumps and calls travel through tiny stub functions in VirtualAlloc'ed memory onto their destinations. Further, all API references are changed to relatively-addressed varieties instead of direct references, i.e. 0xE8 [displacement to import] (call import_address) instead of 0xFF 0x15 [IAT entry] (call dword ptr [IAT entry]).



The point is to make dumping as hard as possible by creating a rigid reliance on the exact layout of the process' address space as it exists during that particular invocation (including the VirtualAlloc'ed memory regions and copied DLLs), and by removing any trace of the import table.



The following sections fill in the gaps (no pun intended) left in section 1.2 by describing precisely what happens under the covers of the VM. In the course of examination, we find that the fixups for each type take place in clusters, with similar code being used repeatedly to perform the same type of fixup. This turns out to be all of the information needed to break the protection, resulting in an automatic, static unpacker for any binary packed with it (of which there are no more -- TheHyper informed me that the protector was lost due to a disk crash).



4.1 Stolen Functions

The first thing we'll need to deal with are the missing functions. As we can see in the following snippet, it turns out that the functions are copied into allocated memory, and a long jump into the relevant function in allocated memory is inserted at the site of the function in the original code section. It should be noted that the stolen functions are still subject to the modifications described in subsequent sections.



.text:01001B9A ; __stdcall UpdateStatusBar(x) .text:01001B9A _UpdateStatusBar@4 db 0B7h dup(0) ROM:0103AFAA mov r0B, 1038A82h ; location of function in the VM section ROM:0103AFB2 mov r06, 0B7h ; notice this matches up with the ROM:0103AFBA push r06 ; size of the stolen function above ROM:0103AFBD push r0B ROM:0103AFC0 push r0F ; points to a block of allocated mem ROM:0103AFC3 x86call x86_memcpy ROM:0103AFC9 add rESP, 0Ch ROM:0103AFD1 mov r0B, 1001B9Ah ; address of UpdateStatusBar ROM:0103AFD9 mov r0E, r0F ROM:0103AFDD sub r0E, r0B ROM:0103AFE1 sub r0E, 5 ; r0E is the displacement of the jmp ROM:0103AFE9 add r0F, r06 ; point after the copied function ROM:0103AFED mov [r0Bb], 0E9h ; assemble a long jmp ROM:0103AFF2 inc r0B ROM:0103AFF5 mov [r0B], r0E ; write the displacement for the jmp

Locating these copies is easy enough: references to x86_memcpy following the final memory key are the ones which copy the stolen functions into VirtualAlloc'ed memory. We can easily extract the source of the copy and the destination of the write and copy the function back into its original real estate within the binary.



While we're on the subject, when fixups are made to functions which have been copied into allocated memory, they are made as a displacement against the beginning of that memory. I.e. we might see a fixup of a long jump made against address [displacement + 100h]. Thus, in order to know where in the original binary this long jump is, we need to retain information about where the functions in the original binary are situated in the allocated memory.



For example:



Displacement 0h into allocated memory -> 1001B9Ah Displacement B7h into allocated memory -> 1001EEFh Displacement 11Dh into allocated memory -> 100696Ah

Then, when we see one of these arbitrary displacements, we can map it to a location in the original binary by looking for the greatest lower bound in the set of displacements. I.e. for displacement C0h, this is +9h into the function with displacement B7h, and is therefore at the address 1001EEFh + 9h. Here's an example:



ROM:01049920 getmem r0B, 10000h ROM:01049926 mov r0B, [r0B] ROM:0104992A add r0B, 639h ; where does this point?



4.2 Long Jump Obfuscation

.text:010019DF E9 00 00 00 00 jmp $+5

Here we see, in the x86 IDB, an example of the jmp obfuscation. What actually happens here, at runtime, is that a chunk of memory is allocated, and gets filled with what looks like API thunk functions. The jmps in the binary are patched to jmp into the allocated memory, which subsequently jmps to the correct location in the binary. The following VM code illustrates this:



ROM:0104840F valloc 195h, 6 ; allocate 0x195 bytes of vmem under the ROM:01048418 getmem r0E, 6 ; tag 0x6 ROM:0104841E mov r0F, 10019DFh ; see above: same address ROM:01048426 mov r0D, r0F ROM:0104842A add r0D, 5 ; point after the jump ROM:01048432 mov r09, r0E ; point at the currently-assembling stub ROM:01048436 sub r09, r0D ; calculate the displacement for the jmp ROM:0104843A inc r0F ; point to the 0 dword in e9 00000000 ROM:0104843D mov [r0F], r09 ; insert reference to allocated memory ROM:01048441 mov r0B, 1001AE1h ; this is the target of the jmp ROM:01048449 mov r0C, r0E ROM:0104844D add r0C, 5 ; calculate address after allocated jmp ROM:01048455 sub r0B, r0C ; calculate displacement for jmp ROM:01048459 mov [r0Eb], 0E9h ; build jmp in VirtualAlloc'ed memory ROM:0104845E inc r0E ROM:01048461 mov [r0E], r0B ; insert address into jmp ROM:01048465 add r0E, 4

This is the general code sequence used to fix up the jumps when the function to be fixed up remains in the original binary's sections. When the function has been copied into memory, as described in the previous section, the code changes slightly: r0F and r0B's addresses are the displacements described previously. For example, the code at -41E and -441 are replaced with these snippets, respectively:



ROM:010498F3 getmem r0F, 10000h ROM:010498F9 mov r0F, [r0F] ROM:010498FD add r0F, 469h ROM:01049920 getmem r0B, 10000h ROM:01049926 mov r0B, [r0B] ROM:0104992A add r0B, 639h

Given the sequences above, and making use of the stolen address -> real address mapping, it's trivial to cut out the middleman and insert the proper displacements into the correct dword locations. In the code above, we retrieve the dword operands from -41E and -441 and simply fix the jumps ourselves.



4.3 Calls-To Obfuscation

These are handled in a very similar fashion as the jump obfuscation: the code to fix up the calls-to references is exactly the same as the jump obfuscation fixups. The calls also go through stubs in allocated memory which jmp to their proper destinations.



.text:01001C51 6A 00 push 0 .text:01001C53 E8 00 00 00 00 call $+5 .text:01001C58 C2 1C 00 retn 1Ch ROM:0104419A valloc 3F2h, 5 ; allocate 0x3f2 bytes of memory under ROM:010441A3 getmem r0E, 5 ; the tag 0x5 ROM:010441A9 mov r0F, 1001C53h ; address of call to be fixed up (above) ROM:010441B1 mov r0D, r0F ROM:010441B5 add r0D, 5 ; point after the call ROM:010441BD mov r09, r0E ; r09 points to the allocated jmp stub ROM:010441C1 sub r09, r0D ROM:010441C5 inc r0F ROM:010441C8 mov [r0F], r09 ; insert the proper displacement ROM:010441CC mov r0B, 1001B9Ah ; we would be calling this address ROM:010441D4 mov r0C, r0E ROM:010441D8 add r0C, 5 ROM:010441E0 sub r0B, r0C ; calculate displacement ROM:010441E4 mov [r0Eb], 0E9h ; form the long jmp in allocated memory ROM:010441E9 inc r0E ROM:010441EC mov [r0E], r0B ROM:010441F0 add r0E, 4

Notice that, in this case, a call from a non-stolen function is being fixed up to call a non-stolen function: the addresses on lines -1A9 and -1CC are hard-coded within the binary. When a call in a stolen function is fixed up to call another function, the beginning of the above code sequence is different: it uses the getmem idiom, as we saw previously. The code at -1A9 becomes:



ROM:010441F8 getmem r0F, 10000h ROM:010441FE mov r0F, [r0F] ROM:01044202 add r0F, 20Dh

However, the destination address is not loaded via getmem, because as we saw previously, calls to stolen functions are routed to their destinations via these jumps. I.e. calls to stolen functions behave just like calls to the original functions.



Recovering the proper displacement from the caller to the callee is as simple as it was for the jumps, because the code is identical, so see the closing remarks for the last section on how to fix up these calls.



4.4 Import Obfuscation

Here's a sample of the import redirection. Instead of referencing the imports directly, the jmp/call-to-import instructions are patched to reference locations such as these:



TheHyper:01021524 pushf TheHyper:01021525 pusha TheHyper:01021526 call sub_1021548 TheHyper:01021548 pop eax TheHyper:01021549 add eax, 16h TheHyper:0102154C jmp eax

This sort of thing goes on for a while (six layers for this one) with some random junked garbage interspersed before eventually redirecting control to the original import:



TheHyper:010215DE 61 popa TheHyper:010215DF 9D popf TheHyper:010215E0 E9 00 00 00 00 jmp $+5



4.4.1 IAT Reconstruction



Believe it or not, the first thing that HyperUnpackMe2 does when it really gets down to business is to correctly rebuild the original IAT.



First, the DLL names are retrieved from memory byte-by-byte. The names are not stored contiguously, but rather, the bytes corresponding to the DLL names are randomly mixed together. The DLL is then LoadLibraryA'd.



ROM:01026058 mov r04, 1013000h ; point to beginning of packer section ROM:0102607B getmem r05, dword_101382D ROM:01026081 mov r0B, r09 ROM:01026085 mov r06, r04 ROM:01026089 add r06, 41Ch ROM:01026091 mov [r0Bb], [r06b] ; copy byte of DLL name from 0x101341c ROM:01026095 inc r0B ROM:01026098 mov r06, r04 ROM:0102609C add r06, 93h ROM:010260A4 mov [r0Bb], [r06b] ; copy byte of DLL name from 0x1013093 ; idiom repeats a variable number of times ROM:010260A8 inc r0B ROM:0102617C mov [r0Bb], 0 ROM:01026181 push r09 ROM:01026184 x86call r0C ; LoadLibraryA ROM:01026186 add rESP, 4

Next, the entire DLL's address space is copied into a freshly-allocated chunk of memory. Yes, you read that right. The DLL's SizeOfImage is used as the size parameter to VirtualAlloc, and then the entire DLL is memcpy'd into it the result. This is responsible for a huge bloat in the memory footprint. I didn't think that this trick would work, but the crackme does run, after all. Personal correspondence with TheHyper reveals that this is why the crackme only runs on XP SP2 (although I haven't investigated why -- help me out here, Alex?).



The following code illustrates the process:



ROM:0102618E push r09 ROM:01026191 push r0B ROM:01026194 push r0D ROM:01026197 mov r09, r0A ROM:0102619B getmem r0A, g_Copy_Of_Kernel32_Address_Space ROM:010261A1 mov r0A, [r0A] ROM:010261A5 push kernel32_hashes_VirtualAlloc ROM:010261AB push r0A ROM:010261AE vmcall API__GetProcAddress ROM:010261B4 mov r0D, r0A ROM:010261B8 mov r0B, r09 ROM:010261BC add r0B, 3Ch ROM:010261C4 mov r0B, [r0B] ROM:010261C8 add r0B, r09 ROM:010261CC add r0B, 50h ROM:010261D4 mov r0B, [r0B] ; retrieve this DLL's SizeOfImage ROM:010261D8 push 40h ROM:010261DE push 1000h ROM:010261E4 push r0B ROM:010261E7 push 0 ROM:010261ED x86call r0D ; allocate that much memory ROM:010261EF add rESP, 10h ROM:010261F7 mov r03, r0A ROM:010261FB mov [r05], r0A ROM:010261FF push r0B ROM:01026202 push r09 ROM:01026205 push r0A ROM:01026208 x86call x86_memcpy ; copy DLL's address space ROM:0102620E add rESP, 0Ch ROM:01026216 pop r0D ROM:01026219 pop r0B ROM:0102621C pop r09

Next, the imported APIs are loaded, but not in the normal way. The protector includes a VM-function that I've called API__GetProcAddress, which takes a pseudo-HMODULE and a shellcode-like API hash as arguments. The pseudo-HMODULE is the address of the memory that the DLL was copied into above. Thus, the addresses returned by this function reside in the copied DLL bodies, and not the originals.



API__GetProcAddress works by iterating through the DLL's exports and hashing each function's name, stopping when it finds the corresponding hash that was passed in as an argument. It then returns the address of that function.



This makes it harder for dynamic tools to identify which APIs are actually being used: after all, the API addresses are not contained within a loaded module.



The hashes and their locations in the original IAT are retrieved from the jumble of data at the beginning of the packer section in a similar fashion as the assembling of the DLL names. Additionally, the address at which the resolved import belongs in the original IAT entries is also assembled from scattered data.



ROM:0102621F xor r07, r07 ; r07 = hash ROM:01026223 mov r06, r04 ROM:01026227 add r06, 12h ROM:0102622F mov r05, [r06] ROM:01026233 and r05, 0FFh ROM:0102623B or r07, r05 ; get a single byte of the hash ROM:0102623F ror r07, 8 ; idiom repeats three times ROM:010262B3 xor r08, r08 ; r08 = where to put the resolved import ROM:010262B7 mov r06, r04 ROM:010262BB add r06, 5C7h ROM:010262C3 mov r05, [r06] ROM:010262C7 and r05, 0FFh ROM:010262CF or r08, r05 ROM:010262D3 ror r08, 8 ; idiom repeats three times ROM:01026347 push r07 ROM:0102634A push r03 ; point at copied DLL ROM:0102634D vmcall API__GetProcAddress ROM:01026353 add r08, 1000000h ROM:0102635B mov [r08], r0A ; store resolved address back into IAT

The DLL names, hashes, and IAT addresses can all be recovered with no difficulties, and we can ignore the DLLs being copied into dynamically allocated memory. It's a simple matter to reverse the hashes into API names. Therefore, the entirety of the import information can be reconstructed statically: we can simply mimic what the packer itself does, rebuild the IDTs/IATs with no difficulties, and then point the imports directory pointer in the PE header to our rebuilt structures.



I was anticipating things would be harder than they turned out to be, so I decided to move the FirstThunk lists (into which the original import references were made) instead of keeping them at their original addresses. This turned out to be an unnecessary mistake that complicates some of what follows. I apologize.



In order to rectify this situation, I kept a map from the old IAT addresses into the new IATs that I created.



For example:



.text:010012A0 __imp__PageSetupDlgW@4 dd 0 010012A0 -> [Address of new FirstThunk entry for PageSetupDlgW import]



4.4.2 IAT Redirection



The next thing that happens is that the addresses which were resolved in the previous section are inserted into API-obfuscating stubs described in 4.4, and the addresses of these API-obfuscating stub functions are inserted into the IAT atop the import addresses.



.text:010012A0 ; BOOL __stdcall PageSetupDlgW(LPPAGESETUPDLGW) .text:010012A0 __imp__PageSetupDlgW@4 dd 0 TheHyper:01014126 pushf TheHyper:01014127 pusha TheHyper:01014128 call sub_1014150 ; eventually ends up at next snippet TheHyper:0101423A popa TheHyper:0101423B popf TheHyper:0101423C jmp near ptr 0B97002DDh ; patch here + 1 byte ROM:0103659A mov r0B, 10012A0h ; see above: IAT addr ROM:010365A2 mov r0E, 1014126h ; see above: beginning of import obfs ROM:010365AA mov r08, 101423Dh ; see above: end of import obfs ROM:010365B2 mov r06, [r0B] ROM:010365B6 mov [r0B], r0E ; replace IAT addr with obfuscated addr ROM:010365BA mov r03, r08 ROM:010365BE dec r03 ROM:010365C1 add r03, 5 ROM:010365C9 sub r06, r03 ROM:010365CD mov [r08], r06 ; form relative jump to real import

This makes no difference to the static examiner, and does not require fixups.



4.4.3 Call Instruction Fixup



Next, the CALL instructions which reference the IAT are re-created as relatively-addressed instructions which reference the API-obfuscating stub functions. The instructions in the original binary were 0xFF 0x15 [direct address], the pre-fixup instructions are 0xCC 0x90 0x90 0x90 0x90 0x90, and the new instructions are 0xE8 [relative address] 0x90. As this operation requires one less byte than the original directly-addressed references, a NOP is needed for the remaining byte cavity.



.text:010019D4 int 3 ; Trap to Debugger .text:010019D5 nop .text:010019D6 nop .text:010019D7 nop .text:010019D8 nop .text:010019D9 nop ROM:0103C747 mov r03, 10019D4h ; address of the snippet above ROM:0103C74F mov r06, r03 ROM:0103C753 add r06, 5 ; point after call ROM:0103C75B mov [r03b], 0E8h ; insert relative call ROM:0103C760 inc r03 ROM:0103C763 mov r04, 100121Ch ; where we call to ROM:0103C76B sub r04, r06 ; create relative displacement ROM:0103C76F mov [r03], r04 ; insert relative address ROM:0103C773 add r03, 4 ROM:0103C77B mov [r03b], 90h ; insert NOP in empty byte spot

As before, the idiom is slightly different for fixing the calls in stolen functions, in that r03 is fetched from memory instead of referenced directly. The code at -747 would become, for instance:



ROM:0103C67E getmem r03, 10000h ROM:0103C684 mov r03, [r03] ROM:0103C688 add r03, 1318h

In order to fix these up, we retrieve the address of the call from -747, and the import destination from -763. We then manually insert the correct instruction which calls into this IAT slot. Actually, due to my previously-described mistake, we first run the IAT address through the old IAT slot -> new IAT slot map before fixing the instruction.



4.4.4 Mov Instruction Fixup



Next, instructions of the form mov reg32, [dword from IAT] are fixed up by the protector in the same fashion as in the previous section. They are relatively addressed to point directly to the obfuscated stubs (whose addresses are fetched out of the IAT), instead of the direct addressing that was present in the original binary. The registers involved in this process are ESI, EDI, EBP, EBX, and EAX.



Stop and think for a second. So far, we've made the assumption that all imports are functions, but this is not always true. The MSVC CRT contains references to two imported data items. Trying to run a data import through the import-obfuscating procedure is an incorrect transformation and will always result in a crash. This is an Achilles' heel of this protection.



The mov-instruction fixup is accomplished in much the same way as the call-instruction fixups. There are several idioms: for stolen functions, for regular functions, for EAX versus the other registers (as the instruction for EAX is five bytes, while the others are six bytes). The EAX-references are assumed to point to data and are fixed up directly instead of relatively.



Once again, extracting this information from the code sequences is not difficult to do statically, and I think I'm starting to develop RSI so I'll skip the details here.



4.4.5 IAT Zeroing



After all of the references are correctly fixed up, the import addresses in the IAT are no longer needed, and are zeroed. We can ignore this step.



4.5 The Rest of the Protection

As is usual in unpacking tasks, we must set the original entrypoint field of the PE header to the real entrypoint. We scan the disassembly listing for the instruction 'stop' and then statically backtrace to find the value of r0A.



ROM:01049F05 mov r0A, 1006AE0h ROM:01049F0D stop

Finally, the NumberOfRVAsAndSizes field of the PE header has been set to -1 in order to confuse OllyDbg, so we should set that back to 0x10, the default. And while we're at it, reduce the raw and virtual sizes of the last section, reduce the SizeOfImage, and truncate the last section in the executable. The final executable is exactly 1kb larger than the copy of notepad.exe which ships with Windows XP SP2.



After making all of the above modifications, the binary runs properly. Success!



5.0 Comments On The Protection

It took a lot of work to unpack this protector, but ultimately, the static solution was both obvious and straightforward. On the other hand, dynamic dumping of this protector would be difficult, although still feasible.



5.1 Problems With The Protection

This protection has a few problems in the theoretical sense. For one, it requires disassembling the binary: considering that the IAT is zeroed, _every_ reference to the IAT must be accounted for; if not, the program will simply crash. For example, if a trivial packer which XORed the code section, but left the imports alone, was applied first, all references would be missed and the binary would have no hope of running. This could be assuaged by not zeroing the original IAT (but still applying fixups on those which can be found) so that any non-found references continue to work properly.



Another problem is, of course, that disassembly isn't perfect, and you could end up with all sorts of bugs if you just blindly replace what you think is a reference to the IAT if it is instead just plain old data, for instance.



Another problem is functions which have merged tails. If a function with a shared exit path is stolen, there are going to be problems.



Another problem, discussed in a previous section, is the assumption that imports will always point to functions and not data. This a faulty assumption, and will cause many failures.



All of that being said, if one assumes perfect disassembly (which is possible manually via IDA and/or full debugging information) and allows a blacklist of imports which are data, then this is a working protection, one which I expect will be quite potent after a few generations. By no means is this a "fire and forget" packer like UPX, but it can be made to work on a case-by-case basis.



Appendix A: Anti-Debugging Tricks

There are 53 anti-debug mechanisms and checks in the VM, 49 of which can be broken automatically with either a tiny IDC script patching the bytecode directly, or a small patch to the VM harness. Of the remaining four, there are two which I've never heard of before (although I don't do this type of work often), so it's worth checking it out in the IDB, but I won't ruin the surprises here. I didn't look too heavily into those which could be broken automatically, so some of those descriptions in the IDB may be incorrect.



You may notice conditional jump instructions in the IDB which don't have their jump targets resolved, such as the following:



ROM:01035E90 cmp r07, 1 ROM:01035E98 jz 77026DDFh

At first I figured my processor module was buggy, and that this instruction was supposed to transfer control to a keyed memory region which the processor module had failed to locate. After inspection of the raw bytecode and a close look at the relevant VM harness code, in fact, this instruction will move the VM EIP to the immediate value 0x77026DDF, which will cause a reading access violation or undefined behavior during the next VM cycle, depending upon whether that's a valid address. Hence, jumps with unresolved targets are anti-debugging tricks. TheHyper confirmed this afterwards in private correspondence.



Appendix B: IDA Processor Module Construction

The main difference between writing a simple disassembler and writing an IDA processor module is that, instead of printing the disassembly immediately and moving on to the next instruction, information about each individual instruction and operand must be retained for later analysis and display.



For example, according to this VM's instruction encoding, 0x20 0x?[0-0xf] means "get flags into specified register". Whereas in a trivial disassembler one might write this:



case 0x20: printf("%lx: getefl %s

", address, decode_register(next_byte & 0xf)); return 2; // size of instruction

In an IDA processor module, one must write something like this (in ana.cpp):



case 0x20: cmd.itype = TH_getefl; // instruction code is TH_getefl; this comes // from an enumeration cmd.Op1.type = o_reg; // operand 1 is register cmd.Op1.reg = TH_Regnum(4, ua_next_byte() & 0xf); // get register num cmd.Op1.dtyp = dt_dword; // register is dword size cmd.Op2.type = o_void; // operands 2+ do not exist length = 2; // instruction size is 2 break;

TH_getefl is an element of an enum (ins.hpp), which in turn has a text representation and flags (ins.cpp). The operand information is eventually retrieved and printed (out.cpp):



case o_reg: OutReg( x.reg ); break;

Clearly, writing an IDA processor module is a significant amount of work compared to writing a simple disassembler, and in the case of small portions of straight-line VM code, the latter approach (via IDC) is preferable. However, in the case of large amounts of VM code with non-trivial control flow structure, the traditional advantages of IDA (cross-reference tracking, comment-ability, ability to name locations, creation and application of structures, and the ability to run scripts and existing plugins) really begin to shine.



B.1 Logical and Physical Divisions of an IDA Processor Module

It should be noted that, as with all C++ source code, physical divisions are irrelevant as long as all references can be resolved at link-time; however, the layout presented herein is consistent with the processor modules released in the IDA SDK, and also with the included processor module. Coincidentally, this information is laid out in the same order as specified by Ilfak in %idasdk%\readme.txt:



" Usually I write a new processor module in the following way: - copy the sample module files to a new directory - first I edit INS.CPP and INS.HPP files - write the analyser ana.cpp - then outputter - and emulator (you can start with an almost empty emulator) - and describe the processor & assembler, write the notify() function"

It should also be noted that I have written only one processor module and am not an expert on the subject. This information presented is correct as far as I am aware, but should not be considered authoritative. When in doubt, consult the processor module sources in the IDA SDK, inquire on the DataRescue forums, ask Ilfak, and buy the SDK support plan as a last resort.



B.2 Assigning Each Mnemonic a Numeric Code and Textual Representation

The files herein are solely responsible for defining the opcodes used by the processor, their mnemonics specifically, in both numeric and textual forms.



B.2.1 Ins.hpp



This file contains an enum, called "nameNum" by Ilfak, which assigns each opcode to a number. This enum contains a special, unused leading entry ([processor]_null, set to zero), and a trailing entry ([processor]_last) denoting the beginning and the end of the enum.



enum nameNum { TH_null = 0, // Unknown Operation TH_mov, // Move [...,] // [more instructions here] TH_stop, // Stop execution, return to x86 TH_end // No more instructions };



B.2.2 Ins.cpp



This file is the counterpart to the corresponding header file, which contains an array of instruc_t structures, which consist of a const char * (the mnemonic's textual description) and a flags dword. The entries in this array correspond numerically to the values given in the enum. The flags specify the number of operands the instruction uses/changes, whether the instruction is a call/switch jump, and whether to continue disassembling after this instruction is encountered (e.g. return instructions and unconditional jumps do not generally transmit control flow to the following instruction).



instruc_t Instructions[] = { { "", 0 }, // Unknown Operation // GROUP 1: Two-Operand Arithmetic Instructions { "mov" , CF_USE2 | CF_USE1 | CF_CHG1 }, // Move [{...,...},] // [more instructions] { "stop" , CF_STOP } // Stop execution, return to x86 };



B.3 Assigning Each Register a Numeric Code and Textual Representation

B.3.1 Reg.hpp



This file contains an enum, whose real entries begin at 0 (unlike previously- described enums with a bogus leading entry), consisting of the legal registers supported by the processor. As IDA takes into account the concept of segmentation, you will need to define fake code and data segment registers if your processor does not use them.



enum TH_regs { rESPb = 0, rESPw, rESP, [...,] r0F, rVcs, // fake registers for segmentation rVds, // fake rEND };



B.3.2 [Processor].hpp This file contains an array of const char *s which map the elements of the enum described in the previous subsection to a textual representation thereof.



static char *TH_regnames[] = { "rESPb", "rESPw", "rESP", [...,] "r0F" };



B.4 Analyzing an Instruction And Filling IDA's "cmd" Structure

The main disassembler function in an IDA processor module is called int ana() and lives in ana.cpp. This function takes no parameters, and instead retrieves the relevant bytes to decode via the functions ua_next_byte(), _word(), and _long().



This function, or collection of functions as the case may be, is responsible for: Setting cmd.itype to the correct value from the nameNum enum described in section B.2.1.

Setting the fields of cmd.Op[1-6] to describe the types of operands (registers, immediates, addresses, etc.) used by this instruction.

Returning the length of the instruction. An example from the included processor module:



case 0x1d: cmd.itype = TH_vfree; // virtualalloc'ed memory free cmd.Op1.type = o_imm; // type of operand 1 is immediate cmd.Op1.value = ua_next_long(); // value = memory key to free cmd.Op1.dtyp = dt_dword; // 4-byte memory key length = 5; // 5 bytes, 1 for opcode, 4 for operand break;

The cmd structure ties together the functions described in the next two sections: these functions do not take arguments, and instead retrieve information from the cmd structure in order to perform their duties.



B.5 Displaying Operands

Out.cpp is responsible for providing two functions, bool outop( op_t & ) and void out(). out() is responsible for outputting the mnemonic and deciding whether to output the operands.



There's a bit of subtlety here: processors which use conditional execution, for example ARM and the instruction MOVEH, may have a single nameNum/Instructions entry for an opcode ("MOV"), and the logic for prepending "-EH" to the mnemonic exists in out(). I have not encountered this while coding a processor module and cannot speak about it.



One thing to notice about the code below is how gl_comm is set to 1 every time out() is called. If you do not do this, you will not see comments in the disassembly. Figuring this required an email to Ilfak. Frankly, it's puzzling why displaying comments is not the default behavior, but this is the reality, so be sure to set this variable.



void out( void ) { char buf[MAXSTR]; init_output_buffer(buf, sizeof(buf)); OutMnem(); if( cmd.Op1.type != o_void ) out_one_operand( 0 ); // output first operand if( cmd.Op2.type != o_void ) // do we have a second operand? { out_symbol( ',' ); // put a ", " in the output OutChar( ' ' ); out_one_operand( 1 ); // output second operand } term_output_buffer(); // attach a possible user-defined comment to this instruction gl_comm = 1; MakeLine( buf ); }

The other function, bool outop(op_t &), is responsible for translating the contents of the op_t structure it is given into a textual description of that operand. The structure of this function is a simple switch statement on the op_t.type field. This function should be written concurrently with ana().



The output takes place through a number of functions exported from ua.hpp in the SDK: these functions tend to begin with "Out" or "out_" (out_register, OutValue, out_keyword, out_symbol, etc).



All in all, coding this function is mainly trivial. Here's one of the more complicated operand types from the included processor module:



case o_displ: out_symbol('['); OutReg( x.phrase ); out_symbol('+'); OutValue(x, OOF_ADDR ); out_symbol(']'); break;



B.6 Creating Cross-References

Most, but not all, instructions implicitly transfer control flow to the next instruction, and create no other cross-references. Some instructions like "ret" and "jmp" do not reference the next instruction. Other instructions, like "call" and conditional jumps, create additional references to the address(es) targeted. Still other instructions create references to data variables specified by immediate values.



This knowledge is not inherent in the depiction of the instruction set which has been developed thus far, and must be specified programatically. This is the responsibility of the int emu() function, which resides in emu.cpp, the smallest .cpp file in the supplied processor module.



int emu( void ) { ulong Feature = cmd.get_canon_feature(); if((Feature & CF_STOP) == 0) // does this instruction pass flow on? ua_add_cref( 0, cmd.ea+cmd.size, fl_F ); // yes -- add a regular flow if(Feature & CF_USE1) // does this instruction have a first operand? TouchArg(cmd.Op1, 0); // process it if(Feature & CF_USE2) TouchArg(cmd.Op2, 1); return 1; // return value seems to be unimportant } // "emulation" performed on a given op_t, see emu() static void TouchArg( op_t &x, bool bRead ) { switch( x.type ) { case o_vmmem: ua_add_cref( 0, get_keyed_address(x.addr), InstrIsSet(cmd.itype, CF_CALL) ? fl_CN : fl_JN); // add a code reference to the targeted address, either a call or a // jump depending on whether that instruc_t's flags has CF_CALL set. break; } }



B.7 Declaring IDA's Relevant Processor Module Structures

The bulk of what remains is the creation of structures which are directly or indirectly exported by the processor module.



B.7.1 asm_t Structure



This structure defines an "assembler" which determines what the disassembly listing should look like. Specifically, what the syntax is for declaring data, origins, section boundaries, comments, strings, etc.



Since we don't need to re-assemble virtual machine code (in the case of VMs found in protectors), the choices made here are immaterial, and this structure can be created once and re-used for all VM processor modules.



B.7.2 Function Begin and End Sequences



Both of these are optional. IDA employs both a linear-sweep and a flow-following method of disassembly: on the first pass, it marks all entrypoints as code, and then scans the raw bytes looking for the function begin sequences (such as push ebp / mov ebp, esp). These sequences can be specified in the processor module; however, when dealing with a throwaway VM, they aren't so important, because you're unlikely to know a priori what a function prologue looks like.



B.7.3 Processor Notification Event Handler



This is where my ignorance of processor module construction is most transparent. This function is called by the kernel upon certain events being triggered; such events include closing the database, opening an existing IDB, creating a new IDB, changing the processor module type, creating a new segment, and so on. A complete list of events can be found in idp.hpp.



For the creation of this processor module, I did not need to utilize many processor events, so I did not explore this further.



B.7.4 processor_t Structure



This is the "main" structure employed by the processor module, as plugin_t is the main structure employed by a plugin. In this structure, the pieces gathered in the previous sections are stitched together.



The processor module must know: The numeric ID of the processor module (custom-defined).

The long and short name of the processor module. I.e. metapc and pc respectively. There's an important point here which isn't documented: the makefile has a line called "DESCRIPTION" which MUST be in the format "[long name]:[short name]". Failure to ensure this means that the processor module will not be shown in the list of valid processor modules. Without knowing this, you'll be mailing Ilfak for advice, like I did.

The assembler(s) available. We'll only need the one we defined in B.7.1.

A function pointer to int ana() (see B.4).

A function pointer to int emu() (see B.6).

A function pointer to void out() and bool outop(op_t &) (see B.5).

A function pointer to int notify(processor_t::idp_notify, ...) (see B.7.3).

The number of registers, and a pointer to the const char * array of register names (both laid out in B.3).

The function begin and end sequences described in B.7.2. We can set these to NULL.

The number of mnemonics, and a pointer to the instruc_t array of mnemonic names (both laid out in B.2).



Appendix C: Obligatory Greets

TheHyper: Very innovative, good work! You keep making them, I'll keep breaking them.



blorght and Zen: Two of my favorite people, with or without the charming accents. Way too talented and more than a step or two over the edge. Stay just the way you are: I love both of you.



Nicholas Brulez: My bro the PE killer :) I hope we get to meet up again soon. I don't have to tell you to keep kicking ass, mate.



Neural Noise: One of my best friends, and a very gracious host. I can't wait to meet up again in the world's most alluring mafia-run slum that is Napoli (what a crazy city!). You bring the beautiful women, and I'll bring my bummy self, and we can have panic attacks in traffic waititng for the party to start ;-). Stay cool, man! :-)



Solar Eclipse: Congrats on the Pietrek thing!



spoonm: Thanks for the crash space and the informed conversation, and I'm looking forward to see what you publish next, too.



Pedram: For being the modern-day Fravia of OpenRCE and editing this tripe.



Rossi: For the much-needed proofreading.



lin0xx: Calm down!



LeetNet, kw, and upb: Self-explanatory.



Skape: For uninformed and for rocking.



Finally, to all true friends everywhere: I couldn't do it without you.



Appendix D: Contact

Contacting the author directly with large cash donations:



[firstname].[lastname]@gmail.com



To license binary analysis, binary diffing and malware taxonomy source code, please see and mail:



www.rolfrolles.info



[the single word "licensing"]@rolfrolles.info

Article Comments Write Comment / View Complete Comments



Username Comment Excerpt Date ndaj3 RolfRolles: Thank you for Writing an Great tuto... Friday, September 4 2009 01:25.09 CDT eirc Wow thanks a lot£¡ Saturday, October 11 2008 05:00.28 CDT h4x0r comprehensive analysis, thanks. for those no... Tuesday, May 15 2007 04:15.56 CDT PoincareLei good analysis.. expecting RolfRolles to wri... Wednesday, April 4 2007 06:26.45 CDT bLaCkeye Impressive display of reverse engineering and c... Friday, February 23 2007 19:47.13 CST nico Good job Bro, as i told you already when i firs... Friday, February 23 2007 13:53.15 CST