Table of Contents:



In today’s post we will discuss various methods to alter memory, execute

machine code, inject (system) calls, and more in another process, using

only the thread context windows API.

Throughout the next paragraphs we will introduce the reader to the concept

of thread context, why we will use the thread context API instead of the

existing API, how we will use it to perform functionality as provided

by existing APIs (without actually using those), some improvements to the

technique, and x64 support.

Finally, we present the reader source and binaries of the methods

described, as well as a Proof of Concept which allocates an executable

page into another process, writes some shellcode to it, and executes it.

The thread context denotes the state of a thread, it can store the value

of general purpose registers

[1], the instruction pointer (or

program counter), the eflags register

[2], floating point

registers, and more[3].

However, in this post, we are only interested in

the general purpose registers and the instruction pointer.

Each thread has its own thread context. By altering the thread context

of another thread, we can change the execution flow of this particular

thread. We will use this technique to our advantage in order to execute

code as we like. But first, why would we want to?

Note: A friend of mine, Echo, already wrote a good Proof of Concept

a

few years ago which illustrates most of the techniques explained in

this post, feel free to check his code as well

Second Note: if anything is still unclear after reading this post, try

reading the Proof of Concept code as well, every

step is highly commented.

You might wonder why one would use techniques described in this post

instead of using the normal functions which result in the same

functionality.

For starters, because you can (or atleast after reading this post.)

Besides that, newer versions of Windows appear to have funky

side-effects [4].

And last, but not least, certain APIs which we will be emulating are

flagged by software such as Anti-Virus’ as malicious. By applying

techniques described in this post, we may or may not bypass these

Anti-Virus heuristics and/or any limitations given by such software.

So what does it take in order to hijack the thread context of another

thread, and use it in such a way that it does our thing?

First of all, one has to obtain a thread handle to the thread which we

want to hijack, this can be done in one of the following ways (btw, this

is not a list of all possible methods.)

Enumerating Threads of a Process, followed by an OpenThread

[5] call

call Retrieve Thread Identifier based on a Window Name, using

GetWindowThreadProcessId [6] ,

followed by an OpenThread call

GetWindowThreadProcessId , followed by an OpenThread call Retrieve Thread Handle after creating a new process, using

CreateProcess [7]

CreateProcess Iterate through all Thread Handles of a process using NtGetNextThread

[8] (after obtaining a process

handle using e.g. NtGetNextProcess

[8])

Once we have obtained a thread handle using our favourite method, we can

start with hijacking.

Before we do anything with the thread context, we have to suspend a

thread, otherwise the thread context API returns undefined behaviour.

(Think about it, why would you want to overwrite registers in a running

thread?)

After suspending the thread, we can obtain the thread context, modify it,

store the new thread context, and resume the thread (make the thread run

again.) We can do this as often as we want. In other words, we can resume

the thread for example five times with registers set as we like, and after

that restore the original thread context. By resuming and suspending the

thread a few times with registers set to our values, we can manipulate the

memory of the process.

In order to find gadgets that will work in the remote process, we will be

using a shared library. That is, a library that we can scan in our own

process, which (optimally) has the same base address in the other process.

In windows, the best example of this would be ntdll.dll, since it

is always loaded in a process. That being said, all of our work will be

done on the ntdll.dll library.

In order to manipulate the memory in the other process, we have

to find so-called gadgets. Before we dive into specific

gadgets

for different operations, we will first examine what exactly gadgets do.

Usually a gadget will do one particular instruction (such as writing data

to a register or address) and after that jump to the next gadget. In other

words, a gadget is a really, really basic sequence of instructions

(usually two to at most 10 instructions.)

For more information regarding Gadgets, you could read some more on

topics such as Return Oriented Programming

[9].

A gadget must meet the following requirements to be usable.

We must control all input variables

With the same input, the output must always be the same

The gadget must return into a location controlled by us

To match the first criteria, we must find a gadget which uses registers

only as input. We don’t control stack directly, so we should not use

gadgets which read from the stack. (An exception to this will be presented

later though.)

The second criteria states that the gadget should always perform the same

operations given the same input, in other words, the gadget cannot contain

any conditional jumps (and for simplicity, we also ignore relative jumps.)

Finally, the third criteria, which is pretty interesting, states that the

gadget should always return in an address controlled by the attacker. This

is because we somehow want to know when the gadget has finished execution.

A very reliable method to do this is by jumping to or returning into a

busy-loop, an instruction that jumps to itself. What happens now is that

the thread will, after executing the gadget, run into an infinite loop.

As attacker we can request the instruction pointer of the thread (by

obtaining the thread context), we know the address of the infinite loop

(as we told the thread to go there), so now we can simply wait until the

thread has reached the loop. Once we see that the gadget has finished

(because it is in the infinite loop), we can suspend it, after which the

gadget has finished execution.

Before we examine the gadgets further, let’s first see what a busy-loop

is, and how we would use it in x86.

As mentioned earlier, a busy-loop is an instruction that jumps to itself.

More specifically, in x86, we use the jmp short instruction. This

is an unconditional jump with an 8bit relative offset, which is calculated

in such a way, that it points to the beginning of the instruction.

(Actually it’s called just jmp, but to make it clear that we want

an 8bit relative offset, we say jmp short.)

In assembly the instruction looks like one of the following

representations.

jmp short $

loop: jmp short loop

This instruction is only two bytes long and therefore quite easily found

in a large library such as ntdll. Actually, the ntdll version shipped

with x86_64 Windows 7 SP1 contains 13 busy-loops.

As one is more than enough for us, this will do just fine.

We have seen what criteria a gadget must meet. Now it’s time to examine

the types of gadgets which we will be using, we will be using two

different types of gadgets.

Read Gadget → read 32bits of data

(one dword)

→ read 32bits of data (one dword) Write Gadget → write 32bits of data

(one dword)

Using only these two gadgets, we will be able to do anything we want (as

we will see later.)

Now we’ve defined the types of gadgets, we have to figure out what a

gadget looks like that fulfills all three criteria.

The first criteria is fairly simple, we are only looking for gadgets which

contain an instruction that reads an address into a register, or writes

data to an address.

For a reading gadget, the following instruction will do. (It obtains the

32bit integer at an address specified by ebx and stores it into

eax, we can later retrieve the value in eax from the thread

context.)

mov eax, dword [ebx]

For a writing gadget, we reverse the operands in the mov instruction,

resulting in the following instruction. (Writes the 32bit integer in

eax to the address specified by ebx.)

mov dword [ebx], eax

The second criteria, output is always the same for a specified

input, is fairly easy if we keep the gadgets as simple as possible.

(That is, no conditional stuff etc.)

This brings us to the last criteria, we have to be able to control where

the gadgets returns after execution. There are two easy ways to do this.

By jumping to an address specified in a general purpose register

Using a return instruction on a stack value we have overwritten

The first method is the easiest and, including the read gadget, may look

like the following. (Where ecx points to an address specified by

us.) Unfortunately research showed that this method does not give us

any gadget at all, but it’s still a nice technique to keep in mind.

mov eax, dword [ebx] jmp ecx

Although we used hardcoded registers in this example, any register should

do (as long as the source or destination operand in the mov

instruction is not the same as the address register in the jmp

instruction.)

For example, the following example is not a valid gadget

for us (because the ebx register is referenced twice.)

mov eax, dword [ebx] jmp ebx

The second method involves setting up the stack in such a way that it has

the address to which we want to jump, and then a return instruction.

This method requires us to do an additional 32bit write before we can do

any other reads, writes or other stuff (because we have to initialize the

stack with our return address.) A simple example follows (with a

write gadget.)

mov dword [ebx], eax retn

Note that, in this case, the source and destination operand of the

mov instruction can not be esp (because that’s where

retn gets its return value, unless that’s what you want..)

One problem that came up during testing was the following.

When hijacking the thread context of a thread that did a simple infinite

loop, there were no problems, and the message box (see the

Proof of Concept section) was shown correctly.

However, after adding a call to Sleep in the loop, problems

occurred. That is, the registers in the write gadget were corrupted.

This has to do with non-volatile registers. Out of the eight

general purpose registers, four of them are labeled as non-volatile

(ebx, ebp, esi and edi.) Non-volatile

registers are preserved

across function calls, whereas a register such as eax is always

corrupted because the return value of the function is stored in it.

This is likely not the entire explanation, by far, but if anyone knows

more about this particular subject, please do leave a comment.

Anyway, basically if we want to be able to hijack threads which might be

in a blocking system call (e.g. Sleep

[10]), then we are limited to

gadgets which use only non-volatile registers, fortunately for us this

doesn’t give too many problems as there are plenty of gadgets left.

As we can now read any value from the process and write any value to the

process (you could chain multiple write commands in order to write more

than four bytes), it is now time to look into function calling in the

other process.

A function call is basically setting up the stack

correctly and jumping to the function address, this is exactly what we

will be doing.

Let’s assume that we want to call VirtualAlloc

[11] in the other process, rather than

calling VirtualAllocEx [12]

in our own process (see the difference? Using VirtualAlloc you can

allocate memory in your own process, whereas VirtualAllocEx is able to

allocate memory into another process. VirtualAlloc is like mmap(2)

[13].)

As you can see on the MSDN page (follow the link in the footnote),

VirtualAlloc takes four parameters. What we will do is the

following.

We will allocate enough space for these four parameters and

the return address (where code execution will continue after finishing

the function call) on the stack

the return address (where code execution will continue after finishing the function call) on the stack We will write the four parameters to the corresponding location on the

stack

stack We will write the address of a busy-loop as return address

And finally, we will call the function

First of all we will allocate enough space on the stack. Assuming that the

remote thread has a normal stack layout, that is, esp points to the

lowest stack address currently in use, we can simply subtract our needed

space from the esp register. In order to call VirtualAlloc we need

to store five values on the stack (four parameters and the return

address.) In other words, we will be writing our parameters/data at

esp-20, where 20 represents five 32bit integers.

Now it’s time to write our data on the allocated stack space. We do this

by using our write gadgets five times in a row. So this is actually pretty

easy, once the correct gadgets have been found.

After we prepare the other thread for the particular function call, it

is now time to execute the function. We do this by pointing esp to

the address we calculated earlier (e.g. esp-20 in our example) and

besides that, we set the instruction pointer to the address of the

function we want to call.

From there on, after resuming the thread, the function will be executed

and arrive in the busy-loop after finishing. We have now successfully

executed the function, and we can read the return value in the eax

register in the thread context.

Note that some functions write output data to a memory address given by a

pointer, e.g. sprintf [14], in

this case one could read the output data from the address by chaining one

or more read gadgets.

The gadgets presented earlier are as basic as they come, however, since we

want our attack to be fairly robust, we will support somewhat more

advanced gadgets as well. This is because the library in the other process

(we use ntdll.dll) might not contain the basic gadgets.

Advanced mov instruction

So, let’s start with supporting mov instructions which take more

registers in the memory address.

mov ebx, dword [esi+eax*2+0x20]

In order to support this gadget, we will most-likely zero the eax

register and subtract 0×20 from the address and store that into

esi. However, if that’s not enough (e.g. this gadgets is followed

by a jmp instruction with esi as register to jump to), then

we will have to do some additional calculations (e.g.

eax = (esi – 0×20) / 2, which only works when esi is even..)

More encodings for Jump instruction

The following example is an improvement on the jmp instruction.

In this case we use a call instruction instead of a jmp

instruction. This brings a few caveats though; the instructions before the

call can not use the esp register and a 32bit address is pushed on

to the stack (controlled by the esp register.) An example read

gadget follows.

mov eax, dword [ebx] call esi

Additional encoding for retn

Besides a normal retn instruction, there is also a variant of the

retn instruction which takes a 16bit immediate, indicating how many

bytes should be added to esp after returning (in our case, to the

busy-loop.) Other than that, the instruction is not very special, but it

is used in functions with the stdcall calling convention (also referred to

as WINAPI, by windows.) A simple example of such instruction looks like

the following.

retn 4

The x64 architecture is slightly different, as well as the calling

convention. Whereas x86 throws all parameters on the stack by default,

x64 has a fastcall calling convention

[15]. The first four parameters

to a function are passed to the function in general purpose registers, any

other parameters are given through the stack. In theory this means that

we could write the return address somewhere on the stack (the busy-loop is

exactly the same as in x86) and execute a function such as

VirtualAlloc simply by passing all the parameters in registers in the

thread context.

Practically, however, we might encounter problems regarding non-volatile

registers, etc.

That said, the gadgets for reading and writing remain the same as they

will be automatically “promoted” to use x64 registers. The only difference

is, obviously, that you will be working with 64bit integers and addresses.

Up-to-date source of the Proof of Concept can be found

here.

Binaries (with source as well) can be found here

here.

The Proof of Concept basically does what we discussed during this post.

First of all it enumerates all the executable sections in ntdll, then it

looks for possible gadgets in these sections (there is actually only one

section, but still.) From there, after finding a busy-loop, read gadget

and write gadget, it prepares the stack in the remote thread and calls the

VirtualAlloc function to allocate a RWX page (read, write,

execute.) It then copies some shellcode to the page and executes it, this

shellcode is a simple MessageBox call, but then again, it’s just a

Proof of Concept.

Example execution looks like the following;

$ cat target.c #include <stdio.h> #include <windows.h> int main() { printf("threadid: %d

", GetCurrentThreadId()); while (1) { Sleep(100); } } $ gcc target.c $ ./a & threadid: 9000 $ ./poc 9000 0x77b6b48d read edi dword [ebp+0xffffffe4] 0x77b931ea write dword [ebx+0xffffffe4] edi Allocated page: 0x00300000 ... msgbox pops up ...