Table of Contents:



This post presents the reader a technique that can be used in order to

achieve Universal System Call Hooking under WOW64

[1] from usermode.

Throughout the next paragraphs we will briefly introduce the reader to

system calls, motivation for this article, the way Linux’ ptrace(2) handles

system call interception, how WOW64 windows performs system calls and

finally how we take advantage of this technique by intercepting

every system call, giving third party software a way to analyze and/or

modify system calls made by a process.

Additionally, we will discuss possible improvements, advantages and

disadvantages of the presented technique.

Finally, we provide the reader a Proof of Concept including all sources,

pre-compiled binaries, a sample output and a small analysis based on the

output and the source.

A system call is, as the name suggests, a call to the “system” or kernel.

System calls are the lowest processes get because it’s the only way to

communicate with the kernel, whatever the kernel does with the data a

process provides, cannot be seen by the process; processes only see the

result of the system call. It’s worthy to note that any kind of

Input/Output goes through the kernel (be it a file, socket, etc.)

Whereas the Linux kernel provides the ptrace(2)

[2] API, which allows processes

to intercept system calls made by child processes

[3], Windows does not

provide such functionality. Windows does, however, ship with a fairly

comprehensive debugging framework, which we will not use here (that

is, we do not use dbghelp.dll to help debugging, or attach to a process

at all, this means that anti-debugger methods do not work for

processes that are being hooked using this technique, or RREAT

[POC] in general.

Because windows does not natively support system call hooking, there have

been a number of kernel drivers (e.g. rootkits), which would intercept

system calls by hooking the SSDT table

[4], installing kernel hooks, etc.

Using the technique presented in this post, one does not need

administrator privileges to perform system call hooking

(which is required to load

drivers into the windows kernel), instead, it requires (atleast) the same

privileges as the process we want to intercept. This makes the technique

fairly attractive because for one, a user doesn’t have to run as

administrator in order to debug another process. Secondly, 64bit windows

kernels make it fairly hard to load drivers at all, and once loaded, the

driver is not allowed to hook anything at all (using the traditional

methods), because PatchGuard

[5]

will jump in and restore any hooks.

The Linux kernel provides the ptrace(2) API which handles everything

related to the debugging of child processes. It also supports intercepting

system calls and does this in a very nice way, therefore we use the same

way in our implementation.

Linux’ ptrace(2) works as follows. When a child process performs a system

call, the parent (the debugger) gets a notification before the

system call is executed (we will call this the

pre-event from now on), from here on the parent will be able to

inspect the arguments to the system call by reading the registers (or

stack, depending on the syscall calling conventions

[6].)

The parent also has the ability

to alter arguments, because it can change registers and write memory in the

child. After the parent finishes the pre-event inspection, it will tell

ptrace(2) that the child can execute the system call. The Linux kernel

will then continue with the execution of the system call that was made by

the child process (with either the original parameters, or parameters that

were altered by the parent), after the system call finishes and right

before the child process gains execution again, the post-event is

triggered; the parent receives another notification, this time the parent

is able to read the return code and any altered parameter (e.g. functions

such as read(2) fill a buffer, given as parameter, with contents from a

file.) Note that the parent is, logically, also capable of altering the

return code and/or anything else.

It should be obvious that the ability to intercept system calls gives full

control over a processes’ behaviour.

The technique presented here is specific to WOW64 processes (that is,

32bit binaries running on a 64bit windows version), however,

with some extra work this technique can also be deployed to work on

32bit windows versions (although it’s not as “reliable” on 32bit windows

for various reasons.)

In order to be able to run 32bit binaries on a 64bit windows version, there

has to be an extra layer between usermode and kernelmode. This is because

a 64bit kernel expects 64bit pointers as parameter, whereas the 32bit

application provides 32bit pointers.

This brings us to segments, on WOW64 there are two code segments, the code

from our 32bit binary will run in segment 0×23, which tells the CPU to

emulate a 32bit CPU. However, when a syscall is made, the CPU has to switch

to segment 0×33, which makes the CPU use the 64bit instruction set.

Switching segments in WOW64 is done by a so-called far jump (a jump which

has an address and a segment.) Because this far jump is kind of tricky,

and hardcoded, it only appears once in ntdll.dll (where all system calls

take place.)

By replacing the instruction at this particular address, we can intercept

every system call.

As stated earlier, there is only one address in ntdll.dll where the segment

switch is done. So whenever a process makes a system call, this particular

address is hit. In other words, by redirecting the execution flow from this

address, we are able to intercept every system call the process makes.

The next problem we face; how do we get that address? Let’s dive into some

assembly, namely that of the ZwCreateFile()

[7]

function (on Windows 7 SP1.)

All this function does is setup the arguments correctly for the system

call, the important parts are 0×52 (this is another notation for 52h) which

is the system call number for the ZwCreateFile() API and the fact that the

‘edx’ register is loaded with the address to the first argument. Followed

by the instructions to initialize a few registers is the call instruction,

this instruction continues code execution at an address which is specified

by fs:[0xc0], we fire up a debugger and find the address in fs:[0xc0]

(simply by executing “mov eax, fs:[0xc0]” and reading the value of ‘eax’

after executing it.) We analyze the instruction at this address and see

that it is, as you might have guessed already, a far jump!

We have just seen what the ZwCreateFile() function looks like, and it’s

worthy to note that every system call looks just like it, except they have

a different number for the ‘eax’ register (every function has a unique

system call number.)

Now we know how we have to hook at which address, it is time to get to our

implementation.

Our PoC is based on several components; a redirection at the childs far

jump, some injected code in the child to handle system calls and notify

the parent and finally a thread in the parent, which waits for

notifications.

Setting the Universal System Call Hooking mechanism up goes as

following.

The parent creates an Event object and duplicates it to the child process,

this way when the child signals the event object, the parent will get a

notification, which is what we need for pre- and post-event notifications.

The parent allocates a memory page in the child and writes some

hand-written machine code to it, before the machine code (btw, this is

32bit machine code) is written to the child, a few depencies are updated

in the machine code, such as the address of the far jump and the handle of

the Event

object in the child process. As ntdll.dll appears to be mapped to the same

base address in the child and parent process, we can simply take the far

address (an address with segment) from the parent process.

The parent creates a thread (in its own process) which is basically an

infinite loop that waits for the child to signal the event object, more on

this later.

The parent overwrites the far jump in the child with a jump to the

hand-written machine code that we injected earlier.

Implementation of the Injected Machine Code

The injected machine code behaves just like ptrace(2)’s method, but in

order to achieve this, a few hacks are necessary. (ptrace(2)’s method:

pre-event, perform real system call, post-event.)

After a child notifies the parent it enters a busy-loop

[8], the busy-loop

gives the parent enough time to catch up. The parent will now suspend

[9]

the thread in the child process, and after it has been suspended, read the

CPU registers.

The parent keeps track whether this notification is a pre-event or a

post-event, simply by toggling a boolean value every notification.

If the notification is a pre-event, the parent will read the values from

the ‘eax’ and ‘edx’ CPU registers (these contain the system call number and

the address of the first argument, respectively.) From here on, the parent

can decide to read all arguments to the function, by reading data from the

child process (at the address specified by the ‘edx’ register.)

However, when the notification is a post-event, the parent can read

the return value of the system call and optionally any altered arguments.

A lot of windows APIs alter arguments to the system call, such as

ZwReadFile(), a low-level variant of fread(), which changes the contents

of the ‘buffer’ parameter to the contents from a file (or any other

stream.) In other words, the post-event allows the parent to read the

‘buffer’ parameter that was filled by the ZwReadFile() system call, and

therefore the parent is able to read and modify these contents.

When the parent finishes processing either a pre- or post-event, it will

set the Instruction Pointer (also called Program Counter) past the

busy-loop and resume the thread, this way the thread will happily continue

executing.

To perform the signal to the parent we need the ZwSetEvent() API, however,

we don’t want to intercept our own API call, so instead of calling the API

(or fs:[0xc0] for that matter), we do a call to the far address directly.

The main feature is, well, Universal System Call Hooking in another

process.

Advantages of the technique described here include; the ability to run

without the need for administrator access, the ability to read and modify

arguments (or even the system call!), etc.

The major two disadvantages are; a process will always be able to bypass

this method if it’s specifically told to and there can be quite some

slowdown as a system call turns into three system calls (ZwSetEvent()

twice for pre- and post-events and of course the real system call) and any

overhead that the parent brings, not including any extra overhead the

kernel brings for Inter Process Communication (signalling an event object

in another process) and manipulation of the thread in the child process

(i.e. suspending and resuming a thread, reading and writing a threads CPU

registers, and reading and writing memory in the child.)

The current implementation, which can be found in the Proof of Concept, is

unfortunately single-threaded only.

A major improvement, speed-wise, is by using a whitelist table in the

child. Our Proof of Concept already does this; when the parent injects the

hand-written machine code, it also allocates a 64k table (no optimizations

here.) Each entry in this table maps to a boolean value which indicates if

the system call should be hooked or not. So, when the child performs a

system call, the value of the ‘eax’ register (which thus contains the

system call number) is checked against the table, if the boolean value is

false then code simply executes the original system call and does nothing

else, this way system calls we are not interested in will not be sent to

the parent, therefore the only overhead for those system calls is a table

lookup, which is quite “cheap.”

As the current implementation only allows single-threaded applications,

multi-threaded support should also be added. A simple

approach is as follows. When the child creates a new thread, the parent

creates a new event object in the parent and duplicates it to the child,

the parent then makes sure that the event object in the child is put

somewhere in the threads Thread Local Storage

[10].

From here on, when the child creates

a system call, it will notify the parent with an event object unique to

the current thread. The parent will then see that the system call occured

from a certain thread and it will do its thing on the specific thread etc.

Note that the parent is able to receive a notification when a child

creates a new thread, because this is a system call as well.

Because replacing the far jump to a normal jump is quite easy to detect, it

might be interesting to, instead of replacing the instruction completely,

only replace the far address (this would result in an address with a

segment of 0×23.) If software still picks this up, one could go further and

modify the 64bit code located at the original far address (e.g. jump back

to 32bit code from there.) Methods to make it harder to detect are

only limited by your imagination

If somebody were to add multi-threaded support, great care has to be taken

regarding Race Conditions

[11]. Race Conditions

can occur when multiple threads try to read and write to and from the same

memory addresses, for example. In our situation, a malicious thread might

alter a threads registers (by using the windows API) thereby possibly

causing undefined behaviour in the parent. The best way to avoid this is

by reading all registers, stack memory and whatever is needed only once

and reusing this data, rather than reading it every time it’s needed.

Source with Binaries can be found here.

Up-to-date source can be found on github.

This Proof of Concept has been tested successfully on

64bit Windows Vista with Service Pack 2 and

64bit Windows 7 with Service Pack 1.

The PoC consists of two parts, a parent and a child. The child attempts to

open a file called “hello world!”, and yes, this is an invalid filename.

The parent reads the filename from the child process after receiving the

pre-event notification and shows the return value (which is an error,

because the filename is invalid) after receiving the post-event.

Running the Proof of Concept (on a 64bit windows machine of course!)

gives something like the following.

$ parent.exe Universal System Call Hooking PoC (C) 2012 Jurriaan Bremer Started hooking child with process identifier: 3308 Child 3308 opens file: "hello world!". Child 3308 return value: 0xc0000034 -1073741772

Note that 0xc0000034 is defined as STATUS_OBJECT_NAME_NOT_FOUND, in other

words, the filename is incorrect.

If we dive into the child’s source, we see a single line of code:

fclose(fopen("hello world!", "r"));

What is interesting about this is the fact that, although we call the

fopen() function in our code, code execution still ends up at

ZwCreateFile(). This is merely a conclusion that fopen() is a huge wrapper

around CreateFile(), which is in turn a wrapper around ZwCreateFile().

The parent’s source is also fairly straight-forward and well-documented,

go read it if you like. As the PoC’s source will show, the internals of

our implementation lay in a library named

RREAT, needless to say, you

can find the internals there.

A small note regarding RREAT; RREAT was made with in mind that it should

unpack packed software, etc. Therefore it’s built in such a way that a

failing API leads to an exit() call, when you’re unpacking a script and

something goes differently than expected, then that means that the script

was not packed in the way you thought it was, therefore your unpacking

script is wrong, hence there is no need for the process to keep running.

Because of this RREAT is not very robust by default, keep this in mind

when developing on top of RREAT

That was it for today, hopefully you liked the content of this post. Feel

free to contact me with any suggestions, critics etc. Pull requests are

also more than welcome, see here.

Cheers,

Jurriaan