Although not the most common vulnerability class, it sometimes happens that a ring-0 module (or the kernel itself) references a local variable or buffer, which wasn’t previously properly initialized. The threat is usually mitigated by compiler warnings / errors, informing about potential security flaws present in the source code – as life shows, it is not always enough to avoid serious vulnerabilities of that kind. Having the knowledge about a buggy function, one might stumble upon a serious problem – how to actually make use of the information in a beneficial way (e.g. execute code in the security context of the faulty function – here, kernel-mode). The problem actually boils down to another matter – how can one control the trash bytes present on the ring-0 stack, from within a ring-3 perspective. It is primarily caused by the fact that each thread present on the Windows platform has two separate, dedicated stacks – a user- and kernel-mode one. Such security design obviously limits any possibility for an application to poke with the execution path of a highly-privileged code. Unfortunately, it also narrows down the amount of controllable data, which the user is able to (indirectly) move to the ring-0 stack… or does it? In this post, I am going to focus on my recent, minor discovery, which makes it possible to insert large amounts of arbitrary data into the stack memory areas of the current thread.

In the remaining part of the post, an assumption is made that the faulty device driver code runs in the same thread context as the code that actually triggers the vulnerability.

Let’s take a look at a simplified picture of a kernel stack layout, while executing a system call handler:

+---------------------------------+ <== Stack limit ................................... ................................... ................................... ................................... +---------------------------------+ | VOID (*HandlerRoutine)(); | <== Uninitialized local pointer ................................... ................................... | nt!KeInternalFunction local | | variables | +---------------------------------+ | Stack Frame #3 | +---------------------------------+ | nt!KeRandomService local | | variables | +---------------------------------+ | Stack Frame #2 | +---------------------------------+ | nt!NtRandomService local | | variables | +---------------------------------+ | Stack Frame #1 | +---------------------------------+ | | | TRAP FRAME | | | +---------------------------------+ ................................... ................................... <== Stack Init (default ESP value) | | | Irrelevant stack | | content | | | +---------------------------------+ <== Stack Base ??????????????????????????????????? <== Unmapped memory

As can be seen, the ring-0 stack during kernel code execution consists of a few components, such as a trap frame, several stack frames (depending on the number of nested calls), and memory reserved for local variable and buffers’ storage. Most notably, in case of a system call, the first stack frame includes a certain number of parameters, which are previously transferred from the user-mode stack. However, the number of arguments assigned to each service is strictly defined, and the real number of bytes that are moved across privilege levels is very low (random guess: up to 60 bytes). Sixty bytes is definitely not enough to reach an example HandlerRoutine variable, which can be placed many levels down the stack (usually not more than 4096 bytes lower than the Stack Init address). Obviously, one might try to manipulate the internal kernel routines’ call stack in such a way, that the uninitialized variable is mapped at exactly the place, where an input syscall parameter was used. Even though the approach might be successful at times, the entire process of adjusting the code execution path seems definitely too time-consuming, and unreliable.

Instead of just trying to match single DWORDs with the uninitialized variables’ addresses, it would be best to control a more vast memory area, at once. More precisely, it would be reasonable to find an internal kernel routine, which directly copies user-supplied data into a local buffer (it doesn’t really matter, what happens next); the larger the buffer is, the better. My first idea was to begin searching among the standard NT core syscall set, so I created a very simple syscall fuzzer, and started watching the stack contents. After several iterations, I apparently achieved the desired effect seeing a huge, consistent block of 0x41’s. Yay!

The fortunate NT API turned out to be nt!NtMapUserPhysicalPages. After a cursory look at the routine prologue, it becomes apparent that the function perfectly meets the conditions listed above:

mov edi, edi push ebp mov ebp, esp push 0FFFFFFFFh push offset dword_452498 push offset __except_handler3 mov eax, large fs:0 push eax mov large fs:0, esp push ecx push ecx mov eax, 10E8h call __chkstk

Given the fact that __chkstk is a special procedure, which lowers the stack pointer by a given amount of bytes – here 0x10e8 – the function must definitely use huge amounts of local storage (more than one, typical memory page!). If we take a look into WRK (Windows Research Kernel), or more precisely the \base

tos\mm\physical.c file, we can find the following code snippet:

NTSTATUS NtMapUserPhysicalPages ( __in PVOID VirtualAddress, __in ULONG_PTR NumberOfPages, __in_ecount_opt(NumberOfPages) PULONG_PTR UserPfnArray ) (...) ULONG_PTR StackArray[COPY_STACK_SIZE];

After digging a little more, we can see the original COPY_STACK_SIZE definition, used as the buffer stack:

// // This local stack size definition is deliberately large as ISVs have told // us they expect to typically do up to this amount. // #define COPY_STACK_SIZE 1024

Wow, things are getting a little more clear, now. The last piece of code required to completely understand what’s going on here, is presented below:

PoolArea = (PVOID)&StackArray[0]; (...) if (NumberOfPages > COPY_STACK_SIZE) { PoolArea = ExAllocatePoolWithTag (NonPagedPool, NumberOfBytes, 'wRmM'); if (PoolArea == NULL) { return STATUS_INSUFFICIENT_RESOURCES; } } (...) Status = MiCaptureUlongPtrArray (PoolArea, UserPfnArray, NumberOfPages);

As shown, the syscall handler allocates a local buffer of 1024 items, each sized sizeof(ULONG_PTR) = 4 on the considered Intel x86 platform. It also allows the user to pass larger amounts of data, but as much as 4096 user-supplied bytes (exactly one memory page) can be locally stored by the function. Consequently, after performing the following call:

NtMapUserPhysicalPages( arbitrary r-3 pointer, 1024, 0x41414141 * 1024 );

the local thread’s stack layout should be as follows:

+---------------------------------+ <== Stack limit ................................... ................................... ................................... |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| \ |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | 4096 bytes of controlled memory |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| | |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| / | nt!NtMapUserPhysicalPages local | | variables | +---------------------------------+ | Stack Frame #1 | +---------------------------------+ <== Trap Frame start | | | TRAP FRAME | | | +---------------------------------+ <== Trap Frame end ................................... ................................... <== Stack Init (default ESP value) | | | Irrelevant stack | | content | | | +---------------------------------+ <== Stack Base ??????????????????????????????????? <== Unmapped memory

I think it is also worthwhile to mention that the technique can be used for other purposes (however, it works very well for controlling unitialized fields). For example, one might try to use it in order to store attacker-controlled bytes at a known address of the kernel memory; e.g. to create a fake structure. It can also be used to move an exploitation payload into the kernel memory areas – though keep in mind that this will only work for platforms with no hardware-enforced DEP enabled; otherwise, you will just end up with nothing more but a ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY bugcheck.

The system call has been around since Microsoft Windows 2000, and as far as I am concerned, it is not going to change / be removed any time soon. On 64-bit platforms, the size of the local buffer is even greater since the native CPU word is double the old size (thus a total of 0x2000 bytes to be controlled, there). In general, I can imagine quite a few interesting usages of the technique, and I hope someone will benefit from the knowledge :-) If you know any other, interesting ways of controlling / manipulating the kernel stack from within user-mode, feel free to drop me a line!

Watch out for the follow up Sunday Blog Entries! ;)