Someone may ask what is the purpose of debugging PE loader, here are a few reasons:

checking why executable is not loaded properly (imports, TLS , other initialization related issues)

, other initialization related issues) looking for some hidden features (e.g. LdrpCheckNXCompatibility )

) plain curiosity

Of course debugging ring 3 part of PE/PE+ loader can reveal only part of the truth, for the second part (or rather first part if I want to be strict) there is MiCreateImageFileMap function inside ntoskrnl (source code of this function can be found in Windows Research Kernel: \base

tos\mm\creasect.c, it is a bit old, but most of the stuff hasn’t changed much). In this short article I’ll cover only x86 and x64 of ring 3 part.

Ring 3 entry point for the new process (and also thread) is located in NTDLL, it is exported as LdrInitializeThunk, more information about this callback can be found at Skywing’s blog: http://www.nynaeve.net/?p=205. Basically above post inspired me to think about some other method to debug process initialization. It was few years ago and I came with a very simple idea (flawed, as it turned out lately when I got back to this project). Initial concept looked like this:

Create process with dwCreationFlags set to CREATE_SUSPENDED

set to Allocate one temporary page in the new process ( VirtualAllocEx )

) inject small shellcode which will check PEB.BeingDebugged field in the loop and in case of debugger detection loop will end and int3 will be executed

field in the loop and in case of debugger detection loop will end and will be executed Redirect LdrInitializeThunk to the shellcode

to the shellcode Resume process

Attach favourite debugger

I was using this scenario and it was sufficient at that time, however it was sometimes failing. Recently I got back to this and finally found the reason. There is a race condition, because during debugger attachment system creates additional thread that should do DbgBreakPoint. So in my case, after resuming application, one of the threads was reaching my shellcode and second one was waiting until I hit ‘step over’ instead of ‘step into’ and in some cases it was taking the initialization process first, leaving me with the already initialized application. Here is new version of the x86 shellcode:

BITS 32 _begin : jmp _skip push 0 push 0 mov eax , 12345678h ; NtTerminateThread call eax _skip : call $ + 5 pop eax mov word [ eax - ( $ - _begin - 1 ) ] , 9090h mov eax , [ fs : 18h ] ; TEB mov eax , [ eax + 30h ] ; PEB _loop : pause cmp byte [ eax + 2 ] , 0 ; PEB.BeingDebugged je _loop int3 mov eax , 12345678h ; LdrInitializeThunk mov dword [ eax ] , 12345678h ; restore original mov word [ eax + 4 ] , 1234h ; code jmp eax BITS 32 _begin: jmp _skip push 0 push 0 mov eax, 12345678h ; NtTerminateThread call eax _skip: call $+5 pop eax mov word [eax - ($ - _begin - 1)], 9090h mov eax, [fs:18h] ; TEB mov eax, [eax + 30h] ; PEB _loop: pause cmp byte [eax + 2], 0 ; PEB.BeingDebugged je _loop int3 mov eax, 12345678h ; LdrInitializeThunk mov dword [eax], 12345678h ; restore original mov word [eax + 4], 1234h ; code jmp eax

And the x64 version:

BITS 64 default rel _begin : jmp _skip xor rcx , rcx xor rdx , rdx mov rax , 1234567890abcdefh ; NtTerminateThread call rax _skip : mov word [ _begin ] , 9090h mov rax , [ gs : 30h ] ; TEB mov rax , [ rax + 60h ] ; PEB _loop : pause cmp byte [ rax + 2 ] , 0 ; PEB.BeingDebugged je _loop int3 mov rax , 1234567890abcdefh ; LdrInitializeThunk mov dword [ rax ] , 12345678h ;\ mov dword [ rax + 4 ] , 12345678h ;| restore original code mov dword [ rax + 8 ] , 12345678h ;/ jmp rax BITS 64 default rel _begin: jmp _skip xor rcx, rcx xor rdx, rdx mov rax, 1234567890abcdefh ; NtTerminateThread call rax _skip: mov word [_begin], 9090h mov rax, [gs:30h] ; TEB mov rax, [rax + 60h] ; PEB _loop: pause cmp byte [rax + 2], 0 ; PEB.BeingDebugged je _loop int3 mov rax, 1234567890abcdefh ; LdrInitializeThunk mov dword [rax], 12345678h ;\ mov dword [rax + 4], 12345678h ;| restore original code mov dword [rax + 8], 12345678h ;/ jmp rax

Above code takes care of the second thread created during debugger attachment, so before entering the loop it overwrites first two bytes of the shellcode (jmp _skip) with NOPs and second thread goes directly to NtTerminateThread.

To make life easier I’ve created small application called LdrDebug that utilize above method. It will detect format of the executable (PE or PE+), inject proper version of shellcode and print PID of the created process:

e:\...\LdrDebug\Release>LdrDebug.exe notepad64.exe Creating process: notepad64.exe Arguments : (null) Type : x64 PID : 6216 (00001848) e:\...\LdrDebug\Release>LdrDebug.exe notepad.exe Creating process: notepad.exe Arguments : (null) Type : x86 PID : 6988 (00001B4C) e:\...\LdrDebug\Release>LdrDebug.exe /x64 notepad.exe Creating process: notepad.exe Arguments : (null) Type : x86 PID : 4240 (00001090)

There is additional switch ‘/x64’ that can be used to debug x64 part of x86 process under WOW64 subsystem. Application was tested on Windows 7, so I can’t guarantee that it will work on any other system. It might not work under Windows 8, as it uses wow64ext library and I had some reports that this library is not working on that system.

Link to binary package: http://rewolf-ldrdebug.googlecode.com/files/rewolf.ldrdebug.zip

Link to google code page: http://code.google.com/p/rewolf-ldrdebug/

Enjoy!