Introduction

It would seem for the most part that a large sum of the malware analysis and reverse engineering world takes for granted some of the extended features that the processor provides us. This write-up will explain the details of system debug MSR's (Model Specific Registers) for both AMD and Intel and how these features can be leveraged to user-mode level debuggers and not just code running at a CPL of 0.

It is important to note that certain debug feature MSRs vary between Intel and AMD processors. For example certain Intel CPUs provide up to 15 Last Branch records whereas AMD does not. In either case however those cannot be leveraged from code running at a CPL of 3 and is not the scope of our discussion.

Background

The author assumes that you have a decent knowledge of Windows debugger APIs, Windows internals, and assembly. Last branch recording and branch tracing should have a solid place among the malware analyst's or software reverse engineer's arsenal of tactics for analysis of ring 3 code. The Windows OS itself provides several backdoors into leveraging these techniques from user mode. The goal of this article is to provide a decent explanation of how to use these features and incorporate them into your own debuggers and analysis tools.

Branch tracing

A branch is an instruction that can conditionally or unconditionally transfer control flow. For example any conditional jump, unconditional jump, call, ret, far call, far jump, iret, retf, int n, syscall, sysexit, icebp, etc.

The term 'branch taken' means there was an actual change in control flow resulting from the branch. In an unconditional branch instruction, the branch will always be taken. However with a conditional jump (for example following a bitwise comparison) the branch will not always be taken and is based upon the result of the prior comparison.

As you hopefully already know, the processor single-step feature ( EFLAGS.TF=1 ) causes a #DB exception to occur after each and every instruction boundary is reached. This type of exception is known as a trap, meaning the instruction pointer that is pushed onto the interrupt handler stack will point to the next instruction to be executed.

Simple x86 example:

pushfd or dword ptr [ esp ], 0x100 popfd inc eax push ebx

The DebugCtl MSR provides a bit that will, when set along with EFLAGS.TF=1 , only raise a #DB trap ( single_step ) after a branch instruction boundary has been reached, instead of every instruction. This occurs only if the branch is taken. The instruction pushed onto the handler stack is then that of the destination of the branch, which is then of course your instruction pointer EIP/RIP in the Windows debugger CONTEXT structure.

Simple x86 example:

( EFLAGS.TF=1 and DebugCtl.BTF=1 )

push ebx push eax call ecx xor eax , eax inc ebx pop eax pop ebx ret

Now, how do we access DebugCtl from usermode? It's simple, and Windows provides access to both BTF and LBR bits of DebugCtl via bits 8 and 9 of DR7. If interested, see KiRestoreDebugRegisterState .

bit 8 of DR7 represents bit 0 of DebugCtl . This is the LBR bit. (last branch record, will explain)

. This is the LBR bit. (last branch record, will explain) bit 9 of DR7 represents bit 1 of DebugCtl . This is the BTF bit. (single-step on branches)

As I'm sure you can imagine, this can speed up a running trace by a long shot. Because in theory, when looking for a difference in code control flow, or a bug our answer is most likely going to rely in which branches are taken and which are not, and when only tracing branches, you can trace hundreds of thousands of instructions per second as opposed to generating an interrupt after every instruction boundary.

Now maybe you have noticed, or maybe not, this leaves us with a problem. The instruction pointer pushed onto the handler stack is that of the destination of the branching instruction. Thus RIP/EIP in your usermode CONTEXT structure will be that of the destination. What if we want to know the location of the branching instruction itself? This is where the last branch record stack comes in, also known as LBR.

Lets imagine you are already branch tracing a program with your user mode debugger or analysis tool. You have bit 9 of DR7 set to enable branch tracing, and the trap flag set as well. Here is what to do. Additionally set the LBR bit via DR7 (bit 8, as shown above). When a #DB exception occurs due to a taken branch, analyze EIP/RIP in your CONTEXT . That as stated before is your destination instruction. Now for the yummy part of the article: the address of the branching instruction itself is tucked away by Windows at EXCEPTION_RECORD->ExceptionInformation[0] provided of course that you properly enabled LBR. This is then the virtual address of the branching instruction itself which branched to whatever your instruction pointer is.

I had gotten a little creative with this myself and I couldn't find any in-depth articles on the web related to these features so I decided to write one myself. I was analyzing a little piece of software for my friend and I had noticed that prior to calling into ws32.send() it would clear the stack, and set up a fake return address then JMP to send() as to not push the original return address onto the stack, making the job of finding wherever it originated from a royal pain.

LBR to the rescue

Here is how we could easily overcome this problem, some of you may already know by this point, but read on for important details.

First of all, we must initialize LBR on the thread we are analyzing. In this case we do not need the BTF feature. So set bit 8 of DR7 for the thread.

Next we must establish a breakpoint on ws32.send() or whatever you are analyzing. Here is the important part: the type of exception raised MUST be a #DB exception.

This is because the only Windows interrupt handler that inserts the LastBranchFromIp into EXCEPTION_RECORD->ExceptionInformation[0] is the Windows int 01 handler.

The Windows int 3 handler does not do this for us, and if you use int 3 your ExceptionInformation[0] member will be empty.

You can either use a debug register breakpoint or ICEBP (int 01 with no DPL check). I would personally recommend ICEBP and here is why: The code could IRET to send() with the resume flag set. If the user of your debugger initialized a breakpoint on the first instruction of send() it would be ignored!

And let's face it, a lot of us like the put breakpoints on the first instruction of an API.

Points of Interest - VM detection

Besides analyzing a branch to a function which was setup with a bogus stack, the LBR feature works as a pretty decent method to detect whether or not your program is running within a hyper-visor. This works because most virtualization software including both VMware and Vbox do not make use of LBR virtualization (even though it is possible and supported).

Here is a rough example. I will leave the logic of the exception handler up to you.

RunningInHyperVisor PROC mov [ eax ], 0x10 lea ebx , [eax+0x18] mov [ ebx ], 0x100 push eax push ecx call SetThreadContext _emit 0xeb _emit 0x00 _emit 0xf1 RunningInHyperVisor ENDP