Forgotten Win32 APIs

GetWsChangeEx . There is a set of Win32 APIs that were introduced in Windows XP to monitor the working set of a process. A process' working set is a collection of pages, chunks of memory, that are currently in RAM (physical memory) and are accessible to that process without inducing a page fault. In particular, the APIs of interest for us are InitializeProcessForWsWatch and GetWsChanges





After reading the MSDN documentation, it's easy to discover what the intended use for these APIs were. These APIs profile the number of page faults that occur within a process' address space.

What's a page fault? A quick recap.

There are 3 general categories of page faults.





A hard page fault occurs when memory is accessed that's not currently in RAM (physical). In situations like this, the OS will need to retrieve the memory from disk (e.g. pagefile.sys) and make it accessible to the faulting process.





A soft page fault occurs when memory is in RAM (physical), but not currently accessible to the process that induced the fault. This memory might be shared amongst multiple processes and the process that caused the page fault might not have it mapped into its working set. These types of page faults are much more performant than hard page faults as there is no disk I/O conducted.





The last and final type of page fault is known formally as an invalid fault. These can also be referred to as access violations. This can be caused when a program, for example, tries to access unallocated memory or tries to write to memory that's marked read-only.





Paging is necessary to make modern operating systems work. You probably have many processes running on your system, but not nearly enough RAM to hold all the possible contents of each process into physical memory. To learn more about paging, I strongly recommend this article posted by my colleague.

Demo

The best way to illustrate what's broken is through an example. I created two simple programs.





The first application, WorkingSetWatch.exe , implements the InitializeProcessForWsWatch and GetWsChangeEx APIs. This application logs when a specific memory region is paged into our process' working set:





The second application, ReadProcessMemory.exe , implements reading of an arbitrary memory blob from another target process' memory space:





The basic idea is to use ReadProcessMemory.exe to read from the monitored memory address inside of WorkingSetWatch.exe . This will induce a page fault.

Windows 7: Build 7601 (SP1)

The WorkingSetWatch.exe application works as expected. We're able to read any (valid) sized buffer using ReadProcessMemory.exe and log it.





Windows 10: Build 15063 (Creator's Update)

Unfortunately, WorkingSetWatch.exe does not seem to log the page fault that occurs when our remote application, ReadProcessMemory.exe , reads a buffer greater than or equal to 512 bytes; however, it does seem to work as expected when a read occurs that's less than 512 bytes.









This renders these working set APIs useless for profiling reasons on Windows 8+.

What went wrong?

To determine what went wrong, we'll need to reverse engineer parts of Windows and see exactly how the implementation changed in Windows 8+ from Windows 7.



All disassembly and pseudo-source is reconstructed from system files that are provided with Windows x64 10.0.15063 (Creator's Update).

Enabling process working set logging

InitializeProcessForWsWatch . From the K32InitializeProcessForWsWatch within kernel32.dll . Our analysis begins there:



This function is very simple. It invokes an import from another library. In this case, it executes a function of the same name ( K32InitializeProcessForWsWatch ), but contained within a different library, api-ms-win-core-psapi-l1-1-0.dll . This library doesn't exist on disk, but rather resolves to an kernelbase.dll (which does exist on disk) for this version of Windows. A look into kernelbase.dll 's implementation shows that a call to NtSetInformationProcess is performed without any parameter marshalling:



Our next target is NtSetInformationProcess within ntdll.dll :



This is just a simplistic syscall stub that will eventually make its way into the implementation contained within ntoskrnl.exe , the Windows kernel. nt!NtSetInformationProcess is a massive function that contains a huge switch statement that supports all the different PROCESSINFOCLASS that can be passed to it.





We're interested in the PROCESSINFOCLASS for ProcessWorkingSetWatch . This is case 15 (0xF). A snippet of the relevant parts (with the cleaned-up disassembly):



To enable working set logging for a process, we need to call. From the MSDN documentation , we're told that on newer versions of Windows this API is exported aswithin. Our analysis begins there:This function is very simple. It invokes an import from another library. In this case, it executes a function of the same name (), but contained within a different library,. This library doesn't exist on disk, but rather resolves to an API Set mapping corresponding to(which does exist on disk) for this version of Windows. A look into's implementation shows that a call tois performed without any parameter marshalling:Our next target iswithinThis is just a simplistic syscall stub that will eventually make its way into the implementation contained within, the Windows kernel.is a massive function that contains a huge switch statement that supports all the differentthat can be passed to it.We're interested in thefor. This is case 15 (0xF). A snippet of the relevant parts (with the cleaned-up disassembly):

ProcessWorkingSetWatch

ProcessWorkingSetWatchEx

nt!NtSetInformationProcess

kernel32!InitializeProcessForWsWatch

ProcessWorkingSetWatch

nt!NtSetInformationProcess

_PAGEFAULT_HISTORY

_PROCESS_WS_WATCH_INFORMATION

_PROCESS_WS_WATCH_INFORMATION

GetWsChanges/Ex

_PAGEFAULT_HISTORY

_EPROCESS.WorkingSetWatch

_PAGEFAULT_HISTORY

PsWatchEnabled

PsWatchEnabled

nt!NtSetInformationProcess





Our investigation leads us to nt!KiPageFault .

Logging a page fault

When a page fault occurs, the CPU transfers execution to nt!KiPageFault :



If the PsWatchEnabled global is set, that means we've enabled working set logging for processes on the system and execution is passed to nt!PsWatchWorkingSet . This function is documented below:



As I mentioned above, there are 3 types of page faults. Access violations are not logged to our process' working set due to an early out by nt!MmAccessFault in nt!KiPageFault . Since this function is executed for the other 2 types of page faults (hard and soft) on the system, it will be accessed heavily by the operating system. Luckily, one of the first things the routine does is check whether or not a working set watch was enabled on the process where the page fault occurred. If there is no working set watch on the process, the routine completes.



As per the documentation, nt!PsWatchWorkingSet will not function while records are being processed ( EntrySelector.Busy ). We'll describe this part in depth at a later time. Since higher priority interrupts can preempt our working set monitor, most of the logic in this routine needs to have adequate sanity (safety) checks and complete as atomically ( Interlocked*** operations) as possible. The first part of the function will safely select a free index in the _PAGEFAULT_HISTORY.WatchInfo array that it can use for logging purposes. If the array is full (there can be at most 1024 entries), a "miss" is recorded ( _PAGEFAULT_HISTORY.MissingRecords ) and the routine completes. If everything is successful, a page fault event is recorded in a free slot in the _PAGEFAULT_HISTORY.WatchInfo array. An interesting (and undocumented) feature changes the entry's _PROCESS_WS_WATCH_INFORMATION.FaultingVa least significant bit to 0 if a hard page fault occurred and 1 if a soft page fault occurred.



Ultimately, there doesn't seem to be any apparent bugs with this code. Additionally, this code matches very closely to the Windows 7 version which we know works. Our investigation leads us to the working set watch retrieval functions: GetWsChanges/Ex .

Querying working set logging

For article brevity, I'll give a quick summary of the call-flow of kernel32!GetWsChanges ( kernel32!K32GetWsChanges ) and kernel32!GetWsChangesEx ( kernel32!K32GetWsChangesEx ) . These functions will call into their kernelbase.dll variants. From there, they will branch into kernelbase!GetWsChangesInternal which will invoke ntdll!NtQueryInformationProcess with the appropriate PROCESSINFOCLASS . In particular, the ProcessWorkingSetWatch class will be used for the GetWsChanges family of functions and ProcessWorkingSetWatchEx will be used for the others. From ntdll!NtQueryInformationProcess , a syscall will be made. This makes it to the implementation of NtQueryInformationProcess within the kernel. A massive switch statement awaits:





nt!PspQueryWorkingSetWatch

nt!ExIsRestrictedCaller

_EPROCESS.WorkingSetWatch

EntrySelector.Busy

nt!PsWatchWorkingSet

nt!KiPageFault

PROCESSINFOCLASS

PSAPI_WS_WATCH_INFORMATION/EX

FaultingPc

FaultingVa

_PAGEFAULT_HISTORY.WatchInfo

_EPROCESS.WorkingSetWatch

/rant.

InitializeProcessForWsWatch

GetWsChanges/Ex

_PAGEFAULT_HISTORY.Busy

nt!PspQueryWorkingSetWatch

nt!PsWatchWorkingSet

GetWsChanges/Ex

1024*PAGE_SIZE

WorkingSetWatch.exe

ReadProcessMemory.exe

ReadProcessMemory.exe

Reading memory

The ReadProcessMemory.exe application is simple enough to understand. We know that we're not logging a page fault when we're reading a buffer that is greater than or equal to 512 bytes. Since there is no apparent bug in the working set APIs, the problem most likely resides in kernel32!ReadProcessMemory .





I'll step past the irrelevant details, but the same strategy is applied as was in the previous parts. In particular, kernel32!ReadProcessMemory calls into kernelbase!ReadProcessMemory . These functions do nothing special and more-or-less directly issue a system call by invoking ntdll!NtReadVirtualMemory . This takes us to the implementation of nt!ReadVirtualMemory in the kernel:





nt!MiReadWriteVirtualMemory

ProcessObject->Pcb.SecurePid

nt!MmCopyVirtualMemory

nt!MmCopyVirtualMemory

FromAddress

ToAddress

nt!MmCopyVirtualMemory's

nt!MmCopyVirtualMemory

nt!MmProbeAndLockPages

nt!MmMapLockedPagesSpecifyCache

memcpy

nt!MmCopyVirtualMemory

memcpy

nt!ExAllocatePoolWithTag

WorkingSetWatch.exe

nt!MmCopyVirtualMemory

WorkingSetWatch

nt!MmProbeAndLockPages

The bug: an optimization in nt!MmProbeAndLockPages

nt!MmProbeAndLockPages underwent drastic changes between Windows 7 to now. If you looked at these two functions side-by-side, you'd quickly notice that the Windows 7 implementation was in some ways much simpler.



The purpose of nt!MmProbeAndLockPages (per the MemoryDescriptorList ) are backed by physical memory. Additionally, there is a series of permission checks to ensure that the virtual pages permit the user-specified access rights. In Windows 7, to perform this access check, the routine actually "probed" the memory by directly accessing it. This would induce a page fault in the context of the correct process and therefore we'd be able to log it using our WorkingSetWatch.exe application.



On Windows 10, this process was optimized. Instead of accessing the memory directly, a





OS development isn't easy The implementation ofunderwent drastic changes between Windows 7 to now. If you looked at these two functions side-by-side, you'd quickly notice that the Windows 7 implementation was in some ways much simpler.The purpose of(per the documentation ) is to ensure that the specified virtual pages (in the argument contained within) are backed by physical memory. Additionally, there is a series of permission checks to ensure that the virtual pages permit the user-specified access rights. In Windows 7, to perform this access check, the routine actually "probed" the memory by directly accessing it. This would induce a page fault in the context of the correct process and therefore we'd be able to log it using ourapplication.On Windows 10, this process was optimized. Instead of accessing the memory directly, a PTE (Page Table Entry) walk is performed to ensure that the correct permissions exist. This change makes the process more efficient especially since the PTEs are leveraged to lock the memory into physical pages anyway.

One seemingly inconspicuous change can break functionality in an entirely unrelated part of the operating system. In our case, an optimization in the underlying logic of how nt!MmProbeAndLockPages functioned broke backwards compatibility of the working set APIs. This bug seems to be entirely unnoticed, but it unfortunately renders the performance profiling nature of the GetWsChanges/Ex APIs useless.





A potential fix for Microsoft is to simply just throw a page fault for "invalid" pages if the PsWatchEnabled global is set or, more granularly, if a process' _EPROCESS.WorkingSetWatch is set.

Microsoft has a great track record of maintaining support for legacy software running under Windows. There is an entire compatibility layer baked into the OS that is dedicated to fixing issues with decades old software running on modern iterations of Windows. To learn more about this application compatibility infrastructure, I'd recommend swinging over to Alex Ionescu's blog . He has a great set of posts describing the technical details on how user (even kernel ) mode shimming is implemented.With all of that said, it's an understatement to say that Microsoft takes backwards compatibility seriously. Occasionally, the humans at Microsoft make mistakes. Usually, though, they're very quick to address these problems.This blog post will go over an unnoticed bug that was introduced in Windows 8 with a documented Win32 API. At the time of this post, this bug is still present in Windows 10 (Creator's Update) and has been around for over 5 years.It's interesting to note that you're able to start monitoring on a process' working set with either a class of(15) or(42). This can be achieved by invokingdirectly instead of going through the documented route with. The latter utilizes only theclass.The actual logic ofis pretty trivial to understand. A blob of memory is allocated per process that we're monitoring. This blob of memory is astructure and contains up to 1024structures internally. Eachstructure is an entry that describes a page fault. These entries will be cycled through as the array fills up. Recall from the MSDN documentation (the "Remarks" section) that you must callwith enough frequency to avoid record loss. This makes perfect sense because we can see that there are a fixed number of these records (1024) allocated. I took the liberty of documenting these structures:The union at the beginning of thestructure may be a little confusing, but it'll be explained later.On successful execution of this routine, the monitored process object will have an internal member () updated to include this recently allocatedpointer. Additionally, theglobal will be set. This value informs the system to track page faults for processes. It will remain set until the system reboots (even if there are no processes running that have working sets tracked). There are only 2 references toand we've already inspected the one inThe part that interests us resides one level deeper withinThere's some input validation (e.g. alignment checks) and a safety check () to avoid kernel pointer leaks in low integrity processes. After that, the process object is retrieved from the supplied process handle. The operating system checks to see that themember is set. Just like the documentation states, at most one query can access a process' working set buffer at a time (). Additionally, while the buffer is being accessed, logging (byin) will produce misses.As long as there's enough space in the user supplied buffer, the operating system will copy over the entry array to the user supplied buffer. The data will be structured in the appropriate way for the appropriate. The last entry in the user supplied buffer () will be terminated with amember of NULL. Additionally, the number of "misses" will be recorded in themember of the last entry.Finally, thearray of thewill be reset after a successful call.TheandAPIs are surprisingly very finicky. There are many weird restrictions and caveats which make it surprisingly difficult for developers to retrieve information regarding the complete set of page faults that occurred within a process.There is a very good chance that you will run into situations where records will wind up missing especially in a multi-processor and multi-threaded environment. For example, if a thread is querying the working set of a process, but a page fault occurs on another thread within that same process, a miss could be recorded since themember will be acquired by. This will prevent the page fault logging logic in. Functionally, this weakens the usability of the API for profiling purposes. To compound this problem, only 1024 entries can be stored in the array between calls of. That's at most 4 MB () of page fault history. This really isn't enough for modern applications which can be very complex.In our specific situation, we ran our tests on a VM that had 1 processor allocated to it. Furthermore, our application was simple enough that it had 1 thread. This mitigates the chance of page fault "misses". Additionally, after a thorough investigation of the working set APIs, we've concluded that we've still not discovered where the bug is. In particular, why does the buffer size play a role in the success of these APIs? In our demo, we were unable to log page faults on Windows 10 when the buffer size was greater than or equal to 512 bytes. Is it possible that the bug is not within, but ratherTo continue our investigation, we need to turn toThis function just invokes. On some versions of ntoskrnl, this routine may just be inlined into the caller's body.Aside from a check that prevents reading and writing to protected processes (), this function is nearly identical to the one in the Windows 7 kernel. We need to go deeper. We traverse intoThis function is massive. It contains many subfunctions that have been inlined. For article brevity, the important parts ofwill be highlighted. One of the first things that this routine does is search for VAD entries that corresponds to the input addresses (and). The idea is to leverage the "region size" information for memory, but this isn't really relevant to our bug. We'll leave the discussion of the VAD (Virtual Address Descriptor) to another time.next task is to determine the input buffer's length. In particular, there are a couple checks against the buffer length and the value 512. This is significant to us because we know the bug only seems to manifest when the buffer size is greater than or equal to 512 bytes.Basically, it seems that if the buffer is greater than or equal to 512 bytes,will utilizeandfollowed by ato clone over memory.If the buffer is less than 512 bytes,will just leveragedirectly by using a buffer on the stack or a buffer allocated in dynamic memory (based on buffer size) viaThis is probably done for performance reasons. Larger memory copies probably benefit from direct mapping instead of memory pool copying. If we do leverage memory pool copying (buffers that are less than 512 bytes in size), we trigger a page fault and the event is logged by ourapplication. On the other hand, if we leverage a direct mapping to copy memory, we do not trigger a page fault.One incorrect assumption is to believe that on Windows 7 this optimization did not exist. On the contrary, there is very similar logic inside of the older version of. However, something did change, otherwise we would not have any discrepancies with ourprogram. Our investigation leads us into