Introducing LuaQEMU

When dealing with complex code in firmware, it is often desirable to have some kind of dynamic runtime introspection as well as the ability to modify behavior on the fly. For example when reverse engineering embedded solutions such as cellular basebands or custom operating system code, the analysts understanding of a target is often fueled by assisting binary analysis with the ability to look at protocol stacks, operating system tasks, and memory at runtime. Similarly, when developing for binary components (e.g. custom fuzzers, exploit code, debuggers, tracing, …), it can be advantageous to deal with an emulated representation of the underlying system instead of physical device. There are different reasons for this, including availability of physical devices, development time, complexity, but also costs.

Comparable scenarios are not only present in the software security world, but equally apply to other engineering roles such as the development of driver code, test code development, bug presence testing (e.g. for release gating) or even bring-up development of embedded OS code.

Why LuaQEMU?

Such an emulation environment shall provide capabilities for rapid prototyping on multiple CPU architectures, while providing a flexible API for interacting with binary code, including full system emulation, emulation of specific System on a Chip (SoC) solutions, and support for peripherals. It is probably best described as a mixture between a system emulator, a debugger, and a dynamic binary instrumentation framework. Historically, we have been missing a solution that perfectly fits our needs in practice. As a result, today we introduce LuaQEMU.

LuaQEMU is a QEMU-based framework exposing several of QEMU-internal APIs to a LuaJIT core injected into QEMU itself. Among other things, this allows fast prototyping of target systems without any native code and minimal effort in Lua.

When initially evaluating the idea of LuaQEMU, we had the following specific functional requirements:

Mature multi-architecture support

Full-system emulation support, including drivers and peripherals, MMU, interrupts, and timers

Ease of long-term maintainability (i.e. little to no QEMU core modifications)

Easy target prototyping (e.g. definition of specific boards) without native code

The first two properties are almost provided by QEMU out of the box. We gain flexibility for target prototyping, by being able to completely write board definitions in Lua without native code. We have implemented this such that each hardware architecture comes with a newly introduced Lua-board, which can be used to interact and source from other native board definitions, while not requiring to modify QEMU core code. For the time being we focused on ARM support here, but this approach can be easily transferred to other architectures supported by QEMU.

Our API requirements are simple:

KISS Debugging API: Exposure and manipulation of CPU context such as registers Control flow tracing (breakpoints, watchpoints) Memory read/write

Trapping of operations on memory regions (e.g. for drivers)

Scripted mapping of memory (e.g. for loaders)

We have specifically chosen the API to be simple as it allows to easily build more powerful features on top using scripting capabilities. This includes custom stubbing/hooking of code, as we will see in the remainder of this article.

Why QEMU?

When it comes to emulation features itself, QEMU-based is an obvious first choice. First, it is the only available free/open source software for full-system emulation and virtualization with an extensive list of supported architectures besides x86/x64 (ARM/AArch64, PPC, Mips, Tricore, Xtensa, …).

Moreover, it supports different versions/compatibility levels of these as well. Admittedly, there are (important) gaps. For example, it currently does not support PowerPC VLE, which is common among certain automotive software solutions. However, the QEMU project itself very active (over 7000 commits in 2016), constantly extended, and reasonably easy to extend in general. More importantly, for common architectures, the code is considerable mature and well tested to rely on it in practice.

It also provides great runtime performance compared to other solutions due to its binary translation and caching of instruction via its Translation Blocks and Tiny Code Generator (TCG). Lastly, it is coupled with other features that prove to be quite useful in practice such as snapshotting, a monitor, and gdb debug stubs.

The Unicorn Engine is one of the tools that took advantage of this as well and greatly helped reverse engineers over the last two years by providing APIs around QEMU’s capabilities. Albeit, with slightly different goals to LuaQEMU (i.e. due to a lack of a strong focus on full-system emulation and different design choices).

Why LuaJIT?

Technically there is no scripting support required to interact with QEMU’s emulation. The reason for this is its built-in gdb stub server. Our early experiments in this field were in fact conducted using QEMU and gdb scripts to achieve certain functionality. However, this does not scale, is not great to maintain especially for larger scripts, and more importantly, is also very slow. So instead, we opted for adding scripting support directly into QEMU itself.

When injecting scripting capabilities into processes, there are a number of (e.g. used by frida.re) options and one could write a blog post series alone on the pro’s and con’s of these.

While Lua definitely has its quirks and it can be somewhat weird to use its C API (mostly due to the stack-based nature), it has a clean syntax, is very simple, well-documented, reasonably small, equally useful for as a language for configuration and imperative programming, but more importantly: allows us to express anything we consider important for the time being. With the existence of LuaJIT, which is what we are actually making use of, there also exists a solution that comes with great performance thanks to its Just in Time compilation (JIT) and a very powerful FFI API for interacting with native code.

Depending on long-term experiences, we may reconsider this decision, but so far we are fairly happy with LuaJIT as our choice.

Example - Broadcom HNDRTE WiFi Stack

If you look at market shares of WiFi chips used in mobile devices, the Broadcom WiFi stack is the juciest target to investigate for security vulnerabilities: Broadcom’s BT/WiFi combo chips are used by the majority of flagship Android devices as well as every single iPhone and iPad. Based on publicly available headers the name of the operating system underlying this stack is the HND Run Time Environment; where HND apparently is an acronym for Broadcom’s Home Networking Division.

Background

Earlier in April, Gal Beniamini of Google’s Project Zero published his excellent research on exploiting Broadcom’s WiFi stack with over-the-air frames. Throughout the course of his research, he found several critical issues allowing an attacker to remotely compromise the WiFi SoC (ultimately leading to application processor compromise as well). We highly recommend this blog post for the interested reader of this post. As we have been eyeballing the BCM WiFi stack as well, we have went out to see if we can use LuaQEMU to reproduce one of the issues in a lab environment, without having to fiddle with over-the-air messages. We will use this as an example to highlight how to test this vulnerability using LuaQEMU and some of the challenges with such an approach.

The vulnerability in question makes a good example of why more involved or full-system emulation is useful. Specifically, as this is a heap buffer overflow, a functional and emulated heap implementation (working malloc/free) is required. Of course one could also abstract away this detail by providing wrappers for certain functions, but in order to reproduce this issue in a non-staged environment we want the heap to behave as on the real device, without fully understanding and reimplementing its internal logic. The heap in turn is initialized during the device start-up routines, which in turn requires to emulate boot code as well.

Moreover, we want the capability to inject other types of WiFi frames for vulnerability testing, i.e. traverse the entire WiFi handling code instead of merely hooking the vulnerable function. We have chosen a BCM4358 firmware on a Samsung Galaxy S6 device as a target (MMB29K.G920FXXU4DPGU).

Goals

Our specific goals are:

Trigger TDLS Setup Confirm buffer overflow using LuaQEMU

Emulate system boot-up for (at least) functional stack/heap

Simulate WiFi receive path to inject arbitrary frames

Of course, to reach this, we will not get around additional manual reverse engineering work.

System Startup

To emulate the system start-up and define a QEMU board for our target, we first need to identify the target architecture. Studying literature and evaluating binary patterns, it is obvious that the main WiFi baseband code runs on an ARM core. In the space of ARM-powered cellular basebands, designs based on Cortex-R* and Cortex-M* cores are fairly common. The excellent Nexmon research and Google’s blog post suggests that Broadcom is making use of Cortex-R4 cores. We believe this is accurate even though for unknown reasons, QEMU does not officially support the Cortex-r4. An unofficial patch exists, but suffers of defining its coprocessor registers, which is why some MRC instructions result in undefined instruction exceptions (BTCM configuration MRC/MCR to be specific). As a result, we used “cortex-r5” - which is compatible and does not require a patch - as the target CPU.

Broadcom does not widely distribute complete firmware files. Instead, mobile devices usually contain patch ram files on flash, which are later loaded to and executed in memory. It is worth noting that there are no cryptographic checks performed on these as far as we can tell so that they are useful for injecting malicious functionality into the WiFi SoC firmware at runtime or dumping ROM.

However, as previous research points out, Broadcom kindly enough provides dhdutil to read/write memory.

As the ROM is known to start at address zero, having a fixed size of 0x180000, we can utilize this tool to dump ROM code on Android devices to subsequently attempt to emulate and reverse engineer it ( dhdutil -i wlan0 membytes -r 0x0 0x180000 ). The NexMon work suggests that right after ROM memory, the BCMHD driver/patchram starts at 0x18000 .

So to summarize, we know the following system parameters:

ROM start address: 0x0

RAM start address: 0x18000

ROM code execution starting at base

ARM Cortex-R4

These bits can be verified using static reverse engineering of the ROM and patch RAM code.

Board Initialization

We have mentioned that we want flexible board definitions using Lua without writing native code. Let us see how a board definition with LuaQEMU looks – the following is a minimal definition for the board that we are using to emulate the BCM WiFi stack:

1 require('hw.arm.luaqemu') 2 3 machine_cpu = 'cortex-r5' 4 5 memory_regions = { 6 region_rom = { 7 name = 'mem_rom', 8 start = 0x0, 9 size = 0x180000 10 }, 11 region_ram = { 12 name = 'mem_ram', 13 start = 0x180000, 14 size = 0xC0000 15 }, 16 } 17 18 file_mappings = { 19 main_rom = { 20 name = 'examples/bcm4358/bcm4358.rom.bin', 21 start = 0x0, 22 size = 0x180000 23 }, 24 main_ram = { 25 name = 'kernel', 26 start = 0x180000, 27 size = 0xC0000 28 } 29 } 30 31 cpu = { 32 env = { 33 thumb = true, 34 }, 35 reset_pc = 0 37 }

This is a very simple example and similar functionality can be achieved with stock QEMU, but we use this to illustrate the concept.

The machine_type configures our target to be a cortex-r5 , which as we outlined above is similar enough to the Cortex-R4 to be compatible.

The memory_region block registers QEMU-internal memory regions. We can use these to map code to memory. This is what happens using the file_mappings configuration entry, which defines a set of files and their respective addresses in memory. LuaQEMU takes care of loading these at the very beginning (however, this can also be done at runtime). The name “kernel” is currently reserved for convenience to map directly to the -kernel command line option of QEMU and we are using that to load the patch RAM code. Technically this is not needed here yet as all the common functionality required by the boot code is contained within ROM (i.e. the heap implementation and various libc functions). An arbitrary number of files can be added here.

The cpu block can be used to initialize the CPU, its registers (not included here), whether to reset and to what address QEMU will jump to on a reset.

System Initialization

This is of course not sufficient to get to a functional state of the WiFi stack, leave alone even the initialization of heap and other important data structures.

We roughly expect the BCM WiFi to setup stack and heap, configure interrupts and trap handlers, configure caches, configure memory protections (MPU on the Cortex-r4), potentially configure NVRAM, configure DMA/PCIe, initialize internal WiFi interface etc. This remains to be an estimate however as it is our intention to manually reverse engineer as little as needed in order to reach the main WiFi handling code.

The following picture is a block diagram from a Cypress document, annoted in red to the function of some of the components:

The WiFi antenna on the bottom right is used to send and receive WiFi frames. As a typical WiFi adapter has just one antenna, a Diplexer is used to split/combine two signals (RX and TX) into one. The signal is then (after being converted analog/digital) processed by DSP cores forming the physical WiFi layer. For these parts, realtime functionality is important. The DOT11MAC (D11) oversees this functionality by receiving the actual signal and taking care about acknowledging frames and more.

The D11 core and the Cortex-r4 core communicate through DMA operations. The r4-core contineously dequeues packets from a shared FIFO before processing actual WiFi Layer2/3 data (more on this later). On an architectural level, the frames from the D11 core are not raw WiFi frames yet, but encapsulated and containing a physical frame header. More details about this can be found in the Nexmon work, Cypress datasheets, and SoftMAC kernel drivers.

The important part to understand here is that between the boot initialization code and the actual WiFi frame reception code, there are a still quite some additional components with unknown functionality involved. Keep in mind we want to reverse engineer as little as possible from this in order to simulate the reception of WiFi frames.

So one approach to do that is to assume that the Cortex-r4 core does actually need to know very little about the other components that are part of this picture. In fact after passing all of the boot code, the code simply waits for an interrupt to appear, which causes it to dequeue a frame from the D11 core, inspects the physical D11 frame header and starts to process the packet via the wlc_dpc function. This is relatively straight forward to identify either by looking for wlc strings or following the interrupt handler.

So while we do not know what functionality exactly comprises the system start-up, manual reverse engineering gives us a rough idea what code we need to hit in order to process frames. As a result, if we assume that the vast majority of the code on the way to the receive path is irrelevant for the actual frame reception and we know an effective address we want to reach, we can try to stumble our way through the code until we skipped as little relevant functionality as possible.

Trial & Error

To reach our goal, we need to skip irrelevant code and run through important code. LuaQEMU cannot magically solve this problem and manual reverse engineering is definitely needed at this point to get a rough idea what code paths we definitely have to cross (e.g. heap initialization in our case) and to get an idea how an error path looks like.

In this particular case, we either have panic/trap handlers indicating a problem and infinite loops caused by certain system states to be in an unexpected state.

For example, the above code reads from the backplane addresses at start-up and expects specific values. It keeps doing that until the value is found. For certain values it immediately jumps into an endless loop. Manually reverse engineering these parts is fairly time intense and so is manual execution. This is however, where LuaQEMU can assist us as it is aware of internal CPU states.

For this, we introduced a Lua callback that can be used after a number of instructions (actually Translation Blocks, but the QEMU details here are out of scope of this article) did not change the CPU state internally. This heuristic simply records a window of executions, produces a CPU state hash and every time this hash is known already, increments a counter until the threshold is hit. Once it is hit, we can be reasonably sure that we are in a stuck state. When we are, we manually dissect the respective code to see if we can simply skip it (often times the results of this will only be clear later) or if it has to work. The procedure here is somewhat trial and error and we have plans for automating it in parts, but we are not quite there yet.

Following is an example of how this looks like in practice.

We define a threshold as part of the cpu initialization:

cpu = { env = { stuck_max = 200000, stuck_cb = lua_stuck_cb, ... } }

Our Lua callback then continues to simply dump the CPU registers:

function lua_stuck_cb() C.printf("CPU is stuck around 0x%x

", lua_get_pc()) local rregs = lua_get_all_registers() for idx, val in ipairs(rregs) do C.printf("r%d\t0x%x

", ffi.new('int',idx-1), val); -- indices start with 1 in Lua end end

This allows us to quickly notice dead ends and attempt to resolve these. Skipping a handful of such stuck locations and making minor modifications to register contents and memory at relevant places is already enough to get to a functional heap state, which can be easily verified by hooking malloc/free functionality.

At this point we also already hit other interesting parts of the code base. One of these areas is the setup of memory protections, which in turn is interesting for exploit mitigations such as XN. As the init code crosses protection code, we can already use that to dump its configuration on the fly.

In summary, this approach allows us to get the overall system APIs into a functional state. However, we failed at fully emulating the PCIe device emulation (i.e. the internal WiFi interface). As a result, we decided to skip this code entirely as part of the boot-up and focus on emulating required data on the actual receive path.

MPU configuration

The Memory Protection Unit (MPU) used in BCM4358 is an important part during the system boot-up as it defines used memory regions and permissions, which is also interesting for exploitation. The respective code to do so can quickly be identified statically by searching for MCR instructions using CP15 and opcode 6. While it is possible to dump this configuration at runtime (as Gal has shown), we want to trace the initialization itself. As a side effect, this allows us to quickly dump the configuration in emulation and therefore track potential mitigation changes introduced by Broadcom in the future. Additionally, we want to do this without reverse engineering the concrete logic of BCM’s MPU configuration.

Dumping the configuration in LuaQEMU is trivial as all that is needed is a Lua callback to dump the values (useful helper) and a breakpoint at the respective instructions:

function bp_dump_mpu() -- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Biiijafh.html local mpu_size_map = { [19] = '1 MB', [20] = '2 MB', [21] = '4 MB', [22] = '8 MB', [23] = '16 MB', [24] = '32 MB', [25] = '64 MB', [26] = '128 MB', [27] = '256 MB', [28] = '512 MB', [29] = '1 GB', [30] = '2 GB', [31] = '4 GB' } local mpu_access_map = { [0] = 'priv: no access; user: no access', [1] = 'priv: read/write; user: no access', [2] = 'priv: read/write; user: read-only', [3] = 'priv: read/write; user: read/write', [4] = 'reserved', [5] = 'priv: read only; user: no access', [6] = 'priv/user: read-only' } local bit = require("bit") local r0 = lua_get_register(0) -- Write MPU Memory Region Number Register local r1 = lua_get_register(1) -- Write Data MPU Region Size and Enable Register local r2 = lua_get_register(2) -- Write MPU Region Base Address Register local r3 = lua_get_register(3) -- Write Region access control Register local size_idx = bit.band(bit.arshift(tonumber(r1), 1), 0x1f) local access_idx = bit.band(bit.arshift(tonumber(r3), 8), 0x7) local xn = bit.band(bit.arshift(tonumber(r3), 12), 0x1) local region_size = mpu_size_map[size_idx] local access_bits = mpu_access_map[access_idx] C.printf("Region %d: 0x%x (access: %s; size: %s; XN: %d)

", r0, r2, access_bits, region_size, xn) lua_continue() end

Resulting in the following output at boot-up:

Region 0: 0x0 (priv: read/write; user: read/write; size: 256 MB; XN: 0) Region 1: 0x10000000 (priv: read/write; user: read/write; size: 256 MB; XN: 0) Region 2: 0x20000000 (priv: read/write; user: read/write; size: 512 MB; XN: 0) Region 3: 0x40000000 (priv: read/write; user: read/write; size: 1 GB; XN: 0) Region 4: 0x80000000 (priv: read/write; user: read/write; size: 2 GB; XN: 0)

As we can see, this is the same memory configuration as found by Project Zero, i.e. all regions are rwx and XN is not used. It seems likely that there is little deviation here between WiFi SoCs. It will be interesting to reuse this portion on newer image releases to see any change by Broadcom in this space.

WiFi Receive Path

Going back to our original goal of emulating the WiFi receive path, it is important to have a better understanding of the underlying code. Based on the SoftMAC implementation and manual reverse engineering (as pointed out by others already, the firmware is fairly verbose in terms of debug strings) we know that the frame reception functionality starts off at a function called wlc_bmac_recv . The frame data is pulled by working with dma_rx , which pulls the frames from the aforementioned FIFO and subsequently processes each frame by calling wlc_recv .

wlc_recv is where WiFi frame data is actually processed the first time and bytes are parsed. This receive path is triggered from an interrupt context through a service routine handler. When simulating the WiFi frame reception, we could either go for simulating interrupts directly or easier, simply instrument this receive path directly and bypass interrupt handling. We have chosen to try the latter as we are not interested in the earlier low level processing.

Similarly to our earlier approach during system startup, we define an error condition we would like to prevent. Next, we stumble our way through the reception functionality while attempting to not hit that error condition. In this case, every time an error is observed by the code, a handler to free the WiFi packet (we call it packet_free ) will be called and we return to the interrupt loop. We do this as long as we have not hit the receive paths we are interested in, i.e. the data and control frame handlers ore more specifically, the vulnerable wlc_tdls_cal_mic_chk handler we are interested in.

Once the emulator is stuck at waiting for an interrupt, we will modify the control flow to directly call wlc_recv with a packet of our choice.

Global WiFi States

Before any frame parsing routine does its job, there are several problems to our emulation approach.

First, wlc_recv receives two arguments, one being the packet payload, including a dot11 receive header, and one being a pointer to a wlc_info structure. This structure is utilized throughout the entire receive path and contains various pointers to other data structures as well.

An excerpt from the version defined in the SoftMAC driver:

512 struct wlc_info { 513 struct wlc_pub *pub; /* pointer to wlc public state */ 514 struct osl_info *osh; /* pointer to os handle */ 515 struct wl_info *wl; /* pointer to os-specific private state */ 516 d11regs_t *regs; /* pointer to device registers */ 517 ... 530 bool device_present; /* (removable) device is present */ ... 652 wlc_bsscfg_t *cfg; /* the primary bsscfg (can be AP or STA) */ ... 782 uint hwrxoff; ...

The structure is fairly large and complex and also likely not the same within the BCM WiFi firmware. Therefore, we cannot simply craft a copy by hand. Its content cannot be ignored either.

For example, the osh handle contains the function pointer going to be used by the identified packet_free function. The hwrxoff will be used to determine offsets to raw frame data. The structure additionally contains information about associated WiFi access points. It also contains its own hardware address (i.e. MAC address), which is used at several places in order to determine whether a packet is directed at the adapter or not (also important in the context of monitor mode).

The structure is also important due to the nature of the vulnerability, i.e. the processing of a SETUP confirmation message. The WiFi stack utilizes this structure to keep track whether a SETUP request has been sent before processing a confirmation message. Virtually all states and WiFi configuration settings can be accessed through this structure. It quickly became clear that ignoring the contents of this structure gets us into more trouble. As a result, we first need a copy of it and then decide which parts of the code have to be patched to account for potentially undesired values and states.

For replaying valid TDLS frames, we took one of the sample PCAPs kindly provided by the author of wpa_supplicant.

Inspecting State and Packet Data at Runtime

As dhdutil allows us to read and write WiFi SoC memory, we decided to write a short assembly stub that we are going to use to dump the memory behind passed parameters to wlc_recv so we can later reuse it during our emulation. We also used this approach to better understand the structure of raw packet data as it is not the same as in the SoftMAC driver.

As Gal also wrote, a larger chunk of the system RAM is going to be reclaimed for the heap after initialization. We abused a static location within this space as a buffer to dump memory to that we are interested in (we could have used malloc as well).

Following is an assembly stub that we used for inspecting packet data:

.syntax unified // thumb2 .global _start _start: .thumb mov r2, #200 // length mov r1, r5 // r5 is a pointer to p ldr r0, .scratch // dest ptr ldr r3, .memcpy blx r3 mov r2, #200 // length mov r1, r6 // r6 is a pointer to wrxh ldr r0, .scratch2 // dest ptr ldr r3, .memcpy blx r3 ldr r3, .cont bx r3 .memcpy: .word 0x000035F9 .scratch: .word 0x1E3410 .scratch2: .word 0x1e3510 .cont: .word 0x0019B749

This can be directly written to memory using dhdutil -i wlan0 membytes -h 0x0019B710 ... . This patches a part of wlc_recv that evaluates the monitor mode setting, i.e. irrelevant code during normal operations. As you can see, unless monitor mode is used, this code is dead so that it is enough for our purposes to overwrite it (also as outlined before the MPU does not prevent that).

The memory copied can then be read using dhdutil by reading from our scratch locations. This significantly helped speeding up manual RE efforts and allowed us to understand the content of relevant structures at runtime.

Due to the heavily nested structure of wlc_info , we did not use this approach to reuse wlc_info and stitching it back together manually. Instead, we opted for taking a ramdump, which we then attach dynamically at runtime to our emulation. This can be done using dhdutil coredump and skipping 0x146 bytes that form some sort of header.

This way we can obtain a fully initialized wlc_info structure without fully understanding its contents. Since the heap lives in that space, it is important to note that this gets the content of our functional heap into a somewhat different (but functional) state. However, in practice this was no issue for us.

Calling wlc_recv

At this point we already have enough functionality to call wlc_recv with a raw packet. Following is the Lua code we used for this:

219 function call_wlc_recvdata_tdls() 220 local wlc_recv = 0x0019B698 221 local wlc_info = 0x001ff418 -- this is allocated very early on and at a static location wl_info lives at 0x22DDB0 222 local packet_p = 0x1E3410 -- scratch space TODO: use malloc here instead 223 local packet_raw = 0x1E3510 -- scratch space 224 local frame_len = 229 225 226 -- load ramdump 227 local buf = read_file('./examples/bcm4358/bcmdhd_sta_samsung_sm-g920i-ucode963.patchram.bin') 228 lua_write_memory(0x180000, buf, #buf) 229 230 lua_write_byte(packet_p + 0, 0x1); -- unkn 231 lua_write_byte(packet_p + 1, 0x0); -- unkn 232 lua_write_byte(packet_p + 2, 0x1); -- refcnt 233 lua_write_byte(packet_p + 3, 0x1); -- alloc status 234 lua_write_byte(packet_p + 4, 0x0); -- unkn 235 lua_write_byte(packet_p + 5, 0x0); -- unkn 236 lua_write_byte(packet_p + 6, 0x0); -- unkn 237 lua_write_byte(packet_p + 7, 0x0); -- unkn 238 lua_write_dword(packet_p + 8, packet_raw-0x28); -- data pointer (-0x28 to account for wrxh) 239 -- this works well enough, because this area is nulled 240 lua_write_word(packet_p + 0xc, frame_len); -- frame length 241 lua_write_word(packet_p + 0xe, 0x0) -- unkn 242 243 local buf = read_file('./examples/bcm4358/packets/data/tlds/1.tdls_setup-conf.raw') 244 lua_write_memory(packet_raw, buf, #buf) 245 246 hex_dump(lua_read_mem(packet_p, 0xf), packet_p) 247 print("------") 248 hex_dump(lua_read_mem(packet_raw, #buf), packet_raw) 249 250 print("call_wlc_recv()") 251 lua_set_register(0, wlc_info) 252 lua_set_register(1, packet_p) 253 lua_set_register(14, 0x181BA7) -- endless loop 254 255 lua_set_pc(wlc_recv) 256 end

As can be seen, we dynamically load a ramdump into memory, reuse the aforementioned scratch space to store our TDLS packet, and adjust a few header bytes in memory to match expected values by wlc_recv , before finally redirecting control flow to wlc_recv() .

Receive Path Problems

Stumbling our way through the receive path means that there are still a few hiccups that end up free’ing the packet that we need to address. Following is a list of issues that we observed and subsequently patched.

A function called wlc_recvfilter is used to determine whether a packet is tossed or not based on authentication state and class of the frame. We entirely bypass this function.

A few places perform checks based on a bsscfg and we skip these as well.

wlc_recv_data checks whether a packet is directed towards itself by comparing the MAC addresses. We skip these checks as well so that our raw packet does not have to match our emulated adapter.

The TDLS implementation performs checks that Gal also described in his blog post, but that are not relevant from a parsing/security perspective. Namely, the code verifies the Link-ID IE contained in the packet, evaluates the BSSID, and verifies the “Snonce” value by comparing it to a store value.

This process is similar to skipping stuck code paths. Since we know that we want to hit relevant TDLS handlers while not hitting packet_free , we can selectively disable checks with minimal additional manual reverse engineering while attempting to get as far as possible. Based on our experience also from emulating different targets, this works surprisingly well in practice.

How are we skipping code with LuaQEMU? We simply use breakpoints and adjust CPU registers. Breakpoints can be initialized at runtime (using lua_breakpoint_insert ) or at initialization:

breakpoints = { [0x0000374C] = bp_log, [0x0018A9F0] = bp_pkt_free, [0x00021818] = bp_wlc_recvfilter, ... }

Overall, surprisingly few changes are required to make this approach work. In total, we patched 17 locations to get through the relevant boot code and enable reception of control and data frames. We additionally patched 7 locations that are part of the TDLS receive path. With more accurate wlc_info content, fewer may be required.

Triggering the issue then requires merely one more code patch to adjust some values of our raw sample packet in memory:

-- see comments below explaining this function bp_fix_RSN_ie_len_in_mem() print("RSN IE len hexdump...") hex_dump(lua_read_mem(RSN_ie_len_ptr, 0x10), 0) lua_write_byte(RSN_ie_len_ptr, 0x14) -- original length from sample pcap hex_dump(lua_read_mem(RSN_ie_len_ptr, 0x10), 0) lua_continue() end function bp_change_tdls_rsn_ie_len() -- If we simply modify r2 here, we won't trigger the overflow, because the value -- is again fetched at 0007A8CA from the ie ptr. This became clear after tracing the memcpy offsets -- as well. Instead, the below code modifies the memory structure directly. There is one issue with this, -- namely that the copied bytes are used as an offset again to determine the interval and FT IE location. -- If we do that, we also influence how bcm_parse_tlvs() works though as it keeps iterating over TLVs by -- adding the length of the previous IE to find the next one. That means we need to craft the interval and FT -- IEs in the TLV buffer again at the right offsets and also adjust the tlv buffer length again as now -- bcm_parse_tlvs() has to scan much further. Instead of doing that, we make sure that after the memcpy -- of the RSN IE happened, we write back the original length in the TLV buffer. This way we also don't corrupt -- the src heap chunk. RSN_ie_len_ptr = lua_get_register(0) + 1 -- start of TLV + 1 = len lua_write_byte(RSN_ie_len_ptr, 0xff-0x23) hex_dump(lua_read_mem(RSN_ie_len_ptr, 0x10), 0) lua_breakpoint_insert(0x0007A8D0, bp_fix_RSN_ie_len_in_mem) -- location after memcpy and next bcm_parse_tlvs -- if we however modify the structure in memory, we also need to fix the subsequent data -- because the length that was copied determines the next offset where the interval IE is expected print("heap memory corruption shall commence") --lua_continue() end

Heap Tracing

Next, we wanted to use LuaQEMU to inspect heap allocation states. This is interesting both for exploitation, but also to notice heap overflows as the TDLS one in the first place. Due to its heap implementation, heap overflows may not be directly visible or otherwise result in crashes. So we would like to use LuaQEMU to trace linear heap out of bounds (OOB) conditions. Something like that would also be useful for the purpose of fuzzing, which would be one of the applications for this LuaQEMU setup as well. We have made a very simple experiment to see if LuaQEMU can assist us with this.

To do so, we track all relevant malloc and free calls, by adding breakpoint stubs for them.

function add_malloc_breakpoints() -- malloc_1 for now, everything else is a wrapper entry_eas = { 0x00181F28 } for k,v in pairs(entry_eas) do lua_breakpoint_insert(v, malloc_entry_hook) end exit_eas = { 0x182024 } for k,v in pairs(exit_eas) do lua_breakpoint_insert(v, malloc_exit_hook) end -- free entry_eas = { 0x0018203C } for k,v in pairs(entry_eas) do lua_breakpoint_insert(v, free_entry_hook) end end

On each malloc entry, we simply record the allocated size. The function exit is more interesting, as it returns a pointer to the allocated buffer. Now a very simple heap OOB detection just needs to trigger a callback once the allocated area is left during a write. A more complete implementation would track the entire heap region and its allocated chunks and trap on any access outside an allocated chunk. It is possible to implement that, but requires more knowledge about the internals of the heap. So for demonstration purposes, we decided to try a simpler implementation that focuses on detecting linear out of bounds conditions only.

heap_entries = {} -- allocated chunks bounds_entries = {} -- allocated chunks for access that OOB function malloc_entry_hook() alloc_size = lua_get_register(0) lua_continue() end function malloc_exit_hook() alloc_ptr = tonumber(lua_get_register(0)) if alloc_ptr == 0 then C.printf("malloc returned 0 at 0x%x

", lua_get_pc()); lua_continue() return end C.printf("0x%x = malloc(%lld)

", lua_get_register(0), alloc_size); --lua_trapped_physregion_add(alloc_ptr, alloc_size, heap_read, heap_write) lua_watchpoint_insert(alloc_ptr, alloc_size, WP_MEM_ACCESS, heap_access) local oob_ptr = tonumber(alloc_ptr + alloc_size) lua_watchpoint_insert(oob_ptr, 4, WP_MEM_ACCESS, bounds_access) bounds_entries[oob_ptr] = alloc_ptr heap_entries[alloc_ptr] = alloc_size -- the bounds entry is needed so we can remove the watchpoint on a free lua_continue() end

The exit hook computes the location of the dword adjacent to the allocated heap chunk and places a watch point on that address. As soon as this watch point triggers, we know that a heap OOB write occurred. Note, this works in this case due to the rather simple version of the embedded heap implementation.

LuaQEMU offers us two ways to trigger on memory access: watchpoints and trap regions. The former is similar to a watchpoint in a debugger triggering on a virtual address, while trap regions trap read and write accesses to physical memory regions. Moreover, the trap region handler can call its own read and write operations. The latter is also useful for emulating drivers or memory-mapped IO ranges, but both are somewhat similar for our purpose.

The allocated pointer, size and oob_ptr is stored in Lua tables (comparable to python dictionaries) for management purposes. The free hook is making use of them to remove the inserted watchpoints:

function free_entry_hook() free_ptr = tonumber(lua_get_register(0)) C.printf("free(%x)

", free_ptr) --lua_trapped_physregion_remove(free_ptr, heap_entries[free_ptr]) if heap_entries[free_ptr] ~= nil then lua_watchpoint_remove(free_ptr, heap_entries[free_ptr], WP_MEM_ACCESS) lua_watchpoint_remove(free_ptr + heap_entries[free_ptr], 4, WP_MEM_ACCESS) table.remove(bounds_entries, free_ptr + heap_entries[free_ptr]) -- remove oob ptr table.remove(heap_entries, free_ptr) else C.printf("%x freed, but we have not seen alloc

", free_ptr) end lua_continue() end

Now if an OOB access occurs, our bounds_access callback will be triggered. Watchpoints in LuaQEMU receive the address, the length, and the access type as an argument, which we can use for evaluation. This is an important detail, because free and malloc itself work on heap meta-data, thus touching on data that we marked as OOB.

As a result, we filter the memory ranges of malloc and free before indicating an OOB condition.

function bounds_access(args) local pc = lua_get_pc() local free_start = 0x0018203C local free_end = 0x001820A2 local malloc_start = 0x00181F28 local malloc_end = 0x182024 if free_start < pc and free_end > pc then lua_continue() return end if malloc_start < pc and malloc_end > pc then lua_continue() return end C.printf("linear out of bounds heap access@0x%08llx accessing 0x%08llx (%lld) (%lld)

", lua_get_pc(), args.addr, args.len, args.fl ags) local aptr = bounds_entries[tonumber(args.addr)] local asize = heap_entries[aptr] local cdata_aptr = ffi.new('uint32_t', aptr) C.printf("destination buffer: 0x%08llx[0x%08llx] (0x%08llx-0x%08llx)!

", cdata_aptr, asize, cdata_aptr, cdata_aptr + asize - 4); end

Triggering the TDLS Setup Confirmation OOB Write

From system start to triggering the heap overflow, this is roughly the information that LuaQEMU gives us currently:

QEMU 2.8.91 monitor - type 'help' for more information (qemu) Region 0: 0x0 (priv: read/write; user: read/write; size: 256 MB; XN: 0) Region 1: 0x10000000 (priv: read/write; user: read/write; size: 256 MB; XN: 0) Region 2: 0x20000000 (priv: read/write; user: read/write; size: 512 MB; XN: 0) Region 3: 0x40000000 (priv: read/write; user: read/write; size: 1 GB; XN: 0) Region 4: 0x80000000 (priv: read/write; user: read/write; size: 2 GB; XN: 0) ... RTE (PCIE-MSG_BUF) 7.112.41.4 (A3 Station/P2P feature) on BCM7332 r1620263011 @ 37.4/0.0/0.0MHz ... 000000.000 TCAM: 256 used: 204 exceed:0 000000.000 reclaim section 1: Returned 140312 bytes to the heap WFI loop reached 001E3410 01 00 01 01 00 00 00 00 E8 34 1E 00 E5 00 00 ........<E8>4..<E5>.. ------ 001E3510 0A 04 E8 06 7E 01 08 01 2C 00 00 03 7F 12 2A 14 ..<E8>.~...,.....*. 001E3520 00 22 68 AC BC BD 1C 4B D6 55 38 BB E0 00 AA AA ."h<AC><BC><BD>.K<D6>U8<BB><E0>.<AA><AA> 001E3530 03 00 00 00 89 0D 02 0C 02 00 00 01 DD 18 00 50 ....<89>.......<DD>..P 001E3540 F2 02 01 01 00 00 02 A4 40 00 27 A4 00 00 42 43 <F2>......<A4>@.'<A4>..BC 001E3550 5E 00 62 32 2F 00 30 14 01 00 00 0F AC 07 01 00 ^.b2/.0.....<AC>... 001E3560 00 0F AC 04 01 00 00 0F AC 07 0C 02 37 52 00 00 ..<AC>.....<AC>...7R.. 001E3570 26 BB 51 DB D7 BB 0C AA 42 3B BD DA 0E F3 FB 37 &<BB>Q<DB><U+05FB>.<AA>B;<BD><DA>.<F3><FB>7 001E3580 5A DE 03 3C 44 E6 A5 29 18 B4 1C 91 EE 1E B1 CA Z<DE>.<D<E6><A5>).<B4>.<91><EE>.<B1><CA> 001E3590 2F 52 B3 B4 AB 27 FA D2 E0 48 0B A6 5C 9C 1D 96 /R<B3><B4><AB>'<FA><D2><E0>H.<A6>\<9C>.<96> 001E35A0 85 D7 C4 BB 74 D6 E7 FD 8C 04 20 EC CC C1 DB 37 <85><D7>Ļt<D6><E7><FD><8C>. <EC><CC><C1><DB>7 001E35B0 D7 AE 92 99 ED DC 53 8F 2F 92 84 C0 73 66 7E CB <U+05EE><92><99><ED><DC>S<8F>/<92><84><C0>sf~<CB> 001E35C0 38 05 02 C0 A8 00 00 3D 16 06 08 00 00 00 00 00 8..<C0><A8>..=........ 001E35D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 65 ...............e 001E35E0 12 00 03 7F 12 2A 14 00 22 68 AC BC BD 1C 4B D6 .....*.."h<AC><BC><BD>.K<D6> 001E35F0 55 38 BB U8<BB> call_wlc_recv() 0x1eb30c = malloc(296) heap access@0x00003728 accessing 0x001eb30c (296) (135) < rwa heap access@0x00003728 accessing 0x001eb30c (296) (135) < rwa heap access@0x00003728 accessing 0x001eb30c (296) (135) < rwa ... memcpy(0x1eb30c, 0xd8, 6) ... 0x1eb204 = malloc(256) < this is going to be our vulnerable buffer 0x1ec048 = malloc(16) memcpy(0x1eb204, 0x1e35e7, 6) heap access@0x000036aa accessing 0x001eb204 (256) (135) < rwa ... memcpy(0x1eb20a, 0x1e35ed, 6) ... memcpy(0x1eb211, 0x1e35df, 20) heap access@0x000036aa accessing 0x001eb204 (256) (135) < rwa ... RSN IE len hexdump... 00000000 DC 01 00 00 0F AC 07 01 00 00 0F AC 04 01 00 00 <DC>....<AC>.....<AC>.... 00000000 14 01 00 00 0F AC 07 01 00 00 0F AC 04 01 00 00 .....<AC>.....<AC>.... memcpy(0x1eb303, 0x1e35c0, 7) heap access@0x000036aa accessing 0x001eb204 (256) (135) < rwa linear out of bounds heap access@0x000036aa accessing 0x001eb304 (4) (135) destination buffer: 0x001eb204[0x00000100] (0x001eb204-0x001eb300)!

We have used the same implementation to play with the other IE parsing issues discovered by Google. The emulation is also not limited to data frames – in our tests, it also processed control frames just fine.

Encryption of Packets?

When first trying this approach, we were not sure how encryption is handled. Particularly, it wasn’t clear if the Cortex-r4 core operates on encrypted packets or not. If it would, this challenge would be slightly more complex. Looking at the assembly code on the wlc_recv paths, we did not see packet decryption however. This also makes sense as this is likely hardware accelerated.

In fact, as can be seen studying the Cypress documentation, this does not happen as part of the WiFi frame parsing:

The crypto engine is used transparently before the RX FIFO and after the TX FIFO. The wlc_info structure then contains pointers to information that encodes whether the original frame was encrypted and session key material, but the actual data handling in the frame parsing is completely operating on plain-text frames, which makes a lot of sense. This is important to keep in mind when working with frame data. This also means that our approach cannot be taken to evaluate the crypto implementation itself.

Summary

While this is not perfect full-system emulation, we have shown that it is possible and useful to approximate full-system emulation for security research with reasonable effort. This does of course not mean that all code paths are working 100% correctly. In fact if you look at the above log message, you will notice that it claims a BCM7332 chip has been initialized: This likely is an artifact of skipping parts of the initialization. However, often enough such small details do not impact the overall result.

As a nice side effect of such emulation, we can with reasonable amount of effort perform similar emulation on new or different firmware releases, e.g. to track changes as they may be deployed by Broadcom. This also provides us with good inroads for further security research and e.g. fuzzing WiFi frames while being able to also do coverage analysis. There are always multiple ways to approach a problem. It would also have been possible to use the patchram facilities to inject a debugger into the WiFi firmware directly. However, being able to run the stack for the most part off-target while being able to instrument code parts in Lua without losing too much performance has its upsides as well.

LuaQEMU is of course no substitute for additional manual reverse engineering work. To give a rough idea, starting from zero, an additional three weeks of manual reverse engineering was required to get through most of the code paths required to understand frame handling and to get the emulation to a reasonable point.

Future plans

At this point, LuaQEMU should be an experiment and cannot be assumed to be stable: do not be surprised if anything misbehaves. In some parts its code has also grown during our testing phase and most definitely needs to be reworked in the future. However, we already make use of it for various different projects. We will continue to play with this idea and build up more powerful features on the existing API.

To complement this blog post, we are releasing an early version of LuaQEMU – onto GitHub – for you to play with: https://github.com/comsecuris/luaqemu.

Please contact us at luaqemu@domain for feedback, patches, comments, or suggestions.