This blog post should give you a rough insight into the implementation of the Mario Kart 8 exploit to be a primary entrypoint for homebrew. Thereby, both the technical details and the problems that came up during development should be discussed. This time I also want to tell you about the ideas that didn’t work, instead of just the one that works fine.

The beginning

In the beginning of this year Rambo6Glaz made just another implementation of the GX2, which uses a different a different PM4 packet to manipulate the kernel heap. This new implementation brought back the idea of implementing a kernel exploit inside a rop chain.

Beside the browser exploit and haxchi there are currently three other userland exploits.

ROBChain, an exploit in the main character scripting of Super Smash Brothers Wii U

an exploit in the network protocol of Mario Kart 8

a savegame exploit in Donkey Kong Tropical Freeze.

But there is a problem with all of these exploits: None of them has access to the access to the JIT-area. This mean no access to a area in memory which writeable and executable. This make arbitrary code execution without a kernel exploit impossible.

Out of these exploits the Mario Kart 8 one is special. It can be run on a previously unmodified console and could be a potential primary entrypoint into the system. Because of this the focus went to the Mario Kart 8 exploit.

Exploiting the network protocol of Mario Kart 8

Back in 2018 Kinnay found a bug in the P2P protocol of Mario Kart 8. He released a PoC which could crash the console of someone who hosted a friend room displaying the message “rop chains are fun :)”. This initial implementation allows a (remote!) rop chain execution with maximum length of ~1000 bytes, more than enough to play around which different payloads.

The original repository has detailed information about the exact bug and exploitation. In summary it’s possible to achieve a 4 byte arbitrary write due to a bug in parsing the “identification token”. That’s enough to manipulate a vtable, and turn a call of Md5Context::GetHashSize into a memcpy to the stack, effectively copying the content of another packet onto the stack, leading to a rop chain execution.

The kernel exploit in a rop chain - theory

In theory implementation the kernel in a rop chain doesn’t sound that hard. From the wiiuhaxx-common repository we already have rop gadgets we can re-use. This includes for example gadgets to call a function or write a value to a arbitrary address in memory. Detailed information about the kernel exploit can be found in part 4 of my “homebrew environment” blog series, but here is a quick overview:

Place a fake heap entry into a specific address in memory Create a PM4 packet and send it to the GPU to override the “next id” on the kernel heap Register an OSDriver and hope it’s allocating memory using our fake heap entry placed in step 1 Manipulate the “SaveArea” pointer in the OSDriver struct (which is now in userland memory) to point into the kernel data. Use the OSDriver_CopyToSaveArea and OSDriver_CopyFromSaveArea functions to get arbitrary read/write with kernel privileges.

This doesn’t really seem that complicated. It’s just a few function calls and it fits relative easy into a 1000 byte rop chain. We also have the advantage that the address of the stack is consistent. This allows us to place data (like the fake heap entry or the pm4 packet) at the end of the rop chain and simply calculate their positions in memory beforehand. Rambo6Glaz talked about starting to implement the kernel exploit in the Mario Kart 8 exploit, and I thought I would give it a shot too. Using the existing gadgets and already knowing the kernel exploit in detail made me think this would be rather trivial and it would be done in maybe a few hours.

I started to play around with the exploit and tried to implement the kernel exploit step by step. Sometimes I had random crashes and testing was quite annoying. For each try you have restart the console, go online, open a friend room and send the payload. Then maybe also read the crash log by restarting again and firing up a CFW to access the crash log. In total each attempt took like at least 2-3 minutes.

Fast forward a few days. After many hours of testing and trying I still had nothing. But somehow this whole exploit was quite addicting, it started with one simple idea and ended (or didn’t end) everyday with “just more one try”. At the same time Rambo6Glaz was doing the same thing. Slowly but steady we got a better understanding of whats going on. Eventually we got a working memory write using the kernel exploit, but something was still wrong. It turned out that the exploit was indeed sometimes working (or at least partially working), but only in like 20% of the tries. This made testing even more annoying. Each idea required at least 5 failed attempts to make sure the idea was wrong and it wasn’t just the exploit randomly failing.

At this point we had collected some facts that helped us understanding:

the kernel exploit did sometimes work, but only in rare cases

the rop chain need to have a specific length to be stable (otherwise you get really strange behaviour and crashes)

the rop chain is running on core 2, but the main GX2 core is 1 (the kernel exploit expects to be run on the GX2 core…)

For me personally a unstable exploit was enough, I just wanted to finish this. Even if performing the exploit would require several attempt, I just wanted to saw it working once, so I can finally spent my time on other projects.

Because of this I tried to split up the exploit into multiple rop chain which need to be executed one after another. Fitting the kernel exploit in 1000 bytes is doable, but also bundling a real payload and copy/executing won’t fit anymore. One challenge was to actually restart the game. But from reverse engineering for HID to VPAD I knew there was a function to force opening the home menu (`OSSendAppSwitchRequest), and it indeed worked.

I also tried to improve the success rate of the kernel exploit by adding some waiting. But every time I waited via OSSleepTicks or added a GX2DrawDone the console crashed. Knowing the kernel exploit would work only in rare cases I tried to think of a solution to give the user feedback if the exploit was successful or has been failed. In a rop chain “code execution” is really limited, it’s only possible to run existing chunk of code. Branches and loops are really hard (at least I haven’t found a way yet to pull it off, I am not a rop chain expert though), the only option I saw was to manipulate the rop chain itself. I placed a OSFatal at the end of the rop chain to make the console crash, but overriding it with a OSExitThread using the (hopefully) newly gained kernel write. This way exiting would mean success and crashing would mean failure. I spent again too much time on this but never really anything working.

At this point more than a week and literally dozens of hours were already wasted on this, without much progress. It was time to change the strategy. Rambo6Glaz suggested to find rop gadget to perform stack pivot to somehow has the possibility to execute a bigger rop chain.

Rop chain basics

Before working on this I’ve been working with rop chains, but I haven’t found /written any rop gadgets myself. I wasn’t really understanding rop chains, I was just using the “high level” functions from the wiiuhaxx_common repository, so it was time to dig deeper and learn something new.

(If you already are familiar with rop chains you can skip this part.)

Why do we need to use a rop chain? On the Wii U no region in the memory is executable and writeable at the same time (except for the JIT area, but we have no access to it in Mario Kart 8), so the idea is to use existing code. If you can control the stack, you can control the code flow. When calling a function, the position in the code of the “calling function” is saved on the stack. When ever the called function returns, it jumps backs to address which was saved on the stack. By manipulating this return address it’s possible to jump anywhere in the code. Using clever places in the code it’s possible to chain multiple of these jump to execute needed instructions.

Functions which use the stack to store local variables have a common pattern. At the end of the function they are loading the saved return address from the stack and increase the stack pointer. By carefully crafting a stack we can jump to parts of code that are written directly before this pattern. Each address which you jump to is called a “gadget”.

Let’s imagine a stack where currently the stack pointer (r1) is pointing to address 0x20000000:

# Stack before running the gadget: 0x20000000: 0 <-- Stackpointer (r1) 0x20000004: 0x10000000 <-- current gadget address 0x20000008: 0 <-- Stackpointer (r1) + 0x08 0x2000000C: [NEW GADGETADDRESS] <-- Stackpointer (r1) + 0x0C

Now we assume that some function was just returning, setting the stack pointer to 0x20000000 and reading the address where to jump to from 0x20000004 . This means at this state the code flow continues at 0x10000000 , with r1 = 0x20000000

# Intructions of the gadget in 0x10000000 0x10000000: [SOME USEFUL INSTRUCTION 1] 0x10000004: [SOME USEFUL INSTRUCTION 2] 0x10000008: [SOME USEFUL INSTRUCTION 3] 0x1000000C: lwz r0, 0xc(r1); # load return address from stackpoint + 0x0c 0x10000010: mtlr r0; # move it to the link register (lr) 0x10000014: addi r1, r1, 8; # increase the stack pointer by 0x08 0x10000018: blr; # branch to link register

The first three instructions are be the ones we are really interested in. Using these we want to achieve our planed behaviour. This could be for example loading values into registers (from the stack, which we can control!), moving values between registers, calling functions or write values to memory and much more. The instructions from 0x1000000C and 0x10000010 read the new return address from the stack pointer + 0xC , which is the value we’ve previously put on the stack ( 0x2000000C ).

The instruction at 0x10000014 will increase the stack pointer by 0x08, afterwards instruction 0x10000018 will branch to the link register which was set in the previous instructions.

After executing the gadget this stack will look like this.

# Stack after running the gadget: 0x20000000: 0 <-- 0x20000004: 0x10000000 <-- 0x20000008: 0 <-- Stackpointer (r1) 0x2000000C: [NEW GADGETADDRESS] <-- current gadget address [...] <-- stack data for the gadget in 0x2000000C

And a new new gadget will be executed. This way chaining multiple gadgets is possible to achieve a intended behaviour.

How to find rop gadgets

There are several tools that help you find rop gadgets. I had the best luck with the tool Ropper. Before you can use Ropper with Wii U binaries, you need to convert them to ELF files. Ropper allows you to display and filter all rop gadgets in a binary up to an specified length.

Beside the actual binary of the exploited application you can also use rop gadgets of the system libraries (.rpl files). The “core” system libraries are always at the same location in the memory, which make them easily usable for rop gadgets. In fact it’s preferred to use gadgets from these executables to be independent of the application to be exploited.

Here is a list of all system libraries that are at a fixed position on memory and their location (.text section, FW 5.5.x+)

coreinit 101C400 - 1090F00 tve 1090F40 - 10B9BC0 nsysccr 10B9C00 - 10BFD40 nsysnet 10BFD80 - 10CFE60 uvc 10CFEC0 - 10D2120 tcl 10D2180 - 10ED6E0 dc 110D600 - 111FEC0 vpadbase 111FF00 - 1128840 vpad 1128880 - 113D5E0 avm 113D640 - 114EBE0 gx2 114EC40 - 11C3020 snd_core 11C3080 - 11E3820

It’s good idea not to hardcore any of the addresses for rop gadgets, but instead get them from the binaries either via the ELFSymbols or a hash. For improving the browser exploit I built a small Java tool that will return a list of gadgets for a config file. This way the rop gadgets for different versions of the binary can easily be found.

Finding actual useful gadgets

After some research I finally knew enough to find a rop gadget on my own for the first time. The goal was to perform a stack pivot to be able to switch to a different (bigger!) stack. As we have learned in previous sections, the stack pointer in stored in register r1 . To modify the stack pointer, we need to find a gadget to modify r1 .

To achieve this, I search for any gadgets that writes a value into r1 without any results. But I found a gadget that moves the content of r12 of r1 , so I started searching for gadgets to control r12 , with out any success. But I found one that moves the content of r11 to r12 … and so on. You see how this is going to end. The ultimate goal was to find a “chain”, that starts reading a value from the stack and moves it over several gadgets into r1 . In the end I really managed to find a working set of gadgets to perform a stack pivot. It wasn’t the most gorgeous solution, but it worked. As the project moved on was I able to improve and shorten the chain multiple times.

Beside having the rop size limitation, there was still the problem on being the wrong CPU core. To switch the affinity of a thread, it needs to be suspended. This mean it’s not possible for a thread to move itself to another CPU core. The obvious solution is create another thread with the affinity to run on the target core. But there is one problem: The OSCreateThread function takes 9 arguments, but with exiting rop gadgets it’s only possible to call a function with up to 6 arguments.

With motivation from the success of finding a stack pivot gadget, I was trying to a rop gadget to create a thread. For quite some time I tried to find a gadget to call an arbitrary function with 9 arguments, but without success. Then I realized that OSCreateThread is just a wrapper for an internal “create thread” function, where the function call is using register r25 to r31 as arguments instead of r3-r9 . In the PowerPC architecture arguments of a function are stored before the call in registers r3 to r9 , setting these on the end of a function is much more unlikely than the upper registers. The “upper” registers (e.g r24 - r31 ) are often saved on the stack at the beginning of a function, and restored (loaded from the stack) at the end of a function. The combination of having a OSCreateThread gadget which loads arguments from r25 to r31 and having an easy gadget to set these registers make this function call with a huge amount arguments feasible.

How to execute long rop chains

At this point it was possible to do a stack pivot and create another thread on the right core. But there was still the problem of the size limited rop chain. Rambo6Glaz and I tried to figure out a way to allow bigger rop chains and came up with two different ideas:

Create a rop chain to load a bigger chain via the network

Split up the “final” rop chain into multiple chunk, run the exploit multiple times and save each time one chunk inside a OSDriver .

While Rambo6Glaz focused on the network solution, I gave the OSDriver idea a shot.

Running the exploit multiple times!

The Wii U OS has a feature that allows libraries to install OSDrivers . Beside registering callback on certain event like acquiring or loosing the foreground, OSDrivers can also store data inside the kernel. This is useful to store permanent data that can be used even after restarting or switching the application. Using the kernel syscalls directly let’s us bypass some checks and simplifies the usage.

Here is a general workflow of this idea:

Run the exploit in Mario Kart 8 to get rop chain execution. Build a rop chain that registers a new OSDriver and stores embedded data (in this case a part of a big rop chain) inside the kernel using CopyToSaveArea . Open the Home Menu via rop chain and exit the game. Go back to step 1 until the whole rop chain is placed in different OSDrivers Build another rop chain that takes the data saved in the OSDrivers and execute it on a new thread on core 1 (GX2 main core in Mario Kart 8).

Using this approach I was able to store 816 bytes inside a OSDriver which each restart. I improved the rop chain generation to automatically take care of the generation of all different rop chain that are needed.

It worked quite well. Finally I could build a rop chain without thinking about the size limit. In fact the size of the final rop chain limited by the amount of “read data from OSDriver X” gadgets, but I never reached it (~8000 bytes were possible). The downside: each try took quite long. I had to run the exploit at least three times to get the “final” rop chain running to check if it’s working. This leads in a > 5 minutes test cycle. For testing just some ideas it was enough, but on long term it was really annoying.

Using this I was able to test some ideas that were previously not possible due to size constraints. One of the first things I tried was to shutdown the GX2 engine and restart it again to have it in a clean state for the kernel exploit. This was now possible because we were on the right CPU core. But this resulted in a crash because the actual game was still running and using the GX2 engine. A simple solution was to suspend the main thread (which luckily is on a fixed address which can be easily obtained from the crash logs), and resume it in the end of the rop chain. Without resuming the main thread exiting the game wouldn’t be possible. But even with stopping the main thread and a reinitialization of the GX2 engine the exploit was still not working. Also adding some waiting in form various variations didn’t help.

The best theory at the was that it didn’t work because something in the background was still running and using the GX2 engine, interfering with the exploit. At this point I was really desperate and tried to implement every single implementation in the rop chain, hoping one of it would actually work. But nothing was working.

From working on the plugin system I knew that threads on the CPU core 2 will actually keep running when opening the Home Menu . My idea was to to perform the exploit while the game was suspended in the background, but this also didn’t work.

We need more gadgets!

Each application implements a ProcUI loop. ProcUI is a wrapper library which allows an easier usage of the system message queue from Cafe OS . The ProcUI loop is the place in the application where it’s decided if the application is requested to move to the background, just gained the foreground or should be closed. I thought by sending a “close application” to the game and keep our own thread running we would have a chance of running rop chain in pretty clean environment without the actual game running and interfering with it.

The easiest way to tell a game that it should be closed is by calling the function SYSRelaunchTitle from the sysapp library, but actually using it was way harder than I thought. In this blog post we’ve already talked about the system libraries that are always at a fixed address in memory, but sysapp is not one of them. The function address can be easily obtained using OSDynLoad_Acquire and OSDynLoad_FindExport . The real problem is using any of the return values and calling a function not by it’s address but by a function address pointer.

To accomplish this once again more rop gadgets needed to found. The function OSDynLoad_FindExport takes the module handled acquired via OSDynLoad_Acquire as first argument, which dynamically changes after each restart. So the first needed gadget was function call where the first argument is dereferenced from an address. In addition a gadget is needed to call the function pointer that is returned using the OSDynLoad_FindExport function.

After finding these gadgets it was finally possible to call SYSRelaunchTitle to trigger a game shutdown, but it turns out it also kills any other existing threads. The idea of keeping rop chain execution after shutting down the game didn’t work either.

But these new gadgets really help to test new things. For example we were able to test the “magic” IM_SetDeviceState call which is used in the browser exploit to shutdown the browser. It turns out that it just emulating pressing the home button is not helping.

Loading bigger rop chains via the network!

The whole time I was using my slow “run the exploit multiple times to get a bigger rop chain”-approach, while RamboGlaz6 was working on loading a second rop chain over the network.

At some point RamboGlaz6 finally managed to get a stable rop chain execution of a rop chain send via TCP to the console. The workflow was something like that:

Create a new thread on CPU core 1

Inside the thread connect to a TCP server and receive a bigger rop chain

Do a stack pivot to execute the received rop chain

Profit!

This was really stable and massively sped up the testing of new rop chains.

Just keep GX2 running

Due to the faster testing I tried several new things. One of them was

stop trying to shutdown and restart GX2 but still suspend the main thread of Mario Kart 8. This lead to an exception in the kernel, so something was happening. To perform the kernel exploit we place a fake heap entry and modify the kernel heap to use this. The crash log suggested the kernel was indeed trying read from the right address, but the read data was not the one we placed there. I wasn’t (and I am still not sure) if this was because of some weird caching issue, but I went the safe route and modified the exploit to read the fake heap entry from 0x2F200014 instead of 0x1F200014 and it worked first try. I gave it a few more shots and it was indeed stable. Finally.

From now on we had a stable kernel exploit which granted us read/write access with kernel privileges. The JIT-area isn’t just helpful for providing easy userland code execution, but also provides easy kernel execution. It’s also the only region in memory which allows write and execute for the kernel, but we still had no access to this region.

Without kernel execution and the default memory mapping there isn’t really anything special you can do with kernel privileged writes, only modifying the kernel .data section and register a new syscall. Without being able to run custom code a new syscall isn’t that helpful. But kernel write is enough to change the tables inside the kernel which are used for the memory mapping and give us a mapping of a “execute only” region with write privileges. The downside of this is that we need to restart the application before the changes take place. So we still do at least one restart.

Before restarting it’s important revert the changes we did to the kernel heap. We also register a new syscall 0x25 which points to a memcpy function ( 0xfff09e44 on 5.5.x) to keep an easy way to perform copy operations with kernel privileges.

Userland code execution!

After performing the kernel exploit, setting up the memcpy syscall, mapping the memory and restarting Mario Kart 8 we perform the exploit once again. Now we can finally achieve code execution. Using the new memory mapping we can copy our any executable into the free 0x011DD000...0x011E0000 region. Afterwards we override the “main()” function call with a jump to our code and switch to the Mii Maker . This will execution our payload in Mii Maker context!

But we still have no real control of the kernel without kernel execution. Unfortunately the free 0x011DD000...0x011E0000 region which we are using for userland code execution has no kernel execution rights. I spent some time to think of a solution when I remembered the RPX version of the homebrew launcher. The RPX version of the homebrew launcher was intended to run as channel in a environment without kernel access, so it ships with a own kernel exploit. It also has no access to the JIT-areas, but somehow achieves kernel execution. Looking at the code reveals that there is a region in memory ( 0x017FF000 , just before the JIT area) that is writable using the memory mapping and also have kernel execution rights. This is enough to have arbitrary kernel execution by placing a payload in this area and register it as a syscall. By changing a IBAT (controls the memory mapping) kernel execution rights can be provided for any other region in memory.

payload.elf loader

In previous blog posts I talked about an homebrew environment where all exploits should be able load a payload.elf from the sd card and execute it. To achieve this we need to fulfill the requirements of the payload loader, and run the payload loader afterwards. One of the requirements it having a syscall which allows the modification of IBAT0 to gain kernel code execution. The other ones are just the “default” kern_read and kern_write syscalls.

After installing these syscalls we just need to load the payload.elf loader into memory and run it.

Based on the JsTypeHax_payload I created a payload for the Mario Kart 8 exploit which setups the needed syscalls for the payload loader and copies the loader into memory. The 0x011DD000...0x011E0000 region barely enough to fit this “payload loader installer” and the actual payload.elf loader , but it somehow fits.

After copying the payload.elf loader into memory it can be finally executed. A arbitrary payload.elf will be loaded from the sd card and executed. We are finally done.

Conclusion

In the end I spent way more time on this than I ever would have thought. So many times I was so close to just give up, but somehow the exploit was really addicting. Once again a big shoutout to Ramboglaz6 (aka NexoCube) who worked on this at the same time. We shared our ideas and tried to motivate each other. In the end we both came up with a working solution which is quite nice.

This blog post may not be most technical one, and maybe not the most exciting one, but this is how developing such a exploit really is, at least in my experience. 95% of the time you’re just failing and trying different ideas. Several times you will be stuck, but somehow there is always a solution. On one side it feels like I’ve wasted way too much time on this, but on the other side I also learned so much. And it feels nice to actually finish such a demotivating project. Even if no one will ever actually use it.

How can I find the code

I put all of the code on Github: