Hardware Bugs can be frustrating, as they are likely the last place you would expect a bug. Unfortunately, they can and do happen. Most likely every CPU has a list of erratas. This post outlines the process taken to hunt down a bug in the i.MX 7 ARM M4 implementation.

TL;DR: LDREX/STREX instructions on the i.MX7 ARM M4 implementation bypass the cache and go straight to main memory, resulting in an incoherent cache.

What in the World?

The bug was first stumbled upon during the development of FreeRTOS tasks. For those unfamiliar with real-time operating systems, a task can be compared to a process in a Linux/Windows OS environment. One FreeRTOS task was responsible for receiving commands (in this case, from the heterogeneous A7 core over rpmsg), executing some function, and returning a response. Depending on the type of command, the task either did the function in the same task or passed the command to a lower priority “slow command” task. The slow task had the restriction that it can only execute one command at a time. In order to control and monitor the slow command task, a global atomic enum was used.

enum class TaskState{ IDLE, RUNNING, ABORTED }; atomic slowTaskState(TaskState::IDLE); Command slowCommand; void commandTask() { while (true) { Command command = receiveCommand(); switch (command.Type) { case CommandType::NORMAL: // Do work sendResponse(SuccessResponse); break; case CommandType::SLOW: if (slowTaskState == TaskState::IDLE) { slowCommand = command; slowTaskState = RUNNING; send(SuccessResponse); } else { send(BusyResponse); } break; case CommandType::ABORT: TaskState expected(TaskState::RUNNING); atomic_compare_exchange_strong( &slowTaskState, &expected, TaskState::ABORTED); send(SuccessResponse); break; } } } void slowCommandTask() { while(true) { // Actually uses xEventGroupWaitBits instead of busy-wait if (slowTaskState != TaskState::IDLE) { // Do some work if (slowTaskState == TaskState::ABORTED) { slowTaskState = TaskState::IDLE; continue; } // Continue Work } } }

The code works as expected almost. The command task would receive and process normal commands, and would hand-off slow commands to the slow command task fine. Attempting to abort slow commands was the location of the problem.

Isolating the Problem

In order to better understand the problem, the steps were broken down into initially a single task, and then outside of FreeRTOS entirely. This was to ensure that a corner race condition wasn’t missed (though there was confidence that this was not the case).

_Atomic int state = ATOMIC_VAR_INIT(3); int main(void) { // show that the atomic was initialized correctly to 3 debug_printf("state initial value: %d\r

", atomic_load(&state)); //set the value to 1 state = 1; debug_printf("state after store 1: %d\r

", atomic_load(&state)); int expected = 1; if (!atomic_compare_exchange_strong(&state, &expected, 2)) { debug_printf("CAS failed: Value is %u, expected 1\r

", expected); } }

Program Output:

state initial value: 3 state after store 1: 1 CAS failed: Value is 3741011900, expected 1

As you can see in the output, the values read by the atomic_compare_exchange_strong do not make any sense. The atomic store and load of “1” was successful, but it seems that the compare and swap (CAS) reads an incorrect stale value!

Suspicions about cache incoherence started to form. How else could a read to the exact same address result in different values?

Aside on ARM Atomics

A quick background on ARM atomics is needed in order to understand the bug and when it will present itself. In the ARM architecture, loads and stores are already atomic. In order to implement other atomic operations, such as test_and_set or fetch_and_add , ARM expanded the instructions set to include LDREX (Load Exclusive/Load-Link) and STREX (Store-Exclusive/Store-Conditional). Load-Link will load a value from a register. Store-Conditional will store a value only if the memory address last accessed by Load-Link was not accessed between the two instructions. (In actuality, hardware does not track the exact memory location accessed, but rather a region, such as a cache line, or possibly any memory location at all). These instructions, when used together allow ARM to implement read-modify-write operations. Functions such as atomic_compare_exchange_strong require these instructions to operate (without disabling interrupts), and the compiler will emit the instructions whenever they are used.

Testing the Cache

The test was ran out of cacheable RAM originally. In order to test the hypothesis, the code was linked to run out of the Tightly Coupled Memory (TCM), which is non-cacheable, and On-Chip RAM (OCRAM), which is cacheable. The OCRAM was included to prove that it wasn’t just an issue with RAM, but with any cacheable memory. The TCM didn’t exhibit the same issue and ran as would be expected, heightening suspicion of the cache. To prove that the cache was involved in the problem, the test was expanded to manually control the cache.

_Atomic int state = ATOMIC_VAR_INIT(3); int main(void) { // show that the atomic was initialized correctly to 3 debug_printf("state initial value: %d\r

", atomic_load(&state)); //set the value to 1 state = 1; debug_printf("state after store 1: %d\r

", atomic_load(&state)); int expected = 1; if (!atomic_compare_exchange_strong(&state, &expected, 2)) { debug_printf("CAS failed: Value is %u, expected 1\r

", expected); } // Flushing cache to update main memory debug_printf("Pushing Cache\r

"); LMEM_FlushSystemCache(LMEM_BASE_PTR); expected = 1; // After pushing the cache to update the value in main memory, the compare_exchange reads the correct value if (!atomic_compare_exchange_strong(&state, &expected, 2)) { debug_printf("CAS failed: Value is %u, expected 1\r

", expected); } else { debug_printf("CAS Passed\r

"); } // The value stored is not what the CAS should have stored debug_printf("state after CAS (should = 2): %d\r

", atomic_load(&state)); // Invalidating the cache shows the value written by CAS is in main memory debug_printf("Invalidating Cache\r

"); LMEM_PSCCR = LMEM_PSCCR & ~(LMEM_PCCCR_PUSHW1_MASK | LMEM_PCCCR_PUSHW0_MASK ); LMEM_PSCCR |= LMEM_PCCCR_INVW1_MASK | LMEM_PCCCR_INVW0_MASK; LMEM_PSCCR |= LMEM_PSCCR_GO_MASK; while (LMEM_PSCCR & LMEM_PSCCR_GO_MASK); debug_printf("state after invalidate (should = 2): %d\r

", atomic_load(&state)); }

Program Output:

state initial value: 3 state after store 1: 1 CAS failed: Value is 3741011900, expected 1 Pushing Cache CAS Passed state after CAS (should = 2): 1 Invalidating Cache state after invalidate (should = 2): 2

After the manual flushing of the cache, the CAS reads the correct value and supposedly stores a new value. The atomic loads then read an unexpected value. Invalidating the cache, forcing a normal load to load from main memory reads the expected value. ARM documentation was scoured to ensure that we did not miss some memory configuration bit or documented hardware limitation. After being reasonably confident the hardware was at fault, a post was made in the NXP (the processor vendor) Community, in which NXP confirmed the hardware bug. (As of this writing, the errata has yet to be updated to include this bug).

Finding a Path to Success

After the problem was verified by NXP, a solution to the problem was still needed. Not using the cache would have worked, although obviously would degrade performance significantly. Another option would be to run out of faster, non-cacheable memory, e.g the TCM, but the TCM is very limited in size. The solution we came upon was to put all atomics (which the compiler may produce LDREX/STREX instructions for), and any class or struct containing an atomic, in the TCM by using gcc linker section attributes. To verify that all atomics are correctly linked into the TCM, libclang was used to analyze the object files during compilation, ensuring that this bug does not come back from the dead in the future to bite us in the butt.