Recently I've started working on yet another version of Touchy Blinky (Lets call it V4) with two main goals: Reducing assembly time, and cutting down on components cost.

As the one most expensive component in the BOM was 48 meters of APA102C LED strips, I have decided to tackle that one first. For those of you who aren't familiar with them APA102 strips use the SPI protocol that has both clock and data signals, and is capable of much higher bit rate than the more common WS2812 LED strips, which cost about half, but have a slow clock speed (800 khz, limiting update to about 1000 LEDs at 30 FPS) and are VERY sensitive to the timing of the signal, making them popular mostly with microcontrollers, as they are able to easily support those tight tolerances. Touchy Blinky however is controller by a Raspberry Pi single board computer, and as I work for an industry that specializes in soft real time I've decided to give it a go. Spoilers: It's working.

First step is understanding the obstacles and what causes the execution speed to fluctuate. The things I could think of are:

* Interrupts, context switches, the OS just getting in the way

* Cache misses and memory retrieval

* Writing to the IO system from different places

So I fired up my Raspberry Pi 3 and started addressing them one by one.

Interrupts and the OS

So the first hing I did is dedicate a core for just my LED strips. There are a few good reasons to do that. First is so the scheduler will never ever context switch you. Second is that you get the L1 cache all to yourself, no one else will put any data there. The wasy to do that is using the isolcpus feature for the linux kernel. you can read more about it here.

Next step was getting rid of all of the interrupts. The way to follow those is looking at the /proc/interrupts file, or the way I like to do it:

watch -n .1 cat /proc/interrupts

Alas, that was not enough. There were still plenty of interrupts interrupting all over my precious dedicated core. To name them they're the arch_timer and Rescheduling interrupts. Time to take them out.

Looking at this document containing the interrupts I went on and played around until I managed to turn off the interrupts all together. Here is the code to do that:

Having done that (I will put the code in a branch on my git repo) timing was almost working. or working most the time, with lots of offsets. Some of them seemed to be just the clock slowing down by half, which tuned out to be the clock slowing down by half for power reasons. The solution was to tell the CPU governor (at /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor ) to stay at a constant speed. I've tried 'performance' to set the speed to the max, but it still sometime slowed down, so I switched to 'powersave' which locked the cores to 600Mhz, which is still fast enough, but increased the time wobble.

Cache Misses

This is actually an easy one. I do have enough time to load everything from the main memory to the L1 cache, I just need to do it in a known amount of time.

Luckily GCC has the __builtin_prefetch operation that translates into the PLD instruction. More data is available here.

Before every delay I just prefetch the next value I'll need and allow the CPU to do whatever I want behind the scenes.

IO System Congestions

After stopping all interrupts and prefetching everything, things mostly worked, but glitched very often. No idea what to do with that. Me sad. Things just took a variable amout of time. What could I do? While I couldn't think of anything else to do (or not do, as the point of this is to do keep things as simple as possible), I could try and measure how long things took and compensate. On x86/x64 this is pretty easy. for VERY high frequency timer you just use the __rdtsc() instruction, giving you a count of how many clock cycles have passed since the processor started ticking. in ARM however, this is not as easy. Biggest hardship is that it isn't available in Userland.

A couple of long nights and a kernel module programming crash course later, we were in the kernel. It was lots of fun.

Using the following code I tried to compensate for the write time.

I was very sad :-(

I didn't know what to do next.

After eliminating every other element it just seemed like writing to the GPIO registers even when using the memory barrier less writel_relaxed() function just varies too much in how long it takes, and it was especially bad when running things on any of the other cores. womp womp.

Suddenly it hit me.

Why have any other cores? Initially I needed many cores because I wanted to isolate one with isolcpus, but if I'm willing to give that up I can do with one core, and then there's no competition from any other cores. I get to do whatever I want with NOTHING getting in the way.

Out with Raspberry Pi 3, in with Raspberry Pi Zero!

Raspberry Pi Zero!

So down to a single core. All mine. slightly different architecture, and much better documented. Peripherals are documented here.

Plan is - get a frame from userland, freeze everything, bitbang all the GPIO, resume.

Does it work?

Yes.

Brilliantly.

Freezing the interrupts is much easier as there are just 3 registers for all of them, and then everything just works.

I've created the most minimal kernel module I could, and a C++ header file to make everything nice and easy.

Here is the business end of things.

Probelms still there

So there are a few things that are not perfect:

* Timing is not up to spec - The first bit up time seems to move back and forth a bit, and each bit is sent in ~900ns instead of the datasheet's 1250ns. It works though, and increases possible frame rate and reduces time the kernel is disabled.

* Concurrency. The kernel module is my first one. I know it has some thread safety problems. But hey, this is for shiny LEDs, not for storing your bitcoin.

* Stopping the kernel is generally not the best idea. so for LED strips of 300 LEDs (Standard 5m strip) you're stopping the kernel for 9ms. Is that a lot? a little? I don't know. if you're taking it to 50FPS that's more than 45% of the time that the kernel is stopped. I tested it mostly with strip lengths of 100 LEDs and everything seemed to have been fine. I'm worried about sound, but I haven't tested it yet. Just be aware and expect your raspberry pi to be much slower.

The Code!

Can be found in my GitHub repo, right here:

https://github.com/UrielGuy/raspi_ws2812

Good luck and let me know if you have any ideas/bug reports/pull request/cool projects you've done with this!