CNXSoft: Guest post by Blu about Baikal T1 development board and SoC, potentially one of the last MIPS consumer grade platforms ever.

It took me a long time to start writing this article, even though I had been poking at the test subject for months, and I felt during that time that there were findings worth sharing with fellow embedded devs. What was holding me back was the thought that I might be seeing one of the last consumer-grade specimen of a paramount ISA that once turned upside-down the CPU world. That thought was giving me mixed feelings of part sadness, part hesitation ‒ to not do some injustice to a possibly last-of-its-kind device. So it was with these feelings that I took to writing this article. But first, a short personal story.

Two winters ago I was talking to a friend of mine over beers. We were discussing CPU architectures and hypothesizing on future CPU developments in the industry, when I mentioned to him that the latest Imagination Technologies’ MIPS P5600 ‒ a MIPS32r5 ‒ hosted an interesting SIMD extension ‒ a previously-unseen one in the MIPS world. I had just skimmed through the docs for that extension ‒ MIPS SIMD Architecture (MSA), and I was impressed with how clean and practical this new vector instruction set looked in comparison to the SIMD ISAs of the day, partiularly to those by a very venerable CPU manufacturer. We discussed how the P5600 had found its way into a SoC by the Russian semiconductor vendor Baikal Electronics, and how they were releasing a devboard, which, thanks to limited-series manufacturing, would be well out-of-reach for mortal devs.

Fast forward to this summer, when I got a ping from my friend ‒ he was currently in St. Petersburg, Russia, and he was browsing the online store of a Moscow computer shop, and there was the Baikal T1 BFK 3.1 board, for the equivalent of 500 EUR, so if I ever wanted to get one, now was the time.

Did I want one? Last MIPS I had an encounter with was the Imagination CI20 board, hosting an Ingenic JZ4780 application SoC ‒ a dual-core MIPS32r2 implementation, and that was a mixed experience. I just had higher expectations of that SoC, as neither the SoC vendor nor Imagination did a good job setting the user expectations of what the XBurst MIPS cores actually were ‒ short in-order pipelines, with a non-pipelined scalar FPU, and an obscure integer-only SIMD specialized for video codecs. The one interesting part in that SoC, from my perspective, was the fully-fledged GLESv2/EGL stack for the aging SGX540. What I was looking for this time around was a “meatier” MIPS, one which was closer to the state of the art of this ISA, and the P5600 was precisely that.

So, yes, I very much wanted one. That price was very close to my threshold of ‘buy for science’, but I still had to keep in check my overgrown annual ‘scientific budget’ (as I refer to my devboard expenses in front of my wife), so I hesitated for a moment. To which my friend suggested ‘Listen, your birthday occurs annually, so how about I get you a birthday present, with some credit from future birthdays?’ [A huge thank you, Mitia, for your ingenuity, kindness and generosity!]

The BFK 3.1 is a sub-uATX board ‒ namely of the flexATX factor ‒ a bit larger than mini-ITX, which means it’s compact ‒ not RPi compact, mind you, but still compact for a devboard. Baikal T1 itself is a compact SoC ‒ not much larger than the Ingenic JZ4780. The latter is 17x17mm BGA390 (40nm), vs 25x25mm BGA576 (28nm) for the T1. But the T1 is a proper SoC that contains everything needed for a small gen-purpose computer (sans a GPU), which is what the BFK 3.1 seeks to be. Combined with the versatile MCU STM32F205 (ARM Cortex-M3 @ 120MHz), the T1 allows for an essentially two-chip devboard. Aside form the SoC and its companion MCU, the BFK 3.1 hosts a PCIe x16 connector (x4 active lanes), a SO-DIMM slot, an ATX power connector, 2x 1Gb Ethernet and 2x SATA 3 connectors, a USB2.0, an UART (via mini-USB) and what appears to be a USB OTG, a couple of JTAGs and even a RPi GPIO connector ‒ the rest of the board’s top surface is nearly pristine clean. Ok, there’s one more connector ‒ a proprietary one for the optional 10Gb Ethernet add-on, but that comes more as a curiosity from my current perspective.

Getting the board live was practically uneventful. BFK 3.1 power delivery is via a 24-pin ATX connector ‒ no barrel connectors of any kind, which in my case made two large drawers worth of PSUs useless, but I also had a 20-pin ATX picoPSU at hand (80W DC-DC, 12V input) and a spare AC-DC 12V convertor (60W) ‒ that improvised power delivery covered the board plus a SSD more than fine ‒ actually it was an overkill, given the manufacturer’s TDP rating of the SoC of 5W. I also had a leftover 4GB DDR3 SO-DIMM from a decommissioned notebook, so I thought I had the RAM covered as well. A “minor” detail had escaped my attention ‒ that SO-DIMM was of the 1333MT/s (667MHz) variety, whereas the board took 1600MT/s (800MHz) sharp ‒ my first booting of the board took me as far as RAM controller negotiations.

One facepalm and a visit to the local store later, the board was hosting shiny-new 8GB of DDR3, to specs and all.

Yet another minor detail about the RAM had originally escaped my attention, but that detail was not crucial to the booting of the board, and I found it out only after the first boot: the SoC had a 32-bit RAM bus, so it was seeing half the capacity of the 64-bit DIMM. Perhaps it could be arranged for such a bus to see the full DIMM capacity ‒ I’m not a hw engineer to know such things, and the designers of the BFK 3.1 clearly did not arrange for that. Which is a bit unfortunate for a devboard. Oh well ‒ back to square ‘4GB of RAM’.

Apropos, as it turned out, I did really need RAM, since for exposing the full potential of the P5600 I had some compiler building ahead of me, and I always self-host builds when possible. But I’m getting ahead of myself.

The board arrives with a Busybox in SPI flash, and Baikal Electronics provide two revisions of Debian Stretch images with kernel 4.4 for day-to-day uses from a SATA drive. All available boot media are exposed via the cleanest U-Boot menu interface I’ve seen yet.

Footnote: aside from dd-ing the Debian image to the SSD, all interactions with the BFK 3.1 were done without involvement of PCs ‒ the above screengrab is from my trusty chromebook.

The obligatory dump of basic caps follows:



blu@baikal:~$ uname -a Linux baikal 4.4.100-bfk3 #4 SMP Thu Feb 15 17:25:02 MSK 2018 mips GNU/Linux blu@baikal:~$ blu@baikal:~$ cat /proc/cpuinfo system type : Baikal-T Generic SoC machine : Baikal-T1 BFK3 evaluation board processor : 0 cpu model : MIPS P5600 V3.0 FPU V2.0 BogoMIPS : 1196.85 wait instruction : yes microsecond timers : yes tlb_entries : 576 extra interrupt vector : yes hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb] isa : mips1 mips2 mips32r1 mips32r2 ASEs implemented : vz msa eva xpa shadow register sets : 1 kscratch registers : 3 package : 0 core : 0 VCED exceptions : not available VCEI exceptions : not available processor : 1 cpu model : MIPS P5600 V3.0 FPU V2.0 BogoMIPS : 1196.85 wait instruction : yes microsecond timers : yes tlb_entries : 576 extra interrupt vector : yes hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb] isa : mips1 mips2 mips32r1 mips32r2 ASEs implemented : vz msa eva xpa shadow register sets : 1 kscratch registers : 3 package : 0 core : 1 VCED exceptions : not available VCEI exceptions : not available blu@baikal:~$ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 blu @ baikal : ~ $ uname - a Linux baikal 4.4.100 - bfk3 #4 SMP Thu Feb 15 17:25:02 MSK 2018 mips GNU/Linux blu @ baikal : ~ $ blu @ baikal : ~ $ cat / proc / cpuinfo system type : Baikal - T Generic SoC machine : Baikal - T1 BFK3 evaluation board processor : 0 cpu model : MIPS P5600 V3 . 0 FPU V2 . 0 BogoMIPS : 1196.85 wait instruction : yes microsecond timers : yes tlb_entries : 576 extra interrupt vector : yes hardware watchpoint : yes , count : 4 , address / irw mask : [ 0x0ffc , 0x0ffc , 0x0ffb , 0x0ffb ] isa : mips1 mips2 mips32r1 mips32r2 ASEs implemented : vz msa eva xpa shadow register sets : 1 kscratch registers : 3 package : 0 core : 0 VCED exceptions : not available VCEI exceptions : not available processor : 1 cpu model : MIPS P5600 V3 . 0 FPU V2 . 0 BogoMIPS : 1196.85 wait instruction : yes microsecond timers : yes tlb_entries : 576 extra interrupt vector : yes hardware watchpoint : yes , count : 4 , address / irw mask : [ 0x0ffc , 0x0ffc , 0x0ffb , 0x0ffb ] isa : mips1 mips2 mips32r1 mips32r2 ASEs implemented : vz msa eva xpa shadow register sets : 1 kscratch registers : 3 package : 0 core : 1 VCED exceptions : not available VCEI exceptions : not available blu @ baikal : ~ $



Whether the kernel saw this as a MIPS32r2 machine or it made use of the address extensions ‒ all that was beyond the scope of this first reconnaissance. I wanted to examine uarch performance, and as long as compilers were in the clear about the CPU’s true ISA capabilities I was set.

The VZ extension is a virtualization thing ‒ far from my interests. The EVA and XPA are addressing extensions ‒ Enhanced Virtual Address and Extended Physical Address, respectively. The former allows more efficient virtual-space mapping between kernel and userspace for the 32-bit/4GB process-addressable memory space. And the latter is, well, a physical address extension. From the P5600 manual:

Extended Physical Address (XPA) that allows the physical address to be extended from 32-bits to 40-bits.

Clearly both addressing extensions could be of good use to kernel developers. Me, of the listed ISA extensions, MSA was the one I truly cared about.

How about FS performance?



root@baikal:/home/blu# hdparm -tT /dev/sda1 /dev/sda1: Timing cached reads: 2352 MB in 2.00 seconds = 1176.26 MB/sec Timing buffered disk reads: 1206 MB in 3.00 seconds = 401.89 MB/sec root@baikal:/home/blu# 1 2 3 4 5 6 root @ baikal : / home / blu # hdparm -tT /dev/sda1 / dev / sda1 : Timing cached reads : 2352 MB in 2.00 seconds = 1176.26 MB / sec Timing buffered disk reads : 1206 MB in 3.00 seconds = 401.89 MB / sec root @ baikal : / home / blu #



As wise men say, ‘Have decent SATA performance ‒ will use for a build machine.’

And finally, an interrupts-related observation that might help me obtain cleaner benchmarking results:



blu@baikal:~$ cat /proc/interrupts CPU0 CPU1 1: 16906 9097 MIPS GIC Local 1 timer 2: 0 0 MIPS GIC Local 0 watchdog 8: 5428 0 MIPS GIC 8 IPI resched 9: 0 4970 MIPS GIC 9 IPI resched 10: 4693 0 MIPS GIC 10 IPI call 11: 0 15118 MIPS GIC 11 IPI call 23: 0 0 MIPS GIC 23 be-apb 31: 0 0 MIPS GIC 31 timer0 38: 0 0 MIPS GIC 38 1f200000.pvt 40: 0 0 MIPS GIC 40 1f046000.i2c0 41: 191 0 MIPS GIC 41 1f047000.i2c1 47: 5 0 MIPS GIC 47 dw_spi0 48: 0 0 MIPS GIC 48 dw_spi1 55: 2464 0 MIPS GIC 55 serial 56: 10 0 MIPS GIC 56 serial 63: 0 0 MIPS GIC 63 dw_dmac 71: 21832 0 MIPS GIC 71 1f050000.sata 75: 0 0 MIPS GIC 75 xhci-hcd:usb1 79: 652 0 MIPS GIC 79 eth1 87: 0 0 MIPS GIC 87 eDMA-Tx-0 88: 0 0 MIPS GIC 88 eDMA-Tx-1 89: 0 0 MIPS GIC 89 eDMA-Tx-2 90: 0 0 MIPS GIC 90 eDMA-Tx-3 91: 0 0 MIPS GIC 91 eDMA-Rx-0 92: 0 0 MIPS GIC 92 eDMA-Rx-1 93: 0 0 MIPS GIC 93 eDMA-Rx-2 94: 0 0 MIPS GIC 94 eDMA-Rx-3 95: 0 0 MIPS GIC 95 MSI PCI 96: 0 0 MIPS GIC 96 AER PCI 103: 0 0 MIPS GIC 103 emc-dfi 104: 0 0 MIPS GIC 104 emc-ecr 105: 0 0 MIPS GIC 105 emc-euc 134: 0 0 MIPS GIC 134 be-axi ERR: 0 blu@baikal:~$ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 blu @ baikal : ~ $ cat / proc / interrupts CPU0 CPU1 1 : 16906 9097 MIPS GIC Local 1 timer 2 : 0 0 MIPS GIC Local 0 watchdog 8 : 5428 0 MIPS GIC 8 IPI resched 9 : 0 4970 MIPS GIC 9 IPI resched 10 : 4693 0 MIPS GIC 10 IPI call 11 : 0 15118 MIPS GIC 11 IPI call 23 : 0 0 MIPS GIC 23 be - apb 31 : 0 0 MIPS GIC 31 timer0 38 : 0 0 MIPS GIC 38 1f200000.pvt 40 : 0 0 MIPS GIC 40 1f046000.i2c0 41 : 191 0 MIPS GIC 41 1f047000.i2c1 47 : 5 0 MIPS GIC 47 dw_spi0 48 : 0 0 MIPS GIC 48 dw_spi1 55 : 2464 0 MIPS GIC 55 serial 56 : 10 0 MIPS GIC 56 serial 63 : 0 0 MIPS GIC 63 dw_dmac 71 : 21832 0 MIPS GIC 71 1f050000.sata 75 : 0 0 MIPS GIC 75 xhci - hcd : usb1 79 : 652 0 MIPS GIC 79 eth1 87 : 0 0 MIPS GIC 87 eDMA - Tx - 0 88 : 0 0 MIPS GIC 88 eDMA - Tx - 1 89 : 0 0 MIPS GIC 89 eDMA - Tx - 2 90 : 0 0 MIPS GIC 90 eDMA - Tx - 3 91 : 0 0 MIPS GIC 91 eDMA - Rx - 0 92 : 0 0 MIPS GIC 92 eDMA - Rx - 1 93 : 0 0 MIPS GIC 93 eDMA - Rx - 2 94 : 0 0 MIPS GIC 94 eDMA - Rx - 3 95 : 0 0 MIPS GIC 95 MSI PCI 96 : 0 0 MIPS GIC 96 AER PCI 103 : 0 0 MIPS GIC 103 emc - dfi 104 : 0 0 MIPS GIC 104 emc - ecr 105 : 0 0 MIPS GIC 105 emc - euc 134 : 0 0 MIPS GIC 134 be - axi ERR : 0 blu @ baikal : ~ $



Notice how all serial and SATA interrupts are serviced by the 1st core? We could put that to some use.

Now the actual fun could begin! Being the control freak that I am, I tend to run a couple of micro-benchmarks when testing new uarchitectures ‒ one on the ‘gen-purpose’ side of performance, and one on the ‘sustained fp’ side of performance. Both of them being single-threaded, and the CPU at hand not featuring SMT, that meant I could focus on the details of the uarch by isolating all tests to the relatively-uninterrupted 2nd core.

Unfortunately, there was one last obstacle before me ‒ Debian Stretch comes with gcc-6.3 which does not know of the MSA extension in the P5600. For that I needed one major compiler revision later ‒ gcc-7.3 was fully aware of the novel instruction set, and so my next step was building gcc-7.3 for the platform. Easy-peasy. Or so I thought.

A short rant: I have difficulties understanding why a compiler’s default-settings self-hosted build would fail with an ‘illegal instruction’ in the bootstrap phase. But that’s the case with g++-7.3 on Debian Stretch when doing a self-hosted --target=mipsel-linux-gnu build on the BFK 3.1, and that’s what made me approach the gcc-dev mailing list with the wrong kind of support question, to which, luckily, I still got helpful responses.

Back to the BFK 3.1, where I eventually got a good g++-7.3 build via the following config, largely copied over from Debian’s g++-6.3:



$ CPATH=/usr/include/mipsel-linux-gnu/ LIBRARY_PATH=/usr/lib/mipsel-linux-gnu/ ../gcc-7.3.0/configure --enable-languages=c,c++ --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --disable-multiarch --with-arch-32=mips32r2 --with-fp-32=xx --with-madd4=no --with-lxc1-sxc1=no --enable-checking=release --build=mipsel-linux-gnu --host=mipsel-linux-gnu --target=mipsel-linux-gnu $ CPATH=/usr/include/mipsel-linux-gnu/ LIBRARY_PATH=/usr/lib/mipsel-linux-gnu/ make -j2 1 2 $ CPATH =/ usr / include / mipsel - linux - gnu / LIBRARY_PATH =/ usr / lib / mipsel - linux - gnu / . . / gcc - 7.3.0 / configure -- enable - languages = c , c ++ -- enable - libstdcxx - debug -- enable - libstdcxx - time = yes -- with - default - libstdcxx - abi = new -- disable - multiarch -- with - arch - 32 = mips32r2 -- with - fp - 32 = xx -- with - madd4 = no -- with - lxc1 - sxc1 = no -- enable - checking = release -- build = mipsel - linux - gnu -- host = mipsel - linux - gnu -- target = mipsel - linux - gnu $ CPATH =/ usr / include / mipsel - linux - gnu / LIBRARY_PATH =/ usr / lib / mipsel - linux - gnu / make - j2



Which gave me:



blu@baikal:~$ /usr/bin/g++ --version g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. blu@baikal:~$ g++ --version g++ (GCC) 7.3.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. blu@baikal:~$ 1 2 3 4 5 6 7 8 9 10 11 12 13 blu @ baikal : ~ $ / usr / bin / g ++ -- version g ++ ( Debian 6.3.0 - 18 + deb9u1 ) 6.3.0 20170516 Copyright ( C ) 2016 Free Software Foundation , Inc . This is free software ; see the source for copying conditions . There is NO warranty ; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE . blu @ baikal : ~ $ g ++ -- version g ++ ( GCC ) 7.3.0 Copyright ( C ) 2017 Free Software Foundation , Inc . This is free software ; see the source for copying conditions . There is NO warranty ; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE . blu @ baikal : ~ $



Yay, got MSA compiler support! Now I could do all the fp32 (and not only) SIMD I wanted.

But first I stumbled upon a surprise coming from the non-SIMD micro-benchmark ‒ a Mandelbrot plot written in the language Brainfuck, and run through a home-grown Brainfuck interpreter.

Running that before and after upgrading the compiler showed the following results:

Brainstorm Mandelbrot ‒ three versions of the code, across two compilers:

g++-6.3.0: 0m43.539s (vanilla)

g++-6.3.0: 0m38.176s (alt)

g++-6.3.0: 0m38.176s (alt^2)

g++-7.3.0: 0m36.003s (vanilla)

g++-7.3.0: 0m36.561s (alt)

g++-7.3.0: 0m31.852s (alt^2)

Notice how for the exact-same code and the exact-same optimization flags the two compilers produced performance delta for the resulting binary as large as 20% in favor of the newer g++? That was not due to some new, smarter P5600 instructions utilized by the newer compiler ‒ nope, the generated codes in both cases used the same ISA. It’s just that the newer compiler produced notably better-quality code ‒ fewer branches, more linear control flow. Yay for better compilers!

Those g++7.3 results positioned the P5600 firmly between the AMD A8-7600 and the Intel Core2 Duo P8600 in the clock-normalized Mandelbrot performance charts (where the Penryn also takes advantage of the custom Apple clang compiler, which generally outperforms gcc at this combination of CPU and task.

Per-clock, the P5600 also scored ahead of the Cortex-A15, which I believe is the closest competitor in the category of the P5600. Where the P5600, or perhaps its incarnation in the Baikal T1, fell short, was in absolute performance due to low clocks. Should that core reach clocks closer to 2GHz, we’d be seeing much more interesting absolute-performance results.

Ok, it was time to see how the P5600 did at fp32 SIMD. For that an SGEMM matrix multiplier was to be used. Making use of the novel MSA ISA took minimal effort, partially thanks to gcc’s support for generic vectors, partially thanks to the simplicity of the MSA ISA. The MSA version of the matmul code, dubbed ‘ALT=8’, took less than an hour to code and tune, and resulted in ~3.9 flop/clock for the small, cache-fitting dataset (64×64 matrices), and 2.1 flop/clock for the large dataset (512×512 matrices). Those results placed the P5600 firmly between Intel Merom and Intel Penryn for the small dataset, and slightly below the level of ARM Cortex-A72 and Intel Merom for the large dataset. The large dataset, though, exhibited a rather erratic behavior ‒ run-times varied considerably even when pinned to the 2nd core. It was as if the memory subsystem, past L2D, was behaving inconsistently doing 128-bit-wide accesses. That warranted further investigation, which would happen on a better day.

But let me finish my BFK 3.1 story here, and give my subjective, not-guaranteed-impartial opinion of the test subject.

My impressions of the P5600 in the Baikal T1 are largely positive. Using my limited micro-benchmark set as a basis, that uarchitecture does largely deliver on its promises of good gen-purposes IPC and good SIMD throughput per clock, and could be considered a direct competitor to the best of 32-bit ARM Cortex designs. That said, Baikal T1 could use higher clocks, which would position it in absolute-performance terms right in the group of the Core2 lineup by Intel and the Cortex-A12/15/17 lineup by ARM. Which, if one thinks of it in the grand scheme things, would be nothing short of a great achievement for the Baikal Warrior (Imagination aptly named the P-series MIPS designs ‘Warrior’ ‒ they’d have to fight for the survival of their ISA). If we ever live to see another Baikal T-series, that is ‒ Baikal Electronics are also developing their Baikal M-series ‒ ARM Cortex-A57 designs.

MIPS once turned the CPU world around. Can it survive its darkest hour (at least in the West ‒ in the East the Chinese have their Loongson) and step into a renaissance, or will it perish into oblivion? I, for one, would love to see the former, but I’m just an old coder, and old coders don’t get much say these days.