You managed to stand up again already? Ok, let’s go on:

Next thing we see, is that the NVIDIA Jetson Nano isn’t scoring well at all. Although it has a CUDA enabled GPU, it’s really not much faster then my old i7–4870HQ. But that’s the catch, ‘not much faster’, it still is faster then a 50W, quad-core, hyperthreading CPU. From a few years back, true, but still. The Jetson Nano never could have consumed more then a short term average of 12.5W, because that’s what I’m powering it with. That’s a 75% power reduction, with a 10% performance increase.

Clearly, the Raspberry Pi on it’s own isn’t anything impressive, not with the floating point model, and still not really anything useful with the quantised model. But hey, I had the files ready anyway, and it was capable of running the tests, so more is always better right? And still kind of interesting because it shows the difference between the ARM Cortex A53 in the Pi, and the A57 in the Jetson Nano.

NVIDIA Jetson Nano

So the Jetson Nano isn’t pumping out impressive FPS rates with the MobileNetV2 classifier, but as I already stated, that doesn’t mean it isn’t a great piece of useful engineering. It’s cheap, it doesn’t need a shitload of energy to run, and maybe the most important property is that it runs TensorFlow-gpu (or any other ML platform) like any other machine you’ve always been using before. As long as your script isn’t diving too deep into CPU architectures, you can run the exact same script you would on an i7+CUDA GPU, also for training! I do still feel like NVIDIA should preload L4T with TensorFlow, but I’ll try not to rage about this any longer. After all, they have a nice explanation on how to install it (don’t be fooled though, TensorFlow 1.12 is not supported, only 1.13.1).

Source:NVIDIA

Google Coral Edge TPU

Ok I have a big love for nicely engineered and high efficiency specific electronic devices, so I’m maybe not perfectly objective. But this thing… It’s a thing of absolute beauty!

Penny for scale, source:Google

The Edge TPU is what we call an “ASIC” (Application Specific Integrated Circuit), which means that it has a combination of small electronic parts such as FET’s and capacities burned directly on the silicon layer, in such a way that it does exactly what it needs to do to speed up inference.

Inference, yes, the Edge TPU is not able to perform backwards propagation.

Coral USB Accelerator

The logic behind this sounds more complex than it is though. (Actually creating the hardware, and making it work, is a whole different thing, and is very, very complex. But the logic functions are much simpler). Next image shows the basic principle around which the Edge TPU has been designed.

Source:Google

A net like MobileNetV2 is consisting mostly of convolutions with activation layers behind. A convolution is stated as :

Convolution

Which means nothing more then multiplying each element(pixel) of the image with every pixel of the kernel, and then adding these results up, to create a new ‘image’(feature map). That is exactly what the main component of the Edge TPU was meant for. Multiplying everything at the same time, then adding it all up at insane speeds. There is no ‘CPU’ behind this, it just does that whenever you pump data into the buffers on the left. If you’re really interested in how this works, look up “Digital Circuit” and “FPGA”, and you’ll probably find enough information to keep you busy for the next few months. Sometimes rather complex to start with, but really really interesting!

But this is exactly why the Coral is in such a different league when comparing performance/Watt numbers, it is a bunch of electronics, designed to do exactly the bitwise operations needed, basically no overhead at all.

Internal schematic of a Google Cloud TPU — Source:Google

Why no 8-bit model for GPU?

A GPU is inherently designed as a fine grained parallel float calculator. So using floats is exactly what it was created for, and what it is good at. The Edge TPU has been designed to do 8-bit stuff, and CPU’s have clever ways of being faster with 8-bit stuff than full bitwidth floats because they have to deal with this in a lot of cases.

Why MobileNetV2?

I could give you a lot of reason’s why MobileNetV2 is a good model, but the main reason is, it’s one of the pre-compiled models that Google made available for the Edge TPU.

What else is available on the Edge TPU?

It used to be just MobileNet and Inception in their different versions, but as of the end of last week, Google pushed an update which allowed us to compile custom TensorFlow Lite models. But the limit is, and will probably always be, TensorFlow Lite models. That is different with the Jetson Nano, that thing runs anything you can imagine.

Raspberry Pi + Coral vs the rest

Why does the Coral seem so much slower when connected to a Raspberry Pi? Answer is simple and straight forward : Raspberry Pi has only USB 2.0 ports, the rest has USB 3.0 ports. And since we can see the i7–7700K is faster with the Coral then the Jetson Nano, but still doesn’t seem to score as well as the Coral Dev Board did when NVIDIA tested it, we can conclude the bottleneck is data rate, and not the Edge TPU.

Source:NVIDIA

Fading away

Ok, I’m the last one left in the office by now, I think this has been long enough for me, and probably for you as well. I have been absolutely blown away by the power of the Google Coral Edge TPU. But to me, the most interesting setup here was the NVIDIA Jetson Nano in combination with the Coral USB Accelerator. I will most certainly use that setup, it feels like a dream to work with.

I hope you had an interesting read. If there are any remarks or questions, do not hesitate to contact me. As usual, this is also where I tell you I will probably write something new soon, so yeah, keep your eyes open and all that. Cheers!