Google’s Coral project has recently gone out of beta. According to the benchmarks, Coral devices provide excellent neural network inference acceleration for DIY makers. Those devices ground on the specialized Tensor Processing Unit ASIC (Edge TPU), which proved to be somewhat tricky to work with, but the enforced limitations and quirks are rewarding. I was eager to explore the deep internals of the interoperation between TensorFlow and Edge TPU, and to hack both to do cool, nonstandard, crazy things.

The following expects that you are familiar with TensorFlow and Edge TPU basics. The official documentation is good, so looking through TensorFlow models on the Edge TPU and Edge TPU Compiler should be enough to proceed. Repeating my experiments requires Ubuntu Linux and an external Coral USB Accelerator.

First of all, the Edge TPU software is not fully open source. The most “tasty” bits, edgetpu_compiler executable and libedgetpu.so shared library are proprietary. This fact increases the potential hacking complexity but also makes it more fun! For example, the only way to see which APIs are exposed by libedgetpu.so is to dump the exported symbols with objdump :

$ objdump -TCj .text /usr/lib/x86_64-linux-gnu/libedgetpu.so.1 /usr/lib/x86_64-linux-gnu/libedgetpu.so.1: file format elf64-x86-64 DYNAMIC SYMBOL TABLE:

000000000006baa0 g DF .text 000000000000000d VER_1.0 edgetpu::RegisterCustomOp()

0000000000072b40 g DF .text 000000000000001f VER_1.0 edgetpu::EdgeTpuContext::~EdgeTpuContext()

0000000000072ad0 g DF .text 0000000000000006 VER_1.0 edgetpu::EdgeTpuContext::~EdgeTpuContext()

0000000000072ad0 g DF .text 0000000000000006 VER_1.0 edgetpu::EdgeTpuContext::~EdgeTpuContext()

000000000006dc10 g DF .text 000000000000000a VER_1.0 tflite_plugin_destroy_delegate

000000000006be50 g DF .text 00000000000001dd VER_1.0 edgetpu_list_devices

000000000006bb80 g DF .text 0000000000000107 VER_1.0 edgetpu_version

000000000006bab0 g DF .text 000000000000000a VER_1.0 edgetpu::EdgeTpuManager::GetSingleton()

000000000006d090 g DF .text 0000000000000b7c VER_1.0 tflite_plugin_create_delegate

000000000006bb20 g DF .text 0000000000000012 VER_1.0 edgetpu_free_devices

000000000006bac0 g DF .text 000000000000005e VER_1.0 edgetpu::operator<<(std::ostream&, edgetpu::DeviceType)

000000000006c030 g DF .text 0000000000000c1a VER_1.0 edgetpu_create_delegate

000000000006bb40 g DF .text 000000000000000a VER_1.0 edgetpu_free_delegate

000000000006bb50 g DF .text 0000000000000024 VER_1.0 edgetpu_verbosity

This output explicitly implies that Edge TPU API is nailed to TensorFlow Lite with long thick nails. It is simply impossible to use the device in any other way. If you expected to see a lower-level API like “multiply this matrix by that vector on Edge TPU” — bad luck.

So let’s quickly recap how TensorFlow Lite API for Edge TPU works:

Generate a computation graph using the regular TensorFlow. For example, train a deep neural network. Convert it to the TensorFlow Lite format, which is flatbuffers instead of protobuf and with a different schema. The new graph must be special to make friends with Edge TPU. Notably, the contained operations (ops) must be quantized to uint8 because Edge TPU can only work with unsigned bytes. Cross your fingers and convert it once again, this time with edgetpu_compiler . The underlying format stays the same, but the supported ops get fused and compiled to a single magical Edge TPU block. Make sure the Coral device is attached, create a new TensorFlow Lite Interpreter with the Edge TPU ops delegate, and invoke.

Thus it is impossible to run some arbitrary calculation on Edge TPU without calling external programs and writing and reading files on the go.

The crucial detail in this multistep procedure is the list of ops that can be compiled for Edge TPU. TensorFlow Lite does not support all the blows and whistles of its elder brother, and Edge TPU supports only a fraction of what’s left. For example, there is no matrix multiplication (surprise! I did empirically checked that “output tensor is one-dimensional” in FullyConnected). These limitations make anything rather than convolutional and fully-connected neural network inference on Edge TPU hard, really hard. But not impossible. Here is where the post turns fun.

Motion Blur on Edge TPU

The motion blur effect is the result of the 2D convolution of an image with the “radius” kernel.

Horizontal motion blur. Left: original photo. Right: photo after applying the effect. Source: Wikipedia.

In TensorFlow terms, that operation is called DepthwiseConv2d, it is widely used in deep convolutional neural networks and is supported by Edge TPU. Image pixels can be represented in RGB format, one byte per channel — exactly what Edge TPU needs. Let’s go through all pits and perils and benchmark how fast is motion blur image filtering on Edge TPU with Python!

0 → tf.function

Forget about the existence of TensorFlow Lite and Edge TPU in this section. Let’s get familiar with the main logic. The following code creates the convolutional kernel. dim is the size and angle is the motion angle on the plane, in radians.

matplotlib is always handy when it comes down to dirty visuals.

The next step is to test our motion blur effect with the regular TensorFlow 2.0. We leverage tf.nn.depthwise_conv2d to calculate the 2D convolution of an image with our kernel. All the strides equal to 1 so that the image dimensions do not change.

In Jupyter, one can quickly measure the motion blur performance with %timeit motion_blur(images) . It yields something like 5.30s±0.09 on my 4x2(HT) Intel i7–8565U CPU.

tf.function → tflite

Now that we are sure that the overall approach works, it is time to port it to TensorFlow Lite.

We have to specify the input signature of tf.function in create_motion_blur_func because TensorFlow Lite currently does not allow variable shapes except for the first "batch" dimension. Hence our motion blur can only work with images of the same size.

create_motion_blur_func_lite is a wrapper around create_motion_blur_func that converts the latter to TensorFlow Lite. generate_lite_model initializes tf.lite.TFLiteConverter from the computation graph belonging to tf.function — our motion blur algorithm — and writes the conversion result on disk. create_func_lite loads it back, setups a new tf.lite.Interpreter and returns the invocation closure.

According to %timeit , the new implementation is faster: 3.50s±0.24 vs. 5.3s. This performance boost is surprising because the execution utilizes only one of my eight CPU cores according to the system monitor. We can visualize the resulting .tflite model with netron :

TensorFlow Lite motion blur graph visualized with netron.

tflite → Edge TPU

Finally, we need to transit from vanilla TensorFlow Lite to Edge TPU. This step is by far the trickiest and the most complex. We will continue building on top of the existing code, adding one feature at a time.

Edge TPU requires uint8 operation data type (dtype) instead of float32 . Unfortunately, we cannot make tf.nn.depthwise_conv2d work directly with uint8 : only float64 , float32 , bfloat16 , and float16 are supported. Therefore we have to resort to "post-training quantization," which means forging the dtypes and adding quantization properties to all the ops. gen_input_samples emulates the range of the pixel values from 0 to 255, that's how the quantization is parameterized in TensorFlow Lite. We further invoke edgetpu_compiler on the quantized model to replace the 2D convolution op with the optimized code for Edge TPU. tf.lite.Interpreter has to be augmented with experimental_delegates=[load_delegate("libedgetpu.so.1.0")] to let it know what to do with that optimized Edge TPU op.

In the ideal world where edgetpu_compiler supports TensorFlow 2.0, the code from above should work. Let's run the code and see.

Edge TPU Compiler version 2.0.267685300 Model compiled successfully in 231 ms. Input model: motion_blur_1_1920_1058_3_25_1.57.tflite

Input size: 3.03KiB

Output model: motion_blur_1_1920_1058_3_25_1.57_edgetpu.tflite

Output size: 296.85KiB

On-chip memory available for caching model parameters: 1.73MiB

On-chip memory used for caching model parameters: 10.00KiB

Off-chip memory used for streaming uncached model parameters: 0.00B

Number of Edge TPU subgraphs: 1

Total number of operations: 3

Operation log: motion_blur_1_1920_1058_3_25_1.57_edgetpu.log Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.

Number of operations that will run on Edge TPU: 1

Number of operations that will run on CPU: 2 Operator Count Status DEPTHWISE_CONV_2D 1 Mapped to Edge TPU

DEQUANTIZE 1 Operation is working on an unsupported data type

QUANTIZE 1 Operation is otherwise supported, but not mapped due to some unspecified limitation

DEPTHWISE_CONV_2D successfully compiles, however, there are weird DEQUANTIZE and QUANTIZE ops that do not. They are TensorFlow 2.0 artifacts which are not supported by the compiler and which arose from the enforced float32 dtype in motion_blur_func 's signature. netron visualization should make everything clear.

Quantized TensorFlow Lite model (left) and compiled Edge TPU model (right) visualized with netron.

Thus we have to do redundant work four times:

Switch the pixel values from uint8 to float32 , pass them to the TensorFlow Lite engine. TensorFlow Lite executes QUANTIZE and switches back to uint8 Having calculated the convolution, we return to float32 in DEQUANTIZE . TensorFlow Lite returns control back to the caller, and we convert to uint8 in order to save the image.

Throwing away QUANTIZE and DEQUANTIZE from the original .tflite would make the model great again. There is no easy way to achieve this, unfortunately. There is simply no official API to manipulate TensorFlow Lite models. We need to dig deeper.

Digging deeper

I mentioned that the TensorFlow Lite model format is flatbuffers. It is a relatively new, general-purpose serialization format that I bet is competing with protobuf internally at Google. flatbuffers’ Python API does not allow modifying existing files, bummer. We are lucky that flatbuffers exhibits two different representations of an object: binary and JSON, and supports lossless conversion between them. There are flatc -j and flatc -b commands to respectively convert from .tflite to JSON and backward. We are going to exercise them to rip away the redundant ops from the model, provided by the schema is public. In fact, this is the method TensorFlow Lite developers use themselves to upgrade .tflite models.

The code reveals that the original .tflite model had the wrong dtypes: int8 instead of uint8 . TensorFlow 2.0 tries to apply multi-channel quantization without being asked; Edge TPU does not support multi-channel quantization, and we have to fix that, too.

The second attempt is more successful:

Edge TPU Compiler version 2.0.267685300 Model compiled successfully in 171 ms. Input model: motion_blur_1_1920_1058_3_25_1.57.tflite

Input size: 2.71KiB

Output model: motion_blur_1_1920_1058_3_25_1.57_edgetpu.tflite

Output size: 296.56KiB

On-chip memory available for caching model parameters: 1.73MiB

On-chip memory used for caching model parameters: 10.00KiB

Off-chip memory used for streaming uncached model parameters: 0.00B

Number of Edge TPU subgraphs: 1

Total number of operations: 1

Operation log: motion_blur_1_1920_1058_3_25_1.57_edgetpu.log Operator Count Status DEPTHWISE_CONV_2D 1 Mapped to Edge TPU

Patched mother TensorFlow Lite model (left) and compiled Edge TPU model (right) visualized with netron.

Now the promised benchmark. I installed libedgetpu-max which does not limit the operating frequency. My results are 5.00s±0.25 and 0.262s±0.001 for the original and Edge TPU versions, correspondingly. Edge TPU is 10–20 times faster than the fastest float32 implementation on my CPU! The comparison is not fair, of course, since only one CPU core is utilized to run the original .tflite, and I cannot change it in Python (it looks possible in C++). I expect the real performance speedup to lie between 2–4x. Besides, a proper vectorized uint8 CPU implementation should be 4x faster than float32 — e.g., pillow-simd. So Edge TPU has no fair supremacy. On the bright side, the Coral device consumes at least 20x less power.

The image produced on Edge TPU looks identical to the ground truth, but not byte-exact because of the precision loss.