Snips Open Sources Tract

A Rust Neural Network library for the Edge

By Mathieu Poumeyrol and Joseph Dureau

After open sourcing Snips-NLU a year ago, Snips now shares Tract, a new piece of its embedded voice platform. Tract is Snips’ neural network inference engine.

While TensorFlow and, to a lesser extent, PyTorch dominate the ecosystem of neural network training solutions, the landscape for inference engines on tiny devices, such as mobile and IoT, is still pretty much open. TensorFlow teams are pushing TensorFlow Lite, and Microsoft ONNX runtime seems like the perfect complement to PyTorch. Hardware vendors are pushing their own solutions, be it Android NN API, ARM NN SDK or Apple BNNS.

These libraries are not really interoperable, and have limited scopes: most of them, TensorFlow Lite included, only support a small subset of TensorFlow’s operator set. This means there is no clear go-to solution for an engineering team when their machine learning colleagues hand them a neural network to run on-device. On the other hand, machine learning teams will often have to pick a network architecture to satisfy the available runtime of a given target. Mimicking the interoperability efforts done by the software industry over the past decades, we think it is time to aim for a “train once, infer anywhere” approach.

This ambition led us, nearly two years ago, to develop our own solution for embedded inference. It has been used in production for over a year as part of our Wake word engine, and it enables our machine learning team to freely explore new families of networks with the confidence that we would be able to bring these networks to production.

Even in today’s context, reverting to an off-the-shelf inference library would mean sacrificing either modelling creativity, performance, or portability. We believe the embedded neural network inference landscape still needs to progress and converge, which is why we decided to contribute to the collective effort by open sourcing our solution. In this post, we describe the library, its performance, and provide high-level motivation for the implementation choices we did to optimize an application field that is close to our hearts: voice processing.

Cross-compilation

One of the major hurdles we encounter at Snips is cross compiling our solution for all of our targets. This includes small single-board computers designed for hobbyists like the Raspberry Pi Zero, 2 and 3, as well as industrial ones like NXP’s i.MX7 and i.MX8, Samsung’s Artik platforms, NVIDIA’s Jetson TX2, etc. The two main mobile platforms, iOS and Android, are also natural targets. This means we support ARMv6, ARMv7 and ARMv8 hardware platforms, on three major operating systems: GNU/Linux, iOS, and Android. We also support the regular non-IoT Intel systems, so that developers can work comfortably.

Snips Flow, our voice platform, includes a Wake word engine. This engine listens continuously, and triggers the speech recognition component when it hears the user say the Wake word (“Hey Snips”, “Alexa”, etc). The Wake word detector relies on a neural network that our machine learning team trains using TensorFlow. Two years ago, before TensorFlow Lite came to existence, the natural approach we tried was to embed TensorFlow as a library to execute our models on device.

However, TensorFlow is a big, complex framework, and, nearly two years ago, cross-compiling it as a library — to Android specifically — gave us a lot of pain. So much that we realized we needed a plan B.

At Snips, we chose Rust as the main language to develop the Snips Flow platform. Our team got used to the comfort brought by working with a modern software environment where cross-compiling is not an obscure afterthought but a solid design principle. So we gave it a try. We used the protobuf library to parse the TensorFlow format, ndarray to operate on tensors, and implemented a handful of operators that were part of our first Wake word models. This was the birth of Tract.

While we finally managed to get TensorFlow working on all our target platforms, the sheer size of it motivated us to actually switch to shipping Tract instead of TensorFlow a few months later.

If the origin story can sound anecdotal, the reasons behind it are not. As of today, the continuous integration infrastructure of the Snips platform targets dozens of operating systems and hardware combinations, with more added every month. This leads to some strong guiding principles in the evolution of Tract:

it is written In Rust

it is trivial to cross compile

it is free of external dependencies: instead of linking computing libraries, like BLAS, we integrated the small bits that we actually use (which in turn gave us a few opportunities for optimizing even more)

at runtime, Tract detects the device it is running on in order to adapt and optimize its performance appropriately. This allow us to ship the same binary for ARMv6 and ARMv7 devices without sacrificing performance on the smarter chips.

Today, some of these principles are actually vindicated by TensorFlow Lite. The latter is much easier to cross-compile than TensorFlow, and also takes the form of a static library free of external dependencies.

Streaming aware

Snips is focused on voice assistant technology. Neural networks are the natural solution for several tasks involved: Wake word detection, recognition of voice commands, speaker identification (finding out who is speaking) and acoustic modelling (translating sounds to phones or words, as part of a speech-to-text engine).

All of these tasks have to be performed in the context of interactive sessions with a human user: time is a central element of the problem description, and minimizing latency is key in making the interaction comfortable.

Wake word detection, for instance, has to happen “live”. Such a detector is always-on: there is no natural end to the input signal that it should wait for before processing the entire signal. In other words, the detection needs to happen in a streaming fashion, which means the engine needs to make a decision at every step in time based on the signal captured in the immediate past. This puts hard constraints on the neural network architecture, forbidding, for instance, reductions over the entire time axis.

This streaming constraint pushes the inference frameworks to their limits: most of the popular frameworks are designed with image classification in mind. They only work naturally over the entire signal, and become very awkward to work with when streaming is required, if they work at all. Kaldi — probably the most popular open-source speech-to-text framework — is a notable exception: its neural network inference engine is designed around streaming.

In TensorFlow, network graphs are “frozen” at train time, and contain a series of training-related idiosyncrasies. They must be optimized to run inference efficiently. Tract transforms networks after loading them, first by decluttering the network of those idiosyncrasies, then by adapting the network to the runtime environment. While TensorFlow Lite does this during a network translation stage, we chose to do it just before runtime, to be able to perform machine-dependent tweaks.

One of the critical transformations Tract performs is translating convolutional models to a streaming form. Indeed, our preferred architecture for Wake word detection and user identification consists of stacks of convolutional layers, in the same fashion as ResNet and WaveNet. While recurring networks are relatively natural to run in a streaming fashion, convolutional networks need a bit more work: a lot of operations can be skipped by implementing caching around each convolution operator.

Convolutional networks transformed naively to frame-by-frame streaming stateful networks would suffer from operation and data dispersion: a lot of flow control logic and data access work would have to happen for each incoming audio frame while performing a relatively modest number of useful computing operations. To run efficiently, it is necessary to re-introduce a grouping of frames, that we nicknamed pulses. By processing a reasonable number of frames together, in the order of 8 or 16, we amortize the flow control overhead. Additionally, this provides vectorization opportunities to optimize the execution even further.

Once networks run through these transformations, they can be fed in a streaming fashion and output decisions in real time. While we still consider it experimental, this “streaming and pulsing” transformation is applicable to all convolutional networks. We have successfully used it to run our WaveNet Wake word networks and ResNet speaker identification ones. Acoustic models will be next.

Beyond voice

Typical image classification example.

With the necessity of reusing efficiently cached data from previous iterations, our main use cases does not fit well in regular frameworks. But rather than solely working on voice-related applications, we felt like we could learn a lot from experimenting with more common use cases, in order to get a complimentary assessment of the performance of Tract itself. We chose to implement enough operators to make a few popular models run, in order to get a fair comparison of the intrinsic performance of Tract against TensorFlow Lite. Obviously, supporting these operators also widens the field for possible third-party use cases of the library. Today, for instance, Tract can run unmodified pretrained Inception v3, MobileNet, and the acoustic model from DeepSpeech.

We also implemented about 85% of ONNX, initially to take advantage of its extensive test-suite that helps us cover the internal operations it shares with TensorFlow. But now, as a “free” side effect, our machine learning teams can choose to work with TensorFlow or PyTorch on a per-project basis.

On top of ONNX and TensorFlow, we are also considering new importers: we are specifically interested in importing the Kaldi format, as our speech-to-text engine uses it. We are also paying close attention to the NNEF format: despite a large support from hardware and software vendors and a very elegant specification, it has received very little attention from the machine learning community.

How good is Tract ?

There are several answers to this question.

The first angle is “will Tract run this network”? Supporting TensorFlow’s immense operator set is not, and will probably never be a goal for Tract. On the TensorFlow front, we implement missing operators on a per-application basis. As of today, Inception v3, several ARM Keyword spotting networks, Mozilla’s DeepSpeech, MobileNet networks and others run. Adding operations is not difficult. Or at least, not much more so than implementing the computation itself.

On the ONNX front, things are a bit simpler: the operator set has a reasonable size, and comes with a test-suite. We cover about 85% of version 9 of the operator set. Recurring operators are coming soon, and the last remaining flow control features are also on the roadmap. 100% coverage of ONNX is an objective for Tract.

Beyond the perimeter of supported operators, comes the question of performance. Tract does not try to compete on the big-hardware side of things: Snips’ main interest lies in running neural networks on small devices. For instance, we have no support for GPU acceleration but we want to be as efficient as we can on ARM CPUs. We pay attention to devices ranging from ARMv5 or v6 with VFP (like a Raspberry Pi Zero) to the bigger ARMv8 with Extended SIMD instruction sets that can be found in most modern smart phones and may become the workhorse of the IoT industry. We also keep an eye on Intel’s CPUs as it is the native environment for most developers.

On the chart below, we compare Tract and TensorFlow Lite’s performance on two neural network architectures that are supported by both libraries. The first one is an early version of the Snips Wake word detector, which we’ll call v1 for the sake of clarity. We then switched to the very convolutional networks that we prefer today, but which aren’t supported by TensorFlow Lite. Snips Wake word v1 relies on a 1D CNN architecture. The second architecture we consider here is a 2D CNN network from the “Hello Edge” paper by ARM.