The photo below taken by VentureBeat’s Dean Takahashi at the first-ever Virtual Beings Summit two weeks ago says it all about this blog post: “AI makes a virtual being possible”. Over the past few weeks, several of our readers reached out to us interested in learning more about the intelligence behind TwentyBN’s AI avatar Millie.

“AI makes a virtual being possible” (Credit: Dean Takahashi @ Virtual Beings Summit)

Therefore, we are excited to take you on a trip deep inside the brain of TwentyBN’s AI avatar, Millie. Make sure to watch the 3 video demos below👇

The Brain of an AI Avatar

Millie’s brain is powered by a deep learning model trained on over 3 million proprietary video data collected by TwentyBN. Since our motto is to teach machines to perceive the world like humans, Millie’s brain has two major components, a vision model and an audio model, resembling how our brains process the sights and sounds we experience everyday.

Vision Model: Understanding human behavior

Our vision model is an extension of classical image-based object detectors to videos. Similar to typical object detectors such as YOLOv3, our vision model simultaneously localizes and classifies: it detects objects by putting bounding boxes around them and recognizes the object type of each bounding box (e.g. face, body, etc.).

Unlike image-based detectors, our model also understands the action being performed by each object. This means that it not only detects and distinguishes a human face from a human body but also determines, for instance, if a face is “smiling” or “talking” or if a body is “jumping” or “dancing”. This is made possible by training our model on video data rather than static images.

Compared to image detectors, which only process two-dimensional images, our model can process the time dimension. On the architecture side, we achieve this by using 3D-convolutional layers instead of 2D-convolutional layers, which enables the neural network to extract the relevant motion features that are needed to recognize the action performed by each object.

This end-to-end approach provides us with a very effective tool for analyzing and understanding human actions. In total, our model is trained to detect over 1,100 types of human activities and hand gestures. Another approach that we experimented with consists of breaking localization and classification into two independent sub-tasks: first, we extract bounding boxes and crop around each object using an object detector; then, we classify the action performed in each crop using a second action recognition model.

Our experience tells us that the end-to-end approach has several benefits over this two-step approach:

Simplicity : only one deep learning model to maintain.

: only one deep learning model to maintain. Multi-tasking / Transfer learning : solving both tasks at once should make the model better on each individual task.

: solving both tasks at once should make the model better on each individual task. Omniscience : the recognition of actions is not restricted to one individual in the scene; the model is aware of all the actions taking place at the same time.

: the recognition of actions is not restricted to one individual in the scene; the model is aware of all the actions taking place at the same time. Constant complexity : it does not matter how many people are in the scene. The computation required for multi-person recognition and action classification is as efficient as for one person.

: it does not matter how many people are in the scene. The computation required for multi-person recognition and action classification is as efficient as for one person. Speed: recognizing and classifying human actions simultaneously saves inference time. This makes it possible to run the vision model in real-time on edge-devices like the iPad Pro.

Audio Model: Speech-to-Intent

Previously relying on cloud solutions, we now have our own at-the-edge audio model for the avatar to understand not only what people do but also what they say. Unlike a typical yet complicated pipeline (see graph below), we are inspired by how humans process speech and went for the most direct and end-to-end approach: speech-to-intent. In this approach, one does not need to listen and transcribe every single word to understand the meaning of another person. A variety of speech-to-intent approaches have been explored (such as this). Similar to the Jasper model, the architecture of our model is fully convolutional.

Roland, TwentyBN’s CEO, favors the speech-to-intent approach:

“We are an intent-capturing company so going straight to human intent is how we roll at TwentyBN. By developing in-house audio capabilities for our AI avatar, we are simply transferring our success in video understanding to speech recognition. This is possible thanks to the world’s largest crowd-acting platform we’ve built, which allows us to rapidly source intent-based data at scale.”

3 benefits of the speech-to-intent approach include:

Robustness: in our experiments with different approaches to speech recognition, speech-to-intent ends up being more robust against noises, accents, and variations in people’s pronunciation of words.

in our experiments with different approaches to speech recognition, speech-to-intent ends up being more robust against noises, accents, and variations in people’s pronunciation of words. Light-weight: the neural network can easily run on small embedded devices in real time, such as a tablet and a smartphone.

the neural network can easily run on small embedded devices in real time, such as a tablet and a smartphone. Reactiveness: running the audio model on premise reduces latency drastically as compared to using cloud solutions, significantly improving the interaction and overall user experience with AI avatars like Millie.

Towards SenseNet: an Audio-Visual Model

As we speak, TwentyBN’s researchers and engineers are working on a new deep learning model called SenseNet, which fuses our vision and audio models into a single audio-visual model for intent classification.

SenseNet is crucial in bringing virtual beings from highly restricted environments into “the wild” (think of a seeing Alexa in a crowded supermarket). AI, just like humans, will need to rely on visual cues, that reveal the origin of a spoken utterance or the context of a scene. Merging vision with audio is a necessary step to achieve high intent classification accuracy in a messy environment.

We hope you have enjoyed this trip inside the brain of a virtual being. Let us know if you have any comments or questions. Make sure you share this article with your friends 🙌 and subscribe below 👇

Written by Guillaume and Nahua. Edited by Roland, David, Antoine, and Moritz. Illustrated by Anny.