Deep Learning as the Solution

Artificial neural networks (deep learning) improves with scale (Jeff Dean)

One theme was undebated: Deep learning will dominate advancements in commercial AI for the unforeseeable future. Why? The short answer is that it works incredibly well for a wide variety of applications. As pointed out by Jeff Dean (Google) during his keynote presentation, deep neural networks can now outperform humans in image classification, and deep learning (DL) is used in almost every Google product. In a world of increasing volumes of unstructured data, DL has provided a flexible framework to translate that information into a desired output or goal. As Charles Fan (Cheetah Mobile) described:

“70% of the time, deep learning will solve 100% of the problem.”

Feature engineering vs. end-to-end learning

One of the biggest promises of deep learning (DL) is that domain expertise and handcrafted rules are no longer required to create really good predictive or generative models. This point was underscored in Li Deng’s (Microsoft) presentation on the evolution of spoken dialog ([chat]bot) technology. Language models of the 90’s and early 2000’s were based on human conceptions of language structure and meaning. Speech was first translated to text. Handcrafted rules were used to extrapolate meaning from that text, to decide the answer, and to synthesize a response. The DL approach, in contrast, does not require a human-interpretable internal model of language structure and meaning. This means that an engineer does not need a PhD in linguistics to build a powerful algorithm to translate a spoken command to computerized task.

Does that mean that traditional machine learning and human language paradigms no longer have a role in spoken machine tech?

Diagram of Alexa’s speech recognition algorithm (Nikko Ström)

Not quite. Nikko Ström (Amazon) described the combination of machine learning, acoustic modeling (phoneme classification), and DL in Alexa’s speech detection and synthesis algorithms. Adam Coates (Baidu Research), however, made the point that “it’s hard to scale our own cleverness.” His research team has been applying deep neural networks for end-to-end speech translation, and he emphasized the importance that the same algorithm can be used for many different languages. While this approach, coined Deep Speech, costs incredible amounts of data and computational energy, Coates/Baidu are making it a commercial reality.

The possibility of having a single algorithm go from raw data to a desired task, or “end-to-end learning,” is a huge advantage offered by DL. Does this mean that feature engineering has become an art of the past?

Deep Speech approach for speech recognition (Adam Coates)

Not necessarily. Even with the Deep Speech algorithm, the raw audio signal is transformed into a spectrogram, which represents the information as binned frequencies over binned time. The benefit of this is that it provides discretize features as inputs into a neural network, and the cost is that decisions have to be made about how to bin the signal in both frequency and time. This process of transforming a sound signal into frequency components is actually what the cochlea, the sensory organ in our ears, does for us — it parses sounds into distinct frequency bands and transmits that information to downstream neurons. In a way, this can be thought of as a static method to extract features from a signal, or feature engineering.

Can an artificial neural network transform an audio signal into its frequency components (i.e., compute a spectrogram)? Yes, but why force a network to learn features that you already know are essential for the task? By pre-processing raw data, you reduce the required depth of the neural network, and thus the number of parameters to fit and the amount of data needed to train. On the other hand, you run the risk of throwing out valuable information. That is why extent to which you pre-process, or transform, data being fed into a DL network is still a decision that AI researchers have to make, and it’s not a trivial one.

These considerations are not only relevant for spoken dialog technologies, but also autonomous vehicles, internet-of-things (IoT), and computer vision. While there is general acceptance around the steps to transform audio, and to some extent 2D visual, data, there are many other sensor and data types for which there are no obvious pre-processing transformations. There is no organic sensory processing organ, such as a cochlea or retina, to mimic, for example, GPS input.

Sensor fusion and machine perception

Telsa’s multi-sensor research vehicle (Junli Gu)

Our world is full of photons, sound pressure waves, and objects all moving around. As humans, we process these inputs at millisecond precision, and navigate our direction and behaviors to achieve an ever-changing goal. With a 64-beam laser, 4 radars, 1 camera, and a GPS, an autonomous vehicle, as Junli Gu (Tesla) explained, has a similar challenge. The vehicle collects this analog data, digitizes it, and must make fast decisions that account for goals and contexts that can constantly change. To do so, information from all of these different sensors needs to be integrated across resolutions, timescales, and modalities and translated to instructions for direction, speed, and braking.

Deep Learning for semantic scene segmentation (Junli Gu)

While DL has enabled major progress in semantic scene segmentation, 3D depth inference, and reinforcement learning, it does so by mostly computing across sensor types separately. Sensor fusion, or the combining of different sensory data, is still immature for the autonomous vehicle technology. Mohawk Shas (Bosch) described a similar need for an infrastructure to combine sources for IoT technology. Will this challenge be solved by further applying principles of Neuroscience (Biology)? will hardcoded logic still play a role? or will we have a entire new field with a new name? Only the future has the answers to these questions, and according to Gu:

“whomever addresses the technical challenge [of sensor fusion] will harvest the influence.”

The ability to map multiple external inputs onto a cohesive internal signal has another name: perception. Tackling this challenge is exactly what Jay Yagnik (Google Research) plans to spend the next 3–4 years on. This includes developing frameworks for cross-modal (e.g. audio-visual) signals, scene understanding, and active (rather than passive) perception. The limitation of machine perception technology is also being faced by Liu Ren and Bosch’s Human-machine-interaction (HMI) team, as they are developing sensory-aware augmented reality (AR) for their wearables. Gary Bradski of OpenCV outlined endeavors into the realm of machine perception with his new company Arraiy.com. Quite distinct from AR, IoT, and self-driving cars, Bradski’s new team is developing AI to aid humans in the generation of creative content. As all of these groups develop their own digital cortices for different applications, the next question is: