To make a smart speaker

Here is a collection of resources to make a smart speaker. Hope one day we can make an open source one for daily use.

The simplified flowchart of a smart speaker is like:

+---+ +----------------+ +---+ +---+ +---+ |Mic|-->|Audio Processing|-->|KWS|-->|STT|-->|NLU| +---+ +----------------+ +---+ +---+ +-+-+ | | +-------+ +---+ +----------------------+ | |Speaker|<--|TTS|<--|Knowledge/Skill/Action|<--+ +-------+ +---+ +----------------------+

Audio Processing includes Acoustic Echo Cancellation (AEC), Beamforming, Noise Suppression (NS), etc.

Keyword Spotting (KWS) detects a keyword (such as OK Google, Hey Siri) to start a conversation.

Speech To Text (STT)

Natural Language Understanding (NLU) converts raw text into structured data.

Knowledge/Skill/Action - Knowledge base and plugins (Alexa Skill, Google Action) to provide an answer.

Text To Speech

KWS + STT + NLU + Skill + TTS

Active open source projects

Mycroft - a hackable open source voice assistant

dingdang robot - a Chinese voice interaction robot based on Jasper and built with raspberry pi

SDK

It has the smartest brain, its extension called Google Action can be created on a few steps with digitalflow.ai and its Device Action is very suit for home smart devices.

KWS

Mycroft Precise - A lightweight, simple-to-use, RNN wake word listener

Snowboy - DNN based hotword and wake word detection toolkit

Honk - PyTorch reimplementation of Google's TensorFlow CNNs for keyword spotting

ML-KWS-For-MCU - Maybe the most promise for resource constrained devices such as ARM Cortex M7 microcontroller

STT

Mozilla DeepSpeech - A TensorFlow implementation of Baidu's DeepSpeech architecture

Kaldi

PocketSphinx - a lightweight speech recognition engine using HMM + GMM

NLU

Rasa NLU

Rasa NLU for Chinese

Snips NLU - a Python library that allows to parse sentences written in natural language and extracts structured information.

TTS

Mimic - Mycroft's TTS engine, based on CMU's Flite (Festival Lite)

manytts - an open-source, multilingual text-to-speech synthesis system written in pure java

espeak-ng - an open source speech synthesizer that supports 99 languages and accents.

ekho - Chinese text-to-speech engine

WaveNet, Tacotron 2

Audio Processing

Acoustic Echo Cancellation

SpeexDSP, its python binding speexdsp-python

EC - Echo Cancelation Daemon based on SpeexDSP AEC for Raspberry Pi or other devices running Linux.

Direction Of Arrival (DOA) - Most used DOA algorithms is GCC-PHAT

tdoa

odas - ODAS stands for Open embeddeD Audition System. This is a library dedicated to perform sound source localization, tracking, separation and post-filtering. ODAS is coded entirely in C, for more portability, and is optimized to run easily on low-cost embedded hardware. ODAS is free and open source.

Beamforming

BeamformIt - filter&sum beamforming

CGMM Beamforming - a reference implementation

MVDR Beamforming

GSC Beamforming

Voice Activity Detection

WebRTC VAD, py-webrtcvad

DNN VAD

Noise Suppresion

NS of WebRTC audio processing, python-webrtc-audio-processing

Audio I/O