Dan Stowell's PhD thesis

Queen Mary University of London, 2010

In brief: This thesis is about controlling synthesisers using vocal sounds such as beatboxing. Focusing on the timbre, we use machine learning methods to automatically infer how to map each vocal sound onto a synth sound.

Abstract

People can achieve rich musical expression through vocal sound -- see for example human beatboxing, which achieves a wide timbral variety through a range of extended techniques. Yet the vocal modality is under-exploited as a controller for music systems. If we can analyse a vocal performance suitably in real time, then this information could be used to create voice-based interfaces with the potential for intuitive and fulfilling levels of expressive control.

Conversely, many modern techniques for music synthesis do not imply any particular interface. Should a given parameter be controlled via a MIDI keyboard, or a slider/fader, or a rotary dial? Automatic vocal analysis could provide a fruitful basis for expressive interfaces to such electronic musical instruments.

The principal questions in applying vocal-based control are how to extract musically meaningful information from the voice signal in real time, and how to convert that information suitably into control data. In this thesis we address these questions, with a focus on timbral control, and in particular we develop approaches that can be used with a wide variety of musical instruments by applying machine learning techniques to automatically derive the mappings between expressive audio input and control output. The vocal audio signal is construed to include a broad range of expression, in particular encompassing the extended techniques used in human beatboxing.

The central contribution of this work is the application of supervised and unsupervised machine learning techniques to automatically map vocal timbre to synthesiser timbre and controls. Component contributions include a delayed decision-making strategy for low-latency sound classification, a regression-tree method to learn associations between regions of two unlabelled datasets, a fast estimator of multidimensional differential entropy and a qualitative method for evaluating musical interfaces based on discourse analysis.