Last updated: 28-05-2020 We all got exposed to different sounds every day. Like, the sound of car horns, siren and music etc. How about teaching computer to classify such sounds automatically into categories! In this blog post, we will learn techniques to classify urban sounds into specific categories (such as car horn and siren) with machine learning. Earlier blog posts covered classification problems where data can be easily expressed in vector form. For example, in the textual dataset, each word in the corpus becomes feature and tf-idf score becomes its value. Likewise, in anomaly detection dataset we saw two features “throughput” and “latency” that fed into a classifier to predict outliers. But when it comes to sound, feature extraction is not quite straightforward. Today, we will first see what features can be extracted from sound dataset and how easy it is to extract such features in Python using open source library called Librosa. To get started with this tutorial, please make sure you have following tools installed: Tensorflow 2.x

Librosa

Numpy

Matplotlib

Dataset We need a labelled dataset that we can be used to train a machine learning model. Fortunately, researchers open-sourced annotated dataset with urban sounds. It contains 8,732 labelled sound clips (4 seconds each) from ten classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music. The dataset by default is divided into 10-folds. To get the dataset please visit the following link and if you want to use this dataset in your research kindly don’t forget to acknowledge. In this dataset, the sound files are in .wav format but if you have files in another format such as .mp3, then it’s good to convert them into .wav format. It’s because .mp3 is lossy music compression technique, check this link for more information. The dataset is pre-split into 10-folds and to keep things managable, we will pre-process the sound files and save them as numpy array so that it will be easy to use afterwards. Let’s read some sound files and visualise to understand how different each sound clip is from the other. Matplotlib’s specgram method performs all the required calculation and plotting of the spectrum. Likewise, Librosa provide handy methods for wave and log power spectrogram plotting. By looking at the plots shown in Figure 1 and 2, we can see apparent differences between sound clips of different classes. ### Load necessary libraries ### import glob import os import librosa import numpy as np import matplotlib.pyplot as plt from matplotlib.pyplot import specgram from sklearn.model_selection import KFold import tensorflow as tf from tensorflow import keras %matplotlib inline plt.style.use('ggplot') ### Define helper functions ### def load_sound_files(file_paths): raw_sounds = [] for fp in file_paths: X,sr = librosa.load(fp) raw_sounds.append(X) return raw_sounds def plot_waves(sound_names,raw_sounds): i = 1 fig = plt.figure(figsize=(25,60), dpi=900) for n,f in zip(sound_names,raw_sounds): plt.subplot(10,1,i) librosa.display.waveplot(np.array(f),sr=22050) plt.title(n.title()) i += 1 plt.suptitle('Figure 1: Waveplot',x=0.5, y=0.915,fontsize=18) plt.show() def plot_specgram(sound_names,raw_sounds): i = 1 fig = plt.figure(figsize=(25,60), dpi = 900) for n,f in zip(sound_names,raw_sounds): plt.subplot(10,1,i) specgram(np.array(f), Fs=22050) plt.title(n.title()) i += 1 plt.suptitle('Figure 2: Spectrogram',x=0.5, y=0.915,fontsize=18) plt.show() def plot_log_power_specgram(sound_names,raw_sounds): i = 1 fig = plt.figure(figsize=(25,60), dpi = 900) for n,f in zip(sound_names,raw_sounds): plt.subplot(10,1,i) D = librosa.logamplitude(np.abs(librosa.stft(f))**2, ref_power=np.max) librosa.display.specshow(D,x_axis='time' ,y_axis='log') plt.title(n.title()) i += 1 plt.suptitle('Figure 3: Log power spectrogram',x=0.5, y=0.915,fontsize=18) plt.show() def extract_feature(file_name): X, sample_rate = librosa.load(file_name) stft = np.abs(librosa.stft(X)) mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0) chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0) mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0) contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0) tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), sr=sample_rate).T,axis=0) return mfccs,chroma,mel,contrast,tonnetz def parse_audio_files(parent_dir,sub_dir,file_ext='*.wav'): features, labels = np.empty((0,193)), np.empty(0) # 193 => total features for fn in glob.glob(os.path.join(parent_dir, sub_dir, file_ext)): mfccs, chroma, mel, contrast,tonnetz = extract_feature(fn) ext_features = np.hstack([mfccs,chroma,mel,contrast,tonnetz]) features = np.vstack([features,ext_features]) labels = np.append(labels, int(fn.split('/')[2].split('-')[1])) return np.array(features, dtype=np.float32), np.array(labels, dtype = np.int8) ### Plot few sound clips along with their spectrograms ### sound_file_paths = ["57320-0-0-7.wav","24074-1-0-3.wav", "15564-2-0-1.wav","31323-3-0-1.wav", "46669-4-0-35.wav","89948-5-0-0.wav", "46656-6-0-0.wav","103074-7-3-2.wav", "106905-8-0-0.wav","108041-9-0-4.wav"] sound_names = ["air conditioner","car horn","children playing", "dog bark","drilling","engine idling", "gun shot","jackhammer","siren","street music"] raw_sounds = load_sound_files(sound_file_paths) plot_waves(sound_names,raw_sounds) plot_specgram(sound_names,raw_sounds)

Feature Extraction To extract the useful features from the sound data, we will use Librosa library. It provides several methods to extract a variety of features from the sound clip. We are going to use below mentioned methods to extract various features: melspectrogram: Compute a mel-scaled power spectrogram

mfcc: Mel-frequency cepstral coefficients

chorma-stft: Compute a chromagram from a waveform or power spectrogram

spectral_contrast: Compute spectral contrast, using method defined in [1]

tonnetz: Computes the tonal centroid features (tonnetz), following the method of [2] To make the process of feature extraction from sound clips easy, let's define helper functions. First parse_audio_files which takes parent directory name, sub directory within the parent directory and file extension (default is .wav) as input. It then iterates over all the files within sub directory and invoke a second helper function extract_feature . It takes a file path as input, read the file withlibrosa.load method, extract and return features mentioned above. These methods are all that is required to convert raw sound clips into informative features (along with a class label for each sound clip) that we can directly utilize to learn a model with classification method of our choice. Note, the class label of each sound clip is in the file name. For example, if the file name is 108041-9-0-4.wav then the class label will be 9. Doing string split by – and taking the second item of the array will give us the class label. To summarize, we will iterate over each file within a fold to extract features and corresponding labels and save them as numpy array. # Pre-process and extract feature from the data parent_dir = 'UrbanSounds8K/audio/' save_dir = "UrbanSounds8K/processed/" sub_dirs = np.array(['fold1','fold2','fold3','fold4', 'fold5','fold6','fold7','fold8', 'fold9','fold10']) for sub_dir in sub_dirs: features, labels = parse_audio_files(parent_dir,sub_dir) np.savez("{0}{1}".format(save_dir, sub_dir), features=features, labels=labels)