Front Matter If you just want to look at python code, checkout the repo on github The blog post was written in an IPython Notebook. You can download the notebook here. Audio Analysis with Image Processing Algorithms GPUs were originally developed to simulate a specific system - the physics of light - yet today we use them to solve problems across many different domains. Could the same thing be done with image processing algorithms? Why not use them to process signals from other domains? To explore this idea I decided to use image template matching to identify audio. Basic template matching algorithms accept an image and a template, and return the likelihood of the template being at any place on the image. My goal was to match a given audio sample to a database of stored audio tracks using image template matching. (Basically do what Shazaam or Soundhound does.) Preprocessing Usually audio data is represented as a 1D array of samples evenly spaced in time. Single channel images are often stored and processed on as 2D arrays, so I needed to get some representation of audio in two dimensions. The naive approach is to simply reshape the 1D array into a 2D array. However this would be a very fragile in practice. Say we started sampling an audio source half a second later than the one in the database, the reshaped image would be quite different. Two adjacent peaks in the database may be on opposite sides of the image in the sample.

In [48]: audio_data = ( np . sin ( 200 * np . arange ( - 30 , 30 , . 01 )) + np . sin ( 500 * np . arange ( - 30 , 30 , . 01 ))) audio_sample = audio_data [ 60 : 1440 ] # take a subsample full_image = audio_data . reshape (( 60 , - 1 )) sample_image = audio_sample . reshape (( 60 , - 1 )) imshow ( full_image ) figure () imshow ( sample_image ) Out[48]: <matplotlib.image.AxesImage at 0x1155bdcd0>

The second image does not immediately appear to be a subsample of the first. A simple image template matching algorithm slides the template over the image comparing each overlap. (Here I will be using the match_template function in skimage.feature which does essentially this.) Image template matching algorithms that work in this manner would not recognize the above smaller image inside the larger one.

In [49]: from skimage.feature import match_template print match_template . __doc__ [: 340 ] Match a template to an image using normalized correlation. The output is an array with values between -1.0 and 1.0, which correspond to the probability that the template is found at that position. Parameters ---------- image : array_like Image to process. template : array_like Template to locate.

In [50]: result = match_template ( full_image , sample_image ) result . max () Out[50]: 0.019117253

As you can see match_template gives a very low probability of the template being anywhere in the image. Enter the Fourier Transform This large sensitivity to time shifts can be mitigated by moving out of the time domain and into the frequency domain. This can be accomplished by using the Discrete Fourier Transform, and it can be accomplished quickly by using the Fast Fourier Transform. (For a good discussion of the DFT and FFT see this blog post by Jake VanderPlas) If we take the fourier transform of each column in our reshaped signal, we can get a feel for how the frequency components change over time. It turns out this is a common technique and is known as a spectrogram. In a nutshell the fourier transform tells us which frequencies have the highest energies in a time domain signal. Since the frequency components in an audio signal usually change over time, the spectrogram combines the nice properties of both the time and frequency domains. Lets see a spectrogram of the dummy audio data:

In [51]: def spectrogram ( data , segment_size = 60 ): end = len ( data ) - len ( data ) % segment_size stacked = data [: end ] . reshape (( - 1 , segment_size )) freq_space = np . fft . fft ( stacked ) real = np . abs ( freq_space ) # fft results are mirrored, this trims the excess trimmed = real . T [: segment_size / 2 , :] return trimmed spec = spectrogram ( audio_data ) imshow ( spec ) Out[51]: <matplotlib.image.AxesImage at 0x11174bdd0>

As you can see there are two prominent horizontal bands. This indicates the audio signal consists of the combination of two frequencies that do not change over time. This makes sense of course because we generated the dummy audio data simply by summing two sine waves of difference frequencies. Now we have something that looks like an image and in theory corresponds well with time shifted audio sub samples. Let's test that theory:

In [52]: sample_spec = spectrogram ( audio_sample ) result = match_template ( spec , sample_spec ) result . max () Out[52]: 1.0000006

Success! match_template is telling us that these two "images" are highly correlated. However, does it work with audio samples that are more than just the sum of two sine waves? Real Data I ultimately want to use my computer to detect which episode of a given TV series is currently playing. Here I will be using Adventure Time Season 1. I have 11 WAV files (one for each episode) in a subdirectory adv_time .

In [56]: from scipy.io import wavfile sampling_rate , audio = wavfile . read ( 'adv_time/ep1.wav' ) audio = np . sum ( audio , 1 ) #sum the channels sample = audio [ 10000000 : 10200000 ] # ~4.5 second subsample spec = spectrogram ( audio , segment_size = 512 ) sample_spec = spectrogram ( sample , segment_size = 512 ) imshow ( sample_spec ) Out[56]: <matplotlib.image.AxesImage at 0x114eb8090>

In [64]: % timeit result = match_template ( spec , sample_spec ) 1 loops, best of 3: 18 s per loop

In [62]: plot ( result [ 0 ,:]) # plot 1 dim as a line Out[62]: [<matplotlib.lines.Line2D at 0x115184050>]

A peak in the result! match_template is fairly confident that the subsample is part of the original audio data. However the template matching algorithm took about 20 seconds to run, which is entirely too long. We will need to lose some data to speed this up. Ask Nyquist A man named Nyquist taught us all that the highest frequency we can reliably detect in a signal is equal to one half the sampling rate. This is known as the Nyquist Frequency. The sampling rate of our data was returned by wavefile.read() :

In [63]: sampling_rate Out[63]: 44100

A sampling rate of 44.1 kHz means we can detect audio frequencies up to 22 kHz. This is convenient since the upper limit on human hearing is around 20 kHz However, looking at the spectrum above it seems like almost all of the interesting bits are in the top half, and a large number of them are in the top eighth. It would seem that use useful information embedded in the audio track of Adventure Time is not close to the upper limit of human perception. This implies we can resample the audio by a factor of 8 and not lose too much information.

In [15]: downsampled = audio . reshape (( - 1 , 8 )) . mean ( 1 ) downsampled_sample = sample . reshape (( - 1 , 8 )) . mean ( 1 ) spec = spectrogram ( downsampled , segment_size = 512 ) sample_spec = spectrogram ( downsampled_sample , segment_size = 512 ) result = match_template ( spec , sample_spec ) print result . max () 0.714121

In [16]: % timeit match_template ( spec , sample_spec ) 1 loops, best of 3: 1.23 s per loop

In [17]: plot ( result [ 0 ,:]) Out[17]: [<matplotlib.lines.Line2D at 0x126972c10>]

Much better. The downsampling awarded us a factor of ~15 speedup while still giving plenty of signal above the noise. Even though the highest likelihood of a match is only 71%, that is still much higher than the surrounding noise floor. However I still want to drop some data to speed up the matching. If we downsample any more in the time domain we will really begin to lose important higher frequency information, but we can still try to downsample in the frequency domain.

In [18]: def downsample2d ( a , factor ): e0 = a . shape [ 0 ] - a . shape [ 0 ] % factor e1 = a . shape [ 1 ] - a . shape [ 1 ] % factor shape = a . shape [ 0 ] / factor , a . shape [ 1 ] / factor sh = shape [ 0 ], a . shape [ 0 ] // shape [ 0 ], shape [ 1 ], a . shape [ 1 ] // shape [ 1 ] return a [: e0 , : e1 ] . reshape ( sh ) . mean ( - 1 ) . mean ( 1 ) down_spec = downsample2d ( spec , 2 ) down_sample_spec = downsample2d ( sample_spec , 2 ) result = match_template ( down_spec , down_sample_spec ) plot ( result [ 0 ,:]) Out[18]: [<matplotlib.lines.Line2D at 0x11110bf50>]

In [19]: % timeit match_template ( down_spec , down_sample_spec ) 1 loops, best of 3: 278 ms per loop

Another factor of 4 speed up while still maintaining a good signal to noise ratio! At this point I am happy enough with the speed. With the downsampling I can compare a ~5 second sub sample to about two hours of audio in one second. But there is still one final test. So far we have only been running the tests using exact sub samples. I need to find out if this can work with samples taken using my computer's microphone. Below is the code I used to run this experiment. First I preprocessed the audio for each episode and stored them in a dictionary called store . I then used pyaudio on my laptop to record the audio playing from an episode of Adventure Time on my TV.

In [33]: import os import pyaudio def process_wavfile ( filename , store ): """ Open the given wavfile, downsample it, compute the spectrogram, and downsample again. Store the result in the given `store` keyed under the filename. """ name = filename . split ( '/' )[ - 1 ] . split ( '.' )[ 0 ] sampling_rate , audio = wavfile . read ( filename ) downsampled = audio . reshape (( - 1 , 16 )) . mean ( 1 ) spec = spectrogram ( downsampled , segment_size = 512 ) down_spec = downsample2d ( spec , 2 ) store [ name ] = down_spec def acquire_audio ( seconds = 5 ): """ Acquire audio for the given duration. """ rate = 11025 chunk = 1024 p = pyaudio . PyAudio () stream = p . open ( format = pyaudio . paInt16 , channels = 1 , rate = rate , input = True , frames_per_buffer = chunk ) frames = [] for _ in range ( int ( rate / chunk * seconds )): frames . append ( stream . read ( chunk )) stream . stop_stream () stream . close () p . terminate () ary = np . fromstring ( b '' . join ( frames ), dtype = np . short ) ary = ary . reshape (( - 1 , 2 )) . mean ( 1 ) return ary def process_acquired ( ary ): """ Calculate the spectrogram and downsample the given audio array. """ spec = spectrogram ( ary , segment_size = 512 ) down_spec = downsample2d ( spec , 2 ) return down_spec store = {} for filename in os . listdir ( 'adv_time' ): process_wavfile ( 'adv_time/' + filename , store )

In [42]: acquired = acquire_audio ( 5 ) processed = process_acquired ( acquired ) results = {} for name , signature in store . iteritems (): result = match_template ( signature , processed ) results [ name ] = result top = sorted ( results . items (), key = lambda i : i [ 1 ] . max (), reverse = True ) for name , result in top [: 3 ]: # print the top three matches print name , result . max () ep1 0.750591 ep8 0.458766 ep10 0.440036

In [39]: for name , result in results . iteritems (): plot ( result [ 0 , :])