Formally known as Audio Source Separation, the problem we are trying to solve here consists in recovering or reconstructing one or more source signals that, through some -linear or convolutive- process, have been mixed with other signals. The field has many practical applications including but not limited to speech denoising and enhancement, music remixing, spatial audio, remastering, etc. In the context of music production, it is sometimes referred to as unmixing or demixing. There’s a good amount of resources on the subject, going from ICA-based -blind- Source Separation, to semi-supervised Non-negative Matrix Factorization techniques, to more recent neural network-based approaches. For a nice walkthrough on the first two, you can check out these tutorial mini-series from CCRMA, which I found very useful back in the day.

But before jumping into design stuff.. a liiittle bit of Applied Machine Learning philosophy…

As someone who’s been working in signal & image processing for a while and prior to the ‘deep-learning-solves-it-all’ boom, I will introduce the solution as a Feature Engineering journey and show you why, for this particular problem, an artificial neural network ends up being the best approach. Why? Very often I find people writing things like:

“with deep learning you don’t have to worry about feature engineering anymore; it does it for you”

or worst…

“the difference between machine learning and deep learning < let me stop you right there… Deep Learning is still Machine Learning! > is that in ML you do the feature extraction and in deep learning it happens automatically inside the network”.

These generalizations, probably coming from the fact that DNNs can be pretty effective at learning good latent spaces, are just wrong. It frustrates me to see recent grads and practitioners being sold on the above misconceptions and going for the ‘deep-learning-solves-it-all’ approach as something you just throw a bunch of raw data at (yes, even after doing some pre-processing you can still be a sinner :)), and expect things to just work as desired. In the real world, where your data is not as simple, clean and pretty as the MNIST dataset and where you have to care about things like real-time, memory and so on, these misconceptions can leave you stuck in experimentation mode for a very long time…