Improving single-network single-channel separation of musical audio with convolutional layers

Gerard Roma, Owen Green, Pierre Alexandre Tremblay
University of Huddersfield

Most convolutional neural network architectures explored so far for musical audio separation follow an autoencoder structure, where the mixture is considered to be a corrupted version of the original source. On the other hand, many approaches based on deep neural networks make use of several networks with different objectives for estimating the sources. In this paper we propose a discriminative approach based on traditional convolutional neural network architectures for image classification and speech recognition. Our results show that this architecture performs similarly to current state of the art approaches for separating singing voice, and that the addition of convolutional layers allows improving separation results with respect to using only fully-connected layers.

This page contains some audio examples of separation results using the Test set of the DSD100 dataset. CNN1 correspond to a network trained to optimize MSE loss with 4 soft masks simultaneously, while CNN2 was trained to optimize non-negative likelihood with a softmax output. One good and one bad examples were chosen based on SDR of vocals extraction.

CNN1

Target Good example Bad example
Original Mixture
Vocals
Bass
Drums
Other

CNN2

Target Good example Bad example
Original Mixture
Vocals
Bass
Drums
Other