Jose J. Lopez | Home page

Sound Source Separation

We start working Sound Source Separation in Oct 2006 by means of a funded project called WaFiSep . Now we have some results you can hear at the demo page.

Sound Source Separation Applications

During the last years, Sound Source Separation has been a subject of intense research. It refers to the task of estimating the signals produced by the individual sound sources from a complex acoustic mixture [3][4][5]. Although human listeners are able to perceptually segregate one sound source from an acoustic mixture, “machine listening” systems are still a great challenge.

SSS has a large number of potential applications: high quality separation of musical sources, signal/speech enhancement, multimedia documents indexing, speech recognition in a “cocktail party” environment or source localization for auditory scene analysis. However, current limitations in the existing methods might render some applications if not impossible, at least impractical. A given separation algorithm may perform well on some tasks and poorly on others. It’s because of this fact, that depending on the application, various factors affect the difficulty of the separation, and distinct criteria may be used to evaluate the performance of an algorithm.

Depending on the application, we could be interested in each individual extracted source or maybe just in extracting one source from the mixture (the target source). For example, extraction of singing voice from a song would be an important achievement [6], not just for remixing purposes, but areas like automatic lyrics recognition, singer identification or music information retrieval.

This paper is focused to an Audio Quality Oriented (AQO) application [7]. This means that the extracted sources will be listened to after the separation. In the case of our work, the main purpose will be to examine the possibilities offered by current audio source separation techniques applied to espatial sound systems. Positioning different sources in different space locations is well accomplished when separated tracks for each source are available. Most of the commercial music productions are recorded this way, but clean information of each source is lost in the mixing process. SSS techniques are the only way to recover the maximum possible information of the different sources.

Although separation algorithms produce resulting signals with plenty of artifacts, they might have less importance when separated sources are mixed again in a surround 5.1 or a WFS system. The isolated tracks for each instrument present artifacts that include mainly, inter-source crosstalk and metallic sound. However, when listening to these tracks all together processed with the WFS system, masking mechanisms are involved. This can make the audition of the resynthesized scene perceptually acceptable even if the separation methods applied are not very sophisticated or flawy.

Traditional Approaches

The main traditional approaches to the source separation problem have always been beamforming and independent component analysis (ICA). Beamforming achieves sound separation by using the principle of spatial filtering. The aim of beamforming is to boost the signal coming from a specific direction by a suitable configuration of a microphone array at the same time that signals coming from other directions are rejected. The amount of noise attenuation increases as the number of microphones and the array length increase. With a properly configured array, beamforming can achieve high-quality separation.

Independent component analysis models the mixture signal as a standard form of linear superposition of source signals. A mixing model of the form x(t)=As(t) is assumed, where s(t) is a vector of unknown source signals, A is a mixing matrix, and x(t) is a vector of the mixed signals recorded by several sensors. The main assumption in ICA is that sources involved in the mixing process are statistically independent. The separation problem consists in estimating the unmixing matrix (inverse of A). Separation results with ICA are excellent when the assumptions are satisfied, but this not always happen with audio signals [5]. In addition, the number of sensors should be at least equal to the number of sources to be separated. Another fundamental limitation is that the mixing matrix A needs to be stationary for a period of time. This assumption is difficult to satisfy in situations in which sound sources slightly move or the environment (acoustic path) changes.

The above techniques are useful just when several observations of the mixture are available. For WFS scene recreation, it would be much more interesting to develop specific algorithms for monaural or stereo recordings. We should concentrate on separation methods where the sources to be separated are not known in advance. These algorithms are based in common properties of real-world sounds, like continuity, sparseness or their harmonic spectral structures.

One-channel Sound Source Separation

The first works on one-channel sound source separation concentrated on the separation of speech signals [8][9]. Analysis and processing of music signals have recently received increasing attention [10][11]. Generally speaking, music is more difficult to be separated than speech. Musical instruments have a wide range of sound production mechanisms, and the resulting signals have a wide range of spectral and temporal characteristics. Even though the acoustic signals are produced independently in each source, it is their consonance and interplay which makes up the music [12]. This results in source signals which depend on each other, which may cause some separation criteria, such as statistical independence to fail.

Approaches used in one-channel sound source separation which do not use source-specific prior knowledge can be roughly divided into three categories, following the classification proposed in [12]:

Model based inference: These methods use a parametric model of the sources to be separated, and the model parameters are estimated from the observed mixture signal. In music applications, the most commonly used parametric model is the sinusoidal model. The model easily enables the prior information of harmonic spectral structure, which makes it the most suitable for the separation of pitched musical instruments and voiced speech [13].
Unsupervised learning: Unsupervised learning methods apply a simple non-parametric model, and use less prior information of the sources to be estimated. Instead, they try to learn the source characteristics from the observed data. The algorithms can apply information-theoretical principles, such as statistical independence between sources. Algorithms which are used to estimate the sources are based on independent subspace analysis [14], non-negative matrix factorization [12], and sparse coding [4].
Computational Auditory Stream Analysis (CASA): CASA methods [15] are based in the ability of humans to perceive and recognize individual sound sources in a mixture referred to as auditory scene analysis [16]. Computational models of this function typically consist of two main stages. First, the mixture signal is decomposed into its elementary time-frequency components. Then, these components are organized and grouped to their respective sound sources. Even though our brain does not resynthesize the acoustic waveforms of each source separately, the human auditory system is a useful reference in the development of one-channel sound source separation systems, since it is the only existing system which can robustly separate sound sources in various circumstances.

Sound Source Separation from Stereo Mixtures

Apart from monaural techniques, other approaches have been made to the problem of source separation in music recordings taking advantage of the stereo mixing process:

The ADRess algorithm [17], is able to distinguish different sources, analyzing the difference signal from the left and right channels. This is made by searching for minima in planes created from frequency and panning information. Obviously, if one-channel SSS could be achieved, the stereo problem would be solved just working on each channel independently.
The DUET algortihm [18] that is similar to ADRess.

References

[3] C. Jutten , M. Babaie-Zadeh, “Source separation: principles, current advances and applications”, presented at the 2006 German-French Institute for Automation and Robotic Annual Meeting, IAR 2006, Nancy, France, November 2006.

[4] P. D. O’Grady, B. A. Pearlmutter, and S. T. Rickard, “Survey of sparse and non-sparse methods in source separation”, IJIST (International Journal of Imaging Systems and Technology), 2005.

[5] K. Torkkola, “Blind separation for audio signals: are we there yet?”, Proceedings of the Workshop on Independent Component Analysis and Blind Signal Separation, 1999.

[6] Li Y., Wang D.L, “Separation of singing voice from music accompaniment for monaural recordings”, IEEE Transactions on Audio, Speech, and Language Processing, in press.

[7] E. Vincent, X. Rodet, A. Röbel, C. Févotte, É. Le Carpentier, R. Gribonval, L. Benaroya, and Fréderic Bimbot, “A tentative typology of audio source separation tasks”, ICA 2003.

[8] C. K. Lee, D. G. Childers, “Cochannel speech separation”, Journal of the Acoustical Society of America, 83(1), 1988.

[9] T. F. Quatieri, R. G. Danisewicz, “An approach to co-channel talker interference suppression using a sinusoidal model for speech”, IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1), 1990.

[10] E. Vincen, X. Rodet, “Music transcription with ISA and HMM”, in Proceedings of the 5th International Symposium on Independent Component Analysis and Blind Signal Separation, Granada, Spain, 2004.

[11] T. Virtanen, "Unsupervised Learning Methods for Source Separation", in Signal Processing Methods for Music Transcription, eds. Klapuri, A., Davy, M., Springer-Verlag, 2006.

[12] T. Virtanen, “Sound Source Separation in Monaural Music Signals”, PhD. Thesis, presented at Tampere University of Technology, November 2006.

[13] T. Virtanen, “Accurate Sinusoidal Model Analysis and Parameter Reduction by Fusion of Components”, presented at the 110th Audio Engineering Society Convention, Amsterdam, Netherlands 2001.

[14] S. Dubnov, “Extracting sound objects by independent subspace analysis”, presented at the AES 22nd International Conference on Virtual, Synthetic and Entertainment Audio, Espoo, Finland, June 2002.

[15] D.L. Wang, G. J. Brown. “Computational Auditory Scene Analysis: Principles, Algorithms, and Applications”. IEEE Press/Wiley-Interscience, 2006.

[16] A. Bregman, “Auditory scene analysis”. MIT Press, Cambridge, USA, 1990.

[17] D. Barry, B. Lawlor, and E. Coyle, “Sound source separation: Azimuth discrimination and resynthesis”. Proceedings of the 7th Int. Conference on Digital Audio Effects (DAFTX 04), 2004.