You are on page 1of 11

Blind Separation of Audio Sources from Single and

Stereo Mixtures with a Special Consideration on


Underdetermined Condition

A thesis/project submitted in partial fulfillment of the requirements of Varendra


University for the degree of BSc Engineering in CSE.

May 2017

Submitted by
ID: xxxxxxxxxx
Name:
Department of Computer Science and Engineering
Varendra University
Rajshahi, Bangladesh

Supervised by
Md. Khademul Islam Molla, PhD
Professor
Department of Computer Science and Engineering
Varendra University
Rajshahi, Bangladesh
Abstract
The separation of mixed audio signals is the problem of automated separation of audio sources
present around a set of differently placed microphones capturing the acoustical scene. The
whole problem resembles the task, a human can solve in a cocktail party situation, where using
two sensors (ears), the brain can focus on a specific source of interest, suppressing all other
sources present. In this thesis, we examine the audio source separation problem using a range of
approaches to segregate the component sources from monophonic and stereo recordings. In
particular, we consider underdetermined condition (i.e. the number of sensors is less than the
number of sources) which is a challenging topic in the field of blind source
separation……………………………………………….
Acknowledgements
Completing a Ph. D. is usually a journey through a long and winding road, where one has to
tame more oneself than the actual phenomena in research. Luckily, I am not alone in this trip.
My sincerest thanks are due to all those that have helped me to get this far.
Primarily, I would like to thank my supervisor Professor Keikichi Hirose for his inspiration,
support and very fruitful collaboration, not to mention for coping with the mountains of
paperworks I must have caused. Without his generous help it would have been impossible to
finish this work. Next comes Dr. Nobuaki Minematsu for his fruitful suggestions during this
journey.
Thanks to Japanese Government (Monbukagakusho) scholarship for funding my study and
living expenses in Tokyo during the period of my course. To the 21 st century Center of
Excellence (COE) project of the University of Tokyo for the financial support to attend and
present my work in several conferences. Thanks to Nippon Telegraph and Telephone (NTT)
communication research laboratory for the permission to use the anechoic room and the
technical supports for audio recording. I am grateful to the people of Bangladesh who have
supported my graduate study in public university. Also thanks to the University of Rajshahi,
Bangladesh for approving me of study leave to pursue this degree.
Thanks to everyone else in the laboratory who have rescued me from the frustration during the
early days in Japan and for their continuous help due to the lack of my ability in Japanese
language. They have made the lab a fun place to live for the last three and half years.
I am deeply grateful to my parents for bringing me into this world and to my siblings for
keeping a warm family where I grew up. My wife Sheuly has always inspired me and deserves
more than mere acknowledgement.
Finally, I wish to thank all of my friends and well-wishers for their support over the time its
taken to get this done.
Contents

Chapter 1: Introduction …………………………………………… 1


1.1 General Introduction 1
1.2 Applications of Audio Source Separation 8
1.3 Thesis Overview 10
1.4 Publications derived from this work 14

Chapter 2: Time-Frequency Representation ……………………… 17


2.1 Introduction 17
2.2 EMD and Hilbert spectrum (HS) 20
2.2.1 Modification of EMD 22
2.2.2 Instantaneous frequency 23
2.2.3 Hilbert Spectrum 25
2.2.4 Marginal Spectrum 27
2.2.5 Orthogonality of the Decomposition 30
2.2.6 Signal Reconstruction and Completeness 31
2.3 Short-time Fourier transform (STFT) 33
2.4 Some comparisons between STFT and Hilbert spectrum 36
2.4.1 Time-frequency Uncertainty 36
2.4.2 Accumulation of Cross-Spectral Energy 39
2.5 Disjoint Orthogonality of Audio Sources 42
2.6 Experimental results and discussion 44

Chapter 3: Audio source separation by subspace decomposition …. 48


3.1 Introduction 48
3.2 Subspace analysis with STFT 50
3.2.1 Proposed Separation Algorithm 52
3.2.1.1 Mathematical model 53
3.2.1.2 Dimension reduction 55
3.2.1.3 Selection of basis vectors 58
3.2.1.4 Independent basis vectors 62
3.2.1.5 Clustering of independent basis vectors 63
3.2.1.6 Source reconstruction 64
3.2.2 Experimental results 65
3.2.3 Discussion 67
3.3 Subspace decomposition of Hilbert spectrum 69
3.3.1 Separation algorithm 70
3.3.2 Deriving source-subspace and re-synthesis 72
3.3.3 Experimental results and discussion 74

Chapter 4: Localization Based Separation ……………………… 77


4.1 Introduction 77
4.2 Separation by beamforming 81
4.2.1 Spatial localization of the sources 82
4.2.2 Multi-band decomposition with EMD 87
4.2.3 Separation by sub-band beamforming 90
4.2.4 Experimental results 92
4.3 Separation by Time-Frequency masking 95
4.3.1 Source Localization 99
4.3.2 Source separation 101
4.3.3 Experimental results and discussion 102

Chapter 5: Moving Source Separation …………………………… 106


5.1 Introduction 106
5.2 Moving source tracking and separation 109
5.2.1 Source Tracking 109
5.2.2 Source Separation 112
5.3 Source Indexing 114
5.3.1 Computing sub-band LPCC using HOS 118
5.3.2 Dominant features selection with PCA 119
5.3.3 The weighted scoring function 124

Chapter 6: Experimental Results and Discussion ……………………… 126


5.4 Experimental setup 126
5.4.1 Sequential source indexing 126
5.4.2 Simultaneous moving source indexing

127```````````````````````````````````````````````````````````````````````````````````````````````````````````````
````````````````````````

Chapter 7: Conclusions and Future Works ……………………… 131

References ………………………………………………………… 136

Chapter 1

Introduction

1.1 General Introduction


The separation of mixed audio source can be defined as the problem of decomposing a real
world sound mixture into individual audio signals. In a multi-source audio environment, human
exhibits a remarkable ability to extract a sound object of interest. The modern communications
systems, such as cellular phones, employ some speech enhancement procedure at the
preprocessing stage, prior to further processing [1]. One approach to separate the mixed audio
signals is microphone array processing [2]. The array processing requires hug computation and
inefficient to be used in real world applications. Hence, present research trend is to reduce the
number of microphones used in recording of the intended acoustical environment.
Several noise reduction schemes have been developed which try to suppress the signal
components corresponding to noise and enhance the target component e.g. speech by exploiting
their respective characteristics. This technique corresponds to the use of only one microphone.
For instance, in the application of spectral noise suppression schemes [3, 4, 5] to speech
enhancement it is assumed that the signal of interest is the speech with its typical speech pauses
while the noise signal is regarded as stationary and uninterrupted. Therefore, it is possible to
estimate the noise spectrum during speech pauses and subsequently subtract it from the
spectrum of the noise contaminated speech segments in order to obtain the enhanced speech
signal.
Such studies construct experimental stimuli consisting of a few simple sounds such as sine tones
or noise bursts, and then record human subjects interpretation/perception of these test sounds [6,
7].

1.2 Applications of Audio Source Separation


There are many potential applications where audio source separation system can be useful:
 Robust Automatic Speech Recognition (RASR): The automatic speech recognition
accuracy degrades in presence of noise or interfering sources e.g. inside a car,
conventional office room, on the street etc. The source separation method can be
employed to segregate the target from mixture(s) of all the interfering audio sources.
Such use of source separation as the front-end of speech recognizer enhances the
robustness in real world applications.
 Music transcription: Demixing of a recording to the actual instruments that are
playing in the recording is an extremely useful tool for all music transcribers. Listening
to an instrument playing solo rather the actual recording facilitates the transcription
process. This applies to all automated polyphonic transcription algorithms that have
appeared in research. Combining a source separation algorithm with a polyphonic
transcriber will lead to a very powerful musical analysis tool
 Audio coding: Each instrument in a music signal has different pitch, attack, timbre
characteristics, requiring different bandwidth for transmission. Decomposing a musical
signal into sound objects (instruments) will enable different encoding and compression
levels for each instrument, depending on its characteristics. The result will be a more
efficient, high quality audio codec. This will be more in line with the general
framework of MPEG-4 for video and audio.
 Telecommunication: Having unmixed the sources that exist in an auditory scene, one
can remove the unwanted noise sources in a multiple source environment. This can
serve as a denoising utility for mobile phones and other telecommunication devices.
 Robotics: The localization based separation of audio sources from stereo mixtures can
be used to implement the auditory system of humanoid robot. The robot with such
separation technique can easily localize the audio sources in adverse acoustical
condition and can be able to separate the target source.
 Surveillance applications: The ability of localization and discrimination of the audio
components of an auditory scene will enhance the performance of surveillance
applications.
 Remixing of studio recording: In tomorrow's audio applications, with all the powerful
tools that can search for songs similar to the ones we like or that sound like the artist we
want, a personal remixing of a studio recording according to our liking will be possible
with audio source separation. In addition, current stereo recordings can be remixed in
5:1 speaker configuration (five satellite speakers and one subwoofer) without using the
original masters

1.3 Thesis Overview


This thesis is focused on the separation of mixed audio signals implemented in time–frequency
space. In this attempt, we address a couple of open problems in the related field as it is
explained further on. The organization of the rest the part is as bellow:

1.3.1 Modification of EMD


It is noticed that EMD yields a small number of modes (IMFs) that completely fall inside the
frequency range of the original data. During the sifting process the most of the IMF include
energy at frequencies that cannot be associated with the data. Table 1.1 shows such redundant
signal energy corresponding to the first three IMF components of Fig. 1.2. To simulate that
observation, the analyzing signal s(t) is passed through a band-pass filter (BPF). The each of the
IMF extracted from the band-passed version of s(t) is also passed with the same BPF. A fourth
order Butterworth filter with frequency range 80Hz -4kHz is used in this experiment as most of
the audio signals fall within the range.
Fig. 1.2: Hilbert spectrum (HS) using 256 frequency bins.
The value of the overall index of orthogonality is less than the order of 10 -5 which implies that
the amount of cross term of the IMF components is negligible. This is the analytical evidence
that HS does not include the cross-spectral energy. In Fourier based filtering the index of
orthogonality mainly depends on the number of filters and their predefined bandwidths.
Whereas, the EMD does not require any of those parameters in priori to decompose the given
signal into an arbitrary number IMF components. Another effective characteristic of this
decomposition is the less (almost negligible) amount of reconstruction error. In many
applications of TF representation including audio source separation, it is required to reconstruct
the time domain signal even after performing some processing in TF domain. The
reconstruction from spectrogram outputs some spurious error comparing with the original
signal.

Table 1.1: Parameters of EMD analysis of different audio signals

Audio signals Number of IMFs Number of IMFs Overall Index of Maximum


(original EMD) (modified EMD) orthogonality Error
Male speech-1 13 23 0.000061 3.12x10-14
Male speech-2 14 22 0.000028 2.53x10-15
Female speech-1 16 24 0.000186 1.82x10-14
Female speech-2 13 21 0.000037 2.43x10-14
Jazz music 12 21 0.000142 3.11x10-13
Flute sound 16 23 0.000073 2.56x10-14

1.3.1.1 Mathematical model


The magnitude spectrogram X of the mixture signal can be represented as the superposition of k
independent source spectrograms xi.
k
X   xi (3.2)
i 1
The MS of the ith source is defined as xi  Fi Z iT , where Fi and Zi can be represented as

Fi  [ f1( i ) f 2(i )  f b(i i ) ]


(3.3)
Z i  [ z1( i ) z 2(i )  zb(ii ) ]

where fj and zj are column vectors of lengths equal to the number of frequency bins and time
frames of X, respectively. Each fj(i) corresponds to a spectral basis vector derived from X. The
group of such bases, denoted by Fi, represents the overall spectrum of the ith source.

References

[1] Anemuller, J. and Gramss, T.: On-line blind separation of moving sound
sources. Proc. of Int. Conf. on Independent Component Analysis and Blind
Source Separation (ICA’99), pp: 331-334, 1999.
[2] Asano, F., Goto, M., Itou, K. and Asoh, H.: Real-time sound source localization
and separation system and its application to automatic speech recognition. Proc.
of Eurospeech01, pp:1013-1016, 2001.
[3] Allen, J. B.: How do humans process and recognize speech? IEEE Transaction
on Speech and Audio, 2(4), pp: 567-577, 1994.
[4] Brown, G. J. and Cooke, M.: Computational auditory scene analysis. Computer
Speech Language, Vol. 8(4), pp: 297-336, 1994.
[5] Bofill, P.: Underdetermined blind separation of delayed sound sources in the
frequency domain. Neurocomputing, Vol. 55, No. 3/4, pp: 627-641, 2003.
[6] Boll, S. F.: Suppression of acoustic noise in speech using spectral subtraction.
IEEE Trans. on Acoustic, Speech and Signals Processing, Vol. 27, pp: 113-120,
1979.
[7] Bregman, A. S.: Auditory scene analysis. MIT Press, Cambridge, 1990.
[8] Bregman, A. S.: Auditory Scene Analysis: The perceptual organization of sound.
MIT press, 2nd edition, 1999.
[9] Breebaart, J., Van de Par, S. and Kohlrausch, A.: Binaural processing model on
contralateral inhibition. I. Model structure. Journal of Acoustical Society of
America. 110, pp: 1074-1088, 2001
[10] Baeck, M. and Zolzer, U.: Real-time implementation of source separation
algorithm. Proc. of DAFx-03, pp: 29-34, 8-11 Sep, 2003.

You might also like