Source Separation and DOA Estimation for Underdetermined Auditory Scene

In human-machine communication the separation of a target speech signal and localization of it in noisy environments are very important tasks. [1] For carrying out these tasks recent advanced sensor array signal processing is promising technology. [2] It utilizes the collection of multi-channel acoustic data by an array of microphones for detecting and producing output signals which is much more intelligible and suitable for communication and automatic speech recognition. [3]


Introduction
In human-machine communication the separation of a target speech signal and localization of it in noisy environments are very important tasks. [1] For carrying out these tasks recent advanced sensor array signal processing is promising technology. [2] It utilizes the collection of multi-channel acoustic data by an array of microphones for detecting and producing output signals which is much more intelligible and suitable for communication and automatic speech recognition. [3]

BSS problem
Blind source separation (BSS) aims to estimate source signals by only using their mixed signals without any a priori information about mixing process and acoustic circumstances. The cocktail-party problem is one of the typical BSS problems. [1] Basically, the BSS problem can be solved by exploiting intrinsic properties of speech signals. Depending on the inherent properties there have been proposed lots of methods for BSS problems on speech signals. Among them the most widely applied approaches are the following two. [8], and 2. Time-Frequency sparseness of source signals [9]- [14].

Independent component analysis (ICA)[4]-
The ICA-based separation relies on statistical independence of speech signals in time-domain [5] [7] as well as in frequency-domain [6]. In addition, [8] proposed a dynamic recurrent separation system by exploiting the spatial independence of located sources as well as temporal dependence. On the other hand the second approach exploits the sparseness of speech signals in timefrequency (T-F) domain where only small number of T-F components are dominant in representing a speech signal. The T-F sparseness leads the disjoint property of T-F domain components, called W-disjoint orthogonality (WDO) property [11] [12], between speech signals. It means that at most one source dominates at every T-F points, in another word; different speech signals rarely generate the same frequency at the same time.
Though ICA approach performs well even in a reverberant condition, it is difficult to solve the underdetermined case in which the number of sources is greater than the number of sensors. Additionally, the frequency-domain ICA [6] the permutation ambiguity of its solution is a serious problem. It needs to align the separated frequency components that originate from the same source.
The T-F masking method which is the most popular sparseness-based approach is the topic concerned in this chapter. The representative method is known as DUET (Degenerate Unmixing Estimation Technique) [11]. A flow of conventional sparseness-based separation can be summarized as follows.

Observed signals in T-F domain:
Transform time domain acoustic observations during few seconds to the T-F domain signals by applying short time Fourier transform (STFT) where a sparse representation of speech signal is obtained. [15] Thus the T-F components of a speech signal distribute in T-F domain without overlapping with T-F components of other speech signals.

Features of T-F cells:
As known in auditory scene analysis interaural time differences and level differences are significant spatial features of sources. [1] These localization cues are estimated from the differences in the direction and the distance of speakers. Actually, in microphone array the geometric parameters of sources can be obtained from phase differences and attenuation ratios at the mixture T-F cells.

Clustering T-F cells:
Under the WDO assumption the distribution of feature vectors obtained at all T-F cells makes as many clusters as the number of sources. The essential task of separation therefore turns out to cluster the feature vectors. The preliminary clustering method adopted in [9] - [12] is to make the histogram of features and to find the peaks corresponding the sources. Each T-F cell in the mixed signal is thereby associated with one peak depending on the distance in the cell's feature space.

Masking T-F cells:
Utilizing the clustering results individual binary masks are applied to the T-F domain spectrogram to detect the components that originate from individual sources.

Inverse transform:
A set of masked T-F components are inversely transformed by STFT and then it provides restored speech signal.

Remarks:
1. T-F domain sparseness in speech signals is also employed as a separation principle in the context of single channel or monaural signal source separation problem where harmonic structure in spectrogram is crucial for segregation. [16] [17] 2. Associated with the features of T-F cells conventionally used features are summarized in [13] and the features are evaluated from the separation performance point of view.

3.
Clustering scheme in T-F masking would be crucial for high separation ability. Subsequent studies after DUET-like approaches [11][12], maximum-likelihood (ML) based method for real-time operation [18], k-means algorithm or hierarchical clustering, and EM algorithm [19] have been proposed. The method called MENUET [13] applies k-means algorithm to a vector space consisting of the signal level ratio and the frequency-normalized phase difference with appropriately weighting terms for effective clustering. They solve the optimization problem by adopting an efficient iterative update algorithm. In [14] k-means algorithm is applied clustering spatial features for arbitrary sensor array configuration even with wider sensor distance where spatial aliasing may occur. Their clustering procedure is divided into two steps, the first one of which is applicable to the non-aliasing or lower frequency band and the second one treats the remaining aliasing occurred frequency band.

DOA estimation
Localization of acoustic sources using microphone array system is a significant issue in many practical applications such as hands-free phone, camera control in video conference system, robot audition, and so on. The latter half of this chapter focuses on the Direction-Of-Arrival (DOA) estimation of sources. Since this monograph interests in speech signals, we make no mention of the methods addressed for narrow-band signals, for instance in radar/sonar processing. There have been proposed a large number of DOA estimation methods for broadband signals [20], [21]. Typical array processing approaches are; tions. The MUSIC-like algorithms are well-known methods for narrowband target signals. For broadband signals such as speech, several frequency-domain approaches have been proposed. The subspace-based approaches for small number of sensors have to overcome two drawbacks, one of which is the limited precision for DOA estimation, and the other is that it is unable to deal with the underdetermined case.

Sparseness-based approaches
The third category of the DOA estimation algorithms is based on sparseness of speech signals and is closely related to the BSS. Source sparseness assumption implies WDO or its weaker condition TIFROM [24]. These conditions are the crucial properties to solve DOA problems for underdetermined multiple sources. The BSS approach associated with these assumptions is a group of T-F masking framework. In DUET-like methods [9]- [14], the delay time or the frequency-normalized ratio of the frequency-domain observations at each T-F point is used to compute the TDOA. An alternative DOA estimation method proposed by Araki et al. [27], in the context of k-means algorithm, estimates DOA as the individual centroid of each cluster of normalized observation vectors corresponding to an individual source. The DEMIX [25] algorithm introduces a statistical model in order to exploit a local confidence measure to detect the regions where robust mixing information is available. The computational cost of DEMIX would be high due to performing the principal component analysis for every local scatter plot of observation vectors at individual T-F points.
For addressing robust cocktail-party speech recognitions the localization cue such as TDOA or spatial direction evaluated at each T-F cell has a central role. As in [29][30], integrating approaches the segregation/localization of sound sources and speech recognition against background interferences are significant CASA (Computational Auditory Scene Analysis) front-ends.

DOA Tracking
Not only estimating but also tracking sound sources draws lots of attentions recently in robot auditory systems. For instance, speaker's DOA tracking by microphone array mounted on mobile robot is the problem of moving sources and moving sensors.

BSS and DOA Problems:
The underlying BSS and DOA estimation problems addressed in this chapter are listed as follows: a. Use of a pair of microphones b. Multiple simultaneously uttered speech signals under the assumption that the number of sources is known a priori c. Underdetermined cases, where the sources outnumber the sensors d. The inter-sensor distance is bounded so as to avoid spatial aliasing (for instance, less than 4 cm spacing for an 8 kHz sampling rate) While stereophonic sensor is the simplest sensor array, the study of how to improve the separation performance and to obtain accurate DOA by a pair of microphones is meaningful because any complex array configuration can be considered as an integration of these.
The rest of this chapter is organized as follows. In section 2, problems of underlying BSS and DOA estimation are described in detail. The proposed BSS method based on a frame-wise scheme is introduced in section 3. Section 4 describes a DOA estimation algorithm by using T-F cell selection and the kernel density estimator. The last section concludes this chapter.

Observation model
Source mixing models in time domain and its T-F domain description are described as follows.
where h mi (τ) represents the impulse response from i-th source to m-th sensor. Observed signals x m (t) (m=1~M) are converted into T-F domain signals X m k , l by using L-point windowed STFT as written by : where r is dummy variable in convolution sum operation, win(r) is a window and S is the window shift length. Here, we apply half window size overlapping transformation, namely S = L/2 in (2). Transformed T-F mixture model of Eq.(1) can be described by the instantaneous mixtures at each time frame index k and frequency bin l.
where H mi l is the frequency response (DFT) of h mi (t), S i k, l is the windowed STFT representation of i-th source signal s i (t), and the point k , l is called "T-F cell" in this chapter.
Assuming an anechoic mixing, the source signals which we want to recover are alternatively redefined as the observed signals at the first mixture x 1 k, l . In this case, the following mixing models in the T-F domain are henceforth considered without loss of generality.
where S i k, l and H mi l are different from S i k, l and ℋ mi l in (3), S i k, l is the i-th source signal observed at the first sensor (m=1), and H mi l eventually represents the DFT domain operation of the transfer function with relative attention and delay between m-th and the first sensors.
From then on, consider the mixture of two sources S 1 k, l and S 2 k, l which are received at a pair of microphones. Their mixture system (4a) and (4b) can thus be expressed as

Basic assumptions
As stated in Section 1, the WDO is commonly supposed in sparseness-based separation approaches. At first, we denote the T-F domain Ω on which S 1 k, l and S 2 k, l are defined where B : = l 1 , L / 2 is the frequency band after deleting lower frequency components which do not exist in actual speech signals, and l 1 = f 1 L / f s means the Gauss floor function, and f 1 is the analog lowest frequency of speech components such as 80Hz in later experiments.
where ε( > 0) is a sufficiently small value. Although, in theory, the support of S i k, l (i = 1, 2) is defined by the condition |S i k , l | ≠ 0, Eq. (7) gives a set of components of actual signals except noise-like ones satisfying |S i k , l | < ε.
We may consequently express the WDO assumption between two source signals s 1 (t) and s 2 (t) by the disjoint condition ( ) This can equivalently be represented as follow. 1 2 , , 0 at any , The verification of above WDO condition for actual speech signals is performed in Fig. 1 where (a) and (b) show spectrograms of two speech signals in the T-F domain, and (c) shows their multiplication in which we see rarely overlapping between two spectrograms.
Obviously, the supports of X 1 k, l and X 2 k, l are coincident and it, denoted by Ω X , can be given as In addition the following null component domain, denoted by Ω N , is also introduced as 1 2 : The WDO condition (8) accordingly derives that the T-F domain representation of the mixed signal X 1 k, l , given by Eq. (5), can be decomposed into the following three parts with no overlapping in Ω.

Source separation
Under the WDO assumption expressed in (12), the binary masking in the T-F domain is performed as follow: Clustering the T-F cells in the support Ω X of the mixture X 1 k, l into two sub-regions Ω 1 and Ω 2 , the separated source estimates in T-F domain, Ŝ 1 k, l and Ŝ 2 k, l , are obtained by applying the masks on X 1 k, l as follows.

Clustering features
The separation task is to classify T-F cells composing the support Ω X of X 1 k, l into either Ω 1 or Ω 2 . A pair of X 1 k, l and X 2 k, l is used to characterize a T-F cell k , l at which spatial features are introduced, and the clustering process is performed in the estimated feature space.
Effective features must be the signal level or attenuation ratio defined by 1 2 , , : , X k l k l X k l a é ù ë û é ù = ë û é ù ë û (15) and the arrival time difference defined by the frequency-normalized phase difference (PD) between X 1 k, l and X 2 k, l as , : where ϕ[k, l] is the PD as defined by Other features used for characterizing T-F cells are listed in [13]as well as the attenuation ratio modifications. It is noted that the attenuation ratio would not give distinctive difference for short distance microphone array. In our experimental setup, for example, the distance between microphones is 4cm in order to avoid spatial aliasing at 8kHz sampling rate.

Clustering scheme
For given features at T-F cells in Ω X , clustering of these is the next step. In DUET where a pair of microphones is used, the two dimensional histogram of feature vectors { α k, l , δ k, l } T within a time interval, such as for several seconds, is generated and the clustering is performed by finding the maximum peaks which are corresponding to sources. When the attenuation feature is omitted the clustering problem is solely performed based on time delay histogram distribution. The dimension of feature space will be higher for array configuration with many microphones than two. For these cases more sophisticated clustering scheme such as k-means algorithm or EM algorithm [19] should be adopted.

Inverse STFT
The final stage of the separation process is to obtain time domain separated signals ŝ i (t) (i = 1, 2) by applying the inverse STFT.

Phase-difference vs. frequency data
As a T-F cell's feature depending on the spatial location difference of sources, our strategies exploit a frame-wise, namely, a time sequence of phase difference of observations versus frequency (PD-F) distribution. In a k-th frame, the point plot of the PD-F is defined as a collection of two-dimensional vectors at k-th frame p k (l)as An example of PD-F in (l, ϕ)-plane and its time sequence for the mixture of two speech signals are shown in Fig.2 (a) and (b) respectively.  The relationship between the gradient β of a vector in PD-F plane defined in Eq.(18) and the source direction θ is: [33] 2 sin s f d L c where d is the distance between the sensors, c is the sound velocity, and θ is the direction of source. Here θ = 0 corresponds to the broadside direction and the term (d / c)sinθ represents the wave arriving delay between microphones. For example, the dot distribution in Fig.2 (a) concentrates along two lines corresponding to two source directions. By determining the gradients of these lines two directions of sources are estimated from the relationship of (19).
The conventionally utilized features associating with delay time at each T-F cell can be estimated from the frequency normalization of PD-F dot corresponding to individual T-F cells.
Unlike the conventional delay-like features PD-F dots keep a linear dot distribution on the plane and it is effectively utilized in both following source separation and direction finding methods. Here we may define three sets of time-frame indeces as follows:

Frame categorization
The whole set of time-frames, denoted by K:={1,.....,K}, is categorized into three sets with no overlapping.
In addition, we define the following Sourse Active(SA) frame index set.  Above frame categorization suggests the source separation algorithm consisting of the following two parts: • Assign each T-F component at SSA frame to either source by identifying the direction.
• Apply separation algorithm solely to DSA frames The detail of these will be described in the next section.

Outline of the method
The outline of the separation method using PD-F plot is shown in Fig.4 and summarized.
Step1: Discriminate SA from NSA The following average power at a frame is employed to check the presence of speech signal at the frame.
Here, the threshold operation is valid for basic voice activity detection as follow. k E k Th = > K (23) where Th SA is determined by a pre-experiment of noise level estimate during no utterance. In later experiments, we applied the following formula.
where E 0 is the average noise power estimate and σ E is the standard deviation estimate given by respectively. . (27) Denoting the eigenvalues of R k by λ 1 (k) and λ 2 (k) (assume λ 1 (k) ≥ λ 2 (k)), the ratio of the principal eigenvalues defined by is introduced to discriminate the SSA frames from the DSA. As shown in Fig.3 (c), (d), PD-F vector distribution at a SSA frame tends to concentrate around the first principal axis. This observation leads to the following discrimination of SSA from DSA frames and the estimation of the source directions.
The following criterion is applied to detect SSA frames.
Step 2-2:DOA estimation and SSA identification Define the normalized eigenvector of the first principal eigenvalue as where β(k) is the gradient of the principal axes at k-th frame. The histogram of the set will have two peaks which are corresponding two source directions θ 1 and θ 2 caluculated by Eq. (19). By clustering the set of θ into two groups according to the distance from θ 1 and θ 2 , each SSA frame in K SSA is classified into each one of the sources from the direction θ 1 and θ 2 .

Double Source Active (DSA)
For given set of DSA frames K DSA , the clustering of the vectors p k (l), l ∈ B into two sets is the problem. Before describing this separation algorithm, three frequency bands, denoted by B high , B low , and B mid , are introduced to use in the following separation algorithm.

Frequency Bands
The following three frequency bands are defined respectively.
3), f 2 is set 400Hz, and f 3 is set 1kHz in later experiments.
The idea of source separation at DSA frames utilizing these bands is divided into two parts according to above frequency bands.
1. The first scheme, called initial separation, is applied to the T-F cells in B high based on the directions of sources which have been estimated at the SSA frames previously.

2.
The clustering in B low is performed utilizing a harmonic structure relationship between the spectral components in B low and that of B mid . The harmonic structure in B mid can be obtained by the initial separation results in B high .

Initial separation
Source Separation and DOA Estimation for Underdetermined Auditory Scene http://dx.doi.org/10.5772/56013 Denote the source directions estimated in SSA frames by θ 1 and θ 2 , and their corresponding gradients in PD-F plane are β 1 and β 2 as defined in Eq. (31). The points on these two lines can be expressed as At k ∈ K DSA , the nearest neighbor rule gives the binary mask M i k, l in B high which is defined as 1, arg min , , , .
As a result, the separated individual signals S i k, l (i = 1, 2) are represented by 1 , , , ,

Local maximum points in B mid
The final task for separation process is to generate individual mask applied to B low . In this final separation process, the observed amplitude spectrum given by |X 1 [k, l]| with l∈ B low is compared with the initially separated spectra S 1 k, l and S 2 k, l with l ∈ B mid in terms of harmonic relationships. At first, with the help of local maximum frequencies of |S i k , l |, harmonic structure in B mid is estimated for each separation spectra. We denote the obtained local maximum frequencies of |S i k, l | are b i1 (k), b i2 (k), ・ ・ ・, and the number of local maxima in B mid is q i (k).

Harmonics estimation
The distance of adjacent harmonics Δd i (k) is defined as When q i (k) = 0 or 1, we regard that there is no harmonic in the frame k. The estimated harmonics in low frequency band g in (k) is where n = 1, 2, 3, ⋯ , g in (k) ∈ B low , and g in (k ) means the harmonic structure of source i at frame k.

Massk generation
We assume that the bandwidth of each harmonics is the same, and use 5 adjacent cells as bandwidth in T-F domain. The mask in B low is defined The integrated mask combining Eq. (33) and Eq. (37) is represented by , , , .
Finally, the separated signals are obtained as shown in Eq.(14).

Experimental condition
Some real life experiments are performed in a conference room to evaluate the separation methods. Fig.5(a),(b) show the experimental environments and the setup. The experimental parameters are show in Tab.1. One source was placed at the broadside (θ=0 • ) and the location of the other source is varied from 0 • to 80 • at intervals of every 10 • .  SIR improvement Output SIR Input SIR Where The proposed frame-wise PD-F approach exceeds the conventional method in terms of SIR

DOA estimation
The DOA estimation method discussed in this section is based on the following three novel approaches. is considered to be reliable. We observed the tendency that the PD error decreases as the reliability index increases. Then, the cell group is selected with reliability index η[k, l] > η th for subsequent DOA estimation. In this paper, η th is set to 0.96. The reason for using this value and related remarks are given in later.
For each selected reliable T-F cell, the direction θ is computed using Eq.(41). Here the set of computed directions is denoted as follows: where i is the numbering integer of the selected cells, I is the total number of data, and l i is the frequency bin at which the i-th cell is located.

DOA error distribution model
Consider a T-F cell at which the n-th source dominates and is located in the unknown direction θ n . From Eq. (41), the theoretical PD at the cell is given by sin , where B n = ΔωT sinθ n . The frame index k is omitted because k is not essential in this section. In the l-th frequency bin, the observed ϕ n [l] is distributed around its mean value B n l, where Δϕ[l] is a random variable representing the PD estimation error. Then, assume that the random variable Δϕ[l] is an independent identical Gaussian distribution with zero mean and constant variance σ ϕ 2 , that is, N (o, σ ϕ 2 ). The constant variance means that Δϕ[l] is independent of the frequency bin l; this assumption is represented as follows: Fig. 7 (a) illustrates Gaussian error distribution at l-th frequency bin in PD-F plane in twosource case. The Gaussian distribution assumption is motivated from the simplicity of theoretical manipulation. From these error distribution model the problem is to estimate the probability distribution of the direction θ as shown in Fig.7(b).
Now, the following proposition can be proved. The DOA error distribution model is shown in Fig. 8.
Phase Difference Error Direction angle

DOA estimation using kernel density estimator
The kernel density estimation algorithm known as Parzen window in machine learning [32] is useful for statistical estimation even for a multiple-source problem. The algorithm provides an estimate PDF of θ[l] by using the observed samples (47). The maximum PDF point or the mode of the PDF can be considered as the optimal estimate of θ n in the sense of the most probable value. The kernel density estimator approach yields an approximate estimation of the PDF of θ [l].
It is necessary to generalize the theoretical investigation noted above multisource and multifrequency cases. The theoretical PDF formulation of θ in the case of multiple sources should be a Gaussian mixture with the same number of local modes (local peaks), each of which corresponds to an individual source. For the selected reliable data in Eq. (47), the kernel density estimator is applied to estimate the multi-modal PDF as follows: where K(θ) is a kernel function, for which a Gaussian function is adopted in this study. ε l is the bandwidth of the kernel. The idea behind applying the kernel density estimator is to reflect the theoretical result represented by the above proposition for the determination of the bandwidth. Since the variance of θ[l] depends on l and θ n as indicated in Eq. (52), the bandwidth is determined as the form of where ℏ is the control parameter and the observed θ[l i ] is substituted in place of a real unknown θ n in Eq. (52). Accordingly, the dependence of the bandwidth on θ n is indirectly controlled. The control parameter ℏ is predetermined experimentally. Fig. 9 shows three examples of estimated PDFs for a two-source case with different ℏ. Finally, by finding the same number of local modes (peaks) as the number of pre-assigned source numbers, the source directions are estimated.

Experiments
Some experiments were conducted by the same setup and parameters as shown in Tab. 1. The first experiment is the case of two sources one of which is placed at the broad side (near 0 degree) as shown in Fig.10 (a). The results are shown in Fig.10 (b) and (c). While the proposed method gives a non-biased estimation, the estimates of the conventional method [27] tend to be biased for the cases of non-symmetric source positions with respect to the broadside. The second experiments for underdetermined case of three sources were performed. In this case three sources were symmetrically located at the closer locations { -23, 4, 23 degrees} and far apart locations{ -42, 4, 42 degrees}. Fig.11 (a) and (b) show the results of the conventional method [27] and the proposed. In the "far apart" case both methods can estimate the source directions well. However, for the "closer" case, the proposed method provides less biased estimates than [27]. From the additional experimental results with diffuse noise presented in [33] and [34] it is proved the proposed cell selection method provides noise robust estimation better than the conventional.

Conclusions
This monograph summarizes speech segregation and speaker's direction estimation methods which are based on sparseness of T-F components of speech signals. Throughout the discussion we are interested in underdetermined source-sensor conditions. At first recent progresses on BSS and DOA estimation algorithms associated with T-F sparse representation are reviewed. Then we focus on presenting an author's solution of BSS problems exploiting a series of phase difference versus frequency data. In the algorithm time frame classification concerning source active states is performed, and actual separation procedure is solely applied to the mixing frames.
The latter half of this chapter treats DOA estimation algorithm in a pair of microphones.
The basic error propagating mechanism is introduced and then the kernel density estimator is applied. The method provides a robust and non-biased DOA estimation and it develops theory for arbitrary microphone array configuration. [35] One of recent human machine speech communication research on segregation and localization is associated with robot auditory system where the tracking of moving sources and sensors have to be considered. [36] For coping with these cases the particle filter and adaptive array processing have been attractive, and further efforts will be made.