Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation

The ability of the human cognitive system to distinguish between multiple, simultaneously active sources of sound is a remarkable quality that is often taken for granted. This capability has been studied extensively within the speech processing community andmany an endeavor at imitation has beenmade. However, automatic speech processing systems are yet to perform at a level akin to human proficiency (Lippmann, 1997) and are thus frequently faced with the quintessential "cocktail party problem": the inadequacy in the processing of the target speaker/s when there are multiple speakers in the scene (Cherry, 1953). The implementation of a source separation algorithm can improve the performance of such systems. Source separation is the recovery of the original sources from a set of observations; if no a priori information of the original sources and/or mixing process is available, it is termed blind source separation (BSS). Rather than rely on the availability of a priori information of the acoustic scene, BSS methods often employ an assumption on the constituent source signals, and/or an exploitation of the spatial diversity obtained through a microphone array. BSS has many important applications in both the audio and biosignal disciplines, including medical imaging and communication systems.


Introduction
The ability of the human cognitive system to distinguish between multiple, simultaneously active sources of sound is a remarkable quality that is often taken for granted.This capability has been studied extensively within the speech processing community and many an endeavor at imitation has been made.However, automatic speech processing systems are yet to perform at a level akin to human proficiency (Lippmann, 1997) and are thus frequently faced with the quintessential "cocktail party problem": the inadequacy in the processing of the target speaker/s when there are multiple speakers in the scene (Cherry, 1953).The implementation of a source separation algorithm can improve the performance of such systems.Source separation is the recovery of the original sources from a set of observations; if no a priori information of the original sources and/or mixing process is available, it is termed blind source separation (BSS).Rather than rely on the availability of a priori information of the acoustic scene, BSS methods often employ an assumption on the constituent source signals, and/or an exploitation of the spatial diversity obtained through a microphone array.BSS has many important applications in both the audio and biosignal disciplines, including medical imaging and communication systems.
In the last decade, the research field of BSS has evolved significantly to be an important technique in acoustic signal processing (Coviello & Sibul, 2004).The general BSS problem can be summarized as follows.M observations of N sources are related by the equation

X=AS,
(1) where X is a matrix representing the M observations of the N sources contained in the matrix S, and A is the unknown M × N mixing matrix.The aim of BSS is to recover the source matrix S given simply the observed mixtures X, however rather than directly estimate the source signals, the mixing matrix A is instead estimated.The number of sensors relative to the number of sources present determines the class of BSS: evendetermined (M = N), overdetermined (M > N) or underdetermined (M < N).The evendetermined system can be solved via a linear transformation of the data; whilst the overdetermined case can be solved by an estimation of the mixing matrix A. However, due to its intrinsic noninvertible nature, the underdetermined BSS problem cannot be resolved via a simple mixing matrix estimation, and the recovery of the original sources from the mixtures is considerably more complex than the other aforementioned BSS instances.As a result of its intricacy, the underdetermined BSS problem is of growing interest in the speech processing field.
Traditional approaches to BSS are often based upon assumptions about statistical properties of the underlying source signals, for example independent component analysis (ICA) (Hyvarinen et al., 2001), which aims to find a linear representation of the sources in the observation mixtures.Not only does this rely on the condition that the constituent source signals are statistically independent, it also requires that no more than one of the independent components (sources) follows a Gaussian distribution.However, due to the fact that techniques of ICA depend on matrix inversion, the number of microphones in the array must be at least equal to, or greater than, the number of sources present (i.e.even-or overdetermined cases exclusively).This poses a significant restraint on its applicability to many practical applications of BSS.Furthermore, whilst statistical assumptions hold well for instantaneous mixtures of signals, in most audio applications the expectation of instantaneous mixing conditions is largely impractical, and the convolutive mixing model is more realistic.
The concept of time-frequency (TF) masking in the context of BSS is an emerging field of research that is receiving an escalating amount of attention due to its ease of applicability to a variety of acoustic environments.The intuitive notion of TF masking in the speech processing discipline originates from analyses on human speech perception and the observation of the phenomenon of masking in human hearing: in particular, the fact that the human mind preferentially processes higher energy components of observed speech whilst compressing the lower components.This notion can be administered within the BSS framework as described below.
In the TF masking approach to BSS, the assumption of sparseness between the speech sources, as initially investigated in (Yilmaz & Rickard, 2004), is typically exploited.There exists several varying definitions for sparseness in the literature; (Georgiev et al., 2005) simply defines it as the existence of "as many zeros as possible", whereas others offer a more quantifiable measure such as kurtosis (Li & Lutman, 2006).Often, a sparse representation of speech mixtures can be acquired through the projection of the signals onto an appropriate basis, such as the Gabor or Fourier basis.In particular, the sparseness of the signals in the short-time Fourier transform (STFT) domain was investigated in (Yilmaz & Rickard, 2004) and subsequently termed W-disjoint orthogonality (W-DO).This significant discovery of W-DO in speech signals motivated the degenerate unmixing estimation technique (DUET) which was proven to successfully recover the original source signals from simply a pair of microphone observations.Using a sparse representation of the observation mixtures, the relative attenuation and phase parameters between the observations are estimated at each TF cell.The parameters estimates are utilized in the construction of a power-weighted histogram; under the assumption of sufficiently ideal mixing conditions, the histogram will inherently contain peaks that denote the true mixing parameters.The final mixing parameters estimates are then used in the calculation of a binary TF mask.
This initiation into the TF masking approach to BSS is oft credited to the authors of this DUET algorithm.Due to its versatility and applicability to a variety of acoustic conditions (under-, even-and overdetermined), the TF masking approach has since evolved as a popular and effective tool in BSS, and the formation of the DUET algorithm has consequently motivated a plethora of demixing techniques.
Among the first extensions to the DUET was the TF ratio of mixtures (TIFROM) algorithm (Abrard & Deville, 2005) which relaxed the condition of W-DO of the source signals, and had a particular focus on underdetermined mixtures for arrays consisting of more than two sensors.However, its performance in reverberant conditions was not established and the observations were restricted to be of the idealized linear and instantaneous case.Subsequent research as in (Melia & Rickard, 2007) extended the DUET to echoic conditions with the DESPRIT (DUET-ESPRIT) algorithm; this made use of the existing ESPRIT (estimation of signal parameters via rotational invariance technique) algorithm (Roy & Kailath, 1989).This ESPRIT algorithm was combined with the principles of DUET, however, in contrast to the DUET, it utilized more than two microphone observations with the sensors arranged in a uniform linear array.However, due to this restriction in the array geometry, the algorithm was naturally subjected to front-back confusions.Furthermore, a linear microphone arrangement poses a constraint upon the spatial diversity obtainable from the microphone observations.
A different avenue of research as in (Araki et al., 2004) composed a two-stage algorithm which combined the sparseness approach in DUET with the established ICA algorithm to yield the SPICA algorithm.The sparseness of the speech signals was firstly exploited in order to estimate and subsequently remove the active speech source at a particular TF point; following this removal, the ICA technique could be applied to the remaining mixtures.Naturally, a restraint upon the number of sources present at any TF point relative to the number of sensors was inevitable due to the ICA stage.Furthermore, the algorithm was only investigated for the stereo case.
The authors of the SPICA expanded their research to nonlinear microphones arrays in (Araki et al., 2005;2006a;b) with the introduction of the clustering of normalized observation vectors.Whilst remaining similar in spirit to the DUET, the research was inclusive of nonideal conditions such as room reverberation.This eventually culminated in the development of the multiple sensors DUET (MENUET) (Araki et al., 2007).The MENUET is advantageous over the DUET in that it allows more than two sensors in an arbitrary nonlinear arrangement, and is evaluated on underdetermined reverberant mixtures.In this algorithm the mask estimation was also automated through the application of the k-means clustering algorithm.
Another algorithm which proposes the use of a clustering approach for the mask estimation is presented in (Reju et al., 2010).This study is based upon the concept of complex angles in the complex vector space; however, evaluations were restricted to a linear microphone array.
Despite the advancements of techniques such as MENUET, it is not without its limitations: most significantly, the k-means clustering is not very robust in the presence of outliers or interference in the data.This often leads to non-optimal localization and partitioning results, particularly for reverberant mixtures.Furthermore, binary masking, as employed in the MENUET, has been shown to impede on the separation quality with respect to the musical noise distortions.The authors of (Araki et al., 2006a) suggest that fuzzy TF masking approaches bear the potential to reduce the musical noise at the output significantly.
In (Kühne et al., 2010) the use of the fuzzy c-means clustering for mask estimation was investigated in the TF masking framework of BSS; on the contrary to MENUET, this approaches integrated a fuzzy partitioning in the clustering in order to model the inherent ambiguity surrounding the membership of a TF cell to a cluster.Examples of contributing factors to such ambiguous conditions include the effects of reverberation and additive channel noise at the sensors in the array.However, this investigation, as with many others in

273
Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation the literature, possessed the significant restriction in its limitation to a linear microphone arrangement.
Another clustering approach to TF mask estimation lies with the implementation of Gaussian Mixture Models (GMM).The use of GMMs in conjunction with the Expectation-Maximization (EM) algorithm for the representation of feature distributions has been previously investigated in the sparseness approach to BSS (Araki et al., 2009;Izumi et al., 2007;Mandel et al., 2006).This avenue of research is motivated by the intuitive notion that the individual component densities of the GMM may model some underlying set of hidden parameters in a mixture of sources.Due to the reported success of BSS methods that employ such Gaussian models, the GMM-EM may be considered as a standard algorithm for mask estimation in this framework, and is therefore regarded as a comparative model in this study.
However, each of the TF mask estimation approaches to BSS discussed above are yet to be inclusive of the noisy reverberant BSS scenario.Almost all real-world applications of BSS have the undesired aspect of additive noise at the recording sensors (Cichocki et al., 1996).The influence of additive noise has been described as a very difficult and continually open problem in the BSS framework (Mitianoudis & Davies, 2003).Numerous studies have been proposed to solve this problem: (Li et al., 2006) presents a two-stage denoising/separation algorithm; (Cichocki et al., 1996) implements a FIR filter at each channel to reduce the effects of additive noise; and (Shi et al., 2010) suggests a preprocessing whitening procedure for enhancement.
Whilst noise reduction has been achieved with denoising techniques implemented as a pre-or post-processing step, the performance was proven to degrade significantly at lower signal-to-noise ratios (SNR) (Godsill et al., 1997).Furthermore, the aforementioned techniques for the compensation of additive noise have yet to be extended and applied in depth to the TF masking approach to BSS.
Motivated by these shortcomings, this chapter presents an extension of the MENUET algorithm via a novel amalgamation with the FCM as in (Kühne et al., 2010) (see Fig. 1).The applicability of MENUET to underdetermined and arbitrary sensor constellations renders it superior in many scenarios over the investigation in (Kühne et al., 2010); however, its performance is hindered by its non-robust approach to mask estimation.Firstly, this study proposes that the combination of fuzzy clustering with the MENUET algorithm, which will henceforth be denoted as MENUET-FCM, will improve the separation performance in reverberant conditions.Secondly, it is hypothesized that this combination is sufficiently robust to withstand the degrading effects of reverberation and random additive channel noise.For all investigations in this study, the GMM-EM clustering algorithm for mask estimation is implemented with the MENUET (and denoted MENUET-GMM) for comparative purposes.As a side note, it should be observed that all ensuing instances of the term MENUET are in reference to the original MENUET algorithm as in (Araki et al., 2007).
The remainder of the chapter is structured as follows.Section 2 provides a detailed overview of the MENUET and proposed modifications to the algorithm.Section 3 explains the three different clustering algorithms and their utilization for TF mask estimation.Section 4 presents details of the experimental setup and evaluations, and demonstrates the superiority of the proposed MENUET-FCM combination over the baseline MENUET and MENUET-GMM for BSS in realistic acoustic environments.Section 5 provides a general discussion with insight into potential directions for future research.Section 6 concludes the chapter with a brief summary.

274
Independent Component Analysis for Audio and Biosignal Applications Fig. 1.Basic scheme of proposed time-frequency masking approach for BSS.

Source separation with TF masking
This section provides an introduction to the problem statement of underdetermined BSS and insight into the TF masking approach for BSS.The MENUET, MENUET-FCM and MENUET-GMM algorithms are described in greater detail.

Problem statement
Consider a microphone array made up of M identical sensors in a reverberant enclosure where N sources are present.It is assumed that the observation at the m th sensor can be modeled as a summation of the received images, denoted as s mn (t), of each source s n (t) by where and where t indicates time, h mn (p) represents the room impulse response from the n th source to the m th sensor and n m (t) denotes the additive noise present at the m th sensor.
The goal of any BSS system is to recover the sets of separated source signal images {ŝ 11 (t), ..., ŝM1 (t)},...,{ŝ 1N (t), ..., ŝMN (t)}, where each set denotes the estimated source signal ŝn (t), and ŝmn (t) denotes the estimate of the n th source image, s mn (t), at the m th sensor.Ideally, the separation is performed without any information about s n (t), h mn (p) and the true source images s mn (t).

Feature extraction
The time-domain microphone observations, sampled at frequency f s , are converted into their corresponding frequency domain time-series X m (k, l) via the STFT where k ∈{ 0, . . ., K − 1} is a time frame index, l ∈{ 0, . . ., L − 1} is a frequency bin index, win(τ) is an appropriately selected window function and τ 0 and ω 0 are the TF grid resolution 275 Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation parameters.The analysis window is typically chosen such that sufficient information is retained within whilst simultaneously reducing signal discontinuities at the edges.A suitable window is the Hann window where L denotes the frame size.
It is assumed that the length of L is sufficient such that the main portion of the impulse responses h mn is covered.Therefore, the convolutive BSS problem may be approximated as an instantaneous mixture model (Smaragdis, 1998) in the STFT domain where (k, l) represents the time and frequency index respectively, H mn (l) is the room impulse response from source n and sensor m. S n (k, l), X m (k, l) and N m (k, l) are the STFT of the m th observation, n th source and additive noise at the m th sensor respectively.The sparseness of the speech signals assumes at most one dominant speech source S n (k, l) per TF cell (Yilmaz & Rickard, 2004).Therefore, the sum in ( 6) is reduced to Whilst this assumption holds true for anechoic mixtures, as the reverberation in the acoustic scene increases it becomes increasingly unreliable due to the effects of multipath propagation and multiple reflections (Kühne et al., 2010;Yilmaz & Rickard, 2004).
In this work the TF mask estimation is realized through the estimation of the TF points where a signal is assumed dominant.To estimate such TF points, a spatial feature vector is calculated from the STFT representations of the M observations.Previous research has identified level ratios and phase differences between the observations as appropriate features in this BSS framework as such features retain information on the magnitude and the argument of the TF points.A comprehensive review is presented in (Araki et al., 2007), with further discussion presented in Section 4.2.1.Should the source signals exhibit sufficient sparseness, the clustering of the level ratios and phase differences will yield geometric information on the source and sensor locations, and thus facilitate effective separation.
The feature vector per TF point is estimated as

276
Independent Component Analysis for Audio and Biosignal Applications where c is the propagation velocity, d max is the maximum distance between any two sensors in the array and J is the index of the reference sensor.The weighting parameters A(k, l) and α ensure appropriate amplitude and phase normalization of the features respectively.It is widely known that in the presence of reverberation, a greater accuracy in phase ratio measurements can be achieved with greater spatial resolution; however, it should be noted that the value of d max is upper bounded by the spatial aliasing theorem.
The frequency normalization in A(k, l) ensures frequency independence of the phase ratios in order to prevent the frequency permutation problem in the later stages of clustering.It is possible to cluster without such frequency independence, for example (Sawada et al., 2007;2011); however, the utilization of all the frequency bins in the clustering stage avoids this and also permits data observations of short length (Araki et al., 2007).
Rewriting the feature vector in complex representation yields where θ L j and θ P j are the j th components of ( 9) and ( 10) respectively.In this feature vector representation, the phase difference information is captured in the argument term, and the level ratio is normalized by the normalization term A(k, l).
Fig. 2(a) and 2(b) depict the histogram of extracted level ratios and phase differences, respectively, in the ideal anechoic environment.The clear peaks in the phase histogram in (b) are distinctively visible and correspond to the sources.However, when the anechoic assumption is violated and reverberation is introduced into the environment, the distinction between peaks is reduced in clarity as is evident in the phase ratio histogram in Fig. 2(c).Furthermore, the degrading effects of additive channel noise can be seen in Fig. 2(d) where the phase ratio completely loses its reliability.It is hypothesized in this study that a sufficiently robust TF mask estimation technique will be competent to withstand the effect of reverberation and/or additive noise in the acoustic environment.
The masking approach to BSS relies on the observation that in an anechoic setting, the extracted features are expected to form N clusters, where each cluster corresponds to a source at a particular location.Since the relaxation of the anechoic assumption reduces the accuracy

277
Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation of the extracted features as mentioned above in Section 2.2, it is imperative that a sufficiently robust TF clustering technique is implemented in order to effectively separate the sources.
The feature vector set , where u n (k, l) indicates the degree of membership of the TF cell (k, l) to the n th cluster.The GMM-EM clustering results in the parameter set associated with the Gaussian mixture densities {Λ = λ 1 ,...,λ G } where G is the number of mixture components in the Gaussian densities, and each λ i vector has a representative mean and covariance matrix.Further details on the three main clustering algorithms used in this study are provided in Section 3.

Mask estimation and separation
In this work source separation is effectuated by the application of TF masks, which are the direct result of the clustering step.
For the k-means algorithm, a binary mask for the n th source is simply estimated as In the instances of FCM clustering, the membership partition matrix is interpreted as a collection of N fuzzy TF masks, where For the GMM-EM algorithm, the mask estimation is based upon the calculation of probabilities from the final optimized parameter set Λ = {λ 1 ,...,λ n }.The parameter set is used to estimate the masks as follows where λ n denotes the parameter set pertaining to the n th source, and probabilities p(θ(k, l)|λ n ) are calculated using a simple normal distribution (Section 3.3).

Hard k -means clustering
Previous methods (Araki et al., 2006b;2007) employ hard clustering techniques such as the hard k-means (HKM) (Duda et al., 2000).In this approach, the feature vectors θ(k, l) are clustered to form N distinct clusters C 1 ,...,C N .
The clustering is achieved through the minimization of the objective function where the operator .denotes the Euclidean norm and c n denotes the cluster centroids.
Starting with a random initialization for the set of centroids, this minimization is iteratively realized by the following alternating equations until convergence is met, where E{.} θ(k,l)∈C n denotes the mean operator for the TF points within the cluster C n , and the (*) operator denotes the optimal value.The resulting N clusters are then utilized in the mask estimation as described in Section 2.3.Due to the algorithm's sensitivity to initialization of the cluster centres, it is recommended to either design initial centroids using an assumption on the sensor and source geometry (Araki et al., 2007), or to utilize the best outcome of a predetermined number of independent runs.
Whilst this binary clustering performed satisfactorily in both simulated and realistic reverberant environments, the authors of (Jafari et al., 2011;Kühne et al., 2010) demonstrate that the application of a soft masking scheme improves the separation performance substantially.

280
Independent Component Analysis for Audio and Biosignal Applications

Fuzzy c-means clustering
In the fuzzy c-means clustering, the feature set Θ(k, l)={θ(k, l)|θ(k, l) ∈ R 2(M−1) , (k, l) ∈ Ω} is clustered using the fuzzy c-means algorithm (Bezdek, 1981) into N clusters, where Ω = {(k, l) :0≤ k ≤ K − 1, 0 ≤ l ≤ L − 1} denotes the set of TF points in the STFT plane.Each cluster is represented by a centroid v n and partition matrix U = {u n (k, l) ∈ R|n ∈ (1,...,N), (k, l) ∈ Ω)} which specifies the degree u n (k, l) to which a feature vector θ(k, l) belongs to the n th cluster.Clustering is achieved by the minimization of the cost function where is the squared Euclidean distance between the vector θ(k, l) and the n th cluster centre.The fuzzification parameter q > 1 controls the membership softness; a value of q in the range of q ∈ (1, 1.5] has been shown to result in a fuzzy performance akin to hard (binary) clustering (Kühne et al., 2010).However, superior mask estimation ability has been established when q = 2; thus in this work, the fuzzification q is set to 2.
The minimization problem in ( 23) can be solved using Lagrange multipliers and is typically implemented as an alternating optimization scheme due to the open nature of its solution (Kühne et al., 2010;Theodoridis & Koutroumbas, 2006).Initialized with a random partitioning, the cost function J fcm is iteratively minimized by alternating the updates for the cluster centres and memberships u n (k, l) q , ∀n, (25) where (*) denotes the optimal value, until a suitable termination criterion is satisfied.Typically, convergence is defined as when the difference between successive partition matrices is less than some predetermined threshold (Bezdek, 1981).However, as is also the case with the k-means (Section 3.1), it is known that the alternating optimization scheme presented may converge to a local, as opposed to global, optimum; thus, it is suggested to independently implement the algorithm several times prior to selecting the most fitting result.

Gaussian mixture model clustering
To further examine the separation ability of the MENUET-FCM scheme another clustering approach, based upon GMM clustering, is presented in this study.A GMM of a multivariate distribution Θ(k, l) may be represented by a weighted sum of G component Gaussian densities as given by where w i , i =1,..., G are the mixture weights, g(Θ|Λ) are the component Gaussian densities, and Λ is the vector of hidden parameters such that Λ = {λ 1 ,...,λ G } of the Gaussian components.Each component density is a D-variate Gaussian function of the form with mean vector µ i and covariance matrix Σ i .The constraint on the mixture weights is such as to satisfy the condition The goal of the GMM-EM clustering is to fit the source mixture data into a Gaussian mixture model and then estimate the maximum likelihood of the hidden parameters Λ = {λ 1 ,...,λ G }, where each {λ i } has its associated mean vector µ i and covariance matrix Σ i , associated with the mixture densities in the maximum likelihood of the features Θ(k, l).The features Θ(k, l) in this section will henceforth be denoted as Θ for simplicity.Under the assumption of independence between the features, the likelihood of the parameters, L(Λ|Θ) is related to Θ by where T is the total number of TF cells per feature (i.e.k * l).The estimation of the optimum hidden parameter set Λ * relies on the maximization of ( 29) Due to the fact that the log of L( * ) is typically calculated in lieu of L( * ), the function ( 29) is a nonlinear function of Λ.Therefore, the maximization in the G mixture components is a difficult problem.However, the maximum-likelihood (ML) estimates of these parameters may be calculated using the Expectation-Maximization (EM) algorithm (Izumi et al., 2007).The EM algorithm is iterated until a predetermined convergence threshold is reached.
The choice of the number of Gaussian mixtures for fitting the microphone array data is critical, and is typically determined by trial and error (Araki et al., 2007).In this study, the number of mixture components is set equal to the number of sources in order to facilitate the association of clusters to sources.In the case where G > N, the association will have an ambiguous nature.

Independent Component Analysis for Audio and Biosignal Applications
This assumption that each resulting Gaussian cluster uniquely fits one source therefore allows the calculation of the probability that a TF cell originates from the n th source; this is because the probability is equivalent to the probability that the TF cell originates from the n th mixture component.It is assumed in this study that the probability of membership follows a normal distribution as where λ * n ∈ Λ * = {λ * 1 ,...,λ * N }.

Experimental evaluations 4.1 Experimental setup
Fig. 3.The room setup for the three sensor nonlinear arrangement experimental evaluations.
The experimental setup was such as to reproduce that in (Araki et al., 2007) and (Jafari et al., 2011) for comparative purposes.Fig. 3 depicts the speaker and sensor arrangement, and Table 1 details the experimental conditions.The wall reflections of the enclosure, as well as the room impulse responses for each sensor, were simulated using the image model method for small-room acoustics (Lehmann & Johansson, 2008).The room reverberation was quantified in the measure RT 60 , where RT 60 is defined as the time required for reflections of a direct sound to decay by 60dB below the level of the direct sound (Lehmann & Johansson, 2008).
For the noise-robust evaluations, spatially uncorrelated white noise was added to each sensor mixture such that the overall channel SNR assumed a value as in Table 1.The SNR definition as in (Loizou, 2007) was implemented, which employs the standardized method given in

283
Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation (ITU-T, 1994) to objectively measure the speech.The four speech sources, the genders of which were randomly generated, were realized with phonetically-rich utterances from the TIMIT database (Garofolo et al., 1993) , and a representative number of mixtures for evaluative purposes constructed in total.In order to avoid the spatial aliasing problem, the microphones were placed at a maximum distance of 4cm apart.

Experimental conditions
Number As briefly discussed in Section 3.1 and 3.2, it is widely recognized that the performance of the clustering algorithms is largely dependent on the initialization of the algorithm.For both the MENUET and MENUET-FCM, the best of 100 runs was selected for initialization in order to minimize the possibility of finding a local, as opposed to global, optimum.In order to ensure the GMM fitting of the mixtures in the MENUET-GMM evaluations, the initial values for the mean and variance in the parameter set Λ had to be selected appropriately.The initialization of the parameters has been proven to be an imperative yet difficult task; should the selection be unsuccessful, the GMM fitting may completely fail (Araki et al., 2007).In this study, the mean and variance for each parameter set were initialized using the k-means algorithm.

Evaluation measures
For the purposes of speech separation performance evaluation, two versions of the publicly available MATLAB toolboxes BSS_EVAL were implemented (Vincent et al., 2006;2007).This performance criteria is applicable to all source separation approaches, and no prior information of the separation algorithm is required.Separation performance was evaluated with respect to the global image-to-spatial-distortion ratio (ISR), signal-to-interference ratio (SIR), signal-to-artifact ratio (SAR) and signal-to-distortion ratio (SDR) as defined in (Vincent et al., 2007); for all instances, a higher ratio is deemed as better separation performance.
This assumes the decomposition of the estimated source ŝn (t)  (t) are the undesired error components that correlate to the spatial distortion, interferences and artifacts respectively.This decomposition is motivated by the auditory notion of distinction between sounds originating from the target source, sounds from other sound sources present, and "gurgling" noise corresponding to s img mn (t)+ê spat mn (t), êint mn (t) and êarti f mn (t), respectively.The decomposition of the estimated signal was executed using the function bss_eval_images, which computes the spatial distortion and interferences by means of a least-squares projection of the estimated source image onto the corresponding signal subspaces.As recommended in (Vincent et al., 2007), the filter length was set to the maximal tractable length of 512 (64ms).
The ISR of the n th recovered source is then calculated as which provides a measure for the relative amount of distortion present in the recovered signal.
The SIR, given by an estimate of the relative amount of interference in the target source estimate.For all SIR evaluations the gain SIR gain = SIR output − SIR input was computed in order to quantify the improvement between the input and the output of the proposed studies.
The SAR is computed as in order to give a quantifiable measure of the amount of artifacts present in the n th source estimate.
As an estimate of the total error in the n th recovered source (or equivalently, a measure for the separation quality), the SDR is calculated as 285 Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation Similarly, the SNR of the estimated output signal was also evaluated using the BSS_EVAL toolkit.The estimated source ŝn (t) was assumed to follow the following decomposition (Vincent et al., 2006) ŝn (t)=s where s target n (t) is an allowed distortion of the original source, and ênoise n (t), êint n (t) and êarti f n (t) are the noise, interferences and artifacts error terms respectively.The decomposition of the estimated signal in this instance was executed using the function bss_decomp_filt, which permits time-invariant filter distortions of the target source.As recommended in (Vincent et al., 2006), the filter length was set to 256 taps (32ms).The global SNR for the n th source was subsequently calculated as

Initial evaluations of fuzzy c-means clustering
Firstly, to establish the feasibility of the c-means clustering as a credible approach to the TF mask estimation problem for underdetermined BSS, the algorithm was applied to a range of feature sets as defined in (Araki et al., 2007).The authors of (Araki et al., 2007) present a comprehensive review of suitable location features for BSS within the TF masking framework, and evaluate their effectiveness using the k-means clustering algorithm.The experimental setup for these set of evaluations was such as to replicate that in (Araki et al., 2007) to as close a degree as possible.In an enclosure of dimensions 4.55m x 3.55m x 2.5m, two omnidirectional microphones were placed a distance of 4cm apart at an elevation of 1.2m.Three speech sources, also at an elevation of 1.2m, were situated at 30 o ,7 0 o and 135 o ; and the distance R between the array and speakers was set to 50cm.The room reverberation was constant at 128ms.The speech sources were randomly chosen from both genders of the TIMIT database in order to emulate the investigations in (Araki et al., 2007) which utilized English utterances.
It is observed from the comparison of separation performance with respect to SIR improvement as shown in Table 2 that the c-means outperformed the original k-means clustering in all but one feature set.This firstly establishes the applicability of the c-means clustering in the proposed BSS framework, and secondly demonstrates the robustness of the c-means clustering against a variety of spatial features.The results of this investigation provide further motivation to extend the fuzzy TF masking scheme to other sensor arrangements and acoustic conditions.

Separation performance in reverberant conditions
Once the feasibility of the fuzzy c-means clustering for source separation was established, the study was extended to a nonlinear three sensor and four source arrangement as in Fig. 3.The separation results with respect to the ISR, SIR gain, SDR and SAR for a range of reverberation times are given in Fig. 4(a)-(d) respectively.Fig. 4(a) depicts the ISR results; from here it is evident that there are considerable improvements in the MENUET-FCM over 286 Independent Component Analysis for Audio and Biosignal Applications Table 2. Comparison of separation performance in terms of SIR improvement in dB of typical spatial features.Separation results are evaluated with SIR for the TF masking approach to BSS when the hard k-means and fuzzy c-means algorithms are implemented for mask estimation.The reverberation was constant at RT 60 = 128ms.
both the MENUET and MENUET-GMM.Additionally, the MENUET-GMM demonstrates a slight improvement over the MENUET.
The SIR gain as in Fig. 4(b) clearly demonstrates the superiority in source separation with the MENUET-FCM.For example, at the high reverberation time of 450ms, the proposed MENUET-FCM outperformed both the baseline MENUET and MENUET-GMM by almost 5dB.
Similar results were noted for the SDR, with substantial improvements when fuzzy masks are used.As the SDR provides a measure of the total error in the algorithm, this suggests that the fuzzy TF masking approach to BSS is more robust against algorithmic error than the other algorithms.
The superiority of the fuzzy masking scheme is further established in the SAR values depicted in Fig. 4(d).A consistently high value is achieved across all reverberation times, unlike the other approaches which fail to attain such values.This indicates that the fuzzy TF masking scheme yields source estimates with fewer artifacts present.This is in accordance with the study as in (Araki et al., 2006a) which demonstrated that soft TF masks bear the ability to significantly reduce the musical noise in recovered signals as a result of the inherent characteristic of the fuzzy mask to prevent excess zero padding in the recovered source signals.

287
Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation It is additionally observed that there is a significantly reduced standard deviation resulting from the FCM algorithm which further implies consistency in the algorithm's source separation ability.

Separation performance in reverberant conditions with additive noise
The impact of additive white channel noise on separation quality was evaluated next.The reverberation was varied from 0ms to 300ms, and the SNR at the sensors of the microphone array was varied from 0dB to 30dB in 5dB increments.
Tables 3(a)-(d) depicts the separation results of the evaluations with respect to the measured ISR, SIR gain, SDR and SAR respectively.It is clear from the table that the proposed MENUET-FCM algorithm has significantly increased separation ability over all tested conditions and for all performance criteria.In particular, the MENUET-FCM scenario demonstrates excellent separation ability even in the higher 300ms reverberation condition.

SNR evaluations
For the purposes of speech quality assessment, the SNR of each recovered speech signal was calculated with the definition as in (Vincent et al., 2006) and averaged across all evaluations, with the results shown in Table 4.The MENUET-FCM approach is again observed to be more robust against additive channel noise at the recovered output.However, a remarkable improvement in SNR values for the recovered speech sources for all clustering techniques is also observed.This suggests that the original MENUET, MENUET-GMM and MENUET-FCM have implementations beyond that of simply BSS and in fact maybe useful in applications that also require speech enhancement capabilities.This has important repercussions as it demonstrates that these approaches are able to withstand additive noise without significant degradations in performance, and thus bear the potential to additionally be utilized as a speech enhancement stage in a BSS system.

Discussion
The experimental results presented have demonstrated that the implementation of the fuzzy c-means clustering with the nonlinear microphone array setup as in the MENUET renders superior separation performance in conditions where reverberation and/or additive channel noise exist.
The feasibility of the fuzzy c-means clustering was firstly tested on a range of spatial feature vectors in an underdetermined setting using a stereo microphone array, and compared against the original baseline k-means clustering of the MENUET algorithm.The successful outcome of this prompted further investigation, with a natural extension to a nonlinear microphone array.The GMM-EM clustering algorithm was also implemented as a second baseline to further assess the quality of the c-means against alternative binary masking schemes other than the k-means.Evaluations confirmed the superiority of c-means clustering with positive improvements recorded for the average performance in all acoustic settings.In addition to this, the consistent performance even in increased reverberation establishes the potential of fuzzy c-means clustering for the TF masking approach.
However, rather than solely focus upon the reverberant BSS problem, this study refreshingly extended it to be inclusive of additive channel noise.It was suggested that due to the fuzzy c-means' documented robustness in reverberant environments, the extension to the noisy reverberant case would demonstrate similar abilities.Evaluations confirmed this hypothesis with especially noteworthy improvements in the measured SIR gain and SDR.Furthermore, the MENUET, MENUET-GMM and MENUET-FCM approaches were all proven to possess inherent speech enhancement abilities, with higher SNRs measured at the recovered signals.
However, a possible hindrance in the MENUET-GMM clustering was discussed previously regarding the correct selection of the number of fitted Gaussians (Section 3.3).Should the number of Gaussians be increased in a bid to improve the performance, an appropriate clustering approach should then be applied in order to group the Gaussians originating from the same speaker together; for example, a nearest neighbour or correlative clustering algorithm may be used.
Ultimately, the goal of any speech processing system is to mimic the auditory and cognitive ability of humans to as close a degree as possible, and the appropriate implementation of a BSS scheme is an encouraging step towards reaching this goal.This study has demonstrated that with the use of suitable time-frequency masking techniques, robust blind source separation can be achieved in the presence of both reverberation and additive channel noise.The success of the MENUET-FCM suggests that future work into this subject is highly feasible for real-life speech processing systems.

Conclusions
This chapter has presented an introduction into advancements in the time-frequency approach to multichannel BSS.A non-exhaustive review of mask estimation techniques was discussed 292 Independent Component Analysis for Audio and Biosignal Applications with insight into the shortcomings affiliated with such existing masking techniques.In a bid to overcome such shortcomings, the novel amalgamation of two existing BSS approaches was proposed and thus evaluated in (simulated) realistic multisource environments.
It was suggested that a binary masking scheme for the TF masking approach to BSS is inadequate at encapsulating the inevitable reverberation present in any acoustic setup, and thus a more suitable means for clustering the observation data, such as the fuzzy c-means, should be considered.The presented MENUET-FCM algorithm integrated the fuzzy c-means clustering with the established MENUET technique for automatic TF mask estimation.
In a number of experiments designed to evaluate the feasibility and performance of the c-means in the BSS context, the MENUET-FCM was found to outperform both the original MENUET and MENUET-GMM in source separation performance.The experiments varied in conditions from a stereo (linear) microphone array setup to a nonlinear arrangement, in both anechoic and reverberant conditions.Furthermore, additive white channel noise was also included in the evaluations in order to better reflect the conditions of realistic acoustic environments.
work should endeavor upon the refinement of the robustness of the feature extraction/mask estimation stage, and on the betterment of the clustering technique in order propel the MENUET-FCM to a sincerely blind system.Details are presented in the following section.Furthermore, the evaluation of the BSS performance in alternative contexts such as automatic speech recognition should also be considered in order to gain greater perspective on its potential for implementation in real-life speech processing systems.

Future directions
Future work should focus upon the improvement of the robustness of the mask estimation (clustering) stage of the algorithm.For example, an alternative distance measure in the FCM can be considered: it has been shown (Hathaway et al., 2000) that the Euclidean distance metric as employed in the c-means distance calculation may not be robust to the outliers due to undesired interferences in the acoustic environment.A measure such as the l 1 -norm could be implemented in a bid to reduce error (Kühne et al., 2010).
Additionally, the authors of (Kühne et al., 2010) also considered the implementation of observation weights and contextual information in an effort to emphasize the reliable features whilst simultaneously attenuating the unreliable features.In such a study, a suitable metric is required to determine such reliability: in the formulation of such a metric, consideration may be given to the behavior of proximate TF cells through a property such as variance (Kühne et al., 2009).
Alternatively, the robustness in the feature extraction stage can also be investigated.As described in Section 2.2, the inevitable conditions of reverberation and nonideal channels interfere with the reliability of the extracted features.A robust approach to the feature extraction would further ensure the accuracy of the TF mask estimation.The authors of (Reju et al., 2010) employ a feature extraction scheme based upon the Hermitian angle between the observation vector and a reference vector; and in a spirit similar to the MENUET-FCM, the features were clustered using the FCM and encouraging separation results were reported.

293
Advancements in the Time-Frequency Approach to Multichannel Blind Source Separation denotes the set of TF points in the STFT plane.Depending on the selection of clustering algorithm, the clusters are represented by distinct sets of TF points (hard k-means clustering); a set of prototype vectors and membership partition matrix (fuzzy c-means); or a parameter set (GMM-EM approach).Specifically, the k-means algorithm results in N distinct clusters C 1 ,...,C N , where each cluster is comprised of the constituent TF cells, and N ∑ n=1 |C n | = |Θ(k, l)| where the operator |.| denotes cardinality.The fuzzy c-means yields the N centroids v n and a partition matrix

Fig. 2 .
Fig. 2. Example histograms of the MENUET features as in (9) and (10) for varying acoustic conditions: (a) histogram of level ratio in an anechoic environment, (b) histogram of phase difference in an anechoic environment, (c) phase difference in presence of reverberant noise (RT 60 = 300ms), (d) phase difference in presence of channel noise.
Time-Frequency Approach to Multichannel Blind Source Separation where C win = 0.5/τ 1 0L is a Hann window function constant, and individual frequency components of the recovered signal are acquired through an inverse STFT ŝk mn

Fig. 4 .
Fig. 4. Source separation results in reverberant conditions using three separation approaches: MENUET, MENUET-GMM and MENUET-FCM.Performance results given in terms of (a) ISR, (b) SIR gain, (c) SDR and (d) SAR for all RT 60 values.The error bars denote the standard deviation over all evaluations.

Table 1 .
The parameters used in experimental evaluations.

Table 4 .
Results for the measured SNR at the BSS output averaged over all the recovered signals.Results given for all RT 60 and input channel SNR values.The highest achieved ratio per acoustic scenario is denoted in boldface.