A Study of Methods for Initialization and Permutation Alignment for Time-Frequency Domain Blind Source Separation

The problem of the blind signal separation (BSS) consists of estimating the latent component signals in a linear mixture, referred to as the sources, starting from several observed signals, without relying on any specific knowledge of the sources. In particular, when the sources are audible, this problem is known as to the cocktail-party problem, making reference to the ability of the human ear to isolate the conversation of our interest among several conversations immersed in a noisy environment with many people talking at the same time.


Introduction
The problem of the blind signal separation (BSS) consists of estimating the latent component signals in a linear mixture, referred to as the sources, starting from several observed signals, without relying on any specific knowledge of the sources. In particular, when the sources are audible, this problem is known as to the cocktail-party problem, making reference to the ability of the human ear to isolate the conversation of our interest among several conversations immersed in a noisy environment with many people talking at the same time.
The complexity of the blind separation problem greatly depends on the mixture model, the number of sources and sensors that better adjust to reality and the presence of noise in the mixture. The simplest case regards the linear and instantaneous mixture, that is, when the sources are mixed affected only by some scaling. However, in a real room recording the situation becomes more difficult, since the source signals do not only follow the direct path from the source to the sensor, but there are also other paths coming from the reflections in the walls. Hence, the problem becomes convolutive rather than instantaneous, and the mixture process is then modelled by means of a convolution of the sources by some acoustic mixing filters. In the present chapter we assume that the channel between sources and microphones is time-invariant, the number of sources equals the number of sensors, that is, the N × N case, and there is no additive noise. In such recording environments the separation is very complex, especially in highly reverberant conditions where the mixing filters can be very long (greater than 250 ms) and can contains strong peaks corresponding to the echoes.
Several component analysis techniques solve the instantaneous and determined case in the time domain. One of the most popular is the Independent Component Analysis (ICA), which is a method to recover statistically independent sources by using implicitly or explicitly high-order statistics Comon (1994). Some of those techniques have been extended to solve the convolutive case in the time domain. However, its use for the separation of real speech recordings is limited because of the high length of the acoustical mixing filters (of the order of hundreds of milliseconds). Since it is required the adjustment of too many parameters, those method present convergence problems and a high computational cost. An extended strategy, referred to as fd-ICA in the literature, consists of formulating the problem in the time-frequency domain instead of the time domain (Smaragdis, 1998). The main reason is that the convolutive mixture can be approximated by a set of instantaneous mixtures, one for each frequency bin, that can be solved independently by applying several separation algorithms. However, this simplification introduces some additional problems referred to as scaling and permutation problems, since the obtained solutions in each frequency exhibit an arbitrary complex scaling and order. The scaling ambiguity introduces a filtering effect in the estimated sources, that can be removed by introducing some constraint on the separation filters. Nevertheless, the permutation problem leads to non-consistent time-frequency representations of the estimated signals, and needs to be addressed to successfully recover the original sources. During the last years, several algorithms have been proposed in order to solve this problem, although nowadays there is no satisfactory solution, specially for highly reverberant environments. Furthermore, the problem increases in complexity with the number of sources in the mixture, and the developed algorithms usually deal with the case of two source signals and two observations only.
Here, we will focus our attention on the initialization of the separation algorithms and the permutation problem. The initialization procedure is very important since some separation algorithms are very sensitive to initial conditions, and often there are some frequency bins in which the separation algorithm fails to converge. Furthermore, a suitable initialization can achieve a spectacular reduction of permutation misalignments, since it favours the preservation of the order of the separated components in wide frequency blocks, which facilitate the permutation correction. A new permutation algorithm is also proposed based on the spectral coherence property of the speech signal. For that, we will derive a contrast function which maximization achieves the solution to the permutation. For the assessment of the developed algorithms, an exhaustive study has been performed in both synthetic measurements and real recordings, for various mixture environments.

Model of BSS of convolutive mixtures in the time-frequency domain
It is well known that any acoustic signal acquired from microphones in a real recording environment suffers from reflections on the walls and surfaces inside the room. In this sense, the recorded signals can be accurately modelled as a convolutive mixture, where the mixing filter is usually considered a high-order FIR filter. The standard convolutive mixing model of N sources, s j (n), j = 1, · · · , N, in a noiseless situation can be written as where x i (n), i = 1, . . . , N are the N sensor signals, and h ij (n) is the impulse response from source j th to microphone i th . In order to blindly recover the original speech signals (sources) one can apply a matrix of demixing filters to the observations x i (n) that yields an estimate of each of the sources where the coefficients b ij (k) denote the impulse response of demixing system filter of Q taps.
The transformation of time-domain signals to the time-frequency domain is usually performed by the short-time Fourier transform (STFT). The main advantage of using the time-frequency domain is that convolutive mixture in Equation (1) Let X i ( f , t) and S i ( f , t) be, respectively, the STFT of x i (n) and s i (n), and H ij ( f ) be the frequency response of the channel h ij (n). From Equation (1) we obtain which can be rewritten, in matrix notation, as X( f , t) where the observation and source vectors for each time-frequency point are T , respectively, and H( f ) is the frequency response of the mixing filter whose elements are H ij ( f ) = [H( f )] ij ∀i, j. The superscript T represents the matrix transpose operator. From now on, we will assume that the mixing matrices H( f ) are full rank. In practice, the approximation (3) is considered valid when the length of the DFT is significantly greater than the length of the mixing filters Parra & Spence (2000). For instance, in fd-ICA context for speech separation it is common that the DFT is twice as long as the length of the mixing filters Araki et al. (2003).
Each of the separation matrices B( f ) can be estimated independently with a suitable algorithm for instantaneous mixtures of complex values, which is computationally very efficient. The vector of outputs or estimated sources Nevertheless, the simplification (3) has some disadvantages that need to be solved to successfully recover the sources. As each instantaneous separation problem is solved independently, the recovered signals will have an arbitrary permutation and scaling in each frequency bin. Those ambiguities are inherent to the problem of the blind source separation. In consequence, Y( f , t) is usually modelled as where Π(f) is a permutation matrix and D( f ) is an arbitrary nonsingular diagonal matrix of complex scalars, representing respectively the permutation and scaling ambiguities.
The scaling ambiguity is not a serious problem. In fact, it causes an overall filtering of the sources. However, the correction of the permutation is essential. Even when perfect separation is achieved in all frequency bins, the transformation of the recovered signals into the time domain will be erroneous if the order of the extracted components are not the same in all frequency bins. Therefore, it is necessary to determine the permutation matrix P * ( f ) in each frequency bin in such way that the order of the outputs remains constant over all the frequencies, Once the separated components are well aligned, the sources can be finally recovered by converting the time-frequency representations Y j ( f , t) back to the time domain. It is also possible to estimate the sources by first transforming the separation matrices B( f ) to the time domain, correcting previously the ambiguities, and then by applying the Equation (2).

299
A Study of Methods for Initialization and Permutation Alignment for Time-Frequency Domain Blind Source Separation

The separation stage
The most widely used methods for solving the instantaneous separation problems in the standard fd-ICA approach relies on the statistical independence among different sources and on the notion of contrast function. The statistical independence of the sources is a plausible assumption in real-room recordings, since each speakers acts independently of the others. On the other hand, the notion of contrast function defines a correspondence between the distribution of the estimated sources and the real line which is only maximized when the sources are mutually independent Comon (1994).
In the fd-ICA context, it is important to note that the separation algorithm must be capable of handling complex data, given that the separation problem is formulated in the time-frequency domain. Nowadays, most of the ICA methods that work with complex data often use a preliminary whitening step that leads to Z ≡ Z( f ) the spatially whitened observations. This preprocessing simplifies the problem and, in some cases, it is also used because it improves the convergence of the algorithm. The whitening procedure consists of a linearly transform of the observed variables to zero mean and unit variance, that can be accomplished by e.g. Principal Component Analysis (Comon, 1994). One of the most widely used algorithm is FastICA Hyvärinen & Oja (1997), which exploits the property of the non-Gaussianity of the sources. The extension to complex data was formulated in Bingham & Hyvärinen (2000). The solution is obtained by finding the extrema of the following contrast function where E represents expectation, u is the extraction vector (a row of the separating matrix U H ), while G is a smooth even function whose expectation measures the departure (in a given sense) from the Gaussian distribution. Some usual choices for function G can be found in Bingham & Hyvärinen (2000).
The optimization of the contrast function (7) is performed by the Newton's method, resulting the following update rule for the fixed-point algorithm for one unit where i is the iteration index, and g(·) and g (·) denote the first and the second derivatives of G(·), respectively. This method can be combined with a deflation procedure to retrieve all the original components. An optimized variant of FastICA consists of introducing an adaptive choice of function G. For this purpose, the distributions of the independent components can be modelled by a generalized Gaussian distribution. The resulting algorithm is called efficient FastICA or simply EFICA Koldovský et al. (2006).
The ICA algorithms previously commented ignore the time structure of source signals. However, for the speech signals, nearby samples are highly correlated and when comparing the statistics for distant samples the nonstationary behaviour is revealed. It is possible to exploit any of these features to achieve the separation using only second order statistics (SOS). One important advantage of the SOS based systems is that they are less sensitive to noise 300 Independent Component Analysis for Audio and Biosignal Applications and outliers. One popular method of this family of algorithms is the second order blind identification (SOBI) algorithm, proposed in Belouchrani et al. (1997).
Under the assumption of spatial decorrelation of the sources, the correlation matrices of the sources R s (τ) = E[s(t + τ)s * (t)] for any nonzero time lag τ are diagonal, where superscript * denotes conjugate operation. If we consider now time-delayed correlation matrices of whitened observations, the next relation for prewhitened sensor signals is satisfied where W is the whitening matrix and U is the unitary mixing matrix. Since R s (τ) is diagonal, the separation matrix U H may be estimated by enforcing an unitary diagonalization of a covariance matrix R z (τ) for some non zero lag. Instead of use only one time lag, SOBI approximated jointly diagonalizes a set of covariance matrices computed for a fixed set of time lags.
An extension of this algorithm that jointly exploits the non-stationary and the temporal structure of the source signals is second-order non-stationary source separation (SEONS) algorithm proposed in Choi & Cichocki (2000). This method estimates a set of covariance matrices at different time-frames. For that, the whitened observations are divided into non-overlapping blocks, where different time-delayed covariance matrices are computed. Then, a joint approximate diagonalization method is applied to this set of matrices to estimate the separation matrix. The application of the SEONS algorithm in the simulations of this chapter considers covariance matrices for τ = 0 and one sample in each block.

The ThinICA algorithm
The higher order cumulants of the outputs have been one of the first class of contrast functions proposed in the context of blind deconvolution Donoho (1981) and later extensively used in the context of blind source separation Comon (1994); Cruces et al. (2004a). In its simpler form, the contrast function takes the form of a sum of the fourth-order cumulants of the outputs subject to a unitary constraint on the separation matrix (U H U = I). Indeed, the first implementation of the Fast-ICA algorithm Hyvärinen & Oja (1997) considered the maximization of (10). Nearly at the same time, other authors developed in DeLathauwer et al. (2000) a higher-order power method that consider the separation of the sources with a contrast function based on a least squares fitting of a higher-order cumulant tensor.
The ThinICA algorithm was proposed in Cruces & Cichocki (2003) as a flexible tool to address the simultaneous optimization of several correlation matrices and/or cumulant tensors. These matrices and tensors can be arbitrarily defined, so the algorithm is able to exploit not only the independence of the sources, but also their individual temporal correlations and also their possible large-term non-stationarity Cruces et al. (2004b).
For simplicity, it is assumed that sources are locally stationary and standardized to zero mean and unit variance. In order to determine the statistics that take part in the ThinICA contrast function one should specify the order of the cumulants q and the considered time tuples θ = (t 1 , · · · , t q ) which are grouped in the set Θ = {θ m ∈ R q , m = 1, · · · , r}. The algorithm works 301 A Study of Methods for Initialization and Permutation Alignment for Time-Frequency Domain Blind Source Separation with positive weighting scalars w θ and with q unitary matrix estimates U [k] , k = 1, · · · , q, of the mixing system WA and their respective linear estimates Y [k] (t) = U [k]H Z(t), k = 1, . . . , q, of the vector of desired sources. It was shown in Cruces et al. (2004b) that the function is a contrast function whose global maxima are only obtained when all the estimates agree (U [1] = · · · = U [q] ) and the sources of the mixture are recovered. Moreover, the constrained maximization of the previous contrast function is equivalent to the constrained minimization of the weighted least squares error between a set of q-order cumulant tensors of the observations {C Z q (θ), ∀θ ∈ Θ} and their best approximations that take into account the mutual independence statistical structure of the sources If D θ denote diagonal matrices, the maximization of (11) is equivalent to the minimization of with respect to the unitary matrices U [k] , k = 1, · · · , q. See Cruces et al. (2004b) for more details on the equivalence between the contrast functions (11) and (12).
The optimization of the ThinICA contrast function can be implemented either hierarchically or simultaneously, with respective implementations are based on the thin-QR and thin-SVD factorizations. A MatLab implementation of this algorithm can be found at the ICAlab toolbox icalab (2012), or obtained from the authors upon request.
The ThinICA contrast function and the algorithm has been also extended in Durán & Cruces (2007) to allow the simultaneous combination of correlation matrices and cumulant tensors of arbitrary orders. In this way, the algorithm is able to simultaneously exploit the information of different statistics of the observations, what makes it suitable for obtaining accurate estimates from a reduced set of observations.
The application of the ThinICA algorithm in the simulations of this chapter tries to exploit the non-stationarity behavior of the speech signals by considering q = 2 and the set Θ = {(t m , t m ) ∈ R q , m = 1, · · · , r}, i.e., it uses the information of several local autocorrelations of the observations in frequency domain in order to estimate the latent sources.

Initialization procedure for ICA algorithms
The ICA algorithm used for estimating the optimal separation system in each frequency bin, is often randomly initialized. However, there are several advantages to make a suitable initialization of the algorithm. For instance, if the algorithm is initialized near the optimal solution, one can guarantee a high convergence speed. Also, the permutation ambiguity can be avoided if the mixture have some properties.
One interesting approach to develop an appropriate initialization method is to consider the continuity of the frequency response of the mixing filter H( f ) and its inverse. Under this assumption, it seems reasonable to initialize the separation system B ini ( f ) from the value of the optimal separation system at the previous frequency B o ( f − 1). However, we can not directly apply B( f ) = B o ( f − 1) in those separation algorithms that whiten the observations as a preprocessing step.

Independent Component Analysis for Audio and Biosignal Applications
The whitening is performed by premultiplying the observations with an The computation of the whitening matrix can be accomplished by e.g. Principal Component Analysis (Comon, 1994). After that, the new observations Z( f , t) can be expressed as a new mixture of the sources through a new unitary Given an estimate of the unitary mixing matrix U o ( f ), then it is immediate to see that the separation matrix U( f ) −1 = U( f ) H is also unitary. Therefore, the estimated components or outputs yields the decomposition of the separation matrix B( f ) as the product of an unitary matrix and the whitening system Due to the variability of the sources spectra, even at contiguous frequencies, the whitening matrices W( f ) and W( f − 1) are different.
Consequently, in general we violate the unitary assumption of U( f ) by solving directly for An alternative method to initialize from previous solutions while avoiding the previously described problem, consists of initially preprocessing the observations at frequency f by the separation matrix determined for the previous frequency. This technique, referred on now as classical initialization, first computes the new observations as and then, determines the matrix W( f ) which whitens these new observations. Finally, the separation matrices are obtained by any preferred ICA method on those new observations. In brief, this classical initialization method decomposes the overall separation matrix in the following three factors Instead of this classical initialization, here we aim to exploit the continuity of the frequency response of the separation filter in a different way. We propose to initialize the separation system B ini ( f ) from its joint closest value to a set of optimal separation systems already computed at nearby frequencies (Sarmiento et al., 2009;2010) This leads to the following constrained minimization problem arg min where · F denotes the Frobenius norm and α i are weights assigned to the separation matrices of nearby frequencies. This problem can be solved by applying Lagrange multipliers, where the corresponding Lagrangian function L is given by where 0 N denotes null matrix of dimension N × N. After some manipulations, one obtains the desired solution where Q L and Q R are, respectively, the left and right singular vectors of the following factorization As we will se below, this initialization procedure helps to preserve the ordering of the separated components across the frequencies. However, we can not guarantee that all the frequencies will be correctly aligned. In fact, in the audio context, the mixing filters, and therefore the demixing filters, can contain strong echoes. Thus, in general, the assumption of continuity of the filter frequency response is not valid in all the frequency bins.
Furthermore, it can exist some isolated frequency bins in which the separation problem is ill conditioned, and in consequence, the estimated separation matrices should not correspond to the optimal solution. Despite those aspects, in practice, the initialization procedure can achieve a spectacular reduction of permutation misalignments when it is applied to various ICA separation algorithms.
In order to corroborate this point, we now present various 2 × 2 separation experiments. In Figure 1, we show the number of transitions in the ordering of the estimated components when we apply both, the classical and the initialization procedures aforementioned to various standard ICA algorithm that whitens the observations to the estimation of the separation matrices. For comparison, we have selected three representative ICA algorithms: ThinICA, SEONS and EFICA. As it can be seen, the initialization overperforms the classical procedure, achieving a drastic reduction in the number of permutations in all the cases. Although it is possible to take into account the separation matrices from several frequencies to estimate the initial separation matrix, in our experience the best performing initialization is achieved when we use only one preceding frequency.
The initialization procedure also preserves the ordering of the separated components in wide frequency blocks. This last property is illustrated in Figure 2, where it is shown the spectrograms of the original and estimated components from a simulation for separating two speech sources from a synthetic convolutive mixture. The estimated components have been obtained by using the ThinICA algorithm initialized with the procedure described above, but without correcting the permutation ambiguity. In this simulation there are only four transitions in the order of the estimated components, where it is easy to see that the components are well aligned in wide frequency blocks. This property is particularly very interesting for our purposes, because it could be used to alleviate the computational burden of the algorithms that solve the permutation problem, although this issue will not be discussed in this chapter. In the first row it is shown the spectrogram of the two original speech sources, whereas in the second row it is shown the estimated spectrograms. There are only four frequencies in which the order is not preserved, indicated by a dotted line.

Avoiding the indeterminacies
As we described above, due to the decoupled nature of the solutions across different frequencies, the correspondence between the true sources and their estimates, in general suffers from scaling and ordering ambiguities. Hereinafter, we describe some existing method to try to avoid these ambiguities.

305
A Study of Methods for Initialization and Permutation Alignment for Time-Frequency Domain Blind Source Separation

The scaling ambiguity
The scale ambiguity can be fixed by setting some constraints to the separation filters or by using some a priori knowledge of the source signals. One option is to constraint the separating matrices to have unit determinant Smaragdis (1998), whereas another one is to constraint the diagonal elements of the separating matrices to unity Parra & Spence (2000). However, the most extended option is based on the minimal distortion principle introduced in Matsuoka & Nakashima (2001). The goal of this procedure is to obtain the signals as received by the microphones, that is, including the distortion of the mixing system while not adding other distortion effects. The solution consists of multiplying the separation matrices in each frequency bin by the diagonal of the matrix B( f ) −1

The permutation ambiguity
Nowadays the permutation ambiguity constitutes the main problem in fd-ICA of acoustic signals, and it is still not satisfactory solved in high reverberating environments or for a large number of sources and observations. In order to tackle the problem, it is necessary to take some aspects into consideration. First, it is important to note that, when there are N sources in the mixtures, there are N! possible permutations in each frequency bins, so the problem becomes difficult as the number of sources increase. In fact, a great number of the existing methods work only on the 2 × 2 case, and unfortunately, such methods cannot be directly extended to the general N × N case. On the other hand, in general, we can not guarantee the optimal solution of the instantaneous separation problem in all frequency bins, since the source signals are not homogeneous in their statistical properties along different frequencies.
Therefore, there will be some frequencies in which the estimated sources do not correspond to the original sources. This can affect hardly the robustness of the permutation correction algorithms, and often it causes fatal errors in some of the existing methods.
The general structure of the permutation correction algorithms is presented below. The main goal of the permutation correction algorithms consist of estimating a set of permutation correction matrices, one for each frequency bin, P : = P f 1 , P f 2 , · · · , P f n F , P f k ∈ P, where n F is the number of frequency bins and P represents all the possible permutation matrices of dimension N × N. Those permutation matrices are applied either to the outputs Y( f , t) or to the separation filters B( f ) to fix the permutation problem.
If we denote Π : = Π f 1 , Π f 2 , · · · , Π f n F , Π f k ∈ P a set of permutation matrices, one for each frequency bin, that describes mathematically the permutation ambiguity, then it is possible to define the set of global permutation matrices Q : Then, it is immediate to deduce that the set P : will be an optimal solution to the permutation problem if the corresponding set of global permutation matrices Q : satisfy the following condition, which implies that the permutation problem has N! possible optimal solutions.

306
Independent Component Analysis for Audio and Biosignal Applications

Brief review of existing methods
Here, we present the main ideas emerged in last years to solve the permutation problem, putting special emphasis on the drawbacks and limitations of the techniques. Many different approaches have been proposed during last years. Basically, those methods are based on one of the following two assumptions, or in a combination of both (Pedersen et al., 2008): consistency of the spectrum of the recovered signals and consistency of the filter coefficients.
The first set of methods use the consistency of the spectrum of the recovered signals, which relies on the property of amplitude modulation correlation Anemüller & Kollmeier (2000) or simply co-modulation, of speech signals. This property refers to the spectrogram of a speech signal reveals that there is a pattern in the changes in amplitude at different frequency bins. This can be explained since the energy seems to vary in time in a similar way over different frequency bins, up to a gain factor. In fact, when a speaker starts talking, the power of the signal increases in a similar way at all the frequencies, and the same happens when the speaker stops talking, that is, the power decreases in a similar way in all the frequencies. This pattern is in general different for different speakers, at least at some parts of the recording. Therefore, it is possible to propose a permutation correction algorithm based on some evaluation procedure of the similarity between the envelopes of separated signals.
This idea has been used extensively to propose different methods. One option consists of adjusting the permutation between either adjacent or sorted frequency bins in a sequential order. The method was first proposed in Ikeda & Murata (1998); Murata et al. (2001), where the output components are ordered according to the highest correlation between the frequency bin to be order and a global envelope calculated with the already ordered frequency bins. However, the sequential approach has a major drawback, since an error while estimating the correct permutation at a frequency bin can be propagated to the rest of frequencies to be ordered. In Rahbar & Reilly (2005) it is proposed a dyadic hierarchical sorting scheme to prevent this situation.
Rather than solving the permutation problem a posteriori, some methods try to avoid it. One option consists of introducing some constraint that penalizes the permuted solutions in the separation problem in each frequency bin. For instance, in Anemüller & Kollmeier (2000) the separation and permutation problems are solved simultaneously, so the computational cost is limited. However, it does not work well in high reverberation environments. Another option proposed in Kim et al. (2006) is based on the concept of Independent Vector Analysis (IVA), which is an extension of ICA from univariate components to multivariate components. This method models the time-frequency representations of speech signals with a multivariate probability density function, and separates the fellow source components together. The contrast proposed is a multidimensional extension of the maximum likelihood (ML) approach. The method performs successfully in most conditions, recovering high quality speech signals. However, the convergence to local minima limits the robustness of the method, since in those cases it does not successfully separate all the components.
The second set of methods, which are based on the spectral coherence of the separation filters, includes methods based on the continuity and smoothness of the frequency response of the separation filters and methods based on the sparsity of the separation filters. The property of continuity and smoothness refers to the fact that the frequency response of the separation filters has not got abrupt transitions. Under this assumption, in Pham et al. (2003) the permutation is solved by checking if the ratio R( f , f − 1) = B( f )B −1 ( f − 1) is close to a diagonal matrix, in which case the frequencies f and f − 1 are well aligned. In a similar way,

307
A Study of Methods for Initialization and Permutation Alignment for Time-Frequency Domain Blind Source Separation in Asano et al. (2003) the permutation is corrected by minimizing a distance measure between the filters evaluated at contiguous frequencies. The main weakness of those methods is the propagation of error, since an error in one frequency bin can lead to wrong permutations over the rest of frequency bins to be solved.
The continuity of the separation filters is equivalent to constraint the separation filters to have short support in the time domain. This idea, proposed in Parra & Spence (2000) is based on the observation that the existence of permutations will produce time domain filters with greater lengths. Therefore, if we impose a short length on the separation filters in the separation stage, one can assume that the estimated filters will preserve the same order in all the frequencies. Unfortunately, this method tends to fail in reverberant acoustic environments since the acoustic filters are already quite long. A recent method introduced in Sudhakar & Gribonval (2009) uses the temporal sparsity of the filters to solve the permutation problem, where the sparsity means that the filters have few non-zero coefficients. The main idea is that the permutation errors decreases the sparsity of the reconstructed filters in the time domain, so it is possible to solve the permutation problem by maximizing the sparsity of the time domain demixing filters. This method has a high computational cost, and also only works in absence of the scaling ambiguity, which is not a realistic assumption.
Another family of methods, closed to the beamforming techniques, are based on the different direction of arrival (DOA) of source signals Parra & Alvino (2002). For that, it is necessary the knowledge of the geometry of the sensor array, and the distance between the microphones to be small enough to prevent the problem of spatial aliasing. Those methods assume that the direct path dominates the mixing filter response, and therefore the frequency response of the mixing filter from the i source to the j sensor can be approximately modelled as an anechoic model, where θ i is the direction of arrival of the i source, d ij is the distance between the microphones i and j, and c is the propagation speed of sound. Due to the coherence of the separation filter, some authors, as in Kurita et al. (2000); Saruwatari et al. (2003), assume that the quotient of the frequency response of the mixing filters between a given source and whatever two sensors will present a continuous variation with frequency, so this property is exploited to match the order of the components. However, a correct estimation of the DOAs is not always possible and the method tends to fail in high reverberation conditions or when the sources are near.
Finally, some methods combine the direction of arrival estimation with signal inter-frequency dependence to provide robust solutions, as in Sawada et al. (2004). The method, first fix the permutations by using the DOA approach in those frequencies where the confidence of the method is high enough. Then the remaining frequencies are solved by a correlation approach on nearby frequencies, without changing the permutation fixed by the DOA approach. This method has been extended when the geometry of the sensor array is unknown, in Sawada et al. (2005), and when spatial aliasing happens, in Sawada et al. (2006).

A coherence contrast for solving the permutation
In this section we present a method for solving the permutation problem in the general N × N case, based on the amplitude modulation correlation property of speech signals. For that, we will define a global coherence measure of the separated components that constitutes a contrast for solving the permutation problem Sarmiento et al. (2011). Then, the set of permutation 308 Independent Component Analysis for Audio and Biosignal Applications matrices to align the separated components are estimated in an iterative way by using a block-coordinate gradient ascent method that maximize the contrast.
First, we transform the profiles of the separated components in a logarithmic scale, since it will exhibit clearly the coherence property of speech signals. Given a source signal s i (k) and Consider now two source signals s i (k) and s j (k). The correlation coefficient between i component at f k frequency, |S i | dB ( f k , t), and j component at f p frequency, |S j | dB ( f p , t) is given by where the cross correlation, mean and variance of the spectrograms are respectively Although, in general, the speech signals fulfil the co-modulation property , several authors have stated that the direct comparison between the separated components at different frequencies is not always efficient to solve the permutation problem, mainly owing to the fact that the inter-frequency correlation is degraded in certain conditions. In fact, one speech signal will have high correlation coefficients in nearby frequency bins, but this assumption is not always correct if the frequencies are far apart or the correlation is evaluated in certain frequency range, mainly at very low frequencies or very high frequencies (approximately over 5 kHz).To overcome this, we define the mean correlation coefficient ρ ij ( f k ), as an averaged measure of the correlation coefficients, in other words, a measure of similarity between the i component at frequency f k and the j component in all the frequencies where n F is the number of frequency bins.
Due to the spectral properties of speech signals, it is reasonable to expect that the mean correlation coefficient between one source at any f k and itself will be greater than if we compare with another different source. Therefore, given the set of sources s(k), one can deduce that the following property will be satisfied ∀ f k This property is illustrated in Figure 3, where we it is shown the mean correlation coefficients of a set of 3 sources, evaluated in all frequency bins. From Figure 3, we can see that the assumption of Equation (31) is clearly valid in most frequency bins, except at lower frequencies, as it was expected. Considering the mean correlation coefficient, we define the global coherence of the source vectorρ as the average as in frequency as in components of the mean correlation coefficients, that isρ

Description of the permutation correction algorithm
Consider the ordering vector π f k = [π f k (1), · · · , π f k (N)] associated to the existing permutation matrix Π f k , defined in such way that its i element represents the non-nule row index of the i column of Π f k . Therefore, the i component of the estimated source vectorŜ i ( f k , t) at frequency f k corresponds to the component π f k (i) of the output vector, that is

Independent Component Analysis for Audio and Biosignal Applications
In order to clarify this point, an example for N = 3 sources it is presented, The global coherence of the outputs with alignment errors is given then bȳ From Equation (31) we can deduce that the global coherence of the outputs with alignment errors will be lower than the global coherence of the sources. Hence, it is possible to derive a contrast based on the global coherence which maximization achieves to solve the permutation problem. For that, we define, analogously to Equation (35), a global coherence of the corrected outputsρ where q f k (i) = p f k (π f k (i)), i = 1, . . . , N are the elements of the ordering global vector at frequency f k . This global coherence will be maximum when the global permutation matrices satisfy the condition of Equation (23). Hence, the Equation (36) constitutes a coherence contrast for solving the permutation problem.
In order to calculate the permutation matrices that correct the alignment, it is necessary to solve the next constrained optimization problem P : = P f 1 , P f 2 , · · · , P f n F = arg max P : However, since it is not possible to find an analytical solution to the optimization problem, it is necessary to estimate the permutation correction matrices in an iterative way. Here we adopt a block coordinated ascent method, since it provides good permutation corrections with an efficient computational cost. In this method, at each iteration, the correction permutation matrices are calculated in a independent manner in all the frequencies as follows, • Step 1: Calculate the mean correlation coefficients ρ (l) ij ( f k ) for all N separated components and for all frequency bins. The superindex (l) denotes the iteration index of the algorithm.

311
A Study of Methods for Initialization and Permutation Alignment for Time-Frequency Domain Blind Source Separation • Step 2: Find at each f k the permutation matrix P (l) ( f k ) that maximizes the sum of the mean correlation coefficients Step 3: If P (l) f k = I N for any f k , reorder the estimated components as set the iteration index l = l + 1 and go to step 1. Otherwise, it is considered that the estimated components are well aligned and end the algorithm.
It is important to note that convergence to the optimal solution is not guaranteed. However, in practice, the convergence to local optima that provide highly erroneous solutions is highly improbable.

Performance in a perfect separation situation
Here we present some experiments that were conducted to illustrate the robustness of the proposed method in perfect separation situation. For that, we artificially applied randomly selected permutation matrices to a set of spectrograms of speech sources S( f , t) at each frequency bin. The result corresponds with the outputs of a frequency domain blind source separation scheme, when the separation is achieved perfectly in all the frequencies. We used speech sources of 5-seconds long sampled at 10 kHz, randomly chosen from the database of 12 individual male and female sources in sources (2012). The parameter for the calculus of the STFT were FFT length of 2048 points, Hanning windows of 1024 samples and 90 % overlap. Then, we used the correction permutation algorithm in order to recover the original spectrograms. In Table 1, it is presented the average number of unsolved permutations for a set of 30 different simulation for each configuration from N = 2, · · · , 8 speech sources.
In all the simulations the algorithm correctly order the frequency components, remaining some permuted solutions at lower frequencies as we expected, since the speech sources do not always satisfy the property of spectral coherence. However, those errors do not affect the quality of the recovered speech sources, since they are always located at very low frequencies.
In Figure 4, we show the spectrograms of the original, permuted and recovered signals for one simulation of 6 speech sources, where it can be corroborated the robustness of the permutation correction algorithm. Another important feature of the algorithm is its capacity to order the source components by using a reduced number of iterations. For instance, in the previous experiment, the convergence was achieved in only four iterations.

Simulations
In this section we are going to test the performance of the initialization procedure and the permutation correction algorithm with both simulated and live recording by means of the 312 Independent Component Analysis for Audio and Biosignal Applications quality of the recovered sources. This quality was measured in terms of both objective and perceptually measures. The objective measures are the Source to Distortion Ratio (SDR), the Source to Interferences Ratio (SIR) and the Source to Artifacts Ratio (SAR) computed by the BSS_EVAL toolbox, Fèvotte et al. (2006), whereas the perceptually measure is the Perceptual Evaluation of Speech Quality (PESQ) with a maximum value of 4.5. The Matlab code for calculating the PESQ index can be found in Loizou (2007).

Performance for simulated recording
For the synthetic mixtures, we considered the 2 × 2 and 3 × 3 mixing system for the configuration of microphones and loudspeakers showed in Figure 5. The corresponding channel impulse responses were determined by the Roomsim toolbox roomsim (2012). The The separation experiments have been carried out as following. For the computation of the STFT, the parameters chosen were: Hanning windows of length 1024 samples, FFT of 2048 points and 90% overlap. Then, we estimated the separation matrices by initializing the ThinICA algorithm with the two procedures presented in Section 4: the classical initialization and the initialization with k = 1 preceding frequency, which will be referred from now on as ThinICA classic and ThinICA ini1 , respectively. After that, we fixed the permutation problem applying the method described in Section 6, and the scaling ambiguity by the Minimal Distortion Principle. Finally, we transformed the separation matrices back to the time domain and filtered the observations to obtain the time domain estimated sources and the quality of those signals was computed with the aforementioned methods. For comparison, we also carried out the same separation experiments by using IVA algorithm. The obtained results are presented In Table 2.
In Figure 6 is depicted an example of original sources, mixtures and demixed sources of one 3 × 3 separation experiment by using ThinICA ini1 simulation configuration.
Note that, in the simplest case, the 2 × 2 case, the initialization procedure does not seem to introduce any significant improvement in the quality of the recovered sources respect to the classical procedure. This can be explained when the separation algorithm and the permutation correction algorithm adequately converge in all the experiments. Nevertheless, in the most complex 3 × 3 case, where the convergence of the separation algorithms can be more difficult to achieve, the initialization procedure over perform the classical procedure. Thus, one can conclude that the initialization procedure can obtain better performances in hard situations, mainly as the increment of the number of sources, or when it is available a reduced number of data. It is important to note that IVA method failed in three of the fifteen simulations. In those experiments IVA recovered only one sources, remaining the other two mixed.
From with IVA method. However, the PESQ quality measure obtained in all the cases is similar. This discrepancy can be explained by the conceptual differences between the different quality measures. In general, simulations show that fd-ICA methods obtain better separation ratios than the IVA method, despite hearing the recovered speech reveals that fd-ICA introduce more reverberation in the estimated sources than IVA method. This reverberation degrades the perceived quality of resulting sound, which explains the similar PESQ score of fd-ICA and IVA methods.

Performance for live recording
In this study, we have reproduce two clean speech sources in a typical office room to obtain a real recording in a noisy environment. A sampling frequency of 10 kHz has been used. The recording setup includes Logitech 5.1 Z-5450 loudspeakers, Shure Lavalier MX_ 180 Serie Microflex microphones and a Digi003Rack recording interface. The source signals were estimated by using both fd-ICA by means of ThinICA algorithm and IVA method. In this case, the STFT were computed using Hanning windows of length 2048 samples, FFT of 4096 points and 90% overlap.
For correctly interpret the results, it is important to note that the mixing conditions on live recordings present significant differences respect to the synthetic mixtures. One of the most important feature is the presence of additive noise coming from different noise sources in the recording environment, such us computers, air conditioning, etc. As a consequence, the estimated components will not correspond to the original sources, since they will have also a component of additive noise. Thus, we have included a component of additive noise in the objective quality measure. This noise component have been estimated in the silence periods of the recording. In Table 3 we show the obtained results. Due to computational limitations in the BSS EVAL toolbox, we present only the SIR and SNR ratios.
As it can be seen in Table 3, the three methods perform well in this situation, although the best performance is achieved by the ThinICA method including the initialization procedure. Moreover, fd-ICA methods present better SIR ratios than IVA method, as in the synthetic mixtures experiments. Also, in Figure 7 is depicted the original sources, real recordings and demixed sources by using ThinICA ini1 simulation configuration.   Table 3. SIR (dB), SAR (dB), SDR (dB) and PESQ index for 2 × 2 real recording separation.
To conclude, we have also applied the complete method to a live recording of 3 sources and 3 mixtures provided in SISEC (2012). The quality of the estimated sources was measured in terms of Source to Interferences Ratio (SIR) by E. Vincent, since the original sources are not public. An average SIR of 10.1 dB was obtained.

Conclusions
In this chapter we have considered the problem of the blind separation of speech signals recorded in a real room, when the number of speakers equals the number of simultaneous recordings. We have adopted the time-frequency approach, focusing our attention in the initialization of the separation algorithms and the permutation problem which is ubiquitous to fd-ICA methods. In order to improve the performance of the existing methods we have incorporated an initialization procedure for those ICA algorithms that work in the time-frequency domain and require the whitening of the observations as a preprocessing step. This initialization exploit the local continuity of the demixing filter in the frequency domain, which is a valid property for reverberant filters in a wide range of frequencies. For that, the separation matrix in one frequency bin is initialized from its joint closest value to a set of separation systems already computed at nearby frequencies. Computer simulations show that this initialization, when it is incorporated to the existing ICA algorithms, reduces drastically the number of permutations, preserving the separated components well aligned in wide frequency blocks. Simulations with more than two sources reveal that the proposed initialization also helps to the convergence of the ICA algorithms that solve the separation in each frequency.

317
A Study of Methods for Initialization and Permutation Alignment for Time-Frequency Domain Blind Source Separation The permutation problem becomes a severe problem when the number of sources is large or in high reverberant environments. Nowadays, it is still considered an open problem. For solving the permutation problem, we have present a method, based on the amplitude correlation modulation property of speech signals, that arises the general case of N sources and N observations. We have defined for each frequency bin a measure of coherence based on the amplitude modulation correlation property of speech signals. This measure has been used to formulate a coherence contrast function which maximization allows to successfully arrange the estimated components. An iterative method has been provide for searching the maxima of the contrast. The robustness of the algorithm has been illustrated for artificially permuted sources, which corresponds with a situation of perfect separation. Results show that the algorithm is able to reorder completely the frequency components, except for some very low frequencies that in some cases remained permuted. However, this does not affect to the quality of the recovered sources. Finally, experiments with simulated and live recording in a room with reverberation, for the case where two or three sources are mixed, show that the complete method improves considerably the performance of classical fd-ICA method, as well as IVA method, by means of both objective and perceptually measures.