Blind Source Separation for Speech Application Under Real Acoustic Environment

A hands-free speech recognition system [1] is essential for the realization of an intuitive, unconstrained, and stress-free human-machine interface, where users can talk naturally because they require no microphone in their hands. In this system, however, since noise and reverberation always degrade speech quality, it is difficult to achieve high recognition performance, compared with the case of using a close-talk microphone such as a headset microphone. Therefore, we must suppress interference sounds to realize a noise-robust hands-free speech recognition system.


Introduction
A hands-free speech recognition system [1] is essential for the realization of an intuitive, unconstrained, and stress-free human-machine interface, where users can talk naturally because they require no microphone in their hands.In this system, however, since noise and reverberation always degrade speech quality, it is difficult to achieve high recognition performance, compared with the case of using a close-talk microphone such as a headset microphone.Therefore, we must suppress interference sounds to realize a noise-robust hands-free speech recognition system.
Source separation is one approach to removing interference sound source signals.Source separation for acoustic signals involves the estimation of original sound source signals from mixed signals observed in each input channel.Various methods have been presented for acoustic source signal separation.They can be classified into two groups: methods based on single-channel input, e.g., spectral subtraction (SS) [2], and those based on multichannel input, e.g., microphone array signal processing [3].There have been various studies on microphone array signal processing; in particular, the delay-and-sum (DS) [4][5][6] array and adaptive beamformer (ABF) [7][8][9] are the most conventionally used microphone arrays for source separation and noise reduction.ABF can achieve higher performance than the DS array.However, ABF requires a priori information, e.g., the look direction and speech break interval.These requirements are due to the fact that conventional ABF is based on supervised adaptive filtering, which significantly limits its applicability to source separation in practical applications.Indeed, ABF cannot work well when the interfering signal is nonstationary noise.
Recently, alternative approaches have been proposed.Blind source separation (BSS) is an approach to estimating original source signals using only mixed signals observed in each input channel.In particular, BSS based on independent component analysis (ICA) [10], in which the independence among source signals is mainly used for the separation, has recently been studied actively [11][12][13][14][15][16][17][18][19].Indeed, the conventional ICA could work, particularly in speech-speech mixing, i.e., all sources can be regarded as point sources, but such a mixing condition is very rare and unrealistic; real noises are often widespread sources.In this chapter, we mainly deal with generalized noise that cannot be regarded as a point source.Moreover, we assume this noise to be nonstationary noise that arises in many acoustical environments; however, ABF could not treat this noise well.Although ICA is not influenced by the nonstationarity of signals unlike ABF, this is still a very challenging task that can hardly be addressed by conventional ICA-based BSS because ICA cannot separate widespread sources.
To improve the performance of BSS, some techniques combining conventional ICA and beamforming have been proposed [18,20].However, these studies dealt with the separation of point sources, and the behavior of such methods under a non-point-source condition was not explicitly analyzed to our knowledge.Therefore, in this chapter, first, we analyze ICA under a non-point-source noise condition and point out that ICA is proficient in noise estimation rather than in speech estimation under such a noise condition.This analysis implies that we can still utilize ICA as an accurate noise estimator.
Next, we review blind spatial subtraction array (BSSA) [21], an improved BSS algorithm recently proposed in order to deal with real acoustic sounds.BSSA consists of an ICA-based noise estimator, and noise reduction in the proposed BSSA is achieved by subtracting the power spectrum of the estimated noise via ICA from the power spectrum of the noisy observations.This "power-spectrum-domain subtraction" procedure provides better noise reduction than conventional ICA with estimation-error robustness.The efficacy of BSSA can be determined in various experiments, including computer-simulation-based and real-recording-based experiments.This chapter shows strong evidence of BSSA providing promising speech enhancement results in a railway-station environment.
Finally, the real-time implementation issue of BSS is discussed.Several recent studies have dealt with the real-time implementation of ICA, but they still required high-speed personal computers.Consequently, BSS implementation on a small LSI still receives much attention in industrial applications.In this chapter, an example of hardware implementation of BSSA is introduced, which has yielded commercially available microphones adopted by the Japanese National Police Agency.
The rest of this chapter is organized as follows.In Sect.2, the sound mixing model and conventional ICA are discussed.In Sect.3, the analysis of ICA under a non-point-source condition is described in detail.In Sect.4, BSSA is reviewed in detail.In Sect.5, the experimental results are shown and compared with those of conventional methods.In Sect.6, an example of hardware implementation of BSSA is introduced.Following the example, the chaper conclusions are given in Sect.7.

Sound mixing model of microphone array
In this chapter, a straight-line array is assumed.The coordinates of the elements are designated d j (j = 1,...,J), and the direction-of-arrivals (DOAs) of multiple sound sources are designated θ k (k = 1,...,K) (see Fig. 1).Then, we consider that only one target speech signal, some interference signals that can be regarded as point sources, and additive noise exist.This additive noise represents noises that cannot be regarded as point sources, e.g., spatially uncorrelated noises, background noises, and leakage of reverberation components outside the frame analysis.Multiple mixed signals are observed at microphone array elements, and a short-time analysis of the observed signals is conducted by frame-by-frame discrete Fourier transform (DFT).The observed signals are given by  where f is the frequency bin and τ is the time index of DFT analysis.Also, x( f , τ) is the observed signal vector, A( f ) is the mixing matrix, s( f , τ) is the target speech signal vector in which only the Uth entry contains the signal component s U ( f , τ) (U is the target source number), n( f , τ) is the interference signal vector that contains the signal components except the Uth component, and n a ( f , τ) is the nonstationary additive noise signal term that generally represents non-point-source noises.These are defined as

Conventional frequency-domain ICA
Here, we consider a case where the number of sound sources, K, equals the number of microphones, J, i.e., J = K.In addition, similarly to that in the case of the conventional ICA contexts, we assume that the additive noise n a ( f , τ) is negligible in (1).In frequency-domain ICA (FDICA), signal separation is expressed as 7)

43
Blind Source Separation for Speech Application Under Real Acoustic Environment Fig. 2. Blind source separation procedure in FDICA in case of where o( f , τ) is the resultant output of the separation and W ICA ( f ) is the complex-valued unmixing matrix (see Fig. 2).
The unmixing matrix W ICA ( f ) is optimized by ICA so that the output entries of o( f , τ) become mutually independent.Indeed, many kinds of ICA algorithm have been proposed.
In the second-order ICA (SO-ICA) [15,17], the separation filter is optimized by the joint diagonalization of co-spectra matrices using the nonstationarity and coloration of the signal.For instance, the following iterative updating equation based on SO-ICA has been proposed by Parra and Spence [15]: where μ is the step-size parameter, [p] is used to express the value of the pth step in iterations, off-diag[X] is the operation for setting every diagonal element of matrix X to zero, and where the superscript H denotes Hermitian transposition.This criterion is to be minimized with respect to W ICA ( f ).
On the other hand, a higher-order-statistics-based approach exists.In higher-order ICA (HO-ICA), the separation filter is optimized on the basis of the non-Gaussianity of the signal.The optimal W ICA ( f ) in HO-ICA is obtained using the iterative equation where I is the identity matrix, • τ denotes the time-averaging operator, and ϕ(•) is the nonlinear vector function.Many kinds of nonlinear function ϕ( f , τ) have been proposed.

Independent Component Analysis for Audio and Biosignal Applications
Considering a batch algorithm of ICA, it is well-known that tanh(•) or the sigmoid function is appropriate for super-Gaussian sources such as speech signals [22].In this study, we define the nonlinear vector function ϕ(•) as where the superscripts (R) and (I) denote the real and imaginary parts, respectively.The nonlinear function given by (12) indicates that the nonlinearity is applied to the real and imaginary parts of complex-valued signals separately.This type of complex-valued nonlinear function has been introduced by Smaragdis [14] for FDICA, where it can be assumed for speech signals that the real (or imaginary) parts of the time-frequency representations of sources are mutually independent.According to Refs.[19,23], the source separation performance of HO-ICA is almost the same as or superior to that of SO-ICA.Thus, in this chapter, HO-ICA is utilized as the basic ICA algorithm in the simulation (Sect.3.4) and experiments (Sect.5).

Analysis of ICA under non-point-source noise condition
In this section, we investigate the proficiency of ICA under a non-point-source noise condition.
In relation to the performance analysis of ICA, Araki et al. have reported that ICA-based BSS has equivalence to parallel constructed ABFs [24].However, this investigation was focused on separation with a nonsingular mixing matrix, and thus was valid for only point sources.
First, we analyze beamformers that are optimized by ICA under a non-point-source condition.In the analysis, it is clarified that beamformers optimized by ICA become specific beamformers that maximize the signal-to-noise ratio (SNR) in each output (so-called SNR-maximize beamformers).In particular, the beamformer for target speech estimation is optimized to be a DS beamformer, and the beamformer for noise estimation is likely to be a null beamformer (NBF) [16].
Next, a computer simulation is conducted.Its result also indicates that ICA is proficient in noise estimation under a non-point-source noise condition.Then, it is concluded that ICA is suitable for noise estimation under such a condition.

Can ICA separate any source signals?
Many previous studies on BSS provided strong evidence that conventional ICA could perform source separation, particularly in the special case of speech-speech mixing, i.e., all sound sources are point sources.However, such sound mixing is not realistic under common acoustic conditions; indeed the following scenario and problem are likely to arise (see Fig. 3): • The target sound is the user's speech, which can be approximately regarded as a point source.In addition, the users themselves locate relatively near the microphone array (e.g., 1 m apart), and consequently the accompanying reflection and reverberation components are moderate.
• For the noise, we are often confronted with interference sound(s) which is notapointsource but a widespread source.Also, the noise is usually far from the array and is heavily reverberant.
Fig. 3. Expected directivity patterns that are shaped by ICA.
In such an environment, can ICA separate the user's speech signal and a widespread noise signal?The answer is no.It is well expected that conventional ICA can suppress the user's speech signal to pick up the noise source, but ICA is very weak in picking up the target speech itself via the suppression of a distant widespread noise.This is due to the fact that ICA with small numbers of sensors and filter taps often provides only directional nulls against undesired source signals.Results of the detailed analysis of ICA for such a case are shown in the following subsections.

SNR-maximize beamformers optimized by ICA
In this subsection, we consider beamformers that are optimized by ICA in the following acoustic scenario: the target signal is the user's speech and the noise is not a point source.Then, the observed signal contains only one target speech signal and an additive noise.In this scenario, the observed signal is defined as Note that the additive noise n a ( f , τ) cannot be negligible in this scenario.Then, the output of ICA contains two components, i.e., the estimated speech signal y s ( f , τ) and estimated noise signal y n ( f , τ);thesearegivenby Therefore, ICA optimizes two beamformers; these can be written as 46 Independent Component Analysis for Audio and Biosignal Applications where J ( f )] T is the coefficient vector of the beamformer used to pick up the target speech signal, and T is the coefficient vector of the beamformer used to pick up the noise.Therefore, (15) can be rewritten as In SO-ICA, the multiple second-order correlation matrices of distinct time block outputs, are diagonalized through joint diagonalization.
On the other hand, in HO-ICA, the higher-order correlation matrix is also diagonalized.Using the Taylor expansion, we can express the factor of the nonlinear vector function of HO-ICA, Thus, the calculation of the higher-order correlation in HO-ICA, ϕ(o( f , τ))o H ( f , τ),c a n be decomposed to a second-order correlation matrix and the summation of higher-order correlation matrices of each order.This is shown as where Ψ( f ) is a set of higher-order correlation matrices.In HO-ICA, separation filters are optimized so that all orders of correlation matrices become diagonal matrices.Then, at least the second-order correlation matrix is diagonalized by HO-ICA.In both SO-ICA and HO-ICA, at least the second-order correlation matrix is diagonalized.Hence, we prove in the following that ICA optimizes beamformers as SNR-maximize beamformers focusing on only part of the second-order correlation.Then the absolute value of the normalized cross-correlation coefficient (off-diagonal entries) of the second-order correlation, C,isdefinedby

47
Blind Source Separation for Speech Application Under Real Acoustic Environment where ŝ( f , τ) is the target speech component in ICA's output, n( f , τ) is the noise component in ICA's output, r s is the coefficient of the residual noise component, r n is the coefficient of the target-leakage component, and the superscript * represents a complex conjugate.Therefore, the SNRs of y s ( f , τ) and y n ( f , τ) can be respectively represented by where Γ s is the SNR of y s ( f , τ) and Γ n is the SNR of y n ( f , τ).Using ( 22), ( 23), (24), and ( 25), we can rewrite (21) as where arg r represents the argument of r.T h u s ,C is a function of only Γ s and Γ n .Therefore, the cross-correlation between y s ( f , τ) and y n ( f , τ) only depends on the SNRs of beamformers g s ( f ) and g n ( f ).
Now, we consider the minimization of C, which is identical to the second-order correlation matrix diagonalization in ICA.When | arg r * n − arg r s | > π/2, where −π < arg r s ≤ π and −π < arg r * n ≤ π, it is possible to make C zero or minimum independently of Γ s and Γ n .This case is appropriate for the orthogonalization between y s ( f , τ) and y n ( f , τ), which is related to principal component analysis (PCA) unlike ICA.However, SO-ICA requires that all correlation matrices in the different time blocks are diagonalized (joint diagonalization) to maximize independence among all outputs.Also, HO-ICA requires that all order correlation matrices are diagonalized, i.e., not only o( f , τ)o H ( f , τ) τ but also Ψ( f ) in ( 20) is diagonalized.These diagonalizations result in the prevention of the orthogonalization of y s ( f , τ) and y n ( f , τ); consequently, hereafter, we can consider only the case of | arg r * n − arg r s |≤π/2.Then, the partial differential of C 2 with respect to Γ s is given by • 2Re e j(arg r * n −arg r s ) < 0, where Γ s > 1andΓ n > 1. Similarly to the partial differential of C 2 with respect to Γ n ,wecan also prove that ∂C 2 /∂Γ n < 0, where Γ s > 1a n dΓ n > 1 in the same manner.Therefore, C is a monotonically decreasing function of Γ s and Γ n .The above-mentioned fact indicates the following in ICA.
• The absolute value of cross-correlation only depends on the SNRs of the beamformers spanned by each row of an unmixing matrix.
• The absolute value of cross-correlation is a monotonically decreasing function of SNR.
• Therefore, the diagonalization of a second-order correlation matrix leads to SNR maximization.
Thus, it can be concluded that ICA, in a parallel manner, optimizes multiple beamformers, i.e., g s ( f ) and g n ( f ), so that the SNR of the output of each beamformer becomes maximum.

48
Independent Component Analysis for Audio and Biosignal Applications

What beamformers are optimized under non-point-source noise condition?
In the previous subsection, it has been proved that ICA optimizes beamformers as SNR-maximize beamformers.In this subsection, we analyze what beamformers are optimized by ICA, particularly under a non-point-source noise condition, where we assume a two-source separation problem.The target speech can be regarded as a point source, and the noise is a non-point-source noise.First, we focus on the beamformer g s ( f ) that picks up the target speech signal.The SNR-maximize beamformer for g s ( f ) minimizes the undesired signal's power under the condition that the target signal's gain is kept constant.Thus, the desired beamformer should satisfy where a( f , θ s ( f )) is the steering vector, θ s ( f ) is the direction of the target speech, M is the DFT size, f s is the sampling frequency, c is the sound velocity, and is a function of frequency because the DOA of the source varies in each frequency subband under a reverberant condition.Here, using the Lagrange multiplier, the solution of (28) is This beamformer is called a minimum variance distortionless response (MVDR) beamformer [25].Note that the MVDR beamformer requires the true DOA of the target speech and the noise-only time interval.However, we cannot determine the true DOA of the target source signal and the noise-only interval because ICA is an unsupervised adaptive technique.Thus, the MVDR beamformer is expected to be the upper limit of ICA in the presence of non-point-source noises.
Although the correlation matrix is often not diagonalized in lower-frequency subbands [25], e.g., diffuse noise, we approximate that the correlation matrix is almost diagonalized in subbands in the entire frequency.Then, regarding the power of noise signals as approximately δ 2 ( f ), the correlation matrix results in R( f )=δ 2 ( f ) • I. Therefore, the inverse of the correlation matrix R −1 ( f )=I/δ 2 ( f ) and ( 30) can be rewritten as This filter g s ( f ) is approximately equal to a DS beamformer [4].Note that the filter g s ( f ) is not a simple DS beamformer but a reverberation-adapted DS beamformer because it is optimized for adistinctθ s ( f ) in each frequency bin.The resultant noise power is δ 2 ( f )/J when the noise is 49 Blind Source Separation for Speech Application Under Real Acoustic Environment spatially uncorrelated and white Gaussian.Consequently the noise-reduction performance of the DS beamformer optimized by ICA under a non-point-source noise condition is proportional to 10 log 10 J [dB]; this performance is not particularly good.
Next, we consider the other beamformer g n ( f ), which picks up the noise source.Similar to the noise signal, the beamformer that removes the target signal arriving from θ s ( f ) is the SNR-maximize beamformer.Thus, the beamformer that steers the directional null to θ s ( f ) is the desired one for the noise signal.Such a beamformer is called NBF [16].This beamformer compensates for the phase of the signal arriving from θ s ( f ), and carries out subtraction.Thus, the signal arriving from θ s ( f ) is removed.For instance, NBF with a two-element array is designed as where σ( f ) is the gain compensation parameter.This beamformer surely satisfies g T n ( f ) • a( f , θ s ( f )) = 0.The steering vector a( f , θ s ( f )) expresses the wavefront of the plane wave arriving from θ s ( f ).T h u s ,g n ( f ) actually steers the directional null to θ s ( f ).Note that this always occurs regardless of the number of microphones (at least two microphones).Hence, this beamformer achieves a reasonably high, ideally infinite, SNR for the noise signal.Also, note that the filter g n ( f ) is not a simple NBF but a reverberation-adapted NBF because it is optimized for a distinct θ s ( f ) in each frequency bin.Overall, the performance of enhancing the target speech is very poor but that of estimating the noise source is good.

Computer simulations
We conduct computer simulations to confirm the performance of ICA under a non-point-source noise condition.Here, we used HO-ICA [14] as the ICA algorithm.We used the following 8-kHz-sampled signals as the ICA's input; the original target speech (3 s) was convoluted with impulse responses that were recorded in an actual environment, and to which three types of noise from 36 loudspeakers were added.The reverberation time (RT 60 ) is 200 ms; this corresponds to mixing filters with 1600 taps in 8 kHz sampling.The three types of noise are an independent Gaussian noise, actually recorded railway-station noise, and interference speech by 36 people.Figure 4 illustrates the reverberant room used in the simulation.We use 12 speakers (6 males and 6 females) as sources of the original target speech, and the input SNR of test data is set to 0 dB.We use a two-, three-, or four-element microphone array with an interelement spacing of 4.3 cm.
The simulation results are shown in Figs. 5 and 6. Figure 5 shows the result for the average noise reduction rate (NRR) [16] of all the target speakers.NRR is defined as the output SNR in dB minus the input SNR in dB.This measure indicates the objective performance of noise reduction.NRR is given by where OSNR is the output SNR and ISNR j is the input SNR of microphone j.
From this result, we can see an imbalance between the target speech estimation and the noise estimation in every noise case; the performance of the target speech estimation is significantly poor, but that of noise estimation is very high.This result is consistent with 50 Independent Component Analysis for Audio and Biosignal Applications Fig. 4. Layout of reverberant room in our simulation.the previously stated theory.Moreover, Fig. 6 shows directivity patterns shaped by the beamformers optimized by ICA in the simulation.It is clearly indicated that beamformer g s ( f ), which picks up the target speech, resembles the DS beamformer, and that beamformer g n ( f ), which picks up the noise, becomes NBF.From these results, it is confirmed that the previously stated theory, i.e., the beamformers optimized by ICA under a non-point-source noise condition are DS and NBF, is valid.

51
Blind Source Separation for Speech Application Under Real Acoustic Environment Fig. 6.Typical directivity patterns under non-point-source noise condition shaped by ICA at 2 kHz and two-element array for case of white Gaussian noise.

Blind spectral subtraction array 4.1 Motivation and strategy
As clearly shown in Sects.3.3 and 3.4, ICA is proficient in noise estimation rather than in target-speech estimation under a non-point-source noise condition.Thus, we cannot use ICA for direct target estimation under such a condition.However, we can still use ICA as a noise estimator.This motivates us to introduce an improved speech-enhancement strategy, i.e., BSSA [21].BSSA consists of a DS-based primary path and a reference path including ICA-based noise estimation (see Fig. 7).The estimated noise component in ICA is efficiently subtracted from the primary path in the power-spectrum domain without phase information.This procedure can yield better target-speech enhancement than simple ICA, even with the additional benefit of estimation-error robustness in speech recognition applications.The detailed process of signal processing is shown below.

52
Independent Component Analysis for Audio and Biosignal Applications

Partial speech enhancement in primary path
We again consider the generalized form of the observed signal as described in (1).The target speech signal is partly enhanced in advance by DS.This procedure can be given as where y DS ( f , τ) is the primary-path output that is a slightly enhanced target speech, w DS ( f ) is the filter coefficient vector of DS, and θ U is the estimated DOA of the target speech given by the ICA part in Sect.4.3.In (35), the second and third terms on the right-hand side express the remaining noise in the output of the primary path.

ICA-based noise estimation in reference path
BSSA provides ICA-based noise estimation.First, we separate the observed signal by ICA and obtain the separated signal vector o( f , τ) as where the unmixing matrix W ICA ( f ) is optimized by (11).Note that the number of ICA outputs becomes K + 1, and thus the number of sensors, J,i sm o r et h a nK + 1b e c a u s ew e assume that the additive noise n a ( f , τ) not negligible.We cannot estimate the additive noise perfectly because it is deformed by the filter optimized by ICA.Moreover, other components also cannot be estimated perfectly when the additive noise n a ( f , τ) exists.However, we can estimate at least noises (including interference sounds that can be regarded as point sources, and the additive noise) that do not involve the target speech signal, as indicated in Sect.3. Therefore, the estimated noise signal is still beneficial.
Next, we estimate DOAs from the unmixing matrix W ICA ( f ) [16].This procedure is represented by

53
Blind Source Separation for Speech Application Under Real Acoustic Environment where θ u is the DOA of the uth sound source.Then, we choose the Uth source signal, which is nearest the front of the microphone array, and designate the DOA of the chosen source signal as θ U .This is because almost all users are expected to stand in front of the microphone array in a speech-oriented human-machine interface, e.g., a public guidance system.Other strategies for choosing the target speech signal can be considered as follows.
• If the approximate location of a target speaker is known in advance, we can utilize the location of the target speaker.For instance, we can know the approximate location of the target speaker at a hands-free speech recognition system in a car navigation system in advance.Then, the DOA of the target speech signal is approximately known.For such systems, we can choose the target speech signal, selecting the specific component in which the DOA estimated by ICA is nearest the known target-speech DOA.
• For an interaction robot system [26], we can utilize image information from a camera mounted on a robot.Therefore, we can estimate DOA from this information, and we can choose the target speech signal on the basis of this estimated DOA.
• If the only target signal is speech, i.e., none of the noises are speech, we can choose the target speech signal on the basis of the Gaussian mixture model (GMM), which can classify sound signals into voices and nonvoices [27].
Next, in the reference path, no target speech signal is required because we want to estimate only noise.Therefore, we eliminate the user's signal from the ICA's output signal o( f , τ).Th is can be written as where q( f , τ) is the "noise-only" signal vector that contains only noise components.Next, we apply the projection back (PB) [13] method to remove the ambiguity of amplitude.This procedure can be represented as where M + denotes the Moore-Penrose pseudo-inverse matrix of M.T h u s ,q( f , τ) is a good estimate of the noise signals received at the microphone positions, i.e., where na ( f , τ) contains the deformed additive noise signal and separation error due to an additive noise.Finally, we construct the estimated noise signal z( f , τ) by applying DS as This equation means that z( f , τ) is a good candidate for noise terms of the primary path output y DS ( f , τ) (see the 2nd and 3rd terms on the right-hand side of ( 35)).Of course this noise estimation is not perfect, but we can still enhance the target speech signal via oversubtraction in the power-spectrum domain, as described in Sect.4.4.Note that z( f , τ) is a function of the frame index τ, unlike the constant noise prototype in the traditional spectral subtraction method [2].Therefore, the proposed BSSA can deal with nonstationary noise.

54
Independent Component Analysis for Audio and Biosignal Applications

Noise reduction processing in BSSA
In BSSA, noise reduction is carried out by subtracting the estimated noise power spectrum (45) from the partly enhanced target speech signal power spectrum (35).This procedure is given as where y BSSA ( f , τ) is the final output of BSSA, β is the oversubtraction parameter, and γ is the flooring parameter.Their appropriate setting, e.g., β > 1a n dγ ≪ 1, results in efficient noise reduction.For example, a larger oversubtraction parameter (β 1) leads to a larger SNR improvement.However, the target signal would be distorted.On the other hand, a smaller oversubtraction parameter (β ≪ 1) gives a less-distorted target signal.However, the SNR improvement is decreased.In the end, a trade-off between SNR improvement and the distortion of the output signal exists with respect to the parameter β;1< β < 2 is usually used.
The system switches between two equations depending on the conditions in (46).If the calculated noise components using ICA in (45) are underestimated, i.e., |y DS ( f , τ)| 2 > β|z( f , τ)| 2 , the resultant output y BSSA ( f , τ) corresponds to power-spectrum-domain subtraction among the primary and reference paths with an oversubtraction rate of β.O n the other hand, if the noise components are overestimated in ICA, i.e., |y DS ( f , τ)| 2 < β|z( f , τ)| 2 , the resultant output y BSSA ( f , τ) is floored with a small positive value to avoid a negative-valued unrealistic spectrum.These oversubtraction and flooring procedures enable error-robust speech enhancement in BSSA rather than a simple linear subtraction.Although the nonlinear processing in (46) often generates an artificial distortion, so-called musical noise, it is still applicable in the speech recognition system because the speech decoder is not very sensitive to such a distortion.BSSA involves mel-scale filter bank analysis and directly outputs the mel-frequency cepstrum coefficient (MFCC) [28] for speech recognition.Therefore, BSSA requires no transformation into the time-domain waveform for speech recognition.
In BSSA, DS and SS are processed in addition to ICA.In HO-ICA or SO-ICA, to calculate the correlation matrix, at least hundreds of product-sum operations are required in each frequency subband.On the other hand, in DS, at most J product-sum operations are required in each frequency subband.A mere 4 or 5 products are required for SS.Therefore, the complexity of BSSA does not increase by as much as 10% compared with ICA.

Variation and extension in noise reduction processing
As mentioned in the previous subsection, the noise reduction processing of BSSA is mainly based on SS, and therefore it often suffers from the problem of musical noise generation due to its nonlinear signal processing.This becomes a big problem in any audio applications aimed for human hearing, e.g., hearing-aids, teleconference systems, etc.
To improve the sound quality of BSSA, many kinds of variations have been proposed and implemented in the post-processing part in (46).Generalized SS and parametric Wiener filtering algorithms [29] have been introduced to successfully mitigate musical noise generation [30].Furthermore, the minimum mean-square error (MMSE) short-time spectral amplitude (STSA) estimator [31] can be used for achieving low-distortion speech enhancement in BSSA [32].In addition, this MMSE-STSA estimator with ICA-based noise estimation has been modified to deal with binaural signal enhancement, where the spatial cue of the target speech signal can be maintained in the output of BSSA [33].
In recent studies, an interesting extension in the signal processing structure has been addressed [34,35].Two types of the BSSA structures are shown in Fig. 8.One is the original BSSA structure that performs SS after DS (see Fig. 8(a)), and another is that SS is channelwisely performed before DS (chBSSA; see Fig. 8(b)).It has been theoretically clarified that chBSSA is superior to BSSA in the mitigation of the musical noise generation via higher-order statistics analysis.

Experiment in reverberant room
In this experiment, we present a comparison of typical blind noise reduction methods, namely, the conventional ICA [14] and the traditional SS [2] cascaded with ICA (ICA+SS).We utilize the HO-ICA algorithm as conventional ICA [14].Hereafter, 'ICA' simply indicates HO-ICA.For ICA+SS, we first obtain the estimated noise from the speech pause interval in the target 56 Independent Component Analysis for Audio and Biosignal Applications speech estimation by ICA.The noise reduction achieved by SS is where nremain ( f ) is the noise signal from the speech pause in the target speech estimated by ICA.Moreover, a DOA-based permutation solver [16] is used in conventional ICA and in the ICA part in BSSA.
We used 16-kHz-sampled signals as test data; the original speech (6 s) was convoluted with impulse responses recorded in an actual environment, to which cleaner noise or a male's interfering speech recorded in an actual environment was added.Figure 9 shows the layout of the reverberant room used in the experiment.The reverberation time of the room is 200 ms; this corresponds to mixing filters of 3200 taps in 16 kHz sampling.The cleaner noise is not a simple point source signal but consists of several nonstationary noises emitted from a motor, an air duct, and a nozzle.Also, the male's interfering speech is not a simple point source but is slightly moving.In addition, these interference noises involve background noise.The SNR of the background noise (power ratio of target speech to background noise) is about 28 dB.We use 46 speakers (200 sentences) as the source of the target speech.The input SNR is set to 10 dB at the array.We use a four-element microphone array with an interelement spacing of 2 cm.The DFT size is 512.The oversubtraction parameter β is 1.4 and the flooring coefficient γ is 0.2.Such parameters were experimentally determined.The speech recognition task and conditions are shown in Table 1.
Regarding the evaluation index, we calculate NRR described in (34), cepstral distortion (CD), and speech recognition, which is the final goal of BSSA, in which the separated sound quality is fully considered.CD [36] is a measure of the degree of distortion via the cepstrum domain.
It indicates the distortion among two signals, which is defined as where T is the frame length, C out (ρ; τ) is the ρth cepstrum coefficient of the output signal in the frame τ, C ref (ρ; τ) is the ρth cepstrum coefficient of the speech signal convoluted with the impulse response, and D b is a constant that transforms the measure into dB.Moreover, B is the number of dimensions of the cepstrum used in the evaluation.Moreover, we use the word accuracy (WA) score as a speech recognition performance.This index is defined as where W WA is the number of words, S WA is the number of substitution errors, D WA is the number of dropout errors, and I WA is the number of insertion errors.
First, actual separation results obtained by ICA for the case of cleaner noise and interference speech are shown in Fig. 10.We can confirm the imbalanced performance between target estimation and noise estimation, similar to the simulation-based results (see Sect.    is performed on the basis of spectral subtraction.However, the increase in the degree of distortion is expected to be negligible. Finally, we show the speech recognition result in Figs.11(c) and 12(c).It is evident that BSSA is superior to conventional ICA and ICA+SS.

Experiment in real world
An experiment in an actual railway-station environment is discussed here.Figure 13 shows the layout of the railway-station environment used in this experiment, where the reverberation time is about 1000 ms; this corresponds to mixing filters of 16000 taps in 16 kHz sampling.We used  Figure 15 shows the real separation results obtained by ICA in the railway-station environment.We can ascertain the imbalanced performance between target estimation and noise estimation, similar to the simulation-based results (see Sect. 3.4).
In the next experiment, we compare conventional ICA, ICA+SS, and BSSA in terms of NRR, cepstral distortion, and speech recognition performance.Figure 16(a) shows the results of the average NRR for whole sentences.From these results, we can see that the NRR of BSSA that utilizes ICA as a noise estimator is superior to those of conventional methods.However, we find that the cepstral distortion in BSSA is greater than compared with that in ICA from Fig. 16(b).
Finally, we show the results of speech recognition, where the extracted sound quality is fully considered, in Fig. 16(c).The speech recognition task and conditions are the same as those in Sect.5.1, as shown in Table 1.From this result, it can be concluded that the target-enhancement performance of BSSA, i.e., the method that uses ICA as a noise estimator, is evidently superior to the method that uses ICA directly as well as ICA+SS.

Real-time implementation of BSS
Several recent studies [19,38,39] have dealt with the issue of real-time implementation of ICA.The methods used, however, require high-speed personal computers, and BSS implementation on a small LSI still receives much attention in industrial applications.As a recent example of the implementation of real-time BSS, a real-time BSSA algorithm and its development are described in the following.
In BSSA's signal processing, the DS, SS, and separation filtering parts are possible to work in real-time.However, it is toilsome to optimize (update) the separation filter in real-time because the optimization of the unmixing matrix by ICA consumes huge amount of computations.Therefore, we should introduce a strategy in which the separation filter optimized by using the past time period data is applied to the current data.Figure 17 illustrates the configuration of the real-time implementation of BSSA.Signal processing in this implementation is performed as follows.Step 1: Inputted signals are converted into time-frequency domain series by using a frame-by-frame fast Fourier transform (FFT).
Step 2: ICA is conducted using the past 1.5-s-duration data for estimating the separation filter while the current 1.5 s.The optimized separation filter is applied to the next (not current) 1.5 s samples.This staggered relation is due to the fact that the filter update in ICA requires substantial computational complexities and cannot provide an optimal separation filter for the current 1.5 s data.
Step 3: Inputted data is processed in two paths.In the primary path, the target speech is partly enhanced by DS.In the reference path, ICA-based noise estimation is conducted.Again, note that the separation filter for ICA is optimized by using the past time period data.
Step 4: Finally, we obtain the target-speech-enhanced signal by subtracting the power spectrum of the estimated noise signal in the reference path from the power spectrum of the primary path's output.
Although the update of the separation filter in the ICA part is not real-time processing, but involves a total latency of 3.0 s, the entire system still seems to run in real-time because DS, SS, and separation filtering can be carried out in the current segment with no delay.In the system, the performance degradation due to the latency problem in ICA is mitigated by oversubtraction in spectral subtraction.
Figure 18 shows an example of the hardware implementation of BSSA, which was developed by KOBELCO Ltd., Japan [40].They have fabricated a pocket-size real-time BSS microphone, where the BSSA algorithm can work on a general-purpose DSP (TEXAS INSTRUMENTS TMS320C6713; 200 MHz clock, 100 kB program size, 1 MB working memory).This microphone was made commercially available in 2007 and has been adopted for the purpose of surveillance by the Japanese National Police Agency.

Conclusion
This chapter addressed the BSS problem for speech applications under real acoustic environments, particularly focusing on BSSA that utilizes ICA as a noise estimator.Under a non-point-source noise condition, it was pointed out that beamformers optimized by ICA are a DS beamformer for extracting the target speech signal that can be regarded as a point source and NBF for picking up the noise signal.Thus, ICA is proficient in noise estimation under a non-point-source noise condition.Therefore, it is valid to use ICA as a noise estimator.
In experiments involving computer-simulation-based and real-recording-based data, the SNR improvement and speech recognition results of BSSA are superior to those of conventional methods.These results indicate that the ICA-based noise estimation is beneficial for speech enhancement in adverse environments.Also, the hardware implementation of BSS was discussed with a typical example of a real-time BSSA algorithm.

Fig. 9 .
Fig. 9. Layout of reverberant room used in our experiment.

Fig. 10 .Fig. 11 .
Fig. 10.NRR-based separation performance of conventional ICA in environment shown in Fig. 9. Next, we discuss the NRR-based experimental results shown in Figs.11(a) and 12(a).From the results, we can confirm that the NRRs of BSSA are more than 3 dB greater than those of conventional ICA and ICA+SS.However, we can see that the distortion of BSSA is slightly higher from Figs.11(b) and 12(b).This is due to the fact that the noise reduction of BSSA

R xx ( f , τ b ) and R oo ( f , τ b ) are the cross-power spectra of the input x( f , τ) and output o( f
1is a normalization factor ( • represents the Frobenius norm).

Table 1 .
3.4).Conditions for Speech Recognition 16-kHz-sampled signals as test data; the original speech (6 s) was convoluted with impulse responses recorded in the same railway-station environment, to which a real-recorded noise was added.We use 46 speakers (200 sentences) as the original source of the target speech.The noise in the environment is nonstationary and is almost a non-point-source; it consists of various kinds of interference noise, namely, background noise and the sounds of trains, ticket-vending machines, automatic ticket gates, footsteps, cars, and wind.Figure14shows two typical noises, i.e., noises 1 and 2, which are recorded in distinct time periods and used in this experiment.A four-element array with an interelement spacing of 2 cm is used.59BlindSource Separation for Speech Application Under Real Acoustic Environment