Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of a mismatch between the training and testing environments. The current approach focusing on automatic speech recognition (ASR) robustness to reverberation and noise can be classified as speech signal processing [1, 4, 5, 14], robust feature extraction [10, 20], and model adaptation [3, 25].


Introduction
In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of a mismatch between the training and testing environments.The current approach focusing on automatic speech recognition (ASR) robustness to reverberation and noise can be classified as speech signal processing [1,4,5,14], robust feature extraction [10,20], and model adaptation [3,25].
In this chapter, we focus on speech signal processing in the distant-talking environment.Because both the speech signal and the reverberation are nonstationary signals, dereverberation to obtain clean speech from the convolution of nonstationary speech signals and impulse responses is very hard work.Several studies have focused on mitigating the above problem [8,9,11,12].[1] explored a speech dereverberation technique whose principle was the recovery of the envelope modulations of the original (anechoic) speech.They applied a technique that they originally developed to treat background noise [11] to the dereverberation problem.[7] proposed a novel approach for multimicrophone speech dereverberation.The method was based on the construction of the null subspace of the data matrix in the presence of colored noise, employing generalized singular-value decomposition or generalized eigenvalue decomposition of the respective correlation matrices.
A reverberation compensation method for speaker recognition using spectral subtraction in which the late reverberation is treated as additive noise was proposed by [16,17].However, the drawback of this approach is that the optimum parameters for spectral subtraction are empirically estimated from a development dataset and the late reverberation cannot be subtracted well since it is not modeled precisely.[18] proposed a novel dereverberation method utilizing multi-step forward linear prediction.They estimated the linear prediction coefficients in a time domain and suppressed the amplitude of late reflections through spectral subtraction in a spectral domain.In this chapter, we propose a robust distant-talking speech recognition method based on spectral subtraction (SS) employing the multi-channel least mean square (MCLMS) algorithm.Speech captured by distant-talking microphones is distorted by the reverberation.With a long impulse response, the spectrum of the distorted speech is approximated by convolving the spectrum of clean speech with the spectrum of the impulse response as explained in the next section.This enables us to treat the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction can be easily applied to compensate for the late reverberation.By excluding the phase information from the dereverberation operation, the dereverberation reduction in a power spectral domain provides robustness against certain errors that the conventional sensitive inverse filtering method cannot achieve [18].The compensation parameter (that is, the spectrum of the impulse response) for spectral subtraction is required.An adaptive MCLMS algorithm was proposed to blindly identify the channel impulse response in a time domain [12][13][14].In this chapter, we extend the method to blindly estimate the spectrum of the impulse response for spectral subtraction in a frequency domain.The early reverberation is normalized by CMN [6].Power SS is the most commonly used SS method.A previous study has shown that generalized SS (GSS) with a lower exponent parameter is more effective than power SS for noise reduction [26].In this chapter, both of power SS and GSS are employed to suppress late reverberation.A diagram of the proposed method is shown in Fig. 1.
In this chapter, we also investigate the robustness of the power SS-based dereverberation under various reverberant conditions for large vocabulary continuous speech recognition (LVCSR).We analyze the effect factors (numbers of reverberation windows and channels, length of utterance, and the distance between sound source and microphone) of compensation parameter estimation for dereverberation based on power SS in a simulated reverberant environment.
The remainder of this paper is organized as follows.Section 2 describes the outline of blind dereverberation based on spectral subtraction.A multi-channel method based on the LMS algorithm and used to estimate the power spectrum of the impulse response (that is, a compensation parameter for spectral subtraction) is described in Section 3. Section 4 describes the experimental results of hands-free speech recognition in both simulated and real reverberant environments.Finally, Section 5 summarizes the paper.where * denotes the convolution operation.In this chapter, additive noise is ignored for simplification, so Eq. ( 1) becomes

Outline of blind dereverberation
To analyze the effect of impulse response, the impulse response h[t] can be separated into two parts h early [t] and h late [t] as [16,17] h early [t]= h[t] t < T 0o t h e r w i s e , where T is the length of the spectral analysis window, and δ() is a dirac delta function (that is, a unit impulse function).The formula (1) can be rewritten as where the early effect is distortion within a frame (analysis window), and the late effect comes from previous multiple frames.
When the length of impulse response is much shorter than analysis window size T used for short-time Fourier transform (STFT), STFT of distorted speech equals STFT of clean speech multiplied by STFT of impulse response h[t] (in this case, h[t]=h early [t]).However, when the length of impulse response is much longer than an analysis window size, STFT of distorted speech is usually approximated by ( 4 ) where f is frame index, H(ω) is STFT of impulse response, S( f , ω) is STFT of clean speech s and H(d, ω) denotes the part of H(ω) corresponding to frame delay d.Thatistosay ,with long impulse response, the channel distortion is no more of multiplicative nature in a linear spectral domain, rather it is convolutional [25].[17] proposed a far-field speaker recognition based on spectral subtraction.In this method, the early term of Eq. ( 3) was compensated by the conventional CMN, whereas the late term of Eq. ( 3) was treated as additive noise, and a noise reduction technique based on spectral subtraction was applied as ( 5 ) where α is the noise overestimation factor, β is the spectral floor parameter to avoid negative or underflow values, and g(ω) is a frequency-dependent value which is determined on a d e v e l o p m e n ta n ds e ta s|1 − 0.9e jω | [17].However, the drawback of this approach is that the optimum parameters α, β for the spectral subtraction are empirically estimated on a development dataset and the STFT of late effect of impulse response as the second term of Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition the right-hand side of Eq. ( 4) is not straightforward subtracted since the late reverberation is not modelled precisely.
In this chapter, we propose a dereverberation method based on spectral subtraction to estimate the STFT of the clean speech Ŝ( f , ω) based on Eq. ( 4), and the spectrum of the impulse response for the spectral subtraction is blindly estimated using the method described in Section 3. Assuming that phases of different frames is noncorrelated for simplification, the power spectrum of Eq. ( 4) can be approximated as The estimated power spectrum of clean speech may not be very accurate due to the estimation error of the impulse response, especially the estimation error of early part of the impulse response.In addition, the unreliable estimated power spectrum of clean speech in a previous frame causes a furthermore estimation error in the current frame.In this chapter, the late reverberation is reduced based on the power SS, while the early reverberation is normalized by CMN at the feature extraction stage.A diagram of the proposed method is shown in Fig. 1.SS is used to prevent the estimated power spectrum obtained by reducing the late reverberation from being a negative value; the estimated power spectrum | X( f , ω)| 2 obtained by reducing the late reverberation then becomes where is the spectrum of estimated clean speech, Ĥ( f , ω) is the estimated STFT of the impulse response.To estimate the power spectra of the impulse responses, we extended the Multi-channel LMS algorithm for identifying the impulse responses in a time domain [14] to a frequency domain in Section 3.2.

Dereverberation based on GSS
Previous studies have shown that GSS with an arbitrary exponent parameter is more effective than power SS for noise reduction [26].In this chapter, we extend GSS to suppress late reverberation.Instead of the power SS-based dereverberation given in Eq. ( 7), GSS-based dereverberation is modified as where n is the exponent parameter.For power SS, the exponent parameter n is equal to 1.In this chapter, the exponent parameter n is set to 0.1 as this value yielded the best results [26].
The methods given in Eq. ( 7) and Eq. ( 8) are referred to power SS-based and GSS-based dereverberation methods, respectively.

Dereverberation and denoising based on GSS
The precision of impulse response estimation is drastically degraded when the additive noise is present.We present a dereverberation and denoising based on GSS.A diagram of the processing method is shown in Fig. 2. At first, the spectrum of additive noise is estimated and noise reduction is performed.Then the reverberation is suppressed using the estimated spectra of impulse responses.When additive noise is present, the power spectrum of Eq. ( 6) becomes where N( f , ω) is the spectrum of noise n(t).To suppress the noise and reverberation  simultaneously, Eq. ( 8) is modified as where N(ω) is the mean of noise spectrum N( f , ω),and XN ( f , ω) is the spectrum obtained by subtracting the spectrum of the observed speech from the estimated mean spectrum of noise N(ω) 1 .In this paper, we set parameter β 1 equal to β 2 .

Identifiability and principle
An adaptive multi-channel LMS algorithm for blind Single-Input Multiple-Output (SIMO) system identification in time domain was proposed by [13,14].
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition Before introducing the MCLMS algorithm for the blind channel identification, we express what SIMO systems are blind identifiable.A multi-channel FIR (Finite Impulse Response) system can be blindly primarily because of the channel diversity.
As an extreme counter-example, if all channels of a SIMO system are identical, the system reduces to a Single-Input Single-Output (SISO) system, becoming unidentifiable.In addition, the source signal needs to have sufficient modes to make the channels fully excited.The following two assumptions are made to guarantee an identifiable system: 1.The polynomials formed from h n , n = 1, 2, ••• , N,w h er eh n is n-th impulse response and N is the channel number, are co-prime2 , i.e., the channel transfer functions H n (z) do not share any common zeros; 2. The autocorrelation matrix R ss = E{s(k)s T (k)} of input signal is of full rank (such that the single-input multiple-output (SIMO) system can be fully excited).
In the following, these two conditions are assumed to hold so that we will be dealing with a blindly identifiable FIR (Finite Impulse Response) SIMO system.
In the absence of additive noise, we can take advantage of the fact that and have the following relation at time t: where h i (t) is the i-th impulse response at time t and where x n (t) is speech signal received from the n-th channel at time t and L is the number of taps of the impulse response.Multiplying Eq. ( 13) by x n (t) and taking expectation yields, where R x i x j (t + 1)=E{x i (t + 1)x T j (t + 1)}.Eq. ( 15) comprises N(N − 1) distinct equations.By summing up the N − 1 cross relations associated with one particuar channel h j (t),weget Over all channels, we then have a total of N equations.In matrix form, this set of equations is written as: where where h n (t, l) is the l-th tap of the n-th impulse response at time t.If the SIMO system is blindly identifiable, the matrix R x+ is rank deficient by 1 (in the absence of noise) and the channel impulse responses can be uniquely determined.
When the estimation of channel impulse responses is deviated from the true value, an error vector at time t + 1isproducedby: where Rx i x j (t + 1)=x i (t + 1)x T j (t + 1), i, j = 1, 2, ••• , N and ĥ(t) is the estimated model filter at time t.H e r ew ep u tat i l d ei n Rx i x j to distinguish this instantaneous value from its mathematical expectation R x i x j .This error can be used to define a cost function at time t + 1 By minimizing the cost function J of Eq. ( 23), the impulse response is blindly derived.There are various methods to minimize the cost function J, for example, constrained Multi-Channel LMS (MCLMS) algorithm, constrained Multi-Channel Newton (MCN) algorithm and Variable Step-Size Unconstrained MCLMS (VSS-UMCLMS) algorithm and so forth [12,14].Among these methods, the VSS-UMCLMS achieves a nice balance between complexity and convergence speed [14].Moreover, the VSS-UMCLMS is more practical and much easier to use since the step size does not have to be specified in advance.Therefore, in this chapter, we apply VSS-UMCLMS algorithm to identify the multi-channel impulse responses.

Variable step-size unconstrained multi-channel LMS algorithm in time domain
The cost function J(t + 1) at time t + 1 diminishes and its gradient with respect to ĥ(t) can be approximated as ∆J(t + 1) ≈ 2 Rx+ (t + 1) ĥ(t) ĥ(t) 2 (24) 161 Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition and the model filter ĥ(t + 1) at time t + 1is ĥ(t + 1)= ĥ(t) − 2µ Rx+ (t + 1) ĥ(t), (25) which is theoretically equivalent to the adaptive algorithm proposed by [2] although the cost functions are defined in different ways in these two adaptive blind SIMO identification algothrithms.In Eq. ( 25), µ is step size for Multi-channel LMS.
With such a simplified adaptive algorithm, the primary concern is whether it would converge to the trivial all-zero estimate.Fortunately this will not happen as long as the initial estimate ĥ(0) is not orthogonal to the true channel impulse response vector h [2].
Finally, an optimal step size for the unconstrained MCLMS at time t + 1 is obtained by The details of the VSS-UMCLMS were described in [14].

Extending VSS-UMCLMS algorithm to compensation parameter estimation for spectral subtraction
To blindly estimate the compensation parameter (that is, the spectrum of impulse response), we extend the MCLMS algorithm mentioned in Section 3.1 from a time domain to a frequency domain in this section.
The spectrum of distorted signal is a convolution operation of the spectrum of clean speech and that of impulse response as shown in Eq. ( 4).The spectrum of the impulse response is dependent on frequency ω, and the varibale ω is omitted for simplification.Thus, in the absence of additive noise, the spectra of distorted signals have the following relation at frame f on the frequency domain: Where ] T is a D-dimention vector of spectra of the distorted speech received from the n-th channel at frame f , X n ( f ) is the spectrum of the distorted speech received from the n-th channel at frame f for frequency ω, vector of spectra of the impulse response, and H n ( f , d) is the spectrum of the impulse response for frequency ω at frame f corresponding to frame delay d (that is, at frame f + d).
Using Eq. ( 27) in place of Eq. ( 13), the spectra of the impulse responses can be blindly estimated by the VSS-UMCLMS mentioned in Section 3.1.2.

Experimental setup
The proposed dereverberation method based on spectral subtraction is evaluated on an isolated word recognition task in a simulated reverberant environment, and a large vocabulary continuous speech recognition task in both a simulated reverberant environment and a real reverberant environment, respectively.database were used in the isolated word recognition task).The illustration of microphone array is shown in Fig. 3.A four-channel circle or linear microphone array was taken from a circle + linear microphone array (30 channels).The four-channel circle type microphone array had a diameter of 30 cm, and the four microphones were located at equal 90 • intervals.The four microphones of the linear microphone array were located at 11.32 cm intervals.Impulse responses were measured at several positions 2 m from the microphone array.The sampling frequency was 48 kHz.
For clean speech, 20 male speakers each with a close microphone uttered 100 isolated words.The 100 isolated words were phonetically balanced common isolated words selected from the Tohoku University and Panasonic isolated spoken word database [21].The average time of all utterances was about 0.6 s.The sampling frequency was 12 kHz.The impulse responses sampled at 48 kHz were downsampled to 12 kHz so that they could be convolved with clean speech.The frame length was 21.3 ms, and the frame shift was 8 ms with a 256-point Hamming window.Then, 116 Japanese speaker-independent syllable-based HMMs (strictly speaking, mora-unit HMMs [22]) were trained using 27,992 utterances read by 175 male speakers from the Japanese Newspaper Article Sentences (JNAS) corpus [15]).Each continuous-density HMM had five states, four with probability density functions (pdfs) of output probability.Each pdf consisted of four Gaussians with full-covariance matrices.The acoustic model was common for the baseline and proposed methods, and it was trained in a clean condition.The feature space comprised 10 mel-frequency cepstral coefficients.Firstand second-order derivatives of the cepstra plus first and second derivatives of the power component were also included (32 feature parameters in total).
The number of reverberant windows D in Eq. ( 4) was set to eight, which was empirically determined.In general, the window size D is proportional to RT60.However, the window size D is also affected by the reverberation property; for example, the ratio of power of the late reverberation to the power of the early reverberation.In our preliminary experiment with partial test data, the performance of our proposed method with a window size D = 2 to 16 outperformed the baseline significantly and the window size D = 8 achieved the best result.Automatic estimation of the optimum window size D is our future work.The length of the Hamming window for discrete Fourier transformation was 256 (21.3 ms), and the rate of overlap was 1/2.An illustration of the analysis window is shown in Fig. 4. For the proposed dereverberation based on spectral subtraction, the previous clean power spectra estimated with a skip window were used to estimate the current clean power spectrum 3 .Th e spectrum of the impulse response Ĥ(d, ω) is estimated using the corresponding utterance to be recognized with average duration of about 0.6 second.No special parameters such as over-subtraction parameters were used in spectral subtraction (α = 1), except that the subtracted value was controlled so that it did not become negative (β = 0.15).The speech recognition performance for clean isolated words was 96.0%.

Experimental setup for LVCSR task
In this study, both the artificial reverberant speech and real reverberant speech were used to evaluate our proposed method.
For artificial reverberant speech, multi-channel distorted speech signals simulated by convolving multi-channel impulse responses with clean speech were used.Fifteen kinds of multi-channel impulse responses measured in various acoustical reverberant environments were selected from the real world computing partnership (RWCP) sound scene database [23] and the CENSREC-4 database [24].Table 1 lists the details of 15 recording conditions.The illustration of microphone array is shown in Fig. 3.For RWCP database, a 2-8 channel circle or linear microphone array was taken from a circle + linear microphone array (30 channels).The circle type microphone array had a diameter of 30 cm.The microphones of the linear microphone array were located at 2.83 cm intervals.Impulse responses were measured at several positions 2 m from the microphone array.For the CENSREC-4 database, 2 or 4 channel microphones were taken from a linear microphone array (7 channels) with the two microphones located at 2.125 cm intervals.Impulse responses were measured at several positions 0.5 m from the microphone array.The Japanese Newspaper Article Sentences (JNAS) corpus [15] was used as clean speech.Hundred utterances from the JNAS database convolved with the multi-channel impulse responses shown in Table 1 were used as test data.The average time for all utterances was about 5.8 s.
For reverberant speech in a real environment, we recorded multi-channel speech degraded simultaneously by background noise and reverberation.Table 2 gives the conditions and content of the recordings.One hundred utterances from the JNAS corpus, uttered by five male speakers seated on the chairs labeled A to E in Fig. 5, were recorded by a multi-channel recording device.The heights of the microphone array and the utterance position of each speaker were about 0.8 m and 1.0 m, respectively.An electric fan with high air volume located behind the speaker in position A was used as background noise.An average SNR of the speech was about 18 dB.We used amicrophone array with 9 channels (Fig. 5) and a pin microphone to record speech in the distant-talking environment and close-talking environment, respectively.Table 3 gives the conditions for speech recognition.The acoustic models were trained with the ASJ speech databases of phonetically balanced sentences (ASJ-PB) and the JNAS.In total, around 20K sentences (clean speech) uttered by 132 speakers were used for each gender.Table 4 gives the conditions for SS-based denoising and dereverberation.The parameters shown in Table 4 were determined empirically.For SS-based dereverberation method without background noise, the parameter α was equal to α 1 and β was equal to β 1 .The number of reverberant windows D was set to 6 (192 ms).An illustration of the analysis window is shown in Fig. 4.An open-source LVCSR decoder software "Julius" [19] that is based on word trigram and triphone context-dependent HMMs is used.

Isolated word recognition results
Table 5 shows the isolated word recognition results in a simulated reverberant environment."Distorted speech #" in   beamforming [27] is performed for all methods in this chapter.The conventional CMN combined with delay-and-sum beamforming was used as a baseline.
The power SS-based dereverberation method by Eq. ( 7) improved speech recognition significantly compared with CMN for all severe reverberant conditions.The reason was that the proposed method compensated for both the late and early reverberation.The proposed method achieved an average relative error reduction rate of 24.5% in relation to conventional CMN with beamforming.

(a) Effect factor analysis of power SS-based dereverberation in the simulated reverberant environment
In this section, we describe the use of four microphones to estimate the spectrum of the impulse responses without a particular explanation.Delay-and-sum beamforming (BF) was performed on the 4-channel dereverberant speech signals.For the proposed method, each speech channel was compensated by the corresponding estimated impulse response.Preliminary experimental results for isolated word recognition showed that the power SS-based dereverberation method significantly improved the speech recognition performance significantly compared with traditional CMN with beamforming.In this section, we  Channel number Word accuracy rate (%) evaluated the power SS-based dereverberation method for LVCSR and analyzed the effect factor (number of reverberation windows D in Eq. ( 7), channel number, and length of utterance) for compensation parameter estimation based on power SS using RWCP database.The word accuracy rate for LVCSR with clean speech was 92.6%.
The effect of the number of reverberation windows on speech recognition is shown in Fig. 6.The detail results based on different number of reverberation windows D and reverberant environments (that is, different reverberation times) were shown in Table 6.The results shown on Fig. 6 and Table 6 were not performed delay-and-sum beamforming.The results show that the optimal number of reverberation windows D depends on the reverberation time.The best average result of all reverberant speech was obtained when D equals 6.The speech recognition performance with the number of reverberation windows between 4 and 10 did not vary greatly and was significantly better than the baseline.
We analyzed the influence of the number of channels on parameter estimation and delay-and-sum beamforming.Besides four channels, two and eight channels were also used to estimate the compensation parameter and perform beamforming.Channel numbers corresponding to Fig. 3(a) shown in Table 7 were used.The results are shown in Fig. 7.The speech recognition performance of the SS-based dereverberation method without beamforming was hardly affected by the number of channels.That is, the compensation parameter estimation is robust to the number of channels.Combined with beamforming, the more channels that are used and the better is the speech recognition performance.
Thus far, the whole utterance has been used to estimate the compensation parameter.The effect of the length of utterance used for parameter estimation was investigated, with the results shown in Fig. 8.The longer the length of utterance used, the better is the speech recognition performance.Deterioration in speech recognition was not experienced with the length of the utterance used for parameter estimation greater than 1 s.The speech recognition performance of the SS-based dereverberation method is better than the baseline even if only 0.1 s of utterance is used to estimate the compensation parameter.
We also compared the power SS-based dereverberation method on LVCSR in different simulated reverberant environments.The experimental results shown in Fig. 9. Naturally, the speech recognition rate deteriorated as the reverberation time increased.Using the SS-based dereverberation method, the reduction in the speech recognition rate was smaller than in conventional CMN, especially for impulse responses with a long reverberation time.For RWCP database, the SS-based dereverberation method achieved a relative word recognition error reduction rate of 19.2% relative to CMN with delay-and-sum beamforming.We also conducted an LVCSR experiment with SS-based dereverberation under different reverberant conditions (CENSREC-4), with the reverberation time between 0.25 and 0.75 s and the distance between microphone and sound source 0.5 m.A similar trend to the above results was observed.Therefore, the SS-based dereverberation method is robust to various reverberant  conditions for both isolated word recognition and LVCSR.The reason is that the SS-based dereverberation method can compensate for late reverberation through SS using an estimated power spectrum of the impulse response.

(b) Results of GSS-based method in the simulated reverberant environment
In this section, reverberation and noise suppression using only 2 speech channels is described.
In both power SS-based and GSS-based dereverberation methods, speech signals from two microphones were used to estimate blindly the compensation parameters for the power SS and GSS (that is, the spectra of the channel impulse responses), and then reverberation was suppressed by SS and the spectrum of dereverberant speech was inverted into a time domain.Finally, delay-and-sum beamforming was performed on the two-channel dereverberant speech.
The results of power SS-based method and the GSS-based method without background noise were compared in Table 8. "Distorted speech #" in Table 8 corresponds to "array no" in Table 1.
The speech recognition performance was drastically degraded under reverberant conditions because the conventional CMN did not suppress the late reverberation.Delay-and-sum beamforming with CMN (41.91%) could not markedly improve the speech recognition performance because of the small number of microphones and the small distance between the microphone pair.In contrast, the power SS-based dereverberation using Eq. ( 7) markedly improved the speech recognition performance.The GSS-based dereverberation using Eq. ( 8) improved speech recognition performance significantly compared with the power SS-based dereverberation and CMN for all reverberant conditions.The GSS-based method achieved an average relative word error reduction rate of 31.4% compared to the conventional CMN and 9.8% compared to the power SS-based method.
Table 9 shows the speech recognition results for the power SS and GSS-based denoising and dereverberation methods for the simulated noisy and reverberant speech."Distorted speech #", "DN" and "DNR" in Table 9 denote the "array #" in Table 1, "denoising", and "denoising and dereverberation", respectively.The speech recognition performance of conventional CMN was drastically degraded owing to the noisy and reverberant conditions and the fact that CMN did not suppress the late reverberation.The power SS-based DN improved speech recognition performance significantly compared to the CMN for all reverberant conditions.The GSS-based DN using Eq. ( 11), however, did not improve the speech recognition performance compared to the power SS-based DN.On the other hand, the power SS-based DNR achieved a marked improvement in the speech recognition performance compared with   that of CMN.The GSS-based DNR using Eq. ( 10) improved speech recognition performance significantly compared to both the CMN method and the power SS-based DNR for almost all reverberant conditions.

(c) Results in the real noisy reverberant environment
Table 10 shows the speech recognition results for the real noisy reverberant speech under the same conditions as the simulated noisy reververant speech.The word accuracy rate for close-talking speech recorded in a real environment was 88.3%.We investigated the best channel combination in the real environment and the best speech recognition performance was obtained when channels 6, 7, 8, and 9 described in Fig. 5 were used.Therefore, this channel combination was used in this study.Power SS-based DN and GSS-based DN achieved a smaller improvement in recognition performance compared with the simulated noisy reverberant environment because the type of background noise in the real environment was different from that in the simulated environment.On the other hand, the power SS-based DNR markedly improved the speech recognition performance compared to CMN.The GSS-based DNR improved speech recognition performance significantly compared to both the CMN method and the power SS-based DNR for almost all speakers.The GSS-based DNR achieved an average relative word error reduction rate of 39.1% and 11.5% compared to conventional CMN and power SS-based DNR, respectively.These results show that our proposed method is also effective in a real environment under the same denoising and dereverberation conditions as the simulated noisy reverberant environment.

Conclusion
In this chapter, we proposed a blind spectral subtraction based dereverberation method for hands-free speech recognition method.We treated the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction was applied to compensate for the late reverberation.The early reverberation was normalized by CMN.The time-domain MCLMS algorithm was extended to blindly estimate the spectrum of the impulse response for spectral subtraction in a frequency domain.We evaluated our proposed methods on isolated word recognition task and LVCSR task.The proposed spectral subtraction based on multi-channel LMS significantly outperformed than the conventional CMN.For isolated word recognition task, a relative error reduction rate of 24.5% in relation to the conventional CMN was achieved.For LVCSR task without background noise, the proposed method achieved an average relative word error reduction rate of 31.5% compared to conventional CMN in the simulated reverberant environment.We also presented a denoising and dereverberation method based on spectral subtraction and evaluated it in both the simulated noisy reverberant environment and the real noisy reverberant environment.The GSS-based method achieved an average relative word error reduction rate of 39.1% and 11.5% compared to conventional CMN and power SS-based method, respectively.These results show that our proposed method is also effective in a real noisy reverberant environment.
In this chapter, we also investigated the effect factors (numbers of reverberation windows and channels, and length of utterance) for compensation parameter estimation.We reached the following conclusions: 1) the speech recognition performance with the number of reverberation windows between 4 and 10 did not vary greatly and was significantly better than the baseline; 2) the compensation parameter estimation was robust to the number of channels; and 3) degradation of speech recognition did not occur with the length of utterance used for parameter estimation longer than 1 s.We also compared the SS-based dereverberation method on LVCSR in different simulated reverberant environments.A similar trend was observed.

Figure 2 .
Figure 2. Schematic diagram of an SS-based dereverberation and denoising method.

Figure 4 .
Figure 4. Illustration of the analysis window for spectral subtraction.

Figure 6 .
Figure 6.Effect of the number of reverberation windows D on power SS-based dereverberation for speech recognition.

Figure 7 .
Figure 7. Effect of the number of channels on power SS-based dereverberation for speech recognition.

Figure 8 .
Figure 8.Effect of length of utterance used for parameter estimation on power SS-based dereverberation for speech recognition.

Figure 9 .
Figure 9. Word accuracy for LVCSR in different simulated reverberant environments.

Table 1 .
[23]ils of recording conditions for impulse response measurement4.1.1.Experimental setup for isolated word recognition taskMulti-channel distorted speech signals simulated by convolving multi-channel impulse responses with clean speech were used to create artificial reverberant speech.Six kinds of multi-channel impulse responses measured in various acoustical reverberant environments were selected from the Real World Computing Partnership (RWCP) sound scene database[23].Table 1 lists the details of recording conditions (impulse responses with array no 3-8 in RWCP Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

Table 2 .
Conditions for recording in real environment.
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

Table 3 .
Conditions for large vocabulary continuous speech recognition

Table 5
corresponds to "array no" in Table1.Delay-and-sum

Table 6 .
Detail results based on different number of reverberation windows D and reverberant environments (%)

Table 7 .
Channel number corresponding to Fig.3(a) using for dereverberation and denoising (RWCP database)

Table 8 .
Comparison of Word accuracy for LVCSR with power SS-based method and GSS-based method in the simulated reverberant environment (%)

Table 9 .
Word accuracy for LVCSR with the simulated noisy reverberant speech (%).

Table 10 .
Aver ag e 60.9 61.6 73.1 62.9 76.2 Delay-and-sum beamforming was performed for all methods Word accuracy for LVCSR with the real noisy reverberant speech (%).