A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios

In the direct human interaction, the verbal and nonverbal communication modes play a fundamental role by jointly cooperating in assigning semantic and pragmatic contents to the conveyed message and by manipulating and interpreting the participants’ cognitive and emotional states from the interactional contextual instance. In order to understand, model, analyse, and automatize such behaviours, converging competences from social and cognitive psychology, linguistic, philosophy, and computer science are needed.


Introduction
In the direct human interaction, the verbal and nonverbal communication modes play a fundamental role by jointly cooperating in assigning semantic and pragmatic contents to the conveyed message and by manipulating and interpreting the participants' cognitive and emotional states from the interactional contextual instance.In order to understand, model, analyse, and automatize such behaviours, converging competences from social and cognitive psychology, linguistic, philosophy, and computer science are needed.
The exchange of information (more or less conscious) that take place during interactions build up a new knowledge that often needs to be recalled, in order to be re-used, but sometime it also needs to be appropriately supported as it occurs.Currently, the international scientific research is strongly committed towards the realization of intelligent instruments able to recognize, process and store relevant interactional signals: The goal is not only to allow efficient use of the data retrospectively but also to assist and dynamically optimize the experience of interaction itself while it is being held.To this end, both verbal and nonverbal (gestures, facial expressions, gaze, etc.) communication modes can be exploited.Nevertheless, voice is still a popular choice due to informative content it carries: Words, emotions, dominance can all be detected by means of different kinds of speech processing techniques.Examples of projects exploiting this idea are CHIL (Waibel et al. (2004)), AMI-AMIDA (Renals (2005)) and CALO (Tur et al. (2010)).
The applicative scenario taken here as reference is a professional meeting, where the system can readily assists the participants and where the participants themselves do not have particular expectations on the forms of supports provided by the system.In this scenario, it is assumed that people are sitting around a table, and the system supports and enrich the conversation experience by projecting graphical information and keywords on a screen.
A complete architecture of such a system has been proposed and validated in (Principi et al. (2009); Rocchi et al. (2009)).It consists of three logical layers: Perception, Interpretation and Presentation.The Perception layer aims to achieve situational awareness in the workplace and is composed of two essential elements: Presence Detector and Speech Processing Unit.The first determines the operating states of the system: Presence (the system checks if there are people around the table); conversation (the system senses that a conversation is ongoing).The Speech Processing Unit processes the captured audio signals and identifies the keywords that are exploited by the system in order to decide which stimuli to project.It consists of 2 Speech Processing two main components: The multi-channel front-end (speech enhancement) and the automatic speech recognizer (ASR).
The Interpretation module is responsible of the recognition of the ongoing conversation.At this level, semantic representation techniques are adopted in order to structure both the content of the conversation and how the discussion is linked to the speakers present around the table.Closely related to this module is the Presentation one that, based on conversational analysis just made, dynamically decides which stimuli have to be proposed and sent.The stimuli are classified in terms of conversation topics and on the basis of their recognition, they are selected and projected on the table.
The focus of this chapter is on the speech enhancement stage of the Speech Processing Unit and in particular on the set of algorithms constituting the front-end of the ASR.In a typical meeting scenario, participants' voices can be acquired through different type of microphones.Depending on the choice made, the microphone signals are more or less susceptible to the presence of noise, the interference from other co-existing sources and reverberation produced by multiple acoustic paths.The usage of close-talking microphones can mitigate the aforementioned problems but they are invasive and the meeting participants can feel uncomfortable in such situation.A less invasive and more flexible solution is the choice of far-field microphone arrays.In this situation, the extraction of a desired speech signal can be a difficult task since noise, interference and reverberation are more relevant.
In the literature, several solutions have been proposed in order to alleviate the problems (Naylor & Gaubitch (2010); Woelfel & McDonough (2009)): Here, the attention is on two popular techniques among them, namely blind source separation (BSS) and speech dereverberation.In (Huang et al. (2005)), a two stage approach leading to sequential source separation and speech dereverberation based on blind channel identification (BCI) is proposed.This can be accomplished by converting the multiple-input multiple-output (MIMO) system into several single-input multiple-output (SIMO) systems free of any interference from the other sources.Since each SIMO system is blindly identified at different time, the BSS algorithm does not suffer of the annoying permutation ambiguity problem.Finally, if the obtained SIMO systems room impulse responses (RIRs) do not share common zeros, dereverberation can be performed by using the Multiple-Input/Output Inverse Theorem (MINT) (Miyoshi & Kaneda (1988)).
A real-time implementation of this approach has been presented in (Rotili et al. (2010)), where the optimum inverse filtering approach is substituted by an iterative technique, which is computationally more efficient and allows the inversion of long RIRs in real-time applications (Rotili et al. (2008)).Iterative inversion is based on the well known steepest-descent algorithm, where a regularization parameter taking into account the presence of disturbances, makes the dereverberation more robust to RIRs fluctuations or estimation errors due to the BCI algorithm (Hikichi et al. (2007)).
The major drawback of such implementation is that the BCI stage need to know "who speaks when" in order to estimate the RIRs related to the right speaker.To overcome the problem, in this chapter a solution which exploits a speaker diarization system is proposed.Speaker diarization steers the BCI and the ASR, thus allowing the identification task to be accomplished directly on the microphone mixture.
The proposed framework, is developed on the NU-Tech platform (Squartini et al. (2005)), a freeware software which allows the efficient management of the audio stream by means of the ASIO interface.NU-Tech provides a useful plug-in architecture which has been exploited for the C++ implementation.Experiments performed over synthetic conditions at 16 kHz sampling rate confirm the real-time capabilities of the implemented architecture and its effectiveness as multi-channel front-end for the subsequent speech recognition engine.The chapter outline is the following.In Sec. 2 the speech enhancement front-end, aimed at separating and dereverberating the speech sources is described, whereas Sec. 3 details the ASR engine and its parametrization.Sec. 4 is targeted to discuss the simulations setup and performed experiments.Conclusions are drawn in Sec. 5.

Speech enhancement front-end
Let M be the number of independent speech sources and N the number of microphones.The relationship between them is described by an M × N MIMO FIR (finite impulse response) system.According to such a model, the n-th microphone signal at k-th sample time is: where (•) T denotes the transpose operator and is the m-th source.The term is the L h -taps RIR between the n-th microphone and the m-th source.Applying the z transform, Eq. 1 can be rewritten as: where The objective is recovering the original clean speech sources s m by means of a speech dereverberation approach: Indeed, it is necessary to automatically identify who is speaking, accordingly estimating the unknown RIRs and then apply a seperation and dereverberation process to restore the original speech quality.
The reference framework proposed in (Huang et al. (2005); Rotili et al. ( 2010)) consists of three main stages: source separation, speech dereverberation and BCI.Firstly source separation is accomplished by transforming the original MIMO system in a certain number of SIMO systems and secondly the separated sources (but still reverberated) pass through the dereverberation process yielding the final cleaned-up speech signals.In order to make the two procedures properly working, it is necessary to estimate the MIMO RIRs of the audio channels between the speech sources and the microphones by the usage of the BCI stage.
As mentioned in the introductory section, this approach suffers from the BCI stage inability of estimating the RIRs without the knowledge of the speakers' activities.To overcome this disadvantage a speaker diarization system can be introduced to steer the BCI stage.The block diagram of the proposed framework is shown in Fig. 1 where N = 3 and M = 2 have been considered.Speaker Diarization takes as input the central microphone mixture and for each Separation Dereverberation frame, the output P m is "1" if the m-th source is the only active, and "0" otherwise.In such a way, the front-end is able to detect when to perform or not to perform the required operation.Using the information carried out by the Speaker Diarization stage, the BCI will estimate the RIRs and the speech recognition engine will perform recognition if the corresponding source is the only active.

Blind channel identification
Considering a SIMO system for a specific source s m * , a BCI algorithm aims to find the RIRs vector T by using only the microphone signals x n (k).In order to ensure this, two identifiability condition are assumed satisfied (Xu et al. (1995)): 1.The polynomial formed from h nm * are co-prime, i.e. the room transfer functions (RTFs) H nm * (z) do not share any common zeros (channel diversity); 2. C{s(k)}≥2L h + 1, where C{s(k)} denotes the linear complexity of the sequence s(k).
It is an adaptive technique well suited to satisfy the real-time constraints imposed by the case study since it offers a good compromise among fast convergence, adaptivity, and low computational complexity.
Here, we briefly review the UNMCFLMS in order to understand the motivation of its choice in the proposed front-end.Refer to (Huang & Benesty (2003)) for details.The derivation 4 Speech Enhancement, Modeling and Recognition -Algorithms and Applications of UNMCFLMS is based on cross relation criteria (Xu et al. (1995)) using the overlap-save technique (Oppenheim et al. (1999)).
The frequency-domain cost function for the q-th frame is defined as where e ni (q) is the frequency-domain block error signal between the n-th and i-th channels and (•) H denotes the Hermitian transpose operator.The update equation of the UNMCFLMS is expressed as where 0 < ρ < 2 is the step-size, δ is a small positive number and T T , while F denotes the discrete Fourier transform (DFT) matrix.The frequency-domain error function e ni (q) is given by where the diagonal matrix is the DFT of the q-th frame input signal block for the n-th channel.From a computational point of view, the UNMCFLMS algorithm ensures an efficient execution of the circular convolution by means of the fast Fourier transform (FFT).In addition, it can be easily implemented in a real-time application since the normalization matrix P nm * (q)+δI 2L h ×L h is diagonal, and it is straightforward to compute its inverse.
Though UNMCFLMS allows the estimation of long RIRs, it requires a high input signal-to-noise ratio.In this paper, the presence of noise has not been taken into account and therefore the UNMCFLMS still remain an appropriate choice.Different solutions have been proposed in literature in order to alleviate the misconvergence problem of the UNMCFLMS in presence of noise.Among them, the algorithms presented in (Haque et al. (2007); Haque & Hasan (2008); Yu & Er (2004)) guarantee a significant robustness against noise and they could be used to improve our front-end.

5
A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios www.intechopen.com

Source separation
Here we briefly review the procedure already described in (Huang et al. (2005)) according to which it is possible to transform an M × N MIMO system (with M < N)i nM1× N SIMO systems free of interferences, as described by the following relation: where P = C M N is the number of combinations.It must be noted that the SIMO systems outputs are reverberated, likely more than the microphone signals due to the long impulse response of equivalent channels F s m ,p (z).Related formula and the detailed description of the algorithm can be found in (Huang et al. (2005)).Different choices can be made in order to calculate the equivalent SIMO system.In the block scheme of Fig. 2, representing the MIMO-SIMO conversion, is depicted a possible solution when M = 2 and N = 3.With this choice the first SIMO systems corresponding to the source s 1 is The second SIMO system corresponding to the source s 2 can be found in a similar way, thus results, F s 1 ,p (z)=F s 2 ,p (z) with p = 1, 2, 3.As stated in the previous section the presence of additive noise is not taken into account in this contribution and than all the terms B s m ,p (z) of Eq. 11 are equal to zero.Finally it is important to highlight that in using this separation algorithm a lower computation complexity w.r.t.traditional independent component analysis technique is achieved and since the MIMO system is decomposed into a number of SIMO systems which are be blindly identified at different time the permutation ambiguity problem is avoided.

6
Speech Enhancement, Modeling and Recognition -Algorithms and Applications www.intechopen.com

Speech dereverberation
Given the equivalent SIMO system F s m * ,p (z) related to the specific source s m * , a set of inverse filters G s m * ,p (z) can be found by using the MINT theorem such that assuming that the polynomials F s m * ,p (z) have no common zeros.In the time-domain, the inverse filter vector denoted as g s m * , is calculated by minimizing the following cost function: where • denote the l 2 -norm operator and with p = 1, 2, ••• , P.
The vector v is the target vector, i.e. the Kronecker delta shifted by an appropriate modeling delay (0 where F s m * ,p is the convolution matrix of the equivalent FIR filter When the matrix F s m * is obtained as shown in the previous section, the inverse filter set can be calculated as where (•) † denotes the Moore-Penrose pseudoinverse.In order to have a unique solution L g must be chosen in such a way that F s m * is square i.e.
Considering the presence of disturbances, i.e. additive noise or RTFs fluctuations, the cost function Eq. 14 is modified as follows (Hikichi et al. (2007)): where the parameter γ(≥ 0), called regularization parameter, is a scalar coefficient representing the weight assigned to the disturbance term.It should be noticed that Eq. 20 has the same form to that of Tikhonov regularization for ill-posed problems (Egger & Engl (2005)).
Let the RTF for the fluctuation case be given by the sum of two terms, the mean RTF (F s m * ) and the fluctuation from the mean RTF ( F s m * ) and let E F T s m * F s m * = γI.In this case a general

7
A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios www.intechopen.comcost function, embedding noise and fluctuation case, can be derived: where The filter that minimizes the cost function in Eq.21 is obtained by taking derivatives with respect to g s m * and setting them equal to zero.The required solution is The usage of Eq. 23 to calculate the inverse filters requires a matrix inversion that, in the case of long RIRs, can result in a high computational burden.Instead, an adaptive algorithm (Rotili et al. (2008)) has been here adopted to satisfy the real-time constraint.It is based on the steepest-descent technique, whose recursive estimator has the form Moving from Eq. 21 through simple algebraic calculations, the following expression is obtained: Substituting Eq. 25 into Eq.24 is where μ(q) is the step-size.The convergence of the algorithm to the optimal solution is guaranteed if the usual conditions for the step-size in terms of autocorrelation matrix F T F eigenvalues hold.However, the achievement of the optimum can be slow if a fixed step-size value is chosen.The algorithm convergence speed can be increased following the approach in (Guillaume et al. ( 2005)), where the step-size is chosen in order to minimize the cost function at the next iteration.The analytical expression obtained for the step-size is the following: where e(q)=F T [v −Fg s m * (q)] − γg s m * (q).
In using the previously illustrated algorithm, different advantages are obtained: The regularization parameter which takes into account the presence of disturbances, makes the dereverberation process more robust to estimation errors due to the BCI algorithm (Hikichi et al. (2007)); the real-time constraint can be met also in the case of long RIRs since no matrix inversion is required.Finally, the complexity of the algorithm has been decreased computing the required operation in the frequency-domain by using FFTs.

8
Speech Enhancement, Modeling and Recognition -Algorithms and Applications www.intechopen.com

Speaker diarization
The speaker diarization stage drives the BCI and the ASRs so that they can operate into speaker-homogeneous regions.Current state-of-the-art speaker diarization systems are based on clustering approaches, usually combining hidden Markov models (HMMs) and the bayesian information criterion metric (Fredouille et al. (2009); Wooters & Huijbregts (2008)).Despite their state-of-art performance, such systems have the drawback of operating on the entire signals, making them unsuitable to work online as required by the proposed framework.
The approach taken here as reference has been proposed in (Vinyals & Friedland (2008)), and its block scheme for M = 2 and N = 3, is shown in Fig. 3.The speaker diarization block scheme: "SPK 1 " and "SPK 2 " are the speaker identities labels assigned to each chunk.
In the recognition phase, the first operation consists in a voice activity detection in order to remove the silence periods: frames are tagged as silence or not based on the bi-gaussian model, using a maximum likelihood criterion.
After the voice activity detection, the signals are divided into non overlapping chunks, and the same feature extraction pipeline of the training phase extracts feature vectors.The decision is then taken using majority vote on the likelihoods: every feature vector in the current segment is assigned to one of the known speaker's model based on the maximum likelihood criterion.The model which has the majority of vectors assigned determines the speaker identity on the current segment.The Demultiplexer block associates each speaker label to a distinct output and sets it to "1" if the speaker is the only active, and "0" otherwise.
It is worth pointing out that the speaker diarization algorithm is not able to detect overlapped speech, and an oracle overlap detector is used to overcome this lack.

9
A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios www.intechopen.com

Speech enhancement front-end operation
The proposed front-end requires an initial training phase where each speaker is asked to talk for 60 s.During this period, the speaker diarization stage trains the both the VAD and speakers' models.
In the testing phase, the input signal is divided into non overlapping chunks of 2 s, the speaker diarization stage provides as output the speakers' activity P m .This information is employed both in the BCI stage and ASR engines: only when the m-th source is the only active the related RIRs are updated and the dereverberated speech recognized.In all the other situations the BCI stage provide as output the RIRs estimated at the previous step while the ASRs are idle.
The Separation stage takes as input the microphone signals and outputs the interference free signals that are subsequently processed by Dereverberation stage.Both stages perform theirs operations using the RIRs vector provided by the BCI stage.
The front-end performances are strictly related to the speaker diarization errors.In particular, the BCI stage is sensitive to false alarms (speaker in hypothesis but not in reference) and speaker errors (mapped reference is not the same as hypothesis speaker).If one of these occurs, the BCI performs the adaptation of the RIRs using an inappropriate input frame providing as output an incorrect estimation.An additional error which produces the previously highlighted behaviour is the miss speaker overlap detection.
The sensitivity to false alarms and speaker errors could be reduced imposing a constraint in the estimation procedure and updating the RIR only when a decrease in the cost function occurs.A solution to miss overlap error would be to add an overlap detector and not to perform the estimation if more than one speaker is simultaneously active.On the other hand, missed speaker errors (speaker in reference but not in hypothesis) does not negatively affect the RIRs estimation procedure, since the BCI stage does not perform the adaptation in such frames.Only a reduced convergence rate can be noticed in this case.
The real-time capabilities of the proposed front-end have been evaluated calculating the real-time factor on a Intel ® Core™i7 machine running at 3 GHz with 4 GB of RAM.The obtained value for the speaker diarization stage is 0.03, meaning that a new result is output every 2.06 s.The real-time factor for the others stage is 0.04 resulting in a total value of 0.07 for the entire front-end.

ASR engine
Automatic speech recognition has been performed by means of the Hidden Markov Model Toolkit (HTK) (Young et al. (2006)) using HDecode, which has been specifically designed for large vocabulary speech recognition tasks.Features have been extracted through the HCopy tool, and are composed of 13 MFCC, deltas and double deltas, resulting in a 39 dimensional feature vector.Cepstral mean normalization is included in the feature extraction pipeline.Recognition has been performed based on the acoustic models available in (Vertanen (2006)).
The models differ with respect to the amount of training data, the use of word-internal or cross-word triphones, the number of tied states, the number of Gaussians per state, and the initialization strategy.The main focus of this work is to achieve real-time execution of the complete framework, thus an acoustic model able to obtain adequate accuracies and 10 Speech Enhancement, Modeling and Recognition -Algorithms and Applications real-time ability was required.The computational cost strongly depends on the number of Gaussians per state, and in (Vertanen (2006)) it has been shown that real-time execution can be obtained using 16 Gaussians per state.The main parameters of the selected acoustic model are summarized in Table 1 The language model consists of the 5k words bi-gram model included in the Wall Street Journal (WSJ) corpus.Recognizer parameters are the same as in (Vertanen (2006)): using such values, the word accuracy obtained on the November '92 test set is 94.30% with a real-time factor of 0.33 on the same hardware platform mentioned above.It is worth pointing out that the ASR engine and the front-end can jointly operate in real-time.

Corpus description
The acoustic scenario under study is made of an array of three microphones and two speech sources located in a small office.The room arrangement is depicted in Fig. 4. The data set A suitable database representing the described scenario has been artificially created using the following procedure: The 330 clean sentences are firstly reduced to 320 in order to have the 11 A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios www.intechopen.comsame number of sentences for each speaker.These are then convolved with RIRs generated using the RIR Generator tool (Habets (2008)).No background noise has been added.Two different reverberation conditions have been taken into account: the low and the and high reverberant ones, corresponding to T 60 = 120 ms and T 60 = 240 ms respectively (with RIRs 1024 taps long).
For each channel, the final overlapped and reverberated sentences have been obtained by coupling the sentences of two speakers.Following the WSJ November '92 notation, speaker 440 has been paired with 441, 442 with 443, etc.This choice makes possible to cover all the combinations of male and female speakers, resulting in 40 sentences per couple of speakers.
The mean value of overlap has been fixed to 15% of the speech frames for the overall dataset.
For each sentence the amount of overlap is obtained as a random value drown from the uniform distribution on the interval [12,18].This assumption allows the artificial database to reflect the frequency of overlapped speech in real-life scenarios such as two-party telephone conversation or meeting (Shriberg et al. (2000)).

Front-end evaluation
As stated in Sec. 2 the proposed speech enhancement front-end consists in four different stages.Here we focus the attention on the evaluation of the Speaker Diarization and BCI stages which represent the most crucial parts of the entire system.An extensive evaluation of the Separation and Dereverberation stages can be found in (Huang et al. (2005)) and (Rotili et al. (2008)) respectively.
The performance of the speaker diarization algorithms are measured by the diarization error rate1 (DER).DER is defined by the following expression: where dur is the duration of the segment, S is the total number of segments in which no speaker change occurs, N ref (s) and N hyp (s) indicate respectively the number of speakers in the reference and in the hypothesis, and N correct (s) indicates the number of speakers that speak in the segment s and have been correctly matched between the reference and the hypothesis.As recommended by the National Institute for Standards and Technology (NIST), evaluation has been performed by means of the "md-eval" tool with a collar of 0.25 s around each segment to take into account timing errors in the reference.The same metric and tool are used to evaluate the VAD performance2 .
Performance for the sole VAD are reported in table Table 2. Table 3 shows the results obtained testing the speaker diarization algorithm on the clean signals, as well as on the two reverberated scenarios in the previous illustrated configurations.For the seek of comparison two different configurations have been considered: • REAL SD w/ ORACAL-VAD: The speaker diarization system uses an "Oracle" VAD; • REAL SD w/ REAL-VAD: The system described in Sec.2.4.
The performance across the three scenarios are similar due to the matching of the training and testing conditions, and are consistent with (Vinyals & Friedland (2008)).
Clean T 60 = 120 ms T 60 = 240 ms REAL-VAD 1.85 1.96 1.68 where is the projection misalignment vector, h is the real RIR vector whereas h(q) is the estimated one at the q-th iteration, i.e. the frame index.the Speaker Diarization stage affect the RIRs identification we compare the curves obtained for ORACLE-SD where the speaker diariazion operates in an "Oracle" fashion, i.e. it operates at 100% of its possibilities, and REAL-SD case.As expected the REAL-SD NPM is always above the ORACLE-SD NPM.Parts where the curves are flat indicate speech segment in which source s 1 is the not only active source i.e. it is overlapped to s 2 or we have silence.

Full system evaluation
In this section the objective is to evaluate the recognition capabilities of the ASR engine fed by speech signals coming from the multichannel DSP front-end, therefore the performance metric employed is the word recognition accuracy.
The word recognition accuracy obtained assuming ideal source separation and dereverberation is 93.60%.This situation will be denoted as "Reference" in the remainder of the section.
Four different setups have been addressed: • Unprocessed: The recognition is performed on the reverberant speech mixture acquired from Mic 2 (see Fig. 4); • ASR w/o SD: The ASRs do not exploit the speaker diarization output; • ASR w/ ORACLE-SD: The ASRs exploit the "Oracle" speaker diarization output; • ASR w/ REAL-SD: The ASRs exploit the "Real" speaker diarization output.
Fig. 6 reports the word accuracy for both the low and high reverberant conditions when the complete test file is processed by the multi-channel DSP front-end and recognition is performed on the separated and dereverberated streams (Overall) for all the three setup.Fig. 7 shows the word accuracy values attained where the recognition is performed starting from the first silence frame after the BCI and Dereverberation stages converge 3 (Convergence).
Observing the results of Fig. 6, it can be immediately stated that feeding the ASR engine with unprocessed audio files leads to very poor performances.The missing source separation and the related wrong matching between the speaker and the corresponding word transcriptions result in a significant amount of insertions which justify the occurrence of negative word accuracy values.
Conversely, when the audio streams are processed, the ASRs are able to recognize most of the spoken words, specially once the front-end algorithms have reached the convergence.The usage of speaker diarization information to drive the ASRs activity significantly increases the performance.As expected the usage of the "Real" speaker diarization instead of an "Oracle" one lead to a decrease in performance of about 15% for the low reverberant condition and of a 10% for the high reverberant condition.Despite this, the word accuracy is still higher then the one obtained without speaker diarization, providing an average increase of about 20% for both the reverberation time.
In the Convergence evaluation case study, when T 60 = 120 ms and the "Oracle" speaker diarization is employed, a word accuracy of 86.49% is obtained, which is about 7% less than the result attainable in the "Reference" conditions.In this case, the usage of the "Real" 3 Additional experiments have demonstrated that this is reached after 20 − 25 s of speech activity.speaker diarization lead to decrease of only 8%.As expected, the reverberation effect has a negative impact on the recognition performances especially in presence of high reverberation, i.e.T 60 = 240 ms.However, it must be observed that the convergence margin is even more significant w.r.t. the low reverberant scenario, further highlighting the effectiveness of the proposed algorithmic framework as multichannel front-end.

Conclusion
In this paper, an ASR system was successfully enhanced by an advanced multi-channel front-end to recognize the speech content coming from multiple speakers in reverberated acoustic conditions.The overall architecture is able to blindly identify the impulse responses,

15
A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios www.intechopen.comto separate the existing multiple overlapping sources, to dereverberate them and to recognize the information contained within the original utterances.A speaker diarization system able to steer the BCI stage and the ASRs has been also included in the overall framework.All the algorithms work in real-time and a PC-based implementation of them has been discussed in this contribution.Performed simulations, based on a existing large vocabulary database (WSJ) and suitably addressing the acoustic scenario under test, have shown the effectiveness of the developed system, making it appealing in real-life human-machine interaction scenarios.As future works, an overlap detector will be integrated in the speaker diarization system and its impact in terms of final recognition accuracy will be evaluated.In addition other applications different form ASR such as emotion recognition (Schuller et al. (2011)), dominance detection (Hung et al. (2011)) or keyword spotting (Wöllmer et al. ( 2011)) will be considered in order to assess the effectiveness of the front-end in other recognition tasks.
diagram of the proposed framework.

Fig. 3 .
The algorithm operation is divided in two phases, training and recognition.In the first, the acquired signals, after a manual removal of silence periods, are transformed in feature vectors composed of 19 mel-frequency cepstral coefficients (MFCC) plus their first and second derivatives.Cepstral mean normalization is applied to deal with stationary channel effects.Speaker models are represented by mixture of Gaussians trained by means of the expectation maximization algorithm.The number of Gaussians and the end accuracy at convergence have been empirically determined, and set to 100 and 10 −4 respectively.In this phase the voice activity detector (VAD) is also trained.The adopted VAD is based on bi-gaussian model of the log-energy frame.During the training a two gaussian model is estimated using the input sequence: The gaussian with the smallest mean will model the silence frames whereas the other gaussian corresponds to frames of speech activity.

Fig. 4 .
Fig. 4. Room setup.used for the speech recognition experiments has been constructed from the WSJ November '92 speech recognition evaluation set.It consists of 330 sentences (about 40 minutes of speech), uttered by eight different speakers, both male and female.The data set is recorded at 16 kHz and does not contain any additive noise or reverberation.

Fig. 5
Fig.5shows the NPM curve for the identification of the RIRs relative to source s 1 at T 60 = 240 ms for an input signal of 40 s.In order to understand how the performance of

Table 1 .
Characteristics of the selected acoustic model.