Robust Speech Recognition for Adverse Environments

As the state-of-the-art speech recognizers can achieve a very high recognition rate for clean speech, the recognition performance generally degrades drastically under noisy environments. Noise-robust speech recognition has become an important task for speech recognition in adverse environments. Recent research on noise-robust speech recognition mostly focused on two directions: (1) removing the noise from the corrupted noisy signal in signal space or feature space such as noise filtering: spectral subtraction (Boll 1979), Wiener filtering (Macho et al. 2002) and RASTA filtering (Hermansky et al. 1994), and speech or feature enhancement using model-based approach: SPLICE (Deng et al. 2003) and stochastic vector mapping (Wu et al. 2002); (2) compensating the noise effect into acoustic models in model space so that the training environment can match the test environment such as PMC (Wu et al. 2004) or multi-condition/multi-style training (Deng et al. 2000). The noise filtering approaches require some assumption of prior information, such as the spectral characteristic of the noise. The performance will degrade when the noisy environment vary drastically or under unknown noise environment. Furthermore, (Deng et al. 2000; Deng et al. 2003) have shown that the use of denoising or preprocessing are superior to retraining the recognizers under the matched noise conditions with no preprocessing.


Introduction
As the state-of-the-art speech recognizers can achieve a very high recognition rate for clean speech, the recognition performance generally degrades drastically under noisy environments. Noise-robust speech recognition has become an important task for speech recognition in adverse environments. Recent research on noise-robust speech recognition mostly focused on two directions: (1) removing the noise from the corrupted noisy signal in signal space or feature space -such as noise filtering: spectral subtraction (Boll 1979), Wiener filtering (Macho et al. 2002) and RASTA filtering (Hermansky et al. 1994), and speech or feature enhancement using model-based approach: SPLICE (Deng et al. 2003) and stochastic vector mapping ; (2) compensating the noise effect into acoustic models in model space so that the training environment can match the test environment -such as PMC (Wu et al. 2004) or multi-condition/multi-style training (Deng et al. 2000). The noise filtering approaches require some assumption of prior information, such as the spectral characteristic of the noise. The performance will degrade when the noisy environment vary drastically or under unknown noise environment. Furthermore, (Deng et al. 2000;Deng et al. 2003) have shown that the use of denoising or preprocessing are superior to retraining the recognizers under the matched noise conditions with no preprocessing.
Stochastic vector mapping (SVM) (Deng et al. 2003;Wu et al. 2002) and sequential noise estimation (Benveniste et al. 1990;Deng et al. 2003;Gales et al. 1996) for noise normalization have been proposed and achieved significant improvement in noisy speech recognition. However, there still exist some drawbacks and limitations. First, the performance of sequential noise estimation will decrease when the noisy environment vary drastically. Second, the environment mismatch between training data and test data still exists and results in performance degradation. Third, the maximum-likelihood-based stochastic vector Modern Speech Recognition Approaches with Case Studies 4 mapping (SPLICE) requires annotation of environment type and stereo training data. Nevertheless, the stereo data are not available for most noisy environments.
In order to overcome the insufficiency of tracking ability in the sequential expectationmaximization (EM) algorithm, in this chapter, the prior models were introduced to provide more information in sequential noise estimation. Furthermore, an environment model adaptation is constructed to reduce the mismatch between the training data and the test data. Finally, minimum classification error (MCE)-based approach  was employed without the stereo training data and an unsupervised frame-based autoclustering was adopted to automatically detect the environment type of the training data (Hsieh et al. 2008).
For recognition of disfluent speech, a number of cues can be observed when edit difluency occurs in the spontaneous speech. These cues can be detected from linguistic features, acoustic features (Shriberg et al. 2000) and integrated knowledge sources (Bear et al. 1992).  outlined phonetic consequences of disfluency to improve models for disfluency processing in speech applications. Four types of disfluency based on intonation, segment duration and pause duration were presented in (Savova et al. 2003). Soltau et al. used a discriminatively trained full covariance Gaussian system for rich transcription (Soltau et al. 2005). (Furui et al. 2005) presented the approaches to corpus collection, analysis and annotation for conversational speech processing. (Charniak et al. 2001) proposed an architecture for parsing the transcribed speech using an edit word detector to remove edit words or fillers from the sentence string, and then a standard statistical parser was used to parse the remaining words. The statistical parser and the parameters estimated by boosting were employed to detect and correct the disfluency. (Heeman et al. 1999) presented a statistical language model that is able to identify POS tags, discourse markers, speech repairs and intonational phrases. A noisy channel model was used to model the disfluency in (Johnson et al. 2004). (Snover et al. 2004) combined the lexical information and rules generated from 33 rule templates for disfluency detection. (Hain et al. 2005) presented the techniques in front-end processing, acoustic modeling, language and pronunciation modeling for transcribing the conversational telephone speech automatically. ) compared the HMM, maximum entropy, and conditional random fields for disfluency detection in detail.
In this chapter an approach to the detection and correction of the edit disfluency based on the word order information is presented (Yeh et al. 2006). The first process attempts to detect the interruption points (IPs) based on hypothesis testing. Acoustic features including duration, pitch and energy features were adopted in hypothesis testing. In order to circumvent the problems resulted from disfluency especially in edit disfluency, a reliable and robust language model for correcting speech recognition errors was employed. For handling language-related phenomena in edit disfluency, a cleanup language model characterizing the structure of the cleanup sentences and an alignment model for aligning words between deletable region and correction part are proposed for edit disfluency detection and correction.
Robust Speech Recognition for Adverse Environments 5 Furthermore, multilinguality frequently occurs in speech content, and the ability to process speech in multiple languages by the speech recognition systems has become increasingly desirable due to the trend of globalization. In general, there are different approaches to achieving multilingual speech recognition. One approach employing external language identification (LID) systems  to firstly identify the language of the input utterance and the corresponding monolingual system is then selected to perform the speech recognition (Waibel et al. 2000). The accuracy of the external LID system is the main factor to the overall system performance.
Another approach to multilingual speech recognition is to run all the monolingual recognizers in parallel and select the output generated by the recognizer that obtains the maximum likelihood score. The performance of the multilingual speech recognition depends on the post-end selection of the maximum likelihood sequence. The popular approaches to multilingual speech recognition are the utilization of a multilingual phone set. The multilingual phones are usually created by merging the phones across the target languages that are acoustically similar in an attempt to obtain a minimal phone set that covers all the sounds existing in all the target languages (Kohler 2001).
In this chapter, an approach to phonetic unit generation for mixed-language or multilingual speech recognition is presented (Huang et al. 2007). The International Phonetic Alphabet (IPA) representation is employed for phonetic unit modeling. Context-dependent triphones for Mandarin and English speech are constructed based on the IPA representation. Acoustic and contextual analysis is investigated to characterize the properties among the multilingual context-dependent phonetic units. Acoustic likelihood is adopted for the pair-wise similarity estimation of the context-dependent phone models to construct a confusing matrix. The hyperspace analog to language (HAL) model is used for contextual modeling and then used for contextual similarity estimation between phone models.
The organization of this paper is as follows. Section 2 presents two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping. Section 3 describes an approach to edit disfluency detection and correction for rich transcription. In Section 4, fusion of acoustic and contextual analysis is described to generate phonetic units for mixed-language or multilingual speech recognition. Finally the conclusions are provided in the last section.

Speech recognition in noisy environment
In this section, an approach to feature enhancement for noisy speech recognition is presented. Three prior models are introduced to characterize clean speech, noise and noisy speech, respectively. The framework of the system is shown in Figure 1. Sequential noise estimation is employed for prior model construction based on noise-normalized stochastic vector mapping (NN-SVM). Therefore, feature enhancement can work without stereo training data and manual tagging of background noise type based on auto-clustering on the estimated noise data. Environment model adaptation is also adopted to reduce the mismatch between the training data and the test data. The SVM-based feature enhancement approach estimates the clean speech feature  x from the noisy speech feature y through an environment-dependent mapping function For the estimation of the mapping function parameter () e  , if the stereo data, which contain a clean speech signal and the corrupted noisy speech signal with the identical clean speech signal, are available, the SPLICE-based approach can be directly adopted. However, the stereo data are not easily available in real-life applications. In this chapter an MCE-based approach is proposed to overcome the limitation. Furthermore, the environment type of the noisy speech data is needed for training the environment model () e  . The noisy speech data are manually classified into NE noisy environments types. This strategy assigns each noisy speech file to only one environment type and is very time consuming. Actually, each noisy speech file contains several segments with different types of noisy environment. Since the noisy speech annotation affects the purity of the training data for the environment model () e  , this section introduces a frame-based unsupervised noise clustering approach to construct a more precise categorization of the noisy speech.

Noise-Normalized Stochastic Vector Mapping (NN-SVM)
In (Boll 1979), the concept of noise normalization is proposed to reduce the effect of background noise in noisy speech for feature enhancement. If the noise feature vector  n of each frame can be estimated first, the NN-SVM is conducted from Eq.Error! Reference source not found.(2) by replacing y and  x with  y n and   xn as The process for noise normalization makes the environment model e  more noise-tolerable.
Obviously, the estimation algorithm of noise feature vector  n plays an important role in noise-normalized stochastic vector mapping.

Prior model for sequential noise estimation
This section employs a frame-based sequential noise estimation algorithm (Benveniste et al. 1990;Deng et al. 2003;Gales et al. 1996) by incorporating the prior models. In the procedure, Modern Speech Recognition Approaches with Case Studies 8 only noisy speech feature vector of the current frame is observed. Since the noise and clean speech feature vectors are missing simultaneously, the relation among clean speech, noise and noisy speech is required first. Then the sequential EM algorithm is introduced for online noise estimation based on the relation. In the meantime, the prior models are involved to provide more information for noise estimation.

The acoustic environment model
The nonlinear acoustic environment model is introduced first for noise estimation in (Deng et al. 2003). Given the cepstral features of a clean speech x , an additive noise n and a channel distortion h , the approximated nonlinear relation among x , n , h and the corrupted noisy speech y in cepstral domain is estimated as: where C denotes the discrete cosine transform matrix. In order to linearize the nonlinear model, the first order Taylor series expansion was used around two updated operating points 0 n and x 0  denoting the initial noise feature and the mean vector of the prior clean speech model, respectively. By ignoring the channel distortion effect, for which 0 h  , Eq.Error! Reference source not found.(5) is then derived as: where ()exp

The prior models
The three prior models n  , x  and y  , which denotes noise, clean speech and noisy speech models respectively, can provide more information for sequential noise estimation. First, the noise and clean speech prior models are characterized by GMMs as: where the pre-training data for noisy and clean speech are required to train the model parameters of the two GMMs, n  and x  .
While the prior noisy speech model is needed in sequential noise estimation, the noisy speech model parameters are derived according to the prior clean speech and noise models using the approximated linear model around two operating points 0 n  and x 0  as follows: The noisy speech prior model will be employed to search the most similar clean speech mixture component and noise mixture component in sequential noise estimation.

Sequential noise estimation
Sequential EM algorithm is employed for sequential noise estimation. In this section, the prior clean speech, noise and noisy speech model are considered to construct a robust noise estimation procedure. Based on the sequential EM algorithm, the estimated noise is obtained from 11 arg max ( ) tt n nQ n   . In the E-step of the sequential EM algorithm, an objective function is defined as: Also, a forgetting factor is employed to control the effect of the features of the preceding frames.
In the M-step, the iterative stochastic approximation is introduced to derive the solution. Finally, sequential noise estimation is performed as follows: The prior models are used to search the most similar noise or clean speech mixture component. Given the two mixture components, the estimation of the posterior probability (,) md   will be more accurate.

Environment model adaptation
Because the prior models are usually not complete enough to represent the universal data, the environment mismatch between the training data and the test data will result in the degradation on feature enhancement performance. In this section, an environment model adaptation strategy is proposed before the test phase to deal with the problem. The environment model adaptation procedure contains two parts: The first one is model parameter adaptation on noise prior model n  and noisy speech prior model y  in the training phase and adaptation phase. The second is on noise-normalized SVM function () e  and environment model e  in the adaptation phase.

Model adaptation on noise and noisy speech prior models
For noise and noisy speech prior model adaptation, MAP adaptation is applied to the noise prior model n  first. The adaptation equations for the noise prior model parameters given T frames of the adaptation noise data z, which is estimated using the un-adapted prior models, are defined as: where the conjugate prior density of the mixture weight is the Dirichlet distribution with hyper-parameter d  and the joint conjugate prior density of mean and variance parameters is the Normal-Wishart distribution with hyper-parameters d The two distributions are defined as follows: After adaptation of noise prior model, the noisy speech prior model y  is then adapted using the clean speech prior model x  and the newly adapted noise prior model n  based on Eq.Error! Reference source not found.(8).

Model adaptation of noise-normalized SVM (NN-SVM)
For is not a random variable and does not follow any conjugate prior density, a maximum likelihood (ML)-based adaptation which is similar to the correction vector estimation of SPLICE is employed as: where the temporal estimated clean speech  t x are estimated using the un-adapted noise normalized stochastic mapping function in Eq.(4). Table 1 shows the experimental results of the proposed approach on AURORA2 database. The AURORA2 database contains both clean and noisy utterances of the TIDIGITS corpus and is available from ELDA (Evaluations and Language resources Distribution Agency). Two results of previous research were illustrated for comparison and three experiments were conducted for different experimental conditions: no denoising, SPLICE with recursive EM using stereo data (Deng et al. 2003), the proposed approach using manual annotation without adaptation, and the proposed approach using auto-clustered training data without and with adaptation. The overall results show that the proposed approach slightly outperformed the SPLICE-based approach with recursive EM algorithm under the lack of stereo training data and manual annotation. Furthermore, based on the results in Set B with 0.11% improvement (different background noise types to the training data) and Set C with 0.04% improvement (different background noise types and channel characteristic to the training data), the environment model adaptation can slightly reduce the mismatch between the training data and test data.

Conclusions
In this section two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are presented. The prior model was introduced for precise noise estimation. Then the environment model adaptation is constructed to reduce the environment mismatch between the training data and the test data. Experimental results demonstrate that the proposed approach can slightly outperform the SPLICE-based approach without stereo data on AURORA2 database.

Speech recognition in disfluent environment
In this section, a novel approach to detecting and correcting the edit disfluency in spontaneous speech is presented. Hypothesis testing using acoustic features is fist adopted to detect potential interruption points (IPs) in the input speech. The word order of the utterance is then cleaned up based on the potential IPs using a class-based cleanup language model. The deletable region and the correction are aligned using an alignment model. Finally, a log linear weighting mechanism is applied to optimize the performance.

Edit disfluency ANalsis
In conversational utterances, several problems such as interruption, correction, filled pause, and ungrammatical sentence are detrimental for speech recognition. The definitions of disfluencies have been discussed in SimpleMDE. Edit disfluencies are portions of speech in which a speaker's utterance is not complete and fluent; instead the speaker corrects or alters the utterance, or abandons it entirely and starts over. In general, edit disfluencies can be divided into four categories: repetitions, revisions, restarts and complex disfluencies. Since complex disfluencies consist of multiple or nested edits, it seems reasonable to consider the complex disfluencies as a combination of the other simple disfluencies: repetitions, revisions, and restarts. Edit disfluencies have a complex internal structure, consisting of the deletable region (delreg), interruption point (IP) and correction. Editing terms such as fillers, particles and markers are optional and follow the IP in edit disfluency.
In spontaneous speech, acoustic features such as short pause (silence and filler), energy and pitch reset generally appear along with the occurrence of edit dislfuency. Based on these features, we can detect the possible IPs. Furthermore, since IPs generally appear at the boundary of two successive words, we can exclude the unlikely IPs whose positions are within a word. Besides, since the structural patterns between the deletable word sequence and correction word sequence are very similar, the deletable word sequence in edit disfluency is replaceable by the correction word sequence.

Framework of edit disfluency transcription system
The overall transcription task for conversational speech with edit disfluency in the proposed method is composed of two main mechanisms; IP detection module and edit disfluency correction module. The framework is shown in Figure 2. IP detection module predicts the potential IPs first. Edit disfluency correction module generates the rich transcription that contains information of interruption, text transcription from the speaker's utterances and the cleaned-up text transcription without disfluencies. Figure 3 shows the correction process for edit disfluency.
The speech signal is fed to both acoustic feature extraction module and speech recognition engine in IP detection module. Information about durations of syllables and silence from speech recognition is provided for acoustic feature extraction. Combined with side information from speech recognition, duration-, pitch-, and energy-related features are extracted and used to model the IPs using a Gaussian mixture model (GMM). Besides, in order to perform hypothesis testing on IP detection, an anti-IP GMM is also constructed based on the extracted features from the non-IP regions. The hypothesis testing verifies if the posterior probability of the acoustic features of a syllable boundary is above a threshold and therefore determines if the syllable boundary is an IP. Since IP is an event that happens in interword location, we can remove the detected IPs that do not appear in the word boundary.
Robust Speech Recognition for Adverse Environments 15  There are two processing stages in the edit disfluency correction module: cleanup and alignment. As shown in Figure 4, cleanup process divides the word string into three parts: deletable region (delreg), editing term, and correction according to the locations of potential IPs detected by the IP detection module. Cleanup process is performed by shifting the correction part and replaces the deletable region to form a new cleanup transcription. The edit disfluency correction module is composed of an n-gram language model and the alignment model. The n-gram model regards the cleanup transcriptions as fluent utterances and models their word order information. The alignment model finds the optimal correspondence between deletable region and correction in edit disfluency.

Potential interruption point detection
For IP detection, instead of detecting exact IP, potential IPs are selected for further processing. Since the IP is the point at which the speaker breaks off the deletable region, some acoustic events will go along with it. For syllabic languages like Chinese, every character is pronounced as a monosyllable, while a word is composed of one to several syllables. The speech input of the syllabic languages with n syllables can be described as a sequence,

IP detection using posterior probability of silence duration
Since IPs always appear at the inter-syllable position, the n-1 silence positions between n syllables will be considered as the IP candidates. By this, we can take the IP detection as the problem to verify whether each of the n-1 silence positions is an IP or not. In conversation, speakers may hesitate to find the correct words when disfluency appears. Hesitation is usually realized as a pause. Since the length of silence is very sensitive to disfluency, we use normal distributions to model the posterior probabilities of that IP appears and does not appear in silencek , respectively.

Syllable-based acoustic features extraction
Acoustic features including duration, pitch, and energy for each syllable (Soltau et al. 2005) are adopted for IP detection. A feature vector of the syllables within an observation window around the silence is formed as the input of the GMM. That is, we are interested in the syllables around the silence that may appear as an IP. A window of 2w syllables with w syllables after and before silencek is used. First, the subscript will be translated according to the position of silence as nk n Syl Syl   . And we then extract the features of syllables within the observation windows.
Since the durations of syllables are not the same even for the same syllable, the duration ratio is defined as the average duration of the syllable normalized by the average duration over all syllables. Where syllablei,j means the j-th samples of syllable i in the corpus. |syllable| means the number of the syllable. ni is the number of syllable i in the corpus. Similarly, for energy and pitch, frame-based statistics are used to calculate the normalized features for each syllable.
Considering the result of speech recognition, the features are normalized to be the first order features. For modeling the speaking rate and variation in the energy and pitch during the utterance, the 2 nd order feature called delta-duration, delta-energy and delta-pitch are obtained from the forward difference of the 1 st order features. The following equation shows the estimation for delta-duration, which can also be applied for the estimation of deltaenergy and delta-pitch.
Where w is half of the observation window size. Totally, there are three kinds of two orders features after feature extraction. We combine these features to form a vector with 24w-6 features to be the observation vector of the GMM.

Gaussian mixture model for interruption point detection
The GMM is adopted for IP detection using the acoustic features.

Potential interruption point extraction
Based on the assumption that IP appears generally at the boundary of two successive words, we can remove the detected IPs that do not appear in the word boundary. After the removal of unlikely IPs, the remaining IPs will be kept for further processing. Since the word graph or word lattice is obtained from speech recognition module, every path in the word graph or word lattice form its potential IP set for an input utterance.

Lingusitic processing for edit disfluency correction
In previous section, potential IPs has been detected from the acoustic features. However, correcting edit disfluency using the linguistic features is, in fact, one of the keys for rich transcription. In this section, the edit disfluency is detected by maximizing the likelihood of the language model for the cleaned-up utterances and the word correspondence between the deletable region and the correction given the position of the IP. Consider the word sequence W * in the word lattice generated by the speech recognition engine. We can model the word string W * using a log linear mixture model in which language model and alignment are both included. where  and 1   are the combination weight for cleanup language model and alignment model. IP means the interruption point obtained from the IP detection module and n is the position of the potential IP.

Language model of cleanup utterance
In the past, statistical language models have been applied to speech recognition and have achieved significant improvement in the recognition results. However, probability estimation of word sequences can be expensive and always suffers from the problem of data sparseness. In practice, the statistical language model is often approximated by the class-

 
Class  means the conversion function that translates a word sequence into a word class sequence. In this section, we employ two word classes: semantic class and parts-ofspeech (POS) class. A semantic class, such as the synsets in WordNet (http://wordnet.princeton.edu/) or concepts in the UMLS (http://www.nlm.nih.gov/ research/umls/), contains the words that share a semantic property based on semantic relations, such as hyponym and hypernym. POS is called syntactic or grammatical categories defined as the role that a word plays in a sentence such as noun, verb, adjective… etc.
The other essential issue of n-gram model for correcting edit disfluency is the number of orders in Markov model. Since IP is the point at which the speaker breaks off the deletable region and the correction consists of the portion of the utterance that has been repaired by the speaker and can be considered fluent. By removing part of the word string will lead to a shorter string and result in the condition that higher probability is obtained for shorter word string. As a result, short word string will be favored. To deal with this problem, we can increase the order to constrain the perplexity and normalize the word length by aligning the deletable region and the correction.

Alignment model between the deletable region and the correction
In conversational speech, the structural pattern of a deletable region is usually similar to that of the correction. Sometimes, the deletable region appears as a substring of the correction. Accordingly, we can find the structural pattern in the starting point of the correction which generally follows the IP. Then, we can take the potential IP as the center and align the word string before and after it. Since the correction is used for replacing the deletable region and ending the utterance, there exists a correspondence between the words in the deletable region and the correction. We may, therefore, model the alignment assuming the conditional probability of the correction given the possible deletable region. According to this observation, class-based alignment is proposed to clean up edit disfluency. The alignment model can be described as where fertility k f means the number of words in the correction corresponding to the word k w in the deletable region. k and l are the positions of the words k w and l w in the deletable region and the correction, respectively. m denotes the number of words in the deletable region. The alignment model for cleanup contains three parts: fertility probability, translation or corresponding probability and distortion probability. The fertility probability of word k w is defined as where     is an indicator function and N means the maximum value of fertility. The translation or corresponding probability is measured according to (Wu et al. 1994

Experimental results and discussion
To evaluate the performance of the proposed approach, a transcription system for spontaneous speech with edit dsifluencies in Mandarin was developed. A speech recognition engine using Hidden Markov Model Toolkit (HTK) was constructed as the syllable recognizer using 8 states (3 states for initial, and 5 states for final in Mandarin).

Experimental data
The Mandarin Conversational Dialogue Corpus (MCDC), collected from 2000 to 2001 at the Institute of Linguistics of Academia Sinica, Taiwan, consists of 30 digitized conversational dialogues of a total length of 27 hours. 60 subjects were randomly chosen from daily life in Taiwan area. It was annotated according to (Yeh et al. 2006) that gives concise explanations and detailed operational definitions of each tag in Mandarin. Corresponding to SimpleMDE, direct repetitions, partial repetitions, overt repairs and abandoned utterances are taken as edit disfluency in MCDC. The dialogs tagged as number 01, 02, 03 and 05 are used as the test corpus. For training the parameters in the speech recognizer, MAT Speech Database, TCC-300 and MCDC were employed.

Potential interruption point detection
According to the observation of the MCDC, the probability density function (pdf) of the duration of the silences with or without IPs is obtained. The average duration of the silences Robust Speech Recognition for Adverse Environments 23 with IP is larger than that of the silences without IP. According to this result, we can estimate the posterior probability of silence duration using a GMM for IP detection. For hypothesis testing, an anti-IP GMM is also constructed.
Since IP detection can be regarded as a position determination problem, an observation window over several syllables is adopted. In this observation window, the values of pitch and energy of the syllables just before an IP are usually larger than that after the IP. This phenomenon means the pitch reset and energy reset co-occur with IP in the edit disfluency. This generally happens in the syllables of the first word just after the IP. The pitch reset event is very obvious when the disfluency type is repair. Similar to the pitch, energy plays the same role when edit disfluency appears, but the effect is not so obvious compared to the pitch. The filler words or phrase after IP will be lengthened to strive for the time for the speaker to construct the correction and attract the listener to pay attention to. This factor can achieve significant improvement in IP detection rate.
The hypothesis testing, combined with the GMM model with four mixture components using the syllable features, will determine if the silence contains the IP. The parameter  should be determined to achieve a better result. The overall IP error rate defined in RT'04F will be simply the average number of missed IP detections and falsely detected IPs per reference IP: Where M IP n  and FA IP n  denote the numbers of missed and false alarm IPs respectively. IP n means the number of reference IPs. We can adjust the threshold  for M IP n  and FA IP n  .
Since the goal of the IP detection module is to detect the potential IPs, false alarm for IP detection is not a serious problem compared to miss error. That is to say, we want to obtain high recall rate without much increase in false alarm rate. Finally, the threshold  was set to 0.25. Since the IP always appears in word boundary, this constraint can be used to remove unlikely IPs.

Clean-up disfluency using linguistic information
For evaluating the edit disfluency correction model, two different types of transcriptions were used: human generated transcription (REF) and speech-to-text recognition output (STT). Using the reference transcriptions provides the best case for the evaluation of the edit disfluency correction module because there are no word errors in the transcription. For practicability, the syllable lattice from speech recognition is fed to the edit disfluency correction module for performance assessment.
For class-based approach, part of speech (POS) and semantic class are employed as the word class. Herein, semantic class is obtained based on Hownet (http://www.keenage.com/) that defines the relation "IS-A" as the primary feature. There are 26 and 30 classes in POS class and semantic class respectively. By this, we can categorize the words according to their hypernyms or concepts, and every word can map to its own semantic class.
The edit word detection (EWD) task is to detect the regions of the input speech containing the words in the deletable regions. One of the primary metrics for edit disfluency correction is to use the edit word detection method defined in RT'04F (Chen et al. 2002), which is similar to the metric for IP detection shown in Eq. (38).
Due to the lack of structural information, unigram does not obtain any improvement. Bigram provides more significant improvement combined with POS class-based alignment than semantic class-based alignment. Using 3-gram and semantic class-based alignment outperforms other combinations. The reason is that 3-gram with more strict constraints can reduce the false alarm rate for edit word detection. In fact, we also tried using 4-gram to gain more improvement than 3-gram, but the excess computation makes the light improvement not conspicuous as we expected. Besides, the statistics of 4-gram is too spare compared to 3-gram model. The best combination in edit disfluency correction module is 3gram and semantic class.
According to the analysis of the results shown in Table 2, we can find the values of the probabilities of the n-gram model are much smaller than that of the alignment model. Since the alignment can be taken as the penalty for edit words, we should balance the effects between the 3-gram and the alignment with semantic class using a log linear combination weight  . For optimizing the performance, we estimate  empirically based on the minimization of the edit word errors.

Human generated transcription (REF)
Speech-to-text recognition output (

Conclusion and future work
This investigation has proposed an approach to edit disfluency detection and correction for rich transcription. The proposed theoretical approach, based on a two stage process, aims to model the behavior of edit disfluency and cleanup the disfluency. IP detection module using hypothesis testing from the acoustic features is employed to detect the potential IPs. Wordbased linguistic module consists of a cleanup language model and an alignment model is used for verifying the position of the IP and therefore correcting the edit disfluency.
Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. In an investigation of the linguistic properties of edit disfluency, the linguistic module was explored for correcting disfluency based on the potential IPs. The experimental results indicate a significant improvement in performance was achieved. In the future, this framework will be extended to deal with the problem resulted from subword to improve the performance of the rich transcription system.

Speech recognition in multilingual environment
This section presents an approach to generating phonetic units for mixed-language or multilingual speech recognition. Acoustic and contextual analysis is performed to characterize multilingual phonetic units for phone set creation. Acoustic likelihood is utilized for similarity estimation of phone models. The hyperspace analog to language (HAL) model is adopted for contextual modeling and contextual similarity estimation. A confusion matrix combining acoustic and contextual similarities between every two phonetic units is built for phonetic unit clustering. Multidimensional scaling (MDS) method is applied to the confusion matrix for reducing dimensionality.

Introduction
In multilingual speech recognition, it is very important to determine a global phone inventory for different languages. When an authentic multilingual phone set is defined, the acoustic models and pronunciation lexicon can be constructed (Chen et al. 2002). The simplest approach to phone set definition is to combine the phone inventories of different languages together without sharing the units across the languages. The second one is to map language-dependent phones to the global inventory of the multilingual phonetic association based on phonetic knowledge to construct the multilingual phone inventory. Several global phone-based phonetic representations such as International Phonetic Alphabet (IPA) (Mathews 1979), Speech Assessment Methods Phonetic Alphabet (Wells 1989) and Worldbet (Hieronymus 1993) are generally used. The third one is to merge the language-dependent phone models using a hierarchical phone clustering algorithm to obtain a compact multilingual inventory. In this approach, the distance measure between acoustic models, such as Bhattacharyya distance (Mak et al. 1996) and Kullback-Leibler (KL) divergence (Goldberger et al. 2005), is employed to perform the bottom-up clustering. Finally, the multilingual phone models are generated with the use of a phonetic top-down clustering procedure (Young et al. 1994).

Multilingual phone set definition
From the viewpoint of multilingual speech recognition, a phonetic representation is functionally defined by the mapping of the fundamental phonetic units of languages to describe the corresponding pronunciation. In this section, IPA-based multilingual phone definition is suitable and consistent for phonetic representation. Using phonetic representation of the IPA, the recognition units can be effectively reduced for multilingual speech recognition. Considering the co-articulated pronunciation, context-dependent triphones are adopted in the expansion of IPA-based phonetic units.
In multilingual speech recognition, misrecognition generally results from incorrect pronunciation or confusable phonetic set. For examples, in Mandarin speech, the "ei_M" and "zh_M" is usually pronounced as "en_M" and "z_M", respectively. In this section, statistical methods are proposed to deal with the problem of misrecognition caused by the confusing characteristics between phonetic units in multilingual speech recognition. Based on the analysis of confusing characteristics, confusing phones due in part to the confusable phonetic representation are redefined to alleviate the misrecognition problem.

Contextual analysis
A co-articulation pattern can be considered as a semantically plausible combination of phones. This section presents a text mining framework to automatically induce coarticulation patterns from a mixed-language or a multilingual corpus. A crucial step to induce the co-articulation patterns is to represent speech intonation as well as combination of phones. To achieve this goal, the hyperspace analog to language (HAL) model constructs a high-dimensional contextual space for the mixed-language or multilingual corpus. Each context-dependent triphone in the HAL space is represented as a vector of its context phones, which represents that the sense of a phone can be co-articulated through its context phones. Such notion is derived from the observation of articulation behavior. Based on the co-articulation behavior, if two phones share more common context, they are more similarly articulated.
The HAL model represents the multilingual triphones based on a vector representation. Each dimension of the vector is a weight representing the strength of association between the target phone and its context phone. The weights are computed by applying an observation window of length  over the corpus. All phones within the window are considered as the co-articulated pronunciation with each other. For any two phones of distance d within the window, the weight between them is defined as 1 d   . After moving the window by one phone increment over the sentence, the HAL space is constructed. The resultant HAL space is an NN  matrix, where N is the number of triphones. Table 3 presents the HAL space for the example of English and Mandarin mixed sentence " 查一下<look up> ( CH A @ I X I A ) Baghdad ( B AE G D AE D )." For each phone in Table 3, the corresponding row vector represents its left contextual information, i.e. the weights of the phones preceding it. The corresponding column vector represents its right contextual information.
, kl w indicates the k-th weight of the l-th triphone l  . Furthermore, the weights in the vector are re-estimated as described as follows.
where N denotes the total number of phone vectors and Nl represents the number of vectors of phone l  with nonzero dimension. After each dimension is re-weighted, the HAL space is transformed into a probabilistic framework, and thus each weight can be redefined as

Fusion of confusing matrices and dimensional reduction
The multidimensional scaling (MDS) method is used to project multilingual triphones to the orthogonal axes where the ranking distance relation between them can be estimated using Euclidean distance. MDS is generally a procedure which characterizes the data in terms of a matrix of pairwise distances using Euclidean distance estimation. One of the purposes of

Phone clustering
This section presents how to cluster the triphones with similar acoustic and contextual properties into a multilingual triphone cluster. Cosine measure between triphones Y k and Y l is adopted as follows.
where , ki y and , li y are the element of the triphone vectors Y k and Y l . The modified kmeans (MKM) algorithm is applied to cluster all the triphones into a compact phonetic set. The convergence of closeness measure is determined by a pre-set threshold.

Experimental evaluations
For evaluation, an in-house multilingual speech recognizer was implemented and experiments were conducted to evaluate the performance of the proposed approach on an English-Mandarin multilingual corpus.

Multilingual database
In Taiwan, English and Mandarin are popular in conversation, culture, media, and everyday life. For bilingual corpus collection, the English across Taiwan (EAT) project (EAT [online] http://www.aclclp.org.tw/) sponsored by National Science Council, Taiwan prepared 600 recording sheets. Each sheet contains 80 reading sentences, including English long sentences, English short sentences, English words and mixed English and Mandarin sentences. Each sheet was used for speech recording individually for English-major students and non-Englishmajor students. Microphone corpus was recorded as sound files with 16 kHz sampling rate and 16 bit sample resolution. The summarized recording information of EAT corpus is shown in Table 4. In this section, we applied mixed English-Mandarin sentences in microphone application. The average sentence length is around 12.62 characters.

English-Major
Non

Evaluation of the phone set generation based on acoustic and contextual analysis
In this section, the phone recognition rate was adopted for the evaluation of acoustic modeling accuracy. Three classes of speech recognition errors, including insertion errors ( Ins ), deletion errors ( Del ) and substitution errors ( Sub ), were considered. This section applied the fusion of acoustic and contextual analysis approaches to generating the multilingual triphone set. Since the optimal clustering number of acoustic models was unknown, several sets of HMMs were produced by varying the MKM convergence threshold during multilingual triphone clustering. There are three different approaches including acoustic likelihood (ACL), contextual analysis (HAL) and fusion of acoustic and contextual analysis (FUN). It is evident that the proposed fusion method achieves a better result than individual ACL or HAL methods. The comparison of acoustic analysis and contextual analysis, HAL achieves a higher recognition rate than ACL. It denotes that contextual analysis is more significant than acoustic analysis for multilingual confusing phone clustering. The curves shows that phone accuracy will increase with the increase in state number, and finally decrease due to the confusing triphone definition and the requirement of a large size of multilingual training corpus. The proposed multilingual phone generation approach can get an improved performance than the ordinary multilingual triphone sets. In this section, the English and Mandarin triphone sets is defined based on the expansion of the IPA definition. The multilingual speech recognition system for English and Mandarin contains 924 context-dependent triphone models. The best phone recognition accuracy was 67.01% for the HAL window size = 3. Therefore, this section applied this setting in the following experiments. Table 5 shows the comparisons on different acoustic and language models for multilingual speech recognition. For the comparison of monophone and triphone-based recognition, different phone inventory definitions including direct combination of language-dependent phones (MIX), language-dependent IPA phone definition (IPA), tree-based clustering procedure (TRE) (Mak et al. 1996)  In acoustic comparison, multilingual context-independent (MIX and IPA) and contextdependent (TRE and FUN) phone sets were investigated. With the language model of English and Mandarin, the approach based on MIX achieved 45.81% phone accuracy and the IPA method achieved 66.05% phone accuracy. The IPA performance is evidently better than MIX approach. TRE method achieved 76.46% phone accuracy and our proposed approach achieved 78.18%. It is obvious that triphone models achieved better performance than monophone models. There is around 2.25% relative improvement from 76.46% accuracy for the baseline system based on TRE to 78.18% accuracy for the approach using acoustic and contextual analysis.

Comparison of acoustic and language models for multilingual speech recognition
In order to evaluate the acoustic modeling performance, the experiments were conducted without using language model. Without the language model, the MIX approach achieved 32.58%, IPA method achieved 51.98%, TRE method achieved 65.32%, and the proposed approach achieved 67.01% phone accuracies. In conclusion, multilingual speech recognition can obtain the best performance using FUN approach for the context-dependent phone definition with language model.

Comparison of monolingual and multilingual speech recognition
In this experiment, the utterances of English word and English sentence in the EAT corpus were collected for the evaluation of monolingual speech recognition. A comparison of monolingual and multilingual speech recognition using EAT corpus was shown in Table 6. Totally, 2496 English words, 3072 English sentences and 5884 mixed English and Mandarin utterances were separately used for training. Other 200 utterances were applied for evaluation. In the context-dependent without language model condition, the performance of monolingual English word achieved 76.25% which is higher than 67.42% for monolingual English sentences. The phone recognition accuracy of monolingual English sentences is 67.42% slightly better than 67.01% for mixed English and Mandarin sentences.

Monolingual Multilingual
English word English sent. English and Mandarin mixed sent.

Conclusions
In this section, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixed-language or multilingual speech recognition. The contextdependent triphones are defined based on the IPA representation. Furthermore, the confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. From the acoustic analysis, the acoustic likelihood confusing matrix is constructed by the posterior probability of triphones. From the contextual analysis, the hyperspace analog to language (HAL) approach is employed. Using the multidimensional scaling and data fusion approaches, the combination matrix is built and each phone is represented as a vector. Furthermore, the modified k-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that the proposed approach gives encouraging results.

Conclusions
In this chapter speech recognition techniques in adverse environments are presented. For speech recognition in noisy environments, two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are described. Experimental results show that the proposed approach outperformed the SPLICE-based approach without stereo data on AURORA2 database. For speech recognition in disfluent environments, an approach to edit disfluency detection and correction for rich transcription is presented. The proposed theoretical approach, based on a two stage process, aims to model the behavior of edit disfluency and cleanup the disfluency. Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. For speech recognition in multilingual environments, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixedlanguage or multilingual speech recognition. The confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. The modified k-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that the proposed approach improves recognition accuracy in multilingual environments.

Author details
Chung-Hsien Wu * and Chao-Hong Liu Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.