Robust Speech Recognition for Adverse Environments

Chung-Hsien Wu; Chao-Hong Liu

doi:10.5772/47843

Author Information

Show +

Chung-Hsien Wu*
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,, Taiwan, R.O.C.
Chao-Hong Liu
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,, Taiwan, R.O.C.

*Address all correspondence to:

1. Introduction

As the state-of-the-art speech recognizers can achieve a very high recognition rate for clean speech, the recognition performance generally degrades drastically under noisy environments. Noise-robust speech recognition has become an important task for speech recognition in adverse environments. Recent research on noise-robust speech recognition mostly focused on two directions: (1) removing the noise from the corrupted noisy signal in signal space or feature space - such as noise filtering: spectral subtraction(Boll 1979), Wiener filtering (Macho et al. 2002) and RASTA filtering (Hermansky et al. 1994), and speech or feature enhancement using model-based approach: SPLICE (Deng et al. 2003) and stochastic vector mapping (Wu et al. 2002); (2) compensating the noise effect into acoustic models in model space so that the training environment can match the test environment - such as PMC (Wu et al. 2004) or multi-condition/multi-styletraining (Deng et al. 2000). The noise filtering approaches require some assumption of prior information, such as the spectral characteristic of the noise. The performance will degrade when the noisy environment vary drastically or under unknown noise environment. Furthermore, (Deng et al. 2000; Deng et al. 2003) have shown that the use of denoising or preprocessing are superior to retraining the recognizers under the matched noise conditions with no preprocessing.

Stochastic vector mapping (SVM) (Deng et al. 2003; Wu et al. 2002) and sequential noise estimation (Benveniste et al. 1990; Deng et al. 2003; Gales et al. 1996) for noise normalization have been proposed and achieved significant improvement in noisy speech recognition. However, there still exist some drawbacks and limitations. First, the performance of sequential noise estimation will decrease when the noisy environment vary drastically. Second, the environment mismatch between training data and test data still exists and results in performance degradation. Third, the maximum-likelihood-based stochastic vector mapping (SPLICE) requires annotation of environment type and stereo training data. Nevertheless, the stereo data are not available for most noisy environments.

In order to overcome the insufficiency of tracking ability in the sequential expectation-maximization (EM) algorithm, in this chapter, the prior models were introduced to provide more information in sequential noise estimation. Furthermore, an environment model adaptation is constructed to reduce the mismatch between the training data and the test data. Finally, minimum classification error (MCE)-based approach (Wu et al. 2002)was employed without the stereo training data and an unsupervised frame-based auto-clustering was adopted to automatically detect the environment type of the training data(Hsieh et al. 2008).

For recognition of disfluent speech, a number of cues can be observed when edit difluency occurs in the spontaneous speech. These cues can be detected from linguistic features, acoustic features (Shriberg et al. 2000) and integrated knowledge sources (Bear et al. 1992). (Shriberg et al. 2005) outlined phoneticconsequences of disfluency to improve models for disfluency processing in speech applications. Four types of disfluency based on intonation, segment duration and pause duration were presented in (Savova et al. 2003). Soltau et al. used a discriminatively trained full covarianceGaussian system for rich transcription (Soltau et al. 2005). (Furui et al. 2005) presented the approaches to corpus collection, analysis and annotation for conversational speech processing.

(Charniak et al. 2001)proposedan architecturefor parsing the transcribed speech using an editword detector to remove edit words or fillers fromthe sentence string, and then a standard statisticalparser was used to parse theremaining words. The statistical parser and the parameters estimated by boosting were employed to detect and correct the disfluency. (Heeman et al. 1999) presented a statistical language model that is able to identify POS tags,discourse markers, speech repairs and intonational phrases. A noisy channel model was used to model the disfluency in (Johnson et al. 2004). (Snover et al. 2004)combined the lexical information and rules generated from 33 rule templates for disfluency detection. (Hain et al. 2005)presented the techniques in front-end processing, acoustic modeling, language and pronunciation modeling for transcribing the conversational telephone speech automatically.(Liu et al. 2005) compared the HMM, maximum entropy, and conditional random fields for disfluency detection in detail.

In this chapter an approach to the detection and correction ofthe edit disfluency based on the word order information is presented (Yeh et al. 2006). The first process attempts to detect the interruption points (IPs) based on hypothesis testing. Acoustic features including duration, pitch and energy features were adopted in hypothesis testing. In order to circumvent the problems resulted from disfluency especially in edit disfluency, a reliable and robust language model for correctingspeech recognition errors was employed. For handling language-related phenomena in edit disfluency, a cleanup language model characterizingthe structure of the cleanup sentences and an alignment model for aligning words between deletable region and correction part are proposed for edit disfluencydetection and correction.

Furthermore, multilinguality frequently occurs in speech content, and the ability to process speech in multiple languages by the speech recognition systems has become increasingly desirable due to the trend of globalization. In general, there are different approaches to achieving multilingual speech recognition. One approach employing external language identification (LID) systems(Wu et al. 2006)to firstly identify the language of the input utterance and the corresponding monolingual system is then selected to perform the speech recognition (Waibel et al. 2000). The accuracy of the external LID system is the main factor to the overall system performance.

Another approach to multilingual speech recognition is to run all the monolingual recognizers in parallel and select the output generated by the recognizer that obtains the maximum likelihood score. The performance of the multilingual speech recognition depends on the post-end selection of the maximum likelihood sequence. The popular approaches to multilingual speech recognition are the utilization of a multilingual phone set. The multilingual phones are usually created by merging the phones across the target languages that are acoustically similar in an attempt to obtain a minimal phone set that covers all the sounds existing in all the target languages (Kohler 2001).

In this chapter, an approach to phonetic unit generation for mixed-language or multilingual speech recognition is presented(Huang et al. 2007). The International Phonetic Alphabet (IPA) representation is employed for phonetic unit modeling. Context-dependent triphonesfor Mandarin and English speech are constructed based on the IPA representation. Acoustic and contextual analysis is investigated to characterize the properties among the multilingual context-dependent phonetic units. Acoustic likelihood is adopted for the pair-wise similarity estimation of the context-dependent phone models to construct a confusing matrix. The hyperspace analog to language (HAL) model is used for contextual modeling and then used for contextual similarity estimation between phone models.

The organization of this paper is as follows. Section 2 presents two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping. Section 3 describes an approach to edit disfluency detection and correction for rich transcription. In Section 4, fusion of acoustic and contextual analysis is described to generate phonetic units for mixed-language or multilingual speech recognition. Finally the conclusions are provided in the last section.

2. Speech recognition in noisy environment

In this section, an approach to feature enhancement for noisy speech recognition is presented. Three priormodels are introduced to characterize clean speech, noise and noisy speech, respectively. The framework of the system is shown in Figure 1. Sequential noise estimation is employed for priormodel construction based on noise-normalized stochastic vector mapping (NN-SVM). Therefore, feature enhancement can work without stereo training data and manual tagging of background noise type based on auto-clustering on the estimated noise data. Environment model adaptation is also adopted to reduce the mismatch between the training data and the test data.

Figure 1.
Diagram of training and test phases for noise-robust speech recognition

2.1. NN-SVM for cepstral feature enhancement

2.1.1. Stochastic Vector Mapping (SVM)

The SVM-based feature enhancement approach estimates the clean speech featurex^ from the noisy speech feature ythrough an environment-dependent mapping functionF(y;Θ(e)), where Θ(e) denotes the mapping function parameters and e denotes the corresponding environment of noisy speech featurey.

Assuming that the training data of the noisy speech Y can be partitioned into Nedifferent noisy environments, the feature vectors of Yunder an environmentecan be modeled by a Gaussian mixture model(GMM) with Nk mixtures:

p(y|e;Ωe)=∑k=1Nkp(k|e)p(y|k,e)=∑k=1Nkωke⋅N(y;ξke,Rke)E1

where Ωe represents the environment model. The clean speech feature x^ can be estimated using a stochastic vector mapping function which is defined as follows:

x^≜F(y;Θ(e))=y+∑e=1NE∑k=1Nkp(k|y,e)rke E2

where the posterior probability p(k|y,e)can be estimated using the Bayes theory based on the environment model Ωe as follows:

p(k|y,e)=p(k|e)p(y|k,e)∑j=1Nkp(j|e)p(y|j,e)E3

and Θ(e)={rk(e)}k=1Nk denotes the mapping function parameters. Generally, Θ(e) are estimated from a set of training data using maximum likelihood criterion.

For the estimation of the mapping function parameterΘ(e), if the stereo data,which contain a clean speech signal and the corrupted noisy speech signal with the identical clean speech signal, are available, the SPLICE-based approach can be directly adopted. However, the stereo data are not easily available in real-life applications. In this chapter an MCE-based approach is proposed to overcome the limitation.Furthermore, the environment type of the noisy speech data is needed for training the environment modelΘ(e). The noisy speech data are manually classified into N_E noisy environments types. This strategy assigns each noisy speech file to only one environment type and is very time consuming. Actually, each noisy speech file contains several segments with different types of noisy environment. Since the noisy speech annotation affects the purity of the training data for the environment modelΘ(e), this section introduces a frame-based unsupervised noise clustering approach to construct a more precise categorization of the noisy speech.

2.1.2. Noise-Normalized Stochastic Vector Mapping (NN-SVM)

In (Boll 1979), the concept of noise normalization is proposed to reduce the effect of background noise in noisy speech for feature enhancement. If the noise feature vectorn˜of each frame can be estimated first, the NN-SVM is conducted from Eq.(2) by replacingy and x^ with y-n˜ and x^-n˜ as

x^-n˜≜F(y-n˜;Θ(e))=y-n˜+∑e=1NE∑k=1Nkp(k|y-n˜,e)rke E4

The process for noise normalization makes the environment model Ωe more noise-tolerable. Obviously, the estimation algorithm of noise feature vector n˜ plays an important role in noise-normalized stochastic vector mapping.

2.2. Prior model for sequential noise estimation

This section employs a frame-based sequential noise estimation algorithm (Benveniste et al. 1990; Deng et al. 2003; Gales et al. 1996) by incorporating the prior models. In the procedure, onlynoisy speech feature vector of the current frame is observed. Since the noise and clean speech feature vectors are missing simultaneously, the relation among clean speech, noise and noisy speech is required first. Then the sequential EM algorithm is introduced for online noise estimation based on the relation. In the meantime, the prior models are involved to provide more information for noise estimation.

2.2.1. The acoustic environment model

The nonlinear acoustic environment model is introduced first for noise estimation in (Deng et al. 2003). Given the cepstral features of a clean speechx, an additive noise n and a channel distortion h, the approximated nonlinear relation among x, n, h and the corrupted noisy speech y in cepstral domain is estimated as:

y≈h+x+g(n-h-x),

g(z)=Cln(I+exp[CT(z)])E5

where C denotes the discrete cosine transform matrix. In order to linearize the nonlinear model, the first order Taylor series expansion was used around two updated operating points n0 and μ0x denoting the initial noise feature and the mean vector of the prior clean speech model, respectively. By ignoring the channel distortion effect, for which h=0, Eq.(5) is then derived as:

y≈μ0x+g(n0-μ0x)+G(n0-μ0x)(x-μ0x)+[I-G(n0-μ0x)](n-n0)E6

where

G(z)=-Cdiag(II+exp[CTz])CTE7

.

2.2.2. The prior models

The three prior modelsΦn, Φx and Φy, which denotes noise, clean speech and noisy speech models respectively, can provide more information for sequential noise estimation. First, the noise and clean speech prior models are characterized by GMMs as:

p(n;Φn)=∑d=1Ndwdn⋅N(n;μdn,Σdn),

p(x;Φx)=∑m=1Nmwmx⋅N(x;μmx,Σmx)E8

where the pre-training data for noisy and clean speech are required to train the model parameters of the two GMMs, Φnand Φx.

While the prior noisy speech model is needed in sequential noise estimation, the noisy speech model parameters are derived according to the prior clean speech and noise models using the approximated linear model around two operating points μ0n and μ0xas follows:

p(y;Φy)=∑m=1Nm∑d=1Ndwm,dy⋅N(y;μm,dy,Σm,dy)E9

μm,dy=μ0x+g(μ0n-μ0x)+G(μ0n-μ0x)(μmx-μ0x)+[I-G(μ0n-μ0x)](μdn-μ0n)Σm,dy=[I+G(μ0n-μ0x)]Σmx[I+GT(μ0n-μ0x)]Tμ0n=E[μdn], μ0x=E[μmx], wm,dy=wm·wdE10

The noisy speech prior model will be employed to search the most similar clean speech mixture component and noise mixture component in sequential noise estimation.

2.2.3. Sequential noise estimation

Sequential EM algorithm is employed for sequential noise estimation. In this section, the prior clean speech, noise and noisy speech model are considered to construct a robust noise estimation procedure.Based on the sequential EM algorithm, the estimated noise is obtained from nt+1=argmaxnQt+1(n). In the E-step of the sequential EM algorithm, an objective function is defined as:

Qt+1(n)≜E[lnp(y1t+1,M1t+1,D1t+1|n)|y1t+1,n1t]E11

where M1t+1 and D1t+1 denote the mixture index sequence of the clean speech GMM and the noise GMM in which the noisy speech y occurs from frame 1 to frame t+1. The objective function is simplified for the M-step as:

where δmτ,m denotes the Kronecker delta function and γτ(m,d) denotes the posterior probability. γτ(m,d) can be estimated according to the Bayes rule as:

where the likelihood p(yτ|m,d,nτ−1) can be approximated using the approximated linear model as:

p(yτ|m,d,nτ−1)∼N[yτ;μm,cy(nτ−1),Σm,dy]μm,dy(μτ−1n)=μ0x+g(n0-μ0x)+G(n0-μ0x)(μmx-μ0x)+[I-G(n0-μ0x)](nτ−1-n0)Σm,dy=[I+G(n0-μ0x)]Σmx[I+GT(n0-μ0x)]TE14

Also, a forgetting factor isemployed to control the effect of the features of the preceding frames.

Qt+1(n)=∑τ=1t+1εt+1−τ∑m=1Nm∑d=1Ndγτ(m,d)⋅lnp(yτ|m,d,n)+ConstQ˜t+1(n)=−∑τ=1t+1εt+1−τ∑m=1Nm∑d=1Ndγτ(m,d)⋅[yτ−μm,dy(nτ)]T(Σm,dy)−1[yτ−μm,dy(nτ)] =εQ˜t(n)−Rt+1(n)Rt+1(n)=∑m=1Nm∑d=1Ndγt+1(m,d)⋅[yτ−μm,dy(nτ)]T(Σm,dy)−1[yτ−μm,dy(nτ)]E15

In the M-step, the iterative stochastic approximation is introduced to derive the solution. Finally, sequential noise estimation is performed as follows:

nt+1=nt+(Kt+1)−1st+1 Kt+1=−∂2Qt+1∂2n|n=nt st+1=−∂Rt+1∂n|n=ntKt+1=−∂2Qt+1∂2n|n=nt=∑τ=1t+1εt+1−τ∑m=1Nm∑d=1Ndγτ(m,d)[I−G(n0−μ0x)]T(Σm,dy)−1[I−G(n0−μ0x)]st+1=−∂Rt+1∂n|n=nt=∑m=1Nm∑d=1Ndγt+1(m,d)[I−G(n0−μ0x)]T(Σm,dy)−1[yt+1−μm,dy(nt)]E16

The prior models are used to search the most similar noise or clean speech mixture component. Given the two mixture components, the estimation of the posterior probability γτ(m,d) will be more accurate.

2.3. Environment model adaptation

Because the prior models are usually not complete enough to represent the universal data, the environment mismatch between the training data and the test data will result in the degradation on feature enhancement performance. In this section, an environment model adaptation strategy is proposed before the test phase to deal with the problem. The environment model adaptation procedure contains two parts: The first one is model parameter adaptation on noise prior modelΦn and noisy speech prior model Φyin the training phase and adaptation phase. The second is on noise-normalized SVM functionΘ(e) and environment model Ωe in the adaptation phase.

2.3.1. Model adaptation on noise and noisy speech prior models

For noise and noisy speech prior model adaptation, MAP adaptation is applied to the noise prior modelΦn first. The adaptation equations for the noise prior model parameters given T frames of the adaptation noise data z, which is estimated using the un-adapted prior models,are defined as:

w˜d=(νd−1)+∑t=1Tsd,t/∑d=1Nd(νd−1)+∑d=1Nd∑t=1Tsd,tμ˜dn=τdρd+∑t=1Tsd,t⋅zt/τd+∑t=1Tsd,tΣdn˜−1=υd+∑t=1Tsd,t(zt−μ˜dn)(zt−μ˜dn)T+τd(ρd−μ˜dn)(ρd−μ˜dn)T/(αd−p)+∑t=1Tsd,tE17

where the conjugate prior density of the mixture weight is the Dirichlet distribution with hyper-parameterνdand the joint conjugate prior density of mean and variance parameters is the Normal-Wishart distribution with hyper-parameters τd, ρd, αd, and υd. The two distributions are defined as follows:

g(w1,...,wNd|ν1,...,νNd)∝∏d=1Ndwdνd−1g(μdn,Σdn|τd,ρd,αd,υd)∝|Σdn|(αd−p)/2exp[−τd2(μdn−ρd)Tτd(μdn−ρd)]exp[−12tr(υdΣdn)]E18

where νd>0, αk>p−1 and τk>0. After adaptation of noise prior model, the noisy speech prior modelΦyis then adapted using the clean speech prior model Φx and the newly adapted noise prior modelΦn based on Eq.(8).

2.3.2. Model adaptation of noise-normalized SVM (NN-SVM)

For NN-SVM adaptation, model parametersΩe and mapping function parameters in F(y;Θ(e))need to be adapted in the adaptation phase. First, adaptation of model parameterΩeis similar to that of noise prior model. Second, the adaptation ofΘ(e)={rk(e)}k=1Nk is an iterative procedure. While Θ(e)={rk(e)}k=1Nk is not a random variable and doesnot follow any conjugate prior density, a maximum likelihood (ML)-based adaptation which is similar to the correction vector estimation of SPLICE is employed as:

rk(e)˜=∑tp(k|yt-n˜,e)(xt˜-yt)/∑tp(k|yt-n˜,e)E19

where the temporal estimated clean speech xt˜ are estimated using the un-adapted noise normalized stochastic mapping function in Eq.(4).

2.4. Experimental results

Table 1 shows the experimental results of the proposed approach on AURORA2 database. The AURORA2 database contains both clean and noisy utterances of the TIDIGITS corpus and is available from ELDA (Evaluations and Language resources Distribution Agency). Two results of previous research were illustrated for comparison and three experiments were conducted for different experimental conditions: no denoising, SPLICE with recursive EM using stereo data (Deng et al. 2003), the proposed approach using manual annotation without adaptation, and the proposed approach using auto-clustered training data without and with adaptation. The overall results show that the proposed approach slightly outperformed the SPLICE-based approach with recursive EM algorithm under the lack of stereo training data and manual annotation. Furthermore, based on the results in Set B with 0.11% improvement (different background noise types to the training data) and Set C with 0.04% improvement (different background noise types and channel characteristic to the training data), the environment model adaptation can slightly reduce the mismatch between the training data and test data.

Methods	Training- Mode	Set A	Set B	Set C	Overall
No Denoising	Multi-condition	87.82	86.27	83.78	86.39
No Denoising	Clean only	61.34	55.75	66.14	60.06
MCE	Multi-condition	92.92	89.15	90.09	90.85
MCE	Clean only	87.82	85.34	83.77	86.02
SPLICE with Recursive-EM	Multi-condition	91.49	89.16	89.62	90.18
SPLICE with Recursive-EM	Clean only	87.82	87.09	85.08	86.98
Proposed approach (manual tag, no adaptation)	Multi-condition	91.42	89.18	89.85	90.21
Proposed approach (manual tag, no adaptation)	Clean only	87.84	86.77	85.23	86.89
Proposed approach (auto-clustering, no adaptation)	Multi-condition	91.06	90.79	90.77	90.89
Proposed approach (auto-clustering, no adaptation)	Clean only	87.56	87.33	86.32	87.22
Proposed approach (auto-clustering, with adaptation)	Multi-condition	91.07	90.90	90.81	90.95
Proposed approach (auto-clustering, with adaptation)	Clean only	87.55	87.44	86.38	87.27

Table 1.

Experimental results (%) on AURORA2

2.5. Conclusions

In this section two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are presented. The prior model was introduced for precise noise estimation. Then the environment model adaptation is constructed to reduce the environment mismatch between the training data and the test data. Experimental results demonstrate that the proposed approach can slightly outperform the SPLICE-based approach without stereo data on AURORA2 database.

3. Speech recognition in disfluent environment

In this section, a novel approach to detecting and correcting the edit disfluency in spontaneous speech is presented. Hypothesis testing using acoustic features is fist adopted to detect potential interruption points (IPs) in the input speech. The word order of the utterance is then cleaned up based on the potential IPs using a class-based cleanup language model.The deletable region and the correction are aligned using an alignment model. Finally, a log linear weighting mechanismis applied to optimize the performance.

3.1. Edit disfluency ANalsis

In conversational utterances, several problems such as interruption, correction, filled pause, and ungrammatical sentence are detrimental for speech recognition. The definitions of disfluencies have been discussed in SimpleMDE. Edit disfluencies are portions of speech in which a speaker's utterance is not complete and fluent; instead the speaker corrects or alters the utterance, or abandons it entirely and starts over. In general, edit disfluencies can be divided into four categories: repetitions, revisions, restarts and complex disfluencies. Since complex disfluencies consist of multiple or nested edits, it seems reasonable to consider the complex disfluencies as a combination of the other simple disfluencies: repetitions, revisions, and restarts. Edit disfluencies have a complex internal structure, consisting of the deletable region (delreg), interruption point (IP) and correction.Editing terms such as fillers, particles and markers are optional and follow the IP in edit disfluency.

In spontaneous speech, acoustic features such as short pause (silence and filler), energy and pitch reset generally appear along with the occurrence of edit dislfuency. Based on these features, we can detect the possible IPs. Furthermore, since IPs generally appear at the boundary of two successive words, we can exclude the unlikely IPs whose positions are within a word. Besides, since the structural patterns between the deletable word sequence and correction word sequence are very similar, the deletable word sequence in edit disfluency is replaceable by the correction word sequence.

3.2. Framework of edit disfluency transcription system

The overall transcription task for conversational speech with edit disfluency in the proposed method is composed of two main mechanisms; IP detection module and edit disfluency correction module. The framework is shown in Figure 2. IP detection module predicts the potential IPs first. Edit disfluency correction module generates the rich transcription that contains information of interruption, text transcription from the speaker’s utterances and the cleaned-up text transcription without disfluencies. Figure 3 shows the correction process foredit disfluency.

The speech signal is fed to both acoustic feature extraction module and speech recognition engine in IP detection module. Information about durations of syllables and silence from speech recognition is provided for acoustic feature extraction. Combined with side information from speech recognition, duration-, pitch-, and energy-related features are extracted and used to model the IPs using a Gaussian mixture model (GMM). Besides, in order to perform hypothesis testing on IP detection, an anti-IP GMM is also constructed based on the extracted features from the non-IP regions. The hypothesis testing verifies if the posterior probability of the acoustic features of a syllable boundary is above a threshold and therefore determines if the syllable boundary is an IP. Since IP is an event that happens in interword location, we can remove the detected IPs that do not appear in the word boundary.

Figure 2.
The framework of transcription system for spontaneous speech with edit disfluencies

Figure 3.
The correction process for the edit disfluency

There are two processing stages in the edit disfluency correction module: cleanup and alignment. As shown in Figure 4, cleanup process divides the word stringinto three parts:deletable region (delreg),editing term, and correction according to the locations of potential IPs detected by the IP detection module. Cleanup process is performed by shifting the correction part and replaces the deletable region to form a new cleanup transcription. The edit disfluency correction module is composed of an n-gram language model and the alignment model. The n-gram model regards the cleanup transcriptions as fluent utterances and models their word order information. The alignment model finds the optimal correspondence between deletable region and correction in edit disfluency.

Figure 4.
The cleanup language model for the edit disfluency

3.3. Potentialinterruptionpoint detection

For IP detection, instead of detecting exact IP, potential IPs are selected for further processing. Since the IP is the point at which the speaker breaks off the deletable region, some acoustic events will go along with it. For syllabic languages like Chinese, every character is pronounced as a monosyllable, while a word is composed of one to several syllables. The speech input of the syllabic languages with n syllables can be described as a sequence,

Seqsyllable_silence ≡syllable1, silence1, syllable2, silence2,...,silencen−1, syllablen,

and then this sequence can be separated into a syllable sequence

Seqsyllable ≡syllable1, syllable2, ..., syllablen,

and a silence sequence

Seqsilence ≡silence1, silence2,...,silencen−1.

We model the interruption detection problem as choosing between H_0, which is termed the IP not embedded in the silence hypothesis, and H₁ which is the IP embedded in the silence hypothesis. The likelihood ratio test is employed to detect the potential IPs. The function L(Seqsyllable_silence) is termed the likelihood ratio since it indicates for each value of Sequencesyllable_silence the likelihood of H1 versus the likelihood of H0.

L(Seqsyllable_silence)=P(Seqsyllable_silence;H1)P(Seqsyllable_silence;H0)E20

By introducing the thresholdγ to adjust the precision and recall rates, H1:L(Seqsyllable_silence)≥γ means the IP is embedded in silence_k. Conceptually, silence_k is a potential IP. Under the assumption of independence, the probability of IP appearing in silence_kcan be regarded as the product of probabilities obtained from silence_k and the syllables around it. The probability density functions (PDFs) under each hypothesis are denoted and estimated as

P(Seqsyllable_silence;H1)=P(Seqsyllable_silence|Eip) =P(Seqsilence|Eip)×P(Seqsyllable|Eip)E21

and

P(Seqsyllable_silence;H0)=P(Seqsyllable_silence|¬Eip) =P(Seqsilence|¬Eip)×P(Seqsyllable|¬Eip)E22

Where E_ip denotes that IP is embedded in silence_k and ¬Eip means that IP does not appear in silencek, that is,

Eip:Interuption point∈silencek

¬Eip:

Interuption point∉silencek

3.3.1.IP. detection using posterior probability of silence duration

Since IPs always appear at the inter-syllable position, the n-1 silence positions between n syllables will be considered as the IP candidates. By this, we can take the IP detection as the problem to verify whether each of the n-1 silence positionsis an IP or not. In conversation, speakers mayhesitate to find the correct words when disfluency appears. Hesitation is usually realized as a pause. Since the length of silence is very sensitive to disfluency, we use normal distributions to model the posterior probabilities of that IP appears and does not appear in silence_k, respectively.

P(Seqsilence|Eip)=22πσipexp(−(Seqsilence-μip)22σip2)E23

P(Seqsilence|¬Eip)=22πσnipexp(−(Seqsilence-μnip)22σnip2)E24

Where μip, μnip, σnip2and σip2denote the means and variances of the silence duration containing and not containing the IP, respectively.

3.3.2.Syllable-based. acoustic features extraction

Acoustic features including duration, pitch, and energy for each syllable (Soltau et al. 2005) are adopted for IP detection.A feature vector ofthe syllables within an observation window around the silence is formed as the input of the GMM. That is, we are interested in the syllables around the silence that may appear as an IP. Awindow of 2w syllables with w syllables after and before silence_kis used. First, the subscript will be translated according to the position of silence as Syln−k←Syln. And we then extract the features of syllables within the observation windows.

Since the durations of syllables are not the same even for the same syllable, the duration ratio is defined as the average duration of the syllable normalized by the average duration over all syllables.

nfdurationi≡∑j=1niduration(syllablei.j)∑i=1|syllable|∑j=1niduration(syllablei.j)E25

Where syllable_i,j means the j-th samples of syllablei in the corpus. |syllable| means the number of the syllable. n_i is the number of syllable i in the corpus. Similarly, for energy and pitch, frame-based statistics are used to calculatethe normalized features for each syllable.

Considering the result of speech recognition, the features are normalized to be the first order features. For modeling the speaking rate and variation in the energy and pitch during the utterance, the 2^nd order feature called delta-duration, delta-energy and delta-pitch are obtained from the forward difference of the 1^st order features. The following equation shows the estimation for delta-duration, which can also be applied for the estimation of delta-energy and delta-pitch.

Δnfdurationi={nfdurationi+1−nfdurationiif -w < i <w0others E26

Where w is half of the observation window size. Totally, there are three kinds of two orders features after feature extraction. We combine these features to form a vector with 24w-6 features to be the observation vector of the GMM. The acoustic features are denoted as the syllable-based observationsequence that corresponds to the potential IP, silence_k, by

{O=[OD,OP,OE]∈Rdim}E27

Where Os∈Rdims, S∈{D,P,E}represents the single kind feature vectors and dim means the dimensions of the feature vector consisting of duration-related, pitch-related and energy-related features. The following equation shows the estimation for duration-related features.

OD≡[nfduration−w+1,...,nfduration−1,nfduration0,nfduration+1,nfduration+2,...,nfduration+wΔnfduration−w+1,...,Δnfduration−1,Δnfduration0,Δnfduration+1,Δnfduration+2,...,Δnfduration+w−1]TE28

3.3.3.Gaussian. mixture model for interruption point detection

The GMM is adopted for IP detection using the acoustic features.

P(Seqsyllable|Cj)≡P(Ot|λj)=∑i=1WωiN(Ot;μi,∑i)E29

Where Cj={Eip,¬Eip} means thehypothesis set for silence_k containing and not containing the IP.λj is the GMM for class Cj andωi is a mixture weight which must satisfy the constraint ∑i=1Wωi=1, where W is the number of mixture components, and N(⋅) is the Gaussian density function:

N(Ot;μi,∑i)=1(2π)dim/2|∑i|1/2exp(−12(Ot−μi)T∑i−1(Ot−μi))E30

where μi and ∑i are the mean vector and covariance matrix of the i-th component. Ot denotes the t-th observation in the training corpus. The parameters θ=[ωi,μi,∑i], i=1..Mcan be estimated iteratively using the EM algorithm for mixture i

ω^i=1N∑t=1NP(i|Ot,λ)E31

μ^i=∑t=1NP(i|Ot,λ)Ot∑t=1NP(i|Ot,λ)E32

∑^i=∑t=1NP(i|Ot,λ)(Ot−μ^i)(Ot−μ^i)T∑t=1NP(i|Ot,λ)E33

Where P(i|Ot,λ)=P(Ot|λ)ωi∑j=1WP(Ot|λ)ωj and N denote the total number of feature observations.

3.3.4.Potential. interruption point extraction

Based on the assumption that IP appears generally at the boundary of two successive words, we can remove the detected IPs that do not appear in the word boundary. After the removal of unlikely IPs, the remaining IPs will be kept for further processing. Since the word graph or word lattice is obtained from speech recognition module, every path in the word graph or word lattice form its potential IP set for an input utterance.

3.4. Lingusitic processing for edit disfluency correction

In previous section, potential IPs has been detected from the acoustic features. However, correcting edit disfluency using the linguistic features is, in fact, one of the keys for rich transcription. In this section, the edit disfluency is detected by maximizing the likelihood of the language model for the cleaned-up utterances and the word correspondence between the deletable region and the correction given the position of the IP. Consider the word sequence W^* in the word lattice generated by the speech recognition engine. We can model the word string W^* using a log linear mixture model in which language model and alignment are both included.

W*=argmaxW,IPP(W;IP)=argmaxW,IPP(w1,w2,...wt,wt+1,...wn,wn+1,...w2n−t,w2n−t+1,....wN;IP)=argmaxW,n,t(P(w1,w2,...wt,wn+1,...w2n−t,w2n−t+1,....wN)α ×P(wt+1,...wn|wn+1,...w2n−t,w2n−t+1,....wN)(1−α))=argmaxW,n,t(αlog(P(w1,w2,...wt,wn+1,...w2n−t,w2n−t+1,....wN)) +(1−α)log(P(wt+1,...wn|wn+1,...w2n−t,w2n−t+1,....wN)))E34

where α and 1−α are the combination weight for cleanup language model and alignment model. IP means the interruption point obtained from the IP detection module and n is the position of the potential IP.

3.4.1.Language. model of cleanup utterance

In the past, statistical language models have been appliedto speech recognition and have achieved significant improvement in the recognition results. However, probability estimation of word sequences can beexpensive and always suffers from the problem of data sparseness. In practice, the statisticallanguage model is often approximated by the class-basedn-gram model with modified Kneser-Ney discounting probabilities for further smoothing.

P(w1,w2,...wt,wn+1,...w2n−t,w2n−t+1,....wN)=∏i=1tP(wi|Class(w1i−1))P(wn+1|Class(w1t))∏j=n+2NP(wj|Class(w1twn+1j−1))E35

Where Class(⋅) means the conversion function that translates a word sequence into a word class sequence. In this section, we employ two word classes: semantic class and parts-of-speech (POS) class. A semantic class, such as the synsets in WordNet (http://wordnet.princeton.edu/) or concepts in the UMLS (http://www.nlm.nih.gov/research/umls/),containsthe words that share a semantic property based on semantic relations, such as hyponym and hypernym.POS is called syntactic or grammatical categories defined as the role that a word plays in a sentence such as noun, verb, adjective… etc.

The other essential issue of n-gram model for correcting edit disfluency is the number of orders in Markov model. Since IP is the point at which the speaker breaks off the deletable region and the correction consists of the portion of the utterance that has been repaired by the speaker and can be considered fluent. By removing part of the word string will lead to a shorter string and result in the condition that higher probability is obtained for shorter word string. As a result, short word string will be favored. To deal with this problem, we can increase the order to constrain the perplexity and normalize the word length by aligning the deletable region and the correction.

3.4.2.Alignment. model between the deletable region and the correction

In conversational speech, the structural pattern of a deletable region is usually similar to that of the correction. Sometimes, the deletable region appears as a substring of the correction. Accordingly, we can find the structural pattern in the starting point of the correction which generally follows the IP. Then, we can take the potential IP as the center and align the word string before and after it. Since the correction is used for replacing the deletable region and ending the utterance, there exists a correspondence between the words in the deletable region and the correction. We may, therefore, model the alignment assuming the conditional probability of the correction given the possible deletable region. According to this observation, class-based alignment is proposed to clean up edit disfluency. The alignment model can be described as

P(wn+1,...w2n−t,w2n−t+1,....wN|wt+1,...wn)=∏k=t+1n(P(fk|Class(wk))∏l=1fkP(Class(wl)|Class(wk)))∏k,l,mP(l|k,m)E36

where fertility fk means the number of words in the correction corresponding to the word wk in the deletable region. k and l are the positions of the words wk and wl in the deletable region and the correction, respectively. m denotes the number of words in the deletable region. The alignment model for cleanup contains three parts: fertility probability, translation or corresponding probability and distortion probability. The fertility probability of wordwk is defined as

P(fk|Class(wk))=∑wi∈Class(wk)δ(fi=fk)∑p=0N∑wj∈Class(wk)δ(fj=p) E37

where δ(⋅) is an indicator function and N means the maximum value of fertility. The translation or corresponding probability is measured according to (Wu et al. 1994).

P(Class(wl)|Class(wk))=2×Depth(LCS(Class(wl),Class(wk)))Depth(Class(wl))+Depth(Class(wk))E38

where Depth(⋅) denotes the depth of the word class and LCN(⋅) denotes the lowest common subsumer of the words. The distortion probability P(l|k,m)is the mapping probability of the word sequence between the deletable region and the correction.

3.5. Experimental results and discussion

To evaluate the performance of the proposed approach, a transcription system for spontaneous speech with edit dsifluencies in Mandarin was developed. A speech recognition engine using Hidden Markov Model Toolkit (HTK) was constructed as the syllable recognizer using 8 states (3 states for initial, and 5 states for final in Mandarin).

3.5.1. Experimental data

The Mandarin Conversational Dialogue Corpus (MCDC), collectedfrom 2000 to 2001 at the Institute of Linguistics of AcademiaSinica, Taiwan, consists of 30 digitized conversational dialoguesof a total length of 27 hours. 60 subjects were randomlychosen from daily life in Taiwan area.It was annotated according to (Yeh et al. 2006) that gives conciseexplanations and detailed operationaldefinitions of each tag in Mandarin. Corresponding to SimpleMDE, direct repetitions, partial repetitions, overt repairs and abandoned utterances are taken as edit disfluency in MCDC. The dialogs tagged as number 01, 02, 03 and 05 are used as the test corpus. For training the parameters in the speech recognizer, MAT Speech Database, TCC-300 and MCDC were employed.

3.5.2.Potential. interruption point detection

According to the observation of the MCDC, the probability density function (pdf) of the duration of the silences with or without IPs is obtained. The average duration of the silences with IP is larger than that of the silences without IP. According to this result, we can estimate the posteriorprobability of silence duration using a GMM for IP detection. For hypothesis testing, an anti-IP GMM is also constructed.

Since IP detection can be regarded as a position determination problem, an observation window over several syllables is adopted. In this observation window, the values of pitch and energy of the syllables just before an IP are usually larger than that after the IP. This phenomenon means the pitch reset and energy reset co-occur with IP in the edit disfluency. This generally happens in the syllables of the first word just after the IP. The pitch reset event is very obvious when the disfluency type is repair. Similar to the pitch, energy plays the same role when edit disfluency appears, but the effect is not so obvious compared to the pitch. The filler words or phrase after IP will be lengthened to strive for the time for the speaker to construct the correction and attract the listener to pay attention to. This factor can achieve significant improvement in IP detection rate.

The hypothesis testing, combined with the GMM model with four mixture components using the syllable features, will determine if the silence contains the IP. The parameter γ should be determined to achieve a better result. The overall IP error rate defined in RT’04F will be simply the average number of missed IP detections and falsely detected IPs per reference IP:

ErrorIP=nM−IP+nFA−IPnIPE39

Where nM−IP and nFA−IP denote the numbers of missed and false alarm IPs respectively. nIP means the number of reference IPs. We can adjust thethresholdγ for nM−IP and nFA−IP.

Since the goal of the IP detection module is to detect the potential IPs, false alarm for IP detection is not a serious problem compared to miss error. That is to say, we want to obtain high recall rate without much increase in false alarm rate. Finally, thethresholdγ was set to 0.25. Since the IP always appears in word boundary, this constraint can be used to remove unlikely IPs.

3.5.3.Clean-up. disfluency using linguistic information

For evaluating the edit disfluency correction model, two different types of transcriptions were used: humangeneratedtranscription (REF) and speech-to-text recognition output(STT). Using the reference transcriptions provides the bestcasefor the evaluation of the edit disfluency correction module because there are no word errors in the transcription.For practicability, the syllable lattice from speech recognition is fed to the edit disfluency correction module for performance assessment.

For class-based approach, part of speech (POS) and semantic class are employed as the word class. Herein, semantic class is obtained based on Hownet (http://www.keenage.com/) that defines the relation “IS-A” as the primary feature. There are 26 and 30 classes in POS class and semantic class respectively. By this, we can categorize the words according to their hypernyms or concepts, and every word can map to its own semantic class.

The edit word detection (EWD) task is to detect the regions of the input speech containing the words in the deletable regions. One of the primary metrics for edit disfluency correction is to use the edit word detection method defined in RT’04F (Chen et al. 2002), which is similar to the metric for IP detection shown in Eq. (38).

Due to the lack of structural information, unigram does not obtain any improvement. Bigram provides more significant improvementcombined with POS class-based alignment than semantic class-based alignment. Using 3-gram and semantic class-based alignment outperforms other combinations. The reason is that 3-gram with more strict constraints can reduce the false alarm rate for edit word detection. In fact, we also tried using 4-gram to gain more improvement than 3-gram, but theexcess computation makes the light improvementnot conspicuous as we expected. Besides, the statistics of 4-gram is too spare compared to 3-gram model. The best combination in edit disfluency correction module is 3-gram and semantic class.

According to the analysis of the results shown in Table 2, we can find the values of the probabilities of the n-gram model are much smaller than that of the alignment model. Since the alignment can be taken as the penalty for edit words, we should balance the effects between the 3-gram and the alignment with semantic class using a log linear combination weight α. For optimizing the performance, we estimate αempirically based on the minimization of the edit word errors.

	Human generated transcription (REF)				Speech-to-text recognition output (STT)
	nM−EWDnEWD	nFA−EWDnEWD	ErrorEWD	nM−EWDnEWD		nFA−EWDnEWD	ErrorEWD
1-gram+alignment¹	0.15	0.17	0.32	0.58		0.65	1.23
1-gram+alignment²	0.23	0.12	0.35	0.62		0.42	1.04
2-gram+alignment¹	0.09	0.15	0.24	0.46		0.43	0.87
2-gram+alignment²	0.10	0.11	0.21	0.38		0.36	0.74
3-gram+alignment¹	0.12	0.04	0.16	0.39		0.23	0.62
3-gram+alignment²	0.11	0.04	0.15	0.36		0.24	0.60

Table 2.

Results (%) of linguistic module with equal weight α=(1−α)=0.5for edit word detection onREF and STT conditions

3.6. Conclusion and future work

This investigation has proposed an approach toedit disfluency detection and correction forrichtranscription. The proposed theoretical approach, based on a two stageprocess, aims to model the behavior of edit disfluency and cleanup the disfluency. IP detection module using hypothesis testing from the acoustic features is employed to detect the potential IPs. Word-based linguistic module consists of a cleanup language model and an alignment model is used for verifying the position of the IP and therefore correcting the edit disfluency. Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. In an investigation of the linguistic properties of edit disfluency, thelinguisticmodule was explored for correcting disfluencybased on the potential IPs. The experimental results indicate a significant improvement in performance was achieved. In the future, this framework will be extended to deal with the problem resulted from subword to improve the performance of the rich transcription system.

4. Speech recognition in multilingual environment

This section presents an approach to generating phonetic units for mixed-language or multilingual speech recognition. Acoustic and contextual analysis is performed to characterize multilingual phonetic units for phone set creation. Acoustic likelihood is utilized for similarity estimation of phone models. The hyperspace analog to language (HAL) model is adopted for contextual modeling and contextual similarity estimation. A confusion matrix combining acoustic and contextual similarities between every two phonetic units is built for phonetic unit clustering. Multidimensional scaling (MDS) method is applied to the confusion matrix for reducing dimensionality.

4.1. Introduction

In multilingual speech recognition, it is very important to determine a global phone inventory for different languages.When an authentic multilingual phone set is defined, the acoustic models and pronunciation lexicon can be constructed(Chen et al. 2002). The simplest approach to phone set definition is to combine the phone inventories of different languages together without sharing the units across the languages. The second one is to map language-dependent phones to the global inventory of the multilingual phonetic association based on phonetic knowledge to construct the multilingual phone inventory. Several global phone-based phonetic representations such as International Phonetic Alphabet (IPA) (Mathews 1979), Speech Assessment Methods Phonetic Alphabet (Wells 1989) and Worldbet (Hieronymus 1993)are generally used. The third one is to merge the language-dependent phone models using a hierarchical phone clustering algorithm to obtain a compact multilingual inventory. In this approach, the distance measure between acoustic models, such as Bhattacharyya distance (Mak et al. 1996) and Kullback-Leibler (KL) divergence (Goldberger et al. 2005), isemployed to perform the bottom-up clustering.Finally, the multilingual phone models are generated with the use of a phonetic top-down clustering procedure(Young et al. 1994).

4.2.Multilingual. phone set definition

From the viewpoint of multilingual speech recognition, a phonetic representation is functionally defined by the mapping of the fundamental phonetic units of languages to describe the corresponding pronunciation. In this section, IPA-based multilingual phone definition is suitable and consistent for phonetic representation.Using phonetic representation of the IPA, the recognition units can be effectively reduced for multilingual speech recognition. Considering the co-articulated pronunciation, context-dependent triphones areadopted in the expansion of IPA-based phonetic units.

In multilingual speech recognition, misrecognition generally results from incorrect pronunciation or confusable phonetic set.For examples, in Mandarin speech, the “ei_M” and “zh_M” is usually pronounced as “en_M” and “z_M”, respectively. In this section, statistical methodsare proposed to deal with the problem of misrecognition caused by the confusing characteristics between phonetic units in multilingual speech recognition.Based on the analysis of confusing characteristics, confusing phones due in part to the confusable phonetic representation areredefined to alleviate the misrecognition problem.

4.2.1. Acoustic likelihood

For the estimation of the confusion between two phone models, the posterior probabilities obtained from the phone-based hidden Markov model (HMM) are employed. Given two phone models, ωk and ωl, trained withthe corresponding training data,xik, 1≤i≤Iand xjl, 1≤j≤J, the symmetric acoustic likelihood (ACL) between two phone models, ωk and ωl, are estimated as follows.

ak,l=∑i=1IP(xil|ωk)+∑j=1JP(xjk|ωl)I+JE40

where I and J represent the number of training data for phone models, ωk and ωl, respectively. The acoustic confusing matrix A=(ak,l)N×N is obtained from the pairwise similarities between every two phone models, andNdenotes the number of phone models.

4.2.2. Contextualanalysis

A co-articulation pattern can be considered as a semantically plausible combination of phones. This section presentsa text mining framework to automatically induce co-articulation patterns from a mixed-language or a multilingual corpus.A crucial step to induce the co-articulation patterns is to represent speech intonation as well as combination of phones. To achieve this goal, the hyperspace analog to language (HAL) model constructs a high-dimensional contextual space for the mixed-language or multilingual corpus. Each context-dependent triphone in the HAL space is represented as a vector of its context phones, which represents that the sense of a phone can be co-articulated through its context phones. Such notion is derived from the observation of articulation behavior. Based on the co-articulation behavior, if two phones share more common context, they are more similarlyarticulated.

The HAL model represents the multilingual triphonesbased on a vector representation. Each dimension of the vector is a weight representing the strength of association between the target phone and its context phone. The weights are computed by applying an observation window of lengthℓ over the corpus. All phones within the window are considered as the co-articulated pronunciation with each other. For any two phones of distanced within the window, the weight between them is defined as ℓ−d+1. After moving the window by one phone increment over the sentence, the HAL spaceG=(gk,l)N×N is constructed. The resultant HAL space is an N×N matrix, where N is the number of triphones.

Table 3 presents the HAL space for the example of English and Mandarin mixed sentence “查一下<look up> (CH A @ I X I A) Baghdad (B AE G D AE D).”For each phone in Table 3, the corresponding row vector represents its left contextual information, i.e. the weights of the phones preceding it. The corresponding column vector represents its right contextual information. wk,lindicates the k-th weightof thel-th triphoneφl. Furthermore,the weightsin the vector are re-estimated as described as follows.

w¯k,l=wk,l×logNNlE41

where Ndenotes the total number of phone vectors and N_lrepresents the number of vectors of phoneφlwithnonzero dimension. After each dimension is re-weighted, the HAL space is transformed into a probabilistic framework, and thus each weight can be redefined as

w^k,l=w¯k,l∑k=1Nw¯k,lE42

To generate a symmetric matrix, the weight is averaged as

gk,l=w^k,l+w^l,k2, 1≤k,l≤N

4.2.3.Fusion. of confusing matrices and dimensional reduction

The multidimensional scaling (MDS) method is used to project multilingual triphones to the orthogonal axes where the ranking distance relation between them can be estimated using Euclidean distance. MDS is generally a procedure which characterizes the data in terms of a matrix of pairwise distances using Euclidean distance estimation. One of the purposes of MDS is to reduce the data dimensionality into a low-dimensional space. The IPA-based phone alphabet is 55 for English and Mandarin. This makes around 166,375 (55×55×55) triphone numbers. When the number of target languages is increased, the dimension of the confusing matrix becomes huge. Another purpose of multidimensional scaling is to project the elements in the matrix to the orthogonal axes where the ranking distance relation between elements in the confusion matrix can be estimated.Compared to the hierarchical clustering method (Mak et al. 1996), thissection applies MDS to the global similarity measure of multilingual triphones.

	CH	A	@	I	X	B	AE	G	D
CH
A	3			4	1
@	2	3
I	1	2	4		3
X		1	2	3
B		3		2	1
AE		2		1		3		2	3
G		1				2	3
D						1	5	4

Table 3.

Example of multilingual sentence“查一下<look up> (CH A @ I X I A) Baghdad (B AE G D AE D)”in HAL space

In this section, the multidimensional scaling method suitable to represent the high dimensionality relation is adopted to project the confusing characteristic of multilingual triphones onto a lower-dimensional space for similarity estimation. Multidimensional scaling approach is similar to the principal component analysis (PCA) method.The difference is that MDS focuses on the distance relation between any two variables and PCA focuses on the discriminative principal component in variables. MDS is applied for estimating the similarity of pairwise triphones. The similarity matrix V=(vk,l)N×N contains pairwise similarities between every two multilingual triphones. The element of row k and column l in the similarity matrix is computed as

vk,l=−(α×log(ak,l)+(1−α)×log(gk,l)) 1≤k,l≤NE43

where αdenotes the combination weight. The sum rule of data fusion is indicated to combine acoustic likelihood (ACL) and contextual analysis (HAL)confusing matricesas shown in Figure 5.

MDS is then adopted to project the triphones onto the orthogonal axes where the ranking distance relation between triphones can be estimated based on the similarity matrices of triphones. The first step of MDS is to obtain the following matrices

B=HSHE44

where H=I−1n11' is the centralized matrix. I indicates the diagonal matrix and 1 means the indicator vector. The elements in matrixB is computed as

bkl=skl−s¯k•−s¯•l−s¯••E45

where

s¯k•=∑l=1NsklNE46

is the average similarity values over the kth row,

s¯•l=∑k=1NsklNE47

denotes the average similarity values over the lth column, and

s¯••=∑k=1N∑l=1NsklN2E48

are the average similarity values over all rows and columns of the matrixB. The eigenvector analysis is applied to matrixB to obtain the axis of each triphone in a low dimension. The singular value decomposition (SVD) is applied to solve the eigenvalue and eigenvector problems. Afterwards, the first z nonzero eigenvalues for each phone in a descending order, i.e. λ1≥λ2≥…≥λz>0, is obtained. The corresponding ordered eigenvectors are denoted as u. Then, each triphone is represented by a projected vector as

Y=[λ1u1,λ2u2,…,λzuz]E49

4.2.4.Phone. clustering

This section presents how to cluster the triphones with similar acoustic and contextual properties into a multilingual triphone cluster. Cosine measure between triphones Yk and Yl is adopted as follows.

C(Yk,Yl)=yk⇀•yl⇀‖yk⇀‖⋅‖yl⇀‖=∑i=1zyk,i×yl,i∑i=1zyk,i2×∑i=1zyl,i2E50

where yk,i and yl,i are the element of the triphone vectorsYk and Yl. The modified k-means (MKM) algorithm is applied to cluster all the triphones into a compact phonetic set. The convergence of closeness measure is determined by a pre-set threshold.

Figure 5.
An illustration of fusion of acoustic likelihood (ACL) and contextual analysis (HAL)confusing matrices for the MDS process

4.3.Experimental. evaluations

For evaluation, an in-house multilingual speech recognizer was implemented and experiments were conducted to evaluate the performance of the proposed approach on an English-Mandarin multilingual corpus.

4.3.1. Multilingual database

In Taiwan, English and Mandarin are popular in conversation, culture, media, and everyday life. For bilingual corpus collection, the English across Taiwan (EAT) project (EAT [online] http://www.aclclp.org.tw/) sponsored by National Science Council, Taiwan prepared 600 recording sheets. Each sheet contains 80 reading sentences, including English long sentences, English short sentences, English words and mixed English and Mandarin sentences. Each sheet was used for speech recording individually for English-majorstudents and non-English-majorstudents. Microphone corpus was recorded as sound files with 16 kHz sampling rate and 16 bit sample resolution. The summarized recording information of EAT corpus is shown in Table 4.In this section, we applied mixed English-Mandarin sentences in microphone application. The average sentence length is around 12.62 characters.

	English-Major		Non-English-Major
	male	female	male	female
No. of Sentences	11,977	30,094	25,432	15,540
No. of Speakers	166	406	368	224

Table 4.

EAT-MICMultilingual Corpus Information

4.3.2.Evaluation. of the phone set generation based on acoustic and contextual analysis

In this section, the phone recognition rate was adopted for the evaluation of acoustic modeling accuracy. Three classes of speech recognition errors, including insertion errors (Ins), deletion errors (Del) and substitution errors (Sub), were considered. This section applied the fusion of acoustic and contextual analysis approaches to generating the multilingual triphone set. Since the optimal clustering number of acoustic models was unknown, several sets of HMMs were produced by varying the MKM convergence threshold during multilingual triphone clustering. There are three different approaches including acoustic likelihood (ACL), contextual analysis (HAL) and fusion of acoustic and contextual analysis (FUN). It is evident that the proposed fusion method achieves a better result than individual ACL or HAL methods. The comparison of acoustic analysis and contextual analysis, HAL achieves a higher recognition rate than ACL. It denotes that contextual analysis is more significant than acoustic analysis for multilingual confusing phone clustering. The curves shows that phone accuracy will increase with the increase in state number, and finally decrease due to the confusing triphone definition and the requirement of a large size of multilingual training corpus. The proposed multilingual phone generation approach can get an improved performance than the ordinary multilingual triphone sets. In this section, the English and Mandarin triphone sets is defined based on the expansion of the IPA definition. The multilingual speech recognition system for English and Mandarin contains 924 context-dependent triphone models. The best phone recognition accuracy was 67.01% for the HAL window size = 3. Therefore, this section applied this setting in the following experiments.

4.3.3. Comparison of acoustic and language models formultilingualspeech recognition

Table 5 shows the comparisons on different acoustic and language modelsfor multilingual speech recognition. For the comparison of monophone and triphone-based recognition, different phone inventory definitions including direct combination of language-dependent phones (MIX), language-dependentIPA phone definition (IPA), tree-based clustering procedure (TRE) (Mak et al. 1996) and the proposed methods (FUN) were considered. The phonetic units of Mandarin can be represented as 37 fundamental phones and English can be represented as 39 fundamental phones. The phone set for the direct combination of English and Mandarin is 78 phones with two silence models. The phone set for IPA definition of English and Mandarin contains 55 phones.

	Monophone		Triphone
	MIX	IPA	TRE	FUN
Phone models	78	55	1172	924
With language model	45.81%	66.05%	76.46%	78.18%
Without language model	32.58%	51.98%	65.32%	67.01%

Table 5.

Comparison of acoustic and language models for multilingual speech recognition

In acoustic comparison, multilingual context-independent (MIX and IPA) and context-dependent (TRE and FUN) phone sets were investigated. With the language model of English and Mandarin, the approach based on MIX achieved 45.81% phone accuracy and the IPA method achieved 66.05% phone accuracy. The IPA performance is evidently better than MIX approach. TRE method achieved 76.46% phone accuracy and our proposed approach achieved 78.18%. It is obvious that triphone models achieved better performance than monophone models. There is around 2.25% relative improvement from 76.46% accuracy for the baseline system based on TRE to 78.18% accuracy for the approach using acoustic and contextual analysis.

In order to evaluate the acoustic modeling performance, the experiments were conducted without using language model. Without the language model, the MIX approach achieved 32.58%, IPA method achieved 51.98%, TRE method achieved 65.32%, and the proposed approach achieved 67.01% phone accuracies. In conclusion, multilingual speech recognition can obtain the best performance using FUN approach for the context-dependent phone definition with language model.

4.3.4. Comparison of monolingual andmultilingualspeech recognition

In this experiment, the utterances of English word and English sentence in the EAT corpus were collected for the evaluation of monolingual speech recognition. A comparison of monolingual and multilingual speech recognition using EAT corpus was shown in Table 6. Totally, 2496 English words, 3072 English sentences and 5884 mixed English and Mandarin utterances were separately used for training. Other 200 utterances were applied for evaluation. In the context-dependent without language model condition, the performance of monolingual English word achieved 76.25%which is higher than67.42% for monolingual English sentences. The phone recognition accuracy of monolingual English sentences is 67.42% slightly better than67.01% for mixed English and Mandarin sentences.

	Monolingual		Multilingual
	English word	English sent.	English and Mandarin mixed sent.
Training corpus	2496	3072	5884
Phone recognition accuracy	76.25%	67.42%	67.01%

Table 6.

Comparison of monolingual and multilingual speech recognition

4.4. Conclusions

In this section, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixed-language or multilingual speech recognition. The context-dependent triphones are defined based on the IPA representation. Furthermore, the confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. From the acoustic analysis, the acoustic likelihood confusing matrix is constructed by the posterior probability of triphones.From the contextual analysis, the hyperspace analog to language (HAL) approach is employed. Using the multidimensional scaling and data fusion approaches, the combination matrix is built and each phone is represented as a vector. Furthermore, the modified k-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that the proposed approach givesencouraging results.

5. Conclusions

In this chapter speech recognition techniques in adverse environments are presented. For speech recognition in noisy environments, two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are described. Experimental results show that the proposed approach outperformed the SPLICE-based approach without stereo data on AURORA2 database. For speech recognition in disfluent environments, an approach to edit disfluency detection and correction for rich transcription is presented. The proposed theoretical approach, based on a two stage process, aims to model the behavior of edit disfluency and cleanup the disfluency. Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. For speech recognition in multilingual environments, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixed-language or multilingual speech recognition. The confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. The modified k-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that the proposed approach improves recognition accuracy in multilingual environments.

Acknowledgement

This work was partially supported by NCKU Project of Promoting Academic Excellence & Developing World Class Research Centers.

References

1. BearJ.DowdingJ.ShribergE.1992Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog. Proc. of ACL. Newark, Deleware, USA, Association for Computational Linguistics: 5663
2. BenvenisteA.MétivierM.PriouretP.1990Adaptive Algorithms and Stochastic Approximations. Applications of Mathematics. New York, Springer. 22.
3. BollS.1979Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 272113120
4. CharniakE.JohnsonM.2001Edit detection and parsing for transcribed speech. Proc. of NAACL, Association for Computational Linguistics: 118126
5. ChenY. J.WuC. H.ChiuY. H.LiaoH. C.2002Generation of robust phonetic set and decision tree for Mandarin using chi-square testing. Speech Communication, 383-4349364
6. DengL.AceroA.PlumpeM.HuangX.2000Large-vocabulary speech recognition under adverse acoustic environments. Proc. ICSLP-2000, Beijing, China.
7. DengL.DroppoJ.AceroA.2003Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. Speech and Audio Processing, IEEE Transactions on, 116568580
8. FuruiS.NakamuraM.IchibaT.IwanoK.2005Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese. Speech Communication, 471-2208219
9. GalesM. J. F.YoungS. J.1996Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 45352359
10. GoldbergerJ.AronowitzH.2005A distance measure between gmms based on the unscented transform and its application to speaker recognition. Proc. of EUROSPEECH. Lisbon, Portugal: 19851988
11. HainT.WoodlandP. C.EvermannG.GalesM. J. F.LiuX.MooreG. L.PoveyD.WangL.2005Automatic transcription of conversational telephone speech. IEEE Transactions on Speech and Audio Processing, 13611731185
12. HeemanP. A.AllenJ. F.1999Speech repairs, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue. Computational Linguistics, 254527571
13. HermanskyH.MorganN.1994RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 24578589
14. HieronymusJ. L.1993ASCII phonetic symbols for the world’s languages: Worldbet. Journal of the International Phonetic Association, 23
15. HsiehC. H.WuC. H.2008Stochastic vector mapping-based feature enhancement using prior-models and model adaptation for noisy speech recognition. Speech Communication, 506467475
16. HuangC. L.WuC. H.2007Generation of phonetic units for mixed-language speech recognition based on acoustic and contextual analysis. IEEE Transactions on Computers, 56912251233
17. JohnsonM.CharniakE.2004A TAG-based noisy channel model of speech repairs. Proc. of ACL, Association for Computational Linguistics: 3339
18. KohlerJ.2001Multilingual phone models for vocabulary-independent speech recognition tasks. Speech Communication, 351-22130
19. LiuY.ShribergE.StolckeA.HarperM.2005Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. Proc. of Eurospeech: 33133316
20. MachoD.MauuaryL.NoéB.ChengY. M.EaleyD.JouvetD.KelleherH.PearceD.SaadounF.2002Evaluation of a noise-robust DSR front-end on Aurora databases. Proc. ICSLP-2002, Denver, Colorado, USA.
21. MakB.BarnardE.1996Phone clustering using the Bhattacharyya distance. Proc. ICSLP, IEEE. 420052008
22. MathewsR. H.1979Mathews’ Chinese-English Dictionary, Harvard university press.
23. SavovaG.BachenkoJ.2003Prosodic features of four types of disfluencies. Proc. of DiSS: 9194
24. ShribergE.FerrerL.KajarekarS.VenkataramanA.StolckeA.2005Modeling prosodic feature sequences for speaker recognition. Speech Communication, 463-4455472
25. ShribergE.StolckeA.Hakkani-TurD.TurG.2000Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 321-2127154
26. SnoverM.DorrB.SchwartzR.2004A lexically-driven algorithm for disfluency detection. Proc. of HLT/NAACL, Association for Computational Linguistics: 157160
27. SoltauH.KingsburyB.ManguL.PoveyD.SaonG.ZweigG.2005The IBM 2004 conversational telephony system for rich transcription. Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Philadelphia, USA.
28. WaibelA.SoltauH.SchultzT.SchaafT.MetzeF.2000Multilingual Speech Recognition. Verbmobil: foundations of speech-to-speech translation, Springer-Verlag.
29. WellsJ. C.1989Computer-coded phonemic notation of individual languages of the European Community. Journal of the International Phonetic Association, 1913154
30. WuC. H.ChiuY. H.ShiaC. J.LinC. Y.2006Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Transactions on Audio, Speech, and Language Processing, 141266276
31. WuC. H.YanG. L.2004Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition. Journal of VLSI Signal Processing Systems, 36291104
32. WuJ.HuoQ.2002An environment compensated minimum classification error training approach and its evaluation on Aurora2 database. Proc. ICSLP-2002, Denver, Colorado, USA.
33. WuZ.PalmerM.1994Verbs semantics and lexical selection. Proc. 32nd ACL, Association for Computational Linguistics: 133138
34. YehJ. F.WuC. H.2006Edit disfluency detection and correction using a cleanup language model and an alignment model. IEEE Transactions on Audio, Speech, and Language Processing, 14515741583
35. YoungS. J.OdellJ.WoodlandP.1994Tree-based state tying for high accuracy acoustic modelling. Proc. ARPA Human Language Technology Conference. Plainsboro, USA, Association for Computational Linguistics: 307312

[1] 1. BearJ.DowdingJ.ShribergE.1992Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog. Proc. of ACL. Newark, Deleware, USA, Association for Computational Linguistics: 5663

[2] 2. BenvenisteA.MétivierM.PriouretP.1990Adaptive Algorithms and Stochastic Approximations. Applications of Mathematics. New York, Springer. 22.

[3] 3. BollS.1979Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 272113120

[4] 4. CharniakE.JohnsonM.2001Edit detection and parsing for transcribed speech. Proc. of NAACL, Association for Computational Linguistics: 118126

[5] 5. ChenY. J.WuC. H.ChiuY. H.LiaoH. C.2002Generation of robust phonetic set and decision tree for Mandarin using chi-square testing. Speech Communication, 383-4349364

[6] 6. DengL.AceroA.PlumpeM.HuangX.2000Large-vocabulary speech recognition under adverse acoustic environments. Proc. ICSLP-2000, Beijing, China.

[7] 7. DengL.DroppoJ.AceroA.2003Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. Speech and Audio Processing, IEEE Transactions on, 116568580

[8] 8. FuruiS.NakamuraM.IchibaT.IwanoK.2005Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese. Speech Communication, 471-2208219

[9] 9. GalesM. J. F.YoungS. J.1996Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 45352359

[10] 10. GoldbergerJ.AronowitzH.2005A distance measure between gmms based on the unscented transform and its application to speaker recognition. Proc. of EUROSPEECH. Lisbon, Portugal: 19851988

[11] 11. HainT.WoodlandP. C.EvermannG.GalesM. J. F.LiuX.MooreG. L.PoveyD.WangL.2005Automatic transcription of conversational telephone speech. IEEE Transactions on Speech and Audio Processing, 13611731185

[12] 12. HeemanP. A.AllenJ. F.1999Speech repairs, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue. Computational Linguistics, 254527571

[13] 13. HermanskyH.MorganN.1994RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 24578589

[14] 14. HieronymusJ. L.1993ASCII phonetic symbols for the world’s languages: Worldbet. Journal of the International Phonetic Association, 23

[15] 15. HsiehC. H.WuC. H.2008Stochastic vector mapping-based feature enhancement using prior-models and model adaptation for noisy speech recognition. Speech Communication, 506467475

[16] 16. HuangC. L.WuC. H.2007Generation of phonetic units for mixed-language speech recognition based on acoustic and contextual analysis. IEEE Transactions on Computers, 56912251233

[17] 17. JohnsonM.CharniakE.2004A TAG-based noisy channel model of speech repairs. Proc. of ACL, Association for Computational Linguistics: 3339

[18] 18. KohlerJ.2001Multilingual phone models for vocabulary-independent speech recognition tasks. Speech Communication, 351-22130

[19] 19. LiuY.ShribergE.StolckeA.HarperM.2005Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. Proc. of Eurospeech: 33133316

[20] 20. MachoD.MauuaryL.NoéB.ChengY. M.EaleyD.JouvetD.KelleherH.PearceD.SaadounF.2002Evaluation of a noise-robust DSR front-end on Aurora databases. Proc. ICSLP-2002, Denver, Colorado, USA.

[21] 21. MakB.BarnardE.1996Phone clustering using the Bhattacharyya distance. Proc. ICSLP, IEEE. 420052008

[22] 22. MathewsR. H.1979Mathews’ Chinese-English Dictionary, Harvard university press.

[23] 23. SavovaG.BachenkoJ.2003Prosodic features of four types of disfluencies. Proc. of DiSS: 9194

[24] 24. ShribergE.FerrerL.KajarekarS.VenkataramanA.StolckeA.2005Modeling prosodic feature sequences for speaker recognition. Speech Communication, 463-4455472

[25] 25. ShribergE.StolckeA.Hakkani-TurD.TurG.2000Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 321-2127154

[26] 26. SnoverM.DorrB.SchwartzR.2004A lexically-driven algorithm for disfluency detection. Proc. of HLT/NAACL, Association for Computational Linguistics: 157160

[27] 27. SoltauH.KingsburyB.ManguL.PoveyD.SaonG.ZweigG.2005The IBM 2004 conversational telephony system for rich transcription. Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Philadelphia, USA.

[28] 28. WaibelA.SoltauH.SchultzT.SchaafT.MetzeF.2000Multilingual Speech Recognition. Verbmobil: foundations of speech-to-speech translation, Springer-Verlag.

[29] 29. WellsJ. C.1989Computer-coded phonemic notation of individual languages of the European Community. Journal of the International Phonetic Association, 1913154

[30] 30. WuC. H.ChiuY. H.ShiaC. J.LinC. Y.2006Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Transactions on Audio, Speech, and Language Processing, 141266276

[31] 31. WuC. H.YanG. L.2004Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition. Journal of VLSI Signal Processing Systems, 36291104

[32] 32. WuJ.HuoQ.2002An environment compensated minimum classification error training approach and its evaluation on Aurora2 database. Proc. ICSLP-2002, Denver, Colorado, USA.

[33] 33. WuZ.PalmerM.1994Verbs semantics and lexical selection. Proc. 32nd ACL, Association for Computational Linguistics: 133138

[34] 34. YehJ. F.WuC. H.2006Edit disfluency detection and correction using a cleanup language model and an alignment model. IEEE Transactions on Audio, Speech, and Language Processing, 14515741583

[35] 35. YoungS. J.OdellJ.WoodlandP.1994Tree-based state tying for high accuracy acoustic modelling. Proc. ARPA Human Language Technology Conference. Plainsboro, USA, Association for Computational Linguistics: 307312