A Particle Filter Compensation Approach to Robust Speech Recognition

Aleem Mushtaq

doi:10.5772/51532

Author Information

Show +

Aleem Mushtaq

*Address all correspondence to:

1. Introduction

The speech production mechanism goes through various stages. First, a thought is generated in speakers mind. The thought is put into a sequence of words. These words are converted into a speech signal using various muscles including face muscles, chest muscles, tongue etc. This signal is distorted by environmental factors such as background noise, reverberations, channel distortions when sent through a microphone, telephone channel etc. The aim of Automatic Speech Recognition Systems (ASR) is to reconstruct the spoken words from the speech signal. From information theoretic [1] perspective, we can treat what is between the speaker and machine as a distortion channel as shown in figure 1.

Figure 1.
Information theoretic view of Speech Recognition

Here, Wrepresent the spoken words and Xis the speech signal. The problem of extracting Wfrom Xcan be viewed as finding the words sequence that most likely resulted in the observed signal Xas given in equation (1)

W^=argmaxWp(X|W)E1

Like any other Machine Learning/Pattern Recognition problem, the posterior p(X|W)plays a fundamental role in the decoding process. This distribution is parametric and its parameters are found from the available training data. Modern ASR systems do well when environment of speech signal being tested matches well with that of the training data. This is so because the parameter values correspond well to the speech signal being decoded. However, if the environments of training and testing data do not match well, the performance of the ASR systems degrade. Many schemes have been proposed to overcome this problem but humans still outperform these systems, especially in adverse conditions.

The approaches to overcome this problem falls under two categories. One way is to adapt the parameters of p(X|W)such that they match better with the testing environment and the other is to choose features Xsuch that they are more robust to environment variations. The features can also be transformed to make them more suited to the parameters of(X|W), obtained from training data.

1.1. Typical ASR system

Typical ASR systems for small vocabulary are comprised of three main components as shown in figure 2. Speech data is available in waveform which is first converted into feature vectors. Mel Frequency Cepstrum Coefficients (MFCC) [2] features have been widely used in speech community for the task of speech recognition due to their superior discriminative capability.

The features from an available training speech corpus are used to estimate the parameters of Acoustic Models. An acoustic model for a particular speech unit, say a phoneme or a word is the likelihood of observing that unit based on the features as given in equation 1.1. Most commonly used structure for the acoustic models in ASR systems is the Hidden Markov Models (HMM). These models capture the dynamics and variations of speech signal well. The test speech signal is then decoded using Viterbi Decoder.

1.2. Distortions in speech

The distortions in speech signal can be viewed in signal space, feature space and the model space [3] as shown in figure 3. Resilience to environmental distortions can be added in the feature extraction process, by modifying the distorted features or adapting the acoustic models to better match the environment from which test signal has emanated. SXand FXrepresent speech signal and speech feature respectively. MXrepresent the acoustic models.

Figure 3.
Stages where noise robustness can be added

In stage 1, the feature extraction process is improved so that the features are robust to distortions. In stage 2, the features are modified to match them better with the training environment. The mismatch in this stage is usually modeled by nuisance parameters. These are estimated from the environment and test data and their effect is minimized based on some optimality criteria. In stage 3, the acoustic models are improved to match better with the testing environment. One way to achieve this is to use Multi-Condition training i.e. use data from diverse environments to train the models. Another way is find transform the models where transformation matrix is obtained from the test environment.

1.3. Speech and noise tracking for noise compensation

A sequential Monte Carlo feature compensation algorithm was initially proposed [4-5] in which the noise was treated as a state variable while speech was considered as the signal corrupting the observation noise and a VTS approximation was used to approximate the clean speech signal by applying a minimum mean square error (MMSE) procedure. In [5] extended Kalman filters were used to model a dynamical system representing the noise which was further improved by using Polyak averaging and feedback with a switching dynamical system [6]. These were initial attempts to incorporate particle filter for speech recognition in more indirect fashion as it was used for tracking of noise instead of the speech signal itself. Since the speech signal is treated as corrupting signal to the noise, limited or no information readily available from the HMMs or the recognition process can be utilized efficiently in the compensation process.

Particle filters are powerful numerical mechanisms for sequential signal modeling and is not constrained by the conventional linearity and Gaussianity [7] requirements. It is a generalization of the Kalman filter [8] and is more flexible than the extended Kalman filter [9] because the stage-by-stage linearization of the state space model in Kalman filter is no longer required [7]. One difficulty of using particle filters lies in obtaining a state space model for speech as consecutive speech features are usually highly correlated. Just like in the Kalman filter and HMM frameworks, state transition is an integral part of the particle filter algorithms.

In contrast to the previous particle filter attempts [4-6] we describe a method in this chapter where we treat the speech signal as the state variable and the noise as the corrupting signal and attempt to estimate clean speech from noisy speech. We incorporate statistical information available in the acoustic models of clean speech, e.g., the HMMs trained with clean speech, as an alternative state transition model[10-11]. The similarity between HMMs and particles filters can be seen from the fact that an observation probability density function corresponding to each state of an HMM describes, in statistical terms, the characteristics of the source generating a signal of interest if the source is in that particular state, whereas in particle filters we try to estimate the probability distribution of the state the system is in when it generates the observed signal of interest. Particle filters are suited for feature compensation because the probability density of the state can be updated dynamically on a sample-by-sample basis. On the other hand, state densities of the HMMs are assumed independent of each other. Although they are good for speech inference problems, HMMs do not adapt well in fast changing environments.

By establishing a close interaction of the particle filters and HMMs, the potentials of both models can be harnessed in a joint framework to perform feature compensation for robust speech recognition. We improve the recognition accuracy through compensation of noisy speech, and we enhance the compensation process by utilizing information in the HMM state transition and mixture component sequences obtained in the recognition process. When state sequence information is available we found we can attain a 67% digit error reduction from multi-condition training in the Aurora-2 connected digit recognition task. If the missing parameters are estimated in the operational situations we only observe a 13% error reduction in the current study. Moreover, by tracking the speech features, compensation can be done using only partial information about noise and consequently good recognition performance can be obtained despite potential distortion caused by non-stationary noise within an utterance.

The remainder of the chapter is organized as follows. In section 2, a tracking scheme in general is described followed by the explanation of the well known Kalman filter tracking algorithm. Particle Filters, which form the backbone of PFC are also described in this section. In section 3, the steps involved in tracking and then extracting the clean speech signal from the noisy speech signal are laid out. We also discuss various methods to obtain information required to couple the particle filters and the HMMs in a joint framework. Finally, the experimental results and performance comparison for PFC is given before drawing the conclusions in section 4.

2. Tracking algorithms

Tracking is the problem of estimating the trajectory of an object in a space as it moves through that space. The space could be an image plane captured directly from a camera or it could be synthetically generated from a radar sweep. Generally, tracking schemes can be applied to any system that can be represented by a time dynamical system which consists of a state space model and an observation

xt=f(xt−1,wt)yt=h(xt,nt)E2

Where nt is the observation noise and wt is called the process noise and represents the model uncertainties in the state transition functionf(.). What is available is an observation yt which is function ofxt.We are interested in finding a good estimate of current state given observations till current time ti.e.p(xt|yt,yt-1,yt-2,‧ y0). The state space model f(.)represents the relation between states adjacent in time. The model in equation (2) assumes that state sequence is one step Markov process

f(xt+1|xt,xt−1,...x0)=f(xt+1|xt)E3

It is further assumed that observations are independent of one another

f(yt+1|xt+1,yt,...y0)=f(yt+1|xt+1)E4

Tracking is a two step process. The first step is to obtain density xt at timet-1. This is called the prior density ofxt. Once it is available, we can construct a posterior density upon availability of observationyt. The propagation step is given in equation (5). The update step is obtained using Bayesian theory (equation (6)).

f(xt|yt−1,...,y0)=∫f(xt|xt−1)f(xt−1|yt−1,...,y0)dxt−1E5

f(xt|yt,yt−1,...,y0)=f(yt|xt,yt−1,...,y0)f(xt|yt−1,...,y0)f(yt|yt−1,...,y0)E6

2.1. Kalman filter as a recursive estimation tracking algorithm

Kalman Filter is the optimal recursive estimation solution for posterior density p(xt+1|yt,‧,y0)if the time dynamical system is linear

xt+1=Atxt+wtyt=Ctxt+ntE7

where At and Ct are known as state transition matrix and observation matrix respectively. Subscript tindicates that both can vary with time. Under the assumption that both process noise wt and observation noise nt are Gaussian with zero mean and covariance Qt and Rt respectively, p(xt+1|xt)can be readily obtained.

mean(xt+1|xt)=E(Atxt+wt)=Atxtcovariance(xt+1|xt)=E(wtwtT)=QtE8

and therefore

p(xt+1|xt)~N(Atxt,Qt)E9

p(xt|yt,yt−1,...,y0)~N(x^t|t,Pt|t)E10

where Pt|t is the covariance of xt|yt,‧,y0 and is given byE[(xt-Ext)(xt-Ext)T|yt,‧,y0]. Now both components of the integral in equation (5) are available in equation (9) and (10). Solving the integral using expanding and completing the squares [12] we get

p(xt+1|yt,yt−1,...,y0)~N(Atx^t|t,AtPt|tAtT+Qt)E11

This is the propagation step and is sometimes is also written as

p(xt+1|yt,yt−1,...,y0)~N(x^t+1|t,Pt+1|t)E12

To get the update step, we note that the distributions ofxt+1|yt,…,y0 and yt+1 are both Gaussian. For two random variables say xand ythat are jointly Gaussian, the distribution of one of them given the other for example x|yis also Gaussian. Consequently, xt+1|yt+1,yt,…,y0 is a Gaussian distribution with following mean and variance

x^t+1|xt+1=E[xt+1|yt+1,yt,...,y0]=x^t+1|xt+RxyRyy−1(yt+1−E[yt+1|yt,...,y0])E13

where

Similarly

Ryy=Ct+1Pt+1|tCt+1T+Rt+1E15

Back substituting equation (14) and equation (15) in equation (13), we get

x^t+1|xt+1=x^t+1|t+Kt+1(yt+1−Ct+1x^t+1|t)E16

where Kis called the Kalman gain and is given by

Kt+1=Pt+1|tCt+1T(Ct+1Pt+1|tCt+1T+Rt+1)−1E17

Covariance can also be obtained by referring to the fact that covariance ofx|y, the two jointly Gaussian random variables, is given by

cov(X|Y)=Rxx−RxyRyy−1RyxE18

we thus obtain the covariance of estimate of x̠t+1|t+1 as follows

Pt+1|t+1=Pt+1|t−Pt+1|tCt+1T(Ct+1Pt+1|tCt+1T+Rt+1)−1Ct+1Pt+1|t =(1−Kt+1Ct+1T)Pt+1|tE19

The block diagram in Figure 4 below shows a general recursive estimation algorithm steps starting from some initial state estimatex0. The block labeled Kalman filter summarizes the steps specific to Klaman filter algorithm.

Figure 4.
Recursive Estimation Algorithm

2.2.Grid. based methods

It is hard to obtain analytical solutions to most recursive estimation algorithms. If the state space for a problem is discrete, then we can use grid based methods and can still obtain the optimal solution. Considering that state xtakes Ns possible values, we can represent discrete density pyxusing Ns samples[7].

p(xk|yk,yk−1,...,y0)=∑i=1Nswk|kδ(xk−xki)E20

where the weights are computed as follows

Here Cis the normalizing constant to make total probability equal one. The assumption that state can be represented by finite number of points gives us the ability to sample the whole state space. The weight wki represents the probability of being in state xki when observation at time kisyk. In grid based method we construct the discrete density at every time instant in two steps. First we estimate the weights at kwithout the current observation wk|k-1i and then update them when observation is available and obtainwk|ki. In the propagation step we take into account probabilities (weights) for all possible state values atk-1 to estimate the weights at time kas shown in figure 5.

If the prior p(xki|xkj)and the observation probabilityp(zk|xk) are available, the grid based method gives us the optimal solution for tracking the state of the system. If the state of the system is not discrete, then we can obtain an approximate solution using this method. We divide the continuous space into say Jcells and for each cell we compute the prior and posterior in a way that takes into account the range of the whole cell:

p(xki|xk−1j)=∫x∈xkip(x|x¯k−1j)dxp(yk|xki)=∫x∈xkip(yk|x)dxE22

where x̣k is the center of jth cell at timek-1. The weight update in equation (21) subsequently remains unchanged.

2.3. Particle filter method

Particle filtering is a way to model signals emanating from a dynamical system. If the underlying state transition is known and the relationship between the system state and the observed output is available, then the system state can be found using Monte Carlo simulations [13]. Consider the discrete time Markov process such that

X1~μ(x1)Xt|Xt−1=xt~p(xt|xt−1)Yt|Xt=xt~p(yt|xt)E23

We are interested in obtaining p(xt|yt,‧,y0)so that we have a filtered estimate of xt from the measurements available so far,yt,‧,y0. If the state space model for the process is available, and both the state and the observation equations are linear, then Kalman filter described above can be used to determine the optimal estimate of xt given observationsyt,‧,y0. This is so under the condition that process and observation noises are white Gaussian noise with zero mean and mutually independent. In case the state and observation equations are nonlinear, the Extended Kalman Filter (EKF) [9], which is a modified form of the Kalman Filter can be used. Particle filter algorithm estimates the state’s posterior density, p(xt|yt,‧,y0)represented by a finite set of support points [7]:

p(xt|yt,yt−1,...,y1)=∑i=1Nswtiδ(xt−xti)E24

where xti for i=1,‧,Nsare the support points and wtiare the associated weights. We thus have a discretized and weighted approximation of the posterior density without the need of an analytical solution. Note the similarities with Grid based method. In that, support points for discrete distribution were predefined and covered the whole space. In particle filter algorithm, the support points are determined based on the concept of importance sampling in which instead of drawing fromp(.), we draw points from another distribution q(.) and compute the weights using the following:

wi=π(xi)q(xi)E25

where Ϟ(.)is the distribution of p(.)and q(.)is an importance density from which we can draw samples. For the sequential case, the weight update equation can be computed one by one,

wti=wt−1ip(yt|xti)p(xti|xt−1i)q(xti|xt−1i,yt)E26

The density q.propagates the samples to new positions at tgiven samples at time t-1and is derived from the state transition model of the system.

3. Tracking algorithms for noise compensation

State transition information is an integral part of the particle filter algorithm and is used to propagate the particle samples through time transitions of the signal being processed. Specifically, the state transition is important to be able to position the samples at the right locations.To solve this problem, statistics from HMMs can be used. Although we only have discrete states in HMMs, each state is characterized by a continuous density Gaussian mixture model(GMM) and therefore it enables us to capture part of the variation in speech features to generate particle samples for feature compensation. Using particle filter algorithms with side information about the statistics of clean speech available in the clean HMMs we can perform feature compensation.If the clean speech is corrupted by an additive noise, n, and a distortion channel, h, thenwe can represent the noise corrupted speech with an additive noise model [14], assumingknown statistics of the noise parameters,

y=x+h+log(1+exp(n−x−h))E27

wherey=log⁢(Symp), x=log⁢(Sxmp)and h=log⁢(|Hmp|2)and S(mp)denotes the pth mel spectrum.

Sy(mp)=Sx(mp)|H(mp)|2+SN(mp)E28

The additional side information needed for feature compensation is a set of nuisance parameters, τsimilar to stochastic matching [3], we can iteratively find τfollowed by decoding as shown in Figure 6:

Φ′=argmaxΦP(Y′|Φ,Λ)E29

where Y’ is the noisy or compensated utterance.

Figure 6.
General feature compensation scheme

The clean HMMs and the background noise information enable us to generate appropriate samples fromq(.) in equation (26). The parameters τ in equation (30) in our particle filter compensation (PFC) implementation, correspond to the corresponding correct HMM state sequence and mixture component sequence. These sequences provide critical information for density approximation in PFC. As shown in Figure 6 this can be done in two stages. We first perform a front-end compensation of noisy speech. Then recognition is done in the second stage to generate the side informationτso as to improve compensation. This process can be iterated similar to what’s done in maximum likelihood stochastic matching [3]. During compensation, the observed speech yis mapped to clean speech featuresx. For this purpose clean speech alone cannot be represented by a finite set of points and therefore HMMs by themselves cannot be used directly for tracking ofx. Now if an HMM ϙmis available that adequately represents the speech segment under consideration for compensation along with an estimated state sequences1,s2,‧,sTthat correspond toTfeature vectors to be considered in the segment, then we can generate the samples from the ithsample according to

p(xt|xt−1i)~∑k=1Kck,stN(μk,st,Σk,st)E30

where NϚk,st,ρk,stis the kth Gaussian mixture for the state stin ϙmandck,stis its corresponding weight for the mixture. The total number of particles is fixed and the contribution from each mixture, computed at run time, depends on its weight. We have chosen the importance sampling density,q(xt|xt-1i,yt) in equation (26) to be p(xt|xt-1i)in equation (31). This is known as the sampling importance resampling (SIR) filter [7]. It is one of the simplest implementation of particle filters and it enables the generation of samples independently from the observation. For the SIR filter, we only need to know the state and the observation equations and should be able to sample from the prior as in Eq. (3). Also, the resampling step is applied at every stage and the weight assigned to the i-th support point of the distribution of the speech signal at time tis updated as:

wti∝p(yt|xti)E31

The procedure for obtaining HMMs and the state sequence will be described in detail later. To obtain p(yt|xti), the distribution of the log spectra of noise for each channel is assumed Gaussian with meanϚn and varianceϡn2. Assuming there is additive noise only with no channel effects

y=x+log(1+en−x)E32

We are interested in evaluatingp(y|x) wherexrepresents clean speech and nis the noise with densityN(Ϛn,ϡn). Then

p[Y<y|x]=p[x+log(1+eN−x) < y|x]p(y|x)=F'(u)=p(u)ey−xey−x−1E33

Where F(Ϛ)is the Gaussian cumulative density function with meanϚn and varianceϡn2 andu=log⁢logey-x-1+x. In the case of MFCC features, the nonlinear transformation is [14]

y=x+Dlog(1+eD−1(n−x))E34

Consequently,

p(y|x)=pN(g−1(y))Jg−1(y)E35

where PN(.) is a Gaussian pdf,Jg-1(y) is the corresponding Jacobian and Dis a discrete cosine transform matrix which is not square and thus not invertible. To overcome this problem, we zero-pad the yand xvectors and extend Dto be a square matrix. The variance of the noise density is obtained from the available noise samples. Once the point density of the clean speech features is available, we estimate of the compensated features using discrete approximation of the expectation as

xt=∑i=1NswtixtiE36

where Ns is the total number of particle samples at timet.

3.1. Estimation of HMM side information

As described above, it is important to obtainτ∇{ϙm,S} where ϙmis an HMM that faithfully represents the speech segment being compensated andS=s1,s2,‧,sT is the state sequence corresponding to the utterance of lengthT. To obtainϙm for themth wordWmin the utterance, we chose the N-best modelsϙm1,ϙm2,‧,ϙmNfrom HMMs trained using ‘clean speech data’. The Nmodels are combined together to obtain a single model ϙmas follows.

3.1.1. Gaussian Mixtures Estimation

To obtain the observation model for each statejof modelϙm, we concatenate mixtures from the corresponding states of all component models,

b^j(m)(o)=∑l=1L∑k=1Kck,j(ml)N(μk,j(ml),Σk,j(ml))E37

where Kis the number of Gaussian mixtures in each original HMM and Lis the number of different wordsm1, m2,‧,mL in the N-best hypothesis.Ϛk,jml andρk,jml are mean and covariance from thek-th mixture in thej-th state of modelml. The mixture weights are normalized by scaling them according to the likelihood of the occurrence of the model, from which they come from,

ck,j(ml)=ck,j(ml)×p(Wm=λml)E38

The mixture weight is an important parameter because it determines the number of samples that will be generated from the corresponding mixture. The state transition coefficients forϙmare computed using the following:

a^ij(m)=∑l=1Lp[st(ml)=i,st−1(ml)=j|Wm=λml]p[Wm=λml]a^ij(m)=∑l=1L[aij(ml)|Wm=λml]p[Wm=λml]E39

(39)

3.1.2. State sequence estimation

The recognition performance can be greatly improved if a good estimate of the HMM state sequenceS is available. But obtaining this sequence in a noisy operational environment in ASR is very challenging. The simplest approach is to use the decoded state sequence obtained with multi-condition trained models in an ASR recognition process as shown in the bottom of Figure 6. However, these states could often correspond to incorrect models and deviate significantly from the optimal one. Alternatively, we can determine the states (to generate samples from) sequentially during compensation. For left-to-right HMMs, given the statest-1 at timet-1, we chose stusing equation (41) as follows:

st~ast,st−1st=argmaxi(aij)E40

where acomes from the state transition matrix forϙm. The mixture indices are subsequently selected from amongst the mixtures corresponding to the chosen state.

3.1.3. Experiments

To investigate the properties of the proposed approach, we first assume that a decent estimate of the state is available at each frame. Moreover, we assume that speech boundaries are marked and therefore the silence and speech sections of the utterance are known. To obtain this information, we use a set of digit HMMs (18 states, 3 Gaussian mixtures) that have been trained using clean speech represented by 23 channel mel-scale log spectral feature. The speech boundaries and state information for a particular noisy utterance is then captured through digit recognition performed on the corresponding clean speech utterance. The speech boundary information is critical because the noise statistics have to be estimated from the noisy section of the utterance. To get the HMM needed for particle filter compensationLmodelsϙ1,ϙ2,...,ϙLare selected based on the N-best hypothesis list. For our experiments, we setL=3. We combine these models to getϙ'm for the m-th word in the utterance. Best results are obtained if the correct word model is present in the pool of models that contribute toϙ'm. Upon availability of this information, the compensation of the noisy log spectral features is done using the sequential importance sampling. To see the efficacy of the compensation process, we consider the noisy, clean and compensated filter banks (channel 8) for the whole utterances shown in Figure 7. The SNR for this particular case is 5 dB. It is clear that the compensated feature matches well with the clean feature. It should be noted however that such a good restoration of the clean speech signal from the noisy signal is achievable only when a good estimate of the side information about the state and mixture component sequences is available.

Figure 7.
Fbank channel 8 corresponding underlying clean and compensated speech (SNR = 5 dB).

Assuming all such information were given (the ideal oracle case) recognition can be performed on MFCCs (39 MFCCs with 13 MFCCs and their first and second time derivatives) extracted from these compensated log spectral features. The HMMs used for recognition are trained with noisy data that has been compensated in the same way as the testing data. The performance compared to multi-condition (MC) and clean condition training (Columns 5 and 6 in Table 1) is given in Column 2 of Table 1 (Adapted Model I). It is clearly noted that a very significant 67% digit error reduction was attained if the missing information were made available to us.

Word Accuracy	Adapted Models I	Adapted Models II	Adapted Models III	MC Training	Clean Training
clean	99.10	99.10	99.10	98.50	99.11
20dB	97.75	96.46	97.38	97.66	97.21
15dB	97.61	95.98	96.47	96.95	92.36
10dB	96.66	94.00	94.40	95.16	75.14
5dB	95.20	90.64	88.02	89.14	42.42
0dB	92.13	82.62	68.28	64.75	22.57
-5dB	89.28	72.13	32.92	27.47	NA
0-20dB	95.86	90.23	88.91	88.73	65.94

Table 1.

ASR accuracy comparisons for Aurora-2

In the case of the actual operational scenarios, when no side information is available, models were chosen from the N-Best list while the states were computed using Viterbi decoding. Of course, the states would correspond to only one model which might not be correct, and there might be a significant mismatch between actual and computed states. Moreover the misalignment of words also exacerbated the problem. The results for this case (Adapted Model III as shown in Table 1 Column 4) were only marginally better than those obtained with the multi-condition trained models. To see the effects of the improvements for the case where the states are better aligned, we made use of whatever information we could get. The boundaries of words were extracted from the N-Best list using exhaustive search and the states for the words between these boundaries were assigned by splitting the digits into equal-sized segments and assigning one state to each segment. This limited the damage done by state misalignment, and it can be seen that a 13% digit error reduction from MC training was observed (Adapted Model II in Table 1 Column 3).

3.2. A clustering approach to obtaining correct HMM information

HMM states are used to spread the particles at the right locations for subsequent estimation of the underlying clean speech density. If the state is incorrect, the location of particles will be wrong and the density estimate will be erroneous. One solution is to merge the states into clusters. Since the total number of clusters can be much less than the number of states, the problem of choosing the correct information block for sample generation is simplified. A tree structure to group the Gaussian mixtures from clean speech HMMs into clusters can be built with the following distance measure [15]:

d(m,n)=∫gm(x)loggm(x)gn(x)dx+∫gn(x)loggn(x)gm(x)dxE41

=∑i[σm2(i)−σn2(i)+(μn(i)−μm(i))2σn2(i)+σn2(i)−σm2(i)+(μn(i)−μm(i))2σm2(i)]E42

where Ϛm(i)is the i-th element of the mean vectorϚm and ϡm2(i) is thei-th diagonal element of the covariance matrixρm. The parameters of the single Gaussian representing the cluster,gckX=N(X|Ϛk,ϡk2), is computed as follows:

μk(i)=1Mk∑m=1MkE(xm(k)(i))=1Mk∑m=1Mkμm(k)(i)E43

σk2(i)=1Mk∑m=1MkE((xm(k)(i)−μk(i))2 =1Mk∑m=1Mkσm2(k)(i)+∑m=1Mkμm(k)2(i)−Mkμk2(i)E44

Alternatively, we can group the components at the state level using the following distance measure [16]:

d(n,m)=−1S∑s=1S1P∑p=1Plog[bms(μnsp)]+log[bns(μmsp)]E45

where Sis the total number of states in the cluster, Pis the number of mixtures per state and b(.)is the observation probability. This method makes it easy to track the state level composition of each cluster. In both cases, the clustering algorithm proceeds as follows:

Create one cluster for each mixture up to kclusters.
Whilek>Mk, find nand mfor which d(n,m)is minimum and merge them.

Once clustering is complete, it is important to pick the most suitable cluster for feature compensation at each frame. The particle samples are then generated from the representative density of the chosen cluster. Two methods can be explored. The first is to decide the cluster based on the N-best transcripts obtained from recognition using multi-condition trained models. Denote the states obtained from the N-best transcripts for noisy speech feature vectors at time tasst1,st2,‧, stN. If state sti is a member of clusterck, we increment Mckby one, where Mckis a count of how many states from the N-best list belong to clusterck. We choose the cluster based on argmaxk⁢argmaxkM(ck) and generate samples from it. If more than one cluster satisfies this criterion, we merge their probability density functions. In the second method, we chose the cluster that maximizes the likelihood of the MFCC vector at timet, Ot, belonging to that cluster as follows:

C~argmaxkgmc(Ot|Ck)E46

It is important to emphasize here that gmc is derived from multi-condition speech models and has a different distribution from the one used to generate the samples. The relationship between clean clusters and multi-condition clusters is shown in figure 1. Clean clusters are obtained using methods described in section 3. The composition information of these clusters is then used to build a corresponding multi-condition cluster set from multi-condition HMMs. A cluster Cj in clean clusters represents statistical information of a particular section of clean speech. The multi-condition counterpart Cj represents statistics of the noisy version of the same speech section.

Figure 8.
Clustering of multi-condition trained HMMs

Clean clusters are necessary to track clean speech because we need to generate samples from clean speech distributions. However, they are not the best choice for estimating equation (46) because the observation is noisy and has a different distribution. The best candidate for computing equation (46)is the multi-condition cluster set. It is constructed from multi-condition HMMs that match more closely with noisy speech. A block diagram of the overall compensation and recognition process is shown in Figure 9. We make inference about the cluster to be used for observation vector Otusing both the N-best transcripts and equation (46) combined together. Samples at frame tare then generated using the pdf of chosen cluster. The weights of the samples are computed using equation (46) and compensated features are obtained using equation (36). Once the compensated features are available for the whole utterance, recognition is performed again using retrained HMMs with compensated features.

3.2.1. Experiments

To evaluate the proposed framework we experimented on the Aurora 2 connected digit task. We extracted features (39 elements with 13 MFCCs and their first and second time derivatives) from test speech as well as 23 channel filter-bank features thereby forming two streams. One-best transcript was obtained from the MFCC stream using the multi-condition trained HMMs. PFC is then applied to the filter-bank stream (stream two). We chose two clusters, one based on 1-best and the other selected with equation (46). The multi-condition clusters used in equation (46) were from 23 channel fbank features so that the test features from stream two can be directly used to evaluate the likelihood of the observations. For results in these experiments, clusters were formed using method two, i.e., tracking the state-wise composition of each cluster. The number of clusters and particles were varied to evaluate the performance of the algorithm under different settings. From the compensated filter-bank features of stream two, we extracted 39-element MFCC features. Final recognition on these models was done using the retrained HMMs, i.e., multi-condition training data compensated in a similar fashion as described above.

Word Accy	20 Clust.	25 Clust.	30 Clust.	MC Trained	Clean Trained
clean	99.11	99.11	99.11	98.50	99.11
20dB	97.76	98.00	97.93	97.66	97.21
15dB	97.00	97.14	96.69	96.80	92.36
10dB	95.21	95.41	93.88	95.32	75.14
5dB	89.48	89.59	87.08	89.14	42.42
0dB	70.16	70.38	68.84	64.75	22.57
-5dB	36.30	36.63	36.94	27.47	NA
0-20dB	89.92	90.10	88.88	88.73	65.94

Table 2.

Variable number of clusters (100 particles)

The results for a fixed number of particles (100) are shown in Table 1. The number of clusters was 20, 25 or 30. To set the specific number of clusters, HMM states were combined and clustering was stopped when the specified number was reached. HMM sets for all purposes were 18 states, with each state represented by 3 Gaussian mixtures. For the 11-digit vocabulary, we have a total of approximately 180 states. In case of, for example, 20 clusters, we have a 9 to 1 reduction of information blocks to choose from for plugging in the PF scheme.

It is interesting to note that best results were obtained for 25 clusters. Increasing the number of clusters beyond 25 did not improve the accuracy. The larger the number of clusters, the more specific speech statistics each cluster contains. If the number of clusters is large, then each cluster encompasses more specific section of the speech statistics. Having more specific information in each cluster is good for better compensation and recognition because the particles can be placed more accurately. However, due to the large number of clusters to choose from, it is difficult to pick the correct cluster for generation of particles. More errors were made in the cluster selection process resulting in degradation in the overall performance.

This is further illustrated in Figure 10. If the correct cluster is known, having large number of clusters and consequently more specific information per cluster will only improve the performance. The results are for 20, 25 and 30 clusters. In the known cluster case, one cluster is obtained using equation (46) and the second cluster is the correct one. Correct cluster means the one that contains the state (obtained by doing recognition on the clean version of the noisy utterance using clean HMMs) to which the observation actually belongs to. For the unknown cluster case, the clusters are obtained using equation (46) and1-best. It can readily be observed from the known cluster case that if the choice of cluster is always correct, the recognition performance improves drastically. Error rate was reduced by 54%, 59% and 61.4% for 20, 25 and 30 clusters, respectively. Moreover, improvement faithfully follows the number of clusters used. This was also corroborated by the fact that if the cluster is specific down to the HMM state level, i.e., the exact HMM state sequence was assumed known and each state is a separate cluster (total of approximately 180 clusters), the error rate was reduced by as much as 67% [10].

For the results in Table 2, we fixed the number of clusters and varied the number of particles. As we increased the number of particles, the accuracy of the algorithm improves for set A and B combined i.e. for additive noise. The error reduction is 17% over MC trained models. Using a large number of particles implies more samples were utilized to construct the predicted densities of the underlying clean speech features, which is now denser and thus better approximated. Thus, a gradual improvement in the recognition results was observed as the particles increased. In case of Set C, however, the performance was worse when more particles were used. This is so because the underlying distribution is different due to the distortions other than additive noise.

	Set A	Set B	Set C	Average
100 particles	90.02	91.03	89.26	90.1
500 particles	90.03	91.10	89.07	90.07
1000 particles	90.02	91.13	89.07	90.07
MC Trained	88.41	88.82	88.97	88.73
Clean Trained	64.00	67.46	65.39	65.73

Table 3.

Variable number of particles (25 clusters)

Figure 10.
Accuracy when correct cluster known vs. unknown

4. Conclusions

In this chapter, we proposed a particle filter compensation approach to robust speech recognition, and show that a tight coupling and sharing of information between HMMs and particle filters has a strong potential to improve recognition performance in adverse environments. It is noted that we need an accurate alignment of the state and mixture sequences used for compensation with particle filters and the actual HMM state sequences that describes the underlying clean speech features. Although we have observed an improved performance in the current particle filter compensation implementation there is still a considerable performance gap between the oracle setup with correct side information and what’s achievable in this study with the missing side information estimated from noisy speech. We further developed a scheme to merge statistically similar information in HMM states to enable us to find the right section of HMMs to dynamically plug in the particle filter algorithm. Results show that if we use information from HMMs that match specifically well with section of speech being compensated, significant error reduction is possible compared to multi-condition HMMs.

References

1. C.LeeH.HuoQ.On“.adaptivedecision.rulesdecisionparameter.adaptationfor.automaticspeech.recognition”Proc. I. E. E.IEEE, 88124112692000
2. DavisS.MermelsteinP.Comparison“.ofparametric.representationsfor.monosyllableword.recognitionin.continuouslyspoken.sentences,”Proc. I. C. A. S. S.ICASSP 1980ol. 28, no.4, pp. 357366
3. SankarA.C.LeeH.maximum-likelihood“. A.approachto.stochasticmatching.forrobust.speechrecognition,”. I. E. E. E.TransSpeech Audio Processing, 4190202May.1996
4. RajB.SinghR.SternR.On“.trackingnoise.withlinear.dynamicalsystem.models.”Proc. I. C. A. S. S.ICASSP, 2004
5. FujimotoM.NakamuraS.Particle“.Filterbased.non-stationarynoise.trackingfor.robustspeech.recognition,”Proc. I. C. A. S. S.ICASSP, 2005
6. FujimotoM.NakamuraS.Sequential“.non-stationarynoise.trackingusing.particlefiltering.withswitching.dynamicalsystem,”.ProcI. C. A. S. S.ICASSP, 2006
7. ArulampalamM. S.MaskellS.GordonN.ClappT.Tutorial“. A.onParticle.Filtersfor.OnlineNonlinear.Non-GaussianBayesian.Tracking,”I. E. E. E.TransSignal Proc., 2002
8. RobertGrover.BrownPatrickY.C. Hwang. 1996Introduction to Random Signals and Applied Kalman Filtering, 3rd edition, Prentice Hall.
9. Simon Haykin.2009Adaptive Filter Theory, 4th edition, Prentice Hall.
10. MushtaqA.TsaoY.C.LeeH.Particle“. A.FilterCompensation.Approachto.RobustSpeech.Recognition.”Proc.Interspeech, 2009
11. MushtaqA.C.LeeH.An“.integratedapproach.tofeature.compensationcombining.particlefilters.HiddenMarkov.Modelfor.robustspeech.recognition.”Proc. I. C. A. S. S.ICASSP, 2012
12. ToddK.Moon and Wynn C. Stirling. 2007Mathematical Methods and Algorithms for Signal Processing, Pearson Education.
13. ArnaudN.DoucetJohansen. A.tutorialon.particlefiltering.smoothingFifteen.yearslater.TechRep., 2008Online]. http://www.cs.ubc.ca/~arnaud/doucet_johansen_tutorialPF.pdf
14. AceroA.DengL.KristjanssonT.ZhangJ.adaptation“. H. M. M.usingvector.Taylorseries.fornoisy.speechrecognition,”.ProcI. C. S. L.ICSLP, 8698722002
15. WatanbeT.ShinodaK.TakagiK.YamadaE.Speech“.recognitionusing.tree-structuredprobability.sensityfunction,”.inProc.Int. Conf. Speech Language Processing’94, 1994223226
16. YoungS. J.OdellJ. J.WoodlandP. C.Tree-based“.statetying.forhigh.accuracyacoustic.modeling“.ProcA. R. P.ARPA Human Language Technology Workshop, 3073121994

[1] 1. C.LeeH.HuoQ.On“.adaptivedecision.rulesdecisionparameter.adaptationfor.automaticspeech.recognition”Proc. I. E. E.IEEE, 88124112692000

[2] 2. DavisS.MermelsteinP.Comparison“.ofparametric.representationsfor.monosyllableword.recognitionin.continuouslyspoken.sentences,”Proc. I. C. A. S. S.ICASSP 1980ol. 28, no.4, pp. 357366

[3] 3. SankarA.C.LeeH.maximum-likelihood“. A.approachto.stochasticmatching.forrobust.speechrecognition,”. I. E. E. E.TransSpeech Audio Processing, 4190202May.1996

[4] 4. RajB.SinghR.SternR.On“.trackingnoise.withlinear.dynamicalsystem.models.”Proc. I. C. A. S. S.ICASSP, 2004

[5] 5. FujimotoM.NakamuraS.Particle“.Filterbased.non-stationarynoise.trackingfor.robustspeech.recognition,”Proc. I. C. A. S. S.ICASSP, 2005

[6] 6. FujimotoM.NakamuraS.Sequential“.non-stationarynoise.trackingusing.particlefiltering.withswitching.dynamicalsystem,”.ProcI. C. A. S. S.ICASSP, 2006

[7] 7. ArulampalamM. S.MaskellS.GordonN.ClappT.Tutorial“. A.onParticle.Filtersfor.OnlineNonlinear.Non-GaussianBayesian.Tracking,”I. E. E. E.TransSignal Proc., 2002

[8] 8. RobertGrover.BrownPatrickY.C. Hwang. 1996Introduction to Random Signals and Applied Kalman Filtering, 3rd edition, Prentice Hall.

[9] 9. Simon Haykin.2009Adaptive Filter Theory, 4th edition, Prentice Hall.

[10] 10. MushtaqA.TsaoY.C.LeeH.Particle“. A.FilterCompensation.Approachto.RobustSpeech.Recognition.”Proc.Interspeech, 2009

[11] 11. MushtaqA.C.LeeH.An“.integratedapproach.tofeature.compensationcombining.particlefilters.HiddenMarkov.Modelfor.robustspeech.recognition.”Proc. I. C. A. S. S.ICASSP, 2012

[12] 12. ToddK.Moon and Wynn C. Stirling. 2007Mathematical Methods and Algorithms for Signal Processing, Pearson Education.

[13] 13. ArnaudN.DoucetJohansen. A.tutorialon.particlefiltering.smoothingFifteen.yearslater.TechRep., 2008Online]. http://www.cs.ubc.ca/~arnaud/doucet_johansen_tutorialPF.pdf

[14] 14. AceroA.DengL.KristjanssonT.ZhangJ.adaptation“. H. M. M.usingvector.Taylorseries.fornoisy.speechrecognition,”.ProcI. C. S. L.ICSLP, 8698722002

[15] 15. WatanbeT.ShinodaK.TakagiK.YamadaE.Speech“.recognitionusing.tree-structuredprobability.sensityfunction,”.inProc.Int. Conf. Speech Language Processing’94, 1994223226

[16] 16. YoungS. J.OdellJ. J.WoodlandP. C.Tree-based“.statetying.forhigh.accuracyacoustic.modeling“.ProcA. R. P.ARPA Human Language Technology Workshop, 3073121994

A Particle Filter Compensation Approach to Robust Speech Recognition

Modern Speech Recognition Approaches with Case Studies

Author Information

Aleem Mushtaq