ASR accuracy comparisons for Aurora-2
The speech production mechanism goes through various stages. First, a thought is generated in speakers mind. The thought is put into a sequence of words. These words are converted into a speech signal using various muscles including face muscles, chest muscles, tongue etc. This signal is distorted by environmental factors such as background noise, reverberations, channel distortions when sent through a microphone, telephone channel etc. The aim of Automatic Speech Recognition Systems (ASR) is to reconstruct the spoken words from the speech signal. From information theoretic  perspective, we can treat what is between the speaker and machine as a distortion channel as shown in figure 1.
Here, represent the spoken words and is the speech signal. The problem of extracting from can be viewed as finding the words sequence that most likely resulted in the observed signal as given in equation (1)
Like any other Machine Learning/Pattern Recognition problem, the posterior plays a fundamental role in the decoding process. This distribution is parametric and its parameters are found from the available training data. Modern ASR systems do well when environment of speech signal being tested matches well with that of the training data. This is so because the parameter values correspond well to the speech signal being decoded. However, if the environments of training and testing data do not match well, the performance of the ASR systems degrade. Many schemes have been proposed to overcome this problem but humans still outperform these systems, especially in adverse conditions.
The approaches to overcome this problem falls under two categories. One way is to adapt the parameters of such that they match better with the testing environment and the other is to choose features such that they are more robust to environment variations. The features can also be transformed to make them more suited to the parameters of, obtained from training data.
1.1. Typical ASR system
Typical ASR systems for small vocabulary are comprised of three main components as shown in figure 2. Speech data is available in waveform which is first converted into feature vectors. Mel Frequency Cepstrum Coefficients (MFCC)  features have been widely used in speech community for the task of speech recognition due to their superior discriminative capability.
The features from an available training speech corpus are used to estimate the parameters of Acoustic Models. An acoustic model for a particular speech unit, say a phoneme or a word is the likelihood of observing that unit based on the features as given in equation 1.1. Most commonly used structure for the acoustic models in ASR systems is the Hidden Markov Models (HMM). These models capture the dynamics and variations of speech signal well. The test speech signal is then decoded using Viterbi Decoder.
1.2. Distortions in speech
The distortions in speech signal can be viewed in signal space, feature space and the model space  as shown in figure 3. Resilience to environmental distortions can be added in the feature extraction process, by modifying the distorted features or adapting the acoustic models to better match the environment from which test signal has emanated. and represent speech signal and speech feature respectively. represent the acoustic models.
In stage 1, the feature extraction process is improved so that the features are robust to distortions. In stage 2, the features are modified to match them better with the training environment. The mismatch in this stage is usually modeled by nuisance parameters. These are estimated from the environment and test data and their effect is minimized based on some optimality criteria. In stage 3, the acoustic models are improved to match better with the testing environment. One way to achieve this is to use Multi-Condition training i.e. use data from diverse environments to train the models. Another way is find transform the models where transformation matrix is obtained from the test environment.
1.3. Speech and noise tracking for noise compensation
A sequential Monte Carlo feature compensation algorithm was initially proposed [4-5] in which the noise was treated as a state variable while speech was considered as the signal corrupting the observation noise and a VTS approximation was used to approximate the clean speech signal by applying a minimum mean square error (MMSE) procedure. In  extended Kalman filters were used to model a dynamical system representing the noise which was further improved by using Polyak averaging and feedback with a switching dynamical system . These were initial attempts to incorporate particle filter for speech recognition in more indirect fashion as it was used for tracking of noise instead of the speech signal itself. Since the speech signal is treated as corrupting signal to the noise, limited or no information readily available from the HMMs or the recognition process can be utilized efficiently in the compensation process.
Particle filters are powerful numerical mechanisms for sequential signal modeling and is not constrained by the conventional linearity and Gaussianity  requirements. It is a generalization of the Kalman filter  and is more flexible than the extended Kalman filter  because the stage-by-stage linearization of the state space model in Kalman filter is no longer required . One difficulty of using particle filters lies in obtaining a state space model for speech as consecutive speech features are usually highly correlated. Just like in the Kalman filter and HMM frameworks, state transition is an integral part of the particle filter algorithms.
In contrast to the previous particle filter attempts [4-6] we describe a method in this chapter where we treat the speech signal as the state variable and the noise as the corrupting signal and attempt to estimate clean speech from noisy speech. We incorporate statistical information available in the acoustic models of clean speech, e.g., the HMMs trained with clean speech, as an alternative state transition model[10-11]. The similarity between HMMs and particles filters can be seen from the fact that an observation probability density function corresponding to each state of an HMM describes, in statistical terms, the characteristics of the source generating a signal of interest if the source is in that particular state, whereas in particle filters we try to estimate the probability distribution of the state the system is in when it generates the observed signal of interest. Particle filters are suited for feature compensation because the probability density of the state can be updated dynamically on a sample-by-sample basis. On the other hand, state densities of the HMMs are assumed independent of each other. Although they are good for speech inference problems, HMMs do not adapt well in fast changing environments.
By establishing a close interaction of the particle filters and HMMs, the potentials of both models can be harnessed in a joint framework to perform feature compensation for robust speech recognition. We improve the recognition accuracy through compensation of noisy speech, and we enhance the compensation process by utilizing information in the HMM state transition and mixture component sequences obtained in the recognition process. When state sequence information is available we found we can attain a 67% digit error reduction from multi-condition training in the Aurora-2 connected digit recognition task. If the missing parameters are estimated in the operational situations we only observe a 13% error reduction in the current study. Moreover, by tracking the speech features, compensation can be done using only partial information about noise and consequently good recognition performance can be obtained despite potential distortion caused by non-stationary noise within an utterance.
The remainder of the chapter is organized as follows. In section 2, a tracking scheme in general is described followed by the explanation of the well known Kalman filter tracking algorithm. Particle Filters, which form the backbone of PFC are also described in this section. In section 3, the steps involved in tracking and then extracting the clean speech signal from the noisy speech signal are laid out. We also discuss various methods to obtain information required to couple the particle filters and the HMMs in a joint framework. Finally, the experimental results and performance comparison for PFC is given before drawing the conclusions in section 4.
2. Tracking algorithms
Tracking is the problem of estimating the trajectory of an object in a space as it moves through that space. The space could be an image plane captured directly from a camera or it could be synthetically generated from a radar sweep. Generally, tracking schemes can be applied to any system that can be represented by a time dynamical system which consists of a state space model and an observation
Where is the observation noise and is called the process noise and represents the model uncertainties in the state transition function. What is available is an observation which is function of.We are interested in finding a good estimate of current state given observations till current time i.e.. The state space model represents the relation between states adjacent in time. The model in equation (2) assumes that state sequence is one step Markov process
It is further assumed that observations are independent of one another
Tracking is a two step process. The first step is to obtain density at time. This is called the prior density of. Once it is available, we can construct a posterior density upon availability of observation. The propagation step is given in equation (5). The update step is obtained using Bayesian theory (equation (6)).
2.1. Kalman filter as a recursive estimation tracking algorithm
Kalman Filter is the optimal recursive estimation solution for posterior density if the time dynamical system is linear
where and are known as state transition matrix and observation matrix respectively. Subscript indicates that both can vary with time. Under the assumption that both process noise and observation noise are Gaussian with zero mean and covariance and respectively, can be readily obtained.
To obtain the propagation step, we need in addition to. Since this is an iterative step, the estimate of given observations up to time is available at and let’s call it. Let covariance of be. Then
where is the covariance of and is given by. Now both components of the integral in equation (5) are available in equation (9) and (10). Solving the integral using expanding and completing the squares  we get
This is the propagation step and is sometimes is also written as
To get the update step, we note that the distributions of,…,and are both Gaussian. For two random variables say and that are jointly Gaussian, the distribution of one of them given the other for example is also Gaussian. Consequently, ,…,is a Gaussian distribution with following mean and variance
where is called the Kalman gain and is given by
Covariance can also be obtained by referring to the fact that covariance of, the two jointly Gaussian random variables, is given by
we thus obtain the covariance of estimate of as follows
The block diagram in Figure 4 below shows a general recursive estimation algorithm steps starting from some initial state estimate. The block labeled Kalman filter summarizes the steps specific to Klaman filter algorithm.
2.2.Grid. based methods
It is hard to obtain analytical solutions to most recursive estimation algorithms. If the state space for a problem is discrete, then we can use grid based methods and can still obtain the optimal solution. Considering that state takes possible values, we can represent discrete density using samples.
where the weights are computed as follows
Here is the normalizing constant to make total probability equal one. The assumption that state can be represented by finite number of points gives us the ability to sample the whole state space. The weight represents the probability of being in state when observation at time is. In grid based method we construct the discrete density at every time instant in two steps. First we estimate the weights at without the current observation and then update them when observation is available and obtain. In the propagation step we take into account probabilities (weights) for all possible state values atto estimate the weights at time as shown in figure 5.
If the prior and the observation probability) are available, the grid based method gives us the optimal solution for tracking the state of the system. If the state of the system is not discrete, then we can obtain an approximate solution using this method. We divide the continuous space into say cells and for each cell we compute the prior and posterior in a way that takes into account the range of the whole cell:
where is the center of th cell at time. The weight update in equation (21) subsequently remains unchanged.
2.3. Particle filter method
Particle filtering is a way to model signals emanating from a dynamical system. If the underlying state transition is known and the relationship between the system state and the observed output is available, then the system state can be found using Monte Carlo simulations . Consider the discrete time Markov process such that
We are interested in obtaining so that we have a filtered estimate of from the measurements available so far,. If the state space model for the process is available, and both the state and the observation equations are linear, then Kalman filter described above can be used to determine the optimal estimate of given observations. This is so under the condition that process and observation noises are white Gaussian noise with zero mean and mutually independent. In case the state and observation equations are nonlinear, the Extended Kalman Filter (EKF) , which is a modified form of the Kalman Filter can be used. Particle filter algorithm estimates the state’s posterior density, represented by a finite set of support points :
where for are the support points and are the associated weights. We thus have a discretized and weighted approximation of the posterior density without the need of an analytical solution. Note the similarities with Grid based method. In that, support points for discrete distribution were predefined and covered the whole space. In particle filter algorithm, the support points are determined based on the concept of importance sampling in which instead of drawing from, we draw points from another distribution q(.) and compute the weights using the following:
where is the distribution of and is an importance density from which we can draw samples. For the sequential case, the weight update equation can be computed one by one,
The density propagates the samples to new positions at given samples at time and is derived from the state transition model of the system.
3. Tracking algorithms for noise compensation
State transition information is an integral part of the particle filter algorithm and is used to propagate the particle samples through time transitions of the signal being processed. Specifically, the state transition is important to be able to position the samples at the right locations.To solve this problem, statistics from HMMs can be used. Although we only have discrete states in HMMs, each state is characterized by a continuous density Gaussian mixture model(GMM) and therefore it enables us to capture part of the variation in speech features to generate particle samples for feature compensation. Using particle filter algorithms with side information about the statistics of clean speech available in the clean HMMs we can perform feature compensation.If the clean speech is corrupted by an additive noise, n, and a distortion channel, h, thenwe can represent the noise corrupted speech with an additive noise model , assumingknown statistics of the noise parameters,
where, and and denotes the mel spectrum.
where Y’ is the noisy or compensated utterance.
The clean HMMs and the background noise information enable us to generate appropriate samples fromin equation (26). The parametersin equation (30) in our particle filter compensation (PFC) implementation, correspond to the corresponding correct HMM state sequence and mixture component sequence. These sequences provide critical information for density approximation in PFC. As shown in Figure 6 this can be done in two stages. We first perform a front-end compensation of noisy speech. Then recognition is done in the second stage to generate the side informationso as to improve compensation. This process can be iterated similar to what’s done in maximum likelihood stochastic matching . During compensation, the observed speech is mapped to clean speech features. For this purpose clean speech alone cannot be represented by a finite set of points and therefore HMMs by themselves cannot be used directly for tracking of. Now if an HMM is available that adequately represents the speech segment under consideration for compensation along with an estimated state sequencethat correspond tofeature vectors to be considered in the segment, then we can generate the samples from the sample according to
where is the Gaussian mixture for the state in andis its corresponding weight for the mixture. The total number of particles is fixed and the contribution from each mixture, computed at run time, depends on its weight. We have chosen the importance sampling density,in equation (26) to be pin equation (31). This is known as the sampling importance resampling (SIR) filter . It is one of the simplest implementation of particle filters and it enables the generation of samples independently from the observation. For the SIR filter, we only need to know the state and the observation equations and should be able to sample from the prior as in Eq. (3). Also, the resampling step is applied at every stage and the weight assigned to the -th support point of the distribution of the speech signal at time is updated as:
The procedure for obtaining HMMs and the state sequence will be described in detail later. To obtain p, the distribution of the log spectra of noise for each channel is assumed Gaussian with meanand variance. Assuming there is additive noise only with no channel effects
We are interested in evaluatingwhererepresents clean speech and is the noise with density. Then
Where is the Gaussian cumulative density function with meanand varianceand. In the case of MFCC features, the nonlinear transformation is 
where is a Gaussian pdf,is the corresponding Jacobian and is a discrete cosine transform matrix which is not square and thus not invertible. To overcome this problem, we zero-pad the and vectors and extend to be a square matrix. The variance of the noise density is obtained from the available noise samples. Once the point density of the clean speech features is available, we estimate of the compensated features using discrete approximation of the expectation as
where is the total number of particle samples at time.
3.1. Estimation of HMM side information
As described above, it is important to obtainwhere is an HMM that faithfully represents the speech segment being compensated andis the state sequence corresponding to the utterance of length. To obtainfor thewordin the utterance, we chose the -best modelsfrom HMMs trained using ‘clean speech data’. The models are combined together to obtain a single model as follows.
3.1.1. Gaussian Mixtures Estimation
To obtain the observation model for each stateof model, we concatenate mixtures from the corresponding states of all component models,
where is the number of Gaussian mixtures in each original HMM and is the number of different wordsin the -best hypothesis.andare mean and covariance from the-th mixture in the-th state of model. The mixture weights are normalized by scaling them according to the likelihood of the occurrence of the model, from which they come from,
The mixture weight is an important parameter because it determines the number of samples that will be generated from the corresponding mixture. The state transition coefficients forare computed using the following:
3.1.2. State sequence estimation
The recognition performance can be greatly improved if a good estimate of the HMM state sequenceis available. But obtaining this sequence in a noisy operational environment in ASR is very challenging. The simplest approach is to use the decoded state sequence obtained with multi-condition trained models in an ASR recognition process as shown in the bottom of Figure 6. However, these states could often correspond to incorrect models and deviate significantly from the optimal one. Alternatively, we can determine the states (to generate samples from) sequentially during compensation. For left-to-right HMMs, given the stateat time, we chose using equation (41) as follows:
where comes from the state transition matrix for. The mixture indices are subsequently selected from amongst the mixtures corresponding to the chosen state.
To investigate the properties of the proposed approach, we first assume that a decent estimate of the state is available at each frame. Moreover, we assume that speech boundaries are marked and therefore the silence and speech sections of the utterance are known. To obtain this information, we use a set of digit HMMs (18 states, 3 Gaussian mixtures) that have been trained using clean speech represented by 23 channel mel-scale log spectral feature. The speech boundaries and state information for a particular noisy utterance is then captured through digit recognition performed on the corresponding clean speech utterance. The speech boundary information is critical because the noise statistics have to be estimated from the noisy section of the utterance. To get the HMM needed for particle filter compensationmodelsare selected based on the -best hypothesis list. For our experiments, we set. We combine these models to getfor the -th word in the utterance. Best results are obtained if the correct word model is present in the pool of models that contribute to. Upon availability of this information, the compensation of the noisy log spectral features is done using the sequential importance sampling. To see the efficacy of the compensation process, we consider the noisy, clean and compensated filter banks (channel 8) for the whole utterances shown in Figure 7. The SNR for this particular case is 5 dB. It is clear that the compensated feature matches well with the clean feature. It should be noted however that such a good restoration of the clean speech signal from the noisy signal is achievable only when a good estimate of the side information about the state and mixture component sequences is available.
Assuming all such information were given (the ideal oracle case) recognition can be performed on MFCCs (39 MFCCs with 13 MFCCs and their first and second time derivatives) extracted from these compensated log spectral features. The HMMs used for recognition are trained with noisy data that has been compensated in the same way as the testing data. The performance compared to multi-condition (MC) and clean condition training (Columns 5 and 6 in Table 1) is given in Column 2 of Table 1 (Adapted Model I). It is clearly noted that a very significant 67% digit error reduction was attained if the missing information were made available to us.
|Word Accuracy||Adapted Models I||Adapted Models II||Adapted Models III||MC Training||Clean Training|
In the case of the actual operational scenarios, when no side information is available, models were chosen from the N-Best list while the states were computed using Viterbi decoding. Of course, the states would correspond to only one model which might not be correct, and there might be a significant mismatch between actual and computed states. Moreover the misalignment of words also exacerbated the problem. The results for this case (Adapted Model III as shown in Table 1 Column 4) were only marginally better than those obtained with the multi-condition trained models. To see the effects of the improvements for the case where the states are better aligned, we made use of whatever information we could get. The boundaries of words were extracted from the N-Best list using exhaustive search and the states for the words between these boundaries were assigned by splitting the digits into equal-sized segments and assigning one state to each segment. This limited the damage done by state misalignment, and it can be seen that a 13% digit error reduction from MC training was observed (Adapted Model II in Table 1 Column 3).
3.2. A clustering approach to obtaining correct HMM information
HMM states are used to spread the particles at the right locations for subsequent estimation of the underlying clean speech density. If the state is incorrect, the location of particles will be wrong and the density estimate will be erroneous. One solution is to merge the states into clusters. Since the total number of clusters can be much less than the number of states, the problem of choosing the correct information block for sample generation is simplified. A tree structure to group the Gaussian mixtures from clean speech HMMs into clusters can be built with the following distance measure :
where is the -th element of the mean vectorand is the-th diagonal element of the covariance matrix. The parameters of the single Gaussian representing the cluster,), is computed as follows:
Alternatively, we can group the components at the state level using the following distance measure :
where is the total number of states in the cluster, is the number of mixtures per state and is the observation probability. This method makes it easy to track the state level composition of each cluster. In both cases, the clustering algorithm proceeds as follows:
Create one cluster for each mixture up to clusters.
While, find and for which is minimum and merge them.
Once clustering is complete, it is important to pick the most suitable cluster for feature compensation at each frame. The particle samples are then generated from the representative density of the chosen cluster. Two methods can be explored. The first is to decide the cluster based on the -best transcripts obtained from recognition using multi-condition trained models. Denote the states obtained from the -best transcripts for noisy speech feature vectors at time as. If state is a member of cluster, we increment by one, where is a count of how many states from the -best list belong to cluster. We choose the cluster based on and generate samples from it. If more than one cluster satisfies this criterion, we merge their probability density functions. In the second method, we chose the cluster that maximizes the likelihood of the MFCC vector at time, , belonging to that cluster as follows:
It is important to emphasize here that is derived from multi-condition speech models and has a different distribution from the one used to generate the samples. The relationship between clean clusters and multi-condition clusters is shown in figure 1. Clean clusters are obtained using methods described in section 3. The composition information of these clusters is then used to build a corresponding multi-condition cluster set from multi-condition HMMs. A cluster in clean clusters represents statistical information of a particular section of clean speech. The multi-condition counterpart represents statistics of the noisy version of the same speech section.
Clean clusters are necessary to track clean speech because we need to generate samples from clean speech distributions. However, they are not the best choice for estimating equation (46) because the observation is noisy and has a different distribution. The best candidate for computing equation (46)is the multi-condition cluster set. It is constructed from multi-condition HMMs that match more closely with noisy speech. A block diagram of the overall compensation and recognition process is shown in Figure 9. We make inference about the cluster to be used for observation vector using both the N-best transcripts and equation (46) combined together. Samples at frame are then generated using the pdf of chosen cluster. The weights of the samples are computed using equation (46) and compensated features are obtained using equation (36). Once the compensated features are available for the whole utterance, recognition is performed again using retrained HMMs with compensated features.
To evaluate the proposed framework we experimented on the Aurora 2 connected digit task. We extracted features (39 elements with 13 MFCCs and their first and second time derivatives) from test speech as well as 23 channel filter-bank features thereby forming two streams. One-best transcript was obtained from the MFCC stream using the multi-condition trained HMMs. PFC is then applied to the filter-bank stream (stream two). We chose two clusters, one based on 1-best and the other selected with equation (46). The multi-condition clusters used in equation (46) were from 23 channel fbank features so that the test features from stream two can be directly used to evaluate the likelihood of the observations. For results in these experiments, clusters were formed using method two, i.e., tracking the state-wise composition of each cluster. The number of clusters and particles were varied to evaluate the performance of the algorithm under different settings. From the compensated filter-bank features of stream two, we extracted 39-element MFCC features. Final recognition on these models was done using the retrained HMMs, i.e., multi-condition training data compensated in a similar fashion as described above.
|Word Accy||20 Clust.||25 Clust.||30 Clust.||MC Trained||Clean Trained|
The results for a fixed number of particles (100) are shown in Table 1. The number of clusters was 20, 25 or 30. To set the specific number of clusters, HMM states were combined and clustering was stopped when the specified number was reached. HMM sets for all purposes were 18 states, with each state represented by 3 Gaussian mixtures. For the 11-digit vocabulary, we have a total of approximately 180 states. In case of, for example, 20 clusters, we have a 9 to 1 reduction of information blocks to choose from for plugging in the PF scheme.
It is interesting to note that best results were obtained for 25 clusters. Increasing the number of clusters beyond 25 did not improve the accuracy. The larger the number of clusters, the more specific speech statistics each cluster contains. If the number of clusters is large, then each cluster encompasses more specific section of the speech statistics. Having more specific information in each cluster is good for better compensation and recognition because the particles can be placed more accurately. However, due to the large number of clusters to choose from, it is difficult to pick the correct cluster for generation of particles. More errors were made in the cluster selection process resulting in degradation in the overall performance.
This is further illustrated in Figure 10. If the correct cluster is known, having large number of clusters and consequently more specific information per cluster will only improve the performance. The results are for 20, 25 and 30 clusters. In the known cluster case, one cluster is obtained using equation (46) and the second cluster is the correct one. Correct cluster means the one that contains the state (obtained by doing recognition on the clean version of the noisy utterance using clean HMMs) to which the observation actually belongs to. For the unknown cluster case, the clusters are obtained using equation (46) and. It can readily be observed from the known cluster case that if the choice of cluster is always correct, the recognition performance improves drastically. Error rate was reduced by 54%, 59% and 61.4% for 20, 25 and 30 clusters, respectively. Moreover, improvement faithfully follows the number of clusters used. This was also corroborated by the fact that if the cluster is specific down to the HMM state level, i.e., the exact HMM state sequence was assumed known and each state is a separate cluster (total of approximately 180 clusters), the error rate was reduced by as much as 67% .
For the results in Table 2, we fixed the number of clusters and varied the number of particles. As we increased the number of particles, the accuracy of the algorithm improves for set A and B combined i.e. for additive noise. The error reduction is 17% over MC trained models. Using a large number of particles implies more samples were utilized to construct the predicted densities of the underlying clean speech features, which is now denser and thus better approximated. Thus, a gradual improvement in the recognition results was observed as the particles increased. In case of Set C, however, the performance was worse when more particles were used. This is so because the underlying distribution is different due to the distortions other than additive noise.
|Set A||Set B||Set C||Average|
In this chapter, we proposed a particle filter compensation approach to robust speech recognition, and show that a tight coupling and sharing of information between HMMs and particle filters has a strong potential to improve recognition performance in adverse environments. It is noted that we need an accurate alignment of the state and mixture sequences used for compensation with particle filters and the actual HMM state sequences that describes the underlying clean speech features. Although we have observed an improved performance in the current particle filter compensation implementation there is still a considerable performance gap between the oracle setup with correct side information and what’s achievable in this study with the missing side information estimated from noisy speech. We further developed a scheme to merge statistically similar information in HMM states to enable us to find the right section of HMMs to dynamically plug in the particle filter algorithm. Results show that if we use information from HMMs that match specifically well with section of speech being compensated, significant error reduction is possible compared to multi-condition HMMs.