Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition

The most common acoustic front-ends in automatic speech recognition (ASR) systems are based on the state-of-the-art Mel-Frequency Cepstral Coefficients (MFCCs). The practice shows that this general technique is good choice to obtain satisfactory speech representation. In the past few decades, the researchers have made a great effort in order to develop and apply such techniques, which may improve the recognition performance of the conventional MFCCs. In general, these methods were taken from mathematics and applied in many research areas such as face and speech recognition, high-dimensional data and signal processing, video and image coding and many other. One group of mentioned methods is represented by linear transformations.


Introduction
The most common acoustic front-ends in automatic speech recognition (ASR) systems are based on the state-of-the-art Mel-Frequency Cepstral Coefficients (MFCCs). The practice shows that this general technique is good choice to obtain satisfactory speech representation.
In the past few decades, the researchers have made a great effort in order to develop and apply such techniques, which may improve the recognition performance of the conventional MFCCs. In general, these methods were taken from mathematics and applied in many research areas such as face and speech recognition, high-dimensional data and signal processing, video and image coding and many other. One group of mentioned methods is represented by linear transformations.
Linear feature transformations (also referred as subspace learning or dimensionality reduction methods) are used to convert the original data set to an alternative and more compact set with retaining of information as much as possible. They are also used to increase the robustness and the performance of the system. In speech recognition, the basic acoustic front-end based on MFCCs can be supplemented by some kind of linear feature transformation. The linear transformation is applied in feature extraction step. Then the whole feature extraction process is achieved in two steps: parameter extraction and feature transformation. Linear transformation is applied to a sequence of acoustic vectors obtained by some kind of preprocessing method. Usually, the spectral, log-spectral, Mel-filtered spectral or cepstral features are projected to a more relevant and more decorrelated subspace, which is directly used in acoustic modeling. During the transformation often a dimension reduction step is also done. This is achieved by retaining only the relevant dimensions after the transformation according to some optimization criterion. The dimension reduction step helps to solve the problem called the curse of dimensionality.
In practice, supervised and unsupervised subspace learning methods are used. The most popular data-driven unsupervised transformation used in ASR is Principal Component Analysis (PCA). It is known that the supervised methods need an information about the ©2012 Viszlay and Juhár, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
structure of the data, which are partitioned in the classes. Therefore, it is necessary to use appropriate class labels. A widely used supervised method is known as Linear Discriminant Analysis (LDA).
In numerous research works and publications it was proven that the above mentioned linear transformations were successfully applied in ASR to multiple languages with different characteristics of speech. The Slovak speech recognition research group tends to follow this trend. In this work, we present a practical methodology with adequate theoretical principles related to application of linear feature transformations in Slovak phoneme-based large vocabulary continuous speech recognition (LVCSR).
The main subject of this chapter is the application of LDA in Slovak ASR, but the core of most experiments is based on Two-dimensional LDA (2DLDA), which is an extension of LDA. Several context lengths of basic vectors are used in the discriminat analysis and different final dimensions of transformation matrix are utilized. The classical procedures by several our modifications are supplemented. The second part of the chapter is oriented to PCA and to our proposed method related to PCA training from limited amount of training data. The third part investigates the interaction of the above mentioned PCA and 2DLDA applied in one recognition task. The closing part compares and evaluates all experiments and concludes the chapter by presenting the best achieved results.
This chapter is divided into few basic units. Sections 2 and 3 describe LDA and 2DLDA used in speech recognition. Section 4 surveys PCA and also presents the proposed partial-data trained PCA method. Section 5 presents the setup of the system for continuous phoneme-based speech recognition. Section 6 presents extensive experiments and evaluations of the used methods in different configurations. Finally Section 7 concludes the chapter. Section 8 gives the future intentions in our research.

Conventional Linear Discriminant Analysis (LDA)
Linear discriminant analysis is a well-known dimensionality reduction and transformation method that maps the N-dimensional input data to p-dimensional (p < N) subspace while retaining maximum discrimination information. A general mathematical model of linear transformation can be written in the following manner: where y is the output transformed feature set, W is the transformation matrix and x is the input feature set. The aim of LDA is to find this transformation matrix W with respect to some optimization criterion (information loss, class discrimination, ...). It can be obtained by applying an eigendecomposition to the covariance matrices. The p best functions resulted from the decomposition are used to transform the feature vectors to reduced representation.
represented by transformation matrix W ∈ N×p that maps each column x i of X to a column vector y i in the p-dimensional space as: Consider that the original data is partitioned into k classes as X = {Π 1 , . . . , Π k }, where the class Π i contains n i elements (feature vectors) from the ith class. Notice that n = ∑ k i=1 n i . The classes can be represented by class mean vectors and their class covariance matrices which are defined to quantify the quality of the cluster. Since LDA in ASR mostly in class-independent manner is used, we define the within-class covariance matrix as the sum of all class covariance matrices To quantify the covariance between classes, the between-class covariance matrix is used. It is defined as: where is the global mean vector (computed disregarding the class label information). Note that the variable x in speech recognition represents a supervector created by concatenating of acoustic vectors computed on successive speech frames. To build a supervector of J acoustic vectors (J is typically 3, 5, 7, 9 or 11 frames), the vector x j at the current position j is spliced together with J−1 2 vectors on the left and right as It should be noted that in case, when the length of the supervector was greater than the number of classes (13 × J > k, where J ≥ 5, k = 45), the between-class covariance matrix became close to singular or singular. This fact resulted in eigendecomposition with complex valued transformation matrix, which was undesirable.
Therefore, we used for these cases a modified computation of Σ B according to [7] as follows: This way of computation can be interpreted as a finer estimation of Σ B because each training supervector contributes to a final estimation of Σ B (more data points are used) in comparison with the estimation represented by Equation 6.
The given covariance matrices are used to formulate the optimization criterion for LDA, which tries to maximize the between-class scatter (covariance) over the within-class scatter (covariance). It can be shown that the covariance matrices resulting from the linear The objective function can be defined as This optimization problem is equivalent to the generalized eigenvalue problem where v is a square matrix of eigenvectors and λ represents the eigenvalues. The solution can be obtained by applying an eigendecomposition to the matrix The reduced representation W p of W is made by choosing p eigenvectors corresponding to p largest eigenvalues.

Class definition in LDA
Since LDA is a supervised method, it needs additional information about the class structure of the training data. In the past few years, several choices for LDA class definition in ASR were proposed and experimentally investigated. For small vocabulary phoneme-based ASR systems LDA yielded an improvement with phone level conventional class definition [4,8].
In these cases the Viterbi-trained context independent phonemes are used as classes. For HMM-based recognizers the time-aligned HMM states can define the classes [14]. Another reasonable method is to use the subphone levels as LDA classes [15]. We showed in our work [17] that an alternative phonetic class definition based on phonetic segmentation can lead to improvement.
For large vocabulary phoneme-based ASR systems there exist several ways to define the classes. One might argue that the conventional phone-level definition is the appropriate one. For triphone-based recognizers the context-dependent or context-independent triphones can be used [13] or the tied states in context dependent acoustic models [6].
In this work we used the conventional phone-level classes for LDA and 2DLDA. The phonetic segmentation was obtained from embedded training and automatic phone alignment (see Section 5.3). Thus, the number of classes in LDA-based experiments was identical with the number of phonemes and also with the number of trained monophone models. The disadvantage of the phone segmentation obtained from embedded training can be potentially the inaccuracy of the determined phone boundaries compared to the actual boundaries.

Two-Dimensional Linear Discriminant Analysis
Linear Discriminant Analysis used as a feature extraction or dimension reduction method in applications with high-dimensional data may not perform always optimally. Especially, when the dimension of the data exceeds the number of data points, the scatter matrices can become singular. This is known as the singularity or undersampled problem in LDA, which is its intrinsic limitation.
Two-Dimensional Linear Discriminant Analysis (hereinafter 2DLDA) [19] was primarily designed to overcome the singularity problem in classical LDA. 2DLDA overcomes the singularity problem implicitly. The key difference between LDA and 2DLDA is in the data representation model. While conventional LDA works with vectorized representation of data, the 2DLDA algorithm works with data in matrix representation. Therefore, the data collection is performed as a collection of matrices, instead of a single large data matrix. This concept has been used for example in [18] for PCA.
It is known that the optimal transformation matrix in LDA can be obtained by applying an eigendecomposition to the scatter matrices. Generally, these matrices can be singular because they are estimated from high-dimensional data. In recent years, several approaches have been developed to solve such problems related to high-dimensional computing [10]. One of these approaches is called PCA+LDA and it is a widely used two-stage algorithm especially in face recognition [3]. All mentioned methods require the computation of eigendecomposition of large matrices, which can lead to degradation of the efficiency.
2DLDA alleviates the difficult computation of the eigendecomposition in methods discussed above. Since it works with matrices instead of high-dimensional supervectors (as in classical LDA), the eigendecomposition in 2DLDA is computed on matrices with much smaller sizes than in LDA. This reduces the processing time and memory costs of 2DLDA compared to LDA.

Mathematical description
Let A i ∈ R r×c , 1; n be the n training speech signals in the corpus. Suppose there are k classes Π 1 , . . . , Π k , where Π i has n i feature vectors. Let be the mean of the i-th class and be the global mean. In [19], for face recognition, X originally represents a training image. For speech recognition, X represents the concatenated acoustic vectors (supervector) computed on successive speech frames [12]. In fact, X is a matrix composed by combination of acoustic vectors computed on successive speech frames. We can call this matrix analogously to supervector as supermatrix.
2DLDA considers an (l 1 × l 2 )-dimensional space L ⊗ R, which is a tensor product of the spaces -L spanned by vectors {u i } l 1 i=1 and R spanned by vectors {v i } l 2 i=1 . Since in 2DLDA, the speech is considered as a two-dimensional element, two transformation matrices, L and R are defined as L = [u 1 , . . . , u l 1 ], L ∈ R r×l 1 and matrix R = [v 1 , . . . , v l 2 ], R ∈ R c×l 2 . These matrices map each A i ∈ R r×c to a matrix B i ∈ R l 1 ×l 2 as: Due to difficult computing of optimal L and R simultaneously, [19] derived an iterative algorithm, which for fixed R computes the optimal L. With computed L it can be updated R. The procedure is several times repeated. As in classical LDA, the scatter matrices are computed similarly, but in two-dimensional concept. Note that in 2DLDA are defined two within-class scatter matrices S R w and S L w and two between-class scatter matrices S R b and S L b concurrently. Scatter matrices coupled with R are defined as follows: For fixed R, L can be then computed by solving an optimization problem: This problem can be solved as an eigenvalue problem: L can be then obtained in similar way as in LDA by applying an eigendecomposition to matrix resulting from: Scatter matrices coupled with L are defined as follows: In this way, with obtained L it can be computed the optimal R by solving an optimization problem: This problem can be solved as an eigenvalue problem: The optimal R can be then obtained by applying an eigendecomposition to matrix resulting from: It should be noted that the sizes of scatter matrices in 2DLDA are much smaller that those in LDA. Specifically, the size of S R w and S R b is r × r and the size of S L w and S L b is c × c.

Pseudocode of 2DLDA algorithm
The most time consuming steps in 2DLDA computing are lines 5, 8 and 13. The algorithm depends on the initial choice of R 0 . In [19] it was showed and recommended to choose an identity matrix as R 0 .

Principal component analysis
Principal component analysis (PCA) [9] is a linear feature transformation and dimensionality reduction method, which maps the n-dimensional input possibly correlated data to K-dimensional (K < n) linearly uncorrelated variables (mutually independent principal components) with respect to the variability. PCA converts the data by a linear orthogonal transformation using the first few principal components, which usually represent about 80% of the overall variance. The principal component basis minimizes the mean square error of approximating the data. This linear basis can be obtained by application of an eigendecomposition to the global covariance matrix estimated from the original data.

Mathematical description
The characteristic mathematical stages of PCA can be briefly described as follows [2,9]. Firstly suppose that the training data are represented by M n-dimensional feature vectors x 1 , x 2 , . . . , x M . One of the integral parts of PCA is the centering of all vectors (subtracting the mean) as: is the training mean vector. From the centered vectors Φ i the centered data matrix with dimension n × M is created as: To represent the variance of the data across different dimensions, the global covariance matrix is computed as: An eigendecomposition is applied to the covariance matrix in order to obtain its eigenvectors u 1 , u 2 , . . . , u n and corresponding eigenvalues λ 1 , λ 2 , . . . , λ n and it satisfies the linear equation: The principal components are determined by K leading eigenvectors resulting from the decomposition. The dimensionality reduction step is performed by keeping only the eigenvectors corresponding to the K largest eigenvalues (K < n). These eigenvectors form the transformation matrix U K with dimension n × K: while λ 1 > λ 2 > . . . > λ n . Finally, the linear transformation R n → R K is computed according to Equation (1) as: where y i represents the transformed feature vector. The value of K can be chosen as needed or according to the following comparative criterion: where the threshold T ∈ 0.9; 0.95 . Since the comparative criterion can be rewritten as:

Classical PCA in ASR
In this section we describe PCA trained from the whole amount of training data (see Section 5.1). Two kinds of input data for PCA were used. The first kind was represented by 26-dimensional LMFE features and the second one by the 13-dimensional MFCCs. Each parametrized speech signal in the corpus is represented by a LMFE or MFCC matrix X (i) , i ∈ 1; N with dimension 26 × n i (or 13 × n i , see Section 5.2), where n i represents the number of frames in i-th recording and N represents the number of training speech signals (N=36917).
At the first stage, the initial data preparation is performed, which requires the mathematical computations described by Equations 30-32. The global covariance matrix is computed according to Equation 33 and then decomposed to a set of eigenvector-eigenvalue pairs. According to the K largest eigenvalues the corresponding eigenvectors were chosen. These ones formed the transformation matrix U K (see Equation 35), which was used to transform the train and test corpus into PCA feature space.
Note that the final dimension (K) of the feature vectors after PCA transformation was chosen independently from the criterion formula (Equation 37). Detailed reasons are given in Sections 5.2 and 6.3. However, for interest, the determined optimal dimensions for different PCA configurations computed by Equation 37 are listed in Section 6.3.

Partial-data trained PCA
In case of relatively small training corpus there is no problem to compute the covariance matrix. But, in case of large corpora (thousands of recordings) and high-dimensional data there may occur a problem related to processing time (≈ several hours) consumption and memory requirements (≈ 20GB). We found that for PCA learning is not necessary to use the whole training data but it may be sufficient a part of them [16]. In other words, PCA can be trained from limited (reduced) amount of training data, while the performance is maintained, or even improved. We called this procedure as Partial-data trained PCA.
Partial-data PCA training can be viewed as a kind of feature selection process. The main idea is to select the statistically significant data (feature vectors) from the whole amount of training data. There are two major processing stages. The first stage is the data selection based on PCA separately applied to all training feature vectors. Suitable vectors are concatenated into one train matrix, which is treated as the input for the main PCA. The second stage is the main PCA (see Section 4.1).
Suppose now that apply the same conditions as in Section 4.1. Then the selection process based on PCA (without projecting phase) can be described as follows. Each 26-dimensional LMFE (or 13-dimensional MFCC) feature vector x i , i ∈ 1; M (see Section 5.2) is reshaped to its matrix version X i , i ∈ 1; M with dimension 2 × 13 (in case of MFCC vectors, the 13-dimensional vector was extended with zero coefficient in order to reshape to matrix with dimension 2 × 7). After mean subtraction the covariance matrix is computed as: In the next step, the eigendecomposition is performed on the covariance matrix C i , which results in i sets of eigenvectors w i1 , w i2 and eigenvalues α i1 , α i2 : where Note that the parameters w i1 , w i2 and α i1 , α i2 at each iteration i are updated with new parameters resulting from a new eigendecomposition. For PCA-based selection the eigenvectors w i1 , w i2 are not used. On the other hand, the eigenvalues α i1 , α i2 are the key elements because the selective criterion is based exactly on them. Using these eigenvalues, the percentage proportion P i is computed as: which determines the percentage of the variance explained by the first eigenvalue in the eigenspectrum. Further, it is necessary to choose a threshold T. It can be chosen from two different intervals. The first one is defined as T 1 ∈ (50; ≈ 65 and the second one as T 2 ∈ ≈ 85; 99.9 . Then the selective criterion can be based on the following logical expressions: for the first interval, or for the second interval. If the evaluation of the expression yields a logical true then the current feature vector is classified as statistically significant for PCA training. This vector is stored and the selection continues for the next vector. In this way, the whole training corpus is processed. From the selected vectors a training matrix is composed, which is treated as the input for the main PCA described in Section 4.
where φ i is the mean subtracted feature vector in the new train matrix. The next mathematical computations are identical with Equations 33-36. The partial-data training procedure for LMFE feature vectors is illustrated in the Figure 4 The new train matrix can be viewed as a radically-reduced, more relevant representation of the training corpus. It has a nearly homoscedastic variance structure because it contains only those feature vectors, which have almost the same variance distribution. Feature vectors selected from the interval represented by threshold T 1 can be characterized as data clusters, which have very small variance distribution explained by the first eigenvalue among the direction of the corresponding first eigenvector. On the other hand, the feature vectors from the interval represented by threshold T 2 are clusters, which have large variance distribution among the first eigenvector. In both cases, the largeness of the variance is determined by the first eigenvalue. The size of the selected partial data set depends on the value of T 1 or T 2 . The size of partial set can be expressed in percentage amount as: We found that a practical importance has a ratio, when so the selected subset contains maximally 15% of data of the whole training data amount. For example, there are approximately 19 million training vectors in our corpus. According to Equation 48 it is sufficient to extract ≈ 19000 vectors for partial-data training. But, as it will be showed in Section 6.3.2 this argument does not apply to all cases. The time cunsumption and memory costs of the covariance matrix computation of the reduced data set are much smaller than the costs of the covariance matrix computation in case of the whole corpus. In case of partial-data training it is needed to allocate the memory only for one investigated feature vector and for the other data elements for mathematical computations. These memory requirements are of order of units of megabytes. In other words, the advantage of the partial-data training is that it does not require the loading of the whole data matrix in the main memory.

Speech corpus
All experiments were evaluated by using a Slovak speech corpus ParDat1 [5], which contains approx. 100 hours spontaneous parliamentary speech recorded from 120 speakers (90% of men). For acoustic modeling 36917 training utterances were exactly used. For testing purposes 884 utterances were used.

Speech preprocessing
The speech signal was preemphasized and windowed using Hamming window. The window size was set to 25ms and the step size was 10ms. Fast Fourier transform was applied to the windowed segments. Mel-filterbank analysis with 26 channels was followed by logarithm application to the linear filter outputs. This processing resulted in 26-dimensional LMFE features, which were used for PCA-based processing.
In case of MFCC baseline feature extraction, the LMFE vectors were further decorrelated by discrete cosine transform (DCT). The first 12 MFCCs were retained and augmented with the 0-th coefficient. During the acoustic modeling the first and second order derrivatives were computed and added to the basic vectors. Thus, the final MFCC vectors were 39-dimensional.
For LDA and 2DLDA-based processing the 13-dimensional MFCC vectors were used as the input for these methods. In order to regular comparison of recognition accuracy levels in the evaluation process all of LDA and 2DLDA models were trained using 39-dimensional LDA (2DLDA) vectors. In the evaluation, the 39-dimensional MFCC models were treated as reference models so the dimensions were identical. The number of classes k used in LDA and 2DLDA were identical with the number of phonetic classes in acoustic modeling (k = 45).

Acoustic modeling
Our recognition system used context independent monophones modeled using a three-state left-to-right HMMs. The number of Gaussian mixtures per state was a power of 2, starting from 1 to 256. The phone segmentation of 45 phones was obtained from embedded training and automatic phone alignment. The number of trained monophone models corresponded to the number of phonemes and basic classes for LDA and 2DLDA. For testing purposes a word lattice was created from a bigram language model. The language model was built from the test set. The vocabulary size was 125k. The feature extraction, HMM training and testing by using HTK (Hidden Markov Model) Toolkit [20] were performed.

Evaluation
In order to evaluate the experiments we chose the accuracy as the evaluation parameter. Accuracies were computed as the ratio of the number of all word matches (resulting from the recognizer) to the number of the reference words [20]. In all experiments the accuracy is given in percentage.

Experiments and results
This section is a major part of the whole chapter. It provides a detailed and extensive experimental evaluation of the performance of the mentioned linear transformation methods and their combinations. The section presents the results of the recognition accuracy levels resulting from different experimental configurations.

Conventional LDA-based processing
In this section, the conventional LDA is investigated. The LDA-based statistical computing was performed according to mathematical description of Equations 3-12 in Section2.1. Note that the class label of each supervector composed according to Equation 8 was assigned to it according to the class label of the current basic vector x[j] at the center position j. In our experiments we tried 5 lengths J of supervector; J = 3, 5, 7, 9 and 11. This means that the dimensions of the covariance matrices in the statistical estimation were 39 × 39, 65 × 65, 91 × 91, 117 × 117 and 143 × 143. As it was mentioned in Section 2.1, in case when the length of supervector was greater than the number of classes, the between-class scatter (covariance) matrices were close to singular. From this reason we used for these cases the computation of Σ B according to Equation 9.

Supervector compositions and the scatter matrices
It is known that the covariance (scatter) matrices are in general symmetric square positive-definite regular matrices. These arguments apply also for matrices in LDA. Since in LDA the covariance matrices are computed from supervectors, there may occur a problem with the symmetry of these matrices. We found that the symmetry depends on the way, in which the supervectors are constructed. The Figure 2 illustrates two types of supervector construction with example of vector length 4. The subfigure (a) illustrates the classical way of construction of supervector by using a simple concatenation. The subfigure (b) illustrates a construction, where the final structure of the supervector is preserved according to the structure of the basic vectors. Thus, if the first few coefficients of the basic vector preserve a higher energy than the coefficients with lower order, then the new supervector follows this tendency.
It should be noted that the arrangement of the coefficients in the supervector impacts the symmetry of the matrices and this can affect other properties. These facts are proven in Figure 3. From these figures it can be seen the influence of the supervector construction to the symmetry of the scatter matrices. Figures 3 (a) and (c) represent the within-class scatter matrices in case, when the supervectors are constructed according to Figure 2 (a). It can be seen that these matrices are multisymmetric. On the other hand, the matrices in Figures 3 (b) and (d) are purely symmetric. They were computed from supervectors constructed according to Figure 2

Between-class scatter matrix and the singularity
As was mentioned in Section 2.1, the between-class scatter matrices for context length greater than J = 3 were computed according to Equation 9 instead of the classical Equation 6. The Figure 4 (a) demonstrate that the between-class scatter matrix computed for context length J = 5 according to Equation 6 is not symmetric. In addition it is computed from supervectors constructed according to Figure 2 (a). The Figure 4 (b) illustrates a similar case as in Figure 4 (a). This matrix is computed from supervectors constructed according to Figure 2 (b). It can be seen that this matrix si only close to symmetric and in the statistical estimation this can result in singular between-class matrix and complex valued numbers in the transformation LDA matrix. Note that the symmetric between-class scatter matrices in Figure 3 were computed according to Equation 9.

Results
The experiments based on LDA can be divided into three categories related to dimension of the LDA transformation matrix. The first category is represented by LDA matrix with dimension 13 × 39. Thus, for transformation were retained only the first 13 eigenvectors corresponding to 13 leading eigenvalues. The final dimension of the features were expanded to 39 with Δ and ΔΔ coefficients. The second category is represented by LDA matrix with dimension 19 × 39 so for transformations were used more LDA coefficients. Note that the final dimension of features was 38 (19 + Δ). The third category is represented by LDA matrix with dimension 39 × 39 and in this case were not used the Δ and ΔΔ coefficients. The difference between these three categories is that for acoustic modeling were used various numbers of dimensions and data-dependent and data-independent Δ and ΔΔ coefficients. The LDA coefficients with lower order (14-39) can be viewed as Δ and ΔΔ coefficients estimated in data-dependent manner. The experimental results for LDA are given in the Table 1. The results are analyzed separately for the mentioned categories.
1. The highest accuracies were achieved for 13 LDA coefficients expanded with Δ and ΔΔ coefficients and for J = 3. The maximum improvement compared to MFCC model is +2.05% for 4 mixtures. Only for 1 mixture any improvement was achieved.  3. The results in the last case, when the dimension of LDA matrix was 39 × 39 are not satisfactory. In all cases, the performance was decreased. But we can conclude that the longer lengths of context are suitable for higher dimensions of transformation matrix (without Δ and ΔΔ).

2DLDA-based processing
In this section we extensively evaluate the performance of 2DLDA at different configurations and compare with the reference MFCC model and also with the performance of conventional   The main difference is that it is necessary to compute two eigendecompositions and we have two transformation matrices; L and R. 2DLDA does not deal with supervectors as in LDA but with supermatrices, which are the basic data elements in 2DLDA (instead of vectors). These supermatrices were created from the basic cepstral vectors by coupling them together. Similarly as in LDA, we used 5 different sizes of supermatrices according to the number of contextual vectors (context size J). Thus, the sizes of supermatrices were 13 × 3, 13 × 5, 13 × 7, 13 × 9 and 13 × 11. Consequently, the class mean, global mean, within-class scatter matrix and between-class scatter matrix have corresponding sizes according to the current length of context. For example, when the context size J was set to 7, in statistical estimation 7 cepstral vectors were coupled together to form a supermatrix 13 × 7. Then, the statistical estimators have the following dimensions: • class means M i : 13 × 7, • global mean M : 13 × 7, • left within-class scatter matrix S L w : 7 × 7, • left between-class scatter matrix S L b : 7 × 7, • right within-class scatter matrix S R w : 13 × 13, • right between-class scatter matrix S R b : 13 × 13, • left transformation matrix L : 13 × 13, • right transformation matrix R : 7 × 7.
The mathematical computations resulted in the transformations L and R. These matrices were then used to transform the whole speech corpus. In this way, each supermatrix created from the coupled vectors in the recording was transformed to its reduced version. The dimension reduction step was done by choosing the required size of L and R. In the next step, each transformed supermatrix was re-transformed to vector according to the matrix-to-vector alignment. The specific dimensions used in transformations are listed in the Table 2. Since the mathematical part of 2DLDA is an iteration algorithm it was necessary to set the number of iterations I. In [19] it is recommended to run the iteration loop only once (I = 1), which significantly reduces the total running time of the algorithm. In our 2DLDA experiments we run the for loop three times (I = 3).
The results of 2DLDA performance can be divided into three categories, similarly as in case of LDA and are given in the Table 2. 1. The first category is represented by vector of dimension 13, which resulted from transformation. The final dimension was 39 (13 2DLDA +Δ + ΔΔ coefficients). As it can be seen from the The maximum improvement achieved by 2DLDA was +2.01% for context length J = 3 and for one iteration (I = 1).

PCA-based processing
In this section, we experimentally evaluate the performance of the full-data trained PCA method by using the whole amount of training data for LMFE and MFCC features. In the next part of this section we present the results of partial-data trained PCA with various parameters. Note that all of the PCA-based models were transformed with PCA matrix with dimension 13 × 13 and the features were then expanded with Δ and ΔΔ coefficients. This resulted in final dimension 39.

Full-data trained PCA
As it was mentioned, PCA requires allocation of the whole data matrix in the memory. In addition, the covariance matrix is computed from this data matrix, which may be a computationally very difficult operation. In order to compare the partial-data trained models with the full-data trained model it was necessary to do the above mentioned computation.
The full-data trained PCA was performed on a Linux machine with 32GB memory. The training data were loaded in the memory sequentially by data blocks and then concatenated to one data matrix (see Equation 32). From this matrix the covariance matrix according to Equation 33 was computed. Then the integral parts of PCA according to Equations 34-36 were performed. In the next step, the acoustic modeling based on the PCA transformed features was done. The evaluation results of the full-data trained PCA for LMFE features are listed in the Table 5 and for MFCC features in the Table 6.

Partial-data trained PCA
The selective process for the feature vectors according to Fig. 1 was performed and M-times repeated. Overall, 10 partial-data trained models with LMFE features were learned. 5 models were learned for selection based on threshold T 1 and 5 ones for T 2 . For MFCC features apply an identical scheme. The parameters for these models are listed in the Table 3 and Table 4 Table 4. Parameters used for partial-data PCA models trained from MFCC One of the output parameters of the partial-data PCA is the optimal dimension d determined by Equation (37). It represents the number of principal components, which could be used to transform the input data with retaining 95% of global variance. Note that the threshold values T 1 and T 2 were determined on experimental basis. The results of the partial-data PCA models are listed in the Table 5 and Table 6 for LMFE and MFCC features, respectively. Note that the table contains only the highest accuracies chosen from all models.
From the Table 5 we can conclude that for LMFE features the selected subsets of size 0.1% and 5% are not suitable to partial-data PCA training. In addition, an improvement in comparison with full-data trained PCA was achieved only for 32-256 mixtures. The maximum absolute improvement +0.43% for 64 mixtures was achieved.  Table 6. Accuracy levels for MFCC-based full-data and partial-data trained PCA

PCA-based 2DLDA
As was mentioned in Section 1, one of the issues of this chapter is the interaction of two types of linear transformations in one experiment. More specifically, the aim of this section is to present an evaluation of the mentioned interaction of PCA and 2DLDA. In other words, in this experiment we used as the input for 2DLDA the PCA-based feature vectors instead of MFCC vectors. We wanted here to demonstrate that the PCA features have comparative properties as MFCC features and that 2DLDA trained from PCA features can achieve comparative performance as 2DLDA trained from MFCC features. The PCA training was done in two ways. The first one ist the classical full-data training and the second one is the partial-data training (see Table 7).
From the results of the experiment given in the Table 7 we can conclude the following arguments. For 4 of 9 cases the performance of 2DLDA was improved using PCA features as its input. But for 3 cases of 4 the improvement was achieved for full-data training.  Table 7. Accuracy levels (%) of PCA-based 2DLDA

Global experimental evaluation of all methods
In the last section we conclude the experimental results presented in the whole chapter. Overall, we present seven types of experiments evaluating the performance of some kind of linear feature transformation applied in feature extraction in Slovak phoneme-based continuous speech recognition. Each result of the partial experiment is summarized and compared with the other results in the Table 8. The graphical comparison is given in Figure 5.

Conclusions and discussions
The global conclusion of the experimental part of this chapter can be divided into few following deductions.
• Principal Component Analysis can improve the performance of the MFCC-based acoustic model. As the input for PCA can be used LMFE or MFCC features.
• The proposed partial-data trained PCA achieves better results compared to full-data trained PCA. Higher improvements can be achieved in case of MFCC features used as input for partial-data PCA.
• The conventional Linear Discriminant Analysis leads to improvements almost for all mixtures, but there may occur a problem related to singularity of between-class scatter matrix in case of larger lengths of context J.
• 2DLDA achieves comparable improvements as LDA (a little bit smaller). On the other hand, it is much more stable than LDA and there is no problem with the singularity, because 2DLDA overcomes it implicitly (much smaller dimensions of scatter matrices).
• In the last step, we clearly demonstrated that the combination of PCA and 2DLDA (subspace learning) leads to further refinement and improvement compared to performance of 2DLDA.

Future research intentions
Based on the presented knowledge and our research intentions in the near future we would like to develop an algorithm to elimination of using the class label information (class definition) in the LDA-based experiments. In other words, we want to train the LDA and its similar supervised modifications in unsupervised way without using the labeling of speech corpus.