The most similar syllables
Speech recognition is the process of converting speech signals to the text. Studies on speech recognition have increased very fast for the last twenty-five years. Most of these studies have used phoneme and word as speech recognition units. Namely, in phoneme based speech recognition systems, all phonemes in a language have been modelled by a speech recognition method, and then the phonemes can be detected by these models. The recognized words are constructed with concatenating these phonemes. However, word based systems model the word utterances and try to recognize the word as a text unit. Word based systems have better success rate than phoneme based systems. As a rule, if a speech recognition system has longer recognition unit than sub-word units, it has better success rate on recognition process. In addition, phoneme end-point detection is quite difficult operation, and this effects the success of the system.
Turkish, that is one of the least studied languages in the speech recognition field, has different characteristics than European languages which require different language modelling technique. Since Turkish is an agglutinative language, the degree of inflection is very high. So, many words are generated from a Turkish word’s root by adding suffixes. That’s why, word based speech recognition systems are not adequate for large scaled Turkish speech recognition. Because Turkish is a syllabified language and there are approximately 3500 different Turkish syllables, speech recognition system for Turkish will be suitable to use syllable (sub-word unit) as a speech recognition unit.
Turkish speech recognition studies have increased in the past decade. These studies were based on self organizing feature map (Artuner, 1994), DTW (Meral, 1996; Özkan, 1997), HMM (Arısoy & Dutağacı, 2006; Karaca, 1999; Koç, 2002; Salor & Pellom, 2007; Yılmaz, 1999) and Discrete Wavelet Neural Network (DWNN) and Multi-layer Perceptron (MLP) (Avcı, 2007).
In a simplified way, speech recognizer includes the operations as preprocessing, feature extraction, training, recognition and postprocessing. After the speech recognizer takes the acoustic speech signal as an input, the output of the recognizer will be the recognized text.
The most common approaches to speech recognition can be divided into two classes: “template based approach” and “model based approach”. Template based approaches as Dynamic Time Warping (DTW) are the simplest techniques and have the highest accuracy when used properly. The electrical signal from the microphone is digitized by an analog-to-digital converter. The system attempts to match the input with a digitized voice sample, or template. The system contains the input template, and attempts to match this template with the actual input. Model based approaches as Artificial Neural Network (ANN), Hidden Markov Model (HMM) and Support Vector Machine (SVM) tend to extract robust representations of the speech references in a statistical way from huge amounts of speech data. Model based approaches are currently the most popular techniques. However, when the size of the vocabulary is small and the amount of training data is limited, template based approaches are still very attractive.
Speech recognition system can be speaker dependent or speaker independent. A speaker dependent system is developed to operate for a single speaker. These systems are usually more successful. A speaker independent system is developed to operate for any speaker. These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems.
The size of vocabulary of a speech recognition system affects the complexity, processing requirements and the accuracy of the system. Some applications only require a few words such as only numbers; others require very large dictionaries such as dictation machines. According to vocabulary size, speech recognition systems can be divided into three main categories as small vocabulary recognizers (smaller than 100 words), medium vocabulary recognizers (around 100-1000 words) and large vocabulary recognizers (over 1000 words).
Speech recognition studies started in the late 1940s. Dreyfus-Graf (1952) designed his first “Phonetographe”. This system transcribed speech into phonetic “atoms”. Davis et al. (1952) designed the first speaker dependent, isolated digit recognizer. This system used a limited number of acoustic parameters based on zero-crossing counting. In Bell Laboratories, a word recognizer is designed with a phonetic decoding approach using phonetic units (Dudley & Balashek, 1958). Jakobson et al. (1952) developed the speaker independent recognition of vowels. At RCA laboratories, phonetic approach was used in the first “phonetic typewriter” which recognizes syllables by a single speaker (Olson & Belar, 1956).
The first experiments on computer based speech recognition were started in the late 1950s. The studies as speaker independent recognition of ten vowels (Forgie & Forgie, 1959), phoneme identification (Sakai & Doshita, 1962), and digit recognition (Nagata et al., 1963) have been made at this period.
Dynamic time warping using dynamic programming was proposed by Russian researchers (Slutsker, 1968; Vintsyuk, 1968). Reddy (1966) developed continuous speech recognition by dynamic tracking of phonemes.
Statistical modelling methods as Hidden Markov Models (HMM) (Ferguson, 1980; Rabiner, 1989) were used in 1980s. Neural networks as perceptron were proposed in 1950s. In 1980s, neural network approaches were presented again (Lippmann, 1987).
Support Vector Machine (SVM) was presented as a novel method for solving pattern recognition problems (Blanz et al., 1996; Cortes & Vapnik, 1995; Osuna et al., 1997). SVM has been successfully used for implementation speech recognition systems in 1990s and 2000s.
This chapter is organized as follows. In the next section, system databases, preprocessing operation, word and syllable end-point detection, feature extraction, dynamic time warping, artificial neural network and postprocessing operation have been explained. In Section 3, how to detect misspelled words using syllable n-gram frequencies is presented in detail. The experimental results of the developed systems have been given in Section 4. The final section concludes the chapter.
2. Implementation of syllable based speech recognition
In this study, speaker dependent isolated word speech recognition systems using DTW and ANN have been designed and implemented, and the speech signal features as mfcc, lpc, parcor, cepstrum, rasta and the mixture of mfcc, lpc and parcor have been used for the speech recognition approaches.
The speech recognition applications have been executed on the computer which has the following features: Pentium Centrino 1.6 CPU, 768 MB RAM, 40 GB harddisk, Windows XP Operating System, a sound card and a microphone. The codes of the applications have been written with Matlab 6.5.
2.1. System databases
System dictionary consists of 200 different Turkish words (Aşlıyan, 2010). Using this dictionary, two databases have been constructed (Aşlıyan, 2010). One database has been used for training and the other is for testing of the system.
The training speech database (approximately 2.7 hours of 250 MB speech material) involves 5000 Turkish word utterances (25x200) which were recorded by a male speaker. Each word in the dictionary was recorded 25 times.
The testing speech database was constructed by recording every word 10 times in the dictionary. Total number of utterances is 2000 (about 1.1 hours of 100 MB speech material).
The recording procedure took place in a noise-free environment. A head-mounted close-talking microphone was used. The format of the file recording is WAVE file format. The waveforms of the utterances are encoded using Pulse Code Modulation (PCM) coding format, 11025 Hz sampling rate, 2 bytes per sample. The utterances are recorded in 2 seconds time duration.
2.2. Preprocessing operation
After the digitization of the word speech signal, preemphasis filter has been applied to spectrally flatten the signal. For the speech signal the syllable end-point detection is applied. After that each syllable utterance is divided into frames of 20 ms by frameblocking. To reduce the signal discontinuity at the ends of each block, Hamming window is applied for each frame.
2.3. Word and syllable end-point detection
An important problem in speech processing is to detect the presence of speech in a background of noise. This problem is often referred to as the end-point location problem (Rabiner & Sambur, 1975). The accurate detection of a word’s start and end-points means that subsequent processing of the data can be kept to a minimum.
A major cause of errors in isolated-word automatic speech recognition systems is the inaccurate detection of the beginning and ending boundaries of test and reference patterns (Junqua et al., 1997). It is essential for automatic speech recognition algorithms that speech segments are reliably separated from non-speech.
The reason for requiring an effective end-point algorithm is that the computation for processing the speech is minimum when the end-points are accurately located (Savoji, 1989).
In syllable end-point detection operation, the speech signals are taken and after processing them, the number of syllables and the indexes of syllable end-points have been detected. Namely, the beginning and end indexes are computed on the digital speech signal.
After sampling the sound wav files, the mean of the speech signal as a vector is calculated and translated to axis. Assume that is a speech signal. The new speech signal, which is focused on axis, is. After that, the voiced and unvoiced parts of the speech signal are approximately computed with the slope between the beginning value of the digital sound and the maximum value of the sound. This slope is the threshold slope. The utterance is divided into windows which have 350 samples. If the slope, which is calculated between two windows, which are one after the other, is greater than the threshold slope, this means that the voiced part of the sound begins at the index. However, these beginning and end index of the voiced part are nearly true, but not certain value. The distance data of zero-crossing index of sound vector has been used because of obtaining more accurate results. A new vector which represents the zero-crossing distances, has been constructed, then a threshold has been defined (say zero-crossing threshold=100). The beginning index is found earlier but not certainly true index. Now this index goes on one by one to the first index if the zero-crossing distance is between 1 and zero-crossing threshold. Zero values of the vector are not taken into account. In the same way, the end index of the voiced part is calculated. At the end, the voiced part is found exactly.
To discover syllable end-points, the windows that consist of 900 samples are generated without overlapping. The mean values of these windows are computed and assembled for constructing a new mean vector as shown in Figure 1. The slope between one element and the next element of the mean vector is determined, and if the slope is zero or greater than zero, a new vector’s value is +1. Otherwise, the value is -1. Using these vectors, the boundaries of syllables on the sound vector are obtained approximately. The samples between 500 samples backward and 500 samples forward from the found syllable end-points are divided into windows which include 20 samples. After that, the middle index of the window which has the minimum mean is syllable end-point. Finally, the beginning and end index of the syllables can be calculated for each word before processing them. Now we have the number of syllables of the word and their end-point indexes.
According to the number of syllables in a word using syllable end-point detection algorithm which is mentioned in Subsection 2.3.2, it is found that the accuracy result is approximately 99%. For example, the word which has five syllables is successfully divided into five syllables and the end-points of the syllables are detected.
2.3.1. Word end-point detection algorithm
Step 1.in Equation 1 is digital sound vector.(is the number of samples in the utterance)E1
Step 2. is the mean of the values of first 200 samples in is a vector which translated to the axisE2E3
Step 3. is the maximum value of the vector is the index of maximum value of the vector and are the beginning and end threshold values respectively.E4E5
Step 4. The vector is divided into windows which consist of 350 samples. The vector is the mean vector of above windows.
Step 5. For, and are calculated as shown in Equation 8.
Step 6.is the beginning index of the sound vector.
Step 7. is the end index of the sound vector.
For r =DownTo 1
Step 8. The beginning and end indexes are approximately determined from Step 6 and 7. To decide exactly the end-points of the sound, the zero-crossing indexes are fixed. Using the sound vector, the zero-crossing vector is generated.
Step 9. After the distances between one after the other zero-crossing indexes are computed, new distance vector of zero-crossing is calculated as the followings.
if and after the index k, its first value of the following indexes is 1,
Step 10. The threshold value of zero-crossing is accepted as is the value at the index which the sound begins.E21
For DownTo 1
if and then
if then continue
if then break
Step 11. is the value at the index which the sound ends.E22
if and then
if then continue
if then break
2.3.2. Syllable end-point detection of the word utterances
After detecting the end-points (and) of the words, the end-points of the syllables in the words are determined with the following algorithm.
Step 2. The vector is divided into windows, which have 900 samples, without overlapping. The vector is the mean vector of each window above.
Step 3. The slope vector is composed by computing the slopes between the values of the vector which follow one after another.
Step 4. Using the slope vector, we calculate the vector which has the values, +1 and -1. Namely, the increasing and decreasing positions are determined.
Step 5. is the number of syllables in the word.E31
if and then
Step 6. The values of the middle indexes, which include the value -1 in the vector are approximately the end-points of syllables. There are syllable end-points. The syllable end-point vector is calculated as shown in the following. The values are the indexes which are the values of the vector
if the middle index of the indexes, which have the k-th value -1 that follows one after the other, is w thenE34
Step 7. Until now, the beginning point and end point is detected. The vector represents the end-points of syllables. To find the syllable end-points more accurately the following algorithm is performed, and the vector is attained.
The windows with 20 samples between and are constructed, and the mean values are computed for each windows.
if the middle index of the window, which has the smallest mean, is
Step 8. The vector which represents the syllables end-points in the word sound vector is decided accurately. There are syllables in the word. The beginning index of k-th syllable is and the end index is
2.4. Feature extraction
After preprocessing the speech signal, we have the syllable end-points of the word. The syllable utterances are framed with Hamming window. The length of one frame is 20 ms with 10 ms shift (overlapping time=10 ms). 10 features are computed from each frame. These features are lpc, parcor, cepstrum, mfcc and rasta. The number of frames is not constant, but the number of frames is normalized to 30 frames with the length of 10 as shown in Figure 2. The normalized features are used only for the speech recognition method as ANN. is the syllable feature vector. is the syllable feature matrix. For normalized features, m is 30 as illustrated in Figure 2. In Figure 3, the time duration for each speech feature is shown, and the fastest speech feature extraction algorithm among these features is mfcc.
2.5. Dynamic Time Warping
Dynamic Time Warping (DTW) (Fornés et al., 2010; Myers et al., 1980 ; Sankoff & Kruskal, 1999) is a template-based classification technique which has the principle of matching a input speech signal converted into a feature matrix against reference templates. The templates are simply feature matrix examples of each syllable of a word or syllable in the vocabulary of the system. Consequently, DTW is normally used for recognition of isolated words or syllables. The similarity between a template and the unknown speech feature matrix is assumed to be inversely proportional to the minimum cost alignment. This is normally evaluated by calculating a local distance between each input features and all feature matrices of the reference template. Calculating the minimum cost alignment is then a matter of finding the optimal path from the bottom left-hand to the top right-hand corner of the matrix. Namely, the path that accumulates the lowest sum of local distances and does not stray too far away from the diagonal.
If we define two time series and, then DTWCost(S, T) calculates the distance between these two time series (S and T). Using dynamic programming DTW[i, j] is computed as DTW[i, j] = C + Minimum(DTW[i-1, j], DTW[i, j-1], DTW[i-1, j-1]) where C= Distance(s i, t j). As shown in the following, the warping path is found by detecting the minimum cost index pairs (i, j) form (0, 0) to (n, m).
2.5.1. Dynamic Time Warping algorithm
The following algorithm takes two speech feature matrices (inputs) to calculate the distance of them. The output of this algorithm is the distance of these two features. These two matrices consist of n and m frames respectively. Each frame has been assumed to get ten speech features. stands for infinity.
DTW_Cost (S[1..10, 1..n], T[1..10, 1..m])
DTW[0..n, 0..m] as a matrix variable
i, j, Cost as integer variables
For i=1 To m
DTW[0, i] =
For i=1 To n
DTW[i, 0] =
DTW[0, 0] = 0
For i=1 To n
For j=1 To m
Cost = Calculate Distance(S[1..10, i], T[1..10, j])
DTW[i, j] = Cost + Find Minimum(DTW[i-1, j], DTW[i, j-1], DTW[i-1, j-1])
Return DTW[n, m]
2.6. Artificial neural networks
A Neural Network (NN) is a computer software that simulates a simple model of neural cells in humans. The purpose of this simulation is to acquire the intelligent features of these cells.
Backpropagation networks are a popular type of network that can be trained to recognize different patterns including images, signal, and text. Backpropagation networks have been used for this speech recognition system.
2.6.1. Sigmoid function
The function as Equation 11 is called a Sigmoid function. The coefficient a is a real number constant. In NN applications, a is usually chosen between 0.5 and 2. As a starting point, we can use a=1 and modify it later when we are fine-tuning the network. Note that s(0) = 0.5, s(∞) = 1, s(-∞) = 0 (The symbol ∞ means infinity).
The Sigmoid function will convert values less than 0.5 to 0, and values greater than 0.5 to 1. The Sigmoid function is used on the output of neurons.
In NNs, a neuron is a model of a neural cell in humans. This model is simplistic, but as it turned out, is very practical. The neuron has been thought as a program or a class that has one or more inputs and produces one output. The inputs simulate the signals that a neuron gets, while the output simulates the signal which the neuron generates. The output is calculated by multiplying each input by a different number which is called weight, adding them all together, then scaling the total to a number between 0 and 1.
Figure 4 shows a simple neuron with:
Three hundred inputs. The input values are usually scaled to values between 0 and 1.
300 input weights. The weights are real numbers that usually are initialized to some random numbers. The weights are variables of type real. We can initialize to a random number between 0 and 1.
One output z. A neuron has only one output. Its value is between 0 and 1, it can be scaled to the full range of actual values.
In a more general manner, for n number of inputs, d is defined as Equation 13.
Let be a real number which is known as a threshold. Experiments have shown that best values for are between 0.25 and 1. is just a variable of type real that is initialized to any number between 0.25 and 1.
In Equation 14, the output z is the result of applying the sigmoid function onIn NN applications, the challenge is to find the right values for the weights and the threshold.
The structure of the system is shown in Figure 5. This NN consists of four layers: Input layer with 300 neurons, first hidden layer with 30 neurons, second hidden layer with 10 neurons and output layer with one neuron.
The output of a neuron in a layer goes to all neurons in the following layer. Each neuron has its own input weights. The weights for the input layer are assumed to be 1 for each input. In other words, input values are not changed. The output of the NN is reached by applying input values to the input layer, passing the output of each neuron to the following layer as input. The Backpropagation NN must have at least an input layer and an output layer. It could have zero or more hidden layers.
The number of neurons in the input layer depends on the number of possible inputs while the number of neurons in the output layer depends on the number of desired outputs. In general, the addition of a hidden layer could allow the network to learn more complex patterns, but at the same time decreases its performance.
2.6.3. Supervised learning
The Backpropagation NN works supervised training. The training can be summarized as the following algorithm.
Start by initializing the input weights for all neurons to some random numbers between 0 and 1.
Apply input to the network.
Calculate the output.
Compare the resulting output with the desired output for the given input. This is called the error.
Modify the weights and threshold for all neurons using the error.
Repeat the process until the error reaches an acceptable value (the error, 0.006), which means that the NN was trained successfully, or if we reach a maximum count of iterations, which means that the NN training was not successful.
The challenge is to find a good algorithm for updating the weights and thresholds in each iteration (Step 5) to minimize the error.
Changing weights and threshold for neurons in the output layer is different from hidden layers. For the input layer, weights remain constant at 1 for each input neuron weight.
For the training operation, the followings are defined.
The Learning Rate,: A real number constant, 0.02 for the system.
The change,: For example is the change in x.
18.104.22.168. Output layer training
Let z be the output of an output layer neuron. Let y be the desired output for the same neuron, it should be scaled to a value between 0 and 1. This is the ideal output which we like to get when applying a given set of input. Then the error, e, will be as Equation 15.
represents the change in. is the change in weight i of the neuron. In other words, for each output neuron, calculate its error e, and then modify its threshold and weights using Equation 15, 16 and 17.
22.214.171.124. Hidden layer training
Consider a hidden layer neuron as shown in Figure 6. Let be the output of the hidden layer neuron. Let be the weight at neuron in the layer following the current layer. This is the weight for the input coming from the current hidden layer neuron. Let be the error at neuron. Let be the number of neurons in the layer following the current layer. (In Figure 6,).
is the error at the hidden layer neuron. is the change in is the change in weight i. In calculating, we use the weight and error from the following layer, which means that the error and weights in this following layer should have already been calculated. This implies that during a training iteration of a Backpropagation NN, we start modifying the weights at the output layer, and then we proceed backwards on the hidden layers one by one until we reach the input layer. It is the method of proceeding backwards which gives this network its name Backward Propagation.
2.7. The postprocessing of the system
After the syllables of the word utterance are recognized using the speech recognition method, and the most similar 10 syllables are put in order, the recognized syllables are concatenated and generated the recognized word. We can find the most similar words in order by concatenation of the most similar syllables. From the uppermost recognized words, it can be determined whether or not the word is Turkish. If the word is Turkish, it is the recognized word of the system. If these words are not Turkish, the system does not recognize any word.
For example, as shown in Table 1, the recognized syllables are ordered. Hence, the most similar syllables as “kı”, “tap” and “lik” have been found. These syllables are concatenated, and the most similar word as “kıtaplik” is constructed. But, the system decides that the word is not Turkish word. Then, the next most similar word is concatenated, and it is determined whether or not the word is Turkish. This process is continued until a Turkish word is found in these concatenated words. In this example, the word “kitaplık” which is generated from the syllables “ki”, “tap” and “lık” is detected by the system. Therefore, it is the recognized word using the postprocessing.
|The order of the most similar syllables||Recognized Syllables|
3. Detecting misspelled words using syllable n-gram frequencies
To detect misspelled words in a text is an old problem. Today, most of word processors include some sort of misspelled word detection. Misspelled word detection is worthy in the area of cryptology, data compression, speech synthesis, speech recognition and optical character recognition (Barari & QasemiZadeh, 2005; Deorowicz & Ciura, 2005; Kang & Woo, 2001; Tong & Evans, 1996). The traditional way of detecting misspelled words is to use a word list, usually also containing some grammatical information, and to look up every word in the word list (Kukich, 1992) from dictionary.
The main disadvantage of this approach is that if the dictionary is not large enough, the algorithm will report some of correct words as misspelled, because they are not included in the dictionary. For most natural languages, the size of dictionary needed is too large to fit in the working memory of an ordinary computer. In Turkish this is a big problem, because Turkish is an agglutinative language and too many new words can be constructed by adding suffixes.
To overcome this difficulties, a new approach has been proposed for detecting misspelled words in Turkish text. For that, Turkish syllable n-gram frequencies which are generated from several Turkish corpora have been used. From the corpora, syllable monogram, bigram and trigram frequencies have been extracted using TASA (Aşlıyan & Günel, 2005). These n-gram frequencies have been used for calculating a word probability distribution. After that the system has decided whether a word is misspelled or not. This approach does not need any word list.
3.1. System architecture
The system consists of three main components. First component is preprocessing which cleans a text. Second component is TASA, and third component is calculating probability distribution of words. As shown in Figure 7, the system takes words in Turkish text as input and gives the result for each word as “Misspelled Word” or “Correctly Spelled Word”.
In preprocessing component of the system, punctuation marks are cleaned. All letters in the text are converted to lower case. Blank characters between two successive words are limited with only one blank character.
In second component, TASA takes the Turkish clean text as an input and gives the Turkish syllabified text. The system divides words into syllables putting the dash character between two syllables. For example, the word “kitaplık” in Turkish text is converted into the syllabified word “ki-tap-lık” in Turkish syllabified text.
In third component, the probability distribution is calculated for each syllabified word. The system uses syllable monogram, bigram and trigram frequencies to find this probability distribution.
3.2. Calculation of the probability distribution of words
An n-gram is a sub-sequence of n items from a given sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis. The items in question can be letters, syllables, words according to the application.
An n-gram of size 1 is a “monogram”; size 2 is a “bigram”; size 3 is a “trigram”; and size 4 or more is simply called an “n-gram” or “(n−1)-order Markov model” (Zhuang et al., 2004).
An n-gram model predicts based on. When used for language modelling independence assumptions are made so that each word depends only on the last n words. This Markov model is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of learning the language model from data.
Suppose that a word W in Turkish syllabified text consists of the syllable sequence. This word has t syllables. To obtain the n-gram probability distribution (Jurafsky & Martin, 2000) of the word W, we have used in Equation 22.
So, it is concluded as Equation 24.
In Equation 23 and Equation 24, is the frequency of the syllable sequence. Furthermore, is the frequency of the syllable sequence. The frequencies of these syllable sequences can be calculated from some Turkish corpora.
For example, according to bigram model it can be calculated the probability distribution of a word in Turkish text. Assume that we have a text which includes some words as “... Bu gün okulda, şenlik var...”. This text is converted to syllabified text as “... Bu gün o-kul-da, şen-lik var...”. Syllables in the words are delimited with dash character. Assume that the word W=”okulda” in the text is taken for computing its probability distribution and W can be written as the syllable sequence”o”, “kul”, “da”. Here, “o”, “kul”, “da”. Blank character is accepted as a syllable. This syllable is called as. So, assume that syllable monogram frequencies are C(“”)=0.003, C(“o”)=0.002, C(“kul”)=0.004 and syllable bigram frequencies are C(“”,“o”)=0.0001, C(“o”,“kul”)=0.0002, C(“kul”,“da”)=0.0003. P(“okulda”) has been calculated using bigram model. It can be found that the probability distribution of the word “okulda” is 0.0002475 as shown in Equation 27.
3.3. Testing and evaluation of the system
Two systems have been designed and implemented to detect misspelled words in Turkish text. One uses monogram and bigram frequencies. The size of monogram database is 41 kilobytes and the monogram database consists of 4141 different syllables. The sizes of bigram and trigram databases are 570 and 2858 kilobytes respectively. While the bigram database includes 46684 syllable pairs, the trigram database consists of 183529 ternary syllables. The other uses bigram and trigram frequencies. These two systems have been tested for two Turkish texts. One is correctly spelled text which includes 685 correctly spelled words. The other is misspelled text which has 685 misspelled words. These two texts have same words. Namely, misspelled words are generated with putting errors on the correctly spelled words. These error types are substitution, deletion, insertion, transposition and split word errors. The system takes correctly spelled and misspelled texts as input and gives the results for each word as “correctly spelled word” or “misspelled word”. Probability distributions are calculated for each word. If the probability distribution of a word is equal to zero, system decides that the word is misspelled. If it is greater than zero, system decides that the word is correctly spelled.
Firstly, the system on the correctly spelled text has been tested using monogram and bigram frequencies. The system determines 602 correctly spelled words from the correctly spelled text, so the words are correctly recognized with 88% success rate. Also, 589 misspelled words within the misspelled text are decided successfully by the system. Namely, the system which is tested on the misspelled text correctly recognized the words with 86% success rate.
Then the system on the correctly spelled text has been tested using bigram and trigram frequencies. The system determines 671 of 685 correctly spelled words from the correctly spelled text. The success rate on correctly recognition of the words is 98%. Furthermore, 664 of 685 misspelled words within the misspelled text are decided successfully by the system. Thus, the system which is tested on the misspelled text correctly recognized the words with 97% success rate. The system can analyze 75000 words per second.
4. Experimental results
In this section, the extperimental results of the developed syllable based speech recognition have been given according to DTW and ANN approach, and the success rates of the systems have been compared.
The most widely used evaluation measure for speech recognition performance is Word Error Rate (WER) (Hunt, 1990; McCowan et al., 2005). The general difficulty of measuring the performance lies on the fact that the recognized word sequence can have different length from the reference word sequence. The WER is derived from the Levenshtein distance, working at word level instead of character.
This problem is solved by first aligning the recognized word sequence with the reference sequence using dynamic string alignment. The word error rate can then be computed as in Equation 28.
where is the number of wrongly detected words, and is the number of words in the reference set.
In Figure 8, the WER results are given for dynamic time warping. If we evaluate the system, we can say that the best result for DTW is obtained with the mfcc feature. It is followed by rasta feature. The system accuracy rate is increased with postprocessing operation about 9% using DTW.
In Figure 9, the WER results are given for artificial neural network. If we evaluate the system, it can be said that the best result for ANN is obtained with the mfcc feature. The system accuracy rate is increased with postprocessing operation about 15% using ANN.
According to the results of WER of the system, the most successful feature and speech recognition method are mfcc and DTW (5.8% WER) respectively. The postprocessing improves approximately 14% of the system accuracy.
DTW does not need training operation. It uses the extracted features. The best feature is mfcc because its extraction time duration is the lowest. ANN needs training, and it constructs a model for each syllable in words using training. As shown in Table 2, the average training time for ANN is about 1102.5 seconds.
Table 3 displays average testing time duration. Average testing time duration of ANN (7.4 seconds) is quite shorter than that of DTW.
|Methods||Training Time (Seconds)|
|Methods||Testing Time (Seconds)|
In this study, syllable based isolated word Turkish speech recognition systems using the speech recognition methods as DTW and ANN have been designed and implemented. These speaker dependent systems use the features as mfcc, lpc, parcor, cepstrum, rasta and the mixture of mfcc, lpc and parcor. Syllable models of the words in the dictionary are constructed syllable databases to compare the word utterence. The system firstly recognizes the syllables of the word utterence. Recognized word is found by the concatenation of the recognized syllables.
To use in postprocessing stage of the system, firstly, TASA have been designed and implemented. TASA’s correct spelling rate is about 100%. Then, Turkish syllable n-gram frequencies for some Turkish corpora have been calculated. In addition, using syllable n-gram frequencies, a system which decides whether or not a word is misspelled in Turkish text has been developed. The system takes words as inputs. The system produces two results for each word: “Correctly spelled word” or “Misspelled word”. According to the system designed with bigram and trigram frequencies, the success rate is 97% for the misspelled words, and 98% for the correctly spelled words.
In postprocessing operation, after the recognized word is constructed by concatenating of the recognized syllables, the system decides whether it is Turkish word or not. If the word is Turkish word, then it is new recognized word. This postprocessing increases the accuracy rate of the system approximately 14%.
After testing of the middle scaled speech recognition system, the most successful method is DTW whose word error rate is about 5.8%. It can be said that the best feature for the speech recognition is mel frequency cepstral coefficients.
For future work, Support Vector Machine and Hidden Markov Model can be applied for syllable based speech recognition, and we can compare among the results obtained using these speech recognition methods.
The author would like to thank the Scientific Research Unit (BAP) of Adnan Menderes University.