Open access peer-reviewed chapter

Discrete Wavelet Transform & Linear Prediction Coding Based Method for Speech Recognition via Neural Network

By K.Daqrouq, A.R. Al-Qawasmi, K.Y. Al Azzawi and T. Abu Hilal

Submitted: November 11th 2010Reviewed: July 14th 2011Published: September 12th 2011

DOI: 10.5772/20978

Downloaded: 3202

1. Introduction

In the proposed work, the techniques of wavelet transform (WT) and neural network were introduced for speech based text-independent speaker identification and Arabic vowel recognition. The linear prediction coding coefficients (LPCC) of discrete wavelet transform (DWT) upon level 3 features extraction method was developed. Feature vector fed to probabilistic neural networks (PNN) for classification. The functions of features extraction and classification are performed using the wavelet transform and neural networks (DWTPNN) expert system. The declared results show that the proposed method can make an powerful analysis with average identification rates reached 93. Two published methods were investigated for comparison. The best recognition rate selection obtained was for framed DWT. Discrete wavelet transform was studied to improve the system robustness against the noise of 0dB. Our investigation of speaker-independent Arabic vowels classifier system performance is performed via several experiments depending on vowel type. The declared results show that the proposed method can make an effectual analysis with identification rates may reach 93%.

In general, a speaker identification system can be implemented by observing the voiced/unvoiced components or through analyzing the energy distribution of utterances. A number of digital signal processing algorithms, such as LPC technique (Adami & Barone, 2001; Tajima, Port, & Dalby, 1997), Mel frequency cepstral coefficients (MFCCs) (Mashao & Skosan, 2006; Sroka & Braida, 2005; Kanedera, Arai, Hermansky & Pavel, 1999; Daqrouq & Al-Faouri, 2010), DWT (Fonseca, Guido, Scalassara, Maciel, & Pereira, 2007) and wavelet packet transform (WPT) (Lung, 2006; Zhang & Jiao, 2004) are extensively utilized. In the beginning of 1990s, Mel frequency cepstral technique became the most widely used technique for recognition purposes due to its aptitude to represent the speech spectrum in a compacted form (Sarikaya & ansen, 2000). Actually, MFCCs simulate the model of umans’ auditory perception and have been proven to be very effective in automatic speech recognition system and modeling the individual frequency components of speech signals. ESI has been under research by a large number of researches for about four decades (Reynolds, Quatieri, & Dunn, 2000). From a commercial point of view, ESI is a technology with potentially large market due to the applications of frequently ranges from automation of operator- helped service to speech-to-text aiding system for hearing impaired individuals (Reynolds et al., 2000).

Artificial neural network performance is depending mainly on the size and quality of training samples (Visser, Otsuka, & Lee, 2003). When the number of training data is small, not representative of the possibility space, standard neural network results are poor (Kosko & Bart, 1992). Incorporation of neural fuzzy or wavelet techniques can improve performance in this case, particularly, by input matrix dimensionality decreasing (Nava & Taylor, 1996). Artificial neural networks (ANN) are known to be excellent classifiers, but their performance can be prevented by the size and quality of the training set. Fuzzy theory has been used successfully in many applications (Gowdy & Tufekci, 2000). This applications show that fuzzy theory can be used to improve neural network performance.

In this study, authors improve effective feature extraction method for text-independent system, taking in consideration that the size of ANN input is very crucial issue. This affects quality of the training set. For this reason, the presented features extraction method offers a reduction of dimensionality of features comparing with conventional methods. LPCC of DWT in conjunction is utilized. For classification of features extraction coefficients, PNN is proposed.

In this paper, an expert system for speaker identification was proposed for the investigation of the speech signals using pattern identification. The speaker identification performance of this method demonstrated on the total 59 individual speakers (39 male speakers and 20 female speakers). LPCC in conjunction with DWT upon level seven features extraction method were developed. For performing the classification process PNN was investigated. The function of feature extraction and classification is performed using the DWPN expert system. The declared results show that the proposed method can make an effectual analysis.. The average identification rates were 94.89, better than other methods published before. It was found that the recognition rates enhanced upon increasing the number of feature sets (by higher DWT levels). Nevertheless, the improvement implies a tradeoff between the recognition rate and extracting time. The proposed method can offer a significant computational advantage by reducing the dimensionality of the WT coefficients by means of LPCC. DWT approximation Sub-signal via several levels instead of original imposter had good performance on real noise facing, particularly upon level 3 and 4.

2. Discrete Wavelet Transform

The DWT indicates an arbitrary square integrable function as a superposition of a family of basis functions called wavelet functions. A family of wavelet basis functions can be produced by translating and dilating the mother wavelet related to the family (Mallat, 1989). The DWT coefficients can be generated by taking the inner product between the input signal and the wavelet functions. Since the basis functions (wavelet functions) are translated and dilated versions of each other, a simpler algorithm, known as Mallat's pyramid tree algorithm, has been proposed in (Mallat, 1989).

The DWT can be treated as the multiresolution decomposition of a sequence. It takes a length Nsequence a(n)as input and produces a length N sequence as the output. The output has values at the highest resolution (level 1) and N/4 values at the next resolution (level 2), and so on. LetN=2m, and let the number of frequencies, or resolutions, be m, we are bearing in mind m=logNoctaves [18]. So that, the frequency index kvaries as 1, 2,…, m corresponding to the scales 21,22,...,2m. In

As described by Mallat pyramid algorithm (Fig.1), the DWT coefficients of the previous stage are expressed as follows ( Souani et al., 2000):

WL(n,k)=iWL(i,k1)h(i2n),E1
WH(n,k)=iWL(i,k1)g(i2n),E2

Where WL(p,q)is pthscaling coefficient at the qthstage, WH(p,q)is the pthwavelet coefficient at the qthstage, and h(n),g(n)are the dilation coefficients relating to the scaling and wavelet functions, respectively.

For computing the DWT coefficients of the discrete-time data (signal), it is assumed that the input data represents the DWT coefficients of a high resolution stage. Equations (1a) and Equations (1b) and may be used for obtaining DWT coefficients of subsequent stages. In practice, this decomposition is used only for a few stages. We note that the dilation coefficientsh(n)stand for a low-pass filter, where the corresponding g(n)stands for a high-pass filter. In order that, DWT takes out information from the signal at different scales. The first level of wavelet decomposition extracts the details of the signal (high frequency parts), while the second and all subsequent wavelet decompositions take out progressively coarser information (lower frequency parts). Each step of retransforming the low-pass output is called dilation. A schematic of three stages DWT decomposition is shown in Fig. 1. H presents the High pass filter and

L denotes the low pass filter. At the output of each filter the result is down sampled (decimated) by taking one coefficient and leave other ( Souani et al., 2000).

So as to reconstruct the original data, the DWT coefficients are up sampled (insertion of a zero between two samples) and passed through another set of low- and high-pass filters, which are expressed as

WL(n,k)=pWL(p,k+1)h(n2p)+lWH(l,k+1)g(n2l),E3

whereh(n)and g(n)are the low- and the high-pass synthesis filter, respectively. It is observed from Eq. (2) that the kthlevel DWT coefficients may be obtained from(k+1)thlevel DWT coefficients. Efficiently supported wavelets are generally used in various applications.

In the last decade, there has been a huge increase in the applications of wavelets in various scientific disciplines. Typical applications of wavelets include signal processing, image processing, security systems, numerical analysis, statistics, biomedicine, etc. Wavelet transform tenders a wide variety of useful features, on the contrary to other transforms, such as Fourier transform or cosine transform. Some of these are as follows:

  • Adaptive time-frequency windows,

  • Lower aliasing distortion for signal processing applications,

  • Computational complexity ofO(N), where N is the length of data;

  • Inherent scalability;

  • Efficient Very Low Scale Integration implementation

Figure 1.

a. DWT-tree by Mallat's Algorithm; b. IDWT by Mallat's Algorith

3. The use of DWT for feature extraction

Before the stage of features extraction, the speech data are processed by a silence removing algorithm followed by the application of a pre-processed by applying the normalization on speech signals to make the signals comparable regardless of differences in magnitude. In this study three feature extraction methods based on discrete wavelet transform are discussed in the following part of the paper.

3.1. DWT method with LPC

For an orthogonal wavelet function, a library of DWT bases is generated. Each of these bases offers a particular way of coding signals, preserving global energy and reconstructing exact features. The DWT is used to extract additional features to guarantee higher recognition rate. In this study, DWT is applied at the stage of feature extraction, but these data are not proper for classifier due to a great amount of data length. Thus, we have to seek for a better representation for the speaker features. Previous studies proposed that the use of LPC of DWT as features in recognition tasks is competent. (Adami & Barone, 2001; Tajima, Port, & Dalby, 1997) Suggested a method to calculate the LPC orders of wavelet transform for speaker recognition.

In this method the LPC is obtained from DWT Sub signals. The DWT at level three is generated and then 30 LPC orders are obtained for each sub signals to be combined in one feature vector. The main advantage of such sophisticated feature method is to extract different LPC impact based on multi resolution of DWT capability. LPC orders sequence will contain distinguishable information as well as wavelet transform. Fig. 2. shows LPC orders calculated for DWT at depth 3 for three different utterances for the same person. We may notice that the feature vector extracted by DWT and LPC is appropriate for speaker recognition.

Figure 2.

LPC orders calculated for DWT at depth 3 for three different utterances for the same person

3.2. DWT method with entropy

Turkoglu et al., (2003) Suggested a method to calculate the entropy value of the wavelet norm in digital modulation recognition. [16] Proposed features extraction method for speaker recognition based on a combination of three entropy types (sure, logarithmic energy and norm). Lastly, (Daqrouq, 2011) investigated a speaker identification system using adaptive wavelet sure entropy.

As seen in above studies, the entropy of the specific sub-band signal may be employed as features for recognition tasks. This is possible because each Arabic vowel has distinct energy (see Fig.3). In this paper, the entropy obtained from the DWT will be employed for speaker recognition. The features extraction method can be explained as follows:

  • Decomposing the speech signal by wavelet packet transform at level 7, with Daubechies type (db2).

  • Calculating three entropy types for all 256 nodes at depth 7 for wavelet packet using the following equations:

Shannon entropy:

E1(s)=isi2log(si2)E4

Log energy entropy:

E1(s)=ilog(si2)E5

Sure entropy:

|si|pE(s)=imin(si2,p2)E6

Where sis the signal, siare the DWT coefficients and pis a positive threshold. Entropy is a common concept in many fields, mainly in signal processing. Classical entropy-based criterion describes information-related properties for a precise representation of a given signal. Entropy is commonly used in image processing; it posses information about the concentration of the image. On the other hand, a method for measuring the entropy appears as a supreme tool for quantifying the ordering of non-stationary signals. Fig.2 shows the three entropies calculated for DWT at depth 3 for three different utterances for the same person. We may notice that the feature vector extracted by DWT and entropy is appropriate for speaker recognition. This conclusion has been obtained by interpretation the following criterion: the feature vector extracted should possess the following properties Vary widely from class to class. 2) Stable over a long period of time. 3) Should not have correlation with other features (see Fig.3).

Figure 3.

Entropy calculated for DWT at depth 3 for three different utterances for the same person

4. Proposed probabilistic neural networks algorithm

We create a probabilistic neural network algorithm for classification problem (see Fig.4 and Fig.5):

Net=PNN(P,T,SPREAD),E7

where Pis 4x2q+1x24matrix of 24input vowel feature vectors for net training, of 2q+1(minus 2, repeated original node) WP nodes number;

P=[WR11WR12,...,WR124WR21WR22,...,WR224.........WR4x2q+11WR4x2q+12,...,WR4x2q+124],E8

Tis the target class vector

T=[1,2,3,...,24],E9

and SPREAD is spread of radial basis functions. We employ a SPREAD value of 1 because that is a typical distance between the input vectors. If SPREAD is near zero the network acts as a nearest neighbor classifier. As SPREAD becomes larger the designed, network will take into account several nearby design vectors.

Figure 4.

Structure of the original probabilistic neural network

Figure 5.

Flow chart for proposed expert system

5. Results and discussion

5.1. Speaker identification by DWTLPC

A testing database was produced from Arabic language. The recording environment is a normal office environment through PC-sound card, with frequency 4 KHz and sampling frequency 16 KHz.

These utterances are Arabic spoken words. Total 47 individual speakers (19 to 40 years old) who are 31 individual male and 16 individual female spoken these Arabic words for training and testing phases. The total number of tokens considered for training and testing was 653.

It were performed experiments using total 653 the Arabic utterances of total 47 individual speakers (31 male speakers and 16 female speakers). For each of these speakers, up to 15 speech signals were used. 6 of these signals were used for training and from 4 to 9 of these signals (depends of recordings signals for each speaker) were used for testing the expert system (Fig.6). In this experiment, 93.26% correct classification was obtained by means of DWTLPC among the 47 different speaker signal classes. Testing results are tabulated in Tab.1. It, clearly, indicates the usefulness and the trustworthiness of the proposed approach for extracting features from speech signals gender identification system.

Recognition Rate
[%]
Recognized SignalsNumber of SignalsSpeaker
10099Sp.1
88.8889Sp.2
10099Sp.3
88.8889Sp.4
10099Sp.5
66.6669Sp.6
10099Sp.7
10099Sp.8
10099Sp.9
10099Sp.10
88.8889Sp.11
66.6669Sp.12
10099Sp.13
10099Sp.14
10099Sp.15
10099Sp.16
87.578Sp.17
10088Sp.18
87.578Sp.19
10044Sp.20
10044Sp.21
10044Sp.22
10044Sp.23
10044Sp.24
10088Sp.25
10088Sp.26
10088Sp.27
62.558Sp.28
87.578Sp.29
10088Sp.30
10088Sp.31
10088Sp.32
10088Sp.33
87.578Sp.34
87.578Sp.35
10088Sp.36
10088Sp.37
10088Sp.38
10088Sp.39
10088Sp.40
87.578Sp.41
87.578Sp.42
10088Sp.43
87.578Sp.44
10077Sp.45
7568Sp.46
62.558Sp.47
93.26346371Total

Table 1.

DWTLPC Identification Rate results

Table 2 shows the experimental results of different approaches used in the experimental investigation for comparison. Modified DWT with proposed feature extraction method (MDWTLPC), framing DWTLPC (FDWTLPC) illustrated in Fig.8, where LPC orders are obtained from six frames of each DWT sub signal and proposed method DWTLPC were investigated for comparison. The recognition rate of MDWTLPC reached the lowest value. The best recognition rate selection obtained was 93.53% for FDWTLPC.

Identification
Method
Identification
System
Number of SignalsIdentification
Rate [%]
DWTLPCText-independent65393.26
MDWTLPCText-independent65392.66
FDWTLPCText-independent65393.53

Table 2.

Comparison of different classification approaches

Figure 6.

Proposed system performance by using DWT approximation sub-signals (at level 1 to 4).

To improve the robustness of DWTLPC to additive white Gaussian noise (AWGN), same wavelet decomposition process was applied to DWT approximation Sub-signal via several levels instead of original imposter (Daqrouq, 2011). Afterwards, the features extraction was applied to each of the obtained wavelet decomposition sub-signals (see Fig.6). After performing proposed classification mechanism for each sub-signal of distinct DWT level, we can notice that at level 3 and 4 the highest recognition rate was achieved (see Tab.4). In this experiment it was found that the recognition rates were not improved upon increasing the DWT level more than four.

Figure 7.

Feature extraction vectors for three signals of same speaker obtained by a. FDWTLPC. b.DWTLPC

Recognized Signals [100%]Number of SignalsSpeaker
Level 4Level 3Level 2Level 1
40250024Sp.1
108722024Sp.2
252525024Sp.3
100500024Sp.4
4060551024Sp.5
10025122524Sp.6

Table 3.

DWTLPC Identification Rate results through DWT with SNR= 0dB

5.2. Arabic vowel classification by using DWTLPC

In recent times, Arabic language became one of the most significant and broadly spoken languages in the world, with an expected number of 350 millions speakers distributed all over the world and mostly covering 22 Arabic countries. Arabic is Semitic language that characterizes by the existence of particular consonants like pharyngeal, glottal and emphatic consonants. Furthermore, it presents some phonetics and morpho-syntactic particularities. The morpho-syntactic structure built, around pattern roots (CVCVCV, CVCCVC, etc.) (Zitouni and Sarikaya, 2009). The Arabic alphabet consists of 28 letters that can be expanded to a set of 90 by additional shapes, marks, and vowels. The 28 letters represent the consonants and long vowels such as ى and ٱ (both pronounced as/a:/), ي (pronounced as/i:/), andو ( pronounced as/u:/). The short vowels and certain other phonetic information such as consonant doubling (shadda) are not represented by letters directly, but by diacritics. A diacritic is a short stroke located above or below the consonant. Table 1 shows the complete set of Arabic diacritics. We split the Arabic diacritics into three sets: short vowels, doubled case endings, and syllabification marks. Short vowels are written as symbols either above or below the letter in text with diacritics, and dropped all together in text without diacritics. We get three short vowels: fatha: it represents the /a/ sound and is an oblique dash over a letter, damma: it represents the /u/ sound and has shape of a comma over a letter and kasra: it represents the /i/ sound and is an oblique dash under a letter as reported in Table 1.

In this work, speech signals were obtained via PC-sound card, with a sampling frequency of 16000 Hz. The Arabic vowels were recorded by 27 speakers: 5 females, along with 22 males. The recording process was provided in normal university office circumstances. Our study of speaker-independent Arabic vowels classifier system performance is performed via several experiments depending on vowel type. In the following three experiments the used feature extraction method is DWTLPC.

Experimental-1

We experimented 200 long Arabic vowels ٱ (pronounced as/a:/) signals, 400 long Arabic vowels ي (pronounced as/e:/) signals and 90 long Arabic vowels و (pronounced as/u:/) signals. The results indicated that 96% were classified correctly for Arabic vowels ٱ, 90% of the signals were classified correctly for Arabic vowel ي, and 94% of the signals were classified correctly for Arabic vowel و. Tab.5 shows the results of recognition rates.

Recognition Rate
[%]
Not Recognized
Signals
Accepted
Signals
Number of SignalsLong Vowels
968192200Long A
أ
9040360400Long E
ي
9458590Long O
و
93.33Avr. Recognition
Rate

Table 4.

The recognition rate results for long vowels

Experimental-2

In this experiment we study the recognition rates for long vowels connected with other consonant such ل (pronounced as/l/) and ر (pronounced as/r/). Tab.6, reported the recognition rates. The results indicated 88.5% average recognition rate.

Recognition Rate
[%]
Not Recognized
Signals
Recognized
Signals
Number of SignalsLong Vowels
9535760La
لا
10006060Le
لي
70184260Lo
لو
9065460Ra
را
9535760Re
ري
81114960Ro
رو
88.5Avr. Recognition
Rate

Table 5.

The recognition rate results for long vowels connected with other letters

Probabilistic neural network based speech recognition system is presented in this work. This system was performed using a wavelet feature extraction method. In this work, effective feature extraction method for Arabic vowels system is developed, taking in consideration that the computational complexity is very crucial issue. The experimental results on a subset of recorded database showed that feature extraction method proposed in this work is suitable for Arabic recognition system. Our study of speaker-independent Arabic vowels classifier system performance is performed via two experiments depending on vowel type. The declared results show that the proposed method can make an effective analysis with identification rates may reach 93%.

The proposed future work of this study is to improve the capability of proposed system to work in real time. This may be performed by modifying the recording apparatus and a data acquisition system (such as NI-6024E), and interfacing online with written Matlab code that simulates the expert system.

6. Conclusion

In this work, an expert system for speaker identification was investigated for the analyzing of the speech signals using pattern identification. The speaker identification performance of this method demonstrated on the total 47 individual speakers (31 male speakers and 16 female speakers). LPC in conjunction with framed DWT upon level three features extraction method was developed. For performing the classification process PNN was proposed. The stated results show that the proposed method can make an powerful analysis. The performance of the intelligent system was given in Table 1 and Table 2. The average identification rates were 93.26%, better than other methods. Our investigation of speaker-independent Arabic vowels classifier system performance is performed via several experiments depending on vowel type. The declared results show that the proposed method can make an effectual analysis with identification rates may reach 93%.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

K.Daqrouq, A.R. Al-Qawasmi, K.Y. Al Azzawi and T. Abu Hilal (September 12th 2011). Discrete Wavelet Transform & Linear Prediction Coding Based Method for Speech Recognition via Neural Network, Discrete Wavelet Transforms - Biomedical Applications, Hannu Olkkonen, IntechOpen, DOI: 10.5772/20978. Available from:

Embed this chapter on your site Copy to clipboard

<iframe src="http://www.intechopen.com/embed/discrete-wavelet-transforms-biomedical-applications/discrete-wavelet-transform-linear-prediction-coding-based-method-for-speech-recognition-via-neural-n" />

Embed this code snippet in the HTML of your website to show this chapter

chapter statistics

3202total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Implementation of the Discrete Wavelet Transform Used in the Calibration of the Enzymatic Biosensors

By Gustavo A. Alonso, Juan Manuel Gutiérrez, Jean-Louis Marty and Roberto Muñoz

Related Book

First chapter

Discrete Wavelet Multitone Modulation for ADSL & Equalization Techniques

By Sobia Baig, Fasih-ud-Din Farrukh and M. Junaid Mughal

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us