DWTLPC Identification Rate results
In the proposed work, the techniques of wavelet transform (WT) and neural network were introduced for speech based text-independent speaker identification and Arabic vowel recognition. The linear prediction coding coefficients (LPCC) of discrete wavelet transform (DWT) upon level 3 features extraction method was developed. Feature vector fed to probabilistic neural networks (PNN) for classification. The functions of features extraction and classification are performed using the wavelet transform and neural networks (DWTPNN) expert system. The declared results show that the proposed method can make an powerful analysis with average identification rates reached 93. Two published methods were investigated for comparison. The best recognition rate selection obtained was for framed DWT. Discrete wavelet transform was studied to improve the system robustness against the noise of 0dB. Our investigation of speaker-independent Arabic vowels classifier system performance is performed via several experiments depending on vowel type. The declared results show that the proposed method can make an effectual analysis with identification rates may reach 93%.
In general, a speaker identification system can be implemented by observing the voiced/unvoiced components or through analyzing the energy distribution of utterances. A number of digital signal processing algorithms, such as LPC technique (Adami & Barone, 2001; Tajima, Port, & Dalby, 1997), Mel frequency cepstral coefficients (MFCCs) (Mashao & Skosan, 2006; Sroka & Braida, 2005; Kanedera, Arai, Hermansky & Pavel, 1999; Daqrouq & Al-Faouri, 2010), DWT (Fonseca, Guido, Scalassara, Maciel, & Pereira, 2007) and wavelet packet transform (WPT) (Lung, 2006; Zhang & Jiao, 2004) are extensively utilized. In the beginning of 1990s, Mel frequency cepstral technique became the most widely used technique for recognition purposes due to its aptitude to represent the speech spectrum in a compacted form (Sarikaya & ansen, 2000). Actually, MFCCs simulate the model of umans’ auditory perception and have been proven to be very effective in automatic speech recognition system and modeling the individual frequency components of speech signals. ESI has been under research by a large number of researches for about four decades (Reynolds, Quatieri, & Dunn, 2000). From a commercial point of view, ESI is a technology with potentially large market due to the applications of frequently ranges from automation of operator- helped service to speech-to-text aiding system for hearing impaired individuals (Reynolds et al., 2000).
Artificial neural network performance is depending mainly on the size and quality of training samples (Visser, Otsuka, & Lee, 2003). When the number of training data is small, not representative of the possibility space, standard neural network results are poor (Kosko & Bart, 1992). Incorporation of neural fuzzy or wavelet techniques can improve performance in this case, particularly, by input matrix dimensionality decreasing (Nava & Taylor, 1996). Artificial neural networks (ANN) are known to be excellent classifiers, but their performance can be prevented by the size and quality of the training set. Fuzzy theory has been used successfully in many applications (Gowdy & Tufekci, 2000). This applications show that fuzzy theory can be used to improve neural network performance.
In this study, authors improve effective feature extraction method for text-independent system, taking in consideration that the size of ANN input is very crucial issue. This affects quality of the training set. For this reason, the presented features extraction method offers a reduction of dimensionality of features comparing with conventional methods. LPCC of DWT in conjunction is utilized. For classification of features extraction coefficients, PNN is proposed.
In this paper, an expert system for speaker identification was proposed for the investigation of the speech signals using pattern identification. The speaker identification performance of this method demonstrated on the total 59 individual speakers (39 male speakers and 20 female speakers). LPCC in conjunction with DWT upon level seven features extraction method were developed. For performing the classification process PNN was investigated. The function of feature extraction and classification is performed using the DWPN expert system. The declared results show that the proposed method can make an effectual analysis.. The average identification rates were 94.89, better than other methods published before. It was found that the recognition rates enhanced upon increasing the number of feature sets (by higher DWT levels). Nevertheless, the improvement implies a tradeoff between the recognition rate and extracting time. The proposed method can offer a significant computational advantage by reducing the dimensionality of the WT coefficients by means of LPCC. DWT approximation Sub-signal via several levels instead of original imposter had good performance on real noise facing, particularly upon level 3 and 4.
2. Discrete Wavelet Transform
The DWT indicates an arbitrary square integrable function as a superposition of a family of basis functions called wavelet functions. A family of wavelet basis functions can be produced by translating and dilating the mother wavelet related to the family (Mallat, 1989). The DWT coefficients can be generated by taking the inner product between the input signal and the wavelet functions. Since the basis functions (wavelet functions) are translated and dilated versions of each other, a simpler algorithm, known as Mallat's pyramid tree algorithm, has been proposed in (Mallat, 1989).
The DWT can be treated as the multiresolution decomposition of a sequence. It takes a length sequence as input and produces a length N sequence as the output. The output has values at the highest resolution (level 1) and N/4 values at the next resolution (level 2), and so on. Let, and let the number of frequencies, or resolutions, be m, we are bearing in mind octaves . So that, the frequency index varies as 1, 2,…, m corresponding to the scales . In
Where is scaling coefficient at the stage, is the wavelet coefficient at the stage, and are the dilation coefficients relating to the scaling and wavelet functions, respectively.
For computing the DWT coefficients of the discrete-time data (signal), it is assumed that the input data represents the DWT coefficients of a high resolution stage. Equations (1a) and Equations (1b) and may be used for obtaining DWT coefficients of subsequent stages. In practice, this decomposition is used only for a few stages. We note that the dilation coefficientsstand for a low-pass filter, where the corresponding stands for a high-pass filter. In order that, DWT takes out information from the signal at different scales. The first level of wavelet decomposition extracts the details of the signal (high frequency parts), while the second and all subsequent wavelet decompositions take out progressively coarser information (lower frequency parts). Each step of retransforming the low-pass output is called dilation. A schematic of three stages DWT decomposition is shown in Fig. 1. H presents the High pass filter and
L denotes the low pass filter. At the output of each filter the result is down sampled (decimated) by taking one coefficient and leave other ( Souani et al., 2000).
So as to reconstruct the original data, the DWT coefficients are up sampled (insertion of a zero between two samples) and passed through another set of low- and high-pass filters, which are expressed as
whereand are the low- and the high-pass synthesis filter, respectively. It is observed from Eq. (2) that the level DWT coefficients may be obtained fromlevel DWT coefficients. Efficiently supported wavelets are generally used in various applications.
In the last decade, there has been a huge increase in the applications of wavelets in various scientific disciplines. Typical applications of wavelets include signal processing, image processing, security systems, numerical analysis, statistics, biomedicine, etc. Wavelet transform tenders a wide variety of useful features, on the contrary to other transforms, such as Fourier transform or cosine transform. Some of these are as follows:
Adaptive time-frequency windows,
Lower aliasing distortion for signal processing applications,
Computational complexity of, where N is the length of data;
Efficient Very Low Scale Integration implementation
3. The use of DWT for feature extraction
Before the stage of features extraction, the speech data are processed by a silence removing algorithm followed by the application of a pre-processed by applying the normalization on speech signals to make the signals comparable regardless of differences in magnitude. In this study three feature extraction methods based on discrete wavelet transform are discussed in the following part of the paper.
3.1. DWT method with LPC
For an orthogonal wavelet function, a library of DWT bases is generated. Each of these bases offers a particular way of coding signals, preserving global energy and reconstructing exact features. The DWT is used to extract additional features to guarantee higher recognition rate. In this study, DWT is applied at the stage of feature extraction, but these data are not proper for classifier due to a great amount of data length. Thus, we have to seek for a better representation for the speaker features. Previous studies proposed that the use of LPC of DWT as features in recognition tasks is competent. (Adami & Barone, 2001; Tajima, Port, & Dalby, 1997) Suggested a method to calculate the LPC orders of wavelet transform for speaker recognition.
In this method the LPC is obtained from DWT Sub signals. The DWT at level three is generated and then 30 LPC orders are obtained for each sub signals to be combined in one feature vector. The main advantage of such sophisticated feature method is to extract different LPC impact based on multi resolution of DWT capability. LPC orders sequence will contain distinguishable information as well as wavelet transform. Fig. 2. shows LPC orders calculated for DWT at depth 3 for three different utterances for the same person. We may notice that the feature vector extracted by DWT and LPC is appropriate for speaker recognition.
3.2. DWT method with entropy
Turkoglu et al., (2003) Suggested a method to calculate the entropy value of the wavelet norm in digital modulation recognition.  Proposed features extraction method for speaker recognition based on a combination of three entropy types (sure, logarithmic energy and norm). Lastly, (Daqrouq, 2011) investigated a speaker identification system using adaptive wavelet sure entropy.
As seen in above studies, the entropy of the specific sub-band signal may be employed as features for recognition tasks. This is possible because each Arabic vowel has distinct energy (see Fig.3). In this paper, the entropy obtained from the DWT will be employed for speaker recognition. The features extraction method can be explained as follows:
Decomposing the speech signal by wavelet packet transform at level 7, with Daubechies type (db2).
Calculating three entropy types for all 256 nodes at depth 7 for wavelet packet using the following equations:
Log energy entropy:
Where is the signal, are the DWT coefficients and is a positive threshold. Entropy is a common concept in many fields, mainly in signal processing. Classical entropy-based criterion describes information-related properties for a precise representation of a given signal. Entropy is commonly used in image processing; it posses information about the concentration of the image. On the other hand, a method for measuring the entropy appears as a supreme tool for quantifying the ordering of non-stationary signals. Fig.2 shows the three entropies calculated for DWT at depth 3 for three different utterances for the same person. We may notice that the feature vector extracted by DWT and entropy is appropriate for speaker recognition. This conclusion has been obtained by interpretation the following criterion: the feature vector extracted should possess the following properties Vary widely from class to class. 2) Stable over a long period of time. 3) Should not have correlation with other features (see Fig.3).
4. Proposed probabilistic neural networks algorithm
where is matrix of input vowel feature vectors for net training, of (minus 2, repeated original node) WP nodes number;
is the target class vector
and SPREAD is spread of radial basis functions. We employ a SPREAD value of 1 because that is a typical distance between the input vectors. If SPREAD is near zero the network acts as a nearest neighbor classifier. As SPREAD becomes larger the designed, network will take into account several nearby design vectors.
5. Results and discussion
5.1. Speaker identification by DWTLPC
A testing database was produced from Arabic language. The recording environment is a normal office environment through PC-sound card, with frequency 4 KHz and sampling frequency 16 KHz.
These utterances are Arabic spoken words. Total 47 individual speakers (19 to 40 years old) who are 31 individual male and 16 individual female spoken these Arabic words for training and testing phases. The total number of tokens considered for training and testing was 653.
It were performed experiments using total 653 the Arabic utterances of total 47 individual speakers (31 male speakers and 16 female speakers). For each of these speakers, up to 15 speech signals were used. 6 of these signals were used for training and from 4 to 9 of these signals (depends of recordings signals for each speaker) were used for testing the expert system (Fig.6). In this experiment, 93.26% correct classification was obtained by means of DWTLPC among the 47 different speaker signal classes. Testing results are tabulated in Tab.1. It, clearly, indicates the usefulness and the trustworthiness of the proposed approach for extracting features from speech signals gender identification system.
|Recognized Signals||Number of Signals||Speaker|
Table 2 shows the experimental results of different approaches used in the experimental investigation for comparison. Modified DWT with proposed feature extraction method (MDWTLPC), framing DWTLPC (FDWTLPC) illustrated in Fig.8, where LPC orders are obtained from six frames of each DWT sub signal and proposed method DWTLPC were investigated for comparison. The recognition rate of MDWTLPC reached the lowest value. The best recognition rate selection obtained was 93.53% for FDWTLPC.
|Number of Signals||Identification|
To improve the robustness of DWTLPC to additive white Gaussian noise (AWGN), same wavelet decomposition process was applied to DWT approximation Sub-signal via several levels instead of original imposter (Daqrouq, 2011). Afterwards, the features extraction was applied to each of the obtained wavelet decomposition sub-signals (see Fig.6). After performing proposed classification mechanism for each sub-signal of distinct DWT level, we can notice that at level 3 and 4 the highest recognition rate was achieved (see Tab.4). In this experiment it was found that the recognition rates were not improved upon increasing the DWT level more than four.
|Recognized Signals [100%]||Number of Signals||Speaker|
|Level 4||Level 3||Level 2||Level 1|
5.2. Arabic vowel classification by using DWTLPC
In recent times, Arabic language became one of the most significant and broadly spoken languages in the world, with an expected number of 350 millions speakers distributed all over the world and mostly covering 22 Arabic countries. Arabic is Semitic language that characterizes by the existence of particular consonants like pharyngeal, glottal and emphatic consonants. Furthermore, it presents some phonetics and morpho-syntactic particularities. The morpho-syntactic structure built, around pattern roots (CVCVCV, CVCCVC, etc.) (Zitouni and Sarikaya, 2009). The Arabic alphabet consists of 28 letters that can be expanded to a set of 90 by additional shapes, marks, and vowels. The 28 letters represent the consonants and long vowels such as ى and ٱ (both pronounced as/a:/), ي (pronounced as/i:/), andو ( pronounced as/u:/). The short vowels and certain other phonetic information such as consonant doubling (shadda) are not represented by letters directly, but by diacritics. A diacritic is a short stroke located above or below the consonant. Table 1 shows the complete set of Arabic diacritics. We split the Arabic diacritics into three sets: short vowels, doubled case endings, and syllabification marks. Short vowels are written as symbols either above or below the letter in text with diacritics, and dropped all together in text without diacritics. We get three short vowels: fatha: it represents the /a/ sound and is an oblique dash over a letter, damma: it represents the /u/ sound and has shape of a comma over a letter and kasra: it represents the /i/ sound and is an oblique dash under a letter as reported in Table 1.
In this work, speech signals were obtained via PC-sound card, with a sampling frequency of 16000 Hz. The Arabic vowels were recorded by 27 speakers: 5 females, along with 22 males. The recording process was provided in normal university office circumstances. Our study of speaker-independent Arabic vowels classifier system performance is performed via several experiments depending on vowel type. In the following three experiments the used feature extraction method is DWTLPC.
We experimented 200 long Arabic vowels ٱ (pronounced as/a:/) signals, 400 long Arabic vowels ي (pronounced as/e:/) signals and 90 long Arabic vowels و (pronounced as/u:/) signals. The results indicated that 96% were classified correctly for Arabic vowels ٱ, 90% of the signals were classified correctly for Arabic vowel ي, and 94% of the signals were classified correctly for Arabic vowel و. Tab.5 shows the results of recognition rates.
|Number of Signals||Long Vowels|
In this experiment we study the recognition rates for long vowels connected with other consonant such ل (pronounced as/l/) and ر (pronounced as/r/). Tab.6, reported the recognition rates. The results indicated 88.5% average recognition rate.
|Number of Signals||Long Vowels|
Probabilistic neural network based speech recognition system is presented in this work. This system was performed using a wavelet feature extraction method. In this work, effective feature extraction method for Arabic vowels system is developed, taking in consideration that the computational complexity is very crucial issue. The experimental results on a subset of recorded database showed that feature extraction method proposed in this work is suitable for Arabic recognition system. Our study of speaker-independent Arabic vowels classifier system performance is performed via two experiments depending on vowel type. The declared results show that the proposed method can make an effective analysis with identification rates may reach 93%.
The proposed future work of this study is to improve the capability of proposed system to work in real time. This may be performed by modifying the recording apparatus and a data acquisition system (such as NI-6024E), and interfacing online with written Matlab code that simulates the expert system.
In this work, an expert system for speaker identification was investigated for the analyzing of the speech signals using pattern identification. The speaker identification performance of this method demonstrated on the total 47 individual speakers (31 male speakers and 16 female speakers). LPC in conjunction with framed DWT upon level three features extraction method was developed. For performing the classification process PNN was proposed. The stated results show that the proposed method can make an powerful analysis. The performance of the intelligent system was given in Table 1 and Table 2. The average identification rates were 93.26%, better than other methods. Our investigation of speaker-independent Arabic vowels classifier system performance is performed via several experiments depending on vowel type. The declared results show that the proposed method can make an effectual analysis with identification rates may reach 93%.