Multi-Resolution Spectral Analysis of Vowels in Tunisian Context

Classic speech spectrogram shows log-magnitude amplitude (dB) versus time and frequency. The sound pressure level in dB is approximately proportional to the volume perceived by the ear. The classic speech sonagram offers a single integration time which is the length of the window. It implements a uniform bandpass filter, the spectral samples are regularly spaced and correspond to equal bandwidths. The choice of the window length determines the time-frequency resolution for all frequencies of sonagram. The more the window is narrower, the better the time resolution and the worse the frequency resolution. This implies that the display resolution of formants, voicing and frictions at low frequencies is less good than the resolution of the bursts in the high frequencies and vice versa. It is so necessary to make the right choice of windows compared to the signal.


Introduction
Classic speech spectrogram shows log-magnitude amplitude (dB) versus time and frequency. The sound pressure level in dB is approximately proportional to the volume perceived by the ear. The classic speech sonagram offers a single integration time which is the length of the window. It implements a uniform bandpass filter, the spectral samples are regularly spaced and correspond to equal bandwidths. The choice of the window length determines the time-frequency resolution for all frequencies of sonagram. The more the window is narrower, the better the time resolution and the worse the frequency resolution. This implies that the display resolution of formants, voicing and frictions at low frequencies is less good than the resolution of the bursts in the high frequencies and vice versa. It is so necessary to make the right choice of windows compared to the signal. (Mallat, 1989, p.674) remarks "it is difficult to analyze the information content of an image directly from the gray-level intensity of the image pixels... Generally, the structures we want to recognize have very different sizes. Hence, it is not possible to define a priori an optimal resolution for analyzing images.". To improve the standard spectral output, we can calculate a multi-resolution (MR) spectrum. In the original papers, the MR analysis is based on discrete wavelet transforms (Grossmann & Morlet, 1984;Mallat, 1989;2000;. Since that it has been applied to several domains: image analysis (Mallat, 1989), time-frequency analysis (Cnockaert, 2008), speech enhancement (Fu & Wan, 2003;Manikandan, 2006), automatic signal segmentation by search of stationary areas from the scalogram (Leman & Marque, 1998).
The MR spectrum, a compromise that provides both a higher frequency and a higher temporal resolution, is not a new method. In phonetic analysis, (Annabi-Elkadri & Hamouda, 2010; presents a study of two common vowels [a] and [E] in Tunisian dialect and french language. Vowels are pronounced in Tunisian context. The analysis of the obtained results shows that due to the influence of french language on the Tunisian dialect, the vowels [a] and [E] are, in some contexts, similarly pronounced. Annabi-Elkadri & Hamouda (2011 (in press) applies the MRS for an automatic method for Silence/Sonorant/Non-Sonorant detection used the ANOVA method. Results are compared to the classical methods for classifications such as Standard Deviation and Mean with ANOVA who were better. The method for automatic Silence/Sonorant/Non-Sonorant detection based on MRS provides better results when compared to classical spectral analysis. Cheung & Lim (1991) presented a method for combining the wideband spectrogram and the narrowband spectrogram by evaluating the geometric mean of their corresponding pixel values. The combined spectrogram appears to preserve the visual features associated with high resolution in both frequency and time. Lee & Ching (1999) described an approach of using multi-resolution analysis (MRA) for clean connected speech and noisy phone conversation speech. Experiments show that the use of MRA cepstra results reduces significantly error insertion when compared with MFCCs. For music signals, Cancela et al. (2009) presents two algorithms: efficient constant-Q transform and multi-resolution FFT. They are reviewed and compared with a new proposal based on the IIR filtering of the FFT. The proposed method shows to be a good compromise between design flexibility and reduced computational effort. Additionally, it was used as a part of an effective melody extraction algorithm. In this context, Dressler was interested in the description of spectral analysis to extract melodies based on spectrograms multi-resolution (Dressler, 2006). The approach aims to extract the sinusoidal components of the audio signal. A calculation of the spectra of different resolutions of frequencies is done in order to detect sinusoids stable in different frames of the FFT. The evaluated results showed that the multi-resolution analysis improves the extraction of the sinusoidal.
The aim of this study was an extension of Annabi-Elkadri & Hamouda (2010; researches. We presented and tested the concept of multi-resolution for the spectral analysis (MRS) of vowels in Tunisian words and in French words under the Tunisian context. Our method was composed of two parts. The first part was applied MRS method to the signal. MRS was calculated by combining several FFT of different lengths (Annabi-Elkadri & Hamouda, 2010;. The second part was the formant detection by applied multi-resolution LPC (Annabi-Elkadri & Hamouda, 2010). We present an improvement of our method of multi-resolution spectral analysis MR FFT. As an application, we used our system VASP for a Tunisian Dialect corpus pronounced by Tunisian speakers. Standard Arabic is composed by 34 phonemes (Muhammad, 1990). It has three vowels, with long and short forms of [a], [i], and [u]. Arabic phonemes are classified in two classes pharyngeal and emphatic. There are characteristics of semitic languages (Elshafei, 1991;Muhammad, 1990). Arabic has two kinds of syllables: open syllables (CV) and (CVV) and closed syllables (CVC), (CVVC), and (CVCC). Syllables (CVVC) and (CVCC) occur only at the end of the sentence. V is a vowel and C is a consonant (Muhammad, 1990).
In section 2, we presented a brief history of Tunisian Dialect and it's relationship with Arabic and French. In section 3, we presented our calculated method of the multi-resolution FFT. In section 4, we presented the materials and methods composed by our corpus and our system Visual Assistance of Speech Processing Software (VASP). We presented our experimental results in section 5 and we discussed it in section 6. Our conclusion is presented in section 7.
After French colonization, the French government wanted to spread the French language in the country. The French instituted a bilingual education system with the Franco-Arab schools. Programs of bilingual schools were modeled primarily on the model of French primary education for children of European origin (French, Italian and Maltese), which were added courses in colloquial Arabic. As for the Tunisian children, they received their education in classical Arabic in order to study the Quoran. Only a small Tunisian elite received a truly bilingual education, in order to co-administer the country. Tunisian Muslim mass continued to speak only Arabic or one of its many varieties. The report of the Tunisian Minister of Affairs, Jean-Jules Jusserand, pursuing the logic of Jules Ferry. In a "Note on Education in Tunisia", dated February 1882, Jusserand exposing his ideas: "We have not at this time we better way to assimilate the Arabs of Tunisia, to the extent that is possible, that they learn our language, it is the opinion of all who know them best: we can not rely on religion to make this comparison, they do not convert to Christianity, but as they learn our language, a host of European ideas will prove to be bound to them, as experience has sufficiently demonstrated. In the reorganization of Tunisia, a large part must be made to education".
After independence, education of the French language such as Arabic was required for all Tunisian children in primary school. This explains why French has become the second language in Tunisia. It is spoken by the majority of the population.
There are different varieties of TD depending on the region, such as dialect of Tunis, Sahel, Sfax, etc. Its morphology, syntax, pronunciation and vocabulary are quite different from the Arabic (Marcais, 1950). There are several differences in pronunciation between Standard Arabic and TD. Short vowels are frequently omitted, especially where they would occur as the final element of an open syllable. While Standard Arabic can have only one consonant at the beginning of a syllable, after which a vowel must follow, TD commonly has two consonants in the onset. For example Standard Arabic "book" is /kita?b/, while in TD, it is /kta?b/. The nucleus in TD may contain a short or long vowel, and at the end of the syllable, in the coda, it may have up to three consonants, but in standard Arabic, we cannot have more than two consonants at the end of the syllable. Word-internal syllables are generally heavy in that they either have a long vowel in the nucleus or consonant in the coda. Non-final syllables composed of just a consonant and a short vowel (i.e. light syllables) are very rare in TD, and are generally loaned from standard Arabic: short vowels in this position have generally been lost, resulting in the many initial CC clusters. For example /?awa?b/ "reply" is a loan from Standard Arabic, but the same word has the natural development /?wa?b/, which is the usual word for "letter" (Gibson, 1998).
In TD's non-pharyngealised context, there is a strong fronting and closing of /a?/, which, especially among younger speakers in Tunis can reach as far as /e?/, and to a lesser extent of /a/.

Multi-resolution FFT
It's so difficult to choose the ideal window with the ideal characteristics. The size of the ideal window (Hancq & Leich, 2000) was equal to twice the length of the pitch of the signal. A wider window show the harmonics in the spectrum, a shorter window approximated very roughly the spectral envelope. This amounts to estimate the energy dispersion with the least error. When we calculated the windowed FFT, we supposed that the eneregy was concentrated at the center of the frame (Haton & al., 2006, p.41). We noted the center C p . So our problem now, is the estimation of C p .

The center estimation in the case of the Discrete Fourier Transform (DFT)
We would like to calculate the spectral of the speech signal s.W e n o t eL the length of s. The first step is to sample s into frames. The size of each frame was between 10 ms and 20 ms (Calliope, 1989;Ladefoged, 1996) to meet the stationnarity condition. We choosed the Hamming window and we fixed the size to 512 points and the overlap to 50%. Figure 1 shows the principle of the center estimation. For each frame p, the center C p was estimated: and [] the integer part. Each signal s was sampled into frames. Each frame number p was composed by N = 512 points: In general case, for the componant number l of s: The FFT windowing for the frame number p was calculated as: In general case:

The center estimation in the case of the MR FFT
To improve the standard spectral, we calculated the MR FFT by combining several FFT of different lengths. The temporal accuracy is higher in the high frequency region and the resolution of high frequency in the low frequencies.
We calculated the FFT windowing of the signal several times NB. The number of steps NB was equal to the number of band frequency fixed a priori. For each step number i (i ≤ NB), the signal s was sampled into frames s i (p i ) andwindowedwiththewindoww.W en o t e dN i the length of frames and of w for each step i. C i,p i was the center of w.
The spectral S i,k (p i ) for each step i was: with: C i,p i = s i, N i 2 (p i ) the center of the frame p i when the overlap= N i 2 .
In MRS, the overlap N i 2 can not satisfy the principle of continuity of the MRS in different band frequencies. A low overlap causes a discontinuity in the spectrum MRS and thus give us a bad estimation of the energy dispersal. So our problem consisted on the overlap choosing. It was necessary that the frames overlap with a percentage higher to 50% of the frame length. We choosed an overlap equal to 75% ( fig. 2).
For the frame p i = 1 of the step number i,wehaveN i components: Multi-Resolution Spectral Analysis of Vowels in Tunisian Context www.intechopen.com For the frame p i = 2 of the step number i,wehaveN i components: In general case, for the frame p i of the step number i,wehaveN i components: The center C i,p i of p i = 1was: The center C i,p i of p i = 2was: In general case, the center C i,p i of p i was: The spectral S i,k (p i ) of each step i was : the center of the frame p i and the overlap equal to 75%.
The size of the ideal window (Hancq & Leich, 2000) is equal to twice the length of the pitch of the signal. A wider window shows the harmonics in the spectrum, a shorter window approximates very roughly the spectral envelope.
To improve the standard spectral, we calculate a multi-resolution spectral (MRS) with two methods; by combining several FFT of different lengths and by combining several windows for each FFT (Annabi-Elkadri & Hamouda, 2010).
Diagrams displayed in figure 3 illustrates the difference between the standard FFT and the MRS. For a standard FFT, the size of the window is equal for each frequency band unlike the MRS windows size. It is dependent on the frequency band.  Figure 4 shows the classical sonagram; Hamming window, 11 ms with an overlap equal to 1/3. The sentence pronounced is: "Le soir approchait, le soir du dernier jour de l'anné". Figure 5 shows the multi-resolution sonagram of the same sentence. It offers several time integrations which are combinations of several FFT of different lengths depending on frequency bandwidth.

Corpus
Our corpus is composed of TD prounounced by Tunisian speakers. The sampling frequency is equal to 44.1 KHz, the wav format was adopted in mono-stereo. We avoided all types  [0,2000,4000,7000,10000] of this sentence: "Le soir approchait, le soir du dernier jour de l'année" of noise filter that would degrade the quality of the signal and thus, causes information lost. We have recorded a real time spontaneous lyrics and discussions of 4 speakers. We have removed noise and funds sounds like laughing, music, etc. It was difficult to realize a spontaneous corpus because, in real time, it is impossible to have all phonemes and syllables. Another difficulty was the variability of discussion themes and pronounced sounds. For these reasons, we decided to complete our corpus with another one. We prepared a text in Tunisian dialect with all sounds to study. Every phoneme and syllable appeared 15 times. We asked from four speakers: two men and two women, to read the text in a high voice in the same conditions of the first corpus records. All speakers are between 25 and 32 years old.

58
Speech Enhancement, Modeling and Recognition -Algorithms and Applications www.intechopen.com Speakers don't know the text. The sampling frequency is equal to 44,1 KHz, the wav format was adopted in mono-stereo. We avoided the remarkable accents and all types of noise filter that would degrade the quality of the signal and thus, causes information lost. Our corpus was transcribed in sentences, words and phonemes.

VASP software: Visual assistance of speech processing software
For our study, we have created our first prototype System for Visual Assistance of Speech Processing VASP. It offers many functions for speech visualization and analysis. We developed our system with GUI Matlab. In the following subsection, we will present some of the functionalities offered by our system. VASP reads sound files in wav format. It represent a wav file in time domain by waveform and in time-frequency domain by spectral representation, classical spectrogram in narrow band and wide band (see figure 6), spectrograms calculated with linear prediction and cepstral coefficients, gammatone, discrete cosine transform (DCT), Wigner-Ville transformation, multi-resolution LPC representation (MR LPC), multi-resolution spectral representation (MR FFT) and multi-resolution spectrogram. From waveform, we can choose, in real time, the frame for which we want to represent a spectrum (see figure 7). Parameters are manipulated from a menu; we can select the type of windows (Hamming, Hanning, triangular, rectangular, Kaiser, Barlett, gaussian and Blackmann-Harris), window length (64, 128, 256, 512, 1024 and 2048 samples) and LPC factor. From all visual representations, coordinates of any pixel can be read. For example, we can select a point from a spectrogram and read its coordinates directly (time, frequency and intensity).
VASP offers the possibility to choose a part of a signal to calculate and visualize it in any time-frequency representations.
Our system can automatically detect Silence/Speech from a waveform. From the spectrogram, the system can detect acoustic cues like formants, and classify it automatically to two classes: sonorant or non-sonorant. Our system can analyse visual representations with two methods image analysis with edge detection and sound analysis signal. Edge detection is calculated with gradiant method or median filter method ( fig.8). The second method is based on detecting energy from a time-frequency representation.

Experimental results
Formants frequencies are the properties of the vocal tract system and need to be inferred from the speech signal rather than just measured. The spectral shape of the vocal tract excitation strongly influences the observed spectral envelope. However, not all vocal tract resonances can cause peaks in the observed spectral envelope.
To extract formants frequencies from the signal, we resampled it to 8 KHz. We use a linear prediction method for our analysis. Linear prediction models the signal as if it were generated by a signal of minimum energy being passed through a purely-recursive IIR filter. Multi-resolution LPC (MR LPC) is calculated by the LPC of the average of the convolution of several windows to the signal.
To our knowledge, there is no normative studies in Standard Arabic vowels like those of Peterson and Barney (Peterson & Barney, 1952) for American english, and those of Fant and al. (Fant, 1969) for Swedish.
We applied VASP on TD and French language. We measured the two first formants F1 and F2 of vowels in Tunisian and French words. We compared our experimental results with those for Calliope (Calliope, 1989). Their corpus is constituted of vowels repetitions in [p_R] context and pronounced by 10 men and 9 women.  [u] are, in some contexts, similarly pronounced.

Tunisian dialect and French spectral analysis of [a] in Tunisian context
We measured the two first formants F1 and F2 of vowels [a] in Tunisian and french words in Tunisian context. Figures 9(a) and 9(b) show scatters of formants respectively for [a] in Tunisian words and in french words. There was a high matching of the two scatters for F1 (400-700 Hz) and for F2 (500-3000 Hz).

Tunisian dialect and French spectral analysis of [E] in Tunisian context
We measured the two first formants F1 and F2 of vowels [E] in Tunisian and french words in Tunisian context. Figures 10(a) and 10(b) show scatters of formants respectively for [E] in Tunisian words and in french words. There is a high matching of the two scatters for F1 (300-550 Hz) and for F2 (1500-3000 Hz).

Tunisian dialect and French spectral analysis of [i] in Tunisian context
We measured the two first formants F1 and F2 of vowels [i] in Tunisian and french words in Tunisian context. Figures 11(a) and 11(b) show scatters of formants respectively for [i] in Tunisian words and in french words. There is a high matching of the two scatters for F1 (250-400 Hz) and for F2 (1800-2500 Hz).

Tunisian dialect spectral analysis of [o] and [e] in Tunisian context
We measured the two first formants F1 and F2 of vowels [o] and [e] in Tunisian words. Figure  12

Tunisian dialect and French spectral analysis of [u] in Tunisian context
We measured the two first formants F1 and F2 of vowels [u] in Tunisian and french words in Tunisian context. Figures 13(a) and 13(b) show scatters of formants respectively for [u] in Tunisian words and in french words. There is a high matching of the two scatters for F1 (300-440 Hz) and for F2 (1500-3000 Hz).

Discussion
We compared our experimental results with those for Calliope (1989)  , the F1 and F2 Standard Deviation was high and far from Calliope for TD and French language. This may be explained by the facts that the position of the tongue was narrower and it was positioned as far forward as possible in the mouth for the Tunisian dialect. High Standard Deviation was related to the small size of our corpus.
We remarked that Tunisian speakers pronounce vowels in the same way for both French language and TD.

Conclusion
The analysis of the obtained results shows that due to the influence of French language on the Tunisian dialect, the vowels are, in some contexts, similarly pronounced. It will be interesting to extend the study to other vowels, on a large corpus and to compare it with the study of other languages corpus like Standard Arabic, Berber, Italian, English and Spanish.