Oesophageal Speech’s Formants Measurement Using Wavelet Transform

Our research group has presented many works to the scientific community [2], [3], aimed to the improvement of esophageal speech quality by stabilizing the poles of the system which models the vocal tract with LPC. Nowadays the wavelet transform is being used in order to enhance the Harmonics to noise ratio. For this task, it is crucial to know accurately the frequency values of formants in vowels [7].


Introduction
One of the most important concerns for the specialists in otorrinolaringologists and the patients who have suffer a laringectomie is a complex process for their rehabilitation. At the present, it is no available any advanced technique either for the learning or the evaluation of this process.
Esophageal speech is characterized by its low intelligibility, which implies that its objective measurement parameters e.g. pitch, jitter, shimmer or HNR have values outside normal ranges [1]. One of the consequences of this fact is the impossibility of using speech recognizers, speech to text converters or any kind of automatic response device that requires a speech signal.
The here presented paper explains a work which is included in a research whose objective is to adapt speech controlled systems so that they can be used by people with vocal disorders. Esophageal voices are the most grievous among these pathologies.
Our research group has presented many works to the scientific community [2], [3], aimed to the improvement of esophageal speech quality by stabilizing the poles of the system which models the vocal tract with LPC. Nowadays the wavelet transform is being used in order to enhance the Harmonics to noise ratio. For this task, it is crucial to know accurately the frequency values of formants in vowels [7].
In this paper results of a new algorithm are presented, this algorithm uses Wavelets Transform as basis, but proposes a new technique to improve calculation accuracy. In order to evaluate this new technique a comparative between its results and the ones obtained with the LPC will be elaborated. As a reference for the comparative the results of analyzing the FFT transform will be taken [4].
The general objective of the chapter is the enhancement of esophageal speech quality in communications with humans and machines. This aim comes up of the low intelligibility of people who speak with esophageal voice after an operation called laryngectomy which is carry out like treatment of larynx cancer [6].
As the here used signals are digital, it is more useful to use Semi-discrete Wavelet Transform (discretized by dyadic grid, described by 2 j s  and 2 j tk   ) or Discrete Wavelet Transform (DWT). The DWT analyzes the signal at different frequency bands with different resolutions by decomposing the signal into a coarse approximation and detail information [5].
The decomposition of the signal into different frequency bands is simply obtained by successive highpass and lowpass filtering of the time domain signal. The original signal x[n] is first passed through a halfband highpass filter g[n] and a lowpass filter h[n]. This constitutes one level of decomposition and can mathematically be expressed as follows: where y high [k] and y low [k] are the outputs of the highpass and lowpass filters, respectively, after subsampling by 2. This decomposition halves the time resolution since only half the number of samples now characterizes the entire signal.
However, this operation doubles the frequency resolution, since the frequency band of the signal now spans only half the previous frequency band, effectively reducing the uncertainty in the frequency by half. The above procedure, which is also known as the subband coding, can be repeated for further decomposition.
The wavelet packet method is a generalization of wavelet decomposition that offers a richer signal analysis. Wavelet packet atoms are waveforms indexed by three naturally interpreted www.intechopen.com Oesophageal Speech's Formants Measurement Using Wavelet Transform 83 parameters: position, scale (as in wavelet decomposition), and frequency. It will be then selected the most suitable decomposition of a given signal with respect to an entropy-based criterion.

Basis of speech analysis
At the present time, many otolaryngologists (ORLs) use the software tools they have available in order to corroborate the diagnosis of vocal cord pathologies by means of objective parameters. These parameters complete the information gathered by the specialist, which usually comprises: the images obtained from a stroboscope and several perceptual tests carried out on the patient.
Special attention needs to be paid to vocal cord cancer, that is to say, to its diagnosis, treatment, rehabilitation and monitoring, as this cancer can cause the death of the patient suffering from it. Once the cancer has been detected, the ORL specialist removes the patient's vocal cords. This means that the patient will no longer be able to produce what is called laryngeal voice and thus loses his/her speech.
After the operation, during rehabilitation, the patient begins the process of learning how to emit oesophageal voice: the voice produced by modulating air coming from the oesophagus. This enables the patient to communicate, albeit experiencing great difficulty to maintain fluent conversations, due to the poor quality of oesophageal voice. However, one of the major problems is that this type of oesophageal voice cannot be evaluated during the rehabilitation process as there is no application available on the market that can automatically obtain the previously mentioned acoustic parameters. The quality of oesophageal voice is so low that the algorithms obtaining the periodicity of the voice do not work properly, and thus measurements obtained by such software packs are not reliable.
Obviously, the accuracy of measurements made by the software pack presented in this work will also be applicable to less severe pathologies, such as polyps, nodules, hypo mobility of the vocal cords, etc. The deterioration of the voice in this type of pathology is also too high for the measurement of objective parameters to be precise. This means that these commercial software packs are not suitable for measuring these parameters in voices suffering from some kind of pathology. Being able to obtain accurate objective parameters is advantageous for the early detection of cancer in cases where the patient's laryngeal voice is of a very poor quality and has high noise levels [1].
The pitch, or fundamental frequency of the speech, is one of the properties of sound or musical tone perceived through frequency. Due to this natural pseudo-periodicity of the voiced voice, there are small variations in the peaks of the voice which change their fi frequency, so that the pitch can be defined as: Estimating fundamental frequency has been a recurring issue in the area of digital signal processing. This is due to the fact that obtaining the time instants that define voice cycles is a very complex task. These cycles are used to obtain the fi frequency instants. Furthermore, it is vitally important to calculate these instants in the acoustic parameterization, as this is the cornerstone of voice characterizations of this kind.
Jitter [2] is a parameter representing variation of fundamental frequency, that is, the variations of pitch in each voice cycle. On the other hand, specialists also usually employ the shimmer parameter [2], which represents variation in width of voice cycle peaks. The voice produced through larynx modulation is able to almost constantly maintain peak width of voice periods. Therefore, an increase in shimmer value can be a symptom of voice disorder. Tables 1 and 2 present the various mathematical definitions of the jitter and shimmer objective parameters.
As previously mentioned, a number of authors have written several works on the detection of voice cycles [3,4] and there are also many highly detailed techniques to be found in the corresponding literature, such as estimators in the temporary domain (ratio of crosses per zero [5]), estimators of fundamental frequency [6,7], self-correlation methods (Yin estimators [8]), representation of the phase space [9], Cepstrums [10] and statistical methods [11,12,13]. Some of these directly define voice cycles [3], whereas others use numerical approximations [8] in order to obtain fundamental frequency values.
In that respect, another step must be taken if we are to clearly identify the instants that define voice cycles.
However, none of these works were tried out on oesophageal voices and, what is more, it can be stated without a shadow of a doubt that these algorithms are not suitable for voices of this kind. The software pack presented here is a tool designed for use by specialists in otolaryngology, and is specifically designed to obtain objective voice parameters with excellent precision. The tool contains a basic algorithm to calculate the acoustic parameters related to speech periodicity and serves as an aid for not only diagnosis and rehabilitation but also for monitoring the patient.
It can be concluded that the tool is user-friendly and that ORL specialists can use it for measuring such objective parameters as pitch, jitter and shimmer, as well as for keeping patient records on these parameters.

Software interface
Speech signal processing plays an important role within the digital processing projects and investigations. Within this field, the esophageal voices are being objective of analysis and transformation [2], [3] but these have the limitation of measuring their quality only with subjective criteria as hearing tests. This is because an evaluation based on the calculation of objective parameters like pitch, jitter, shimmer or the harmonic to noise ratio HNR demands a high precision in the definition of the beginnings and ends of cycle in the voice signal.
The oesophageal voice is generated using the air pass across the oesophagus but without the modulation possibility by the vocal fold because they have been removed due to, generally, a larynx cancer. Because of this their time-spectral characteristics are atypical and include levels of noise, fundamental frequency asymmetry and formant unstructuration. This leads to wrong measures in commercial applications and therefore is impossible to assess the quality of oesophageal voices. The same is applicable to voices with severe pathologies.
In this sense, it is necessary to develop an algorithm for the exact calculation of the marks that correspond to each cycle of the signal of oesophageal or pathological voices so that the calculation of pitch is exact and, with it, the measures of jitter, shimmer or signal to noise ratio. This algorithm has been included in a software interface for allowing users to measure and to plot in a graph the results of the acoustic parameters of the speech signal. This is suitable for evaluating and comparing the results between original oesophageal speech signal and the processed one after applying the wavelet transform.

System design
The system design has been divided into two parts: the algorithm for improving the quality of oesophageal speech using wavelet transform and the user interface including the speech signal processing using that algorithm and the acoustic analysis of speech parameters.

Algorithm using wavelets
As it has been previously mentioned, wavelet packets will be used in order to detect formants location. The reason of the choosing of this technique is their ability to separate the speech signal in different subbands, allowing to separate the formants bands quite exactly.
The here proposed method makes use of a double analysis. Firstly, a general analysis is applied over the whole spectrum, in this step a band in which the formant is located is approximated, and secondly the exact formant location is determined more accurately, the formant location accuracy can be adjusted through introducing more analysis levels inside the formant approximation band.
The main advantage of this method is the possibility of achieving a great frequency resolution, without consuming excessive computational resources, which is crucial when implementing the algorithms in a real-time device, such as a DSP.

Step 1: Band approximation of formants location
The first step consists of a rough analysis of the signal's wavelet packets tree. In order to locate formants frequencies, the energy of each subband is analyzed. The maxima of this energy signal determine formants location. The scheme of the process is shown in Figure 1.
Firstly, the wavelet packets tree is calculated up to the desire level, the chosen level is calculated taking into account the sampling frequency and the resolution required.
After having obtained the wavelet packets, the energy of each last level node is calculated. The Energy is stored in an array and its envelope is estimated. This envelope smoothes the energy signal and thus, the maxima can be easily calculated.

Step 2: Adjustable resolution analysis
In the above explained step, an approximation to the formantic frequencies was obtained.
As it will explained in next head, the resolution obtained with this approximation, though it is better than the one obtained with conventional methods, may not be enough for some environments.
In order to achieve a finer resolution, an adjustable resolution analysis was designed. The scheme of this analysis is shown in Figure 1. The core idea of the designed technique is to obtain a higher resolution in the previously detected bands by dividing the selected nodes and their adjacent ones.
The main reason of using narrower bands is that energy in wavelets packet spreads among various adjacent nodes, the solution to this problem is to divide the spectrum in such narrow bands that the energy of the formant locates in only one node.
As it can be seen in Figure 1 the first step of the algorithm consists on splitting the approximated formantic bands and their adjacent as many times as necessary. Secondly, the energy of each node is calculated again and the maximum value located, this value indicates formant location.

www.intechopen.com
Oesophageal Speech's Formants Measurement Using Wavelet Transform 87 The main advantage of this method is that it is possible to save a lot of computational load but preserving a high accuracy level at the same time. For example, if an 8 level basic tree is to be taken and its formantic nodes are expanded two levels, it is possible to obtain a 10 level resolution by consuming an 8 level computational load.

User interface
Using the advantages of the previously described algorithms, authors have developed a tool called "PAS Voice". The welcome screen will then be displayed:

Fig. 2. Welcome Screen
Once the application has been started up, the main screen will be displayed: The following areas can be observed on this screen: 1. Menu: The program's general option menus can be identified in this area: File.-Menu with the "Open file" option, which allows you to open a voice signal in order to process it. The signal has to be in .WAV, .AU or .AIFF format. Voice processing begins automatically once the file to be analyzed has been chosen.
Save Results -This enables you to save the signal processing results; results from several sessions can be added for the same person or a new profile can be created. Once the results have been correctly saved, a graph will be displayed showing the evolution of the parameters throughout all sessions of analysis. When this graph is closed, an informative message on development since the previous session will be displayed.
Tools.-Tool menu for application configuration.
Language.-This allows the language to be chosen for the program (initially English and Spanish, although personalized translations can be applied). If the language is changed, the application will have to be rebooted.
Octave Path.-The octave.exe file, essential for the running of this program, can be specified using this option.
Help.-By clicking on this, help is provided for running the program.
2. Measurement area: In this area, once the a voice signal has been processed (through the File/Open file option), the numerical measurements of Pitch, Jitter and Shimmer are displayed. If one wishes to observe the measurements in graphic form with the normality threshold, the "Vocaligram" button can be clicked on; this will only be enabled once the processing has been performed to obtain the measurements needed to create the vocaligram. The vocaligram is a graphic representation of a measurement in each axis (in blue) superimposed over the threshold values for each parameter. The measurements are scaled so that abnormal values are always greater than the threshold (a value above the threshold implies that it is abnormal). Play.-By pressing this button, the voice signal will be reproduced through the computer loudspeakers (if applicable).

Results
Tables 1 and 2 show the measurements of the first formant location for healthy (left) and esophageal (right) voices. Tables 3 and 4 show the measurement errors absolutely and relatively, the relative value is calculated comparing the obtained error with the average formant value. As it can be seen in those tables, conventional methods obtain very poor results, achieving an average deviation of about 70 Hz, approximately the value of the pitch in esophageal speech. These deviations could be inappropriate for some applications which require great accuracy, thus a new measurement method is necessary.
A simple wavelet algorithm with approximation to the formant band improves considerably this results reducing the deviation about a 30%. This represents quite an improvement comparing with LPC, but it is possible to obtain higher resolution without increasing substantially computational costs. The results of the adjustable resolution algorithm show that it is possible to reduce the average deviations up to a 50%.
The obtained values prove that it is feasible to locate formants position with minimum errors and effective algorithms. This fact constitutes a fundamental advance in esophageal speech regeneration, because formant location has great importance in many speech processing algorithms. Taking as an example previous works of the research group, for example for such as an algorithm as the one presented in [2], much better results would be obtained with more accurate formant location estimations.
It is important to highlight the great relevance that this results may have in some other speech technologies fields such as speech recognition, etc. So the applications of this analysis is not restricted to esophageal speech processing but can be implemented with many others purposes.  In our case we are going to create a new profile. As the name "Example" does not exist, by typing it out completely and clicking on "Save Results…", the new name will be created and the data saved. No results will appear as this is the patient's first session: We could also have added results as if they were for a patient not coming for the first time. We choose an already existing patient, "Man 1", by choosing from the list and clicking on "Save Results…". A graph showing all the results saved to date from previous sessions is provided (pitch information is separated from that on jitter and shimmer as they are different units): Recent scientific progress has made it possible to take great steps forward in such fields of major interest as biomedical engineering. In this area, the application of new technologies becomes essential in order to improve techniques in the diagnosis, treatment and rehabilitation of certain medical pathologies. However, there are also collectives suffering from an illness or treatments that only affect a minority of people. This is a characteristic which usually implies that the level of technological development corresponding to the resources having to be used for these pathologies is way behind that for other more common disorders.

Speech
The laryngectomized are people who, for various reasons, have had to undergo surgery to remove their larynx, vocal cords, epiglottis and the cartilages surrounding the larynx. These elements are of vital importance for the generation of speech as they form part of the phoning apparatus. Therefore, the removal of these seriously affects the quality of their speech.
The issue of treating a barely intelligible voice is also of great use from the point of view of the patient's psychology. We have noticed that a high proportion of the laryngectomized feel embarrassed when using this voice, particularly women, who would rather not speak than do so with oesophageal voice, as they consider it unfeminine.
The results obtained from this research work have been useful mainly due to the IT contribution involving the design, development and implementation of a software application specifically intended for the assessment of laryngectomized voices, with a view to performing a correct medical monitoring that will make it possible to measure evolution and prevent relapses. In order to verify improvement in the quality of oesophageal voices, a database containing several phonemes of all kinds of voices was worked with; these voices, both pathological and healthy, were recorded with the help of members from the Asociación Vizcaína de Laringectomizados.
Future work deriving from this research includes, most importantly, the incorporation of functionalities for vocal recognition and synthesis of phrases, as well as implementing the digital signal processing algorithms developed in systems based on cell phones and PDAs; all this with the goal of improving the laryngectomized's quality of life.

Conclusions
Due to the great relevance of Wavelet Transform for the analysis and processing of esophageal speech, and assuming that the final goal will be the implementation in a hardware DSP based device, with very strict real-time requirements, a significant computing resources optimization has been achieved, and consequently, a reduction of the code length in order to minimize computational load. Also it is important to highlight that the obtained wavelet calculi can be used in later processing.
These advantages are achieved through a preprocessing algorithm, which, although Wavelets based, includes some improvements. Firstly, an approximation to the formant subband. And secondly, an adjustable resolution applied over the bands among which the formant energy is shared.
On the other hand, the here proposed algorithm allows to optimize previous research works concerning the treatment of the poles of the system which models esophageal speech, according to LPC. Taking into account the obtained accuracy, it is logical to assume an improvement in results if this technique is used as a first stage of the whole algorithm. The use of the wavelet transform to analyze the behaviour of the complex systems from various fields started to be widely recognized and applied successfully during the last few decades. In this book some advances in wavelet theory and their applications in engineering, physics and technology are presented. The applications were carefully selected and grouped in five main sections -Signal Processing, Electrical Systems, Fault Diagnosis and Monitoring, Image Processing and Applications in Engineering. One of the key features of this book is that the wavelet concepts have been described from a point of view that is familiar to researchers from various branches of science and engineering. The content of the book is accessible to a large number of readers.