InTechOpen uses cookies to offer you the best online experience. By continuing to use our site, you agree to our Privacy Policy.

Computer and Information Science » Computer Science and Engineering » "Speech Technologies", book edited by Ivo Ipsic, ISBN 978-953-307-996-7, Published: June 13, 2011 under CC BY-NC-SA 3.0 license. © The Author(s).

Chapter 21

Speech Research in TUSUR

By Roman V. Meshchryakov
DOI: 10.5772/10568

Article top

Speech Research in TUSUR

V. Meshchryakov Roman1

1. Introduction

In article results of researches in TUSUR are show. Researches proceed more than 20 years. The basic consumers of production are medical institutions

Researches in the field of speech technologies in TUSUR are spent more than 20 years. Among the basic directions definition of the announcer, speech synthesis on printing text are allocated the adaptive analysis of a speech signal for the decision of problems of the analysis and speech recognition. Important applied value is medical appendices for support of the doctor-logopedist, and also on carrying out of procedures of rehabilitation of vocal and speech functions at the person.

2. Processing of a speech signal

For processing of a speech signal the hierarchical model was used. The information is used from various levels of speech system hierarchy. It allows to use the information presented in various scales for the full-function analysis and synthesis of a speech signal.

Elements at each level are thus allocated, their properties, possible transformations are defined. It is besides defined that elements at the bottom levels of hierarchy of speech system under certain conditions form configurations which are elements of top levels.

Thus conditions of a regularity of the configurations defining calculations are formed. In turn at synthesis of a speech signal elements of top levels will be transformed to configurations of elements of the bottom levels of hierarchy of speech system.

2.1. Speech signal analysis

The acoustic model of speech perception system has been developed for carrying out of researches of a speech signal. Its maintenance is represented by original mathematical model and algorithms of processing. Masking is simultaneous allows to eliminate redundancy in the speech signal, connected with features of delay of reaction sensor’s systems after peaks in frequency area. Frequency masking allows to eliminate surplus information on a frequency spectrum which isn't perceived by the person since it disappears behind other frequencies.

In a complex of simultaneous and frequency masking the volume of the processed information on a speech signal at what the information isn't lost is considerably reduced, and its quality only improves, raises it informative.

In fig. 1 results on research of a site of a speech signal are presented. At the left results FFT, on the right – are presented at use of the developed model. It allows to spend segmentation of a speech signal on vocal and non-vocal sequence. They are used for the parametrical description of a speech signal.

Besides, on vocal sequence fundamental frequency with split-hair accuracy in comparison with other algorithms is defined.


Figure 1.

2.2. Medical research

Medical researches go in a direction of quality improvement of diagnosis statement and times reduction of patients rehabilitation. The first developed information system shows to the doctor-logopedist of speech signal feature (on a spectrum) and prompts infringements of speech formation bodies of patient.

The second component allows to form with use of biological feedback esophagus voice. To patients cut throat and form artificial vocal folds in a gullet. In fig 2 anatomic features after removal of a throat and to formation of a new vocal path are presented


Figure 2.

In fig. 3 basic elements of man-machine system are presented. In it the logopedist forms the task for the patient. The patient spends trainings on formation or restoration of vocal function about use of a hardware-software complex. The hardware includes a microphone with which help the data and the monitor on which process of interaction with the patient and its results is displayed are entered. Program and algorithmic maintenance forms contains mathematical model and functions of processing of a speech signal, interaction with the patient, and as history of speech rehabilitation.

Using biocontrol at first it is formed phonation, then the patient is trained to operate vocal folds (stability of fundamental frequency, duration phonation, etc.).

In figure 4 the photo from scientific research institute of oncology of Russian Academy of Medical Science on which base medical researchers are conducted is presented. Active participation is accepted by professor Balatskaja L.N., professor Chojnzonov L.T., logopedist Krasavina Е.А. In TUSUR Bondarenko V. P, Kornilov A.J., Kotsubinsky V.P. were engaged and engaged in Konev A.A., Kostjuchenko E.J., Kvasov A.N. in the given problem

In figure 5 results of change of frequency of the basic tone at stages of vocal rehabilitation are presented (a) - Fundamental frequency, b) - Deviation fundamental frequency, c) - Sound duration [a]).


Figure 3.


Figure 4.


Figure 5.

It is obvious that the base investigated characteristic is fundamental frequency. Measurements at various stages logorestore experts were spent:

  1. prior to the beginning of complex rehabilitation;

  2. in the beginning of complex rehabilitation;

  3. after the termination of vocal rehabilitation;

  4. in 6 months after vocal rehabilitation;

  5. in 1 year after vocal rehabilitation.

On a figure 5a) are resulted for patient M change of fundamental frequency on stages, on a figure of 2 deviations of fundamental frequency from average value and on a figure 3 duration of a said sound [a]. In the resulted case the most widespread kinds of changes of investigated characteristics are reflected. The patient learns to say more long phrases and in wide range to fundamental frequency control. It is obvious that use of new methods of the speech signal analysis is expedient for using at tracing of positive dynamics of vocal rehabilitation. At various stages logorestore therapies at the person are formed the new mechanism of speech formation. However, as already it has been noted above, at throat removal vocal folds speech formation a path of the person leave. At vocal rehabilitation in the first physiological narrowing of a gullet the pseudo-vocal crack which in the subsequent to be analogue of vocal folds of a throat is formed. One of the most important characteristics which are subject to research is the fundamental frequency defining frequency of fluctuations of vocal folds and defining voice quality.

Efficiency of vocal function restoration has made 92,6 % in terms from 8 days till 22 days. The estimation of application efficiency of the developed technique is proved by objective methods of functional researches: acoustic, electro-miograhy and studying of life quality.

Acoustic parameters esophagus voices remain in a database after each vocal training. The detailed analysis of spectral components of a formed voice is carried out and dynamics of the basic indicators is traced. Duration phonation, depending on quantity of trainings has on the average increased with 80 ms to 850 ms. Dynamics of fundamental frequency by results of trainings has made from 40 Hz to 120 Hz.

2.3. Speech synthesis

For synthesis of a speech signal the system model speech formation the person is used. Modeling of formation of the message by the person and control of speech formation bodies is spent. Processes of a breath, an exhalation, phonation, fluctuations of vocal folds, changes of articulation bodies are modeled.

As base the sound system of Russian language is accepted and process of transformation of the information from pragmatically level on level of a physical signal is modeled. Each of the models used at speech synthesis imposes the restrictions. It is caused by that from the entrance text elements of various levels are allocated. Further in system there is a generation of elements of new type and their configuration. For intonation realization it is necessary to generate predicted change of a prosody, and then to operate process of its updating.

All external and internal data used for formation of speech synthesis is reflected in a Fig. 6. In blocks of the left part all data on which basis configurations are under construction are allocated and signs are defined. In an average part there is an information on a stage of transformation and result of its performance. Blocks of the right part are rules of language and enter directly into blocks of the left and central part in the form of tables, rules, conditions of a regularity, algorithms of processing. The given scheme is used as methodological bases of systems of synthesis of speech.

As the given model describes predicted parameters of model of generation of a speech signal, in the resulted scheme there are no some feedback. They are added at a stage of direct management by lungs and speech formation a path in the course of generation of a speech signal.


Figure 6.

On the basis of the scheme resulted in figure 6 and the received practical results it is possible to make recommendations about the organization of knowledge bases for language systems:

  1. Any speech system should be preliminary considered for revealing of the most informative blocks influencing result. By results of the analysis the criteria differentiating the information on significance values should be created. In particular, it is necessary to use criterion of efficiency of the transferred information.

  2. All information necessary for transformation, it is expedient to divide into base data (tables and dictionaries) and rules according to which transformation is made.

  3. For the information which can't be formalized it is presented in the form of rules dictionaries are used. The size of dictionaries is limited to demanded quality and admissible quantity of objects. In the given model 3 dictionaries and 5 sets of the rules considering both language, and parameters of the announcer are used.

  4. For the information unequivocally defined by object it is expedient to use tables, in conformity with which on demanded object its sign is defined. For example, the table duration.

  5. The general parameters of a speech signal should be based on physiological parameters speech formation systems of the person. It allows to adjust system of synthesis of speech on the announcer and to receive natural speech.

  6. Introduction of criteria of efficiency is necessary both at the various intermediate stages, and on resultant. Their task can be as in obvious, and implicitly in the form of restrictions. Thus, we receive a control system of speech signal reception.

3. Conclusion

Now there is a development in the field of realization of effective algorithms. Researches on cases of various languages are conducted.


The collective long time was headed by professor Vladimir Bondarenko. Two students: Evgenie Kostjuchenko and Evgenie Shchipunov have received grants ASA (2008, 2010).


1 - Roman Meshchryakov, Vladimir Bondarenko, 2008 Dialogue as a basis for construction of speech systems. Cybernetics and Systems Analysis, 44 2 175 184 , 1060-0396
2 - Roman Meshcheryakov, Andrey Ronzhin, Alexey Karpov, Milos Zelezny, Ruediger Hoffmann, 2006Development of Multimodal Applications for Disabled People. Proc. of the Eighth All-Ukrainian International Conference, Kyiv, Ukraine, 163 166
3 - V. P. Bondarenko, R. V. Meshcheryakov, A. A. Konev, 2006 Biologic feedback formation by vocal rehabilitation Proceedings of the International Workshop SPEECH and COMPUTER (SPECOM’2006) St.Peterburg, Russia 251 257
4 - R. V. Meshcheryakov, V. P. Bondarenko, V. P. Kotsubinsky, 2004 Peculiarities of vocal sounds generational speech synthesis by rules Proceedings of the Ninth International Conference "SPEECH and COMPUTER" (SPECOM’2004) Saint-Petersburg, Russia, 726 575 577
5 - Paul Taylor, 2009 Text-to-Speech Synthesis: Cambridge University Press,597