Recording environments for microphone and accelerometer
1. Introduction
Speech communication can be impaired by the wide range of noise conditions present in air. Researchers in the field of speech applications have been investigating how to improve the performances of signal extraction and its recognition in the conditions. However, it is not yet possible to measure clear speech in environments where there are low Signal-to-Noise Ratios (SNR) of about 0 dB or less (H. Hirsch and D. Pearce, 2000). Standard rate scales, such as CENSREC(N. Kitaoka et al., 2006) and AURORA (H. Hirsch and D. Pearce,2000), are typically discussed for evaluating performances of speech recognition in noisy environments and have shown thatspeech recognition rates are approximately 50–80% whenunder the influence of noise, demonstrating the difficulty ofachieving high percentages.With these backgrounds, many signal extraction and retrieval methods have been proposed in previous research. There is one of approaches in signal extractions, body-conducted speech (BCS) which is little influence from noise in air however it does not measure 2 kHz above in frequency characteristics. However, these need normal speech or parameters measured simultaneously with body-conducted speech. Because these parameters are not measured in noisy environments, the authors have been investigating the use of body-conducted speech which is generally called bone-conducted speech, where the signal is also conducted through the skin and bone in a human body (S. Ishimitsu, 2008) (M. Nakayama et al., 2011). Conventional retrieval methods for sound quality of body-conducted speech are the Modulation Transfer Function (MTF), Linear Predictive Coefficients (LPC), direct filtering and the use of a throat microphone (T. Tamiya, and T. Shimamura, 2006) (T. T. Vu et al., 2006) (Z. Liu et al., 2004) (S. Dupont, et al., 2004). As a research in state-of-the art, the research fields is expanded to speech communicationsbetween a patient and an operator in a Magnetic Resonance Imaging (MRI) room which has a noisy sound environment with a strong magnetic field (A. Moelker et al., 2005). Conventional microphonesuch as an accelerometercomposed of magnetic materials are not allowed in this environment, which requires a special microphone made of non-magnetic material.
For this environment the authors proposed a speech communication system that uses a BCS microphone with an optical fiber bragg grating (OFBG microphone) (M. Nakayama et al., 2011). It is composed of only non-magnetic materials, is suitable for the environment and should provide clear signals using our retrieval method. Previous research using an OFBG microphone demonstrated the effectiveness and performance of signal extraction in an MRI room. Its performance of speech recognition was evaluated using an acoustic model constructed with unspecified normal speech (M. Nakayama et al., 2011). It is concluded that an OFBG microphone can produce a clear signal with an improved performance compared to an acoustic model made by unspecified speeches. The original signal of an OFBG microphone enabled conversation however some stress was felt because its signal was low in sound quality. Therefore one of the research aims is to improve the quality with our retrieval method which used differential acceleration and noise reduction methods.
In this chapter, it will be shown in experiments and discussionsfor the body-conducted speeches with the method which is measured with an accelerometer and an OFBG microphone, as one of topics is a state-of-the-art in the research field of signal extraction under noisy environment. Especially, it is mainly investigated in evaluations of the microphones, signal retrievals with the method and applying the method toa signal in sentence unit long for estimating and recovering of sound qualities.
2. Speech and body-conducted speech
2.1. Conventional body-conducted speech microphone
Speech as air-conducted sound is easily affected by surrounding noise. In contrast, body-conducted speech is solid-propagated sound and thus less affected by noise. A word is uttered by a 20-year-old male in a quiet room. Table 1 details the recording environments for microphone and acclerometer emploied in this research. Speech is measured 30 cm from the mouth using a microphone, and body-conducted speech is extracted from the upper lip using the accelerometer as conventional microphone which is shown in Figure 1. This microphone position is that commonly used for the speech input of a car navigation system. The upper lip, as a signal-extraction position, provides the best cepstral coefficients as feature parameters for speech recognition (S. Ishimitsu et al., 2004). Figures 2 and 3 showuttered words “Asahi” in quiet room, taken from the JEIDA database, which contains 100 local place names(S. Itahashi, 1991). Speech is measured a cleary signal in frequency characteristics however body-conducted speech lacks high-frequency components above 2 kHz.So the performance is reduced when the signal is used for the recognition directory.
Recorder | TEAC RD-200T |
Microphone | Ono Sokki MI-1431 |
Microphone amplifier | Ono Sokki SR-2200 |
Microphone position | 30cm (Between mouth and microphone) |
Accelerometer | Ono Sokki NP-2110 |
Accelerometer amplifier | Ono Sokki PS-602 |
Accelerometer position | Upper lip |
2.2. Optical Fiber Bragg Grating microphone
To extend testing to scenarios such as that in which noise sound is generated with strong magnetic field, in communications between a patient and an operator in an MRI room, an OFBG microphone is employedto record body-conducted speech there because it can measure a clearer signal than an accelerometer and be used in an environment with a strong magnetic field. It is examined the effectiveness of the microphone in an MRI room in which a magnetic field is produced by an open-type magnetic resonance imaging system. Tables 2 and 3 detail the recording environments for OFBG microphone which is shown in Figure 4. Noise levels in the room did not measure at the recording point such as the mouth of the speaker because a sound-level meter did not permit into the room since it composed from magnetic materials. Therefore, the noise level is measured at the entrance of the room, and consequently may be higher than the noise level at the signal recording point; the noise level is given in Table 2. Owing to patient discomfort during the recordings, only 20 words and 5 sentences were recorded in the room where a scene is shown in Figure 5. Figure 6 shows the body-conducted speech recorded from the OFBG microphone in the room when activated a MRI. Compared the signal with conventional BCS, it is clearer than that for body-conducted speech measured by accelerometer because characteristics of frequencies above 2 kHz can be found.
MRI model | HITACH AIRIS II |
Environment | MRI (OFF): 61.6 dB SPL |
MRI (ON): 81.1 dB SPL | |
Speakers | two males(22 and 23 years old) |
two females(23 and 24 years old) | |
Vocabulary | twenty words × two sets:JEIDA 100 local place names |
five sentences × three sets:ATR database sentences |
Device name | Type name |
Pickup | Optoacoustics Optimic4130 |
Optical-electronic conversion device | Optoacoustics EOU200 |
Recorder | TEAC LX-10 |
3. Speech recognition with OFBG microphone
The quality of the signal recorded with the OFBG microphone, is higher than the quality of BCS recorded with accelerometer. Generally, the quality of speech sound is evaluated by the mean opinion score from 1 to 5 however this requires much evaluation data to achieve adequate significance levels. For the reason, it is evaluated the sound quality through speech recognition using acoustic models estimated with the speech of unspecified speakers as results of recognition performances. In speech recognition, the best candidate is chosen and decided by likelihoods derived from acoustic models and feature parameters such as cepstral parameters, which are calculated from the recorded speech (D. Li, and D. O’Shaughnessy, 2003) (L. Rabiner, 1993). As a result, the recognition performances and likelihoods are statistical results since human errors and other factors are not considered.
3.1. Experimental conditions
Table 4 shows the experimental conditions for isolated word recognition in speech recognition. The experiment employs the Julius, speech recognition decoder, which is a large-vocabulary continuous-speech recognition system for Japanese language (T. Kawahara et al., 1999) (A. Lee et al., 2001). The decoder requires a dictionary, acoustic models and language models. The dictionary describes connections of sub-words in each word, such as phonemes and syllables, which are the acoustic models. Language models give the probability for a present word given a former wordin corpora. The purpose of the experiment is only the evaluation of the clarity or the similarity of signals and acoustic models. Since language models are not required in this experiment, Julian version 3.4.2 is used for isolated-word recognition especially. Thus, the experimentsare used the same acoustic models estimated by HTK with JNAS to evaluate closeness of signals when highest recognition performance is achieved (S. Young et al., 2000) (K. Itou et al, 1999).
3.2. Experimental results
Table 5 shows recognition results of isolated word recognition in each data set, and Table 6 gives averages of recognition results in each speaker. The recognition results for the OFGB microphone are found to be superior to the recognition results for the conventional BCS microphone. The differences in isolated-word recognition rates are about 15% to 35% respectively. These results show the effectiveness of the OFBG microphone when is measured clearly signals with it.
Speaker | two males (22 and 23 years old) two female (23 and 24 years old) |
Number of datasets | 20 words × three sets/person |
Vocabulary | JEIDA 100 local place names |
Recognition system | Julian 3.4.2 |
Acoustic model | gender-dependent triphone model |
Model conditions | 16 mixture Gaussian, clustered 3000 states |
Feature vectors | MFCC(12)+ΔMFCC(12)+ΔPow(1)=25 dim. |
Training condition | more than 20,000 samples JANS with HTK 2.0 |
Speaker | MRI off | MRI on | ||||
set 1 | set 2 | set 3 | set 1 | set 2 | set 3 | |
Male 1 | 85% | 80% | 90% | 30% | 40% | 50% |
Male 2 | 90% | 75% | 85% | 50% | 60% | 60% |
Female 1 | 35% | 35% | 35% | 20% | 20% | 20% |
Female 2 | 80% | 70% | 70% | 75% | 70% | 75% |
Speaker | MRI off | MRI on |
Male 1 | 85.0% | 40.0% |
Male 2 | 83.3% | 56.7% |
Female 1 | 35.0% | 20.0% |
Female 2 | 73.3% | 73.3% |
4. Improvement on sound quality of body-conducted speech in word unit
The OFBG microphone can measure a high quality signal compared to a BCS of an accelerometer. To realize conversations without stress, signals with improved in sound qualitiesare required. Consequently, one of aims in the research is to invent and examine a method for improving sound quality.Many researchers and researches which are already introduced in the chapter of introduction, are unaware that a BCS does not have frequency components 2 kHz and higher. Mindful of this condition, conventional retrieval methods for BCS that need the speech and its parameters are proposed and investigated, however speech is not measured easily in noisy environments. Therefore a signal retrieval method for a BCS only performs well with itself. In realizing this progressive idea, the method is invented a signal retrieval method without speech and the other parameters becauseeffective frequency components in signals over 2 kHz are found however there contains very low gains.
4.1. Differential acceleration
Formula (1) shows an equation for estimating using the differential acceleration from the original BCS.
4.2. Noise reduction method
As a first approach to noise reduction, it is examined the effectiveness of a spectral subtraction method for the reduction of stationary noise. However, improvements in performancesfor the frequency components is inadequated with this approach. The noise spectrum is simply subtracted by a spectral subtraction method, so a Wiener-filtering method is expected to estimate the spectrum envelope of speech using linear prediction coefficients. Therefore, it is tried to extract a clear signal using the Wiener-filtering method, which could estimate and obtain the effective frequency components from noisy speech. Formula (2) shows the equation used for the Wiener-filtering method.
An estimated spectrum
4.3. Evaluations
Signal retrieval for a signal measured by an OFBG microphone is performed using the same parameters in the method because a propagation path of body-conducted speech in a human body is not affected by either quiet or noisy environments. Figure 8 shows a retrieval signal from Figure 7 using a Wiener-filtering method where the linear prediction coefficients and autocorrelation functions are 1 and the frame width is 764 samples. These procedures were repeated five times on a signal to remove a stationary noise. From a retrieval signal, high frequency components from 2 kHz and above were recovered with these settings. This proposed method could also be applied to obtain a clear signal from body-conducted speech measured with OFBG microphone in noisy sound and high magnetic field environment.
5. Improvement on sound quality of body-conducted speech in sentence unit
The effectiveness of signal retrieval for body-conducted speech in word unit measured by an accelerometer and an OFBG microphone has been demonstrated at former sections. Howeverthe effectiveness of body-conducted speech in word unit is proven,signals in sentence unit need to be examined for practical use such as conversations in the noisy environment. Though the investigation for the sentence unit is an important evaluation, so it could revolutionize speech communications in the environment. As a first step in signal retrieval for sentence unit, themethod adoptsthe method to signals in word unit because the transfer function between the microphone and sound source seems to change little whether word or sentence unit, and is examineda body-conducted speech insentence unit directly measured by an accelerometer and an OFBG microphone.
5.1. Body-conducted speech from an accelerometer
In experiments on signal retrieval using an accelerometer, speech and body-conducted speech were measured in a quiet room of our laboratoryand engine room of the training ship at the Oshima National College of Maritime Technology, where there is noisy environments with working a main engine and two generator, are shown Figures 9 (a) and (b). The recording environment is also used Table 1, however the speaker who uttered a word differs from a speaker in a former section. Noise within the engine room, under the two conditions of anchorage and cruising, were 93 and 98 dB SPL, respectively, and the SNR measurements from microphone. There was –20 and –25 dB SNR, respectively. In this research, the signal is experimented under cruisingcondition to estimate retrieval signals.
A 22-year-old male uttered A01 sentencefrom the ATR503 sentence database, andthe sentence is a commonly used sentence in speech recognition and application (M. Abe et al., 1991). And the sentence is composed of the followings in sub-word of mora.
/a/ /ra/ /yu/ /ru/ /ge/ /N/ /ji/ /tsu/ /wo/ /su/ /be/ /te/ /ji/ /bu/ /N/ /no/ /ho/ /u/ /he/ /ne/ /ji/ /ma/ /ge/ /ta/ /no/ /da/
Figures10 and 11 show a speech and a body-conducted speech insentence unit measured by a conventional microphone and accelerometerin a quiet room when a 22 years-old male uttered the sentence. Although the accelerometeris held with fingers, soundsare measured clearly because it was firmly held to the upper lip with a suitablepressure. Figure 12 shows a differential acceleration from Figure 11, becomes clearly signal with little noise because the BCS is high SNR.
Figures 13 and 14 show a speech and a body-conducted speech in sentence unit in the noisy environment. Speech is completely swamped by the intense noise from the engine and generators. On the other hand, body-conducted speech in Figure 14 is affected a little by the noise but can be measured. Because SNR in Figure 14 has low gain, differentialacceleration in Figure 15 is considered that the performance of signal retrieval is reduced. Figure 16 shows the signal retrieval from the differential accelerationworks well when the treated four times since the performance is sufficient to recover the frequency characteristics. As a result, it is concluded that body-conducted speech is as clear as possible without noise disturbance.
5.2. Body-conducted speech from OFBG microphone
The quality of the signal measured by the OFBG microphone in the noisy environment of an MRI room was investigated here. A speaker uttered the sentence A01 during the operation of MRI devices, such that there was an 81 dB SPL-noise environment. Although a sound level meter was not permitted in the room, so itis measured in front of the gate door in the room.Figure 17 shows the signal of the uttered sentence recorded by the OFBG microphone in the MRI room when MRI equipment was in operation. Since the signal is clear, it is expected that the frequency characteristics of the signal can be recovered employing the signal retrieval method. Figures 18 and 19 show the differential acceleration and retrieved signal from the OFBG microphone in the MRI room when the MRI equipment was in operation and the method treated three times. These figures confirm to improve in the sound quality of BCS in sentence, and it also concluded that the SNR in BCS is best when it has high level.
6. Conclusions and future works
This section presentsimprovements on sound quality of body-conducted speeches measured with an accelerometer and an OFBG microphone. Especially, an MRI room has heavy noisy sound and high magnetic field environment. The environment does not allow bringingaccelerometer such as a conventional body-conducted speech microphone which is made from magnetic materials. For conversations and communications between a patient and an operator in the room, an OFBG microphone is proposed, which can measure clear signals compared to accelerometer.
And then, the performances of signal retrieval method in sentence with the microphones that are an accelerometer and an OFBG microphone were evaluated, and the effectiveness is confirmed with time–frequency analysis and speech recognition. From this background, it is investigated estimating clear body-conducted speech in sentence unit from an OFBG microphone with our signal retrieval method that used combined differential acceleration and noise reduction. Applying the method to the signal measured recovered which in sound quality that was evaluated using time-frequency analysis. Thus, its retrieval method can also be applied to a signal measured by an OFBG microphone with the same settings because its conduction path is not affected by the noise in the air. The signals were measured in quiet and noisy rooms, specifically an engine room and MRI room. The signals were clearly obtained employing the signal retrieval method and the same settings used for the word unit as a first step. To obtain a clearer signal with the signal retrieval method, the pressure at which the microphone is held is important, and the sounds have high SNR in original BCS.
As future works, it needs to extend the signal retrieval method for practical use and improvement of algorithm for advance.
Acknowledgement
The authors thank Mr. K. Oda, Mr. H. Nagoshi and his colleagues in Ishimitsu laboratory of Hiroshima City University, members of the Living Informatics Research Group, Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST) for their support in the signal recording, and crew members of the training ship, Oshima-maru, Oshima National College of Maritime Technology.
References
- 1.
Lee A. Kawahara T. Shikano K. 2001 Julius- an open source real-time largevocabulary recognition engine, in Proceedings of European Conference on SpeechCommunication and Technology (EUROSPEECH),1691 1694 - 2.
Moelker A. Maas R. A. J. J. Vogel M. W. Ouhlous M. Pattynama P. M. T. 2005 Importance of bone-conducted sound transmission on patient hearing in the MR scanner, Journal of Magnetic Resonance Imaging,22 1 163 169 - 3.
O’Shaughnessy(Li D. D. 2003 Speech Processing: ADynamic and Optimization-Oriented Approach, MarcelDekker Inc. - 4.
Hirsch H. Pearce D. 2000 The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in proceedings of ISCA ITRW ASR2000,181 188 - 5.
Durbin J. 1960 The Fitting of Time-Series Models, Review of the International Statistical Institute,28 3 233 244 - 6.
Itou K. Yamamoto M. Takeda K. Takezawa T. Matsuoka T. Kobayashi T. Shikano K. Itahashi S. 1999 JNAS : Japanese speech corpus for large vocabulary continuous speech recognition research, Journal of the Acoustical Society of Japan (E), 20(3),199 206 - 7.
L. Rabiner(1993 Fundamentals of Speech Recognition, PrenticeHall - 8.
Abe M. Sagisaka Y. Umeda T. Kuwabara H. 1990 Manual of Japanese Speech Database, ATR - 9.
Nakayama M. Ishimitsu S. Nakagawa S. 2011 A study of making clear body-conducted speech using differential acceleration, IEEJ Transactions on Electrical and Electronic Engineering,6 2 144 150 - 10.
Nakayama M. Ishimitsu S. Nagoshi H. Nakagawa S. Fukui K. 2011 Body-conducted speech microphone using an Optical Fiber Bragg Grating for high magnetic field and noisy environments, in proceedings of Forum Acusticum 2011 - 11.
Nakamura(Kitaoka N. Yamada T. Tsuge S. Miyajima C. Nishiura T. Nakayama M. Denda Y. Fujimoto M. Yamamoto K. Takiguchi T. Kuroiwa S. Takeda K. S. 2006 CENSREC-1-C: development of evaluationframework for voice activity detection under noisyenvironment, IPSJ SIG Technical Report, 2006-SLP-63,1 6 - 12.
Dupont S. Ris C. Bachelart D. 2004 Combined use of closetalk and throat microphones for improved speech recognition under non-stationary background noise, in proceedings of COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, paper31 - 13.
Ishimitsu S. 2008 Construction of a Noise-Robust Body-Conducted Speech Recognition System, in Chapter of Speech Recognition, IN-TECH - 14.
Ishimitsu S. Kitakaze H. Tsuchibushi Y. Yanagawa H. Fukushima M. 2004 A noise-robust speech recognition system making use of body-conducted signals, Acoustical Science and Technology,25 2 166 169 - 15.
Ishimitsu S. Nakayama M. Murakami Y. 2004 Study of Body-Conducted SpeechRecognition for Support of Maritime Engine Operation, in Journal of the JIME,39 4 35 40 in Japanese) - 16.
Itahashi S. 1991 A noise database and Japanese common speech data corpus, Journal of ASJ,47 12 951 953 - 17.
Young S. Jansen J. Odell J. Woodland P. 2000 The HTK Book for2 CambridgeUniversity - 18.
Kawahara T. Lee A. Kobayashi T. Takeda K. Minematsu N. Itou K. Ito A. Yamamoto M. Yamada A. Utsuro T. Shikano K. 1999 Japanese dictation toolkit- 1997version, Journal of ASJ,20 3 233 239 - 19.
Vu T. T. Unoki M. Akagi M. 2006 A Study on Restoration of Boneconducted Speech With LPC Based Model, IEICE TechnicalReport, SP2005-174,67 78 - 20.
Tamiya T. Shimamura T. 2006 Improvement of Body-Conducted Speech Quality by Adaptive Filters, IEICE Technical Report, SP2006-191,41 46 - 21.
Liu Z. Zhang Z. Acero A. Droppo J. Huang X. 2004 Direct Filtering for Air- and Bone-Conductive Microphones, in proceedings of IEEE International Workshop on Multimedia Signal Processing (MMSP’04),363 366