Open access

Improvement on Sound Quality of the Body Conducted Speech from Optical Fiber Bragg Grating Microphone

Written By

Masashi Nakayama, Shunsuke Ishimitsu and Seiji Nakagawa

Submitted: 25 November 2011 Published: 28 November 2012

DOI: 10.5772/47844

From the Edited Volume

Modern Speech Recognition Approaches with Case Studies

Edited by S. Ramakrishnan

Chapter metrics overview

2,062 Chapter Downloads

View Full Metrics

1. Introduction

Speech communication can be impaired by the wide range of noise conditions present in air. Researchers in the field of speech applications have been investigating how to improve the performances of signal extraction and its recognition in the conditions. However, it is not yet possible to measure clear speech in environments where there are low Signal-to-Noise Ratios (SNR) of about 0 dB or less (H. Hirsch and D. Pearce, 2000). Standard rate scales, such as CENSREC(N. Kitaoka et al., 2006) and AURORA (H. Hirsch and D. Pearce,2000), are typically discussed for evaluating performances of speech recognition in noisy environments and have shown thatspeech recognition rates are approximately 50–80% whenunder the influence of noise, demonstrating the difficulty ofachieving high percentages.With these backgrounds, many signal extraction and retrieval methods have been proposed in previous research. There is one of approaches in signal extractions, body-conducted speech (BCS) which is little influence from noise in air however it does not measure 2 kHz above in frequency characteristics. However, these need normal speech or parameters measured simultaneously with body-conducted speech. Because these parameters are not measured in noisy environments, the authors have been investigating the use of body-conducted speech which is generally called bone-conducted speech, where the signal is also conducted through the skin and bone in a human body (S. Ishimitsu, 2008) (M. Nakayama et al., 2011). Conventional retrieval methods for sound quality of body-conducted speech are the Modulation Transfer Function (MTF), Linear Predictive Coefficients (LPC), direct filtering and the use of a throat microphone (T. Tamiya, and T. Shimamura, 2006) (T. T. Vu et al., 2006) (Z. Liu et al., 2004) (S. Dupont, et al., 2004). As a research in state-of-the art, the research fields is expanded to speech communicationsbetween a patient and an operator in a Magnetic Resonance Imaging (MRI) room which has a noisy sound environment with a strong magnetic field (A. Moelker et al., 2005). Conventional microphonesuch as an accelerometercomposed of magnetic materials are not allowed in this environment, which requires a special microphone made of non-magnetic material.

For this environment the authors proposed a speech communication system that uses a BCS microphone with an optical fiber bragg grating (OFBG microphone) (M. Nakayama et al., 2011). It is composed of only non-magnetic materials, is suitable for the environment and should provide clear signals using our retrieval method. Previous research using an OFBG microphone demonstrated the effectiveness and performance of signal extraction in an MRI room. Its performance of speech recognition was evaluated using an acoustic model constructed with unspecified normal speech (M. Nakayama et al., 2011). It is concluded that an OFBG microphone can produce a clear signal with an improved performance compared to an acoustic model made by unspecified speeches. The original signal of an OFBG microphone enabled conversation however some stress was felt because its signal was low in sound quality. Therefore one of the research aims is to improve the quality with our retrieval method which used differential acceleration and noise reduction methods.

In this chapter, it will be shown in experiments and discussionsfor the body-conducted speeches with the method which is measured with an accelerometer and an OFBG microphone, as one of topics is a state-of-the-art in the research field of signal extraction under noisy environment. Especially, it is mainly investigated in evaluations of the microphones, signal retrievals with the method and applying the method toa signal in sentence unit long for estimating and recovering of sound qualities.

Advertisement

2. Speech and body-conducted speech

2.1. Conventional body-conducted speech microphone

Speech as air-conducted sound is easily affected by surrounding noise. In contrast, body-conducted speech is solid-propagated sound and thus less affected by noise. A word is uttered by a 20-year-old male in a quiet room. Table 1 details the recording environments for microphone and acclerometer emploied in this research. Speech is measured 30 cm from the mouth using a microphone, and body-conducted speech is extracted from the upper lip using the accelerometer as conventional microphone which is shown in Figure 1. This microphone position is that commonly used for the speech input of a car navigation system. The upper lip, as a signal-extraction position, provides the best cepstral coefficients as feature parameters for speech recognition (S. Ishimitsu et al., 2004). Figures 2 and 3 showuttered words “Asahi” in quiet room, taken from the JEIDA database, which contains 100 local place names(S. Itahashi, 1991). Speech is measured a cleary signal in frequency characteristics however body-conducted speech lacks high-frequency components above 2 kHz.So the performance is reduced when the signal is used for the recognition directory.

RecorderTEAC RD-200T
MicrophoneOno Sokki MI-1431
Microphone amplifierOno Sokki SR-2200
Microphone position30cm (Between mouth and microphone)
AccelerometerOno Sokki NP-2110
Accelerometer amplifierOno Sokki PS-602
Accelerometer positionUpper lip

Table 1.

Recording environments for microphone and accelerometer

Figure 1.

Accelerometer

Figure 2.

Speech from microphone in quiet

Figure 3.

BCSfrom accelerometer in quiet

2.2. Optical Fiber Bragg Grating microphone

To extend testing to scenarios such as that in which noise sound is generated with strong magnetic field, in communications between a patient and an operator in an MRI room, an OFBG microphone is employedto record body-conducted speech there because it can measure a clearer signal than an accelerometer and be used in an environment with a strong magnetic field. It is examined the effectiveness of the microphone in an MRI room in which a magnetic field is produced by an open-type magnetic resonance imaging system. Tables 2 and 3 detail the recording environments for OFBG microphone which is shown in Figure 4. Noise levels in the room did not measure at the recording point such as the mouth of the speaker because a sound-level meter did not permit into the room since it composed from magnetic materials. Therefore, the noise level is measured at the entrance of the room, and consequently may be higher than the noise level at the signal recording point; the noise level is given in Table 2. Owing to patient discomfort during the recordings, only 20 words and 5 sentences were recorded in the room where a scene is shown in Figure 5. Figure 6 shows the body-conducted speech recorded from the OFBG microphone in the room when activated a MRI. Compared the signal with conventional BCS, it is clearer than that for body-conducted speech measured by accelerometer because characteristics of frequencies above 2 kHz can be found.

Figure 4.

OFBG microphone

Figure 5.

Signal recording in an MRI room

Figure 6.

BCS from OFBG microphone

MRI modelHITACH AIRIS II
EnvironmentMRI (OFF): 61.6 dB SPL
MRI (ON): 81.1 dB SPL
Speakerstwo males(22 and 23 years old)
two females(23 and 24 years old)
Vocabularytwenty words × two sets:JEIDA 100 local place names
five sentences × three sets:ATR database sentences

Table 2.

Recording environment 1 for OFBG microphone

Device nameType name
PickupOptoacoustics Optimic4130
Optical-electronic conversion deviceOptoacoustics EOU200
RecorderTEAC LX-10

Table 3.

Recording environment 2 for OFBG microphone

Advertisement

3. Speech recognition with OFBG microphone

The quality of the signal recorded with the OFBG microphone, is higher than the quality of BCS recorded with accelerometer. Generally, the quality of speech sound is evaluated by the mean opinion score from 1 to 5 however this requires much evaluation data to achieve adequate significance levels. For the reason, it is evaluated the sound quality through speech recognition using acoustic models estimated with the speech of unspecified speakers as results of recognition performances. In speech recognition, the best candidate is chosen and decided by likelihoods derived from acoustic models and feature parameters such as cepstral parameters, which are calculated from the recorded speech (D. Li, and D. O’Shaughnessy, 2003) (L. Rabiner, 1993). As a result, the recognition performances and likelihoods are statistical results since human errors and other factors are not considered.

3.1. Experimental conditions

Table 4 shows the experimental conditions for isolated word recognition in speech recognition. The experiment employs the Julius, speech recognition decoder, which is a large-vocabulary continuous-speech recognition system for Japanese language (T. Kawahara et al., 1999) (A. Lee et al., 2001). The decoder requires a dictionary, acoustic models and language models. The dictionary describes connections of sub-words in each word, such as phonemes and syllables, which are the acoustic models. Language models give the probability for a present word given a former wordin corpora. The purpose of the experiment is only the evaluation of the clarity or the similarity of signals and acoustic models. Since language models are not required in this experiment, Julian version 3.4.2 is used for isolated-word recognition especially. Thus, the experimentsare used the same acoustic models estimated by HTK with JNAS to evaluate closeness of signals when highest recognition performance is achieved (S. Young et al., 2000) (K. Itou et al, 1999).

3.2. Experimental results

Table 5 shows recognition results of isolated word recognition in each data set, and Table 6 gives averages of recognition results in each speaker. The recognition results for the OFGB microphone are found to be superior to the recognition results for the conventional BCS microphone. The differences in isolated-word recognition rates are about 15% to 35% respectively. These results show the effectiveness of the OFBG microphone when is measured clearly signals with it.

Speakertwo males (22 and 23 years old)
two female (23 and 24 years old)
Number of datasets20 words × three sets/person
VocabularyJEIDA 100 local place names
Recognition systemJulian 3.4.2
Acoustic modelgender-dependent triphone model
Model conditions16 mixture Gaussian, clustered 3000 states
Feature vectorsMFCC(12)+ΔMFCC(12)+ΔPow(1)=25 dim.
Training conditionmore than 20,000 samples
JANS with HTK 2.0

Table 4.

Experimental conditions for isolated word recognition

SpeakerMRI offMRI on
set 1set 2set 3set 1set 2set 3
Male 185%80%90%30%40%50%
Male 290%75%85%50%60%60%
Female 135%35%35%20%20%20%
Female 280%70%70%75%70%75%

Table 5.

Recognition results of isolated word recognition in each data set

SpeakerMRI offMRI on
Male 185.0%40.0%
Male 283.3%56.7%
Female 135.0%20.0%
Female 273.3%73.3%

Table 6.

Averagesof recognition results

Advertisement

4. Improvement on sound quality of body-conducted speech in word unit

The OFBG microphone can measure a high quality signal compared to a BCS of an accelerometer. To realize conversations without stress, signals with improved in sound qualitiesare required. Consequently, one of aims in the research is to invent and examine a method for improving sound quality.Many researchers and researches which are already introduced in the chapter of introduction, are unaware that a BCS does not have frequency components 2 kHz and higher. Mindful of this condition, conventional retrieval methods for BCS that need the speech and its parameters are proposed and investigated, however speech is not measured easily in noisy environments. Therefore a signal retrieval method for a BCS only performs well with itself. In realizing this progressive idea, the method is invented a signal retrieval method without speech and the other parameters becauseeffective frequency components in signals over 2 kHz are found however there contains very low gains.

4.1. Differential acceleration

Formula (1) shows an equation for estimating using the differential acceleration from the original BCS.

xdifferential(i)=x(i+1)x(i)E1

xdifferential(i) is the differential acceleration signal that is calculated from each frame of a BCS. Because of low gains in its amplitude, it requires adjusting to a suitable level for hearing or processing. Figure 7 shows a differential acceleration estimated from Figure 6 using Formula (1), with the adjusted gain. It seems that the differential acceleration signal is composed of speech mixed with stationary noise, so we expected to be able to remove it completely with the noise reduction method because the signal has a high SNR compared to the original signal. Consequently, it isproposed the signal estimation method using differential acceleration and a conventional noise reduction method (M. Nakayama et al., 2011).

4.2. Noise reduction method

As a first approach to noise reduction, it is examined the effectiveness of a spectral subtraction method for the reduction of stationary noise. However, improvements in performancesfor the frequency components is inadequated with this approach. The noise spectrum is simply subtracted by a spectral subtraction method, so a Wiener-filtering method is expected to estimate the spectrum envelope of speech using linear prediction coefficients. Therefore, it is tried to extract a clear signal using the Wiener-filtering method, which could estimate and obtain the effective frequency components from noisy speech. Formula (2) shows the equation used for the Wiener-filtering method.

HEstimate(ω)=HSpeech(ω)HSpeech(ω)+HNoise(ω)E2

An estimated spectrum HEstimate(ω) can be converted to a retrieval signal from the differential acceleration signal. It can be calculated from the speech spectrum HSpeech(ω) and noise spectrum HNoise(ω). In particular, HSpeech(ω) is calculated with autocorrelation functions and linear prediction coefficients using a Levinson-Durbin algorithm (J. Durbin, 1960), and HNoise(ω) is then estimated using autocorrelation functions.

4.3. Evaluations

Signal retrieval for a signal measured by an OFBG microphone is performed using the same parameters in the method because a propagation path of body-conducted speech in a human body is not affected by either quiet or noisy environments. Figure 8 shows a retrieval signal from Figure 7 using a Wiener-filtering method where the linear prediction coefficients and autocorrelation functions are 1 and the frame width is 764 samples. These procedures were repeated five times on a signal to remove a stationary noise. From a retrieval signal, high frequency components from 2 kHz and above were recovered with these settings. This proposed method could also be applied to obtain a clear signal from body-conducted speech measured with OFBG microphone in noisy sound and high magnetic field environment.

Figure 7.

Differential acceleration from OFBG microphone

Figure 8.

Retrieval signal from OFBG microphone

Advertisement

5. Improvement on sound quality of body-conducted speech in sentence unit

The effectiveness of signal retrieval for body-conducted speech in word unit measured by an accelerometer and an OFBG microphone has been demonstrated at former sections. Howeverthe effectiveness of body-conducted speech in word unit is proven,signals in sentence unit need to be examined for practical use such as conversations in the noisy environment. Though the investigation for the sentence unit is an important evaluation, so it could revolutionize speech communications in the environment. As a first step in signal retrieval for sentence unit, themethod adoptsthe method to signals in word unit because the transfer function between the microphone and sound source seems to change little whether word or sentence unit, and is examineda body-conducted speech insentence unit directly measured by an accelerometer and an OFBG microphone.

5.1. Body-conducted speech from an accelerometer

In experiments on signal retrieval using an accelerometer, speech and body-conducted speech were measured in a quiet room of our laboratoryand engine room of the training ship at the Oshima National College of Maritime Technology, where there is noisy environments with working a main engine and two generator, are shown Figures 9 (a) and (b). The recording environment is also used Table 1, however the speaker who uttered a word differs from a speaker in a former section. Noise within the engine room, under the two conditions of anchorage and cruising, were 93 and 98 dB SPL, respectively, and the SNR measurements from microphone. There was –20 and –25 dB SNR, respectively. In this research, the signal is experimented under cruisingcondition to estimate retrieval signals.

A 22-year-old male uttered A01 sentencefrom the ATR503 sentence database, andthe sentence is a commonly used sentence in speech recognition and application (M. Abe et al., 1991). And the sentence is composed of the followings in sub-word of mora.

  • /a/ /ra/ /yu/ /ru/ /ge/ /N/ /ji/ /tsu/ /wo/ /su/ /be/ /te/ /ji/ /bu/ /N/ /no/ /ho/ /u/ /he/ /ne/ /ji/ /ma/ /ge/ /ta/ /no/ /da/

Figure 9.

The engine room in Oshima-maru

Figures10 and 11 show a speech and a body-conducted speech insentence unit measured by a conventional microphone and accelerometerin a quiet room when a 22 years-old male uttered the sentence. Although the accelerometeris held with fingers, soundsare measured clearly because it was firmly held to the upper lip with a suitablepressure. Figure 12 shows a differential acceleration from Figure 11, becomes clearly signal with little noise because the BCS is high SNR.

Figures 13 and 14 show a speech and a body-conducted speech in sentence unit in the noisy environment. Speech is completely swamped by the intense noise from the engine and generators. On the other hand, body-conducted speech in Figure 14 is affected a little by the noise but can be measured. Because SNR in Figure 14 has low gain, differentialacceleration in Figure 15 is considered that the performance of signal retrieval is reduced. Figure 16 shows the signal retrieval from the differential accelerationworks well when the treated four times since the performance is sufficient to recover the frequency characteristics. As a result, it is concluded that body-conducted speech is as clear as possible without noise disturbance.

Figure 10.

Speech of sentence in quiet

Figure 11.

BCS of sentence in quiet

Figure 12.

Differential accelerationof sentence in quiet

Figure 13.

Speech of sentence innoise environment

Figure 14.

BCS of sentence in noise environment

Figure 15.

Differential acceleration of sentence in noise environment

Figure 16.

Retrieval BCS of sentence in noise environment

5.2. Body-conducted speech from OFBG microphone

The quality of the signal measured by the OFBG microphone in the noisy environment of an MRI room was investigated here. A speaker uttered the sentence A01 during the operation of MRI devices, such that there was an 81 dB SPL-noise environment. Although a sound level meter was not permitted in the room, so itis measured in front of the gate door in the room.Figure 17 shows the signal of the uttered sentence recorded by the OFBG microphone in the MRI room when MRI equipment was in operation. Since the signal is clear, it is expected that the frequency characteristics of the signal can be recovered employing the signal retrieval method. Figures 18 and 19 show the differential acceleration and retrieved signal from the OFBG microphone in the MRI room when the MRI equipment was in operation and the method treated three times. These figures confirm to improve in the sound quality of BCS in sentence, and it also concluded that the SNR in BCS is best when it has high level.

Figure 17.

BCS of sentence in MRI room

Figure 18.

Differential acceleration of sentence in MRI room

Figure 19.

Retrieval signal of sentence in MRI room

Advertisement

6. Conclusions and future works

This section presentsimprovements on sound quality of body-conducted speeches measured with an accelerometer and an OFBG microphone. Especially, an MRI room has heavy noisy sound and high magnetic field environment. The environment does not allow bringingaccelerometer such as a conventional body-conducted speech microphone which is made from magnetic materials. For conversations and communications between a patient and an operator in the room, an OFBG microphone is proposed, which can measure clear signals compared to accelerometer.

And then, the performances of signal retrieval method in sentence with the microphones that are an accelerometer and an OFBG microphone were evaluated, and the effectiveness is confirmed with time–frequency analysis and speech recognition. From this background, it is investigated estimating clear body-conducted speech in sentence unit from an OFBG microphone with our signal retrieval method that used combined differential acceleration and noise reduction. Applying the method to the signal measured recovered which in sound quality that was evaluated using time-frequency analysis. Thus, its retrieval method can also be applied to a signal measured by an OFBG microphone with the same settings because its conduction path is not affected by the noise in the air. The signals were measured in quiet and noisy rooms, specifically an engine room and MRI room. The signals were clearly obtained employing the signal retrieval method and the same settings used for the word unit as a first step. To obtain a clearer signal with the signal retrieval method, the pressure at which the microphone is held is important, and the sounds have high SNR in original BCS.

As future works, it needs to extend the signal retrieval method for practical use and improvement of algorithm for advance.

Advertisement

Acknowledgement

The authors thank Mr. K. Oda, Mr. H. Nagoshi and his colleagues in Ishimitsu laboratory of Hiroshima City University, members of the Living Informatics Research Group, Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST) for their support in the signal recording, and crew members of the training ship, Oshima-maru, Oshima National College of Maritime Technology.

References

  1. 1. LeeA.KawaharaT.ShikanoK.2001Julius- an open source real-time largevocabulary recognition engine, in Proceedings of European Conference on SpeechCommunication and Technology (EUROSPEECH), 16911694
  2. 2. MoelkerA.MaasR. A. J. J.VogelM. W.OuhlousM.PattynamaP. M. T.2005Importance of bone-conducted sound transmission on patient hearing in the MR scanner, Journal of Magnetic Resonance Imaging, 221163169
  3. 3. LiD.D.O’Shaughnessy(2003Speech Processing: ADynamic and Optimization-Oriented Approach, MarcelDekker Inc.
  4. 4. HirschH.PearceD.2000The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in proceedings of ISCA ITRW ASR2000, 181188
  5. 5. DurbinJ.1960The Fitting of Time-Series Models, Review of the International Statistical Institute, 283233244
  6. 6. ItouK.YamamotoM.TakedaK.TakezawaT.MatsuokaT.KobayashiT.ShikanoK.ItahashiS.1999JNAS : Japanese speech corpus for large vocabulary continuous speech recognition research, Journal of the Acoustical Society of Japan (E), 20(3), 199206
  7. 7. L.Rabiner(1993Fundamentals of Speech Recognition, PrenticeHall
  8. 8. AbeM.SagisakaY.UmedaT.KuwabaraH.1990Manual of Japanese Speech Database, ATR
  9. 9. NakayamaM.IshimitsuS.NakagawaS.2011A study of making clear body-conducted speech using differential acceleration, IEEJ Transactions on Electrical and Electronic Engineering, 62144150
  10. 10. NakayamaM.IshimitsuS.NagoshiH.NakagawaS.FukuiK.2011Body-conducted speech microphone using an Optical Fiber Bragg Grating for high magnetic field and noisy environments, in proceedings of Forum Acusticum 2011
  11. 11. KitaokaN.YamadaT.TsugeS.MiyajimaC.NishiuraT.NakayamaM.DendaY.FujimotoM.YamamotoK.TakiguchiT.KuroiwaS.TakedaK.S.Nakamura(2006CENSREC-1-C: development of evaluationframework for voice activity detection under noisyenvironment, IPSJ SIG Technical Report, 2006-SLP-63,16
  12. 12. DupontS.RisC.BachelartD.2004Combined use of closetalk and throat microphones for improved speech recognition under non-stationary background noise, in proceedings of COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, paper31
  13. 13. IshimitsuS.2008Construction of a Noise-Robust Body-Conducted Speech Recognition System, in Chapter of Speech Recognition, IN-TECH
  14. 14. IshimitsuS.KitakazeH.TsuchibushiY.YanagawaH.FukushimaM.2004A noise-robust speech recognition system making use of body-conducted signals, Acoustical Science and Technology, 252166169
  15. 15. IshimitsuS.NakayamaM.MurakamiY.2004Study of Body-Conducted SpeechRecognition for Support of Maritime Engine Operation, in Journal of the JIME,3943540in Japanese)
  16. 16. ItahashiS.1991A noise database and Japanese common speech data corpus, Journal of ASJ, 4712951953
  17. 17. YoungS.JansenJ.OdellJ.WoodlandP.2000The HTK Book for 2CambridgeUniversity
  18. 18. KawaharaT.LeeA.KobayashiT.TakedaK.MinematsuN.ItouK.ItoA.YamamotoM.YamadaA.UtsuroT.ShikanoK.1999Japanese dictation toolkit- 1997version, Journal of ASJ, 203233239
  19. 19. VuT. T.UnokiM.AkagiM.2006A Study on Restoration of Boneconducted Speech With LPC Based Model, IEICE TechnicalReport, SP2005-174, 6778
  20. 20. TamiyaT.ShimamuraT.2006Improvement of Body-Conducted Speech Quality by Adaptive Filters, IEICE Technical Report, SP2006-191, 4146
  21. 21. LiuZ.ZhangZ.AceroA.DroppoJ.HuangX.2004Direct Filtering for Air- and Bone-Conductive Microphones, in proceedings of IEEE International Workshop on Multimedia Signal Processing (MMSP’04), 363366

Written By

Masashi Nakayama, Shunsuke Ishimitsu and Seiji Nakagawa

Submitted: 25 November 2011 Published: 28 November 2012