Open access peer-reviewed chapter

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

By Keiichi Funaki and Takehito Higa

Submitted: April 11th 2012Reviewed: July 18th 2012Published: January 16th 2013

DOI: 10.5772/51694

Downloaded: 1594

1. Introduction

The F0 estimation determines a performance of speech processing such as speech coding, tonal speech recognition, speaker recognition, and speech enhancement. F0 estimation named “YIN” has been proposed [1] and it is being prevalently used around the world due to its high performance and open-source policy. Speech processing is commonly applied in realistic noisy environments; hence, the performance is degraded seriously. It is well known that YIN does not perform well for noisy speech although it does perform best for clean speech. Accordingly, more robust F0 estimation algorithm is desired and the robust F0 estimation is long lasting problem in speech processing. We have already proposed robust F0 estimation algorithm based on time-varying complex speech analysis for analytic speech signal [2][3]. Analytic signal is a complex-valued signal in which its real part is speech signal and its imaginary part is Hilbert transform of the real part. Since the analytic signal provides the spectrum only on positive frequencies, the signals can be decimated by a factor of two with no degradation. As a result, the complex analysis offers attractive features, for example, more accurate spectral estimation in low frequencies. In [2] and [3], complex LPC residual is used to calculate the criterion of weighted autocorrelation function (AUTOC) with a reciprocal of Average Magnitude Difference Function (AMDF) [6]. The complex residual is calculated from analytic speech signal by means of time-varying complex AR (TV-CAR) speech analysis method [4][5]. In [2], MMSE-based TV-CAR speech analysis [4] is introduced and in [3], ELS-based TV-CAR speech analysis [5] is introduced to calculate complex LPC residual signal. It has been reported in [2] that the method can estimate more accurate F0 for IRS (Intermediate Reference System) filtered speech corrupted by white Gauss noise. Moreover, it has been reported in [3] that the ELS-based complex speech analysis can perform better even for additive pink noise. Furthermore, in order to investigate the effectiveness of the time-varying analysis, the performance was compared for the frame with respect to degree of voiced nature [7]. The experiments using IRS filtered speech corrupted by white Gauss noise or pink noise demonstrate that ELS-based robust time-varying complex speech analysis can perform better for stationary voiced speech and ELS-based time-invariant speech analysis can perform better for ordinary voiced frame. However the computational cost turns to be larger by introducing time-varying analysis. In this paper, in order to reduce the computational cost, pre-selection is introduced. The pre-selection is performed by peak picking of speech spectrum based on the TV-CAR analysis [8]. The evaluation is carried out using Keele Pitch Database [9]. The reminder of the chapter is organized as follows. In Section 2, TV-CAR speech analysis is explained. Analytic signal and Time-Varying Complex AR (TV-CAR) model are explained. Two kinds of the TV-CAR parameter estimation algorithms from an analytic signal, viz., MMSE and ELS methods are explained. In Section 3, F0 estimation algorithm is explained in detail. Sample-based pre-selection is explained and frame-based final–selection is explained. In Section 4, experimental results are explained and these confirm the effectiveness of the proposed method.


2. TV-CAR speech analysis

In this section, ELS-based robust TV-CAR speech analysis method is explained. Before the explanation, analytic signal and TV-CAR model is explained, in which analytic signal is output of the TV-CAR model. In 2.6, the benefit of the robust TV-CAR analysis is explained by showing the estimated sprctra from natural speech.

2.1. Analytic speech signal

Target signal of the time-varying complex AR (TV-CAR) method is an analytic signal that is complex-valued signal defined by an all-pole model as follows.


where yct,y(t) and yH(t)denote an analytic signal at time t, an observed signal at time t, and a Hilbert transformed signal for the observed signal, respectively. Notice that superscript cdenotes complex value in this paper. Since analytic signals provide the spectra only over the range of (0, π) analytic signals can be decimated by a factor of two. 2tmeans the decimation. The term of 1/2is multiplied in order to adjust the power of an analytic signal with that of the observed one.

2.2. Time-varying complex AR (TV-CAR) model

Conventional LPC model is defined by


where aiand Iare i-th order LPC coefficient and LPC order, respectively. Since the conventional LPC model cannot express the time-varying spectrum, LPC analysis cannot extract the time-varying spectral features from speech signal. In order to represent the time-varying features, the TV-CAR model employs a complex basis expansion shown as


whereaic(t),I,L,gi,lc,land flc(t)are taken to be i-th complex AR coefficient at time t, AR order, finite order of complex basis expansion, complex parameter, and a complex-valued basis function, respectively. By substituting Eq.(3) into Eq.(2), one can obtain the following transfer function. Eq.(4) means the TV-CAR model.


The input-output relation is defined as


where uc(t)and yc(t)are taken to be complex-valued input and analytic speech signal shown in Eq.(1), respectively. In the TV-CAR model, the complex AR coefficient is modeled by a finite number of arbitrary complex basis functions such as Fourier basis, wavelet basis or so on. Note that Eq.(3) parameterizes the AR coefficient trajectories that continuously change as a function of time so that the time-varying analysis is feasible to estimate continuous time-varying speech spectrum. In addition, as mentioned above, the complex-valued analysis facilitates accurate spectral estimation in the low frequencies, as a result, this feature allows for more accurate F0 estimation if formant structure is removed by the inverse filtering. Eq.(5) can be represented by vector-matrix notation as


where Nis analysis interval, y-fis (N − I,1) column vector whose elements are analytic speech signal, θ-is (L ・ I,1) column vector whose elements are complex parameters, Φ-fis (N −I,L・ I) matrix whose elements are weighted analytic speech signal by the complex basis. Superscript T denotes transposition.

2.3. MMSE-based algorithm [4]

There are several algorithms that estimate the TV-CAR model parameter from complex-valued signal such as MMSE, WLS(Weighted Least Square), M-estimation, GLS(Generalized Least Square), and ELS(Extended Least Square). The MMSE-algorithm is basic algorithm and used for initial estimation of the ELS. Before explaining the ELS, the MMSE algorithm is explained.

MSE criterion is defined by


Where g^i,lcis the estimated complex parameter, rc(t)is an equation error, or complex AR residual and Eis Mean Squared Error (MSE) for the equation error. To obtain optimal complex AR coefficients, we minimize the MSE criterion. Minimizing the MSE criterion of Eq.(9) with respect to the complex parameter leads to the following MMSE algorithm.


Superscript H denotes Hermitian transposition. After solving the linear equation of Eq.(10), we can get the complex AR parameter (aic(t)) at time tby calculating the Eq.(3) with the estimated complex parameterg^i,lc.

2.4. ELS-based algorithm [5]

Figure 1 shows block diagram of ELS estimation. If the equation error shown as in Eq.(8) is white Gaussian, the MMSE estimation is optimal, however, it is rare case. As a result, MMSE estimation suffers from biased estimation. In the ELS method, an AR filter is adopted to whiten the equation error as follows (Figure 1(2)).


where bkcis k-th parameter of the AR filter whose order is Kand ec(t)is 0-mean white Gaussian of equation error at time t. The inverse filter of Eq.(11) is called a whiten filter. The TV-CAR model can be represented using Eq.(5) and Eq.(11) as follows.


Eq.(12) is the ELS model shown as in Figure 1(3). The parameter is estimated so as minimize the MSE for the whitened equation error in the ELS algorithm whereas the parameter is estimated so as minimize the MSE for the equation error in the MMSE algorithm shown as in Figure 1(1).

Eq.(12) can be expressed by the following vector-matrix notation.




By minimizing the MSE for Eq.(13), one can get the following equation.


By applying the well-known inversion Matrix lemma to Eq.(15), one can obtain the following equation.


The MMSE estimated parameter θ^0contains the biased elementθ^bias. The unbiased estimation of θ^is calculated byθ^0θ^|biasMathType@MTEF@5@5@+=feaagCart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaabaaaaaaaaapeGafqiUdeNbaKaapaWaaSbaaSqaa8qacaaIWaaapaqabaGcpeGaeyOeI0IafqiUdeNbaKaapaWaaSbaaSqaa8qacaGG8baapaqabaGcpeGaamOyaiaadMgacaWGHbGaam4Caaaa@40D8@. The ELS algorithm is equivalent to the GLS (Generalized Least Square) algorithm and more sophisticated algorithm. Since the equation error rc(t) cannot be observed, the iteration algorithm is required by estimating the A(z) and B(z). The iteration procedure is shown as follows.

1. Initial θ^0is estimated by MMSE (Eq.(10)).

2. The equation error is calculated by Eq.(8).

3. b^is estimated so as to minimize Eq.(18) using rc(t).

4. The bias parameter b^is calculated by Eq.(16).

5. The unbiased parameter θ^is calculated by Eq.(17).

6. Go to 2.


In Eq.(18), R(z) is z-transform of rc(t) and B(z) is the transfer function of the whiten filter. The procedures from 2 to 5 are iterated with the pre-determined number. The ELS algorithm estimates two kinds of AR filters, A(z) and B(z), iteratively. Since the ELS algorithm can estimate unbiased and less effected speech spectrum against additive noise, more accurate F0 and formants frequencies can be estimated. Thus, more accurate F0 trajectories can be estimated than the MMSE estimation.

Figure 1.

Block diagrams of MMSE and ELS estimation.

2.5. Benefit of robust TV-CAR speech analysis

In this paragraph, we explain the benefit of robust TV-CAR speech analysis by showing the estimated speech spectrum and explain its effectiveness on F0 estimation of speech. Figure 2 shows example of the estimated speech spectra of natural Japanese vowel /o/ for analytic signal and conventional LPC analysis for speech signal.

Figure 2.

Estimated Spectra of vowel /o/ with complex and conventional LPC analysis.

In Figure 2, left side denote the estimated spectra. Upper is for real-valued LPC analysis. Lower is for complex-valued LPC analysis. Blue line means estimated spectrum by LPC analysis and green line means estimated DFT spectrum. Right side means estimated poles from the estimated AR filter. Figure 3 shows the estimated running spectrum for clean natural speech /arayu/ and for the speech corrupted by white Gaussian (10[dB]). In Figure 3, (1) means speech waveform, (2),(3),(4),(5) and (6) mean the estimated spectrum by MMSE-based time-invariant real-valued AR speech analysis, by MMSE-based time-invariant complex-valued AR speech analysis (L=1), by MMSE-based time-varying complex AR (TV-CAR) speech analysis (L=2), by ELS-based time-invariant complex-valued AR speech analysis (L=1), and by ELS-based time-varying complex AR (TV-CAR) speech analysis (L=2), respectively. Analysis order Iis 14 for real analysis and 7 for complex analysis. Basis function is 1st order polynomial function (1,t). One can observe that the complex analysis can estimate more accurate spectrum in low frequencies whereas the estimation accuracy is down in high frequencies. Since speech spectrum provides much energy in low frequencies, it is expected that the high spectral estimation accuracy in low frequencies makes it possible to improve the performance on F0 estimation. Furthermore, the ELS analysis can estimate more accurate spectrum than MMSE, so that the ELS analysis makes it possible to estimate more accurate F0. Time-varying analysis can estimate tive-varying spectrum from speech. It is expected that the time-varying analysis enables to estimate more accurate F0 since F0 is varying in the analysis interval.

Figure 3.

Estimated spectrum for noise corrupted speech /arayu/ (10[dB]).

3. F0 Estimation method

Proposed method employs two-stage search of F0. In first stage, pre-selection, F0 and F1 are estimated by using sample-based F0 contour estimation [8]. In second stage, final-selection, F0 is estimated by using frame-based F0 estimation [3] within limited range based on the pre-estimated F0 and F1. The two-stage estimation makes it possible to reduce the computation with less degradation In 3.1, pre-selection algorithm is explained. In 3.2, final-selection algorithm is explained.

3.1. Sample-based pre-selection

F0 and F1 are estimated as the lowest two peak frequency, viz., glottal and first formant frequencies by peak-picking for the estimated time-varying speech spectrum. The procedure of F0 and F1 contour estimation is shown as in Figure 4

  1. The set of complex-valued parameter g^i,lcis estimated by the ELS algorithm for each analysis frame.

  2. By using Eq.(3) and Eq.(4) with the estimated parameterg^i,lc, the speech power spectrum for each sample t is calculated, and the two peaks of the estimated spectrum are searched by the peak-picking.

The peak-pinking is carried out from low frequency to high frequency shown as in Figure 5. The estimated two peaks correspond to glottal formant (F0) and first formant (F1). The formant frequencies are estimated by solving the equation of the reciprocal of Eq.(4).

Figure 4.

Flow of F0 and F1 contour estimation

Figure 5.

Peak Picking

3.2. Frame-based final-selection

In frame-based F0 estimation, autocorrelation or AMDF is commonly used. In this paragraph, autocorrelation and AMDF are explained and then adopted weighted autocorrelation is explained.

Autocorrelation function (AUTOC) is defined by


where x(t) is target signal such as speech signal, LPC residual or so on, N is frame length and τ means delay. F0 is selected as peak frequency for Eq.(19) within certain range of F0.

AMDF is defined as follows.


F0 is selected as notch frequency for Eq.(20) within certain range of F0. In Shimamura method [6], the AUTOC is weighted by a reciprocal of the AMDF shown as Eq.(21). Since the weighting makes it possible to suppress other peaks, the method can estimate more accurate F0 than AUTOC or AMDF. The value of m is set to be 1 in order to avoid the value of 0 at the denominator.


where f(τ ) and p(τ ) are AUTOC shown as in Eq.(19) and AMDF shown as in Eq.(20), respectively. In the frame-based method, Shimamura criterion shown as Eq.(21) is applied to complex AR residual extracted by the ELS-based TV-CAR speech analysis. The time-varying complex parameter is estimated and complex AR residual is calculated with the estimated complex parameter with Eq.(17). Note that pre-emphasis is operated for speech analysis such as real-valued AR or TV-CAR speech analysis, and inverse filtering is applied for the non pre-emphasized speech signal so as not to eliminate F0 spectrum on the residual signal. Real part of AUTOC is used to calculate the AUTOC for complex-valued signal. F0 is estimated within the range corresponding to 50-400[Hz]. In order to reduce the computational amount, the range is shortened by setting the upper value as follows.


where F0s and F1sare estimated F0 and F1 by the sample-based pre-selection. Setting upper bound below F1 can not only reduce the computational cost but also can reduce the estimation error.

4. Experiments

Speech signals used in the experiment are 5 long sentences uttered by 5 male speaker and 5 long sentences uttered by 5 female speaker of Keele pitch database [9]. Speech signals are filtered by an IRS filter [10]. The IRS filter is band pass FIR filter whose frequency response corresponds to that for analog part of the transmitter of telephone equipment. The frequency response is shown in Figure 6. In order to evaluate the proposed method for the speech data processed by speech coding, the IRS filter has to be introduced shown as in [2]. The experimental conditions are summarized in Table 1. Frame length is 25.6[msec] and frame shift length is 10[msec]. Analysis orders are 14 and 7 for real-valued analysis and complex-valued analysis, respectively. The basis expansion order L is set to be 1(time-invariant) or 2(time-varying) in the experiments. First order polynomial function is adopted as a basis function. White Gauss noise or pink noise [11] is adopted for additive noise and the levels are 30, 20, 10, 5, 0, and -5 [dB]. In order to extract more accurate F0, 3-point Lagrange’s interpolation is adopted. Commonly used criterion for F0 estimation, Gross Pitch Error (GPE), is adopted for objective evaluation. F0 estimation error is defined as


where Ft(n)is true F0 value and Fe(n)is the estimated one. The true values are derived by pitch file in Keele database. In Eq.(14), if |epn|Ft(n)×THR/100then the estimation error is regarded as ERROR and GPE is the probability of the error frames. Otherwise, the estimation is regarded as SUCCESS and FPE is standard deviation of the error. Figures 7,8,9 and 10 show the experimental results setting the THR as 10[%]. Figure 7 and 9 means the results for male speech. Figure 8 and 10 means the results for female speech. In Figures, (1) shows the results of GPEs or FPEs for additive white Gauss noise. (2) shows the results of GPEs or FPEs for additive pink noise. PROPOSED means the GPEs or FPEs for the proposed method with δ being 25. SP means the Shimamura method [6], viz., Shimamura criterion for speech signal. Other lines mean the GPEs or FPEs for the analysis method shown in Table 2. In all figures, X-axis means noise level of 30, 20, 10, 5, 0,−5[dB]. Y-axis means GPE[%] or FPE[Hz].

Figures 7 and 8 demonstrate that the proposed method can perform slightly better than the full-search method(TVC_E) for male speech while it can perform equivalently to the full search method(TVC_E) for female speech. Figures 9 and 10 show that the proposed one does not perform well in terms of FPE although the Shimamura method performs better in terms of FPE.

Speech dataKeele Pitch database [9]
Male 5 long sentences
Female 5 long sentences
IRS filter64-th FIR [10]
Target signalcomplex AR residual
Sampling 10kHz/16bit
Analysis windowWindow Length: 25.6[ms]
Shift Length: 10.0[ms]
F0search range50 to Eq.Eq.(22)
Complex-valued ARI=7, L=2 (time-varying)
Pre-emphasis 1 − z−1
CriterionAUTOC/AMDF [6]
Noise(1)white Gauss noise
(2)pink noise [11]
Noise Level 30,20,10,5,0,-5[dB]
Interpolation3 point Lagrange’s

Table 1.

Experimental Conditions

LineReal or ComplexNon or TVMMSE or ELS
LPCRed DottedRealNonMMSE
TVRBlue DottedRealTVMMSE
LPC_EMagenta DottedRealNonELS
TVR_EGreen DottedRealTVELS
CLPCRed SolidComplexNonMMSE
TVCBlue SolidComplexTVMMSE
CLPC_EMagenta SolidComplexNonELS
TVC_EGreen SolidComplexTVELS

Table 2.

Analysis methods

Figure 6.

Frequency response of IRS filter

Figure 7.

Experimental Results for Male speech

Figure 8.

Experimental Results for Female speech

Figure 9.

Experimental Results for Male speech

Figure 10.

Experimental Results for Female speech

5. Conclusions

This paper proposed fast robust fundamental frequency estimation algorithm based on robust TV-CAR speech analysis. The method provides two stage of search procedure, pre-selection and final-selection. In the pre-selection, F0 and F1 are estimated by using time-varying F0 contour estimation. In the final-selection, F0 is estimated for only the shorten range based on the pre-selected F0 and F1. The proposed method can perform better for male speech in terms of GPE with reduced computation.



This work was supported by Grand-in-Aid for Scientific Research (C), Research Project Number:20500158.

© 2013 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Keiichi Funaki and Takehito Higa (January 16th 2013). Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis, Design and Architectures for Digital Signal Processing, Gustavo Ruiz and Juan A. Michell, IntechOpen, DOI: 10.5772/51694. Available from:

chapter statistics

1594total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Optical Signal Processing: Data Exchange

By Jian Wang and Alan E. Willner

Related Book

First chapter

Maintenance Management Based on Signal Processing

By Fausto Pedro García Márquez, Raúl Ruiz de la Hermosa González- Carrato, Jesús María Pinar Perez and Noor Zaman

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us