2. TV-CAR speech analysis
In this section, ELS-based robust TV-CAR speech analysis method is explained. Before the explanation, analytic signal and TV-CAR model is explained, in which analytic signal is output of the TV-CAR model. In 2.6, the benefit of the robust TV-CAR analysis is explained by showing the estimated sprctra from natural speech.
2.1. Analytic speech signal
Target signal of the time-varying complex AR (TV-CAR) method is an analytic signal that is complex-valued signal defined by an all-pole model as follows.
denote an analytic signal at time
2.2. Time-varying complex AR (TV-CAR) model
Conventional LPC model is defined by
The input-output relation is defined as
are taken to be complex-valued input and analytic speech signal shown in Eq.(1), respectively. In the TV-CAR model, the complex AR coefficient is modeled by a finite number of arbitrary complex basis functions such as Fourier basis, wavelet basis or so on. Note that Eq.(3) parameterizes the AR coefficient trajectories that continuously change as a function of time so that the time-varying analysis is feasible to estimate continuous time-varying speech spectrum. In addition, as mentioned above, the complex-valued analysis facilitates accurate spectral estimation in the low frequencies, as a result, this feature allows for more accurate
2.3. MMSE-based algorithm 
There are several algorithms that estimate the TV-CAR model parameter from complex-valued signal such as MMSE, WLS(Weighted Least Square), M-estimation, GLS(Generalized Least Square), and ELS(Extended Least Square). The MMSE-algorithm is basic algorithm and used for initial estimation of the ELS. Before explaining the ELS, the MMSE algorithm is explained.
MSE criterion is defined by
is the estimated complex parameter,
is an equation error, or complex AR residual and
Superscript H denotes Hermitian transposition. After solving the linear equation of Eq.(10), we can get the complex AR parameter (
) at time
2.4. ELS-based algorithm 
Figure 1 shows block diagram of ELS estimation. If the equation error shown as in Eq.(8) is white Gaussian, the MMSE estimation is optimal, however, it is rare case. As a result, MMSE estimation suffers from biased estimation. In the ELS method, an AR filter is adopted to whiten the equation error as follows (Figure 1(2)).
Eq.(12) is the ELS model shown as in Figure 1(3). The parameter is estimated so as minimize the MSE for the whitened equation error in the ELS algorithm whereas the parameter is estimated so as minimize the MSE for the equation error in the MMSE algorithm shown as in Figure 1(1).
Eq.(12) can be expressed by the following vector-matrix notation.
By minimizing the MSE for Eq.(13), one can get the following equation.
By applying the well-known inversion Matrix lemma to Eq.(15), one can obtain the following equation.
The MMSE estimated parameter
contains the biased element
. The unbiased estimation of
is calculated by
. The ELS algorithm is equivalent to the GLS (Generalized Least Square) algorithm and more sophisticated algorithm. Since the equation error
1. Initial is estimated by MMSE (Eq.(10)).
2. The equation error is calculated by Eq.(8).3. is estimated so as to minimize Eq.(18) using
4. The bias parameter is calculated by Eq.(16).
5. The unbiased parameter is calculated by Eq.(17).
6. Go to 2.
In Eq.(18), R(
2.5. Benefit of robust TV-CAR speech analysis
In this paragraph, we explain the benefit of robust TV-CAR speech analysis by showing the estimated speech spectrum and explain its effectiveness on
In Figure 2, left side denote the estimated spectra. Upper is for real-valued LPC analysis. Lower is for complex-valued LPC analysis. Blue line means estimated spectrum by LPC analysis and green line means estimated DFT spectrum. Right side means estimated poles from the estimated AR filter. Figure 3 shows the estimated running spectrum for clean natural speech /arayu/ and for the speech corrupted by white Gaussian (10[dB]). In Figure 3, (1) means speech waveform, (2),(3),(4),(5) and (6) mean the estimated spectrum by MMSE-based time-invariant real-valued AR speech analysis, by MMSE-based time-invariant complex-valued AR speech analysis (L=1), by MMSE-based time-varying complex AR (TV-CAR) speech analysis (L=2), by ELS-based time-invariant complex-valued AR speech analysis (L=1), and by ELS-based time-varying complex AR (TV-CAR) speech analysis (L=2), respectively. Analysis order
3. F0 Estimation method
Proposed method employs two-stage search of F0. In first stage, pre-selection, F0 and F1 are estimated by using sample-based F0 contour estimation . In second stage, final-selection, F0 is estimated by using frame-based F0 estimation  within limited range based on the pre-estimated F0 and F1. The two-stage estimation makes it possible to reduce the computation with less degradation In 3.1, pre-selection algorithm is explained. In 3.2, final-selection algorithm is explained.
3.1. Sample-based pre-selection
F0 and F1 are estimated as the lowest two peak frequency, viz., glottal and first formant frequencies by peak-picking for the estimated time-varying speech spectrum. The procedure of F0 and F1 contour estimation is shown as in Figure 4
The set of complex-valued parameter is estimated by the ELS algorithm for each analysis frame.
By using Eq.(3) and Eq.(4) with the estimated parameter , the speech power spectrum for each sample t is calculated, and the two peaks of the estimated spectrum are searched by the peak-picking.
The peak-pinking is carried out from low frequency to high frequency shown as in Figure 5. The estimated two peaks correspond to glottal formant (F0) and first formant (F1). The formant frequencies are estimated by solving the equation of the reciprocal of Eq.(4).
3.2. Frame-based final-selection
In frame-based F0 estimation, autocorrelation or AMDF is commonly used. In this paragraph, autocorrelation and AMDF are explained and then adopted weighted autocorrelation is explained.
Autocorrelation function (AUTOC) is defined by
where x(t) is target signal such as speech signal, LPC residual or so on, N is frame length and τ means delay. F0 is selected as peak frequency for Eq.(19) within certain range of F0.
AMDF is defined as follows.
F0 is selected as notch frequency for Eq.(20) within certain range of F0. In Shimamura method , the AUTOC is weighted by a reciprocal of the AMDF shown as Eq.(21). Since the weighting makes it possible to suppress other peaks, the method can estimate more accurate F0 than AUTOC or AMDF. The value of m is set to be 1 in order to avoid the value of 0 at the denominator.
where f(τ ) and p(τ ) are AUTOC shown as in Eq.(19) and AMDF shown as in Eq.(20), respectively. In the frame-based method, Shimamura criterion shown as Eq.(21) is applied to complex AR residual extracted by the ELS-based TV-CAR speech analysis. The time-varying complex parameter is estimated and complex AR residual is calculated with the estimated complex parameter with Eq.(17). Note that pre-emphasis is operated for speech analysis such as real-valued AR or TV-CAR speech analysis, and inverse filtering is applied for the non pre-emphasized speech signal so as not to eliminate F0 spectrum on the residual signal. Real part of AUTOC is used to calculate the AUTOC for complex-valued signal. F0 is estimated within the range corresponding to 50-400[Hz]. In order to reduce the computational amount, the range is shortened by setting the upper value as follows.
where are estimated F0 and F1 by the sample-based pre-selection. Setting upper bound below F1 can not only reduce the computational cost but also can reduce the estimation error.
Speech signals used in the experiment are 5 long sentences uttered by 5 male speaker and 5 long sentences uttered by 5 female speaker of Keele pitch database . Speech signals are filtered by an IRS filter . The IRS filter is band pass FIR filter whose frequency response corresponds to that for analog part of the transmitter of telephone equipment. The frequency response is shown in Figure 6. In order to evaluate the proposed method for the speech data processed by speech coding, the IRS filter has to be introduced shown as in . The experimental conditions are summarized in Table 1. Frame length is 25.6[msec] and frame shift length is 10[msec]. Analysis orders are 14 and 7 for real-valued analysis and complex-valued analysis, respectively. The basis expansion order L is set to be 1(time-invariant) or 2(time-varying) in the experiments. First order polynomial function is adopted as a basis function. White Gauss noise or pink noise  is adopted for additive noise and the levels are 30, 20, 10, 5, 0, and -5 [dB]. In order to extract more accurate F0, 3-point Lagrange’s interpolation is adopted. Commonly used criterion for F0 estimation, Gross Pitch Error (GPE), is adopted for objective evaluation. F0 estimation error is defined as
where is true F0 value and is the estimated one. The true values are derived by pitch file in Keele database. In Eq.(14), if then the estimation error is regarded as ERROR and GPE is the probability of the error frames. Otherwise, the estimation is regarded as SUCCESS and FPE is standard deviation of the error. Figures 7,8,9 and 10 show the experimental results setting the THR as 10[%]. Figure 7 and 9 means the results for male speech. Figure 8 and 10 means the results for female speech. In Figures, (1) shows the results of GPEs or FPEs for additive white Gauss noise. (2) shows the results of GPEs or FPEs for additive pink noise. PROPOSED means the GPEs or FPEs for the proposed method with δ being 25. SP means the Shimamura method , viz., Shimamura criterion for speech signal. Other lines mean the GPEs or FPEs for the analysis method shown in Table 2. In all figures, X-axis means noise level of 30, 20, 10, 5, 0,−5[dB]. Y-axis means GPE[%] or FPE[Hz].
Figures 7 and 8 demonstrate that the proposed method can perform slightly better than the full-search method(TVC_E) for male speech while it can perform equivalently to the full search method(TVC_E) for female speech. Figures 9 and 10 show that the proposed one does not perform well in terms of FPE although the Shimamura method performs better in terms of FPE.
|Keele Pitch database 
Male 5 long sentences
Female 5 long sentences
|64-th FIR |
|complex AR residual
|Window Length: 25.6[ms]
Shift Length: 10.0[ms]
||50 to Eq.Eq.(22)|
|I=7, L=2 (time-varying)
|(1)white Gauss noise
(2)pink noise 
Noise Level 30,20,10,5,0,-5[dB]
|3 point Lagrange’s|
This paper proposed fast robust fundamental frequency estimation algorithm based on robust TV-CAR speech analysis. The method provides two stage of search procedure, pre-selection and final-selection. In the pre-selection, F0 and F1 are estimated by using time-varying F0 contour estimation. In the final-selection, F0 is estimated for only the shorten range based on the pre-selected F0 and F1. The proposed method can perform better for male speech in terms of GPE with reduced computation.
This work was supported by Grand-in-Aid for Scientific Research (C), Research Project Number:20500158.
Alan de Cheveigne and H.Kawahra, YIN 2002 A fundamental frequency estimator for speech and musicJournal of the Acoustical Society of America 111 4 1917 1930
K. Funaki et.al. 2007 Robust F0 Estimation Based on Complex LPC Analysis for IRS Filtered Noisy SpeechIEICE Trans. on Fundamentals E90-A 8 Aug
K. Funaki 2008 F0 estimation based on robust ELS complex speech analysisProc. EUSIPCO-2008, Lausanne, Switzerland Aug
K. Funaki Y. Miyanaga K. Tochinai 1998 On a time-varying complex speech analysisProc. EUSIPCO-98,Rodes,Greece Sep
K. Funaki 2001 A time-varying complex AR speech analysis based on GLS and ELS methodProc.EUROSPEECH2001, Aalborg Denmark Sep
T. Shimamura H. Kobayashi 2001 Weighted Autocorrelation for Pitch Extraction of Noisy SpeechIEEE Trans. Speech and Audio Processing 9 7 727 730
K. Funaki 2010 On Evaluation of the F0 Estimation Based on Time-Varying Complex Speech AnalysisMakuhari, Japan Sep
K. Funaki 2011 F0 Contour Estimation Using ELS-based Robust Time-Varying Complex Speech AnalysisIEEE DSP/SPE workshop, Sedona, AZ, USA Jan
Keele Pitch DatabaseUniversity of Liverpool http://www.liv.ac.uk/Psychology/hmp/projects/pitch.html
ITU-T Recommendation G.191 2000 Software tools for speech and audio coding standardization Nov