The F 0 estimation determines a performance of speech processing such as speech coding, tonal speech recognition, speaker recognition, and speech enhancement. F 0 estimation named “YIN” has been proposed  and it is being prevalently used around the world due to its high performance and open-source policy. Speech processing is commonly applied in realistic noisy environments; hence, the performance is degraded seriously. It is well known that YIN does not perform well for noisy speech although it does perform best for clean speech. Accordingly, more robust F 0 estimation algorithm is desired and the robust F 0 estimation is long lasting problem in speech processing. We have already proposed robust F 0 estimation algorithm based on time-varying complex speech analysis for analytic speech signal . Analytic signal is a complex-valued signal in which its real part is speech signal and its imaginary part is Hilbert transform of the real part. Since the analytic signal provides the spectrum only on positive frequencies, the signals can be decimated by a factor of two with no degradation. As a result, the complex analysis offers attractive features, for example, more accurate spectral estimation in low frequencies. In  and , complex LPC residual is used to calculate the criterion of weighted autocorrelation function (AUTOC) with a reciprocal of Average Magnitude Difference Function (AMDF) . The complex residual is calculated from analytic speech signal by means of time-varying complex AR (TV-CAR) speech analysis method . In , MMSE-based TV-CAR speech analysis  is introduced and in , ELS-based TV-CAR speech analysis  is introduced to calculate complex LPC residual signal. It has been reported in  that the method can estimate more accurate F 0 for IRS (Intermediate Reference System) filtered speech corrupted by white Gauss noise. Moreover, it has been reported in  that the ELS-based complex speech analysis can perform better even for additive pink noise. Furthermore, in order to investigate the effectiveness of the time-varying analysis, the performance was compared for the frame with respect to degree of voiced nature . The experiments using IRS filtered speech corrupted by white Gauss noise or pink noise demonstrate that ELS-based robust time-varying complex speech analysis can perform better for stationary voiced speech and ELS-based time-invariant speech analysis can perform better for ordinary voiced frame. However the computational cost turns to be larger by introducing time-varying analysis. In this paper, in order to reduce the computational cost, pre-selection is introduced. The pre-selection is performed by peak picking of speech spectrum based on the TV-CAR analysis . The evaluation is carried out using Keele Pitch Database . The reminder of the chapter is organized as follows. In Section 2, TV-CAR speech analysis is explained. Analytic signal and Time-Varying Complex AR (TV-CAR) model are explained. Two kinds of the TV-CAR parameter estimation algorithms from an analytic signal, viz., MMSE and ELS methods are explained. In Section 3, F 0 estimation algorithm is explained in detail. Sample-based pre-selection is explained and frame-based final–selection is explained. In Section 4, experimental results are explained and these confirm the effectiveness of the proposed method.
2. TV-CAR speech analysis
In this section, ELS-based robust TV-CAR speech analysis method is explained. Before the explanation, analytic signal and TV-CAR model is explained, in which analytic signal is output of the TV-CAR model. In 2.6, the benefit of the robust TV-CAR analysis is explained by showing the estimated sprctra from natural speech.
2.1. Analytic speech signal
Target signal of the time-varying complex AR (TV-CAR) method is an analytic signal that is complex-valued signal defined by an all-pole model as follows.
where ,y(t) and denote an analytic signal at time t, an observed signal at time t, and a Hilbert transformed signal for the observed signal, respectively. Notice that superscript c denotes complex value in this paper. Since analytic signals provide the spectra only over the range of (0, π) analytic signals can be decimated by a factor of two. 2t means the decimation. The term of 1/is multiplied in order to adjust the power of an analytic signal with that of the observed one.
2.2. Time-varying complex AR (TV-CAR) model
Conventional LPC model is defined by
where and I are i-th order LPC coefficient and LPC order, respectively. Since the conventional LPC model cannot express the time-varying spectrum, LPC analysis cannot extract the time-varying spectral features from speech signal. In order to represent the time-varying features, the TV-CAR model employs a complex basis expansion shown as
where,I,L,,l and are taken to be i-th complex AR coefficient at time t, AR order, finite order of complex basis expansion, complex parameter, and a complex-valued basis function, respectively. By substituting Eq.(3) into Eq.(2), one can obtain the following transfer function. Eq.(4) means the TV-CAR model.
The input-output relation is defined as
where and are taken to be complex-valued input and analytic speech signal shown in Eq.(1), respectively. In the TV-CAR model, the complex AR coefficient is modeled by a finite number of arbitrary complex basis functions such as Fourier basis, wavelet basis or so on. Note that Eq.(3) parameterizes the AR coefficient trajectories that continuously change as a function of time so that the time-varying analysis is feasible to estimate continuous time-varying speech spectrum. In addition, as mentioned above, the complex-valued analysis facilitates accurate spectral estimation in the low frequencies, as a result, this feature allows for more accurate F 0 estimation if formant structure is removed by the inverse filtering. Eq.(5) can be represented by vector-matrix notation as
where N is analysis interval, is (N − I, 1) column vector whose elements are analytic speech signal, is (L ・ I, 1) column vector whose elements are complex parameters, is (N −I,L・ I) matrix whose elements are weighted analytic speech signal by the complex basis. Superscript T denotes transposition.
2.3. MMSE-based algorithm 
There are several algorithms that estimate the TV-CAR model parameter from complex-valued signal such as MMSE, WLS(Weighted Least Square), M-estimation, GLS(Generalized Least Square), and ELS(Extended Least Square). The MMSE-algorithm is basic algorithm and used for initial estimation of the ELS. Before explaining the ELS, the MMSE algorithm is explained.
MSE criterion is defined by
Where is the estimated complex parameter, is an equation error, or complex AR residual and E is Mean Squared Error (MSE) for the equation error. To obtain optimal complex AR coefficients, we minimize the MSE criterion. Minimizing the MSE criterion of Eq.(9) with respect to the complex parameter leads to the following MMSE algorithm.
Superscript H denotes Hermitian transposition. After solving the linear equation of Eq.(10), we can get the complex AR parameter () at time t by calculating the Eq.(3) with the estimated complex parameter.
2.4. ELS-based algorithm 
Figure 1 shows block diagram of ELS estimation. If the equation error shown as in Eq.(8) is white Gaussian, the MMSE estimation is optimal, however, it is rare case. As a result, MMSE estimation suffers from biased estimation. In the ELS method, an AR filter is adopted to whiten the equation error as follows (Figure 1(2)).
where is k-th parameter of the AR filter whose order is K and is 0-mean white Gaussian of equation error at time t. The inverse filter of Eq.(11) is called a whiten filter. The TV-CAR model can be represented using Eq.(5) and Eq.(11) as follows.
Eq.(12) is the ELS model shown as in Figure 1(3). The parameter is estimated so as minimize the MSE for the whitened equation error in the ELS algorithm whereas the parameter is estimated so as minimize the MSE for the equation error in the MMSE algorithm shown as in Figure 1(1).
Eq.(12) can be expressed by the following vector-matrix notation.
By minimizing the MSE for Eq.(13), one can get the following equation.
By applying the well-known inversion Matrix lemma to Eq.(15), one can obtain the following equation.
The MMSE estimated parameter contains the biased element. The unbiased estimation of is calculated by. The ELS algorithm is equivalent to the GLS (Generalized Least Square) algorithm and more sophisticated algorithm. Since the equation error r c(t) cannot be observed, the iteration algorithm is required by estimating the A(z) and B(z). The iteration procedure is shown as follows.
1. Initial is estimated by MMSE (Eq.(10)).
2. The equation error is calculated by Eq.(8).3. is estimated so as to minimize Eq.(18) using r c(t).
4. The bias parameter is calculated by Eq.(16).
5. The unbiased parameter is calculated by Eq.(17).
6. Go to 2.
In Eq.(18), R(z) is z-transform of r c(t) and B(z) is the transfer function of the whiten filter. The procedures from 2 to 5 are iterated with the pre-determined number. The ELS algorithm estimates two kinds of AR filters, A(z) and B(z), iteratively. Since the ELS algorithm can estimate unbiased and less effected speech spectrum against additive noise, more accurate F 0 and formants frequencies can be estimated. Thus, more accurate F 0 trajectories can be estimated than the MMSE estimation.
2.5. Benefit of robust TV-CAR speech analysis
In this paragraph, we explain the benefit of robust TV-CAR speech analysis by showing the estimated speech spectrum and explain its effectiveness on F 0 estimation of speech. Figure 2 shows example of the estimated speech spectra of natural Japanese vowel /o/ for analytic signal and conventional LPC analysis for speech signal.
In Figure 2, left side denote the estimated spectra. Upper is for real-valued LPC analysis. Lower is for complex-valued LPC analysis. Blue line means estimated spectrum by LPC analysis and green line means estimated DFT spectrum. Right side means estimated poles from the estimated AR filter. Figure 3 shows the estimated running spectrum for clean natural speech /arayu/ and for the speech corrupted by white Gaussian (10[dB]). In Figure 3, (1) means speech waveform, (2),(3),(4),(5) and (6) mean the estimated spectrum by MMSE-based time-invariant real-valued AR speech analysis, by MMSE-based time-invariant complex-valued AR speech analysis (L=1), by MMSE-based time-varying complex AR (TV-CAR) speech analysis (L=2), by ELS-based time-invariant complex-valued AR speech analysis (L=1), and by ELS-based time-varying complex AR (TV-CAR) speech analysis (L=2), respectively. Analysis order I is 14 for real analysis and 7 for complex analysis. Basis function is 1st order polynomial function (1,t). One can observe that the complex analysis can estimate more accurate spectrum in low frequencies whereas the estimation accuracy is down in high frequencies. Since speech spectrum provides much energy in low frequencies, it is expected that the high spectral estimation accuracy in low frequencies makes it possible to improve the performance on F 0 estimation. Furthermore, the ELS analysis can estimate more accurate spectrum than MMSE, so that the ELS analysis makes it possible to estimate more accurate F 0. Time-varying analysis can estimate tive-varying spectrum from speech. It is expected that the time-varying analysis enables to estimate more accurate F 0 since F 0 is varying in the analysis interval.
3. F0 Estimation method
Proposed method employs two-stage search of F0. In first stage, pre-selection, F0 and F1 are estimated by using sample-based F0 contour estimation . In second stage, final-selection, F0 is estimated by using frame-based F0 estimation  within limited range based on the pre-estimated F0 and F1. The two-stage estimation makes it possible to reduce the computation with less degradation In 3.1, pre-selection algorithm is explained. In 3.2, final-selection algorithm is explained.
3.1. Sample-based pre-selection
F0 and F1 are estimated as the lowest two peak frequency, viz., glottal and first formant frequencies by peak-picking for the estimated time-varying speech spectrum. The procedure of F0 and F1 contour estimation is shown as in Figure 4
The set of complex-valued parameter is estimated by the ELS algorithm for each analysis frame.
The peak-pinking is carried out from low frequency to high frequency shown as in Figure 5. The estimated two peaks correspond to glottal formant (F0) and first formant (F1). The formant frequencies are estimated by solving the equation of the reciprocal of Eq.(4).
3.2. Frame-based final-selection
In frame-based F0 estimation, autocorrelation or AMDF is commonly used. In this paragraph, autocorrelation and AMDF are explained and then adopted weighted autocorrelation is explained.
Autocorrelation function (AUTOC) is defined by
where x(t) is target signal such as speech signal, LPC residual or so on, N is frame length and τ means delay. F0 is selected as peak frequency for Eq.(19) within certain range of F0.
AMDF is defined as follows.
F0 is selected as notch frequency for Eq.(20) within certain range of F0. In Shimamura method , the AUTOC is weighted by a reciprocal of the AMDF shown as Eq.(21). Since the weighting makes it possible to suppress other peaks, the method can estimate more accurate F0 than AUTOC or AMDF. The value of m is set to be 1 in order to avoid the value of 0 at the denominator.
where f(τ ) and p(τ ) are AUTOC shown as in Eq.(19) and AMDF shown as in Eq.(20), respectively. In the frame-based method, Shimamura criterion shown as Eq.(21) is applied to complex AR residual extracted by the ELS-based TV-CAR speech analysis. The time-varying complex parameter is estimated and complex AR residual is calculated with the estimated complex parameter with Eq.(17). Note that pre-emphasis is operated for speech analysis such as real-valued AR or TV-CAR speech analysis, and inverse filtering is applied for the non pre-emphasized speech signal so as not to eliminate F0 spectrum on the residual signal. Real part of AUTOC is used to calculate the AUTOC for complex-valued signal. F0 is estimated within the range corresponding to 50-400[Hz]. In order to reduce the computational amount, the range is shortened by setting the upper value as follows.
where are estimated F0 and F1 by the sample-based pre-selection. Setting upper bound below F1 can not only reduce the computational cost but also can reduce the estimation error.
Speech signals used in the experiment are 5 long sentences uttered by 5 male speaker and 5 long sentences uttered by 5 female speaker of Keele pitch database . Speech signals are filtered by an IRS filter . The IRS filter is band pass FIR filter whose frequency response corresponds to that for analog part of the transmitter of telephone equipment. The frequency response is shown in Figure 6. In order to evaluate the proposed method for the speech data processed by speech coding, the IRS filter has to be introduced shown as in . The experimental conditions are summarized in Table 1. Frame length is 25.6[msec] and frame shift length is 10[msec]. Analysis orders are 14 and 7 for real-valued analysis and complex-valued analysis, respectively. The basis expansion order L is set to be 1(time-invariant) or 2(time-varying) in the experiments. First order polynomial function is adopted as a basis function. White Gauss noise or pink noise  is adopted for additive noise and the levels are 30, 20, 10, 5, 0, and -5 [dB]. In order to extract more accurate F0, 3-point Lagrange’s interpolation is adopted. Commonly used criterion for F0 estimation, Gross Pitch Error (GPE), is adopted for objective evaluation. F0 estimation error is defined as
where is true F0 value and is the estimated one. The true values are derived by pitch file in Keele database. In Eq.(14), if then the estimation error is regarded as ERROR and GPE is the probability of the error frames. Otherwise, the estimation is regarded as SUCCESS and FPE is standard deviation of the error. Figures 7,8,9 and 10 show the experimental results setting the THR as 10[%]. Figure 7 and 9 means the results for male speech. Figure 8 and 10 means the results for female speech. In Figures, (1) shows the results of GPEs or FPEs for additive white Gauss noise. (2) shows the results of GPEs or FPEs for additive pink noise. PROPOSED means the GPEs or FPEs for the proposed method with δ being 25. SP means the Shimamura method , viz., Shimamura criterion for speech signal. Other lines mean the GPEs or FPEs for the analysis method shown in Table 2. In all figures, X-axis means noise level of 30, 20, 10, 5, 0,−5[dB]. Y-axis means GPE[%] or FPE[Hz].
Figures 7 and 8 demonstrate that the proposed method can perform slightly better than the full-search method(TVC_E) for male speech while it can perform equivalently to the full search method(TVC_E) for female speech. Figures 9 and 10 show that the proposed one does not perform well in terms of FPE although the Shimamura method performs better in terms of FPE.
|Speech data||Keele Pitch database |
Male 5 long sentences
Female 5 long sentences
|IRS filter||64-th FIR |
|Target signal||complex AR residual|
|Analysis window||Window Length: 25.6[ms]|
Shift Length: 10.0[ms]
|F0 search range||50 to Eq.Eq.(22)|
|Complex-valued AR||I=7, L=2 (time-varying)|
Pre-emphasis 1 − z−1
|Noise||(1)white Gauss noise|
(2)pink noise 
Noise Level 30,20,10,5,0,-5[dB]
|Interpolation||3 point Lagrange’s|
|Line||Real or Complex||Non or TV||MMSE or ELS|
This paper proposed fast robust fundamental frequency estimation algorithm based on robust TV-CAR speech analysis. The method provides two stage of search procedure, pre-selection and final-selection. In the pre-selection, F0 and F1 are estimated by using time-varying F0 contour estimation. In the final-selection, F0 is estimated for only the shorten range based on the pre-selected F0 and F1. The proposed method can perform better for male speech in terms of GPE with reduced computation.
This work was supported by Grand-in-Aid for Scientific Research (C), Research Project Number:20500158.