Low Computational Robust F 0 Estimation of Speech Based on TV-CAR Analysis

The F 0 estimation determines a performance of speech processing such as speech coding, tonal speech recognition, speaker recognition, and speech enhancement. F 0 estimation named “YIN” has been proposed [1] and it is being prevalently used around the world due to its high performance and open-source policy. Speech processing is commonly applied in realistic noisy environments; hence, the performance is degraded seriously. It is well known that YIN does not perform well for noisy speech although it does perform best for clean speech. Accordingly, more robust F 0 estimation algorithm is desired and the robust F 0 esti‐ mation is long lasting problem in speech processing. We have already proposed robust F 0 estimation algorithm based on time-varying complex speech analysis for analytic speech sig‐ nal [2][3]. Analytic signal is a complex-valued signal in which its real part is speech signal and its imaginary part is Hilbert transform of the real part. Since the analytic signal provides the spectrum only on positive frequencies, the signals can be decimated by a factor of two with no degradation. As a result, the complex analysis offers attractive features, for exam‐ ple, more accurate spectral estimation in low frequencies. In [2] and [3], complex LPC resid‐ ual is used to calculate the criterion of weighted autocorrelation function (AUTOC) with a reciprocal of Average Magnitude Difference Function (AMDF) [6]. The complex residual is calculated from analytic speech signal by means of time-varying complex AR (TV-CAR) speech analysis method [4][5]. In [2], MMSE-based TV-CAR speech analysis [4] is intro‐ duced and in [3], ELS-based TV-CAR speech analysis [5] is introduced to calculate complex LPC residual signal. It has been reported in [2] that the method can estimate more accurate F 0 for IRS (Intermediate Reference System) filtered speech corrupted by white Gauss noise. Moreover, it has been reported in [3] that the ELS-based complex speech analysis can per‐ form better even for additive pink noise. Furthermore, in order to investigate the effective‐


Introduction
The F 0 estimation determines a performance of speech processing such as speech coding, tonal speech recognition, speaker recognition, and speech enhancement.F 0 estimation named "YIN" has been proposed [1] and it is being prevalently used around the world due to its high performance and open-source policy.Speech processing is commonly applied in realistic noisy environments; hence, the performance is degraded seriously.It is well known that YIN does not perform well for noisy speech although it does perform best for clean speech.Accordingly, more robust F 0 estimation algorithm is desired and the robust F 0 estimation is long lasting problem in speech processing.We have already proposed robust F 0 estimation algorithm based on time-varying complex speech analysis for analytic speech signal [2] [3].Analytic signal is a complex-valued signal in which its real part is speech signal and its imaginary part is Hilbert transform of the real part.Since the analytic signal provides the spectrum only on positive frequencies, the signals can be decimated by a factor of two with no degradation.As a result, the complex analysis offers attractive features, for example, more accurate spectral estimation in low frequencies.In [2] and [3], complex LPC residual is used to calculate the criterion of weighted autocorrelation function (AUTOC) with a reciprocal of Average Magnitude Difference Function (AMDF) [6].The complex residual is calculated from analytic speech signal by means of time-varying complex AR (TV-CAR) speech analysis method [4] [5].In [2], MMSE-based TV-CAR speech analysis [4] is introduced and in [3], ELS-based TV-CAR speech analysis [5] is introduced to calculate complex LPC residual signal.It has been reported in [2] that the method can estimate more accurate F ness of the time-varying analysis, the performance was compared for the frame with respect to degree of voiced nature [7].The experiments using IRS filtered speech corrupted by white Gauss noise or pink noise demonstrate that ELS-based robust time-varying complex speech analysis can perform better for stationary voiced speech and ELS-based time-invariant speech analysis can perform better for ordinary voiced frame.However the computational cost turns to be larger by introducing time-varying analysis.In this paper, in order to reduce the computational cost, pre-selection is introduced.The pre-selection is performed by peak picking of speech spectrum based on the TV-CAR analysis [8].The evaluation is carried out using Keele Pitch Database [9].The reminder of the chapter is organized as follows.In Section 2, TV-CAR speech analysis is explained.Analytic signal and Time-Varying Complex AR (TV-CAR) model are explained.Two kinds of the TV-CAR parameter estimation algorithms from an analytic signal, viz., MMSE and ELS methods are explained.In Section 3, F 0 estimation algorithm is explained in detail.Sample-based pre-selection is explained and frame-based final-selection is explained.In Section 4, experimental results are explained and these confirm the effectiveness of the proposed method.

TV-CAR speech analysis
In this section, ELS-based robust TV-CAR speech analysis method is explained.Before the explanation, analytic signal and TV-CAR model is explained, in which analytic signal is output of the TV-CAR model.In 2.6, the benefit of the robust TV-CAR analysis is explained by showing the estimated sprctra from natural speech.

Analytic speech signal
Target signal of the time-varying complex AR (TV-CAR) method is an analytic signal that is complex-valued signal defined by an all-pole model as follows.
( ) ( ) ( ) where y c (t) ,y(t) and y H (t) denote an analytic signal at time t, an observed signal at time t, and a Hilbert transformed signal for the observed signal, respectively.Notice that superscript c denotes complex value in this paper.Since analytic signals provide the spectra only over the range of (0, π) analytic signals can be decimated by a factor of two.2t means the decimation.The term of 1/ 2 is multiplied in order to adjust the power of an analytic signal with that of the observed one.

Conventional LPC model is defined by
Design and Architectures for Digital Signal Processing ( ) where a i and I are i-th order LPC coefficient and LPC order, respectively.Since the conventional LPC model cannot express the time-varying spectrum, LPC analysis cannot extract the time-varying spectral features from speech signal.In order to represent the time-varying features, the TV-CAR model employs a complex basis expansion shown as where a i c (t) ,I,L, g i,l c ,l and f l c (t) are taken to be i-th complex AR coefficient at time t, AR order, finite order of complex basis expansion, complex parameter, and a complex-valued basis function, respectively.By substituting Eq.(3) into Eq.( 2), one can obtain the following transfer function.Eq.( 4) means the TV-CAR model.
( ) The input-output relation is defined as where u c ( t ) and y c ( t ) are taken to be complex-valued input and analytic speech signal shown in Eq.(1), respectively.In the TV-CAR model, the complex AR coefficient is modeled by a finite number of arbitrary complex basis functions such as Fourier basis, wavelet basis or so on.Note that Eq.( 3) parameterizes the AR coefficient trajectories that continuously change as a function of time so that the time-varying analysis is feasible to estimate continuous time-varying speech spectrum.In addition, as mentioned above, the complex-valued analysis facilitates accurate spectral estimation in the low frequencies, as a result, this feature allows for more accurate F 0 estimation if formant structure is removed by the inverse filtering.Eq.( 5) can be represented by vector-matrix notation as where N is analysis interval, y f is (N − I, 1) column vector whose elements are analytic speech signal, θ -is (L ・ I, 1) column vector whose elements are complex parameters, Φ f is (N −I,L・ I) matrix whose elements are weighted analytic speech signal by the complex basis.Superscript T denotes transposition.

MMSE-based algorithm [4]
There are several algorithms that estimate the TV-CAR model parameter from complex-valued signal such as MMSE, WLS(Weighted Least Square), M-estimation, GLS(Generalized Least Square), and ELS(Extended Least Square).The MMSE-algorithm is basic algorithm and used for initial estimation of the ELS.Before explaining the ELS, the MMSE algorithm is explained.

MSE criterion is defined by
Where g ^i,l c is the estimated complex parameter, r c ( t ) is an equation error, or complex AR re- sidual and E is Mean Squared Error (MSE) for the equation error.To obtain optimal complex AR coefficients, we minimize the MSE criterion.Minimizing the MSE criterion of Eq.(9) with respect to the complex parameter leads to the following MMSE algorithm.
( ) Superscript H denotes Hermitian transposition.After solving the linear equation of Eq.( 10), we can get the complex AR parameter ( a i c (t) ) at time t by calculating the Eq.( 3) with the estimated complex parameter g ^i,l c .

ELS-based algorithm [5]
Figure 1 shows block diagram of ELS estimation.If the equation error shown as in Eq.( 8) is white Gaussian, the MMSE estimation is optimal, however, it is rare case.As a result, MMSE estimation suffers from biased estimation.In the ELS method, an AR filter is adopted to whiten the equation error as follows (Figure 1(2)).
( ) ( ) ( ) where b k c is k-th parameter of the AR filter whose order is K and e c ( t ) is 0-mean white Gaus- sian of equation error at time t.The inverse filter of Eq.( 11) is called a whiten filter.The TV-CAR model can be represented using Eq.( 5) and Eq.( 11) as follows.
( ) Eq.( 12) is the ELS model shown as in Figure 1(3).The parameter is estimated so as minimize the MSE for the whitened equation error in the ELS algorithm whereas the parameter is estimated so as minimize the MSE for the equation error in the MMSE algorithm shown as in Figure 1(1).
Eq.( 12) can be expressed by the following vector-matrix notation.
( )( ) Where By minimizing the MSE for Eq.( 13), one can get the following equation.

(
) ( ) By applying the well-known inversion Matrix lemma to Eq.( 15), one can obtain the following equation.

( )
The MMSE estimated parameter θ ^0 contains the biased element θ ^bias .The unbiased estimation of θ ^ is calculated by θ ^0 − θ ^|bias .The ELS algorithm is equivalent to the GLS (Generalized Least Square) algorithm and more sophisticated algorithm.Since the equation error r c (t) cannot be observed, the iteration algorithm is required by estimating the A(z) and B(z).
The iteration procedure is shown as follows.

b
^ is estimated so as to minimize Eq.(18) using r c (t).

The bias parameter b
^ is calculated by Eq.( 16).
( ) ( ) In Eq.( 18), R(z) is z-transform of r c (t) and B(z) is the transfer function of the whiten filter.
The procedures from 2 to 5 are iterated with the pre-determined number.The ELS algorithm estimates two kinds of AR filters, A(z) and B(z), iteratively.Since the ELS algorithm can estimate unbiased and less effected speech spectrum against additive noise, more accurate F 0 and formants frequencies can be estimated.Thus, more accurate F 0 trajectories can be estimated than the MMSE estimation.

Benefit of robust TV-CAR speech analysis
In this paragraph, we explain the benefit of robust TV-CAR speech analysis by showing the estimated speech spectrum and explain its effectiveness on F 0 estimation of speech.Figure 2 shows example of the estimated speech spectra of natural Japanese vowel /o/ for analytic signal and conventional LPC analysis for speech signal.In Figure 2, left side denote the estimated spectra.Upper is for real-valued LPC analysis.
Lower is for complex-valued LPC analysis.Blue line means estimated spectrum by LPC analysis and green line means estimated DFT spectrum.Right side means estimated poles from the estimated AR filter.Figure 3 shows the estimated running spectrum for clean natural speech /arayu/ and for the speech corrupted by white Gaussian (10[dB]).In Figure 3, (1) means speech waveform, (2),(3),(4),( 5) and ( 6) mean the estimated spectrum by MMSEbased time-invariant real-valued AR speech analysis, by MMSE-based time-invariant complex-valued AR speech analysis (L=1), by MMSE-based time-varying complex AR (TV-CAR) speech analysis (L=2), by ELS-based time-invariant complex-valued AR speech analysis (L=1), and by ELS-based time-varying complex AR (TV-CAR) speech analysis (L=2), respectively.Analysis order I is 14 for real analysis and 7 for complex analysis.Basis function is 1 st order polynomial function (1,t).One can observe that the complex analysis can estimate more accurate spectrum in low frequencies whereas the estimation accuracy is down in high frequencies.Since speech spectrum provides much energy in low frequencies, it is expected that the high spectral estimation accuracy in low frequencies makes it possible to improve the performance on F 0 estimation.Furthermore, the ELS analysis can estimate more accurate spectrum than MMSE, so that the ELS analysis makes it possible to estimate more accurate F 0. Time-varying analysis can estimate tive-varying spectrum from speech.It is expected that the time-varying analysis enables to estimate more accurate F 0 since F 0 is varying in the analysis interval.

F 0 Estimation method
Proposed method employs two-stage search of F 0 .In first stage, pre-selection, F 0 and F 1 are estimated by using sample-based F 0 contour estimation [8].In second stage, final-selection, F 0 is estimated by using frame-based F 0 estimation [3] within limited range based on the preestimated F 0 and F 1 .The two-stage estimation makes it possible to reduce the computation with less degradation In 3.1, pre-selection algorithm is explained.In 3.2, final-selection algorithm is explained.

Sample-based pre-selection
F 0 and F 1 are estimated as the lowest two peak frequency, viz., glottal and first formant frequencies by peak-picking for the estimated time-varying speech spectrum.The procedure of F 0 and F 1 contour estimation is shown as in Figure 4 1.The set of complex-valued parameter g ^i,l c is estimated by the ELS algorithm for each analysis frame.

2.
By using Eq.( 3) and Eq.( 4) with the estimated parameter g ^i,l c , the speech power spectrum for each sample t is calculated, and the two peaks of the estimated spectrum are searched by the peak-picking.
The peak-pinking is carried out from low frequency to high frequency shown as in Figure 5.
The estimated two peaks correspond to glottal formant (F 0 ) and first formant (F 1 ).The formant frequencies are estimated by solving the equation of the reciprocal of Eq.( 4).

Frame-based final-selection
In frame-based F 0 estimation, autocorrelation or AMDF is commonly used.In this paragraph, autocorrelation and AMDF are explained and then adopted weighted autocorrelation is explained.
Autocorrelation function (AUTOC) is defined by where x(t) is target signal such as speech signal, LPC residual or so on, N is frame length and τ means delay.F 0 is selected as peak frequency for Eq.( 19) within certain range of F 0 .
AMDF is defined as follows.
( ) ( ) ( ) F 0 is selected as notch frequency for Eq.(20) within certain range of F 0 .In Shimamura method [6], the AUTOC is weighted by a reciprocal of the AMDF shown as Eq.( 21).Since the weighting makes it possible to suppress other peaks, the method can estimate more accurate F 0 than AUTOC or AMDF.The value of m is set to be 1 in order to avoid the value of 0 at the denominator.
( ) where f(τ ) and p(τ ) are AUTOC shown as in Eq.( 19) and AMDF shown as in Eq.( 20), respectively.In the frame-based method, Shimamura criterion shown as Eq.( 21) is applied to complex AR residual extracted by the ELS-based TV-CAR speech analysis.The time-varying complex parameter is estimated and complex AR residual is calculated with the estimated complex parameter with Eq.( 17).Note that pre-emphasis is operated for speech analysis such as real-valued AR or TV-CAR speech analysis, and inverse filtering is applied for the non pre-emphasized speech signal so as not to eliminate F 0 spectrum on the residual signal.
Real part of AUTOC is used to calculate the AUTOC for complex-valued signal.F 0 is estimated within the range corresponding to 50-400 [Hz].In order to reduce the computational amount, the range is shortened by setting the upper value as follows.
( ) ( ) where F 0 s and F 1 s are estimated F 0 and F 1 by the sample-based pre-selection.Setting upper bound below F 1 can not only reduce the computational cost but also can reduce the estimation error.

Experiments
Speech signals used in the experiment are 5 long sentences uttered by 5 male speaker and 5 long sentences uttered by 5 female speaker of Keele pitch database [9].Speech signals are filtered by an IRS filter [10].The IRS filter is band pass FIR filter whose frequency response corresponds to that for analog part of the transmitter of telephone equipment.The frequency response is shown in Figure 6.In order to evaluate the proposed method for the speech data processed by speech coding, the IRS filter has to be introduced shown as in [2].The experimental conditions are summarized in Table 1.Frame length is 25.6[msec] and frame shift length is 10[msec].Analysis orders are 14 and 7 for real-valued analysis and complexvalued analysis, respectively.The basis expansion order L is set to be 1(time-invariant) or 2(time-varying) in the experiments.First order polynomial function is adopted as a basis function.White Gauss noise or pink noise [11] is adopted for additive noise and the levels are 30, 20, 10, 5, 0, and -5 [dB].In order to extract more accurate F 0 , 3-point Lagrange's interpolation is adopted.Commonly used criterion for F 0 estimation, Gross Pitch Error (GPE), is adopted for objective evaluation.F 0 estimation error is defined as where F t (n) is true F 0 value and F e (n) is the estimated one.The true values are derived by pitch file in Keele database.In Eq.( 14), if | e p(n) | ≥ F t (n) × THR / 100 then the estimation error is regarded as ERROR and GPE is the probability of the error frames.Otherwise, the estimation is regarded as SUCCESS and FPE is standard deviation of the error.Figures 7,8,9 and 10 show the experimental results setting the THR as 10[%].Figure 7 and 9 means the results for male speech.Figure 8

Conclusions
This paper proposed fast robust fundamental frequency estimation algorithm based on robust TV-CAR speech analysis.The method provides two stage of search procedure, pre-selection and final-selection.In the pre-selection, F 0 and F 1 are estimated by using time-varying F 0 contour estimation.In the final-selection, F 0 is estimated for only the shorten range based on the pre-selected F 0 and F .The proposed method can perform better for male speech in terms of GPE with reduced computation.

Figure 1 .
Figure 1.Block diagrams of MMSE and ELS estimation.

Figure 2 .
Figure 2.Estimated Spectra of vowel /o/ with complex and conventional LPC analysis.

Figure 4 .
Figure 4. Flow of F 0 and F 1 contour estimation

Figure 6 .
Figure 6.Frequency response of IRS filter

Table 1 .
and 10 means the results for female speech.In Figures, (1) shows the results of GPEs or FPEs for additive white Gauss noise.(2)shows the results of GPEs or FPEs for additive pink noise.PROPOSED means the GPEs or FPEs for the proposed method with δ being 25.SP means the Shimamura method [6], viz., Shimamura criterion for speech signal.Other lines mean the GPEs or FPEs for the analysis method shown in Table 2.In all figures, X-axis means noise level of 30, 20, 10, 5, 0,−5[dB].Y-axis means GPE[%] or FPE[Hz].Figures7 and 8demonstrate that the proposed method can perform slightly better than the full-search method(TVC_E) for male speech while it can perform equivalently to the full search method(TVC_E) for female speech.Figures9 and 10show that the proposed one does not perform well in terms of FPE although the Shimamura method performs better in terms of FPE.Experimental Conditions Low Computational Robust F 0 Estimation of Speech Based on TV-CAR Analysis http://dx.doi.org/10.5772/51694

Table 2 .
Analysis methods