Methods for Speech Signal Structuring and Extracting Features

Eugene Fedorov; Tetyana Utkina; Tetiana Neskorodieva

doi:10.5772/intechopen.104634

Abstract

The preliminary stage of the biometric identification is speech signal structuring and extracting features. For calculation of the fundamental tone are considered and in number investigated the following methods – autocorrelation function (ACF) method, average magnitude difference function (AMDF) method, simplified inverse filter transformation (SIFT) method, method on a basis a wavelet analysis, method based on the cepstral analysis, harmonic product spectrum (HPS) method. For speech signal extracting features are considered and in number investigated the following methods – the digital bandpass filters bank; spectral analysis; homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), barkfrequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. The largest probability of identification (equal 0.98) and the smallest number of coefficients (4 coefficients) are provided by coding of a vocal of the speech sound from the TIMIT based on PRC.

Keywords

speech recognition
speech signal structuring and extracting features
the digital bandpass filters bank
spectral analysis
homomorphic processing
linear predictive coding

Author Information

Show +

Eugene Fedorov*
- Cherkasy State Technological University, Cherkasy, Ukraine
Tetyana Utkina
- Cherkasy State Technological University, Cherkasy, Ukraine
Tetiana Neskorodieva
- Vasyl’ Stus Donetsk National University, Vinnytsia, Ukraine

*Address all correspondence to: fedorovee75@ukr.net

1. Introduction

Most often from a speech signal the following features are distinguished [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: power features (energy of a spectral bands); cepstrum; linear predictive parameters; fundamental tone and formant; mel-frequency cepstral coefficients (MFCC); bark-frequency cepstral coefficients (BFCC); parameters of perceptual linear prediction; parameters of the reconsidered perceptual linear prediction.

For features extraction of a speech signal usually use [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: digital bandpass filters bank; spectral analysis (Fourier’s transformation, wavelet transformation); homomorphic processing; linear predictive coding; MFCC method; BFCC method; perceptual linear prediction; reconsidered perceptual linear prediction.

2. Calculation methods of the fundamental tone

For calculation of the fundamental tone use methods which are based on a basis of the analysis of the following signal representations [3]: amplitude-time; spectral (amplitude-frequency); cepstral (maplitude-quefrency); wavelet-spectral (amplitude-time-frequency).

2.1 ACF method

The autocorrelation function (ACF) method carries out search of the maximum value in autocorrelated function [3]:

1. For the chosen signal frame of length ΔN calculates autocorrelated function

Rk=1ΔN∑n=0ΔN−1−kxnxn+k,k∈0,ΔN−1¯.E1

2. Impulse response function initialization Is defined at what value k autocorrelated function Rk it is maximum that corresponds to extraction of the periods in a speech signal

k∗=argmaxkRk,k∈0,ΔN−1¯.E2

The period of the fundamental tone is defined in a form

TОТ=k∗,k∗∈n2n20,k∗∉n2n2,E3

where n1—minimum length of the fundamental tone period, n1=infTОТ, n2—maximum length of the fundamental tone period, n2=supTОТ.

2.2 AMDF method

The average magnitude difference function (AMDF) method carries out search of the minimum value as the average magnitude difference [3] that quicker than search of the maximum value in autocorrelated function.

For the chosen signal frame of length ΔN calculates function of the average magnitude difference
vk=1ΔN∑n=0ΔN−1xn−xn+k,k∈0,ΔN−1¯.E4
Is defined at what value k function of the average magnitude difference vk it is minimum that corresponds to extract of the periods in a speech signal

k∗=argminkvk,k∈0,ΔN−1¯.E5

The period of the fundamental tone is defined in a look

TОТ=k∗,k∗∈n2n20,k∗∉n2n2,E6

where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

2.3 SIFT method

The simplified inverse filter transformation (SIFT) method carries out search of the maximum value in autocorrelated function of linear prediction error of the decimated signal [4]:

For the chosen signal frame of length ΔN extracted the frequency range containing the frequency of the fundamental tone by means of elliptic LPF with a cut frequency fcut=1000 Hz. Instead of the elliptic LPF used in [4] the consecutive calculation is offered:
- DFT (discrete Fourier transform)
  Xk=∑n=0ΔN−1xne−j2π/ΔNnk,k∈0,ΔN−1¯;E7
- extract of the lower frequencies
  Xlowk=Xk,0≤k≤kcut0,kcut<k≤ΔN−1,kcut=fcut⋅ΔN/fd,E8
  where fd—sampling frequency;
- calculation of the inverse DFT
  yn=Re1ΔN∑k=0ΔN−1Xlowkej2π/ΔNnk,n∈0,ΔN−1¯.E9
Decreases sampling frequencies to f1d=2000 Hz by decimation of a signal, i.e. are removed intermediate samples of a signal
sn=yn⋅Δn,n∈0,ΔN/Δn−1¯,E10
where Δn=fd/f1d—decimation coefficient,fd—sampling frequency.
The differences of two next samples of the decimated signal are calculated
sΔn=sn,n=0sn−sn−1,n>0,n∈0,ΔN/Δn−1¯.E11
Autocorrelated function is calculated
s⌢Δn=s⌣Δnwn,wn=0.54+0.46cos2πnΔN,E12
Rk=∑n=0ΔN/Δn−1−ks⌢Δns⌢Δn+k,k∈0,p¯,E13
where wm—Hamming’s window, p—order of linear prediction, ceilf1d/1000≤p≤5+ceilf1d/1000, ceilf—function which rounds f to the next integer.
LPC coefficients are calculated aj according to the procedure Darbin.
The error of linear prediction by means of LPC coefficients is calculated
en=sn,n<psn−∑k=1paksn−k,n≥p,n∈0,ΔN/Δn−1¯,E14
where en—prediction error.
Autocorrelated function of a linear error of prediction is calculated
ewn=enwn,wn=0.54+0.46cos2πnΔN,E15
rk=∑n=0ΔN/Δn−1−kewnewn+k,k∈0,ΔN/Δn−1¯,E16
where wm—Hamming’s window.
Is defined at what value k autocorrelated function rk it is maximum that corresponds to extraction of the periods in a speech signal

k∗=argmaxkrk,r∗=maxkrk,kΔn∈n1n2,E17

where n1—minimum length of the fundamental tone period, n1=infTОТ,

n2—maximum length of the fundamental tone period, n2=supTОТ.

Thus, length of the fundamental tone period is determined in a form

TОТ=k∗Δn,r∗≥γ0,r∗<γ,E18

where γ—the threshold value.

Example 1

In Figure 1 the source signal, is presented on Figure 2—noisy (additive white is added the noise with a mean 0 and variance 0.001 is Gaussian), on Figure 3—filtered and M=1.

As a signal the frame of a sound “A” length is chosen ΔN= 512 with a sampling frequency fd=22050 Hz, 8 bits, mono. In Figures 1–6 the initial signal (Figure 1), the filtered signal (Figure 2), the decimated signal (Figure 3), a signal in the form of the weighed difference (Figure 4), an error of prediction (Figure 5), autocorrelated function of an error of the prediction with extraction of the found maximum and admissible boundaries (Figure 6) are presented.

Figure 4.
A signal in the form of the weighed difference.

Figure 6.
Autocorrelated function of prediction error.

2.4 Method on a basis a wavelet analysis

This method calculates distance between the next minimum a wavelet coefficients.

At the first stage the continuous wavelet transformation which is approximated according to a rectangles formula in a look is calculated

dμl=∑n=0N−1xna0−μ/2ψa0−μn−b0l¯Δt,l∈0,N−1¯,Δt=1/fd,E19

where μ—the decomposition level at which the smooth sinusoid is reached, N—signal length, Δt—quantization step.

For Morle’s wavelet

ψξ=2π−1/2cosk0ξe−ξ2/2,k0=5,ξ=a0−mn−b0l.E20

As sequence dμl represents a smooth sinusoid, the use needs of autocorrelated function and function of the average value of a difference of signal amplitudes having considerable computing complexity disappears. Instead of calculation of these functions at the second stage in the sequence dμl two are defined in a row going a maximum and the difference between them in a form is calculated

(dμj−1≤dμj≥dμ,j+1∧dμ,m−1<dμm≥dμ,m+1∧∧dμ,k−1≥dμk<dμ,k+1∧j<k<m)→k∗=m−j.E21

The period of the fundamental tone is defined in a form

TОТ=k∗,k∗∈n2n20,k∗∉n2n2,E22

where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

Example 2

In Figure 7 it is given a sound “A”, and in Figure 8—a sound “A” on μ=50 decomposition level.

Figure 7.
Sound “A” for wavelet analysis.

Figure 8.
A sound “A” at the 50th level of decomposition (frequency range is 51–250 Hz).

2.5 Method based on the cepstral analysis

This method carries out search of the maximum value in cepstrum [3].

For the chosen signal frame of length ΔN calculates a spectrum, using DFT
Xk=∑n=0ΔN−1xne−j2π/ΔNnk,k∈0,ΔN−1¯.E23
Cepstrum is calculated, using the inverse DFT
sn=1ΔN∑k=0ΔN−1lgXk2ej2π/Nnk,n∈0,ΔN−1¯.E24
Is defined at what value n cepstrum sn it is maximum that corresponds to extraction of the periods in a speech signal

n∗=argmaxnsn,s∗=maxnsn,n∈n1n2,E25

where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

The period of the fundamental tone is defined in a form

TОТ=n∗,s∗≥γ0,s∗<γ,E26

where γ—the threshold value.

Example 3

As a signal the frame of a sound “A” length is chosen ΔN = 512 with a sampling frequency fd = 22050 Hz, 8 bits, mono. In Figure 9 it is given an initial signal, and in Figure 10—cepstrum of a signal.

Figure 9.
Initial signal for cepstrum analysis.

2.6 HPS method

The harmonic product spectrum (HPS) method carries out search of the maximum value in the product of harmonicas of the decimated power spectrum [3].

For the chosen signal frame of length ΔN calculates a spectrum, using DFT
Xk=∑n=0ΔN−1xne−j2π/ΔNnk,k∈0,ΔN−1¯.E27
The power spectrum of a signal is calculated
Wk=Xzk2,k∈0,ΔN−1¯.E28
Z times a power spectrum of a signal is decimated, i.e. intermediate frequencies of a power spectrum of a signal are removed
Wzk=Xzk2,k∈0,ΔN/z−1¯,z∈1,Z¯,E29
where ⋅—integer part of number.
The product of harmonicas of the decimated power spectrum is calculated
Pk=∏z=1ZWzk,k∈0,ΔN/Z−1¯.E30
Is defined at what value k the product of harmonicas of the decimated power spectrum as much as possible that corresponds to extraction of the periods in a speech signal

k∗=argmaxkPk,k∈0,ΔN/Z−1¯.E31

Frequency of the fundamental tone is determined in a form

FОТ=k∗,k∗∈k2k20,k∗∉k2k2,E32

where k1—minimum frequency of the fundamental tone, k1=infFОТ, k2—maximum frequency of the fundamental tone, k2=supFОТ.

The SIFT, ACF, AMDF methods, based on the cepstral analysis depend on noise level.

The HPS methods, on a basis a wavelet analysis, are resistant to noise.

The SIFT methods, based on the cepstral analysis demand a threshold task.

The method on a basis a wavelet analysis demands the setting level of decomposition.

The HPS method demands a task of decimating quantity.

3. Calculation method of linear prediction parameters

The linear predictive coding method uses the amplifier and the digital filter (Figure 11).

Figure 11.
The block diagram of the simplified model of signal formation.

Thus, the signal can be presented in the signal form at the input of the linear system with variables on time parameters excited by quasiperiodic impulses or random noise.

Transfer function of a linear system with variable parameters Hz is considered as the relation of an output signal spectrum Sz to input signal spectrum Uz

Hz=SzUz=GAz,Az=1−∑k=1pakz−k,E33

where Az—the inverse filter for the system Hz, G—coefficient of gain, p—a prediction order (filter order).

The input signal un is presented by the pulse sequence and noise. The model has the following parameters: coefficient of gain G and coefficients of the digital filter ak. All these parameters slowly change in time and can be estimated on frames.

This method as features linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients are used [3].

Signal sm breaks on L frames of the length ΔN. For n-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out
s⌣nm=snm+1−αsnm,m∈0,ΔN−1¯,E34
where α—filtration parameter, 0<α<1.
For n-th frame the autocorrelated function is calculated Rnk
s⌢nm=s⌣nmwm,wm=0.54+0.46cos2πmΔN,E35
Rnk=∑m=0ΔN−1−ks⌢nms⌢nm+k,k∈0,p¯,E36
where wm—Hamming’s window, p—order of linear prediction, ceilfd/1000≤p≤5+ceilfd/1000, ceilf—function which rounds f to the next integer.
For n-th frame linear prediction coefficients (LPC) anj and reflection coefficients (RC) kni are calculated according to the procedure Darbin.
For n-th frame gain coefficient is calculated Gn.
Gn=En=Rn0−∑k=1pankRnk.E37
For n-th frame linear prediction cepstral coefficients (LPCC) are calculated
LPCCnm=lnGn,m=0anm,m=1anm−∑k=1m−1k/mLPCCnkan,m−k,2≤m≤p,m∈0,p¯.E38
For n-th frame log area ratio (LAR) coefficients are calculated

LARnm=ln1−knm1+knm,m∈1,p¯.E39

4. Calculation method formant

For n-th of a frame the logarithmic power spectrum is calculated, using coefficient of gain and linear prediction coefficients (LPC) [3, 4]

10lgWnk=10lgGnAnz2==10lgGn21−∑m=1panmcos2πΔNkm2+∑m=1panmsin2πΔNkm2.E40

At identification of the person or speech recognition for the analysis of vocalized sounds with a frequency range from 0 to 3 kHz are limited and the first 3 formant use F1,F2,F3. At synthesis of the speech with a frequency range from 0 to 4–5 kHz are limited and use the first 5 formant F1,F2,F3,F4,F5.

Example 4

In Figure 12 the logarithmic power spectrum of the central frame of a sound “A” with different orders of prediction, at the same time length of a frame N=512, sampling frequency is presented fd=22050 Hz.

Figure 12.
The Logarithmic power spectrum of LPC of a sound “A” at different orders of prediction p.

Apparently from Figure 12, extraction a formant (maximum in a spectrum) perhaps already at p=30.

Example 5

In Figure 13 it is given a sound “A”, and in Figure 14—its logarithmic power spectrum of LPC. In Figure 15 it is given the central frame of a sound “Sh”, and in Figure 16—its logarithmic power spectrum of LPC. At the same time length of a frame N=512, sampling frequency fd=22050 Hz., 8 bits, mono, prediction order p=30.

Figure 14.
Logarithmic power spectrum of LPC sound “A” at a prediction order p=30.

Figure 16.
Logarithmic power spectrum of LPC sound “Sh” at an order of prediction p=30.

5. Method of mel-frequency cepstral coefficients calculation

This method is based on homomorphic processing and uses as features mel-frequency cepstral coefficients (MFCC) [5, 6].

Signal sm breaks on L frames of the length ΔN. For n-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out
s⌣nm=snm+1−αsnm,m∈0,ΔN−1¯,E41
where α—filtration parameter, 0<α<1.
For n-th frame the spectrum is calculated, using DFT
s⌢nm=s⌣nmwm,wm=0.54+0.46cos2πmΔN,E42
S⌢nk=∑m=0ΔN−1s⌢nme−j2π/ΔNkm,k∈0,ΔN−1¯,E43
where wm—Hamming’s window.
For n-th frame on i-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett’s window

E⌢nm=∑k=0ΔN/2−1S⌢nk2wmk,m∈1,P¯,wmk=0,k<fm−1∨k>fm+1k−fm−1fm−fm−1,fm−1≤k≤fmfm+1−kfm+1−fm,fm≤k≤fm+1,E44

fm=NfdB−1Bfmin+mBfmax−BfminP+1,m∈0,P+1¯,E45

Bf=1125ln1+f/700,B−1b=700expb/1125−1,E46

where E⌢im—energy of m-th mel-frequency band, wmk—Bartlett's window for band m-th,Bf—function which will transform frequency to Hz in frequency in mel,B−1b—function which will transform frequency to mel in frequency in Hz, fm—normalized frequency,fmin,fmax—minimum and maximum frequency in Hz (for example, fmin=0,fmax=fd/2),fd—frequency of sampling of a speech signal in Hz, P—quantity of mel-frequency bands.

4. For n-th frame are calculated mel-frequency cepstral coefficients (MFCC), using the inverse discrete cosine transformation DCT-2

MFCCnm=2P∑k=0P−1lnE⌢n,m+1αkcos2m+1kπ2Pm∈0,P˜−1¯,αk=12,k=01,k>0,E47

where P˜—quantity mel-frequency cepstral coefficients, 1≤P˜≤P.

6. Method of bark-frequency cepstral coefficients calculation

This method is based on homomorphic processing and uses as features are used a bark-frequency cepstral coefficients (BFCC) [7, 8].

Signal sm breaks on frames ΔN of length the L. For n-th frame the spectrum is calculated, using DFT
s⌢nm=snmwm,wm=0.54+0.46cos2πmΔN,E48
S⌢nk=∑m=0ΔN−1s⌢nme−j2π/ΔNkm,k∈0,ΔN−1¯,E49
where wm—Hamming’s window.
The quantity of bark-frequency bands is calculated
P=ceilBfd/2+1,Bf=6asinhf/600,E50
where ceilf—function which rounds f to the next integer, fd—frequency of sampling of a speech signal in Hz, Bf—function which will transform frequency to Hz in frequency in a bark.
For n-th frame energy of bark-frequency bands is calculated
E⌢nm=∑k=0ΔN/2−1S⌢nk2wmk,m∈0,P−1¯,E51
bm=mBfd/2P−1,m∈0,P−1¯,E52
Δbmk=BkfdΔN−bm,m∈0,P−1¯,k∈0,ΔN−1¯,E53
wmk=10Δbmk+0.5,Δbmk≤−0.51,−0.5<Δbmk<0.510−2.5Δbmk−0.5,Δbmk≥0.5,E54
where E⌢im—energy of i-th a bark-frequency band, wmk—trapezoidal window for band m-th.
For n-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out
E⌣nm=vB−1bmE⌢im,m∈0,P−1¯,E55
B−1b=600sinhb/6,E56
vf=f2+56.8⋅106f4f2+6.3⋅1062f2+0.38⋅109,fd<5000f2+56.8⋅106f4f2+6.3⋅1062f2+0.38⋅109f6+9.58⋅1026,fd≥5000,E57
where vf—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), B−1b—function which will transform frequency to a bark in frequency in Hz.
For n-th frame the law of intensity loudness is applied to energy of bark-frequency bands
E˜nm=E⌣nm0.33,m∈0,P−1¯.E58
For n-th frame are calculated a bark-frequency cepstral coefficients (BFCC), using the inverse discrete cosine transformation DCT-2, and previously it is necessary to replace energy E˜n0 and E˜n,P−1 energy E˜n1 and E˜n,P−2 respectively

BFCCnm=2P∑k=0P−2lnE⌢n,m+1αkcos2m+1kπ2P−1,m∈0,P˜−1¯,E59

E˜n0=E˜n1,E˜n,P−1=E˜n,P−2,αk=12,k=01,k>0,E60

where P˜—quantity a bark-frequency cepstral coefficients, 1≤P˜≤P.

7. Method of parameters of perceptual linear prediction calculation

In this method as features perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients are used [9, 10].

Signal sm breaks on frames ΔN of the length L. For n-th frame the spectrum is calculated, using DFT
s⌢nm=snmwm,wm=0.54+0.46cos2πmΔN,E61
S⌢nk=∑m=0ΔN−1s⌢nme−j2π/ΔNkm,k∈0,ΔN−1¯,E62
where wm—Hamming’s window.
The quantity of bark-frequency bands is calculated
P=ceilBfd/2+1,Bf=6asinhf/600,E63
where ceilf—function which rounds f to the next integer, fd—frequency of sampling of a speech signal in Hz, Bf—function which will transform frequency to Hz in frequency in a bark.
For n-th frame energy of bark-frequency bands is calculated
E⌢nm=∑k=0ΔN/2−1S⌢nk2wmk,m∈0,P−1¯,E64
bm=mBfd/2P−1,m∈0,P−1¯,E65
Δbmk=BkfdΔN−bm,m∈0,P−1¯,k∈0,ΔN−1¯,E66
wmk=10Δbmk+0.5,Δbmk≤−0.51,−0.5<Δbmk<0.510−2.5Δbmk−0.5,Δbmk≥0.5,E67
where E⌢im—energy of i-th a bark-frequency band, wmk—trapezoidal window for m-th band.
For n-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out
E⌣nm=vB−1bmE⌢im,m∈0,P−1¯,E68
B−1b=600sinhb/6,E69
vf=f2+56.8⋅106f4f2+6.3⋅1062f2+0.38⋅109,fd<5000f2+56.8⋅106f4f2+6.3⋅1062f2+0.38⋅109f6+9.58⋅1026,fd≥5000,E70
where vf—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), B−1b—function which will transform frequency to a bark in frequency in Hz.
For n-th frame the law of intensity loudness is applied to energy of bark-frequency bands
E˜nm=E⌣nm0.33,m∈0,P−1¯.E71
For n-th frame values of autocorrelated function are calculated, using the inverse DFT, previously it is necessary to replace energy E˜n0 and E˜n,P−1 energy E˜n1 and E˜n,P−2 respectively
Rnk=Re12P−2∑m=02P−3E˜nmej2π/Mkm,k∈0,p¯,E72
E˜n0=E˜n1,E˜n,P−1=E˜n,P−2,E˜n,2P−2−m=E˜nm,m∈1,P−2¯,E73
where p—order of linear prediction, ceilfd/1000≤p≤5+ceilfd/1000 , ceilf—function which rounds f to the next integer.
For n-th frame perceptual linear prediction coefficients (PLPC) anj and perceptual reflection coefficients (PRC) kni are calculated according to the procedure Darbin.
For n-th frame gain coefficient is calculated Gn
Gn=En=Rn0−∑k=1pankRnk.E74
For n-th of frame perceptual linear prediction cepstral coefficients (PLPCC) are calculated
PLPCCnm=lnGn,m=0anm,m=1anm−∑k=1m−1k/mPLPCCnkan,m−k,2≤m≤p,m∈0,p¯.E75
For n-th frame perceptual log area ratio (PLAR) is calculated

PLARnm=ln1−knm1+knm,m∈1,p¯.E76

8. Method of parameters of reconsidered perceptual linear prediction calculation

In this method as features reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), the reconsidered perceptual log area ratio (PLAR) coefficients are used [7, 8].

Signal sm breaks on L frames of the length ΔN. For frame n-th by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out
s⌣nm=snm+1−αsnm,m∈0,ΔN−1¯,E77
where α—filtration parameter, 0<α<1.
For n-th frame the spectrum is calculated, using DFT
s⌢nm=s⌣nmwm,wm=0.54+0.46cos2πmΔN,E78
S⌢nk=∑m=0ΔN−1s⌢nme−j2π/ΔNkm,k∈0,ΔN−1¯,E79
where wm—Hamming’s window.
For n-th frame on i-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett’s window
E⌢nm=∑k=0ΔN/2−1S⌢nk2wmk,m∈1,P¯,E80
wmk=0,k<fm−1∨k>fm+1k−fm−1fm−fm−1,fm−1≤k≤fmfm+1−kfm+1−fm,fm≤k≤fm+1,E81
fm=NfdB−1Bfmin+mBfmax−BfminP+1,m∈0,P+1¯,E82
Bf=1125ln1+f/700,B−1b=700expb/1125−1,E83
where E⌢im—energy of m-th mel-frequency band, wmk—Bartlett’s window for band m-th, Bf—function which will transform frequency to Hz in frequency in mel, B−1b—function which will transform frequency to mel in frequency in Hz, fm—normalized frequency, fmin,fmax—minimum and maximum frequency in Hz (for example, fmin=0,fmax=fd/2), fd—frequency of sampling of a speech signal in Hz, P—quantity of mel-frequency bands.
For n-th frame values of autocorrelated function are calculated, using the inverse DFT
Rnk=Re12P−2∑m=02P−3E⌢n,m−1ej2π/Mkm,k∈0,p¯,E84
E⌢n,2P−m=E⌢nm,m∈2,P−1¯,E85
where p—order of linear prediction, ceilfd/1000≤p≤5+ceilfd/1000, ceilf—function which rounds f to the next integer.
For n-th frame reconsidered perceptual linear prediction coefficients (RPLPC) anj and reconsidered perceptual reflection coefficients (RPRC) kni are calculated according to the procedure Darbin.
For n-th frame gain coefficient is calculated Gn
Gn=En=Rn0−∑k=1pankRnk.E86
For n-th frame the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC) are calculated
RPLPCCnm=lnGn,m=0anm,m=1anm−∑k=1m−1k/mRPLPCCnkan,m−k,2≤m≤p,m∈0,p¯.E87
For n-th frame the reconsidered perceptual log area ratio (PLAR) are calculated

RPLARnm=ln1−knm1+knm,m∈1,p¯.E88

9. The performance comparison of various features for person identification

For the speech signals containing vocal sounds the sampling frequency 8 kHz and the number of quantization levels 256 was established. Sample length of a vocal sound of the speech is equal to 256.

A numerical research results of LPC, RC, LPCC, LAR coefficients, MFCC, BFCC, PLPC, PRC, PLPCC, PLAR coefficients, RPLPC, RPRC, RPLPCC, RPLAR coefficients received by methods of coding and used for biometric identification of people from the TIMIT database on vocal sounds by means of the Gaussian mixed models (GMM) are presented in Table 1.

Coefficient’s type	Identification probability	Coefficients number
LPC	0.72	12
RC	0.96	12
LPCC	0.90	13
LAR coefficients	0.82	12
MFCC	0.97	13
BFCC	0.98	13
PLPC	0.74	4
PRC	0.98	4
PLPCC	0.92	5
PLAR coefficients	0.84	4
RPLPC	0.73	12
RPRC	0.97	12
RPLPCC	0.91	13
RPLAR coefficients	0.83	12

Table 1.

Numerical research results of the coefficients used for personality biometric identification.

For coding methods for the analysis of a speech signal the filter order in case of linear prediction is equal 12, in case of perceptual linear prediction is equal 4, in case of the reconsidered perceptual linear prediction is equal 12, quantity mel-frequency bands equally 20, quantity a bark-frequency bands equally 17, the number of cepstral parameters based on subbands is equal to 13.

The result presented in Table 1 shows that the largest probability of identification and the smallest number of coefficients are provided by coding of a vocal sound of the speech based on PRC.

10. Conclusion

The preliminary stage of the biometric identification is speech signal structuring and extracting features.

For calculation of the fundamental tone are considered and in number investigated the following methods of digital signal processing—ACF (autocorrelation function) method, AMDF (Average Magnitude. Difference Function) method, SIFT (Simplified Inverse Filter Transformation) method, method on a basis a wavelet analysis, method based on the cepstral analysis, HPS (Harmonic Product Spectrum) method. For speech signal extracting features are considered and in number investigated the following methods of digital signal processing—the digital bandpass filters bank; spectral analysis (Fourier’s transformation, wavelet transformation); homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), bark-frequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. Results of a numerical research of speech signal features extraction methods for voice signals people from the TIMIT (Texas Instruments and Massachusetts Institute of Technology) database were received. The features PRC proved to be the most effective.

References

1. Oppenheim AV, Schafer RW. Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice Hall; 2010. p. 1108
2. Mallat S. A Wavelet Tour of Signal Processing: Sparse Way. Bourlington, MA: Academic Press; 2008. p. 832. DOI: 10.1016/B978-0-12-374370-1.X0001-8
3. Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Pearson Higher Education; 2011. p. 1042
4. Markel JD, Gray AH. Linear Prediction of Speech. Berlin: Springer Verlag; 1976. p. 382
5. Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustic, Speech and Signal Processing. 1980;28(4):357-366
6. Ganchev T, Fakotakis N, Kokkinakis G. Comparative evaluation of various MFCC implementations on the speaker verification task. In: Proceedings of SPECOM 2005. Vol. 1. Patras, Greece; 2005. pp. 191-194
7. Josef R, Pollak P. Modified feature extraction methods in robust speech recognition. In: Proceedings of the 17th IEEE Internations Conference Radioelektronika. Brno, Czech Republic: IEEE; 2007. pp. 1-4
8. Kumar P, Biswas A, Mishra AN, Chandra M. Spoken language identification using hybrid feature extraction methods. Journal of Telecommunications. 2010;1(2):11-15
9. Huang X, Acero A, Hon H-W. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, NJ: Prentice Hall; 2001. p. 980
10. Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America. 1990;87(4):1738-1752. DOI: 10.1121/1.399423

[1] 1. Oppenheim AV, Schafer RW. Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice Hall; 2010. p. 1108

[2] 2. Mallat S. A Wavelet Tour of Signal Processing: Sparse Way. Bourlington, MA: Academic Press; 2008. p. 832. DOI: 10.1016/B978-0-12-374370-1.X0001-8

[3] 3. Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Pearson Higher Education; 2011. p. 1042

[4] 4. Markel JD, Gray AH. Linear Prediction of Speech. Berlin: Springer Verlag; 1976. p. 382

[5] 5. Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustic, Speech and Signal Processing. 1980;28(4):357-366

[6] 6. Ganchev T, Fakotakis N, Kokkinakis G. Comparative evaluation of various MFCC implementations on the speaker verification task. In: Proceedings of SPECOM 2005. Vol. 1. Patras, Greece; 2005. pp. 191-194

[7] 7. Josef R, Pollak P. Modified feature extraction methods in robust speech recognition. In: Proceedings of the 17th IEEE Internations Conference Radioelektronika. Brno, Czech Republic: IEEE; 2007. pp. 1-4

[8] 8. Kumar P, Biswas A, Mishra AN, Chandra M. Spoken language identification using hybrid feature extraction methods. Journal of Telecommunications. 2010;1(2):11-15

[9] 9. Huang X, Acero A, Hon H-W. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, NJ: Prentice Hall; 2001. p. 980

[10] 10. Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America. 1990;87(4):1738-1752. DOI: 10.1121/1.399423

Methods for Speech Signal Structuring and Extracting Features

Computational Semantics

Abstract

Keywords

Author Information

Eugene Fedorov*

Tetyana Utkina

Tetiana Neskorodieva

1. Introduction

2. Calculation methods of the fundamental tone

2.1 ACF method

2.2 AMDF method

2.3 SIFT method

Example 1

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

2.4 Method on a basis a wavelet analysis

Example 2

Figure 7.

Figure 8.

2.5 Method based on the cepstral analysis

Example 3

Figure 9.

Figure 10.

2.6 HPS method

3. Calculation method of linear prediction parameters

Figure 11.

4. Calculation method formant

Example 4

Figure 12.

Example 5

Figure 13.

Figure 14.

Figure 15.

Figure 16.

5. Method of mel-frequency cepstral coefficients calculation

6. Method of bark-frequency cepstral coefficients calculation

7. Method of parameters of perceptual linear prediction calculation

8. Method of parameters of reconsidered perceptual linear prediction calculation

9. The performance comparison of various features for person identification

Table 1.

10. Conclusion

References

Continue reading from the same book

Computational Semantics