Open access peer-reviewed chapter

Methods for Speech Signal Structuring and Extracting Features

Written By

Eugene Fedorov, Tetyana Utkina and Tetiana Neskorodieva

Submitted: 11 March 2022 Reviewed: 23 March 2022 Published: 16 June 2022

DOI: 10.5772/intechopen.104634

From the Edited Volume

Computational Semantics

Edited by George Dekoulis and Jainath Yadav

Chapter metrics overview

67 Chapter Downloads

View Full Metrics

Abstract

The preliminary stage of the biometric identification is speech signal structuring and extracting features. For calculation of the fundamental tone are considered and in number investigated the following methods – autocorrelation function (ACF) method, average magnitude difference function (AMDF) method, simplified inverse filter transformation (SIFT) method, method on a basis a wavelet analysis, method based on the cepstral analysis, harmonic product spectrum (HPS) method. For speech signal extracting features are considered and in number investigated the following methods – the digital bandpass filters bank; spectral analysis; homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), barkfrequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. The largest probability of identification (equal 0.98) and the smallest number of coefficients (4 coefficients) are provided by coding of a vocal of the speech sound from the TIMIT based on PRC.

Keywords

  • speech recognition
  • speech signal structuring and extracting features
  • the digital bandpass filters bank
  • spectral analysis
  • homomorphic processing
  • linear predictive coding

1. Introduction

Most often from a speech signal the following features are distinguished [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: power features (energy of a spectral bands); cepstrum; linear predictive parameters; fundamental tone and formant; mel-frequency cepstral coefficients (MFCC); bark-frequency cepstral coefficients (BFCC); parameters of perceptual linear prediction; parameters of the reconsidered perceptual linear prediction.

For features extraction of a speech signal usually use [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: digital bandpass filters bank; spectral analysis (Fourier’s transformation, wavelet transformation); homomorphic processing; linear predictive coding; MFCC method; BFCC method; perceptual linear prediction; reconsidered perceptual linear prediction.

Advertisement

2. Calculation methods of the fundamental tone

For calculation of the fundamental tone use methods which are based on a basis of the analysis of the following signal representations [3]: amplitude-time; spectral (amplitude-frequency); cepstral (maplitude-quefrency); wavelet-spectral (amplitude-time-frequency).

2.1 ACF method

The autocorrelation function (ACF) method carries out search of the maximum value in autocorrelated function [3]:

1. For the chosen signal frame of length ΔN calculates autocorrelated function

Rk=1ΔNn=0ΔN1kxnxn+k,k0,ΔN1¯.E1

2. Impulse response function initialization Is defined at what value k autocorrelated function Rk it is maximum that corresponds to extraction of the periods in a speech signal

k=argmaxkRk,k0,ΔN1¯.E2

The period of the fundamental tone is defined in a form

TОТ=k,kn2n20,kn2n2,E3

where n1—minimum length of the fundamental tone period, n1=infTОТ, n2—maximum length of the fundamental tone period, n2=supTОТ.

2.2 AMDF method

The average magnitude difference function (AMDF) method carries out search of the minimum value as the average magnitude difference [3] that quicker than search of the maximum value in autocorrelated function.

  1. For the chosen signal frame of length ΔN calculates function of the average magnitude difference

    vk=1ΔNn=0ΔN1xnxn+k,k0,ΔN1¯.E4

  2. Is defined at what value k function of the average magnitude difference vk it is minimum that corresponds to extract of the periods in a speech signal

k=argminkvk,k0,ΔN1¯.E5

The period of the fundamental tone is defined in a look

TОТ=k,kn2n20,kn2n2,E6

where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

2.3 SIFT method

The simplified inverse filter transformation (SIFT) method carries out search of the maximum value in autocorrelated function of linear prediction error of the decimated signal [4]:

  1. For the chosen signal frame of length ΔN extracted the frequency range containing the frequency of the fundamental tone by means of elliptic LPF with a cut frequency fcut=1000 Hz. Instead of the elliptic LPF used in [4] the consecutive calculation is offered:

    • DFT (discrete Fourier transform)

      Xk=n=0ΔN1xnej2π/ΔNnk,k0,ΔN1¯;E7

    • extract of the lower frequencies

      Xlowk=Xk,0kkcut0,kcut<kΔN1,kcut=fcutΔN/fd,E8

      where fd—sampling frequency;

    • calculation of the inverse DFT

      yn=Re1ΔNk=0ΔN1Xlowkej2π/ΔNnk,n0,ΔN1¯.E9

  2. Decreases sampling frequencies to f1d=2000 Hz by decimation of a signal, i.e. are removed intermediate samples of a signal

    sn=ynΔn,n0,ΔN/Δn1¯,E10

    where Δn=fd/f1d—decimation coefficient,fd—sampling frequency.

  3. The differences of two next samples of the decimated signal are calculated

    sΔn=sn,n=0snsn1,n>0,n0,ΔN/Δn1¯.E11

  4. Autocorrelated function is calculated

    sΔn=sΔnwn,wn=0.54+0.46cos2πnΔN,E12
    Rk=n=0ΔN/Δn1ksΔnsΔn+k,k0,p¯,E13

    where wm—Hamming’s window, p—order of linear prediction, ceilf1d/1000p5+ceilf1d/1000, ceilf—function which rounds f to the next integer.

  5. LPC coefficients are calculated aj according to the procedure Darbin.

  6. The error of linear prediction by means of LPC coefficients is calculated

    en=sn,n<psnk=1paksnk,np,n0,ΔN/Δn1¯,E14

    where en—prediction error.

  7. Autocorrelated function of a linear error of prediction is calculated

    ewn=enwn,wn=0.54+0.46cos2πnΔN,E15
    rk=n=0ΔN/Δn1kewnewn+k,k0,ΔN/Δn1¯,E16

    where wm—Hamming’s window.

  8. Is defined at what value k autocorrelated function rk it is maximum that corresponds to extraction of the periods in a speech signal

k=argmaxkrk,r=maxkrk,kΔnn1n2,E17

where n1—minimum length of the fundamental tone period, n1=infTОТ,

n2—maximum length of the fundamental tone period, n2=supTОТ.

Thus, length of the fundamental tone period is determined in a form

TОТ=kΔn,rγ0,r<γ,E18

where γ—the threshold value.

Example 1

In Figure 1 the source signal, is presented on Figure 2noisy (additive white is added the noise with a mean 0 and variance 0.001 is Gaussian), on Figure 3filtered and M=1.

Figure 1.

Initial signal.

Figure 2.

The filtered signal.

Figure 3.

The decimation signal.

As a signal the frame of a sound “A” length is chosen ΔN= 512 with a sampling frequency fd=22050 Hz, 8 bits, mono. In Figures 16 the initial signal (Figure 1), the filtered signal (Figure 2), the decimated signal (Figure 3), a signal in the form of the weighed difference (Figure 4), an error of prediction (Figure 5), autocorrelated function of an error of the prediction with extraction of the found maximum and admissible boundaries (Figure 6) are presented.

Figure 4.

A signal in the form of the weighed difference.

Figure 5.

Prediction error.

Figure 6.

Autocorrelated function of prediction error.

2.4 Method on a basis a wavelet analysis

This method calculates distance between the next minimum a wavelet coefficients.

At the first stage the continuous wavelet transformation which is approximated according to a rectangles formula in a look is calculated

dμl=n=0N1xna0μ/2ψa0μnb0l¯Δt,l0,N1¯,Δt=1/fd,E19

where μ—the decomposition level at which the smooth sinusoid is reached, N—signal length, Δt—quantization step.

For Morle’s wavelet

ψξ=2π1/2cosk0ξeξ2/2,k0=5,ξ=a0mnb0l.E20

As sequence dμl represents a smooth sinusoid, the use needs of autocorrelated function and function of the average value of a difference of signal amplitudes having considerable computing complexity disappears. Instead of calculation of these functions at the second stage in the sequence dμl two are defined in a row going a maximum and the difference between them in a form is calculated

(dμj1dμjdμ,j+1dμ,m1<dμmdμ,m+1dμ,k1dμk<dμ,k+1j<k<m)k=mj.E21

The period of the fundamental tone is defined in a form

TОТ=k,kn2n20,kn2n2,E22

where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

Example 2

In Figure 7 it is given a sound “A”, and in Figure 8a sound “A” on μ=50 decomposition level.

Figure 7.

Sound “A” for wavelet analysis.

Figure 8.

A sound “A” at the 50th level of decomposition (frequency range is 51–250 Hz).

2.5 Method based on the cepstral analysis

This method carries out search of the maximum value in cepstrum [3].

  1. For the chosen signal frame of length ΔN calculates a spectrum, using DFT

    Xk=n=0ΔN1xnej2π/ΔNnk,k0,ΔN1¯.E23

  2. Cepstrum is calculated, using the inverse DFT

    sn=1ΔNk=0ΔN1lgXk2ej2π/Nnk,n0,ΔN1¯.E24

  3. Is defined at what value n cepstrum sn it is maximum that corresponds to extraction of the periods in a speech signal

n=argmaxnsn,s=maxnsn,nn1n2,E25

where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

The period of the fundamental tone is defined in a form

TОТ=n,sγ0,s<γ,E26

where γ—the threshold value.

Example 3

As a signal the frame of a sound “A” length is chosen ΔN = 512 with a sampling frequency fd = 22050 Hz, 8 bits, mono. In Figure 9 it is given an initial signal, and in Figure 10cepstrum of a signal.

Figure 9.

Initial signal for cepstrum analysis.

Figure 10.

Cepstrum of a sound “A”.

2.6 HPS method

The harmonic product spectrum (HPS) method carries out search of the maximum value in the product of harmonicas of the decimated power spectrum [3].

  1. For the chosen signal frame of length ΔN calculates a spectrum, using DFT

    Xk=n=0ΔN1xnej2π/ΔNnk,k0,ΔN1¯.E27

  2. The power spectrum of a signal is calculated

    Wk=Xzk2,k0,ΔN1¯.E28

  3. Z times a power spectrum of a signal is decimated, i.e. intermediate frequencies of a power spectrum of a signal are removed

    Wzk=Xzk2,k0,ΔN/z1¯,z1,Z¯,E29

    where —integer part of number.

  4. The product of harmonicas of the decimated power spectrum is calculated

    Pk=z=1ZWzk,k0,ΔN/Z1¯.E30

  5. Is defined at what value k the product of harmonicas of the decimated power spectrum as much as possible that corresponds to extraction of the periods in a speech signal

k=argmaxkPk,k0,ΔN/Z1¯.E31

Frequency of the fundamental tone is determined in a form

FОТ=k,kk2k20,kk2k2,E32

where k1—minimum frequency of the fundamental tone, k1=infFОТ, k2—maximum frequency of the fundamental tone, k2=supFОТ.

The SIFT, ACF, AMDF methods, based on the cepstral analysis depend on noise level.

The HPS methods, on a basis a wavelet analysis, are resistant to noise.

The SIFT methods, based on the cepstral analysis demand a threshold task.

The method on a basis a wavelet analysis demands the setting level of decomposition.

The HPS method demands a task of decimating quantity.

Advertisement

3. Calculation method of linear prediction parameters

The linear predictive coding method uses the amplifier and the digital filter (Figure 11).

Figure 11.

The block diagram of the simplified model of signal formation.

Thus, the signal can be presented in the signal form at the input of the linear system with variables on time parameters excited by quasiperiodic impulses or random noise.

Transfer function of a linear system with variable parameters Hz is considered as the relation of an output signal spectrum Sz to input signal spectrum Uz

Hz=SzUz=GAz,Az=1k=1pakzk,E33

where Az—the inverse filter for the system Hz, G—coefficient of gain, p—a prediction order (filter order).

The input signal un is presented by the pulse sequence and noise. The model has the following parameters: coefficient of gain G and coefficients of the digital filter ak. All these parameters slowly change in time and can be estimated on frames.

This method as features linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients are used [3].

  1. Signal sm breaks on L frames of the length ΔN. For n-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out

    snm=snm+1αsnm,m0,ΔN1¯,E34

    where α—filtration parameter, 0<α<1.

  2. For n-th frame the autocorrelated function is calculated Rnk

    snm=snmwm,wm=0.54+0.46cos2πmΔN,E35
    Rnk=m=0ΔN1ksnmsnm+k,k0,p¯,E36

    where wm—Hamming’s window, p—order of linear prediction, ceilfd/1000p5+ceilfd/1000, ceilf—function which rounds f to the next integer.

  3. For n-th frame linear prediction coefficients (LPC) anj and reflection coefficients (RC) kni are calculated according to the procedure Darbin.

  4. For n-th frame gain coefficient is calculated Gn.

    Gn=En=Rn0k=1pankRnk.E37

  5. For n-th frame linear prediction cepstral coefficients (LPCC) are calculated

    LPCCnm=lnGn,m=0anm,m=1anmk=1m1k/mLPCCnkan,mk,2mp,m0,p¯.E38

  6. For n-th frame log area ratio (LAR) coefficients are calculated

LARnm=ln1knm1+knm,m1,p¯.E39
Advertisement

4. Calculation method formant

For n-th of a frame the logarithmic power spectrum is calculated, using coefficient of gain and linear prediction coefficients (LPC) [3, 4]

10lgWnk=10lgGnAnz2==10lgGn21m=1panmcos2πΔNkm2+m=1panmsin2πΔNkm2.E40

At identification of the person or speech recognition for the analysis of vocalized sounds with a frequency range from 0 to 3 kHz are limited and the first 3 formant use F1,F2,F3. At synthesis of the speech with a frequency range from 0 to 4–5 kHz are limited and use the first 5 formant F1,F2,F3,F4,F5.

Example 4

In Figure 12 the logarithmic power spectrum of the central frame of a sound “A” with different orders of prediction, at the same time length of a frame N=512, sampling frequency is presented fd=22050 Hz.

Figure 12.

The Logarithmic power spectrum of LPC of a sound “A” at different orders of prediction p.

Apparently from Figure 12, extraction a formant (maximum in a spectrum) perhaps already at p=30.

Example 5

In Figure 13 it is given a sound “A”, and in Figure 14—its logarithmic power spectrum of LPC. In Figure 15 it is given the central frame of a sound “Sh”, and in Figure 16—its logarithmic power spectrum of LPC. At the same time length of a frame N=512, sampling frequency fd=22050 Hz., 8 bits, mono, prediction order p=30.

Figure 13.

Sound “A”.

Figure 14.

Logarithmic power spectrum of LPC sound “A” at a prediction order p=30.

Figure 15.

Sound “Sh”.

Figure 16.

Logarithmic power spectrum of LPC sound “Sh” at an order of prediction p=30.

Advertisement

5. Method of mel-frequency cepstral coefficients calculation

This method is based on homomorphic processing and uses as features mel-frequency cepstral coefficients (MFCC) [5, 6].

  1. Signal sm breaks on L frames of the length ΔN. For n-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out

    snm=snm+1αsnm,m0,ΔN1¯,E41

    where α—filtration parameter, 0<α<1.

  2. For n-th frame the spectrum is calculated, using DFT

    snm=snmwm,wm=0.54+0.46cos2πmΔN,E42
    Snk=m=0ΔN1snmej2π/ΔNkm,k0,ΔN1¯,E43

    where wm—Hamming’s window.

  3. For n-th frame on i-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett’s window

Enm=k=0ΔN/21Snk2wmk,m1,P¯,wmk=0,k<fm1k>fm+1kfm1fmfm1,fm1kfmfm+1kfm+1fm,fmkfm+1,E44
fm=NfdB1Bfmin+mBfmaxBfminP+1,m0,P+1¯,E45
Bf=1125ln1+f/700,B1b=700expb/11251,E46

where Eim—energy of m-th mel-frequency band, wmk—Bartlett's window for band m-th,Bf—function which will transform frequency to Hz in frequency in mel,B1b—function which will transform frequency to mel in frequency in Hz, fm—normalized frequency,fmin,fmax—minimum and maximum frequency in Hz (for example, fmin=0,fmax=fd/2),fd—frequency of sampling of a speech signal in Hz, P—quantity of mel-frequency bands.

4. For n-th frame are calculated mel-frequency cepstral coefficients (MFCC), using the inverse discrete cosine transformation DCT-2

MFCCnm=2Pk=0P1lnEn,m+1αkcos2m+12Pm0,P˜1¯,αk=12,k=01,k>0,E47

where P˜—quantity mel-frequency cepstral coefficients, 1P˜P.

Advertisement

6. Method of bark-frequency cepstral coefficients calculation

This method is based on homomorphic processing and uses as features are used a bark-frequency cepstral coefficients (BFCC) [7, 8].

  1. Signal sm breaks on frames ΔN of length the L. For n-th frame the spectrum is calculated, using DFT

    snm=snmwm,wm=0.54+0.46cos2πmΔN,E48
    Snk=m=0ΔN1snmej2π/ΔNkm,k0,ΔN1¯,E49

    where wm—Hamming’s window.

  2. The quantity of bark-frequency bands is calculated

    P=ceilBfd/2+1,Bf=6asinhf/600,E50

    where ceilf—function which rounds f to the next integer, fd—frequency of sampling of a speech signal in Hz, Bf—function which will transform frequency to Hz in frequency in a bark.

  3. For n-th frame energy of bark-frequency bands is calculated

    Enm=k=0ΔN/21Snk2wmk,m0,P1¯,E51
    bm=mBfd/2P1,m0,P1¯,E52
    Δbmk=BkfdΔNbm,m0,P1¯,k0,ΔN1¯,E53
    wmk=10Δbmk+0.5,Δbmk0.51,0.5<Δbmk<0.5102.5Δbmk0.5,Δbmk0.5,E54

    where Eim—energy of i-th a bark-frequency band, wmk—trapezoidal window for band m-th.

  4. For n-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out

    Enm=vB1bmEim,m0,P1¯,E55
    B1b=600sinhb/6,E56
    vf=f2+56.8106f4f2+6.31062f2+0.38109,fd<5000f2+56.8106f4f2+6.31062f2+0.38109f6+9.581026,fd5000,E57

    where vf—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), B1b—function which will transform frequency to a bark in frequency in Hz.

  5. For n-th frame the law of intensity loudness is applied to energy of bark-frequency bands

    E˜nm=Enm0.33,m0,P1¯.E58

  6. For n-th frame are calculated a bark-frequency cepstral coefficients (BFCC), using the inverse discrete cosine transformation DCT-2, and previously it is necessary to replace energy E˜n0 and E˜n,P1 energy E˜n1 and E˜n,P2 respectively

BFCCnm=2Pk=0P2lnEn,m+1αkcos2m+12P1,m0,P˜1¯,E59
E˜n0=E˜n1,E˜n,P1=E˜n,P2,αk=12,k=01,k>0,E60

where P˜—quantity a bark-frequency cepstral coefficients, 1P˜P.

Advertisement

7. Method of parameters of perceptual linear prediction calculation

In this method as features perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients are used [9, 10].

  1. Signal sm breaks on frames ΔN of the length L. For n-th frame the spectrum is calculated, using DFT

    snm=snmwm,wm=0.54+0.46cos2πmΔN,E61
    Snk=m=0ΔN1snmej2π/ΔNkm,k0,ΔN1¯,E62

    where wm—Hamming’s window.

  2. The quantity of bark-frequency bands is calculated

    P=ceilBfd/2+1,Bf=6asinhf/600,E63

    where ceilf—function which rounds f to the next integer, fd—frequency of sampling of a speech signal in Hz, Bf—function which will transform frequency to Hz in frequency in a bark.

  3. For n-th frame energy of bark-frequency bands is calculated

    Enm=k=0ΔN/21Snk2wmk,m0,P1¯,E64
    bm=mBfd/2P1,m0,P1¯,E65
    Δbmk=BkfdΔNbm,m0,P1¯,k0,ΔN1¯,E66
    wmk=10Δbmk+0.5,Δbmk0.51,0.5<Δbmk<0.5102.5Δbmk0.5,Δbmk0.5,E67

    where Eim—energy of i-th a bark-frequency band, wmk—trapezoidal window for m-th band.

  4. For n-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out

    Enm=vB1bmEim,m0,P1¯,E68
    B1b=600sinhb/6,E69
    vf=f2+56.8106f4f2+6.31062f2+0.38109,fd<5000f2+56.8106f4f2+6.31062f2+0.38109f6+9.581026,fd5000,E70

    where vf—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), B1b—function which will transform frequency to a bark in frequency in Hz.

  5. For n-th frame the law of intensity loudness is applied to energy of bark-frequency bands

    E˜nm=Enm0.33,m0,P1¯.E71

  6. For n-th frame values of autocorrelated function are calculated, using the inverse DFT, previously it is necessary to replace energy E˜n0 and E˜n,P1 energy E˜n1 and E˜n,P2 respectively

    Rnk=Re12P2m=02P3E˜nmej2π/Mkm,k0,p¯,E72
    E˜n0=E˜n1,E˜n,P1=E˜n,P2,E˜n,2P2m=E˜nm,m1,P2¯,E73

    where p—order of linear prediction, ceilfd/1000p5+ceilfd/1000 , ceilf—function which rounds f to the next integer.

  7. For n-th frame perceptual linear prediction coefficients (PLPC) anj and perceptual reflection coefficients (PRC) kni are calculated according to the procedure Darbin.

  8. For n-th frame gain coefficient is calculated Gn

    Gn=En=Rn0k=1pankRnk.E74

  9. For n-th of frame perceptual linear prediction cepstral coefficients (PLPCC) are calculated

    PLPCCnm=lnGn,m=0anm,m=1anmk=1m1k/mPLPCCnkan,mk,2mp,m0,p¯.E75

  10. For n-th frame perceptual log area ratio (PLAR) is calculated

PLARnm=ln1knm1+knm,m1,p¯.E76
Advertisement

8. Method of parameters of reconsidered perceptual linear prediction calculation

In this method as features reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), the reconsidered perceptual log area ratio (PLAR) coefficients are used [7, 8].

  1. Signal sm breaks on L frames of the length ΔN. For frame n-th by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out

    snm=snm+1αsnm,m0,ΔN1¯,E77

    where α—filtration parameter, 0<α<1.

  2. For n-th frame the spectrum is calculated, using DFT

    snm=snmwm,wm=0.54+0.46cos2πmΔN,E78
    Snk=m=0ΔN1snmej2π/ΔNkm,k0,ΔN1¯,E79

    where wm—Hamming’s window.

  3. For n-th frame on i-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett’s window

    Enm=k=0ΔN/21Snk2wmk,m1,P¯,E80
    wmk=0,k<fm1k>fm+1kfm1fmfm1,fm1kfmfm+1kfm+1fm,fmkfm+1,E81
    fm=NfdB1Bfmin+mBfmaxBfminP+1,m0,P+1¯,E82
    Bf=1125ln1+f/700,B1b=700expb/11251,E83

    where Eim—energy of m-th mel-frequency band, wmk—Bartlett’s window for band m-th, Bf—function which will transform frequency to Hz in frequency in mel, B1b—function which will transform frequency to mel in frequency in Hz, fm—normalized frequency, fmin,fmax—minimum and maximum frequency in Hz (for example, fmin=0,fmax=fd/2), fd—frequency of sampling of a speech signal in Hz, P—quantity of mel-frequency bands.

  4. For n-th frame values of autocorrelated function are calculated, using the inverse DFT

    Rnk=Re12P2m=02P3En,m1ej2π/Mkm,k0,p¯,E84
    En,2Pm=Enm,m2,P1¯,E85

    where p—order of linear prediction, ceilfd/1000p5+ceilfd/1000, ceilf—function which rounds f to the next integer.

  5. For n-th frame reconsidered perceptual linear prediction coefficients (RPLPC) anj and reconsidered perceptual reflection coefficients (RPRC) kni are calculated according to the procedure Darbin.

  6. For n-th frame gain coefficient is calculated Gn

    Gn=En=Rn0k=1pankRnk.E86

  7. For n-th frame the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC) are calculated

    RPLPCCnm=lnGn,m=0anm,m=1anmk=1m1k/mRPLPCCnkan,mk,2mp,m0,p¯.E87

  8. For n-th frame the reconsidered perceptual log area ratio (PLAR) are calculated

RPLARnm=ln1knm1+knm,m1,p¯.E88
Advertisement

9. The performance comparison of various features for person identification

For the speech signals containing vocal sounds the sampling frequency 8 kHz and the number of quantization levels 256 was established. Sample length of a vocal sound of the speech is equal to 256.

A numerical research results of LPC, RC, LPCC, LAR coefficients, MFCC, BFCC, PLPC, PRC, PLPCC, PLAR coefficients, RPLPC, RPRC, RPLPCC, RPLAR coefficients received by methods of coding and used for biometric identification of people from the TIMIT database on vocal sounds by means of the Gaussian mixed models (GMM) are presented in Table 1.

Coefficient’s typeIdentification probabilityCoefficients number
LPC0.7212
RC0.9612
LPCC0.9013
LAR coefficients0.8212
MFCC0.9713
BFCC0.9813
PLPC0.744
PRC0.984
PLPCC0.925
PLAR coefficients0.844
RPLPC0.7312
RPRC0.9712
RPLPCC0.9113
RPLAR coefficients0.8312

Table 1.

Numerical research results of the coefficients used for personality biometric identification.

For coding methods for the analysis of a speech signal the filter order in case of linear prediction is equal 12, in case of perceptual linear prediction is equal 4, in case of the reconsidered perceptual linear prediction is equal 12, quantity mel-frequency bands equally 20, quantity a bark-frequency bands equally 17, the number of cepstral parameters based on subbands is equal to 13.

The result presented in Table 1 shows that the largest probability of identification and the smallest number of coefficients are provided by coding of a vocal sound of the speech based on PRC.

Advertisement

10. Conclusion

The preliminary stage of the biometric identification is speech signal structuring and extracting features.

For calculation of the fundamental tone are considered and in number investigated the following methods of digital signal processing—ACF (autocorrelation function) method, AMDF (Average Magnitude. Difference Function) method, SIFT (Simplified Inverse Filter Transformation) method, method on a basis a wavelet analysis, method based on the cepstral analysis, HPS (Harmonic Product Spectrum) method. For speech signal extracting features are considered and in number investigated the following methods of digital signal processing—the digital bandpass filters bank; spectral analysis (Fourier’s transformation, wavelet transformation); homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), bark-frequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. Results of a numerical research of speech signal features extraction methods for voice signals people from the TIMIT (Texas Instruments and Massachusetts Institute of Technology) database were received. The features PRC proved to be the most effective.

References

  1. 1. Oppenheim AV, Schafer RW. Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice Hall; 2010. p. 1108
  2. 2. Mallat S. A Wavelet Tour of Signal Processing: Sparse Way. Bourlington, MA: Academic Press; 2008. p. 832. DOI: 10.1016/B978-0-12-374370-1.X0001-8
  3. 3. Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Pearson Higher Education; 2011. p. 1042
  4. 4. Markel JD, Gray AH. Linear Prediction of Speech. Berlin: Springer Verlag; 1976. p. 382
  5. 5. Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustic, Speech and Signal Processing. 1980;28(4):357-366
  6. 6. Ganchev T, Fakotakis N, Kokkinakis G. Comparative evaluation of various MFCC implementations on the speaker verification task. In: Proceedings of SPECOM 2005. Vol. 1. Patras, Greece; 2005. pp. 191-194
  7. 7. Josef R, Pollak P. Modified feature extraction methods in robust speech recognition. In: Proceedings of the 17th IEEE Internations Conference Radioelektronika. Brno, Czech Republic: IEEE; 2007. pp. 1-4
  8. 8. Kumar P, Biswas A, Mishra AN, Chandra M. Spoken language identification using hybrid feature extraction methods. Journal of Telecommunications. 2010;1(2):11-15
  9. 9. Huang X, Acero A, Hon H-W. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, NJ: Prentice Hall; 2001. p. 980
  10. 10. Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America. 1990;87(4):1738-1752. DOI: 10.1121/1.399423

Written By

Eugene Fedorov, Tetyana Utkina and Tetiana Neskorodieva

Submitted: 11 March 2022 Reviewed: 23 March 2022 Published: 16 June 2022