Open access peer-reviewed chapter

A Stereo Acoustic Echo Canceller Using Cross-Channel Correlation

By Shigenobu Minami

Submitted: October 9th 2010Reviewed: April 24th 2011Published: September 6th 2011

DOI: 10.5772/16341

Downloaded: 1391

1. Introduction

Stereo acoustic echo canceller is becoming more and more important as an echo canceller is applied to consumer products like a conversational DTV. However it is well known that if there is strong cross-channel correlation between right and left sounds, it cannot converge well and results in echo path estimation misalignment. This is a serious problem in a conversational DTV because the speaker output sound is combination of a far-end conversational sound, which is essentially monaural, and TV program sound, which has wide variety of sound characteristics, monaural sound, stereo sound or mixture of them. To cope with this problem, many stereo echo cancellation algorithms have been proposed. The methods can be categorized into two approaches. The first one is to de-correlate the stereo sound by introducing independent noise or non-linear post-processing to right and left speaker outputs. This approach is very effective for single source stereo sound case, which covers most of conversational sounds, because the de-correlation prevents rank drop to solve a normal equation in a multi-channel adaptive filtering algorithm. Moreover, it is simple since many traditional adaptation algorithms can be used without any modification. Although the approach has many advantages and therefore widely accepted, it still has an essential problem caused by the de-correlation which results in sound quality change due to insertion of the artificial distortion. Even though the inserted distortion is minimized so as to prevent sound quality degradation, from an entertainment audio equipment view point, such as a conversational DTV, many users do not accept any distortion to the speaker output sound. The second approach is desirable for the entertainment types of equipments because no modification to the speaker outputs is required. In this approach, the algorithms utilize cross-channel correlation change in a stereo sound. This approach is also divided into two approaches, depending on how to utilize the cross-channel correlation change. One widely used approach is affine projection method. If there are small variations in the cross-channel correlation even in a single sound source stereo sound, small un-correlated component appears in each channel. The affine projection method can produce the best direction by excluding the auto-correlation bad effect in each channel and by utilizing the small un-correlated components. This approach has a great advantage since it does not require any modification to the stereo sound, however if the variation in the cross-channel correlation is very small, improvement of the adaptive filter convergence is very small. Since the rank drop problem of the stereo adaptive filter is essentially not solved, we may need slight inserted distortion which reduces merit of this method. Another headache is that the method requires P by P inverse matrix calculation in an each sample. The inverse matrix operation can be relaxed by choosing P as small number, however small P sometimes cannot attain sufficient convergence speed improvement. To attain better performance even by small P, the affine projection method sometimes realized together with sub-band method. Another method categorized in the second approach is “WARP” method. Unlike to affine projection method which utilizes small change in the cross-channel correlation, the method utilizes large change in the cross-channel correlation. This approach is based on the nature of usual conversations. Even though using stereo sound for conversations, most parts of conversations are single talk monaural sound. The cross-channel correlation is usually very high and it remains almost stable during a single talking. A large change happens when talker change or talker’s face movement happens. Therefore, the method applies a monaural adaptive filter to single sound source stereo sound and multi-channel (stereo) adaptive filter to non-single sound source stereo sound. Important feature of the method is two monaural adaptive filter estimation results and one stereo adaptive filter estimation result is transformed to each other by using projection matrixes, called WARP matrixes. Since a monaural adaptive filter is applied when a sound is single source stereo sound, we do not need to suffer from the rank problem.

In this chapter, stereo acoustic echo canceller methods, multi-channel least mean square, affine projection and WARP methods, all of them do not need any modification to the speaker output sounds, are surveyed targeting conversational DTV applications. Then WARP method is explained in detail.

2. Stereo acoustic echo canceller problem

2.1. Conversational DTV

Since conversational DTV should keep smooth speech communication even when it is receiving a regular TV program, it requires following functionalities together with traditional DTV systems as shown in Fig. 1.

Figure 1.

Audio System Example in a Conversational DTV

  1. Mixing of broadcasting sound and communication speech: Two stereo sounds from the DTV audio receiver and local conversational speech decoder are mixed and sent to the stereo speaker system.

  2. Sampling frequency conversion: Sampling frequency of DTV sound is usually wider than that of conversational service, such as fSH=48kHzfor DTV sound and fS=16kHzfor conversational service sound, we need sampling frequency conversion between DTV and conversational service audio parts

  3. Stereo acoustic canceller: A stereo acoustic echo canceller is required to prevent howling and speech quality degradation due to acoustic coupling between stereo speaker and microphone.

Among the above special functionalities, the echo canceller for the conversational DTV is technically very challenging because the echo canceller should cancel wide variety of stereo echoes for TV programs as well as stereo speech communications.

2.2. Stereo sound generation model

A stereo acoustic echo canceller system is shown in Fig. 2 with typical stereo sound generation model, where all signals are assumed to be discrete time signals at the kthsampling timing by fSsampling frequency and the sound generation model is assumed to be linier finite impulse response (FIR) systems which has a sound source signalxSi(k)as an input and stereo sound xRi(k)and xLi(k)as outputs with additional uncorrelated noises xURi(k)andxULi(k). By using matrix and array notations of the signals as

XSi(k)=[xSi(k),xSi(k1),xSi(kP+1)]xSi(k)=[xSi(k),xSi(k1),xSi(kN+1)]TxRi(k)=[xRi(k),xRi(k1),xRi(kN+1)]TxLi(k)=[xLi(k),xLi(k1),xLi(kN+1)]TxURi(k)=[xURi(k),xURi(k1),xURi(kN+1)]TxULi(k)=[xULi(k),xULi(k1),xULi(kN+1)]TE1

wherePand Nare impulse response length of the FIR system and tap length of the adaptive filter for each channel, respectively.

Then the FIR system outputxi(k)is 2N sample array and is expressed as

xi(k)=[xRi(k)xLi(k)]=[XSi(k)gRi(k)+xURi(k)XSi(k)gLi(k)+xULi(k)]E2

where gRi(k)andgLi(k)are Psample impulse responses of the FIR system defined as

gRi(k)=[gRi,0(k),gRi1(k),,gRi,ν(k),,,gRi,P1(k)]TgLi(k)=[gLi,0(k),gLi1(k),,gLi,ν(k),,,gLi,P1(k)]TE3

Figure 2.

Traditional Stereo Acoustic Echo Canceller System Configuration with Typical Stereo Sound Generation Model.

In(2), if gRi(k)andgLi(k)are composed of constant array,gRiand gLiduring the ithperiod, and small time variant arrays,ΔgRi(k)and ΔgLi(k)which are defined as

gRi=[gRi,0,gRi1,,gRi,ν,,gRi,P1]TgLi=[gLi,0,gLi1,,gLi,ν,,gLi,P1]TΔgRi(k)=[ΔgRi,0(k),ΔgRi1(k),,ΔgRi,ν(k),,ΔgRi,P1(k)]TΔgLi(k)=[ΔgLi,0(k),ΔgLi1(k),,ΔgLi,ν(k),,ΔgLi,P1(k)]TE4

(2) is re-written as

[xRi(k)xLi(k)]=[XSi(k)gRi+XSi(k)ΔgRi(k)+xURi(k)XSi(k)gLi+XSi(k)ΔgLi(k)+xULi(k)]E5

This situation is usual in the case of far-end single talking because transfer functions between talker and right and left microphones vary slightly due to talker’s small movement. By assuming the components are also un-correlated noise, (5) can be regarded as a linear time invariant system with independent noise components, xURi(k)andxULi(k), as

[xRi(k)xLi(k)]=[XSi(k)gRi+xURi(k)XSi(k)gLi+xULi(k)]E6

where

xURi(k)=XSi(k)ΔgRi(k)+xURi(k)xULi(k)=XSi(k)ΔgLi(k)+xULi(k)E7

In (6), if there are no un-correlated noises, we call the situation as strict single talking.

In this chapter, sound source signal(xSi(k)), uncorrelated noises (xURi(k)andxULi(k)) are assumed as independent white Gaussian noise with variance σxiandσNi, respectively.

2.3. Stereo acoustic echo canceller problem

For simplification, only one stereo audio echo canceller for the right side microphone’s output signaly˜i(k), is explained. This is because the echo canceller for left microphone output is apparently treated as the same way as the right microphone case. As shown in Fig.2, the echo canceller cancels the acoustic echoyi(k)as

ei(k)=yi(k)y^i(k)+ni(k)E8

where ei(k)is acoustic echo canceller’s residual error, ni(k)is a independent background noise, y^i(k)is an FIR adaptive filter output in the stereo echo canceller, which is given by

y^i(k)=h^RiT(k)xRi(k)+h^LiT(k)xLi(k)E9

where h^Ri(k)and h^Li(k)are N tap FIR adaptive filter coefficient arrays.

Error power of the echo canceller for the right channel microphone output, σei2(k), is given as:

σei2(k)=(yRi(k)-h^STiT(k)xi(k)+ni(k))2E10

where h^STi(k)is a stereo echo path model defined as

h^STi(k)=[h^RiT(k)h^LiT(k)]TE11

Optimum echo path estimation h^OPTwhich minimizes the error power σe2(k)is given by solving the linier programming problem as

Minimize[k=0NLS1σei2(k)]E12

where NLSis a number of samples used for optimization. Then the optimum echo path estimation for the ithLTI period h^OPTiis easily obtained by well known normal equation as

h^OPTi=(k=0NLS1(y˜i(k)xi(k)))XNLSi1E13

where XNLSiis an auto-correlation matrix of the adaptive filter input signal and is given by

XNLSi=k=0NLS1(xi(k)xiT(k))=[AiBiCiDi]=[k=0NLS1(xRi(k)xRiT(k))k=0NLS1(xRi(k)xLiT(k))k=0NLS1(xLi(k)xRiT(k))k=0NLS1(xLi(k)xLiT(k))]E14

By (14), determinant of XNLSiis given by

|XNLSi|=|Ai||DiCiAi1Bi|E15

In the case of the stereo generation model which is defined by(2), the sub-matrixes in (14) are given by

Ai=k=0NLS1(XSi(k)GRRiXSiT(k)+2xURi(k)(XSi(k)gRi)T+xURi(k)xURiT(k))Bi=k=0NLS1(XSi(k)GRLiXSiT(k)+xURi(k)(XSi(k)gRi)T+xULi(k)(XSi(k)gRi)T+xURi(k)xULiT(k))Ci=k=0NLS1(XSi(k)GLRiXSiT(k)+xULi(k)(XSi(k)gRi)T+xURi(k)(XSi(k)gLi)T+xULi(k)xURiT(k))Di=k=0NLS1(XSi(k)GLLiXSiT(k)+2xULi(k)(XSi(k)gLi)T+xULi(k)xULiT(k))E16

where

GRRi=gRigRiT,GRLi=gRigLiT,GLRi=gLigRiT,GLLi=gLigLiTE17

In the cease of strict single talking where xURi(k)and xULi(k)do not exist, (16) becomes very simple as

Ai=k=0NLS1(XSi(k)GRRiXSiT(k))Bi=k=0NLS1(XSi(k)GRLiXSiT(k))Ci=k=0NLS1(XSi(k)GLRiXSiT(k))Di=k=0NLS1(XSi(k)GLLiXSiT(k))E18

To check the determinant|XNLSi|, we calculate |XNLSi||Ci|considering Bi=CiTas

|XNLSi||Ci|=|Ai||(DiCiCiAi1BiCi|=|Ai||(DiCiCiBiCiAi1|E19

Then |DiCiCiBiCiAi1|becomes zero as

|DiCiCiAi1BiCi|=|k=0NLS1(XSi(k)(GLLiGLRiG1RRiGRLi)XSiT(k))XSi(k)GLRiXSiT(k))|=Nσxi2|k=0NLS1(XSi(k)(gLiTgLi(gLigRiTgLigRiT(gRigRiT)1gRigRiT)XSiT(k))|=0E20

Hence no unique solution can be found by solving the normal equation in the case of strict single talking where un-correlated components do not exist. This is a well known stereo adaptive filter cross-channel correlation problem.

3. Stereo acoustic echo canceller methods

To improve problems addressed above, many approaches have been proposed. One widely accepted approach is de-correlation of stereo sound. To avoid the rank drop of the normal equation(13), small distortion such as non-linear processing or modification of phase is added to stereo sound. This approach is simple and effective to endorse convergence of the multi-channel adaptive filter, however it may degrade the stereo sound by the distortion. In the case of entertainment applications, such as conversational DTV, the problem may be serious because customer’s requirement for sound quality is usually very high and therefore even small modification to the speaker output sound cannot be accepted. From this view point, approaches which do not need to add any modification or artifacts to the speaker output sound are desirable for the entertainment use. In this section, least square (LS), stereo affine projection (AP), stereo normalized least mean square (NLMS) and WARP methods are reviewed as methods which do not need to change stereo sound itself.

3.1. Gradient method

Gradient method is widely used for solving the quadratic problem iteratively. As a generalized gradient method, let denoteMsample orthogonalized error array εMi(k)based on original error arrayeMi(k)as

εMi(k)=Ri(k)eMi(k)E21

where eMi(k)is an Msample error array which is defined as

eMi(k)=[ei(k),ei(k1),ei(kM+1)]TE22

and Ri(k)is a M×Mmatrix which orthogonalizes the auto-correlation matrixeMi(k)eMiT(k). The orthogonalized error array is expressed using difference between adaptive filter coefficient array h^STi(k)and target stereo echo path 2Nsample responsehSTas

εMi(k)=Ri(k)XM2NiT(k)(hST-h^STi(k))E23

where XM2Ni(k)is a Mx2N matrix which is composed of adaptive filter stereo input array as defined by

XM2Ni(k)=[xi(k),xi(k1),xi(kM+1)]E24

By defining an echo path estimation error array dSTi(k)which is defined as

dSTi(k)=hST-h^STi(k)E25

estimation error power σεi2(k)is obtained by

σεi2(k)=εMiT(k)εMi(k)=dSTiT(k)Q2N2Ni(k)dSTi(k)E26

where

Q2N2Ni(k)=XM2Ni(k)RiT(k)Ri(k)XM2NiT(k)E27

Then, (26) is regarded as a quadratic function of h^STi(k)as

f(h^STi(k))=12h^STiT(k)Q2N2Ni(k)h^STiT(k)h^STiT(k)Q2N2NihSTE28

For the quadratic function, gradient Δi(k)is given by

Δi(k)=-Q2N2Ni(k)dSTi(k)E29

Iteration of h^STi(k)which minimizesσεi2(k)is given by

h^STi(k+1)=h^STi(k)αΔi(k)=h^STi(k)+αQ2N2Ni(k)dSTi(k)=h^STi(k)+αXM2Ni(k)RiT(k)Ri(k)eMi(k)E30

where αis a constant to determine step size.

Above equation is very generic expression of the gradient method and following approaches are regarded as deviations of this iteration.

3.2. Least Square (LS) method (M=2N)

From(30), the estimation error power between estimated adaptive filter coefficients and stereo echo path response,diT(k)di(k)is given by

diT(k+1)di(k+1)=diT(k)(I2NαQ2N2Ni(k))(I2NαQ2N2Ni(k))Tdi(k)E31

whereI2Nis a 2N×2Nidentity matrix. Then the fastest convergence is obtained by findingRi(k)which orthogonalizes and minimizes eigenvalue variance inQ2N2Ni(k).

If M=2N, X2M2Ni(k)is symmetric square matrix as

XM2Ni(k)=XM2NiT(k)E32

and if XM2Ni(k)XM2NiT(k)(=XM2NiT(k)XM2Ni(k))is a regular matrix so that inverse matrix exists, RiT(k)Ri(k)which orthogonalizes Q2N2Ni(k)is given by

RiT(k)Ri(k)=(X2N2NiT(k)X2N2Ni(k))1E33

By substituting (33) for (30)

h^STi(k+1)=h^STi(k)+αX2N2Ni(k)(X2N2NiT(k)X2N2Ni(k))1eNi(k)E34

Assuming initial tap coefficient array as zero vector and α=0during 0 to 2N-1th samples and α=1at 2Nth sample, (34) can be re-written as

h^STi(2N)=X2N2Ni(2N-1)(X2N2NiT(2N-1)X2N2Ni(2N-1))1yi(2N-1)E35

where yi(k)is 2Nsample echo path output array and is defined as

yi(k)=[yi(k),yi(k1),yi(k2N+1)]TE36

This iteration is done only once at 2N1thsample. IfNLS=2N, inverse matrix term in (35) is written as

X2N2NiT(k)X2N2Ni(k)=k=0NLS1(xi(k)xiT(k))=XNLSiE37

Comparing (13) and (35) with  (37), it is found that LS method is a special case of gradient method when M equals to 2N.

3.3. Stereo Affine Projection (AP) method (M=PN)

Stereo affine projection method is assumed as a case when M is chosen as FIR response length P in the LTI system. This approach is very effective to reduce 2Nx2N inverse matrix operations in LS method to PxP operations when the stereo generation model is assumed to be LTI system outputs from single WGN signal source with right and left channel independent noises as shown in Fig.2. For the sake of explanation, we define stereo sound signal matrix XP2Ni(k)which is composed of right and left signal matrixXRi(k)and XLi(k)for P samples as

XP2Ni(k)=[XRiT(k)XLiT(k)]T=[X2Si(k)GRiT+XURi(k)X2Si(k)GLiT+XULi(k)]E38

where

X2Si(k)=[xSi(k),xSi(k1),xSi(k2P+2)]E39

XURi(k)and XULi(k)are un-correlated signal matrix defined as

XURi(k)=[xURi(k),xURi(k1),xURi(kP+1)]XULi(k)=[xULi(k),xULi(k1),xULi(kP+1)]E40

GRiandGLiare source to microphones response (2P-1)xP matrixes and are defined as

GRi=[g2R,0,iTg2R,1,iTg2R,P1,iT]=[gRiT000gRiT000000gRiT],GLi=[g2RL,0,iTg2L,1,iTg2L,P1,iT]=[gLiT000gLiT000000gLiT]E41

As explained by(31), Q2N2Ni(k)determines convergence speed of the gradient method. In this section, we derive affine projection method by minimizing the max-min eigenvalue variance inQ2N2Ni(k). Firstly, the auto-correlation matrix is expressed by sub-matrixes for each stereo channel as

QN2Ni(k)=[QANNi(k)QBNNi(k)QCNNi(k)QDNNi(k)]E42

where QANNi(k)andQDNNi(k)are right and left channel auto-correlation matrixes, QBNNi(k)andQCNNi(k)are cross channel-correlation matrixes. These sub-matrixes are given by

QANNi(k)=X2Si(k)GRiTRiT(k)Ri(k)GRiX2SiT(k)+XURi(k)RiT(k)Ri(k)XURiT(k)+2X2Si(k)RiT(k)Ri(k)XURiT(k)QBNNi(k)=X2Si(k)GRiTRiT(k)Ri(k)GLiX2SiT(k)+XURi(k)RiT(k)Ri(k)XULiT(k)+2XURi(k)RiT(k)Ri(k)XULiT(k)QCNNi(k)=X2Si(k)GLiTRiT(k)Ri(k)GRiX2SiT(k)+XULi(k)RiT(k)Ri(k)XURiT(k)+2XULi(k)RiT(k)Ri(k)XUTiT(k)QDNNi(k)=X2Si(k)GLiTRiT(k)Ri(k)GLiX2SiT(k)+XULi(k)RiT(k)Ri(k)XULiT(k)+2X2Si(k)RiT(k)Ri(k)XULiT(k)E43

Since the iteration process in (30) is an averaging process, the auto-correlation matrix Q2N2Ni(k)is approximated by using expectation value of it,Q˜2N2Ni(k)=Q2N2Ni(k). Then expectation values for sub-matrixes in (42) are simplified applying statistical independency between sound source signal and noises and Tlzfunction defined in Appendix as

Q˜ANNi=Tlz(X˜2Si(k)GRiTRiT(k)Ri(k)GRiX˜2SiT(k))+Tlz(X˜URi(k)RiT(k)Ri(k)X˜URi(k))Q˜BNNi=Tlz(X˜2Si(k)GRiTRiT(k)Ri(k)GLiX˜2SiT(k))Q˜CNNi=Tlz(X˜2Si(k)GLiTRiT(k)Ri(k)GRiX˜2SiT(k))Q˜DNNi=Tlz(X˜2Si(k)GLiTRiT(k)Ri(k)GLiX˜2Si(k))+Tlz(X˜ULi(k)RiT(k)Ri(k)X˜ULi(k))E44

where

X˜2Si(k)=[x˜2Si(k),x˜2Si(k1),x˜2Si(kP+1)]TX˜URi(k)=[x˜URi(k),x˜URi(k1),x˜URi(kP+1)]X˜ULi(k)=[x˜ULi(k),x˜ULi(k1),x˜ULi(kP+1)]E45

with

x˜2Si(k)=[xSi(k),xSi(k1),xSi(k2p+2)]Tx˜URi(k)=[xURi(k),xURi(k1),xURi(kp+1)]Tx˜ULi(k)=[xULi(k),xULi(k1),xULi(kp+1)]TE46

Applying matrix operations toQ2N2Ni, a new matrix Q2N2Niwhich has same determinant as Q2N2Niis given by

Q2N2Ni(k)=[QANNi(k)00QDNNi(k)]E47

where

QANNi=Tlz(QANNi),QDNNi=Tlz(QDNNi)E48

Since both X˜2Si(k)GRiTand X˜2Si(k)GLiTare symmetric PxP square matrixes, QANNiand QBNNiare re-written as

QANNi=X˜2Si(k)GRiTRiT(k)Ri(k)GRiX˜2SiT(k)+X˜2Si(k)GLiTRiT(k)Ri(k)GLiX˜2SiT(k)+X˜URi(k)RiT(k)Ri(k)X˜URiT(k)=X˜2Si(k)(GRiTGRi+GLiTGLi)X˜2SiT(k)RiT(k)Ri(k)+X˜URi(k)X˜URiT(k)RiT(k)Ri(k)=(NσXi2(GRiGRiT+GLiGLiT)+NσNi2IP)RiT(k)Ri(k)QDNNi=NσNi2IPRiT(k)Ri(k)E49

As evident by(47), (48) and(49), Q2N2Ni(k)is composed of major matrixQANNi(k)and noise matrixQDNNi(k). In the case of single talking where sound source signal power σX2is much larger than un-correlated signal powerσNi2, RiT(k)Ri(k)which minimizes eigenvalue spread in Q2N2Ni(k)so as to attain the fastest convergence is given by makingQANNias a identity matrix by setting RiT(k)Ri(k)as

RiT(k)Ri(k)(NσXi2(GRiGRiT+GLiGLiT))1E50

In other cases such as double talking or no talk situations, where we assume σX2is almost zero, RiT(k)Ri(k)which orthogonalizes QANNiis given by

RiT(k)Ri(k)(NσNi2IP)1E51

Summarizing the above discussions, the fastest convergence is attained by setting RiT(k)Ri(k)as

RiT(k)Ri(k)=(XP2NiT(k)XP2Ni(k))1E52

Since

XP2NiT(k)XP2Ni(k)=[GRiX2SiT(k)+XURiT(k)GLiX2SiT(k)+XULiT(k)][X2Si(k)GRiT+XURi(k)X2Si(k)GLiT+XULi(k)]=GRiX2SiT(k)X2Si(k)GRiT+GLiX2SiT(k)X2Si(k)GLiT+XURiT(k)XURi(k)+XULiT(k)XULi(k)NσXi2(GRiGRiT+GLiGLiT)+2NσNi2IPE53

By substituting (52) for (30), we obtain following affine projection iteration :

h^STi(k+1)=h^STi(k)+αXi(k)(XP2NiT(k)XP2Ni(k))1ePi(k)E54

In an actual implementation αis replaced byμfor forgetting factor and δIis added to the inverse matrix to avoid zero division as shown bellow.

h^ST(k+1)=h^ST(k)+αXP2Ni(k)[XP2NiT(k)XP2Ni(k)+δI]1μePi(k)E55

where δ(1)is very small positive value and

μ=diag[1,(1μ),,(1μ)p1]E56

The method can be intuitively understood using geometrical explanation in Fig. 3. As seen here, from a estimated coefficients in a k-1th plane a new direction is created by finding the nearest point on the i th plane in the case of traditional NLMS approach. On the other hand, affine projection creates the best direction which targets a location included in the both i-1 and i th plane.

Figure 3.

Very Simple Example for Affine Method

3.4. Stereo Normalized Least Mean Square (NLMS) method (M=1)

Stereo NLMS method is a case when M=1 of the gradient method.

Equation (54) is re-written when M =1 as

h^STi(k+1)=h^STi(k)+αxi(k)(xRiT(k)xRi(k)+xLiT(k)xLi(k))1ei(k)E57

It is well known that convergence speed of (57) depends on the smallest and largest eigen-value of the matrixQ2N2Ni. In the case of the stereo generation model in Fig.2 for single talking with small right and left noises, we obtain following determinant of Q2N2Nifor M=1 as

|Q2N2Ni(k)|=|xi(k)(xiT(k)xi(k))1xiT(k)|(gRiTgRi+gLiTgLi)1|(gRigRiT+gLigLiT)||σN2IN|E58

If eigenvalue of gRigRiT+gLigLiTare given as

|(gRigRiT+gLigLiT)|=λmini2λmaxi2E59

where λmini2and λmaxi2are the smallest and largest eigenvalues, respectively.

|Q2N2Ni(k)|is given by assuming un-correlated noise power σNi2is very small (σNi2λmini2) as

|Q2N2Ni(k)|=(gRiTgRi+gLiTgLi)1σNi2σNi2λmini2λmaxi2E60

Hence, it is shown that stereo NLMS echo-canceller’s convergence speed is largely affected by the ratio between the largest eigenvalue of gRigRiT+gLigLiTand non-correlated signal powerσNi2. If the un-correlated sound power is very small in single talking, the stereo NLMS echo canceller’s convergence speed becomes very slow.

3.5. Double adaptive filters for Rapid Projection (WARP) method

Naming of the WARP is that this algorithm projects the optimum solution between monaural space and stereo space. Since this algorithm dynamically changes the types of adaptive filters between monaural and stereo observing sound source characteristics, we do not need to suffer from rank drop problem caused by strong cross-channel correlation in stereo sound. The algorithm was originally developed for the acoustic echo canceller in a pseudo-stereo system which creates artificial stereo effect by adding delay and/or loss to a monaural sound. The algorithm has been extended to real stereo sound by introducing residual signal after removing the cross-channel correlation.

In this section, it is shown that WARP method is derived as an extension of affine projection which has been shown in 3.3.

By introducing error matrix Ei(k)which is defined by

Ei(k)=[ePi(k)ePi(k-1)ePi(k-p+1)]E61

iteration of the stereo affine projection method in (54) is re-written as

H^STi(k+1)=H^STi(k)+αXP2Ni(k)(XP2NiT(k)XP2Ni(k))1Ei(k)E62

where

H^STi(k)=[h^STi(k)h^STi(k-1)h^STi(k-p+1)]E63

In the case of strict single talking, following assumption is possible in theithLTI period by (53)

XPNiT(k)XPNi(k)GRRLLiE64

where GRRLLiis a PxP symmetric matrix as

GRRLLi=NσXi2(GRiGRiT+GLiGLiT)E65

By assuming GRRLLias a regular matrix, (62) can be re-written as

H^STi(k+1)GRRLLi=H^STi(k)GRRLLi+αXP2Ni(k)Ei(k)E66

Re-defining echo path estimation matrix H^STi(k)by a new matrix H^STi(k)which is defined by

H^STi(k)=H^STi(k)GRRLLiE67

(66) is re-written as

H^STi(k+1)=H^STi(k)+αXP2Ni(k)Ei(k)E68

Then the iteration is expressed using signal matrix X2Si(k)as

H^STi(k+1)=H^STi(k)+α[X2Si(k)GRiT+XURi(k)X2Si(k)GLiT+XULi(k)]Ei(k)E69

In the case of strict single talking where no un-correlated signals exist, and if we can assume GLiis assumed to be an output of a LTI system GRLiwhich is PxP symmetric regular matrix with inputGRi, then (69) is given by

[H^STRi(k+1)H^STLi(k+1)]=[H^STRi(k)H^STLi(k)]+α[X2Si(k)GRiEi(k)X2Si(k)GRiGiEi(k)][H^STRi(k+1)H^STLi(k+1)GRLi1]=[H^STRi(k)H^STLi(k)GRLi1]+α[X2Si(k)GRiEi(k)X2Si(k)GRiEi(k)]E70

It is evident that rank of the equation in (70) is N not 2N, therefore the equation becomes monaural one by subtracting the first law after multiplying (GRLi)1from the second low as

H^MONRLi(k+1)=H^MONRLi(k)+2αXRi(k)Ei(k)E71

where

H^MONRLi(k)=H^STRi(k)+H^STLi(k)GRLi1E72

or assuming GRi=GLiGLRi

H^MONLRi(k+1)=H^MONLRi(k)+2αXLi(k)Ei(k)E73

where

H^MONRLi(k)=H^STLi(k)+H^STRi(k)GLRi1E74

Selection of the iteration depends on existence of the inverse matrixGRLi1or GLRi1and the detail is explained in the next section.

By substituting (67) to (72) and (74), we obtain following equations;

H^MONRLi(k)=H^STRi(k)GRRLLi+H^STLi(k)GRRLLiGRLi1E75

or

H^MONLRi(k)=H^STRi(k)GRRLLiGLRi1+H^STLi(k)GRRLLiE76

From the stereo echo path estimation view point, we can obtain H^MONRLi(k)orH^MONLRi(k), however we can’t identify right and left echo path estimation from the monaural one. To cope with this problem, we use two LTI periods for separating the right and left estimation results as

[H^TMONLRiH^TMONLRi1]=[GRRLLiTGRRLLiGRLi1GRRLLii1TGRRLLi1GRLi11][H^TSTRiH^STLi]GRLi and GRLi1                                                                                                  are regular matrix[H^TMONLRiH^TMONLRi1]=[GRLRLiTGLRi1GRRLLiGRRLLi1TGLRi11GRRLLi1][H^TSTRiH^STLi]GLRi and GLRi1                                                                                                   are regular matrix[H^TMONLRiH^TMONLRi1]=[GRRLLiTGRRLLiGRLi1GRRLLi1TGLRi11GRRLLi1][H^TSTRiH^STLi]GRLi and GLRi1                                                                                                   are regular matrix[H^TMONLRiH^TMONLRi1]=[GRRLLiTGLRi1GRRLLiGRRLLi1TGRRLLi1GRLi11][H^TSTRiH^STLi]GLRi and GRLi1                                                                                                    are regular matrixE77

where H^MONLRiand H^MONLRi1are monaural echo canceller estimation results at the end of each LTI period, H^STRiand H^STLiare right and left estimated stereo echo paths based on the i1thand ithLTI period’s estimation results.

Equation (77) is written simply as

H^MONi,i1=Wi1H^STiE78

where H^TMONRLijis estimation result matrix for the i1thand ithLTI period’s as

H^MONi,i1=[H^TMONRLiH^TMONRLi1]E79

H^TSTiis stereo echo path estimation result as

H^STi=[H^TSTRiH^TSTLi]E80

Wi1is a matrix which projects stereo estimation results to two monaural estimation results and is defined by

Wi1={[GRRLLiTGRRLLiGRLi1GRRLLi1TGRRLLi1GRLi11]GRLi and GRLi1 are regular matrix[GRRLLiTGLRi1GRRLLiGRRLLi1TGRLRi11GRRLLi1]GLRi and GLRi1 are regular matrix[GRRLLiTGRRLLiGRLi1GRLi1TGLRi11GRRLLi1]GRLi and GLRi1 are regular matrix[GRRLLiTGLRi1GRRLLiGRRLLi1TGRRLLi1GRLi11]GLRi and GRLi1 are regular matrixE81

By swapping right side hand and left side hand in(78), we obtain right and left stereo echo path estimation using two monaural echo path estimation results as

H^STi=WiH^MONi,i1E82

Since Wi1and Wiare used to project optimum solutions in two monaural spaces to corresponding optimum solution in a stereo space and vice-versa, we call the matrixes as WARP functions. Above procedure is depicted in Fig. 4. As shown here, the WARP system is regarded as an acoustic echo canceller which transforms stereo signal to correlated component and un-correlated component and monaural acoustic echo canceller is applied to the correlated signal. To re-construct stereo signal, cross-channel correlation recovery matrix is inserted to echo path side. Therefore, WARP operation is needed at a LTI system change.

Figure 4.

Basic Principle for WARP method

In an actual application such as speech communication, the auto-correlation characteristics GRRLLivaries frequently corresponding speech characteristics change, on the other hand the cross-channel characteristics GRLior GLRichanges mainly at a far-end talker change. So, in the following discussions, we apply NLMS method as the simplest affine projection (P=1).

The mechanism is also intuitively understood by using simple vector planes depicted in Fig. 5.

Figure 5.

Very Simple Example for WARP method

As shown here, using two optimum solutions in monaural spaces (in this case on the lines) the optimum solution located in the two dimensional (stereo) space is calculated directly.

4. Realization of WARP

4.1. Simplification by assuming direct-wave stereo sound

Both stereo affine projection and WARP methods require P x P inverse matrix operation which needs to consider its high computation load and stability problem. Even though the WARP operation is required only when the LTI system changes such as far-end talker change and it is much smaller computation than inverse matrix operations for affine projection which requires calculations in each sample, simplification of the WARP operation is still important. This is possible by assuming that target stereo sound is composed of only direct wave sound from a talker (single talker) as shown in Fig. 6.

Figure 6.

Stereo Sound Generation System for Single Talking

In figure 6, a single sound source signal at an angular frequency ωin the ithLTI period,xSi(ω), becomes a stereo sound composed of right and left signals, xRi(ω)andxLi(ω), through out right and left LTI systems, gSRi(ω)andgSLi(ω)with additional un-correlated noisexURi(ω)and xULi(ω)as

xRi(ω)=gSRi(ω)xSi(ω)+xURi(ω)xLi(ω)=gSLi(ω)xSi(ω)+xULi(ω)E83

In the case of simple direct-wave systems, (83) can be re-written as

xRi(ω)=lRiejωτRixSi(ω)+xURi(ω)xLi(ω)=lLiejωτLixSi(ω)+xULi(ω)E84

where lRiand lLiare attenuation of the transfer functions and τRiand τLiare analog delay values.

Since the right and left sounds are sampled by fS(=ωS/2π)Hz and treated as digital signals, we use z- domain notation instead of ω-domain as

z=exp[2πωj/ωs]E85

In z-domain, the system in Fig.4 is expressed as shown in Fig. 7.

Figure 7.

WARP Method using Z-Function

As shown in Fig.7, the stereo sound generation model for xi(z)is expressed as

xi(z)=[xRi(z)xLi(z)]=[gSRi(z)xSi(z)+xURi(z)gSLi(z)xSi(z)+xULi(z)]E86

wherexRi(z), xLi(z), gSRi(z), gSLi(z), xSi(z), xURi(z)and xULi(z)are z-domain expression of the band-limited sampled signals corresponding toxRi(ω), xLi(ω), gSRi(ω), gSLi(ω), xURi(ω)andxULi(ω), respectively. Adaptive filer output y^i(z)and microphone output yi(z)at the end of ithLTI period is defined as

y^i(z)=h^iT(z)xi(z)yi(z)=hT(z)xi(z)+ni(z)E87

where ni(z)is a room noise,h^i(z)andh^i(z)are stereo adaptive filter and stereo echo path characteristics at the end of ithLTI period respectively and which are defined as

H^STi(z)=[h^Ri(z)h^Li(z)],HST(z)=[hR(z)hL(z)]E88

Then cancellation error is given neglecting near end noise by

ei(z)=yi(z)H^STiT(z)xi(z)E89

In the case of single talking, we can assume both xURi(z)and xULi(z)are almost zero, and (89) can be re-written as

ei(z)=yi(z)(gSRi(z)h^Ri(z)+gSLi(z)h^Li(z))xSi(z)E90

Since the acoustic echo can also be assumed to be driven by single sound sourcexSi(z), we can assume a monaural echo path hMonoi(z)as

hMonoi(z)=gSRi(z)hR(z)+gSLi(z)hL(z)E91

Then (90) is re-written as

ei(z)=(hMonoi(z)(gSRi(z)h^Ri(z)+gSLi(z)h^Li(z)))xSi(z)E92

This equation implies we can adopt monaural adaptive filter by using a new monaural quasi-echo path h^Monoi(z)as

h^Monoi(z)=gSRi(z)h^Ri(z)+gSLi(z)h^Li(z)E93

However, it is also evident that if LTI system changes both echo and quasi-echo paths should be up-dated to meet new LTI system. This is the same reason for the stereo echo canceller in the case of pure single talk stereo sound input. If we can assume the acoustic echo paths is time invariant for two adjacent LTI periods, this problem is easily solved by satisfying require rank for solving the equation as

[h^Monoi(z)h^Monoi1(z)]=Wi1(z)[h^Ri(z)h^Li(z)]E94

where

Wi1(z)=[gSRi(z)gSLi(z)gSRi1(z)gSLi1(z)]E95

In other words, using two echo path estimation results for corresponding two LTI periods, we can project monaural domain quasi-echo path to stereo domain quasi echo path or vice -versa using WARP operations as

H^STi(z)=Wi(z)H^Monoi(z)H^Monoi(z)=Wi1(z)H^STi(z)E96

where

H^Monoi(z)=[h^Monoi(z)h^Monoi1(z)],H^STi(z)=[h^Ri(z)h^Li(z)]E97

In actual implementation, it is impossible to obtain realWi(z), which is composed of unknown transfer functions between a sound source and right and left microphones, so use one of the stereo sounds as a single talk sound source instead of a sound source. Usually, higher level sound is chosen as a pseudo-sound source because higher level sound is usually closer to one of the microphones. Then, the approximated WARP function W˜i(z)is defined as

W˜i(z)={[1gRLi(z)1gRLi-1(z)]RRTransition[1gRLi(z)gLRi-1(z)1]RLTransition[gLRi(z)11gRLi-1(z)]LRTransition[gLRi(z)1gLRi-1(z)1]LLTransitionE98

where gRLi(z)and gLRi(z)are cross-channel transfer functions between right and left stereo sounds and are defined as

gRLi(z)=gSLi(z)/gSRi(z),gLRi(z)=gSRi(z)/gSLi(z)E99

The RR, RL, LR and LL transitions in (98) mean a single talker’s location changes. If a talker’ location change is within right microphone side (right microphone is the closest microphone) we call RR-transition and if it is within left-microphone side (left microphone is the closest microphone) we call LL-transition. If the location change is from right-microphone side to left microphone side, we call RL-transition and if the change is opposite we call LR-transition. Let’s assume ideal direct-wave single talk case. Then the ωdomain transfer functions, gRLi(ω)and gLRi(ω)are expressed in z-domain as

gRLi(z)=lRLiφ(δRLi,z)zdRLi,gLRi(z)=lLRiφ(δLRi,z)zdLRiE100

whereδRLi,, and δLRi,are fractional delays anddRLiand dLRiare integer delays for the direct-wave to realize analog delaysτRLiandτLRi, these parameters are defined as

dRLi=INT[τRLifS].dLRi=INT[τLRifS],δRLi,=Mod[τRLifS],δLRi,=Mod[τLRifS]E101

φ(δ,z)is a “Sinc Interpolation” function to interpolate a value at a timing between adjacent two samples and is given by

φ(δ,z)=ν=sin(πνδ)(πνδ)zνE102

4.2. Digital filter realization of WARP functions

Since LL-transition and LR transition are symmetrical to RR-transition and RL-transition respectively, Only RR and RL transition cases are explained in the following discussions. By solving (96) applying WARP function in(98), we obtain right and left stereo echo path estimation functions as

h^Ri(z)=h^Monoi(z)h^Monoi1(z)gRLi1(z)gRLi(z)h^Li(z)=gRLi1(z)h^Monoi(z)gRLi(z)h^Monoi1(z)gRLi1(z)gRLi(z)RRTransitionE103

or

h^Ri(z)=h^Monoi(z)gRLi1(z)h^Monoi1(z)1gLRi(z)gRLi1(z)h^Li(z)=h^Monoi1(z)gLRi(z)h^Monoi(z)1gLRi(z)gRLi1(z)RLTransitionE104

By substituting (100) for (104), we obtain

h^Ri(z)=h^Monoi(z)h^Monoi1(z)lRLi1φ(δRLi1,z)zdRLi1lRLiφ(δRLi,z)zdRLih^Li(z)=lRLi1φ(δRLi1,z)zdRLi1h^Monoi(z)lRLiφ(δRLi,z)zdRLih^Monoi1(z)lRLi1φ(δRLi1,z)zdRLi1lRLiφ(δRLi,z)zdRLiRRTransitionE105

and

h^Ri(z)=h^Monoi(z)lRLi1φ(δRLi1,z)zdRLi1h^Monoi1(z)1lLRiφ(δLRi,z)lRLi1φ(δRLi1,z)z(dRLi1+dLRi)h^Li(z)=h^Monoi1(z)lLRiφ(δLRi,z)zdLRih^Monoi(z)1lLRiφ(δLRi,z)lRLi1φ(δRLi1,z)z(dRLi1+dLRi)RLTransitionE106

Since φ(δ,z)is an interpolation function for a delayδ, the delay is compensated by φ(δ,z)as

φ(δ,z)φ(δ,z)=1.E107

From(107), (105) is re-written as

h^Ri(z)=(h^Monoi(z)h^Monoi1(z))lRLi11φ(δRLi1,z)zdRLi11(lRLilRLi11)φ(δRLi1,z)φ(δRLi,z)z(dRLidRLi1)h^Li(z)=h^Monoi(z)lRLilRLi11φ(δRLi,z)φ(δRLi1,z)zdRLi+dRLi1h^Monoi1(z)1(lRLilRLi11)φ(δRLi1,z)φ(δRLi,z)z(dRLidRLi1)RRTransitionE108

These functions are assumed to be digital filters for the echo path estimation results as shown in Fig.8.

Figure 8.

Digital Filter Realization for WARP Functions

4.3. Causality and stability of WARP functions

Stability conditions are obtained by checking denominator of (108) and(106) DRRi(z)and DRLi(z)which are defined as

|DRRi(z)|1RRTransition|DRLi(z)|1RLTransitionE109

where

DRRi(z)=lRLilRLi11φ(δRLi1,z)φ(δRLi,z)z(dRLidRLi1)RRTransitionDRLi(z)=lLRilRLi1φ(δLRi,z)φ(δRLi1,z)z(dRLi1+dLRi)RLTransitionE110

From(109),

|φ(δRLi1,z)φ(δRLi,z)||φ(δRLi1,z)||φ(δRLi,z)|RRTransition|φ(δLRi,z)φ(δRLi1,z)||φ(δLRi,z)||φ(δRLi1,z)|RRTransitionE111

By using numerical calculations,

|φ(δ,z)|1.2E112

Substituting (112) for (109),

lRLilRLi111/1.44RRTransitionlLRilRLi11/1.44RLTransitionE113

Secondly, conditions for causality are given by checking the delay of the feedback component of the denominators DRRi(z)andDRLi(z). Since convolution of a “Sinc Interpolation” function is also a “Sinc Interpolation” function as

φ(δA,z)φ(δB,z)=φ(δA+δB,z).E114

Equation (110) is re-written as

DRRi(z)=lRLilRLi11φ(δRLi,δRLi1,z)z(dRLidRLi1)RRTransitionDRLi(z)=lLRilRLi1φ(δLRi+δRLi1,z)z(dRLi1+dLRi)RLTransitionE115

The “Sinc Interpolation” function is an infinite sum toward both positive and negative delays. Therefore it is essentially impossible to endorse causality. However, by permitting some errors, we can find conditions to maintain causality with errors. To do so, we use a “Quasi-Sinc Interpolation” function which is defined as

φ˜(δ,z)=ν=NF+1NFsin(πνδ)(πνδ)zνE116

where 2NFis a finite impulse response range of the “Quasi-Sinc Interpolation”φ˜(δ,z). Then the error power by the approximation is given as

φ˜(δ,z)φ˜*(δ,z)dz=ν=NFsin2(πνδ)(πνδ)2zν+ν=NF+1sin2(πνδ)(πνδ)2zνE117

Equation (116) is re-written as

φ˜(δ,z)=ν=02NF1sin(πνδ)(πνδ)zνNF+1E118

By substituting (118) for (115),

DRRi(z)lRLilRLi11φ˜(δRLi,δRLi1,z)z(dRLidRLi1NF+1)RRTransitionDRLi(z)lLRilRLi1φ˜(δLRi+δRLi1,z)z(dRLi1+dLRiNF+1)RLTransitionE119

Then conditions for causality are

dRLidRLi1NF1RRTransitiondRLi1+dLRiNF1RLTransitionE120

The physical meaning of the conditions are the delay difference due to talker’s location change should be equal or less than cover range of the “Quasi-Sinc Interpolation”φ˜(δ,z)in the case of staying in the same microphone zone and the delay sun due to talker’s location change should be equal or less than cover range of the “Quasi-Sinc Interpolation”φ˜(δ,z)in the case of changing the microphone zone.

4.4. Stereo echo canceller using WARP

Total system using WARP method is presented in Fig. 9, where the system composed of five components, far-end stereo sound generation model, cross-channel transfer function (CCTF) estimation block, stereo echo path model, monaural acoustic echo canceller (AEC-I) block, stereo acoustic echo canceller (AEC-II) block and WARP block.

Figure 9.

System Configuration for WARP based Stereo Acoustic Echo Canceller

As shown in Fig.9, actual echo cancellation is done by stereo acoustic echo canceller (AEC-II), however, a monaural acoustic echo canceller (AEC-I) is used for the far-end single talking. The WARP block is active only when the cross-channel transfer function changes and it projects monaural echo chancellor echo path estimation results for two LTI periods to one stereo echo path estimation or vice-versa.

5. Computer simulations

5.1. Stereo sound generation model

Computer simulations are carried out using the stereo generation model shown in Fig.10 for both white Gaussian noise (WGN) and an actual voice. The system is composed of cross-channel transfer function estimation blocks (CCTF), where all signals are assumed to be sampled at fS=8KHzafter 3.4kHz cut-off low-pass filtering. Frame length is set to 100 samples. Since the stereo sound generation model is essentially a continuous time signal system, over-sampling (x6,fA=48KHz) is applied to simulate it. In the stereo sound generation model, three far-end talker’s locations, A Loc(1)=(-0.8,1.0), B Loc(2)=(-0.8,0.5), C Loc(3)=(-0.8,0.0), D Loc(4)=(-0.8,-0.5) and D Loc(5)=(-0.8,-1.0) are used and R/L microphone locations are set to R-Mic=(0,0.5) and L-Mic=(0,-0.5), respectively. Delay is calculated assuming voice wave speed as 300m/sec. In this set-up, talker’s position change for WGN is assumed to be from location A to location B and finally to location D, in which each talker stable period is set to 80 frames. The position change for voice is from C->A and the period is set to 133 frames. Both room noise and reverberation components in the far-end terminals is assumed, the S/N is set to 20dB ~ 40dB.

Figure 10.

Stereo Sound Generation Model and Cross-Channel Transfer Function Detector

5.2. Cross-channel transfer function estimation

In WARP method, it is easily imagine that the estimation performance of the cross-channel transfer function largely affects the echo canceller cancellation performances. To clarify the transfer function estimation performance, simulations are carried out using the cross-channel transfer function estimators (CCTF). The estimators are prepared for right microphone side sound source case and left microphone side sound source case, respectively. Each estimator has two NLMS adaptive filters, longer (128) tap one and shorter (8) tap one. The longer tap adaptive filter (AF1) is used to find a main tap and shorter one (AF2) is used to estimate the transfer function precisely as an impulse response.

Figure 11 shows CCTF estimation results as the AF1 tap coefficients after convergence setting single male voice sound source to the locations C, B and A in Fig. 11. Detail responses obtained by AF2 are shown in Fig. 12.As shown the results, the CCTF estimation works correctly in the simulations.

Figure 11.

Impulse Response Estimation Results in CCTF Block

Figure 12.

Estimated Tap Coefficients by Short Tap Adaptive Filter in CCTF Estimation Block

Figure 13.

Cross-Channel Correlation Cancellation Performances

Cancellation performances of the cross-channel correlation under room noise (WGN) are obtained using the adaptive filter (AF2) and are shown is Fig. 13, where S/N is assumed to be 20dB, 30dB and 40dB. In the figure CLRL(dB)is power reduction in dB which is observed by the signal power before and after cancellation of the cross-channel correlation by AF2. As shown here, more than 17dB cross-channel correlation cancellation is attained.

5.3. Echo canceller performances

To evaluate echo cancellation performances of the WARP acoustic echo canceller which system is shown in Fig. 10, computer simulations are carried out assuming 1000tap NLMS adaptive filters for both stereo and monaural echo cancellers. The performances of the acoustic echo canceller are evaluated by two measurements. The first one is the echo return loss enhancementERLEij(dB), which is applied to the WGN source case and is defined as

ERLEL(i1)+j1={10log10(k=0NF1y2i,j,k/k=0NF1eMON2i,j,k)MonauralEchoCanceller10log10(k=0NF1y2i,j,k/k=0NF1eST2i,j,k)StereoEchoCancellerE121

where eMONi,j,kand eMONi,j,kare residual echo for the monaural echo canceller (AEC-I) and stereo echo canceller (AEC-II) for thekthsample in the jthframe in the ithLTI period, respectively. The second measurement is normalized misalignment of the estimated echo paths and are defined as

NORML(i1)+j1=10log10((hR)T(hR)+(hL)T(hL)(hRh^Ri,j)T(hLh^Li,j))E122

where h^Ri,jand h^Li,jare stereo echo canceller estimated coefficient arrays at the end of (i,j)thframe, respectively. hRand hLare target stereo echo path impulse response arrays, respectively.

5.3.1. WARP echo canceller basic performances for WGN

The simulation results for WARP echo canceller in the case of WGN sound source, no far-end double talking and no local noise, are shown in Fig. 14. In the simulations, talker is assumed to move from A to E every 80 frames (1sec). In Fig.14, the results (a) and (b) show ERLEs for monaural and stereo acoustic echo cancellers (AEC-I and AEC-II), respectively.

Figure 14.

WARP Echo Cancellation Performances for WGN Source

The WARP operations are applied at the boundaries of the three LTI periods for the talkers C, D and E. NORM for the stereo echo canceller (AEC-II). As shown here, after two LTI periods (A, B periods), NORM and ERLE improves quickly by WARP projection at WARP timings in the Fig. 16. As for ERLE, stereo acoustic echo canceller shows better performance than monaural echo canceller. This is because the monaural echo canceller estimates an echo path model which is combination of CCTF and real stereo echo path and therefore the performance is affected by the CCTF estimation error. On the other hand, the echo path model for the stereo echo canceller is purely the stereo echo path model which does not include CCTF.

Figure 15.

Echo Cancellation Performance Comparison for WGN Source

Secondary, the WARP acoustic echo canceller is compared with a stereo echo canceller based on an affine projection method. In this case, the right and left sounds atkthsample in the (i,j)thframe, xRijkandxLijk, are assumed to have independent level shift to the original right and left sounds, xRijkandxLijk, for simulating small movement of talker’s face as

xRijk=(1+αLevelsin(2πk/(fsT)X))xRijkxLijk=(1+αLevelcos(2πk/(fsT)X))xLijkE123

where αLeveland TXare constants which determine the level shift ratio and cycle. Figure 15 shows the cancellation performances when αLeveland TXare 10% and 500msec, respectively.

Figure 16.

WARP Echo Canceller Performances Affected by Far-End Back Ground Noise

In Fig. 15, the WARP method shows more than 10dB better stereo echo path estimation performance, NORM, than that of affine projection (P=3). ERLE by stereo echo canceller base on WARP method is also better than affine projection (P=3). ERLE by monaural acoustic echo canceller based on WARP method is somehow similar cancellation performance as affine method (P=3), however ERLE improvement after two LTI periods by the WARP based monaural echo canceller is better than affine based stereo echo canceller.

Figure 16 shows the echo canceller performances in the case of CCTF estimation is degraded by room noise in the far-end terminal. S/N in the far-end terminal is assumed to be 30dB or 50dB. Although the results clearly show that lower S/N degrade ERLR or NORM, more than 15dB ERLE or NORM is attained after two LTI periods.

Figure 17 shows the echo canceller performances in the case of echo path change happens. In this simulation, echo path change is inserted at 100frame. The echo path change is chosen 20dB, 30dB and 40dB. It is observed that echo path change affects the WARP calculation and therefore WARP effect degrades at 2nd and third LTI period boundary.

Figure 17.

WARP Echo Canceller Cancellation Performance Drops Due to Echo Path Chance

Figure 18 summarizes NORM results for stereo NLMS method, affine projection method as WARP method. In this simulation, as a non-linear function for affine projection, independent absolute values of the right and left sounds are added by

xRijk=xRijk+0.5αABS(xRijk+|xRijk|)xLijk=xLijk+0.5αABS(xLijk|xLijk|)E124

where αABSis a constant to determine non-liner level of the stereo sound and is set to 10%. In this simulation, an experiment is carried out assuming far-end double talking, where WGN which power is same as far-end single talking is added between 100 and 130 frames.

As evident from the results in Fig. 18, WARP method shows better performances for the stereo echo path estimation regardless far-end double talking existence. Even in the case 10% far end signal level shit, WARP method attains more than 20% NORM comparing affine method (P=3) with 10% absolute non-linear result.

Figure 18.

Echo Path Estimation Performance Comparison for NLMS, Affine and WARP Methods

5.3.2. WARP echo canceller basic performances for voice

Figure 19 shows NORM and residual echo level (Lres) for actual male voice sound source. Since voice sound level changes frequently, we calculate residual echo level Lres (dB) instead of ERLE(dB) for white Gaussian noise case. Although slower NORM and Lres convergence than white Gaussian is observed, quick improvement for the both metrics is observed at the talker B and A border. In this simulation, we applied 500 tap NLMS adaptive filter. Affine projection may give better convergence speed by eliminating auto-correlation in the voice, however it is independent effect from WARP effect. WARP and affine projection can be used together and may contribute to convergence speed up independently.

Figure 19.

Residual Echo (Lres (dB) Level and Normalized Estimated Echo Misalignment.(NORM) for the voice Source at Far-end Terminal S/N=30dB. (Level shift 0, 500tap Step gain=1.0)

6. Conclusions

In this chapter stereo acoustic echo canceller methods are studied from cross-channel correlation view point aiming at conversational DTV use. Among many stereo acoustic echo cancellers, we focused on AP (including LS and stereo NLMS methods) and WARP methods, since these approaches do not cause any modification nor artifacts to speaker output stereo sound which is not desirable consumer audio-visual products such as DTV. In this study, stereo sound generation system is modeled by using right and left Pth order LTI systems with independent noises. Stereo LS method (M=2P) and stereo NLMS method (M=P=1) are two extreme cases of general AP method which requires MxM inverse matrix operation in each sample. Stereo AP method (M=P) can produce the best iteration direction fully adopting un-correlated component produced by small fluctuation in the stereo cross-channel correlation by calculating PxP inverse matrix operations in each sample. Major problem of the method is that it cannot cope with strict single talking where no un-correlated signals exist in right and left channels and therefore rank drop problem happens. Contrary to AP method, WARP method creates a stereo echo path estimation model applying a monaural adaptive filter for two LTI periods at a chance of far-end talker change. Since it creates stereo echo path estimation using two monaural echo path models for two LTI periods, we do not suffer from any rank drop problem even in a strict single talking. Moreover, using WARP method, computational complexity can be reduced drastically because WARP method requires PxP inverse matrix operations only at LTI characteristics change such as far-end talker change. However, contrary to AP method, it is clear that performance of WARP method may drop if fluctuation in cross-channel correlation becomes high. Considering above pros-cons in affine projection and WARP methods, it looks desirable to apply affine method and WARP method dynamically depending on the nature of stereo sound. In this chapter, an acoustic echo canceller based on WARP method which equips both monaural and stereo adaptive filters is discussed together with other gradient base stereo adaptive filter methods. The WARP method observes cross-channel correlation characteristics in stereo sound using short tap pre-adaptive filters. Pre-adaptive filter coefficients are used to calculate WARP functions which project monaural adaptive filter estimation results to stereo adaptive filter initial coefficients or vice-versa.

To clarify effectiveness WARP method, simple computer simulations are carried out using white Gaussian noise source and male voice, using 128tap NLMS cross-channel correlation estimator, 1000tap monaural NLMS adaptive filter for monaural echo canceller and 2x1000tap (2x500tap for voice) multi-channel NLMS adaptive filter for stereo echo canceller. Followings are summary of the results:

  1. Considering sampling effect for analog delay, x6 over sampling system is assumed for stereo generation model. 5 far-end talker positions are assumed and direct wave sound from each talker is assumed to be picked up by far-end stereo microphone with far-end room background noise. The simulation results show we can attain good cross-channel transfer function estimation rapidly using 128tap adaptive filter if far-end noise S/N is reasonable (such as 20-40dB).

  2. Using the far-end stereo generation model and cross-channel correlation estimation results, 1000tap NLMS monaural NLMS adaptive filter and 2-1000 tap stereo NLMS adaptive filters are used to clarify effectiveness of WARP method. In the simulation far-end talker changes are assumed to happen at every 80frames (1frame=100sample). Echo return loss Enhancement (ERLE) MORMalized estimation error power (NORM) are used as measurements. It is clarified that both ERLE and NORM are drastically improved at the far-end talker change by applying WARP operation.

  3. Far-end S/N affects WARP performance, however, we can still attain around SN-5dB ERLE or NORM.

  4. We find slight convergence improvement in the case of AP method (P=3) with non-linear operation. However, the improvement is much smaller than WARP at the far-end talker change. This is because sound source is white Gaussian noise in this simulation and therefore merit of AP method is not archived well.

  5. Since WARP method assumes stereo echo path characteristics remain stable, stereo echo path characteristics change degrade WARP effectiveness. The simulation results show the degradation depends on how much stereo echo path moved and the degradation appears just after WARP projection.

  6. WARP method works correctly actual voice sound too. Collaboration with AP method may improve total convergence speed further more because AP method improves convergence speed for voice independent from WARP effect.

As for further studies, more experiments in actual environments are necessary. The author would like to continue further researches to realize smooth and natural conversations in the future conversational DTV.

7. Appendix

If N×Nmatrix Qis defined as

Q=X2ST(k)GTGX2S(k)E125

where X2S(k)is a (2P1)sample array composed of white Gaussian noise samplex(k)as

X2S(k)=[x(k),x(k1),x(kN+1)]x(k)=[x(k),x(k1),x(k2p+2)]TE126

Gis defined as a (2P1)×Pmatrix as

G=[gT000gT000000gT]E127

where gis P sample array defined as

g=[g0,g1,,gνgP1]TE128

Then Qis a Toepliz matrix and is expressed using P×P(PN) Toepliz matrix Qas

Q=Tlz(Q)E129

This is because (u,v)thelement of the matrixQ, aTlZ(u,v)is defined as

aTlZ(u,v)=xT(k-u)GTGx(k-v)E130

Considering

xT(k-u)GTGx(k-v)=0for all |uv|PE131

the elementaTlZ(u,v)is given as

aTlZ(u,v)={a(uv,0)P1uv0a(0,vu)P1vu00|uv|PE132

By setting the (u,v)th element of P×P(PN) Toepliz matrix Qas aTlZ(u,v)((0uP,0vP)), we define a function Tlz(Q)which determines N×NToepliz matrixQ.

It is noted that if Qis a identity matrix Qis also identity matrix.

© 2011 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike-3.0 License, which permits use, distribution and reproduction for non-commercial purposes, provided the original is properly cited and derivative works building on this content are distributed under the same license.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Shigenobu Minami (September 6th 2011). A Stereo Acoustic Echo Canceller Using Cross-Channel Correlation, Adaptive Filtering, Lino Garcia, IntechOpen, DOI: 10.5772/16341. Available from:

chapter statistics

1391total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

EEG-fMRI Fusion: Adaptations of the Kalman Filter for Solving a High-Dimensional Spatio-Temporal Inverse Problem

By Thomas Deneux

Related Book

First chapter

Applications of Adaptive Filtering

By J. Gerardo Avalos, Juan C. Sanchez and Jose Velazquez

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us