A Stereo Acoustic Echo Canceller Using Cross-Channel Correlation

Shigenobu Minami

doi:10.5772/16341

Author Information

Show +

Minami Shigenobu
- Toshiba Corporation, Japan

*Address all correspondence to:

1. Introduction

Stereo acoustic echo canceller is becoming more and more important as an echo canceller is applied to consumer products like a conversational DTV. However it is well known that if there is strong cross-channel correlation between right and left sounds, it cannot converge well and results in echo path estimation misalignment. This is a serious problem in a conversational DTV because the speaker output sound is combination of a far-end conversational sound, which is essentially monaural, and TV program sound, which has wide variety of sound characteristics, monaural sound, stereo sound or mixture of them. To cope with this problem, many stereo echo cancellation algorithms have been proposed. The methods can be categorized into two approaches. The first one is to de-correlate the stereo sound by introducing independent noise or non-linear post-processing to right and left speaker outputs. This approach is very effective for single source stereo sound case, which covers most of conversational sounds, because the de-correlation prevents rank drop to solve a normal equation in a multi-channel adaptive filtering algorithm. Moreover, it is simple since many traditional adaptation algorithms can be used without any modification. Although the approach has many advantages and therefore widely accepted, it still has an essential problem caused by the de-correlation which results in sound quality change due to insertion of the artificial distortion. Even though the inserted distortion is minimized so as to prevent sound quality degradation, from an entertainment audio equipment view point, such as a conversational DTV, many users do not accept any distortion to the speaker output sound. The second approach is desirable for the entertainment types of equipments because no modification to the speaker outputs is required. In this approach, the algorithms utilize cross-channel correlation change in a stereo sound. This approach is also divided into two approaches, depending on how to utilize the cross-channel correlation change. One widely used approach is affine projection method. If there are small variations in the cross-channel correlation even in a single sound source stereo sound, small un-correlated component appears in each channel. The affine projection method can produce the best direction by excluding the auto-correlation bad effect in each channel and by utilizing the small un-correlated components. This approach has a great advantage since it does not require any modification to the stereo sound, however if the variation in the cross-channel correlation is very small, improvement of the adaptive filter convergence is very small. Since the rank drop problem of the stereo adaptive filter is essentially not solved, we may need slight inserted distortion which reduces merit of this method. Another headache is that the method requires P by P inverse matrix calculation in an each sample. The inverse matrix operation can be relaxed by choosing P as small number, however small P sometimes cannot attain sufficient convergence speed improvement. To attain better performance even by small P, the affine projection method sometimes realized together with sub-band method. Another method categorized in the second approach is “WARP” method. Unlike to affine projection method which utilizes small change in the cross-channel correlation, the method utilizes large change in the cross-channel correlation. This approach is based on the nature of usual conversations. Even though using stereo sound for conversations, most parts of conversations are single talk monaural sound. The cross-channel correlation is usually very high and it remains almost stable during a single talking. A large change happens when talker change or talker’s face movement happens. Therefore, the method applies a monaural adaptive filter to single sound source stereo sound and multi-channel (stereo) adaptive filter to non-single sound source stereo sound. Important feature of the method is two monaural adaptive filter estimation results and one stereo adaptive filter estimation result is transformed to each other by using projection matrixes, called WARP matrixes. Since a monaural adaptive filter is applied when a sound is single source stereo sound, we do not need to suffer from the rank problem.

In this chapter, stereo acoustic echo canceller methods, multi-channel least mean square, affine projection and WARP methods, all of them do not need any modification to the speaker output sounds, are surveyed targeting conversational DTV applications. Then WARP method is explained in detail.

2. Stereo acoustic echo canceller problem

2.1. Conversational DTV

Since conversational DTV should keep smooth speech communication even when it is receiving a regular TV program, it requires following functionalities together with traditional DTV systems as shown in Fig. 1.

Figure 1.
Audio System Example in a Conversational DTV

Mixing of broadcasting sound and communication speech: Two stereo sounds from the DTV audio receiver and local conversational speech decoder are mixed and sent to the stereo speaker system.
Sampling frequency conversion: Sampling frequency of DTV sound is usually wider than that of conversational service, such as f S H = 48 k H z for DTV sound and f S = 16 k H z for conversational service sound, we need sampling frequency conversion between DTV and conversational service audio parts
Stereo acoustic canceller: A stereo acoustic echo canceller is required to prevent howling and speech quality degradation due to acoustic coupling between stereo speaker and microphone.

Among the above special functionalities, the echo canceller for the conversational DTV is technically very challenging because the echo canceller should cancel wide variety of stereo echoes for TV programs as well as stereo speech communications.

2.2. Stereo sound generation model

A stereo acoustic echo canceller system is shown in Fig. 2 with typical stereo sound generation model, where all signals are assumed to be discrete time signals at the k t h sampling timing by f S sampling frequency and the sound generation model is assumed to be linier finite impulse response (FIR) systems which has a sound source signal x S i ( k ) as an input and stereo sound x R i ( k ) and x L i ( k ) as outputs with additional uncorrelated noises x ′ U R i ( k ) and x ′ U L i ( k ) . By using matrix and array notations of the signals as

X S i ( k ) = [ x S i ( k ) , x S i ( k − 1 ) , ⋯ x S i ( k − P + 1 ) ] x S i ( k ) = [ x S i ( k ) , x S i ( k − 1 ) , ⋯ x S i ( k − N + 1 ) ] T x R i ( k ) = [ x R i ( k ) , x R i ( k − 1 ) , ⋯ x R i ( k − N + 1 ) ] T x L i ( k ) = [ x L i ( k ) , x L i ( k − 1 ) , ⋯ x L i ( k − N + 1 ) ] T x ′ U R i ( k ) = [ x ′ U R i ( k ) , x ′ U R i ( k − 1 ) , ⋯ x ′ U R i ( k − N + 1 ) ] T x ′ U L i ( k ) = [ x ′ U L i ( k ) , x ′ U L i ( k − 1 ) , ⋯ x ′ U L i ( k − N + 1 ) ] T E1

where P and N are impulse response length of the FIR system and tap length of the adaptive filter for each channel, respectively.

Then the FIR system output x i ( k ) is 2N sample array and is expressed as

x i ( k ) = [ x R i ( k ) x L i ( k ) ] = [ X S i ( k ) g R i ( k ) + x ′ U R i ( k ) X S i ( k ) g L i ( k ) + x ′ U L i ( k ) ] E2

where g R i ( k ) and g L i ( k ) are P sample impulse responses of the FIR system defined as

g R i ( k ) = [ g R i , 0 ( k ) , g R i 1 ( k ) , ⋯ , g R i , ν ( k ) , , ⋯ , g R i , P − 1 ( k ) ] T g L i ( k ) = [ g L i , 0 ( k ) , g L i 1 ( k ) , ⋯ , g L i , ν ( k ) , , ⋯ , g L i , P − 1 ( k ) ] T E3

Figure 2.
Traditional Stereo Acoustic Echo Canceller System Configuration with Typical Stereo Sound Generation Model.

In(2), if g R i ( k ) and g L i ( k ) are composed of constant array, g R i and g L i during the i t h period, and small time variant arrays, Δ g R i ( k ) and Δ g L i ( k ) which are defined as

g R i = [ g R i , 0 , g R i 1 , ⋯ , g R i , ν , ⋯ , g R i , P − 1 ] T g L i = [ g L i , 0 , g L i 1 , ⋯ , g L i , ν , ⋯ , g L i , P − 1 ] T Δ g R i ( k ) = [ Δ g R i , 0 ( k ) , Δ g R i 1 ( k ) , ⋯ , Δ g R i , ν ( k ) , ⋯ , Δ g R i , P − 1 ( k ) ] T Δ g L i ( k ) = [ Δ g L i , 0 ( k ) , Δ g L i 1 ( k ) , ⋯ , Δ g L i , ν ( k ) , ⋯ , Δ g L i , P − 1 ( k ) ] T E4

(2) is re-written as

[ x R i ( k ) x L i ( k ) ] = [ X S i ( k ) g R i + X S i ( k ) Δ g R i ( k ) + x ′ U R i ( k ) X S i ( k ) g L i + X S i ( k ) Δ g L i ( k ) + x ′ U L i ( k ) ] E5

This situation is usual in the case of far-end single talking because transfer functions between talker and right and left microphones vary slightly due to talker’s small movement. By assuming the components are also un-correlated noise, (5) can be regarded as a linear time invariant system with independent noise components, x U R i ( k ) and x U L i ( k ) , as

[ x R i ( k ) x L i ( k ) ] = [ X S i ( k ) g R i + x U R i ( k ) X S i ( k ) g L i + x U L i ( k ) ] E6

where

x U R i ( k ) = X S i ( k ) Δ g R i ( k ) + x ′ U R i ( k ) x U L i ( k ) = X S i ( k ) Δ g L i ( k ) + x ′ U L i ( k ) E7

In (6), if there are no un-correlated noises, we call the situation as strict single talking.

In this chapter, sound source signal( x S i ( k ) ), uncorrelated noises ( x ′ U R i ( k ) and x ′ U L i ( k ) ) are assumed as independent white Gaussian noise with variance σ x i and σ N i , respectively.

2.3. Stereo acoustic echo canceller problem

For simplification, only one stereo audio echo canceller for the right side microphone’s output signal y ˜ i ( k ) , is explained. This is because the echo canceller for left microphone output is apparently treated as the same way as the right microphone case. As shown in Fig.2, the echo canceller cancels the acoustic echo y i ( k ) as

e i ( k ) = y i ( k ) − y ^ i ( k ) + n i ( k ) E8

where e i ( k ) is acoustic echo canceller’s residual error, n i ( k ) is a independent background noise, y ^ i ( k ) is an FIR adaptive filter output in the stereo echo canceller, which is given by

y ^ i ( k ) = h ^ R i T ( k ) x R i ( k ) + h ^ L i T ( k ) x L i ( k ) E9

where h ^ R i ( k ) and h ^ L i ( k ) are N tap FIR adaptive filter coefficient arrays.

Error power of the echo canceller for the right channel microphone output, σ e i 2 ( k ) , is given as:

σ e i 2 ( k ) = ( y R i ( k ) - h ^ S T i T ( k ) x i ( k ) + n i ( k ) ) 2 E10

where h ^ S T i ( k ) is a stereo echo path model defined as

h ^ S T i ( k ) = [ h ^ R i T ( k ) h ^ L i T ( k ) ] T E11

Optimum echo path estimation h ^ O P T which minimizes the error power σ e 2 ( k ) is given by solving the linier programming problem as

M i n i m i z e [ ∑ k = 0 N L S − 1 σ e i 2 ( k ) ] E12

where N L S is a number of samples used for optimization. Then the optimum echo path estimation for the i t h LTI period h ^ O P T i is easily obtained by well known normal equation as

h ^ O P T i = ( ∑ k = 0 N L S − 1 ( y ˜ i ( k ) x i ( k ) ) ) X N L S i − 1 E13

where X N L S i is an auto-correlation matrix of the adaptive filter input signal and is given by

X N L S i = ∑ k = 0 N L S − 1 ( x i ( k ) x i T ( k ) ) = [ A i B i C i D i ] = [ ∑ k = 0 N L S − 1 ( x R i ( k ) x R i T ( k ) ) ∑ k = 0 N L S − 1 ( x R i ( k ) x L i T ( k ) ) ∑ k = 0 N L S − 1 ( x L i ( k ) x R i T ( k ) ) ∑ k = 0 N L S − 1 ( x L i ( k ) x L i T ( k ) ) ] E14

By (14), determinant of X N L S i is given by

| X N L S i | = | A i | | D i − C i A i − 1 B i | E15

In the case of the stereo generation model which is defined by(2), the sub-matrixes in (14) are given by

A i = ∑ k = 0 N L S − 1 ( X S i ( k ) G R R i X S i T ( k ) + 2 x U R i ( k ) ( X S i ( k ) g R i ) T + x U R i ( k ) x U R i T ( k ) ) B i = ∑ k = 0 N L S − 1 ( X S i ( k ) G R L i X S i T ( k ) + x U R i ( k ) ( X S i ( k ) g R i ) T + x U L i ( k ) ( X S i ( k ) g R i ) T + x U R i ( k ) x U L i T ( k ) ) C i = ∑ k = 0 N L S − 1 ( X S i ( k ) G L R i X S i T ( k ) + x U L i ( k ) ( X S i ( k ) g R i ) T + x U R i ( k ) ( X S i ( k ) g L i ) T + x U L i ( k ) x U R i T ( k ) ) D i = ∑ k = 0 N L S − 1 ( X S i ( k ) G L L i X S i T ( k ) + 2 x U L i ( k ) ( X S i ( k ) g L i ) T + x U L i ( k ) x U L i T ( k ) ) E16

where

G R R i = g R i g R i T , G R L i = g R i g L i T , G L R i = g L i g R i T , G L L i = g L i g L i T E17

In the cease of strict single talking where x U R i ( k ) and x U L i ( k ) do not exist, (16) becomes very simple as

A i = ∑ k = 0 N L S − 1 ( X S i ( k ) G R R i X S i T ( k ) ) B i = ∑ k = 0 N L S − 1 ( X S i ( k ) G R L i X S i T ( k ) ) C i = ∑ k = 0 N L S − 1 ( X S i ( k ) G L R i X S i T ( k ) ) D i = ∑ k = 0 N L S − 1 ( X S i ( k ) G L L i X S i T ( k ) ) E18

| X N L S i | | C i | = | A i | | ( D i C i − C i A i − 1 B i C i | = | A i | | ( D i C i − C i B i C i A i − 1 | E19

Then | D i C i − C i B i C i A i − 1 | becomes zero as

| D i C i − C i A i − 1 B i C i | = | ∑ k = 0 N L S − 1 ( X S i ( k ) ( G L L i − G L R i G − 1 R R i G R L i ) X S i T ( k ) ) X S i ( k ) G L R i X S i T ( k ) ) | = N σ x i 2 | ∑ k = 0 N L S − 1 ( X S i ( k ) ( g L i T g L i ( g L i g R i T − g L i g R i T ( g R i g R i T ) − 1 g R i g R i T ) X S i T ( k ) ) | = 0 E20

Hence no unique solution can be found by solving the normal equation in the case of strict single talking where un-correlated components do not exist. This is a well known stereo adaptive filter cross-channel correlation problem.

3. Stereo acoustic echo canceller methods

To improve problems addressed above, many approaches have been proposed. One widely accepted approach is de-correlation of stereo sound. To avoid the rank drop of the normal equation(13), small distortion such as non-linear processing or modification of phase is added to stereo sound. This approach is simple and effective to endorse convergence of the multi-channel adaptive filter, however it may degrade the stereo sound by the distortion. In the case of entertainment applications, such as conversational DTV, the problem may be serious because customer’s requirement for sound quality is usually very high and therefore even small modification to the speaker output sound cannot be accepted. From this view point, approaches which do not need to add any modification or artifacts to the speaker output sound are desirable for the entertainment use. In this section, least square (LS), stereo affine projection (AP), stereo normalized least mean square (NLMS) and WARP methods are reviewed as methods which do not need to change stereo sound itself.

3.1. Gradient method

Gradient method is widely used for solving the quadratic problem iteratively. As a generalized gradient method, let denote M sample orthogonalized error array ε M i ( k ) based on original error array e M i ( k ) as

ε M i ( k ) = R i ( k ) e M i ( k ) E21

where e M i ( k ) is an M sample error array which is defined as

e M i ( k ) = [ e i ( k ) , e i ( k − 1 ) , ⋯ e i ( k − M + 1 ) ] T E22

and R i ( k ) is a M × M matrix which orthogonalizes the auto-correlation matrix e M i ( k ) e M i T ( k ) . The orthogonalized error array is expressed using difference between adaptive filter coefficient array h ^ S T i ( k ) and target stereo echo path 2 N sample response h S T as

ε M i ( k ) = R i ( k ) X M 2 N i T ( k ) ( h S T - h ^ S T i ( k ) ) E23

where X M 2 N i ( k ) is a Mx2N matrix which is composed of adaptive filter stereo input array as defined by

X M 2 N i ( k ) = [ x i ( k ) , x i ( k − 1 ) , ⋯ x i ( k − M + 1 ) ] E24

By defining an echo path estimation error array d S T i ( k ) which is defined as

d S T i ( k ) = h S T - h ^ S T i ( k ) E25

estimation error power σ ε i 2 ( k ) is obtained by

σ ε i 2 ( k ) = ε M i T ( k ) ε M i ( k ) = d S T i T ( k ) Q 2 N 2 N i ( k ) d S T i ( k ) E26

where

Q 2 N 2 N i ( k ) = X M 2 N i ( k ) R i T ( k ) R i ( k ) X M 2 N i T ( k ) E27

Then, (26) is regarded as a quadratic function of h ^ S T i ( k ) as

f ( h ^ S T i ( k ) ) = 1 2 h ^ S T i T ( k ) Q 2 N 2 N i ( k ) h ^ S T i T ( k ) − h ^ S T i T ( k ) Q 2 N 2 N i h S T E28

For the quadratic function, gradient Δ i ( k ) is given by

Δ i ( k ) = - Q 2 N 2 N i ( k ) d S T i ( k ) E29

Iteration of h ^ S T i ( k ) which minimizes σ ε i 2 ( k ) is given by

h ^ S T i ( k + 1 ) = h ^ S T i ( k ) − α Δ i ( k ) = h ^ S T i ( k ) + α Q 2 N 2 N i ( k ) d S T i ( k ) = h ^ S T i ( k ) + α X M 2 N i ( k ) R i T ( k ) R i ( k ) e M i ( k ) E30

where α is a constant to determine step size.

Above equation is very generic expression of the gradient method and following approaches are regarded as deviations of this iteration.

3.2. Least Square (LS) method (M=2N)

From(30), the estimation error power between estimated adaptive filter coefficients and stereo echo path response, d i T ( k ) d i ( k ) is given by

d i T ( k + 1 ) d i ( k + 1 ) = d i T ( k ) ( I 2 N − α Q 2 N 2 N i ( k ) ) ( I 2 N − α Q 2 N 2 N i ( k ) ) T d i ( k ) E31

where I 2 N is a 2 N × 2 N identity matrix. Then the fastest convergence is obtained by finding R i ( k ) which orthogonalizes and minimizes eigenvalue variance in Q 2 N 2 N i ( k ) .

If M=2N, X 2 M 2 N i ( k ) is symmetric square matrix as

X M 2 N i ( k ) = X M 2 N i T ( k ) E32

and if X M 2 N i ( k ) ⋅ X M 2 N i T ( k ) ( = X M 2 N i T ( k ) ⋅ X M 2 N i ( k ) ) is a regular matrix so that inverse matrix exists, R i T ( k ) R i ( k ) which orthogonalizes Q 2 N 2 N i ( k ) is given by

R i T ( k ) R i ( k ) = ( X 2 N 2 N i T ( k ) X 2 N 2 N i ( k ) ) − 1 E33

By substituting (33) for (30)

h ^ S T i ( k + 1 ) = h ^ S T i ( k ) + α X 2 N 2 N i ( k ) ( X 2 N 2 N i T ( k ) X 2 N 2 N i ( k ) ) − 1 e N i ( k ) E34

Assuming initial tap coefficient array as zero vector and α = 0 during 0 to 2N-1th samples and α = 1 at 2Nth sample, (34) can be re-written as

h ^ S T i ( 2 N ) = X 2 N 2 N i ( 2 N - 1 ) ( X 2 N 2 N i T ( 2 N - 1 ) X 2 N 2 N i ( 2 N - 1 ) ) − 1 y i ( 2 N - 1 ) E35

where y i ( k ) is 2 N sample echo path output array and is defined as

y i ( k ) = [ y i ( k ) , y i ( k − 1 ) , ⋯ y i ( k − 2 N + 1 ) ] T E36

This iteration is done only once at 2 N − 1 t h sample. If N L S = 2 N , inverse matrix term in (35) is written as

X 2 N 2 N i T ( k ) X 2 N 2 N i ( k ) = ∑ k = 0 N L S − 1 ( x i ( k ) x i T ( k ) ) = X N L S i E37

Comparing (13) and (35) with　 (37), it is found that LS method is a special case of gradient method when M equals to 2N.

3.3. Stereo Affine Projection (AP) method (M=P ≤ N)

Stereo affine projection method is assumed as a case when M is chosen as FIR response length P in the LTI system. This approach is very effective to reduce 2Nx2N inverse matrix operations in LS method to PxP operations when the stereo generation model is assumed to be LTI system outputs from single WGN signal source with right and left channel independent noises as shown in Fig.2. For the sake of explanation, we define stereo sound signal matrix X P 2 N i ( k ) which is composed of right and left signal matrix X R i ( k ) and X L i ( k ) for P samples as

X P 2 N i ( k ) = [ X R i T ( k ) X L i T ( k ) ] T = [ X 2 S i ( k ) G R i T + X U R i ( k ) X 2 S i ( k ) G L i T + X U L i ( k ) ] E38

where

X 2 S i ( k ) = [ x S i ( k ) , x S i ( k − 1 ) , ⋯ x S i ( k − 2 P + 2 ) ] E39

X U R i ( k ) and X U L i ( k ) are un-correlated signal matrix defined as

X U R i ( k ) = [ x U R i ( k ) , x U R i ( k − 1 ) , ⋯ x U R i ( k − P + 1 ) ] X U L i ( k ) = [ x U L i ( k ) , x U L i ( k − 1 ) , ⋯ x U L i ( k − P + 1 ) ] E40

G R i and G L i are source to microphones response (2P-1)xP matrixes and are defined as

G R i = [ g 2 R , 0 , i T g 2 R , 1 , i T ⋯ g 2 R , P − 1 , i T ] = [ g R i T 0 ⋯ 0 0 g R i T ⋱ 0 0 0 ⋯ 0 0 0 ⋯ g R i T ] , G L i = [ g 2 R L , 0 , i T g 2 L , 1 , i T ⋯ g 2 L , P − 1 , i T ] = [ g L i T 0 ⋯ 0 0 g L i T ⋱ 0 0 0 ⋯ 0 0 0 ⋯ g L i T ] E41

As explained by(31), Q 2 N 2 N i ( k ) determines convergence speed of the gradient method. In this section, we derive affine projection method by minimizing the max-min eigenvalue variance in Q 2 N 2 N i ( k ) . Firstly, the auto-correlation matrix is expressed by sub-matrixes for each stereo channel as

Q N 2 N i ( k ) = [ Q A N N i ( k ) Q B N N i ( k ) Q C N N i ( k ) Q D N N i ( k ) ] E42

where Q A N N i ( k ) and Q D N N i ( k ) are right and left channel auto-correlation matrixes, Q B N N i ( k ) and Q C N N i ( k ) are cross channel-correlation matrixes. These sub-matrixes are given by

Q A N N i ( k ) = X 2 S i ( k ) G R i T R i T ( k ) R i ( k ) G R i X 2 S i T ( k ) + X U R i ( k ) R i T ( k ) R i ( k ) X U R i T ( k ) + 2 X 2 S i ( k ) R i T ( k ) R i ( k ) X U R i T ( k ) Q B N N i ( k ) = X 2 S i ( k ) G R i T R i T ( k ) R i ( k ) G L i X 2 S i T ( k ) + X U R i ( k ) R i T ( k ) R i ( k ) X U L i T ( k ) + 2 X U R i ( k ) R i T ( k ) R i ( k ) X U L i T ( k ) Q C N N i ( k ) = X 2 S i ( k ) G L i T R i T ( k ) R i ( k ) G R i X 2 S i T ( k ) + X U L i ( k ) R i T ( k ) R i ( k ) X U R i T ( k ) + 2 X U L i ( k ) R i T ( k ) R i ( k ) X U T i T ( k ) Q D N N i ( k ) = X 2 S i ( k ) G L i T R i T ( k ) R i ( k ) G L i X 2 S i T ( k ) + X U L i ( k ) R i T ( k ) R i ( k ) X U L i T ( k ) + 2 X 2 S i ( k ) R i T ( k ) R i ( k ) X U L i T ( k ) E43

Since the iteration process in (30) is an averaging process, the auto-correlation matrix Q 2 N 2 N i ( k ) is approximated by using expectation value of it, Q ˜ 2 N 2 N i ( k ) = ⟨ Q 2 N 2 N i ( k ) ⟩ . Then expectation values for sub-matrixes in (42) are simplified applying statistical independency between sound source signal and noises and T l z function defined in Appendix as

Q ˜ A N N i = T l z ( ⟨ X ˜ 2 S i ( k ) G R i T R i T ( k ) R i ( k ) G R i X ˜ 2 S i T ( k ) ⟩ ) + T l z ( ⟨ X ˜ U R i ( k ) R i T ( k ) R i ( k ) X ˜ U R i ( k ) ⟩ ) Q ˜ B N N i = T l z ( ⟨ X ˜ 2 S i ( k ) G R i T R i T ( k ) R i ( k ) G L i X ˜ 2 S i T ( k ) ⟩ ) Q ˜ C N N i = T l z ( ⟨ X ˜ 2 S i ( k ) G L i T R i T ( k ) R i ( k ) G R i X ˜ 2 S i T ( k ) ⟩ ) Q ˜ D N N i = T l z ( ⟨ X ˜ 2 S i ( k ) G L i T R i T ( k ) R i ( k ) G L i X ˜ 2 S i ( k ) ⟩ ) + T l z ( ⟨ X ˜ U L i ( k ) R i T ( k ) R i ( k ) X ˜ U L i ( k ) ⟩ ) E44

where

X ˜ 2 S i ( k ) = [ x ˜ 2 S i ( k ) , x ˜ 2 S i ( k − 1 ) , ⋯ x ˜ 2 S i ( k − P + 1 ) ] T X ˜ U R i ( k ) = [ x ˜ U R i ( k ) , x ˜ U R i ( k − 1 ) , ⋯ x ˜ U R i ( k − P + 1 ) ] X ˜ U L i ( k ) = [ x ˜ U L i ( k ) , x ˜ U L i ( k − 1 ) , ⋯ x ˜ U L i ( k − P + 1 ) ] E45

with

x ˜ 2 S i ( k ) = [ x S i ( k ) , x S i ( k − 1 ) , ⋯ x S i ( k − 2 p + 2 ) ] T x ˜ U R i ( k ) = [ x U R i ( k ) , x U R i ( k − 1 ) , ⋯ x U R i ( k − p + 1 ) ] T x ˜ U L i ( k ) = [ x U L i ( k ) , x U L i ( k − 1 ) , ⋯ x U L i ( k − p + 1 ) ] T E46

Applying matrix operations to Q 2 N 2 N i , a new matrix Q ′ 2 N 2 N i which has same determinant as Q 2 N 2 N i is given by

Q ′ 2 N 2 N i ( k ) = [ Q ′ A N N i ( k ) 0 0 Q ′ D N N i ( k ) ] E47

where

Q ′ A N N i = T l z ( Q ′ ′ A N N i ) , Q ′ D N N i = T l z ( Q ′ ′ D N N i ) E48

Since both X ˜ 2 S i ( k ) G R i T and X ˜ 2 S i ( k ) G L i T are symmetric PxP square matrixes, Q ′ ′ A N N i and Q ′ ′ B N N i are re-written as

Q ′ ′ A N N i = ⟨ X ˜ 2 S i ( k ) G R i T R i T ( k ) R i ( k ) G R i X ˜ 2 S i T ( k ) + X ˜ 2 S i ( k ) G L i T R i T ( k ) R i ( k ) G L i X ˜ 2 S i T ( k ) ⟩ + ⟨ X ˜ U R i ( k ) R i T ( k ) R i ( k ) X ˜ U R i T ( k ) ⟩ = ⟨ X ˜ 2 S i ( k ) ( G R i T G R i + G L i T G L i ) X ˜ 2 S i T ( k ) ⟩ R i T ( k ) R i ( k ) + ⟨ X ˜ U R i ( k ) X ˜ U R i T ( k ) ⟩ R i T ( k ) R i ( k ) = ( N σ X i 2 ( G R i G R i T + G L i G L i T ) + N σ N i 2 I P ) R i T ( k ) R i ( k ) Q ′ ′ D N N i = N σ N i 2 I P R i T ( k ) R i ( k ) E49

As evident by(47), (48) and(49), Q ′ 2 N 2 N i ( k ) is composed of major matrix Q ′ A N N i ( k ) and noise matrix Q ′ D N N i ( k ) . In the case of single talking where sound source signal power σ X 2 is much larger than un-correlated signal power σ N i 2 , R i T ( k ) R i ( k ) which minimizes eigenvalue spread in Q 2 N 2 N i ( k ) so as to attain the fastest convergence is given by making Q ′ ′ A N N i as a identity matrix by setting R i T ( k ) R i ( k ) as

R i T ( k ) R i ( k ) ≃ ( N σ X i 2 ( G R i G R i T + G L i G L i T ) ) − 1 E50

In other cases such as double talking or no talk situations, where we assume σ X 2 is almost zero, R i T ( k ) R i ( k ) which orthogonalizes Q ′ ′ A N N i is given by

R i T ( k ) R i ( k ) ≃ ( N σ N i 2 I P ) − 1 E51

Summarizing the above discussions, the fastest convergence is attained by setting R i T ( k ) R i ( k ) as

R i T ( k ) R i ( k ) = ( X P 2 N i T ( k ) X P 2 N i ( k ) ) − 1 E52

Since

⟨ X P 2 N i T ( k ) X P 2 N i ( k ) ⟩ = ⟨ [ G R i X 2 S i T ( k ) + X U R i T ( k ) G L i X 2 S i T ( k ) + X U L i T ( k ) ] [ X 2 S i ( k ) G R i T + X U R i ( k ) X 2 S i ( k ) G L i T + X U L i ( k ) ] ⟩ = ⟨ G R i X 2 S i T ( k ) X 2 S i ( k ) G R i T + G L i X 2 S i T ( k ) X 2 S i ( k ) G L i T + X U R i T ( k ) X U R i ( k ) + X U L i T ( k ) X U L i ( k ) ⟩ ≃ N σ X i 2 ( G R i G R i T + G L i G L i T ) + 2 N σ N i 2 I P E53

By substituting (52) for (30), we obtain following affine projection iteration :

h ^ S T i ( k + 1 ) = h ^ S T i ( k ) + α X i ( k ) ( X P 2 N i T ( k ) X P 2 N i ( k ) ) − 1 e P i ( k ) E54

In an actual implementation α is replaced by μ for forgetting factor and δ I is added to the inverse matrix to avoid zero division as shown bellow.

h ^ S T ( k + 1 ) = h ^ S T ( k ) + α X P 2 N i (k)[ X P 2 N i T (k) X P 2 N i (k) + δ I ] − 1 μ e P i ( k ) E55

where δ ( ≪ 1 ) is very small positive value and

μ = d i a g [ 1 , ( 1 − μ ) , ⋯ , ( 1 − μ ) p − 1 ] E56

The method can be intuitively understood using geometrical explanation in Fig. 3. As seen here, from a estimated coefficients in a k-1th plane a new direction is created by finding the nearest point on the i th plane in the case of traditional NLMS approach. On the other hand, affine projection creates the best direction which targets a location included in the both i-1 and i th plane.

Figure 3.
Very Simple Example for Affine Method

3.4. Stereo Normalized Least Mean Square (NLMS) method (M=1)

Stereo NLMS method is a case when M=1 of the gradient method.

Equation (54) is re-written when M =1 as

h ^ S T i ( k + 1 ) = h ^ S T i ( k ) + α x i ( k ) ( x R i T ( k ) x R i ( k ) + x L i T ( k ) x L i ( k ) ) − 1 e i ( k ) E57

It is well known that convergence speed of (57) depends on the smallest and largest eigen-value of the matrix Q 2 N 2 N i . In the case of the stereo generation model in Fig.2 for single talking with small right and left noises, we obtain following determinant of Q 2 N 2 N i for M=1 as

If eigenvalue of g R i g R i T + g L i g L i T are given as

| ( g R i g R i T + g L i g L i T ) | = λ min i 2 ⋯ λ max i 2 E59

where λ min i 2 and λ max i 2 are the smallest and largest eigenvalues, respectively.

| Q 2 N 2 N i ( k ) | is given by assuming un-correlated noise power σ N i 2 is very small ( σ N i 2 ≪ λ min i 2 ) as

| Q 2 N 2 N i ( k ) | = ( g R i T g R i + g L i T g L i ) − 1 ⋅ σ N i 2 ⋯ σ N i 2 ⋅ λ min i 2 ⋯ λ max i 2 E60

Hence, it is shown that stereo NLMS echo-canceller’s convergence speed is largely affected by the ratio between the largest eigenvalue of g R i g R i T + g L i g L i T and non-correlated signal power σ N i 2 . If the un-correlated sound power is very small in single talking, the stereo NLMS echo canceller’s convergence speed becomes very slow.

3.5. Double adaptive filters for Rapid Projection (WARP) method

Naming of the WARP is that this algorithm projects the optimum solution between monaural space and stereo space. Since this algorithm dynamically changes the types of adaptive filters between monaural and stereo observing sound source characteristics, we do not need to suffer from rank drop problem caused by strong cross-channel correlation in stereo sound. The algorithm was originally developed for the acoustic echo canceller in a pseudo-stereo system which creates artificial stereo effect by adding delay and/or loss to a monaural sound. The algorithm has been extended to real stereo sound by introducing residual signal after removing the cross-channel correlation.

In this section, it is shown that WARP method is derived as an extension of affine projection which has been shown in 3.3.

By introducing error matrix E i ( k ) which is defined by

E i ( k ) = [ e P i ( k ) e P i ( k - 1 ) ⋯ e P i ( k - p + 1 ) ] E61

iteration of the stereo affine projection method in (54) is re-written as

H ^ S T i ( k + 1 ) = H ^ S T i ( k ) + α X P 2 N i ( k ) ( X P 2 N i T ( k ) X P 2 N i ( k ) ) − 1 E i ( k ) E62

where

H ^ S T i ( k ) = [ h ^ S T i ( k ) h ^ S T i ( k - 1 ) ⋯ h ^ S T i ( k - p + 1 ) ] E63

In the case of strict single talking, following assumption is possible in the i t h LTI period by (53)

⟨ X P N i T ( k ) X P N i ( k ) ⟩ ≅ G R R L L i E64

where G R R L L i is a PxP symmetric matrix as

G R R L L i = N σ X i 2 ( G R i G R i T + G L i G L i T ) E65

By assuming G R R L L i as a regular matrix, (62) can be re-written as

H ^ S T i ( k + 1 ) G R R L L i = H ^ S T i ( k ) G R R L L i + α X P 2 N i ( k ) E i ( k ) E66

Re-defining echo path estimation matrix H ^ S T i ( k ) by a new matrix H ′ ^ S T i ( k ) which is defined by

H ^ ′ S T i ( k ) = H ^ S T i ( k ) G R R L L i E67

(66) is re-written as

H ^ ′ S T i ( k + 1 ) = H ^ ′ S T i ( k ) + α X P 2 N i ( k ) E i ( k ) E68

Then the iteration is expressed using signal matrix X 2 S i ( k ) as

H ^ ′ S T i ( k + 1 ) = H ^ ′ S T i ( k ) + α [ X 2 S i ( k ) G R i T + X U R i ( k ) X 2 S i ( k ) G L i T + X U L i ( k ) ] E i ( k ) E69

In the case of strict single talking where no un-correlated signals exist, and if we can assume G L i is assumed to be an output of a LTI system G R L i which is PxP symmetric regular matrix with input G R i , then (69) is given by

[ H ^ ′ S T R i ( k + 1 ) H ^ ′ S T L i ( k + 1 ) ] = [ H ^ ′ S T R i ( k ) H ^ ′ S T L i ( k ) ] + α [ X 2 S i ( k ) G R i E i ( k ) X 2 S i ( k ) G R i G i E i ( k ) ] [ H ^ ′ S T R i ( k + 1 ) H ^ ′ S T L i ( k + 1 ) G R L i − 1 ] = [ H ^ ′ S T R i ( k ) H ^ ′ S T L i ( k ) G R L i − 1 ] + α [ X 2 S i ( k ) G R i E i ( k ) X 2 S i ( k ) G R i E i ( k ) ] E70

It is evident that rank of the equation in (70) is N not 2N, therefore the equation becomes monaural one by subtracting the first law after multiplying ( G R L i ) − 1 from the second low as

H ^ M O N R L i ( k + 1 ) = H ^ M O N R L i ( k ) + 2 α X R i ( k ) E i ( k ) E71

where

H ^ M O N R L i ( k ) = H ^ ′ S T R i ( k ) + H ^ ′ S T L i ( k ) G R L i − 1 E72

or assuming G R i = G L i G L R i

H ^ M O N L R i ( k + 1 ) = H ^ M O N L R i ( k ) + 2 α X L i ( k ) E i ( k ) E73

where

H ^ M O N R L i ( k ) = H ^ ′ S T L i ( k ) + H ^ ′ S T R i ( k ) G L R i − 1 E74

Selection of the iteration depends on existence of the inverse matrix G R L i − 1 or G L R i − 1 and the detail is explained in the next section.

By substituting (67) to (72) and (74), we obtain following equations;

H ^ M O N R L i ( k ) = H ^ S T R i ( k ) G R R L L i + H ^ S T L i ( k ) G R R L L i G R L i − 1 E75

or

H ^ M O N L R i ( k ) = H ^ S T R i ( k ) G R R L L i G L R i − 1 + H ^ S T L i ( k ) G R R L L i E76

From the stereo echo path estimation view point, we can obtain H ^ M O N R L i ( k ) or H ^ M O N L R i ( k ) , however we can’t identify right and left echo path estimation from the monaural one. To cope with this problem, we use two LTI periods for separating the right and left estimation results as

[ H ^ T M O N L R i H ^ T M O N L R i − 1 ] = [ G R R L L i T G R R L L i G R L i − 1 G R R L L i i − 1 T G R R L L i − 1 G R L i − 1 − 1 ] [ H ^ T S T R i H ^ S T L i ] ⋯ G R L i and G R L i − 1 are regular matrix [ H ^ T M O N L R i H ^ T M O N L R i − 1 ] = [ G R L R L i T G L R i − 1 G R R L L i G R R L L i − 1 T G L R i − 1 − 1 G R R L L i − 1 ] [ H ^ T S T R i H ^ S T L i ] ⋯ G L R i and G L R i − 1 are regular matrix [ H ^ T M O N L R i H ^ T M O N L R i − 1 ] = [ G R R L L i T G R R L L i G R L i − 1 G R R L L i − 1 T G L R i − 1 − 1 G R R L L i − 1 ] [ H ^ T S T R i H ^ S T L i ] ⋯ G R L i and G L R i − 1 are regular matrix [ H ^ T M O N L R i H ^ T M O N L R i − 1 ] = [ G R R L L i T G L R i − 1 G R R L L i G R R L L i − 1 T G R R L L i − 1 G R L i − 1 − 1 ] [ H ^ T S T R i H ^ S T L i ] ⋯ G L R i and G R L i − 1 are regular matrix E77

where H ^ M O N L R i and H ^ M O N L R i − 1 are monaural echo canceller estimation results at the end of each LTI period, H ^ S T R i and H ^ S T L i are right and left estimated stereo echo paths based on the i − 1 t h and i t h LTI period’s estimation results.

Equation (77) is written simply as

H ^ M O N i , i − 1 = W i − 1 H ^ S T i E78

where H ^ T M O N R L i j is estimation result matrix for the i − 1 t h and i t h LTI period’s as

H ^ M O N i , i − 1 = [ H ^ T M O N R L i H ^ T M O N R L i − 1 ] E79

H ^ T S T i is stereo echo path estimation result as

H ^ S T i = [ H ^ T S T R i H ^ T S T L i ] E80

W i − 1 is a matrix which projects stereo estimation results to two monaural estimation results and is defined by

W i − 1 = { [ G R R L L i T G R R L L i G R L i − 1 G R R L L i − 1 T G R R L L i − 1 G R L i − 1 − 1 ] ⋯ G R L i and G R L i − 1 are regular matrix [ G R R L L i T G L R i − 1 G R R L L i G R R L L i − 1 T G R L R i − 1 − 1 G R R L L i − 1 ] ⋯ G L R i and G L R i − 1 are regular matrix [ G R R L L i T G R R L L i G R L i − 1 G R L i − 1 T G L R i − 1 − 1 G R R L L i − 1 ] ⋯ G R L i and G L R i − 1 are regular matrix [ G R R L L i T G L R i − 1 G R R L L i G R R L L i − 1 T G R R L L i − 1 G R L i − 1 − 1 ] ⋯ G L R i and G R L i − 1 are regular matrix E81

By swapping right side hand and left side hand in(78), we obtain right and left stereo echo path estimation using two monaural echo path estimation results as

H ^ S T i = W i H ^ M O N i , i − 1 E82

Since W i − 1 and W i are used to project optimum solutions in two monaural spaces to corresponding optimum solution in a stereo space and vice-versa, we call the matrixes as WARP functions. Above procedure is depicted in Fig. 4. As shown here, the WARP system is regarded as an acoustic echo canceller which transforms stereo signal to correlated component and un-correlated component and monaural acoustic echo canceller is applied to the correlated signal. To re-construct stereo signal, cross-channel correlation recovery matrix is inserted to echo path side. Therefore, WARP operation is needed at a LTI system change.

Figure 4.
Basic Principle for WARP method

In an actual application such as speech communication, the auto-correlation characteristics G R R L L i varies frequently corresponding speech characteristics change, on the other hand the cross-channel characteristics G R L i or G L R i changes mainly at a far-end talker change. So, in the following discussions, we apply NLMS method as the simplest affine projection (P=1).

The mechanism is also intuitively understood by using simple vector planes depicted in Fig. 5.

Figure 5.
Very Simple Example for WARP method

As shown here, using two optimum solutions in monaural spaces (in this case on the lines) the optimum solution located in the two dimensional (stereo) space is calculated directly.

4. Realization of WARP

4.1. Simplification by assuming direct-wave stereo sound

Both stereo affine projection and WARP methods require P x P inverse matrix operation which needs to consider its high computation load and stability problem. Even though the WARP operation is required only when the LTI system changes such as far-end talker change and it is much smaller computation than inverse matrix operations for affine projection which requires calculations in each sample, simplification of the WARP operation is still important. This is possible by assuming that target stereo sound is composed of only direct wave sound from a talker (single talker) as shown in Fig. 6.

Figure 6.
Stereo Sound Generation System for Single Talking

In figure 6, a single sound source signal at an angular frequency ω in the i t h LTI period, x S i ( ω ) , becomes a stereo sound composed of right and left signals, x R i ( ω ) and x L i ( ω ) , through out right and left LTI systems, g S R i ( ω ) and g S L i ( ω ) with additional un-correlated noise x U R i ( ω ) and x U L i ( ω ) as

x R i ( ω ) = g S R i ( ω ) x S i ( ω ) + x U R i ( ω ) x L i ( ω ) = g S L i ( ω ) x S i ( ω ) + x U L i ( ω ) E83

In the case of simple direct-wave systems, (83) can be re-written as

x R i ( ω ) = l R i e − j ω τ R i x S i ( ω ) + x U R i ( ω ) x L i ( ω ) = l L i e − j ω τ L i x S i ( ω ) + x U L i ( ω ) E84

where l R i and l L i are attenuation of the transfer functions and τ R i and τ L i are analog delay values.

Since the right and left sounds are sampled by f S ( = ω S / 2 π ) Hz and treated as digital signals, we use z- domain notation instead of ω -domain as

z = exp [ 2 π ω j / ω s ] E85

In z-domain, the system in Fig.4 is expressed as shown in Fig. 7.

As shown in Fig.7, the stereo sound generation model for x i ( z ) is expressed as

x i ( z ) = [ x R i ( z ) x L i ( z ) ] = [ g S R i ( z ) x S i ( z ) + x U R i ( z ) g S L i ( z ) x S i ( z ) + x U L i ( z ) ] E86

where x R i ( z ) , x L i ( z ) , g S R i ( z ) , g S L i ( z ) , x S i ( z ) , x U R i ( z ) and x U L i ( z ) are z-domain expression of the band-limited sampled signals corresponding to x R i ( ω ) , x L i ( ω ) , g S R i ( ω ) , g S L i ( ω ) , x U R i ( ω ) and x U L i ( ω ) , respectively. Adaptive filer output y ^ i ( z ) and microphone output y i ( z ) at the end of i t h LTI period is defined as

y ^ i ( z ) = h ^ i T ( z ) x i ( z ) y i ( z ) = h T ( z ) x i ( z ) + n i ( z ) E87

where n i ( z ) is a room noise, h ^ i ( z ) and h ^ i ( z ) are stereo adaptive filter and stereo echo path characteristics at the end of i t h LTI period respectively and which are defined as

H ^ S T i ( z ) = [ h ^ R i ( z ) h ^ L i ( z ) ] , H S T ( z ) = [ h R ( z ) h L ( z ) ] E88

Then cancellation error is given neglecting near end noise by

e i ( z ) = y i ( z ) − H ^ S T i T ( z ) x i ( z ) E89

In the case of single talking, we can assume both x U R i ( z ) and x U L i ( z ) are almost zero, and (89) can be re-written as

e i ( z ) = y i ( z ) − ( g S R i ( z ) h ^ R i ( z ) + g S L i ( z ) h ^ L i ( z ) ) x S i ( z ) E90

Since the acoustic echo can also be assumed to be driven by single sound source x S i ( z ) , we can assume a monaural echo path h M o n o i ( z ) as

h M o n o i ( z ) = g S R i ( z ) h R ( z ) + g S L i ( z ) h L ( z ) E91

Then (90) is re-written as

e i ( z ) = ( h M o n o i ( z ) − ( g S R i ( z ) h ^ R i ( z ) + g S L i ( z ) h ^ L i ( z ) ) ) x S i ( z ) E92

This equation implies we can adopt monaural adaptive filter by using a new monaural quasi-echo path h ^ M o n o i ( z ) as

h ^ M o n o i ( z ) = g S R i ( z ) h ^ R i ( z ) + g S L i ( z ) h ^ L i ( z ) E93

However, it is also evident that if LTI system changes both echo and quasi-echo paths should be up-dated to meet new LTI system. This is the same reason for the stereo echo canceller in the case of pure single talk stereo sound input. If we can assume the acoustic echo paths is time invariant for two adjacent LTI periods, this problem is easily solved by satisfying require rank for solving the equation as

[ h ^ M o n o i ( z ) h ^ M o n o i − 1 ( z ) ] = W i − 1 ( z ) [ h ^ R i ( z ) h ^ L i ( z ) ] E94

where

W i − 1 ( z ) = [ g S R i ( z ) g S L i ( z ) g S R i − 1 ( z ) g S L i − 1 ( z ) ] E95

In other words, using two echo path estimation results for corresponding two LTI periods, we can project monaural domain quasi-echo path to stereo domain quasi echo path or vice -versa using WARP operations as

H ^ S T i ( z ) = W i ( z ) H ^ M o n o i ( z ) H ^ M o n o i ( z ) = W i − 1 ( z ) H ^ S T i ( z ) E96

where

H ^ M o n o i ( z ) = [ h ^ M o n o i ( z ) h ^ M o n o i − 1 ( z ) ] , H ^ S T i ( z ) = [ h ^ R i ( z ) h ^ L i ( z ) ] E97

In actual implementation, it is impossible to obtain real W i ( z ) , which is composed of unknown transfer functions between a sound source and right and left microphones, so use one of the stereo sounds as a single talk sound source instead of a sound source. Usually, higher level sound is chosen as a pseudo-sound source because higher level sound is usually closer to one of the microphones. Then, the approximated WARP function W ˜ i ( z ) is defined as

W ˜ i ( z ) = { [ 1 g R L i ( z ) 1 g R L i - 1 ( z ) ] ⋯ R R − T r a n s i t i o n [ 1 g R L i ( z ) g L R i - 1 ( z ) 1 ] ⋯ R L − T r a n s i t i o n [ g L R i ( z ) 1 1 g R L i - 1 ( z ) ] ⋯ L R − T r a n s i t i o n [ g L R i ( z ) 1 g L R i - 1 ( z ) 1 ] ⋯ L L − T r a n s i t i o n E98

where g R L i ( z ) and g L R i ( z ) are cross-channel transfer functions between right and left stereo sounds and are defined as

g R L i ( z ) = g S L i ( z ) / g S R i ( z ) , g L R i ( z ) = g S R i ( z ) / g S L i ( z ) E99

The RR, RL, LR and LL transitions in (98) mean a single talker’s location changes. If a talker’ location change is within right microphone side (right microphone is the closest microphone) we call RR-transition and if it is within left-microphone side (left microphone is the closest microphone) we call LL-transition. If the location change is from right-microphone side to left microphone side, we call RL-transition and if the change is opposite we call LR-transition. Let’s assume ideal direct-wave single talk case. Then the ω domain transfer functions, g R L i ( ω ) and g L R i ( ω ) are expressed in z-domain as

g R L i ( z ) = l R L i φ ( δ R L i , z ) z − d R L i , g L R i ( z ) = l L R i φ ( δ L R i , z ) z − d L R i E100

where δ R L i , , and δ L R i , are fractional delays and d R L i and d L R i are integer delays for the direct-wave to realize analog delays τ R L i and τ L R i , these parameters are defined as

d R L i = I N T [ τ R L i f S ] . d L R i = I N T [ τ L R i f S ] , δ R L i , = M o d [ τ R L i f S ] , δ L R i , = M o d [ τ L R i f S ] E101

φ ( δ , z ) is a “Sinc Interpolation” function to interpolate a value at a timing between adjacent two samples and is given by

φ ( δ , z ) = ∑ ν = − ∞ ∞ sin ( π ν − δ ) ( π ν − δ ) z − ν E102

4.2. Digital filter realization of WARP functions

Since LL-transition and LR transition are symmetrical to RR-transition and RL-transition respectively, Only RR and RL transition cases are explained in the following discussions. By solving (96) applying WARP function in(98), we obtain right and left stereo echo path estimation functions as

h ^ R i ( z ) = h ^ M o n o i ( z ) − h ^ M o n o i − 1 ( z ) g R L i − 1 ( z ) − g R L i ( z ) h ^ L i ( z ) = g R L i − 1 ( z ) h ^ M o n o i ( z ) − g R L i ( z ) h ^ M o n o i − 1 ( z ) g R L i − 1 ( z ) − g R L i ( z ) ⋯ R R − T r a n s i t i o n E103

or

h ^ R i ( z ) = h ^ M o n o i ( z ) − g R L i − 1 ( z ) h ^ M o n o i − 1 ( z ) 1 − g L R i ( z ) g R L i − 1 ( z ) h ^ L i ( z ) = h ^ M o n o i − 1 ( z ) − g L R i ( z ) h ^ M o n o i ( z ) 1 − g L R i ( z ) g R L i − 1 ( z ) ⋯ R L − T r a n s i t i o n E104

By substituting (100) for (104), we obtain

h ^ R i ( z ) = h ^ M o n o i ( z ) − h ^ M o n o i − 1 ( z ) l R L i − 1 φ ( δ R L i − 1 , z ) z − d R L i − 1 − l R L i φ ( δ R L i , z ) z − d R L i h ^ L i ( z ) = l R L i − 1 φ ( δ R L i − 1 , z ) z − d R L i − 1 h ^ M o n o i ( z ) − l R L i φ ( δ R L i , z ) z − d R L i h ^ M o n o i − 1 ( z ) l R L i − 1 φ ( δ R L i − 1 , z ) z − d R L i − 1 − l R L i φ ( δ R L i , z ) z − d R L i ⋯ R R − T r a n s i t i o n E105

and

h ^ R i ( z ) = h ^ M o n o i ( z ) − l R L i − 1 φ ( δ R L i − 1 , z ) z − d R L i − 1 h ^ M o n o i − 1 ( z ) 1 − l L R i φ ( δ L R i , z ) l R L i − 1 φ ( δ R L i − 1 , z ) z − ( d R L i − 1 + d L R i ) h ^ L i ( z ) = h ^ M o n o i − 1 ( z ) − l L R i φ ( δ L R i , z ) z − d L R i h ^ M o n o i ( z ) 1 − l L R i φ ( δ L R i , z ) l R L i − 1 φ ( δ R L i − 1 , z ) z − ( d R L i − 1 + d L R i ) ⋯ R L − T r a n s i t i o n E106

Since φ ( δ , z ) is an interpolation function for a delay δ , the delay is compensated by φ ( − δ , z ) as

φ ( − δ , z ) ⋅ φ ( δ , z ) = 1. E107

From(107), (105) is re-written as

h ^ R i ( z ) = ( h ^ M o n o i ( z ) − h ^ M o n o i − 1 ( z ) ) l R L i − 1 − 1 φ ( − δ R L i − 1 , z ) z d R L i − 1 1 − ( l R L i l R L i − 1 − 1 ) φ ( − δ R L i − 1 , z ) φ ( δ R L i , z ) z − ( d R L i − d R L i − 1 ) h ^ L i ( z ) = h ^ M o n o i ( z ) − l R L i l R L i − 1 − 1 φ ( δ R L i , z ) φ ( − δ R L i − 1 , z ) z − d R L i + d R L i − 1 h ^ M o n o i − 1 ( z ) 1 − ( l R L i l R L i − 1 − 1 ) φ ( − δ R L i − 1 , z ) φ ( δ R L i , z ) z − ( d R L i − d R L i − 1 ) ⋯ R R − T r a n s i t i o n E108

These functions are assumed to be digital filters for the echo path estimation results as shown in Fig.8.

Figure 8.
Digital Filter Realization for WARP Functions

4.3. Causality and stability of WARP functions

Stability conditions are obtained by checking denominator of (108) and(106) D R R i ( z ) and D R L i ( z ) which are defined as

| D R R i ( z ) | 1 ⋯ R R − T r a n s i t i o n | D R L i ( z ) | 1 ⋯ R L − T r a n s i t i o n E109

where

D R R i ( z ) = l R L i l R L i − 1 − 1 φ ( − δ R L i − 1 , z ) φ ( δ R L i , z ) z − ( d R L i − d R L i − 1 ) ⋯ R R − T r a n s i t i o n D R L i ( z ) = l L R i l R L i − 1 φ ( δ L R i , z ) φ ( δ R L i − 1 , z ) z − ( d R L i − 1 + d L R i ) ⋯ R L − T r a n s i t i o n E110

From(109),

By using numerical calculations,

| φ ( δ , z ) | 1.2 E112

Substituting (112) for (109),

l R L i l R L i − 1 − 1 1 / 1.44 ⋯ R R − T r a n s i t i o n l L R i l R L i − 1 1 / 1.44 ⋯ R L − T r a n s i t i o n E113

Secondly, conditions for causality are given by checking the delay of the feedback component of the denominators D R R i ( z ) and D R L i ( z ) . Since convolution of a “Sinc Interpolation” function is also a “Sinc Interpolation” function as

φ ( δ A , z ) ⋅ φ ( δ B , z ) = φ ( δ A + δ B , z ) . E114

Equation (110) is re-written as

D R R i ( z ) = l R L i l R L i − 1 − 1 φ ( δ R L i , − δ R L i − 1 , z ) z − ( d R L i − d R L i − 1 ) ⋯ R R − T r a n s i t i o n D R L i ( z ) = l L R i l R L i − 1 φ ( δ L R i + δ R L i − 1 , z ) z − ( d R L i − 1 + d L R i ) ⋯ R L − T r a n s i t i o n E115

The “Sinc Interpolation” function is an infinite sum toward both positive and negative delays. Therefore it is essentially impossible to endorse causality. However, by permitting some errors, we can find conditions to maintain causality with errors. To do so, we use a “Quasi-Sinc Interpolation” function which is defined as

φ ˜ ( δ , z ) = ∑ ν = − N F + 1 N F sin ( π ν − δ ) ( π ν − δ ) z − ν E116

where 2 N F is a finite impulse response range of the “Quasi-Sinc Interpolation” φ ˜ ( δ , z ) . Then the error power by the approximation is given as

∮ φ ˜ ( δ , z ) φ ˜ * ( δ , z ) d z = ∑ ν = − ∞ − N F sin 2 ( π ν − δ ) ( π ν − δ ) 2 z − ν + ∑ ν = N F + 1 ∞ sin 2 ( π ν − δ ) ( π ν − δ ) 2 z − ν E117

Equation (116) is re-written as

φ ˜ ( δ , z ) = ∑ ν = 0 2 N F − 1 sin ( π ν − δ ) ( π ν − δ ) z − ν − N F + 1 E118

By substituting (118) for (115),

D R R i ( z ) ≃ l R L i l R L i − 1 − 1 φ ˜ ( δ R L i , − δ R L i − 1 , z ) z − ( d R L i − d R L i − 1 − N F + 1 ) ⋯ R R − T r a n s i t i o n D R L i ( z ) ≃ l L R i l R L i − 1 φ ˜ ( δ L R i + δ R L i − 1 , z ) z − ( d R L i − 1 + d L R i − N F + 1 ) ⋯ R L − T r a n s i t i o n E119

Then conditions for causality are

d R L i − d R L i − 1 ≥ N F − 1 ⋯ R R − T r a n s i t i o n d R L i − 1 + d L R i ≥ N F − 1 ⋯ R L − T r a n s i t i o n E120

The physical meaning of the conditions are the delay difference due to talker’s location change should be equal or less than cover range of the “Quasi-Sinc Interpolation” φ ˜ ( δ , z ) in the case of staying in the same microphone zone and the delay sun due to talker’s location change should be equal or less than cover range of the “Quasi-Sinc Interpolation” φ ˜ ( δ , z ) in the case of changing the microphone zone.

4.4. Stereo echo canceller using WARP

Total system using WARP method is presented in Fig. 9, where the system composed of five components, far-end stereo sound generation model, cross-channel transfer function (CCTF) estimation block, stereo echo path model, monaural acoustic echo canceller (AEC-I) block, stereo acoustic echo canceller (AEC-II) block and WARP block.

Figure 9.
System Configuration for WARP based Stereo Acoustic Echo Canceller

As shown in Fig.9, actual echo cancellation is done by stereo acoustic echo canceller (AEC-II), however, a monaural acoustic echo canceller (AEC-I) is used for the far-end single talking. The WARP block is active only when the cross-channel transfer function changes and it projects monaural echo chancellor echo path estimation results for two LTI periods to one stereo echo path estimation or vice-versa.

5. Computer simulations

5.1. Stereo sound generation model

Computer simulations are carried out using the stereo generation model shown in Fig.10 for both white Gaussian noise (WGN) and an actual voice. The system is composed of cross-channel transfer function estimation blocks (CCTF), where all signals are assumed to be sampled at f S = 8 K H z after 3.4kHz cut-off low-pass filtering. Frame length is set to 100 samples. Since the stereo sound generation model is essentially a continuous time signal system, over-sampling (x6, f A = 48 K H z ) is applied to simulate it. In the stereo sound generation model, three far-end talker’s locations, A Loc(1)=(-0.8,1.0), B Loc(2)=(-0.8,0.5), C Loc(3)=(-0.8,0.0), D Loc(4)=(-0.8,-0.5) and D Loc(5)=(-0.8,-1.0) are used and R/L microphone locations are set to R-Mic=(0,0.5) and L-Mic=(0,-0.5), respectively. Delay is calculated assuming voice wave speed as 300m/sec. In this set-up, talker’s position change for WGN is assumed to be from location A to location B and finally to location D, in which each talker stable period is set to 80 frames. The position change for voice is from C->A and the period is set to 133 frames. Both room noise and reverberation components in the far-end terminals is assumed, the S/N is set to 20dB ~ 40dB.

Figure 10.
Stereo Sound Generation Model and Cross-Channel Transfer Function Detector

5.2. Cross-channel transfer function estimation

In WARP method, it is easily imagine that the estimation performance of the cross-channel transfer function largely affects the echo canceller cancellation performances. To clarify the transfer function estimation performance, simulations are carried out using the cross-channel transfer function estimators (CCTF). The estimators are prepared for right microphone side sound source case and left microphone side sound source case, respectively. Each estimator has two NLMS adaptive filters, longer (128) tap one and shorter (8) tap one. The longer tap adaptive filter (AF1) is used to find a main tap and shorter one (AF2) is used to estimate the transfer function precisely as an impulse response.

Figure 11 shows CCTF estimation results as the AF1 tap coefficients after convergence setting single male voice sound source to the locations C, B and A in Fig. 11. Detail responses obtained by AF2 are shown in Fig. 12.As shown the results, the CCTF estimation works correctly in the simulations.

Figure 11.
Impulse Response Estimation Results in CCTF Block

Figure 12.
Estimated Tap Coefficients by Short Tap Adaptive Filter in CCTF Estimation Block

Figure 13.
Cross-Channel Correlation Cancellation Performances

Cancellation performances of the cross-channel correlation under room noise (WGN) are obtained using the adaptive filter (AF2) and are shown is Fig. 13, where S/N is assumed to be 20dB, 30dB and 40dB. In the figure C L R L ( d B ) is power reduction in dB which is observed by the signal power before and after cancellation of the cross-channel correlation by AF2. As shown here, more than 17dB cross-channel correlation cancellation is attained.

5.3. Echo canceller performances

To evaluate echo cancellation performances of the WARP acoustic echo canceller which system is shown in Fig. 10, computer simulations are carried out assuming 1000tap NLMS adaptive filters for both stereo and monaural echo cancellers. The performances of the acoustic echo canceller are evaluated by two measurements. The first one is the echo return loss enhancement E R L E i j ( d B ) , which is applied to the WGN source case and is defined as

ERLE L ⋅ ( i − 1 ) + j − 1 = { 10 log 10 ( ∑ k = 0 N F − 1 y 2 i , j , k / ∑ k = 0 N F − 1 e M O N 2 i , j , k ) ⋯ M o n a u r a l E c h o C a n c e l l e r 10 log 10 ( ∑ k = 0 N F − 1 y 2 i , j , k / ∑ k = 0 N F − 1 e S T 2 i , j , k ) ⋯ S t e r e o E c h o C a n c e l l e r E121

where e M O N i , j , k and e M O N i , j , k are residual echo for the monaural echo canceller (AEC-I) and stereo echo canceller (AEC-II) for the k t h sample in the j t h frame in the i t h LTI period, respectively. The second measurement is normalized misalignment of the estimated echo paths and are defined as

NORM L ⋅ ( i − 1 ) + j − 1 = 10 log 10 ( ( h R ) T ( h R ) + ( h L ) T ( h L ) ( h R − h ^ R i , j ) T ( h L − h ^ L i , j ) ) E122

where h ^ R i , j and h ^ L i , j are stereo echo canceller estimated coefficient arrays at the end of ( i , j ) t h frame, respectively. h R and h L are target stereo echo path impulse response arrays, respectively.

5.3.1. WARP echo canceller basic performances for WGN

The simulation results for WARP echo canceller in the case of WGN sound source, no far-end double talking and no local noise, are shown in Fig. 14. In the simulations, talker is assumed to move from A to E every 80 frames (1sec). In Fig.14, the results (a) and (b) show ERLEs for monaural and stereo acoustic echo cancellers (AEC-I and AEC-II), respectively.

Figure 14.
WARP Echo Cancellation Performances for WGN Source

The WARP operations are applied at the boundaries of the three LTI periods for the talkers C, D and E. NORM for the stereo echo canceller (AEC-II). As shown here, after two LTI periods (A, B periods), NORM and ERLE improves quickly by WARP projection at WARP timings in the Fig. 16. As for ERLE, stereo acoustic echo canceller shows better performance than monaural echo canceller. This is because the monaural echo canceller estimates an echo path model which is combination of CCTF and real stereo echo path and therefore the performance is affected by the CCTF estimation error. On the other hand, the echo path model for the stereo echo canceller is purely the stereo echo path model which does not include CCTF.

Figure 15.
Echo Cancellation Performance Comparison for WGN Source

Secondary, the WARP acoustic echo canceller is compared with a stereo echo canceller based on an affine projection method. In this case, the right and left sounds at k t h sample in the ( i , j ) t h frame, x ′ R i j k and x ′ L i j k , are assumed to have independent level shift to the original right and left sounds, x R i j k and x L i j k , for simulating small movement of talker’s face as

x ′ R i j k = ( 1 + α L e v e l ⋅ sin ( 2 π k / ( f s ⋅ T ) X ) ) x R i j k x ′ L i j k = ( 1 + α L e v e l ⋅ cos ( 2 π k / ( f s ⋅ T ) X ) ) x L i j k E123

where α L e v e l and T X are constants which determine the level shift ratio and cycle. Figure 15 shows the cancellation performances when α L e v e l and T X are 10% and 500msec, respectively.

Figure 16.
WARP Echo Canceller Performances Affected by Far-End Back Ground Noise

In Fig. 15, the WARP method shows more than 10dB better stereo echo path estimation performance, NORM, than that of affine projection (P=3). ERLE by stereo echo canceller base on WARP method is also better than affine projection (P=3). ERLE by monaural acoustic echo canceller based on WARP method is somehow similar cancellation performance as affine method (P=3), however ERLE improvement after two LTI periods by the WARP based monaural echo canceller is better than affine based stereo echo canceller.

Figure 16 shows the echo canceller performances in the case of CCTF estimation is degraded by room noise in the far-end terminal. S/N in the far-end terminal is assumed to be 30dB or 50dB. Although the results clearly show that lower S/N degrade ERLR or NORM, more than 15dB ERLE or NORM is attained after two LTI periods.

Figure 17 shows the echo canceller performances in the case of echo path change happens. In this simulation, echo path change is inserted at 100frame. The echo path change is chosen 20dB, 30dB and 40dB. It is observed that echo path change affects the WARP calculation and therefore WARP effect degrades at 2^nd and third LTI period boundary.

Figure 17.
WARP Echo Canceller Cancellation Performance Drops Due to Echo Path Chance

Figure 18 summarizes NORM results for stereo NLMS method, affine projection method as WARP method. In this simulation, as a non-linear function for affine projection, independent absolute values of the right and left sounds are added by

x ′ R i j k = x R i j k + 0.5 ⋅ α A B S ⋅ ( x R i j k + | x R i j k | ) x ′ L i j k = x L i j k + 0.5 ⋅ α A B S ⋅ ( x L i j k − | x L i j k | ) E124

where α A B S is a constant to determine non-liner level of the stereo sound and is set to 10%. In this simulation, an experiment is carried out assuming far-end double talking, where WGN which power is same as far-end single talking is added between 100 and 130 frames.

As evident from the results in Fig. 18, WARP method shows better performances for the stereo echo path estimation regardless far-end double talking existence. Even in the case 10% far end signal level shit, WARP method attains more than 20% NORM comparing affine method (P=3) with 10% absolute non-linear result.

Figure 18.
Echo Path Estimation Performance Comparison for NLMS, Affine and WARP Methods

5.3.2. WARP echo canceller basic performances for voice

Figure 19 shows NORM and residual echo level (Lres) for actual male voice sound source. Since voice sound level changes frequently, we calculate residual echo level Lres (dB) instead of ERLE(dB) for white Gaussian noise case. Although slower NORM and Lres convergence than white Gaussian is observed, quick improvement for the both metrics is observed at the talker B and A border. In this simulation, we applied 500 tap NLMS adaptive filter. Affine projection may give better convergence speed by eliminating auto-correlation in the voice, however it is independent effect from WARP effect. WARP and affine projection can be used together and may contribute to convergence speed up independently.

Figure 19.
Residual Echo (Lres (dB) Level and Normalized Estimated Echo Misalignment.(NORM) for the voice Source at Far-end Terminal S/N=30dB. (Level shift 0, 500tap Step gain=1.0)

6. Conclusions

In this chapter stereo acoustic echo canceller methods are studied from cross-channel correlation view point aiming at conversational DTV use. Among many stereo acoustic echo cancellers, we focused on AP (including LS and stereo NLMS methods) and WARP methods, since these approaches do not cause any modification nor artifacts to speaker output stereo sound which is not desirable consumer audio-visual products such as DTV. In this study, stereo sound generation system is modeled by using right and left Pth order LTI systems with independent noises. Stereo LS method (M=2P) and stereo NLMS method (M=P=1) are two extreme cases of general AP method which requires MxM inverse matrix operation in each sample. Stereo AP method (M=P) can produce the best iteration direction fully adopting un-correlated component produced by small fluctuation in the stereo cross-channel correlation by calculating PxP inverse matrix operations in each sample. Major problem of the method is that it cannot cope with strict single talking where no un-correlated signals exist in right and left channels and therefore rank drop problem happens. Contrary to AP method, WARP method creates a stereo echo path estimation model applying a monaural adaptive filter for two LTI periods at a chance of far-end talker change. Since it creates stereo echo path estimation using two monaural echo path models for two LTI periods, we do not suffer from any rank drop problem even in a strict single talking. Moreover, using WARP method, computational complexity can be reduced drastically because WARP method requires PxP inverse matrix operations only at LTI characteristics change such as far-end talker change. However, contrary to AP method, it is clear that performance of WARP method may drop if fluctuation in cross-channel correlation becomes high. Considering above pros-cons in affine projection and WARP methods, it looks desirable to apply affine method and WARP method dynamically depending on the nature of stereo sound. In this chapter, an acoustic echo canceller based on WARP method which equips both monaural and stereo adaptive filters is discussed together with other gradient base stereo adaptive filter methods. The WARP method observes cross-channel correlation characteristics in stereo sound using short tap pre-adaptive filters. Pre-adaptive filter coefficients are used to calculate WARP functions which project monaural adaptive filter estimation results to stereo adaptive filter initial coefficients or vice-versa.

To clarify effectiveness WARP method, simple computer simulations are carried out using white Gaussian noise source and male voice, using 128tap NLMS cross-channel correlation estimator, 1000tap monaural NLMS adaptive filter for monaural echo canceller and 2x1000tap (2x500tap for voice) multi-channel NLMS adaptive filter for stereo echo canceller. Followings are summary of the results:

Considering sampling effect for analog delay, x6 over sampling system is assumed for stereo generation model. 5 far-end talker positions are assumed and direct wave sound from each talker is assumed to be picked up by far-end stereo microphone with far-end room background noise. The simulation results show we can attain good cross-channel transfer function estimation rapidly using 128tap adaptive filter if far-end noise S/N is reasonable (such as 20-40dB).
Using the far-end stereo generation model and cross-channel correlation estimation results, 1000tap NLMS monaural NLMS adaptive filter and 2-1000 tap stereo NLMS adaptive filters are used to clarify effectiveness of WARP method. In the simulation far-end talker changes are assumed to happen at every 80frames (1frame=100sample). Echo return loss Enhancement (ERLE) MORMalized estimation error power (NORM) are used as measurements. It is clarified that both ERLE and NORM are drastically improved at the far-end talker change by applying WARP operation.
Far-end S/N affects WARP performance, however, we can still attain around SN-5dB ERLE or NORM.
We find slight convergence improvement in the case of AP method (P=3) with non-linear operation. However, the improvement is much smaller than WARP at the far-end talker change. This is because sound source is white Gaussian noise in this simulation and therefore merit of AP method is not archived well.
Since WARP method assumes stereo echo path characteristics remain stable, stereo echo path characteristics change degrade WARP effectiveness. The simulation results show the degradation depends on how much stereo echo path moved and the degradation appears just after WARP projection.
WARP method works correctly actual voice sound too. Collaboration with AP method may improve total convergence speed further more because AP method improves convergence speed for voice independent from WARP effect.

As for further studies, more experiments in actual environments are necessary. The author would like to continue further researches to realize smooth and natural conversations in the future conversational DTV.

7. Appendix

If N × N matrix Q is defined as

Q = X 2 S T ( k ) G T G X 2 S ( k ) E125

where X 2 S ( k ) is a ( 2 P − 1 ) sample array composed of white Gaussian noise sample x ( k ) as

X 2 S ( k ) = [ x ( k ) , x ( k − 1 ) , ⋯ x ( k − N + 1 ) ] x ( k ) = [ x ( k ) , x ( k − 1 ) , ⋯ x ( k − 2 p + 2 ) ] T E126

G is defined as a ( 2 P − 1 ) × P matrix as

G = [ g T 0 ⋯ 0 0 g T ⋱ 0 0 0 ⋯ 0 0 0 ⋯ g T ] E127

where g is P sample array defined as

g = [ g 0 , g 1 , ⋯ , g ν ⋯ g P − 1 ] T E128

Then ⟨ Q ⟩ is a Toepliz matrix and is expressed using P × P ( P ≤ N ) Toepliz matrix ⟨ Q ′ ⟩ as

⟨ Q ⟩ = T l z ( ⟨ Q ′ ⟩ ) E129

This is because ( u , v ) t h element of the matrix ⟨ Q ⟩ , a T l Z ( u , v ) is defined as

a T l Z ( u , v ) = ⟨ x T ( k - u ) G T G x ( k - v ) ⟩ E130

Considering

⟨ x T ( k - u ) G T G x ( k - v ) ⟩ = 0 ⋯ for all | u − v | ≥ P E131

the element a T l Z ( u , v ) is given as

a T l Z ( u , v ) = { a ( u − v , 0 ) ⋯ P − 1 ≥ u − v ≥ 0 a ( 0 , v − u ) ⋯ P − 1 ≥ v − u 0 0 ⋯ | u − v | ≥ P E132

By setting the ( u , v ) th element of P × P ( P ≤ N ) Toepliz matrix ⟨ Q ′ ⟩ as a T l Z ( u , v ) ( ( 0 ≤ u P , 0 ≤ v P ) ), we define a function T l z ( ⟨ Q ′ ⟩ ) which determines N × N Toepliz matrix Q .

It is noted that if Q ′ is a identity matrix Q is also identity matrix.

References

1. Nagumo J. 1967A Learning Identification Method for System Identification”, IEEE Trans. AC. 12 3Jun 282
2. M.M.Sondhi et.al. 1991Acoustic Echo Cancellation for Stereophonic Teleconferencing", Workshop on Applications of Signal Processing to Audio and Acoustics, May.
3. Benesty J. Amand F. Gillorie A. Grenier Y. 1996adaptive filtering algorithm for a stereophonic echo cancellation” Proc. Of ICASSP-96, 5May, 3099-3012..
4. Benesty J. Morgan D. R. Sondhi M. M. 1998A better understanding and an improved solution to the specific problems of stereophonic acoustic echo canceller”, IEEE Trans. Speech Audio Processing, 6 2 156 165Mar.
5. Bershad NJ, 1987 “Behavior of the ε -normalized LMS algorithm with Gaussian inputs”, IEEE Transaction on Acoustic, Speech and Signal Processing, ASSP-35(5): 636-644.
6. Fujii T. Shimada S. 1984A Note on Multi-Cannel Echo Cancelers," technical report of ICICE on CS, 7 14Jan.
7. Sugiyama A. Joncour Y. Hirano A. 2001A stereo echo canceller with correct echo-path identification based on an input-sliding technique”, IEEE Trans. On Signal Processing, 49 11 2577 2587
8. Jun-Mei Yang;Sakai, 2007Stereo acoustic echo cancellation using independent component analysis” IEEE,
9. International Symposium on Intelligent Signal Processing and Communication Systems (USA) P.P.121-4
10. Jacob Benesty. R. Morgan M. M. Sondhi 1998A hybrid Momo/Stereo Acoustic Echo Canceller”, IEEE Transactions on Speech and Audio Processing, 6 5September.
11. Shimauchi S. S. Makino S. 1995Stereo projection echo canceller with true echo path estimation”, IEEE Proc. of ICASSP95, 3662P.P.3059-62 vol.5 PD:
12. Makino S. Strauss K. Shimauchi S. Haneda Y. Nakagawa A. 1997Subband Stereo Echo Canceller using the projection algorithm with fast convergence to the true echo path”, IEEE Proc. of ICASSP 97, 299 302
13. Shimauchi S. Makino S. Haneda Y. Kaneda Y. 1998New configuration for a stereo echo canceller with nonlinier pre-processing”, IEEE Proc. of ICASSP 98, 3685 3688
14. Shimauchi S. Makino S. Haneda Y. Nakagawa A. Sakauchi S. 1999A stereo echo canceller implemented using a stereo shaker and a duo-filter control system”, IEEE ICASSP99 2 857 860
15. Akira Nakagawa and Youichi Haneda, 2002 " A study of an adaptive algorithm for stereo signals with a power difference”, IEEE ICASSP2002, II-1913-162
16. Minami S. 1993An Echo Canceller with Comp. & Decomposition of Estimated Echo Path Characteristics for TV Conference & Multi-Media Terminals”, The 6^th Karuizawa Workshop on Circuits and Sytstems, April 19-20 333 337
17. S.Minami, 1987An Acoustic Echo Canceler for Pseudo Stereophonic Voice", IEEE GLOBCOM’87 35.1 Nov.
18. S.Minami, 1986A stereophonic Voice Coding Method for teleconferencing", IEEE ICCC. 86, 46.6, June
19. Multi-Channel Acoustic Echo Canceller with Microphone/Speaker Array ITC-CSCC’09 2009 397 400
20. WARP-AEC: 2009A Stereo Acoustic Echo Canceller based on W-Adaptive filters for Rapid Projection IEEE ISPACS’09

[1] 1. Nagumo J. 1967A Learning Identification Method for System Identification”, IEEE Trans. AC. 12 3Jun 282

[2] 2. M.M.Sondhi et.al. 1991Acoustic Echo Cancellation for Stereophonic Teleconferencing", Workshop on Applications of Signal Processing to Audio and Acoustics, May.

[3] 3. Benesty J. Amand F. Gillorie A. Grenier Y. 1996adaptive filtering algorithm for a stereophonic echo cancellation” Proc. Of ICASSP-96, 5May, 3099-3012..

[4] 4. Benesty J. Morgan D. R. Sondhi M. M. 1998A better understanding and an improved solution to the specific problems of stereophonic acoustic echo canceller”, IEEE Trans. Speech Audio Processing, 6 2 156 165Mar.

[5] 5. Bershad NJ, 1987 “Behavior of the ε -normalized LMS algorithm with Gaussian inputs”, IEEE Transaction on Acoustic, Speech and Signal Processing, ASSP-35(5): 636-644.

[6] 6. Fujii T. Shimada S. 1984A Note on Multi-Cannel Echo Cancelers," technical report of ICICE on CS, 7 14Jan.

[7] 7. Sugiyama A. Joncour Y. Hirano A. 2001A stereo echo canceller with correct echo-path identification based on an input-sliding technique”, IEEE Trans. On Signal Processing, 49 11 2577 2587

[8] 8. Jun-Mei Yang;Sakai, 2007Stereo acoustic echo cancellation using independent component analysis” IEEE,

[9] 9. International Symposium on Intelligent Signal Processing and Communication Systems (USA) P.P.121-4

[10] 10. Jacob Benesty. R. Morgan M. M. Sondhi 1998A hybrid Momo/Stereo Acoustic Echo Canceller”, IEEE Transactions on Speech and Audio Processing, 6 5September.

[11] 11. Shimauchi S. S. Makino S. 1995Stereo projection echo canceller with true echo path estimation”, IEEE Proc. of ICASSP95, 3662P.P.3059-62 vol.5 PD:

[12] 12. Makino S. Strauss K. Shimauchi S. Haneda Y. Nakagawa A. 1997Subband Stereo Echo Canceller using the projection algorithm with fast convergence to the true echo path”, IEEE Proc. of ICASSP 97, 299 302

[13] 13. Shimauchi S. Makino S. Haneda Y. Kaneda Y. 1998New configuration for a stereo echo canceller with nonlinier pre-processing”, IEEE Proc. of ICASSP 98, 3685 3688

[14] 14. Shimauchi S. Makino S. Haneda Y. Nakagawa A. Sakauchi S. 1999A stereo echo canceller implemented using a stereo shaker and a duo-filter control system”, IEEE ICASSP99 2 857 860

[15] 15. Akira Nakagawa and Youichi Haneda, 2002 " A study of an adaptive algorithm for stereo signals with a power difference”, IEEE ICASSP2002, II-1913-162

[16] 16. Minami S. 1993An Echo Canceller with Comp. & Decomposition of Estimated Echo Path Characteristics for TV Conference & Multi-Media Terminals”, The 6^th Karuizawa Workshop on Circuits and Sytstems, April 19-20 333 337

[17] 17. S.Minami, 1987An Acoustic Echo Canceler for Pseudo Stereophonic Voice", IEEE GLOBCOM’87 35.1 Nov.

[18] 18. S.Minami, 1986A stereophonic Voice Coding Method for teleconferencing", IEEE ICCC. 86, 46.6, June

[19] 19. Multi-Channel Acoustic Echo Canceller with Microphone/Speaker Array ITC-CSCC’09 2009 397 400

[20] 20. WARP-AEC: 2009A Stereo Acoustic Echo Canceller based on W-Adaptive filters for Rapid Projection IEEE ISPACS’09

A Stereo Acoustic Echo Canceller Using Cross-Channel Correlation

Adaptive Filtering

Author Information

Minami Shigenobu

1. Introduction

2. Stereo acoustic echo canceller problem

2.1. Conversational DTV

Figure 1.

2.2. Stereo sound generation model

Figure 2.

2.3. Stereo acoustic echo canceller problem

3. Stereo acoustic echo canceller methods

3.1. Gradient method

3.2. Least Square (LS) method (M=2N)

3.3. Stereo Affine Projection (AP) method (M=P ≤ N)

Figure 3.

3.4. Stereo Normalized Least Mean Square (NLMS) method (M=1)

3.5. Double adaptive filters for Rapid Projection (WARP) method

Figure 4.

Figure 5.

4. Realization of WARP

4.1. Simplification by assuming direct-wave stereo sound

Figure 6.

Figure 7.

4.2. Digital filter realization of WARP functions

Figure 8.

4.3. Causality and stability of WARP functions

4.4. Stereo echo canceller using WARP

Figure 9.

5. Computer simulations

5.1. Stereo sound generation model

Figure 10.

5.2. Cross-channel transfer function estimation

Figure 11.

Figure 12.

Figure 13.

5.3. Echo canceller performances

5.3.1. WARP echo canceller basic performances for WGN

Figure 14.

Figure 15.

Figure 16.

Figure 17.

Figure 18.

5.3.2. WARP echo canceller basic performances for voice

Figure 19.

6. Conclusions

7. Appendix

References

Continue reading from the same book

Adaptive Filtering