Stereo acoustic echo canceller is becoming more and more important as an echo canceller is applied to consumer products like a conversational DTV. However it is well known that if there is strong cross-channel correlation between right and left sounds, it cannot converge well and results in echo path estimation misalignment. This is a serious problem in a conversational DTV because the speaker output sound is combination of a far-end conversational sound, which is essentially monaural, and TV program sound, which has wide variety of sound characteristics, monaural sound, stereo sound or mixture of them. To cope with this problem, many stereo echo cancellation algorithms have been proposed. The methods can be categorized into two approaches. The first one is to de-correlate the stereo sound by introducing independent noise or non-linear post-processing to right and left speaker outputs. This approach is very effective for single source stereo sound case, which covers most of conversational sounds, because the de-correlation prevents rank drop to solve a normal equation in a multi-channel adaptive filtering algorithm. Moreover, it is simple since many traditional adaptation algorithms can be used without any modification. Although the approach has many advantages and therefore widely accepted, it still has an essential problem caused by the de-correlation which results in sound quality change due to insertion of the artificial distortion. Even though the inserted distortion is minimized so as to prevent sound quality degradation, from an entertainment audio equipment view point, such as a conversational DTV, many users do not accept any distortion to the speaker output sound. The second approach is desirable for the entertainment types of equipments because no modification to the speaker outputs is required. In this approach, the algorithms utilize cross-channel correlation change in a stereo sound. This approach is also divided into two approaches, depending on how to utilize the cross-channel correlation change. One widely used approach is affine projection method. If there are small variations in the cross-channel correlation even in a single sound source stereo sound, small un-correlated component appears in each channel. The affine projection method can produce the best direction by excluding the auto-correlation bad effect in each channel and by utilizing the small un-correlated components. This approach has a great advantage since it does not require any modification to the stereo sound, however if the variation in the cross-channel correlation is very small, improvement of the adaptive filter convergence is very small. Since the rank drop problem of the stereo adaptive filter is essentially not solved, we may need slight inserted distortion which reduces merit of this method. Another headache is that the method requires P by P inverse matrix calculation in an each sample. The inverse matrix operation can be relaxed by choosing P as small number, however small P sometimes cannot attain sufficient convergence speed improvement. To attain better performance even by small P, the affine projection method sometimes realized together with sub-band method. Another method categorized in the second approach is “WARP” method. Unlike to affine projection method which utilizes small change in the cross-channel correlation, the method utilizes large change in the cross-channel correlation. This approach is based on the nature of usual conversations. Even though using stereo sound for conversations, most parts of conversations are single talk monaural sound. The cross-channel correlation is usually very high and it remains almost stable during a single talking. A large change happens when talker change or talker’s face movement happens. Therefore, the method applies a monaural adaptive filter to single sound source stereo sound and multi-channel (stereo) adaptive filter to non-single sound source stereo sound. Important feature of the method is two monaural adaptive filter estimation results and one stereo adaptive filter estimation result is transformed to each other by using projection matrixes, called WARP matrixes. Since a monaural adaptive filter is applied when a sound is single source stereo sound, we do not need to suffer from the rank problem.
In this chapter, stereo acoustic echo canceller methods, multi-channel least mean square, affine projection and WARP methods, all of them do not need any modification to the speaker output sounds, are surveyed targeting conversational DTV applications. Then WARP method is explained in detail.
2. Stereo acoustic echo canceller problem
2.1. Conversational DTV
Since conversational DTV should keep smooth speech communication even when it is receiving a regular TV program, it requires following functionalities together with traditional DTV systems as shown in Fig. 1.
Mixing of broadcasting sound and communication speech: Two stereo sounds from the DTV audio receiver and local conversational speech decoder are mixed and sent to the stereo speaker system.
Sampling frequency conversion: Sampling frequency of DTV sound is usually wider than that of conversational service, such as for DTV sound and for conversational service sound, we need sampling frequency conversion between DTV and conversational service audio parts
Stereo acoustic canceller: A stereo acoustic echo canceller is required to prevent howling and speech quality degradation due to acoustic coupling between stereo speaker and microphone.
Among the above special functionalities, the echo canceller for the conversational DTV is technically very challenging because the echo canceller should cancel wide variety of stereo echoes for TV programs as well as stereo speech communications.
2.2. Stereo sound generation model
A stereo acoustic echo canceller system is shown in Fig. 2 with typical stereo sound generation model, where all signals are assumed to be discrete time signals at the sampling timing by sampling frequency and the sound generation model is assumed to be linier finite impulse response (FIR) systems which has a sound source signalas an input and stereo sound and as outputs with additional uncorrelated noises and. By using matrix and array notations of the signals as
whereand are impulse response length of the FIR system and tap length of the adaptive filter for each channel, respectively.
Then the FIR system outputis 2N sample array and is expressed as
where andare sample impulse responses of the FIR system defined as
In(2), if andare composed of constant array,and during the period, and small time variant arrays,and which are defined as
(2) is re-written as
This situation is usual in the case of far-end single talking because transfer functions between talker and right and left microphones vary slightly due to talker’s small movement. By assuming the components are also un-correlated noise, (5) can be regarded as a linear time invariant system with independent noise components, and, as
In (6), if there are no un-correlated noises, we call the situation as strict single talking.
In this chapter, sound source signal(), uncorrelated noises (and) are assumed as independent white Gaussian noise with variance and, respectively.
2.3. Stereo acoustic echo canceller problem
For simplification, only one stereo audio echo canceller for the right side microphone’s output signal, is explained. This is because the echo canceller for left microphone output is apparently treated as the same way as the right microphone case. As shown in Fig.2, the echo canceller cancels the acoustic echoas
where is acoustic echo canceller’s residual error, is a independent background noise, is an FIR adaptive filter output in the stereo echo canceller, which is given by
where and are N tap FIR adaptive filter coefficient arrays.
Error power of the echo canceller for the right channel microphone output, , is given as:
where is a stereo echo path model defined as
Optimum echo path estimation which minimizes the error power is given by solving the linier programming problem as
where is a number of samples used for optimization. Then the optimum echo path estimation for the LTI period is easily obtained by well known normal equation as
where is an auto-correlation matrix of the adaptive filter input signal and is given by
By (14), determinant of is given by
In the cease of strict single talking where and do not exist, (16) becomes very simple as
To check the determinant, we calculate considering as
Then becomes zero as
Hence no unique solution can be found by solving the normal equation in the case of strict single talking where un-correlated components do not exist. This is a well known stereo adaptive filter cross-channel correlation problem.
3. Stereo acoustic echo canceller methods
To improve problems addressed above, many approaches have been proposed. One widely accepted approach is de-correlation of stereo sound. To avoid the rank drop of the normal equation(13), small distortion such as non-linear processing or modification of phase is added to stereo sound. This approach is simple and effective to endorse convergence of the multi-channel adaptive filter, however it may degrade the stereo sound by the distortion. In the case of entertainment applications, such as conversational DTV, the problem may be serious because customer’s requirement for sound quality is usually very high and therefore even small modification to the speaker output sound cannot be accepted. From this view point, approaches which do not need to add any modification or artifacts to the speaker output sound are desirable for the entertainment use. In this section, least square (LS), stereo affine projection (AP), stereo normalized least mean square (NLMS) and WARP methods are reviewed as methods which do not need to change stereo sound itself.
3.1. Gradient method
Gradient method is widely used for solving the quadratic problem iteratively. As a generalized gradient method, let denotesample orthogonalized error array based on original error arrayas
where is an sample error array which is defined as
and is a matrix which orthogonalizes the auto-correlation matrix. The orthogonalized error array is expressed using difference between adaptive filter coefficient array and target stereo echo path sample responseas
where is a Mx2N matrix which is composed of adaptive filter stereo input array as defined by
By defining an echo path estimation error array which is defined as
estimation error power is obtained by
Then, (26) is regarded as a quadratic function of as
For the quadratic function, gradient is given by
Iteration of which minimizesis given by
where is a constant to determine step size.
Above equation is very generic expression of the gradient method and following approaches are regarded as deviations of this iteration.
3.2. Least Square (LS) method (M=2N)
From(30), the estimation error power between estimated adaptive filter coefficients and stereo echo path response,is given by
whereis a identity matrix. Then the fastest convergence is obtained by findingwhich orthogonalizes and minimizes eigenvalue variance in.
If M=2N, is symmetric square matrix as
and if is a regular matrix so that inverse matrix exists, which orthogonalizes is given by
Assuming initial tap coefficient array as zero vector and during 0 to 2N-1th samples and at 2Nth sample, (34) can be re-written as
where is 2sample echo path output array and is defined as
This iteration is done only once at sample. If, inverse matrix term in (35) is written as
3.3. Stereo Affine Projection (AP) method (M=PN)
Stereo affine projection method is assumed as a case when M is chosen as FIR response length P in the LTI system. This approach is very effective to reduce 2Nx2N inverse matrix operations in LS method to PxP operations when the stereo generation model is assumed to be LTI system outputs from single WGN signal source with right and left channel independent noises as shown in Fig.2. For the sake of explanation, we define stereo sound signal matrix which is composed of right and left signal matrixand for P samples as
and are un-correlated signal matrix defined as
andare source to microphones response (2P-1)xP matrixes and are defined as
As explained by(31), determines convergence speed of the gradient method. In this section, we derive affine projection method by minimizing the max-min eigenvalue variance in. Firstly, the auto-correlation matrix is expressed by sub-matrixes for each stereo channel as
where andare right and left channel auto-correlation matrixes, andare cross channel-correlation matrixes. These sub-matrixes are given by
Since the iteration process in (30) is an averaging process, the auto-correlation matrix is approximated by using expectation value of it,. Then expectation values for sub-matrixes in (42) are simplified applying statistical independency between sound source signal and noises and function defined in Appendix as
Applying matrix operations to, a new matrix which has same determinant as is given by
Since both and are symmetric PxP square matrixes, and are re-written as
As evident by(47), (48) and(49), is composed of major matrixand noise matrix. In the case of single talking where sound source signal power is much larger than un-correlated signal power, which minimizes eigenvalue spread in so as to attain the fastest convergence is given by makingas a identity matrix by setting as
In other cases such as double talking or no talk situations, where we assume is almost zero, which orthogonalizes is given by
Summarizing the above discussions, the fastest convergence is attained by setting as
In an actual implementation is replaced byfor forgetting factor and is added to the inverse matrix to avoid zero division as shown bellow.
where is very small positive value and
The method can be intuitively understood using geometrical explanation in Fig. 3. As seen here, from a estimated coefficients in a k-1th plane a new direction is created by finding the nearest point on the i th plane in the case of traditional NLMS approach. On the other hand, affine projection creates the best direction which targets a location included in the both i-1 and i th plane.
3.4. Stereo Normalized Least Mean Square (NLMS) method (M=1)
Stereo NLMS method is a case when M=1 of the gradient method.
Equation (54) is re-written when M =1 as
It is well known that convergence speed of (57) depends on the smallest and largest eigen-value of the matrix. In the case of the stereo generation model in Fig.2 for single talking with small right and left noises, we obtain following determinant of for M=1 as
If eigenvalue of are given as
where and are the smallest and largest eigenvalues, respectively.
is given by assuming un-correlated noise power is very small () as
Hence, it is shown that stereo NLMS echo-canceller’s convergence speed is largely affected by the ratio between the largest eigenvalue of and non-correlated signal power. If the un-correlated sound power is very small in single talking, the stereo NLMS echo canceller’s convergence speed becomes very slow.
3.5. Double adaptive filters for Rapid Projection (WARP) method
Naming of the WARP is that this algorithm projects the optimum solution between monaural space and stereo space. Since this algorithm dynamically changes the types of adaptive filters between monaural and stereo observing sound source characteristics, we do not need to suffer from rank drop problem caused by strong cross-channel correlation in stereo sound. The algorithm was originally developed for the acoustic echo canceller in a pseudo-stereo system which creates artificial stereo effect by adding delay and/or loss to a monaural sound. The algorithm has been extended to real stereo sound by introducing residual signal after removing the cross-channel correlation.
In this section, it is shown that WARP method is derived as an extension of affine projection which has been shown in 3.3.
By introducing error matrix which is defined by
iteration of the stereo affine projection method in (54) is re-written as
In the case of strict single talking, following assumption is possible in theLTI period by (53)
where is a PxP symmetric matrix as
By assuming as a regular matrix, (62) can be re-written as
Re-defining echo path estimation matrix by a new matrix which is defined by
(66) is re-written as
Then the iteration is expressed using signal matrix as
In the case of strict single talking where no un-correlated signals exist, and if we can assume is assumed to be an output of a LTI system which is PxP symmetric regular matrix with input, then (69) is given by
It is evident that rank of the equation in (70) is N not 2N, therefore the equation becomes monaural one by subtracting the first law after multiplying from the second low as
Selection of the iteration depends on existence of the inverse matrixor and the detail is explained in the next section.
From the stereo echo path estimation view point, we can obtain or, however we can’t identify right and left echo path estimation from the monaural one. To cope with this problem, we use two LTI periods for separating the right and left estimation results as
where and are monaural echo canceller estimation results at the end of each LTI period, and are right and left estimated stereo echo paths based on the and LTI period’s estimation results.
Equation (77) is written simply as
where is estimation result matrix for the and LTI period’s as
is stereo echo path estimation result as
is a matrix which projects stereo estimation results to two monaural estimation results and is defined by
By swapping right side hand and left side hand in(78), we obtain right and left stereo echo path estimation using two monaural echo path estimation results as
Since and are used to project optimum solutions in two monaural spaces to corresponding optimum solution in a stereo space and vice-versa, we call the matrixes as WARP functions. Above procedure is depicted in Fig. 4. As shown here, the WARP system is regarded as an acoustic echo canceller which transforms stereo signal to correlated component and un-correlated component and monaural acoustic echo canceller is applied to the correlated signal. To re-construct stereo signal, cross-channel correlation recovery matrix is inserted to echo path side. Therefore, WARP operation is needed at a LTI system change.
In an actual application such as speech communication, the auto-correlation characteristics varies frequently corresponding speech characteristics change, on the other hand the cross-channel characteristics or changes mainly at a far-end talker change. So, in the following discussions, we apply NLMS method as the simplest affine projection (P=1).
The mechanism is also intuitively understood by using simple vector planes depicted in Fig. 5.
As shown here, using two optimum solutions in monaural spaces (in this case on the lines) the optimum solution located in the two dimensional (stereo) space is calculated directly.
4. Realization of WARP
4.1. Simplification by assuming direct-wave stereo sound
Both stereo affine projection and WARP methods require P x P inverse matrix operation which needs to consider its high computation load and stability problem. Even though the WARP operation is required only when the LTI system changes such as far-end talker change and it is much smaller computation than inverse matrix operations for affine projection which requires calculations in each sample, simplification of the WARP operation is still important. This is possible by assuming that target stereo sound is composed of only direct wave sound from a talker (single talker) as shown in Fig. 6.
In figure 6, a single sound source signal at an angular frequency in the LTI period,, becomes a stereo sound composed of right and left signals, and, through out right and left LTI systems, andwith additional un-correlated noiseand as
In the case of simple direct-wave systems, (83) can be re-written as
where and are attenuation of the transfer functions and and are analog delay values.
Since the right and left sounds are sampled by Hz and treated as digital signals, we use z- domain notation instead of -domain as
As shown in Fig.7, the stereo sound generation model for is expressed as
where, , , , , and are z-domain expression of the band-limited sampled signals corresponding to, , , , and, respectively. Adaptive filer output and microphone output at the end of LTI period is defined as
where is a room noise,andare stereo adaptive filter and stereo echo path characteristics at the end of LTI period respectively and which are defined as
Then cancellation error is given neglecting near end noise by
In the case of single talking, we can assume both and are almost zero, and (89) can be re-written as
Since the acoustic echo can also be assumed to be driven by single sound source, we can assume a monaural echo path as
Then (90) is re-written as
This equation implies we can adopt monaural adaptive filter by using a new monaural quasi-echo path as
However, it is also evident that if LTI system changes both echo and quasi-echo paths should be up-dated to meet new LTI system. This is the same reason for the stereo echo canceller in the case of pure single talk stereo sound input. If we can assume the acoustic echo paths is time invariant for two adjacent LTI periods, this problem is easily solved by satisfying require rank for solving the equation as
In other words, using two echo path estimation results for corresponding two LTI periods, we can project monaural domain quasi-echo path to stereo domain quasi echo path or vice -versa using WARP operations as
In actual implementation, it is impossible to obtain real, which is composed of unknown transfer functions between a sound source and right and left microphones, so use one of the stereo sounds as a single talk sound source instead of a sound source. Usually, higher level sound is chosen as a pseudo-sound source because higher level sound is usually closer to one of the microphones. Then, the approximated WARP function is defined as
where and are cross-channel transfer functions between right and left stereo sounds and are defined as
The RR, RL, LR and LL transitions in (98) mean a single talker’s location changes. If a talker’ location change is within right microphone side (right microphone is the closest microphone) we call RR-transition and if it is within left-microphone side (left microphone is the closest microphone) we call LL-transition. If the location change is from right-microphone side to left microphone side, we call RL-transition and if the change is opposite we call LR-transition. Let’s assume ideal direct-wave single talk case. Then the domain transfer functions, and are expressed in z-domain as
where, and are fractional delays andand are integer delays for the direct-wave to realize analog delaysand, these parameters are defined as
is a “Sinc Interpolation” function to interpolate a value at a timing between adjacent two samples and is given by
4.2. Digital filter realization of WARP functions
Since LL-transition and LR transition are symmetrical to RR-transition and RL-transition respectively, Only RR and RL transition cases are explained in the following discussions. By solving (96) applying WARP function in(98), we obtain right and left stereo echo path estimation functions as
Since is an interpolation function for a delay, the delay is compensated by as
These functions are assumed to be digital filters for the echo path estimation results as shown in Fig.8.
4.3. Causality and stability of WARP functions
By using numerical calculations,
Secondly, conditions for causality are given by checking the delay of the feedback component of the denominators and. Since convolution of a “Sinc Interpolation” function is also a “Sinc Interpolation” function as
Equation (110) is re-written as
The “Sinc Interpolation” function is an infinite sum toward both positive and negative delays. Therefore it is essentially impossible to endorse causality. However, by permitting some errors, we can find conditions to maintain causality with errors. To do so, we use a “Quasi-Sinc Interpolation” function which is defined as
where is a finite impulse response range of the “Quasi-Sinc Interpolation”. Then the error power by the approximation is given as
Equation (116) is re-written as
Then conditions for causality are
The physical meaning of the conditions are the delay difference due to talker’s location change should be equal or less than cover range of the “Quasi-Sinc Interpolation”in the case of staying in the same microphone zone and the delay sun due to talker’s location change should be equal or less than cover range of the “Quasi-Sinc Interpolation”in the case of changing the microphone zone.
4.4. Stereo echo canceller using WARP
Total system using WARP method is presented in Fig. 9, where the system composed of five components, far-end stereo sound generation model, cross-channel transfer function (CCTF) estimation block, stereo echo path model, monaural acoustic echo canceller (AEC-I) block, stereo acoustic echo canceller (AEC-II) block and WARP block.
As shown in Fig.9, actual echo cancellation is done by stereo acoustic echo canceller (AEC-II), however, a monaural acoustic echo canceller (AEC-I) is used for the far-end single talking. The WARP block is active only when the cross-channel transfer function changes and it projects monaural echo chancellor echo path estimation results for two LTI periods to one stereo echo path estimation or vice-versa.
5. Computer simulations
5.1. Stereo sound generation model
Computer simulations are carried out using the stereo generation model shown in Fig.10 for both white Gaussian noise (WGN) and an actual voice. The system is composed of cross-channel transfer function estimation blocks (CCTF), where all signals are assumed to be sampled at after 3.4kHz cut-off low-pass filtering. Frame length is set to 100 samples. Since the stereo sound generation model is essentially a continuous time signal system, over-sampling (x6,) is applied to simulate it. In the stereo sound generation model, three far-end talker’s locations, A Loc(1)=(-0.8,1.0), B Loc(2)=(-0.8,0.5), C Loc(3)=(-0.8,0.0), D Loc(4)=(-0.8,-0.5) and D Loc(5)=(-0.8,-1.0) are used and R/L microphone locations are set to R-Mic=(0,0.5) and L-Mic=(0,-0.5), respectively. Delay is calculated assuming voice wave speed as 300m/sec. In this set-up, talker’s position change for WGN is assumed to be from location A to location B and finally to location D, in which each talker stable period is set to 80 frames. The position change for voice is from C->A and the period is set to 133 frames. Both room noise and reverberation components in the far-end terminals is assumed, the S/N is set to 20dB ~ 40dB.
5.2. Cross-channel transfer function estimation
In WARP method, it is easily imagine that the estimation performance of the cross-channel transfer function largely affects the echo canceller cancellation performances. To clarify the transfer function estimation performance, simulations are carried out using the cross-channel transfer function estimators (CCTF). The estimators are prepared for right microphone side sound source case and left microphone side sound source case, respectively. Each estimator has two NLMS adaptive filters, longer (128) tap one and shorter (8) tap one. The longer tap adaptive filter (AF1) is used to find a main tap and shorter one (AF2) is used to estimate the transfer function precisely as an impulse response.
Figure 11 shows CCTF estimation results as the AF1 tap coefficients after convergence setting single male voice sound source to the locations C, B and A in Fig. 11. Detail responses obtained by AF2 are shown in Fig. 12.As shown the results, the CCTF estimation works correctly in the simulations.
Cancellation performances of the cross-channel correlation under room noise (WGN) are obtained using the adaptive filter (AF2) and are shown is Fig. 13, where S/N is assumed to be 20dB, 30dB and 40dB. In the figure is power reduction in dB which is observed by the signal power before and after cancellation of the cross-channel correlation by AF2. As shown here, more than 17dB cross-channel correlation cancellation is attained.
5.3. Echo canceller performances
To evaluate echo cancellation performances of the WARP acoustic echo canceller which system is shown in Fig. 10, computer simulations are carried out assuming 1000tap NLMS adaptive filters for both stereo and monaural echo cancellers. The performances of the acoustic echo canceller are evaluated by two measurements. The first one is the echo return loss enhancement, which is applied to the WGN source case and is defined as
where and are residual echo for the monaural echo canceller (AEC-I) and stereo echo canceller (AEC-II) for thesample in the frame in the LTI period, respectively. The second measurement is normalized misalignment of the estimated echo paths and are defined as
where and are stereo echo canceller estimated coefficient arrays at the end of frame, respectively. and are target stereo echo path impulse response arrays, respectively.
5.3.1. WARP echo canceller basic performances for WGN
The simulation results for WARP echo canceller in the case of WGN sound source, no far-end double talking and no local noise, are shown in Fig. 14. In the simulations, talker is assumed to move from A to E every 80 frames (1sec). In Fig.14, the results (a) and (b) show ERLEs for monaural and stereo acoustic echo cancellers (AEC-I and AEC-II), respectively.
The WARP operations are applied at the boundaries of the three LTI periods for the talkers C, D and E. NORM for the stereo echo canceller (AEC-II). As shown here, after two LTI periods (A, B periods), NORM and ERLE improves quickly by WARP projection at WARP timings in the Fig. 16. As for ERLE, stereo acoustic echo canceller shows better performance than monaural echo canceller. This is because the monaural echo canceller estimates an echo path model which is combination of CCTF and real stereo echo path and therefore the performance is affected by the CCTF estimation error. On the other hand, the echo path model for the stereo echo canceller is purely the stereo echo path model which does not include CCTF.
Secondary, the WARP acoustic echo canceller is compared with a stereo echo canceller based on an affine projection method. In this case, the right and left sounds atsample in the frame, and, are assumed to have independent level shift to the original right and left sounds, and, for simulating small movement of talker’s face as
where and are constants which determine the level shift ratio and cycle. Figure 15 shows the cancellation performances when and are 10% and 500msec, respectively.
In Fig. 15, the WARP method shows more than 10dB better stereo echo path estimation performance, NORM, than that of affine projection (P=3). ERLE by stereo echo canceller base on WARP method is also better than affine projection (P=3). ERLE by monaural acoustic echo canceller based on WARP method is somehow similar cancellation performance as affine method (P=3), however ERLE improvement after two LTI periods by the WARP based monaural echo canceller is better than affine based stereo echo canceller.
Figure 16 shows the echo canceller performances in the case of CCTF estimation is degraded by room noise in the far-end terminal. S/N in the far-end terminal is assumed to be 30dB or 50dB. Although the results clearly show that lower S/N degrade ERLR or NORM, more than 15dB ERLE or NORM is attained after two LTI periods.
Figure 17 shows the echo canceller performances in the case of echo path change happens. In this simulation, echo path change is inserted at 100frame. The echo path change is chosen 20dB, 30dB and 40dB. It is observed that echo path change affects the WARP calculation and therefore WARP effect degrades at 2nd and third LTI period boundary.
Figure 18 summarizes NORM results for stereo NLMS method, affine projection method as WARP method. In this simulation, as a non-linear function for affine projection, independent absolute values of the right and left sounds are added by
where is a constant to determine non-liner level of the stereo sound and is set to 10%. In this simulation, an experiment is carried out assuming far-end double talking, where WGN which power is same as far-end single talking is added between 100 and 130 frames.
As evident from the results in Fig. 18, WARP method shows better performances for the stereo echo path estimation regardless far-end double talking existence. Even in the case 10% far end signal level shit, WARP method attains more than 20% NORM comparing affine method (P=3) with 10% absolute non-linear result.
5.3.2. WARP echo canceller basic performances for voice
Figure 19 shows NORM and residual echo level (Lres) for actual male voice sound source. Since voice sound level changes frequently, we calculate residual echo level Lres (dB) instead of ERLE(dB) for white Gaussian noise case. Although slower NORM and Lres convergence than white Gaussian is observed, quick improvement for the both metrics is observed at the talker B and A border. In this simulation, we applied 500 tap NLMS adaptive filter. Affine projection may give better convergence speed by eliminating auto-correlation in the voice, however it is independent effect from WARP effect. WARP and affine projection can be used together and may contribute to convergence speed up independently.
In this chapter stereo acoustic echo canceller methods are studied from cross-channel correlation view point aiming at conversational DTV use. Among many stereo acoustic echo cancellers, we focused on AP (including LS and stereo NLMS methods) and WARP methods, since these approaches do not cause any modification nor artifacts to speaker output stereo sound which is not desirable consumer audio-visual products such as DTV. In this study, stereo sound generation system is modeled by using right and left Pth order LTI systems with independent noises. Stereo LS method (M=2P) and stereo NLMS method (M=P=1) are two extreme cases of general AP method which requires MxM inverse matrix operation in each sample. Stereo AP method (M=P) can produce the best iteration direction fully adopting un-correlated component produced by small fluctuation in the stereo cross-channel correlation by calculating PxP inverse matrix operations in each sample. Major problem of the method is that it cannot cope with strict single talking where no un-correlated signals exist in right and left channels and therefore rank drop problem happens. Contrary to AP method, WARP method creates a stereo echo path estimation model applying a monaural adaptive filter for two LTI periods at a chance of far-end talker change. Since it creates stereo echo path estimation using two monaural echo path models for two LTI periods, we do not suffer from any rank drop problem even in a strict single talking. Moreover, using WARP method, computational complexity can be reduced drastically because WARP method requires PxP inverse matrix operations only at LTI characteristics change such as far-end talker change. However, contrary to AP method, it is clear that performance of WARP method may drop if fluctuation in cross-channel correlation becomes high. Considering above pros-cons in affine projection and WARP methods, it looks desirable to apply affine method and WARP method dynamically depending on the nature of stereo sound. In this chapter, an acoustic echo canceller based on WARP method which equips both monaural and stereo adaptive filters is discussed together with other gradient base stereo adaptive filter methods. The WARP method observes cross-channel correlation characteristics in stereo sound using short tap pre-adaptive filters. Pre-adaptive filter coefficients are used to calculate WARP functions which project monaural adaptive filter estimation results to stereo adaptive filter initial coefficients or vice-versa.
To clarify effectiveness WARP method, simple computer simulations are carried out using white Gaussian noise source and male voice, using 128tap NLMS cross-channel correlation estimator, 1000tap monaural NLMS adaptive filter for monaural echo canceller and 2x1000tap (2x500tap for voice) multi-channel NLMS adaptive filter for stereo echo canceller. Followings are summary of the results:
Considering sampling effect for analog delay, x6 over sampling system is assumed for stereo generation model. 5 far-end talker positions are assumed and direct wave sound from each talker is assumed to be picked up by far-end stereo microphone with far-end room background noise. The simulation results show we can attain good cross-channel transfer function estimation rapidly using 128tap adaptive filter if far-end noise S/N is reasonable (such as 20-40dB).
Using the far-end stereo generation model and cross-channel correlation estimation results, 1000tap NLMS monaural NLMS adaptive filter and 2-1000 tap stereo NLMS adaptive filters are used to clarify effectiveness of WARP method. In the simulation far-end talker changes are assumed to happen at every 80frames (1frame=100sample). Echo return loss Enhancement (ERLE) MORMalized estimation error power (NORM) are used as measurements. It is clarified that both ERLE and NORM are drastically improved at the far-end talker change by applying WARP operation.
Far-end S/N affects WARP performance, however, we can still attain around SN-5dB ERLE or NORM.
We find slight convergence improvement in the case of AP method (P=3) with non-linear operation. However, the improvement is much smaller than WARP at the far-end talker change. This is because sound source is white Gaussian noise in this simulation and therefore merit of AP method is not archived well.
Since WARP method assumes stereo echo path characteristics remain stable, stereo echo path characteristics change degrade WARP effectiveness. The simulation results show the degradation depends on how much stereo echo path moved and the degradation appears just after WARP projection.
WARP method works correctly actual voice sound too. Collaboration with AP method may improve total convergence speed further more because AP method improves convergence speed for voice independent from WARP effect.
As for further studies, more experiments in actual environments are necessary. The author would like to continue further researches to realize smooth and natural conversations in the future conversational DTV.
If matrix is defined as
where is a sample array composed of white Gaussian noise sampleas
is defined as a matrix as
where is P sample array defined as
Then is a Toepliz matrix and is expressed using () Toepliz matrix as
This is because element of the matrix, is defined as
the elementis given as
By setting the th element of () Toepliz matrix as (), we define a function which determines Toepliz matrix.
It is noted that if is a identity matrix is also identity matrix.