Effect of Delay on Voice Quality

## 1. Introduction

Echo phenomenon has been always existed in telecommunications networks. Generally it has been noticed on long international telephone calls. As technology advances and the data transmission methods tend more to packet-switching concepts, the traditional echo problem remained up-to-date. An important issue in echo analysis is a round-trip delay of the network. This is a time interval required for a signal from speaker’s mouth, across the communication network through the transmit path to the potential source of the echo, and then back across the network again on the receive path to the speaker’s ear. The main problem associated with IP-based networks is that the round-trip delay can be never reduced below its fundamental limit. There is always a delay of at least two to three packet sizes (50 to 80 ms) (Choi et al., 2004) that can make the existing network echo more audible (Gordy & Goubran, 2006). Therefore, all Voice over IP (VoIP) network terminals should employ echo cancellers to reduce the amplitude of returning echoes. A main parameter of each echo canceller is a length of its coverage. The coverage means the length of time that the echo canceller stores its approximation in memory. The adaptive filter should be long enough to model an unknown system properly, especially in case of VoIP applications (Nisar et al., 2009; Youhong et al., 2005). On the other hand, it is known that an active part of the network echo path is usually much smaller compared to the whole echo path that has to be covered by the adaptive filtering algorithm. That is why the knowledge of the echo delay is important for using echo cancellers in packet-switching networks. Today, there is a wide family of adaptive filtering algorithms that can exploit sparseness of the echo path to reduce high computational complexity associated with long echo paths (Dyba, 2008; Hongyang & Dyba, 2008; Khong & Naylor, 2006; Hongyang & Dyba, 2009). In this chapter, we discuss numerous methods used for estimation of echo delay. Algorithms based on cross-correlation function and adaptive filters are used in the art. We will consider both types of them, discuss their advantages and drawbacks. Afterwards, we will pay our attention to the adaptive filtering techniques. We provide a study on different partial, proportionate, sparseness-controlled time- and frequency-domain adaptive filters. The readers will get closer to an issue of echo cancellation, which is relevant in nowadays telecommunications networks. Ones will able to recognize important features and particular areas of implementation of various adaptive algorithms. Further, we are giving a short introduction to the issue of echo control for telecommunications networks. This description emphasises on two most important aspects of perceptual echo control, which are echo loudness and echo delay.

## 1.1. Echo control issue

In the very beginning of the telephone age, all calls were made through an analog pair of copper wires. The technology has progressively moved to digital circuit switched networks over the past several decades. Today most of the phone traffic is handled by the Public Switched Telephone Network (PSTN), which provides end-to-end dedicated circuits. During the last years a move to packet-switched networks has been initiated to support voice traffic over Internet Protocol (IP). The main reason for the move from circuit-switched voice networks to packet-switched networks is to enable convergence between data services and voice services. It is of economical interest to be able to use the same equipment for voice and data traffic. Reduced cost of placing a phone call is another reason, since the voice-packet is treated and routed much in the same way as any other data packet (note that Quality of Service plays a vital role in this process). Thus, conventional long distance tariffs have a tendency to be completely eliminated in Voice over IP (VoIP) networks as well.

Echo issue has long been recognized as a problem on telecommunications networks, though generally it has been noticed mostly on international telephone calls or when using speaker phones. As technology advances and the information transmission methods tend more to packet-switching concepts, the traditional echo problem should be reviewed and updated. Previously unconsidered factors now play an important part in the echo characteristics. This section describes the echo delay problem, which is often encountered in packet-switched networks. This problem is highlighted in relation to VoIP networks. More specific details on the process of locating and eliminating echoes are included in conclusion to the chapter.

Consider a simple voice telephone call, where an echo occurs when you hear your own voice repeated. An echo is the audible leak-through of your own voice into your own receive path. Every voice conversation has always at least two participants. From the perspective of each participant, there are two voice paths in every call:

Transmit path – The transmit path is usually depicted as Tx path. In a conversation, the transmit path is created when any person begins speaking. The sound is transmitted from the mouth of the speaker to the ear of the listener.

Receive path – The receive path is also called the return and depicted as Rx path. In a conversation, the receive path is created when a person hears the conversation coming from the mouth of another speaker.

Fig. 1 illustrates a simple diagram of a voice call between two persons A (Kirill) and B (Kate). From the user A’s perspective, the Tx path carries his voice to the user B’s ear, and the Rx path carries the user B’s voice to the user A’s ear.

There is one significant factor in the echo analysis, and especially for the packet-switching networks. It is a round-trip delay of the voice network. The round-trip delay is the length of time required for an utterance from the user A’s mouth, across the network on the Tx path to the source of the leak, and then back across the network again on the Rx path to the user A’s ear. Let’s define two important statements about echo nature, which are the following:

The louder the echo (echo amplitude), the more annoying it is,

The longer the round-trip delay (the “later” the echo), the more annoying it is.

Table 1 shows how time delay can affect the quality of a voice conversation.

One-Way Delay Range (ms) | Effect on Voice Quality |

0-25 | This is the expected range for national calls. There are no difficulties during conversation. |

25-150 | This is the expected range for international calls using a terrestrial transport link and IP telephony, which includes only one instance of IP voice. This range is acceptable for most users, assuming the use of echo control devices. |

150-400 | This is the expected range for a satellite link. Delays in this range can interrupt the normal flow of a conversation. A high-performance echo canceller must be used and careful network planning is necessary. |

Greater than 400 | This is excessive delay and must be avoided by network planning. |

Fig. 2 shows how the echo disturbance influenced by the two parameters: delay and echo level. The metric called Talker Echo Loudness Rating (TELR) denotes the level difference between the voice and echo signals. The “acceptable” curve represents the limit for acceptable talker echo performance for all digital networks. The fact that the speaker A, in Fig. 1, hears an echo illustrates one of the basic characteristics of echo: perceived echo most likely indicates a problem at the other end of the call. The problem that is producing the echo that A hears, the leakage source, is somewhere on B’s side of the network. If the person B was experiencing echo, the problem would be on the user A’s side.

The perceived echo usually originates in the terminating side of the network for the following two reasons:

Leakage happens only in analog circuits. Voice traffic in the digital portions of the network does not pass from one path to another.

Echo arriving after very short time, about 25 milliseconds, is generally imperceptible, because it is masked by the physical and electrical side-tone signal.

A hybrid transformer is often main source of the electrical signal leakage. The typical analog telephone terminal is 2-wire device: a single pair of conductors is used to carry both the Tx and Rx signals. For analog trunk connections, known as 4-wire transmission, two pairs of conductors carry separate Tx and Rx signals. Digital trunks (T1/E1) can be virtual 4-wire links because they also carry separate Tx and Rx signals. A hybrid is a transformer that is used to interface 4-wire links to 2-wire links. Fig. 4 shows a hybrid transformer in an analog tail circuit. Because a hybrid transformer is a non-ideal physical device, a certain fraction of the 4-wire incoming (Rx) signal will be reflected into 4-wire outgoing (Tx) signal. A typical fraction for a properly terminated hybrid in a PBX is about -25 decibels (dB), meaning that the reflected signal (the echo) will be a version of the Rx signal attenuated by about 25 dB. For a PSTN POTS (Plain Old Telephone Service) termination, the expected value is between 12 and 15 dB. Echo strength is expressed in dB as a measurement called Echo Return Loss (ERL). Therefore, and ERL of 0 dB indicates that the echo is the same amplitude as the original source. A large ERL indicates a negligible echo. Remember that an echo must have both sufficient amplitude and sufficient delay to be perceived. For local calls with one-way delay from 0 to 25 ms, an echo of strength of -25 dB relative to the speech level of the talker is generally quiet enough to not be annoying. For a one-way delay in the range of 25 to 150 ms, the ERL should exceed 55 dB to eliminate the perception of echo from the end-user perspective, as recommended in ITU-T recommendation G.168 on echo cancellation (ITU-T G.168, 2002). In this case echo cancellation is required.

## 2. Echo delay estimation using cross-correlation

The following section presents a study of cross-correlation-based Time Delay Estimation (TDE) algorithms. The main purpose is to analyze a number of methods, in order to find the most suitable one for real-time speech processing. As TDE is an important topic during transmission of voice signals over packet-switching telecommunication systems, it is vital to estimate the true time delay between Tx and Rx speech signals. We consider algorithms processing both in time- and frequency domains. An echo delay problem associated with IP-based transport networks is also included into the discussion. An experimental comparison of the performance of numerous methods based on cross-correlation, normalized cross-correlation and a generalized cross-correlation function is presented.

### 2.1. General scenario of delay estimation using cross-correlation functions

The known problem associated with IP-based networks is that the round-trip delay can be never reduced below its fundamental limit. There is always a delay of at least two to three packet sizes (50 to 80 ms) that can make the existing network echo more audible. Therefore, all Voice over IP (VoIP) network terminals should employ echo cancellers to reduce the amplitude of returning echoes. A main parameter of each echo canceller is a length of coverage. Echo canceller coverage specifies the length of time that the echo canceller stores its approximation in memory. The adaptive filter should be long enough to model an unknown system properly, especially in case of VoIP applications. On the other hand, it is known that the active part of the network echo path is usually much smaller compared to the whole echo path that has to be covered by the adaptive filtering algorithm inside the echo canceller. That is why the knowledge of the echo delay is important for using echo cancellers in packet-switching networks successfully. In general, every communications system includes a communications network and communications terminals on the both sides of the network. The communications terminals could be telephones, soft phones, and wireless voice communication devices. Fig. 3 illustrates how an echo assessment device can be arranged into the defined system. The echo delay estimator has to monitor two parallel channels. An outgoing voice channel transmits an original voice waveform from the first terminal through the communications network to the second terminal. An incoming voice channel receives an echo waveform of the original signal returning from the second terminal through the communications network back. This is a delayed and attenuated version of the original voice signal.

Fig. 4 illustrates a general block diagram of the echo delay estimator. The echo delay estimator computes correlation between two voice channels for different set of delays in parallel manner (Carter, 1976). The delay-shift with the largest cross-correlation coefficient is selected as the delay estimate. Fig. 5 illustrates, in a flowchart form, steps performed when implementing a method of echo delay estimation utilizing cross-correlation algorithms. Once started from block 1, block 2 calculates the cross-correlation function for a buffer of input samples of the Rx and Tx signals. Block 3 utilizes cross-correlation coefficients to compute the similarities between the transmitted signal and the received signal over a range of delays. For each particular delay, the similarity is obtained. Once the similarities have been determined for each delay within the range of delays, block 4 chooses a delay that produces the greatest similarity metric for the given input frames. Consequently, block 5 indicates that the estimation process is completed.

### 2.2. Algorithms proceeding in time-domain

Time domain implementation of Cross-Correlation Function (CCF) and Normalized CCF (NCCF) is presented. The cross-correlation function for a successive par of speech frames can be estimated by (Mueller, 1975)

Here, *x(n)* simply denotes a frame of the outgoing signal, *y(n)* is related to a frame of the incoming signal. According to Fig. 4, the estimation of the CCF is done for a supposed range of delays. The time-shift, τ, which is always in range of [τ_{min}; τ_{max}] and causes the maximal peak value of the CCF is declared as an estimate of the true echo delay T_{D}. Similarly to the CCF, an estimate of the NCCF is done (Buchner et al., 2006)

Here, E_{x} and E_{y} denotes a short-term energy of the outgoing and the incoming frames. These values are calculated using the following equations

Let us further consider generalized cross-correlation algorithms, which operate in the frequency domain (Youn et al., 1983; Zetterberg et al., 2005).

### 2.3. Algorithms proceeding in frequency-domain

More sophisticated way how to provide TDE is to compute the cross-correlation function in the frequency domain. This process in literature is called Generalized Cross-Correlation (GCC) (Hertz, 1986). The idea behind this method is to perform pre-filtering of the input signals before calculating CCF. It makes possible to improve the accuracy of delay estimation. Note that the filtering procedure is performed in the frequency domain. Let us describe this process in more details. It is well known, that the simple cross-correlation function, *R*_{xy}, between signals *x(n)* and *y(n)* is related to the cross-power density function (cross-power spectrum), *G*_{xy}, by the general inverse Fourier transform relationship, as

When *x(n)* and *y(n)* have been filtered with filters having transfer functions *H*_{1}*(f)* and *H*_{2}*(f)*, the cross-power spectrum between the filter out-puts is given by

Consequently, the Generalized Cross-Correlation Function (GCCF) between *x(n)* and *y(n)* is given by (Knapp & Carter, 1976)

Here, *Ψ*_{g}, is a generalized weighting function. Table 2 represents weighting functions that were used for experiments with speech signals (Wilson & Darrell, 2006).

The parameter *γ*_{xy} denotes a complex coherence function. It can be calculated as (Tianshuang & Hongyu, 1996)

Here, *G*_{xx}*(f)* and *G*_{yy}*(f)* are auto-power spectra of the outgoing and the incoming signal; R_{xx}(m) and R_{yy}(m) are auto-correlation functions of the same signals. Fig. 6 illustrates a block diagram of the implemented generalized cross-correlation algorithm, where the Fast Fourier Transform (FFT) is used for auto-spectra and cross-spectrum calculation. After the cross-power spectrum is estimated, it is multiplied by the corresponding GCC weighting function. The inverse FFT is used for obtaining the time domain generalized-cross correlation function. This operation is repeated for the specified range of possible delays. After the whole process has completed, the time shift with maximum corresponding peak value is declared as an estimation of the true delay.

### 2.4. Discussion over experimental results

We used MATLAB software as a simulation environment. The time difference between time when the outgoing signal leaves the voice terminal and moment when the incoming signal containing the echo of the original signal arrives back from the network is referred to as a true echo delay. This value for the first three figures that are presented below equals 6ms (48 samples). For the purpose of TDE it is also necessary to specify time interval through which the value of the true delay is searched. To cover the 6ms delay we choose the interval between 0 and 10ms what corresponds to the maximum delay value of 60 samples. Afterwards we present the estimation results for a group of different delays. It helps to understand better performance of the algorithms. Unfortunately, because of the non-stationary nature of human speech, the CCF is not reliable for all situations. Its performance highly depends on numerous factors, i.e. signal strength, signal-to-noise ratio (SNR), etc (Chen et al., 2006). The NCCF is not so sensitive to the sudden changes in the signal’s amplitude. It outperforms the CCF when we work with low level signals. The advantages of the algorithms proceeding in the frequency domain compared to the algorithms operating in the time domain are accuracy and reduced computational complexity. Fig. 7 illustrates the outputs of the GCC algorithms, which were presented in Table 2.

Table 3 and 4 provides us along with the following results. The joint comparison was done in terms of the estimation accuracy of the algorithms. The group of delays was chosen for this experiment. Delay values are consistent with the ones referenced in the corresponding ITU-T recommendation G.131 (ITU-T G.131, 2003). Once the respective cross-correlation function was calculated, its maximum peak value is detected using the searching procedure described in Fig. 4. SCC is related to the Standard CC function.

[ms] | SCC | ROTH | SCOT | PHAT | CPS-2 | HT | ECKART | HB | WIENER |

5 | 4,9 | 5,1 | 5,2 | 3,7 | 5,2 | 4,4 | 4,2 | 5,2 | 5,2 |

10 | 9,7 | 10,3 | 10,3 | 7,5 | 10,3 | 8,8 | 8,3 | 10,4 | 10,3 |

20 | 19,5 | 20,6 | 20,7 | 15,0 | 20,7 | 17,7 | 16,7 | 20,8 | 20,7 |

30 | 29,2 | 30,9 | 31,0 | 22,4 | 31,0 | 26,5 | 25,0 | 31,2 | 31,0 |

50 | 48,7 | 51,4 | 51,6 | 37,4 | 51,6 | 44,2 | 41,7 | 52,1 | 51,7 |

100 | 97,3 | 102,9 | 103,3 | 74,8 | 103,3 | 88,5 | 83,3 | 104,2 | 103,4 |

200 | 194,6 | 205,7 | 206,6 | 149,6 | 206,6 | 177,0 | 166,6 | 208,3 | 206,8 |

300 | 292,0 | 308,6 | 309,9 | 224,4 | 309,9 | 265,4 | 249,9 | 312,5 | 310,3 |

[ms] | SCC | ROTH | SCOT | PHAT | CPS-2 | HT | ECKART | HB | WIENER |

5 | 4,9 | 5,1 | 5,2 | 3,7 | 5,2 | 4,4 | 4,2 | 5,2 | 5,2 |

10 | 9,7 | 10,3 | 10,3 | 7,5 | 10,3 | 8,8 | 8,3 | 10,4 | 10,3 |

20 | 19,5 | 20,6 | 20,7 | 15,0 | 20,7 | 17,7 | 16,7 | 20,8 | 20,7 |

30 | 29,2 | 30,9 | 31,0 | 22,4 | 31,0 | 26,5 | 25,0 | 31,2 | 31,0 |

50 | 48,7 | 51,4 | 51,6 | 37,4 | 51,6 | 44,2 | 41,7 | 52,1 | 51,7 |

100 | 97,3 | 102,9 | 103,3 | 74,8 | 103,3 | 88,5 | 83,3 | 104,2 | 103,4 |

200 | 194,6 | 205,7 | 206,6 | 149,6 | 206,6 | 177,0 | 166,6 | 208,3 | 206,8 |

300 | 292,0 | 308,6 | 309,9 | 224,4 | 309,9 | 265,4 | 249,9 | 312,5 | 310,3 |

The abscissa of the largest peak value is the estimated delay. Note that 50 trial speech records for each processor were evaluated to obtain the mean value and the Root Mean Square Deviation (RMSD) parameter (Anderson & Woessner, 1992). Not only different speech signals, but various hybrid impulse response models have been used. The results for delays from 5 to 300 ms are presented in the corresponding tables. Table 3 contents the mean values, whether Table 4 illustrates the estimated RMSD values.

## 3. Echo delay estimation using adaptive filters

In this section, we introduce methods for extracting an echo delay between speech signals using adaptive filtering algorithms. We know that time delay estimation is an initial step for many speech processing applications. Conventional techniques that estimate a time difference of arrival between two signals are based on the peak determination of the generalized cross-correlation between these signals. To achieve a good precision and stability in estimation, the input sequences have to be multiplied by an appropriate weighting function. Regularly, the weighting functions are dependent on the signals power spectra. The spectra are generally unknown and have to be estimated in advance. An implementation of the time delay estimation via the adaptive least mean squares is analogous to estimating the Roth generalized cross-correlation weighting function. The estimated parameters using the adaptive filter have a smaller variance, because it avoids the need for the spectrum estimation. In the following, we discuss proportionate and partial-update adaptive techniques and consider their performance in term of delay estimation.

Time delay estimation (TDE) has always been and remains a popular research topic. It is known, that the main problem associated with IP-based networks is the round-trip delay, which can never be reduced below its fundamental limit. A number of efforts were made in order to improve the TDE precision. Various methods based on the Generalized Cross-Correlation (GCC) were proposed. The GCC algorithms arrange a pre-filter to obtain the modified signal spectrum for optimal time delay estimation. To specify the filter’s characteristic, it requires a priori knowledge of the statistics of the received signals. However, the efficiency of the algorithms decreases considerably when little or no prior knowledge about the signal statistics is known. From the time when Widrow proposed an adaptive filtering technique based on Least Mean Squares (LMS) (Widrow, 2005; Haykin, 2001), an adaptive theory also found an application to delay estimation. An adaptive implementation of the time delay estimation via Widrow’s LMS algorithm is usually referred to as TDLMS. Comparing to the GCC algorithms, the adaptive filtering techniques do not require a priori information of the signal statistics, because the estimation of the signal spectrum is no longer needed. The adaptive filtering algorithms determine the time delay in an iterative manner. There are comparative studies, which provide comparison of the LMS versus the generalized cross-correlation (Zetterberg et al., 2005). Generally, the time domain implementation of any adaptive filter is associated with high computational complexity. It directly depends on the length of the adaptive filter. In order to reduce the computational load of the TDLMS (Emadzadeh et al., 2008), we offer using adaptive filtering algorithms with reduced computational complexity. They provide savings comparing to the conventional adaptive algorithms. In the following, we discuss each of the algorithms in greater details. First, a general scenario for the adaptive time delay estimation using a simple Normalized Least Mean Squares (NLMS) adaptive filtering algorithm is presented. Afterwards, we introduce the proportionate and partial-updated algorithms proceeding in the time domain. A new partial-updated proportionate NLMS algorithm is outlined. A comparison between the TDE algorithms is made in context of the network echo delay estimation.

### 3.1. General scenario of delay estimation using adaptive filters

Traditionally, the NLMS algorithm is used for the echo canceller implementation. It applies a Finite Impulse Response (FIR) adaptive filter with adjustable weights for modelling the unknown echo path’s impulse response. The NLMS algorithm minimizes the least mean squares difference between two signals: the reference incoming signal and the filtered original (outgoing) signal (see Fig. 7).

Basically, the NLMS algorithm is a simple extension of the Widrow’s LMS algorithm. According to Fig. 7, the adaptive filter weights are tuned iteratively using a special rule. The important parameter controlling the whole adaptation process is referred to as a step-size parameter. It should be varied in time in order the algorithm to be able to track non-stationary changes in the echo path’s impulse response. Unlike the LMS algorithm, the NLMS algorithm’s step-size is adjusted according to the instantaneous short-time energy of the input signal. The adaptive filtering process using the NLMS algorithm can be described by the following set of equations

where *w* is a vector containing *L* coefficients of the adaptive filter; *x* is a vector consisting of *L* samples of the input signal *x(n)*; *e(n)* is a difference between the reference signal, *y(n)*, and the adaptive filter output during the *n*^{th} iteration; *μ*_{0} is the fixed NLMS step-size parameter from the interval (0;1). Fig. 8 illustrates the basic principle of the NLMS adaptive algorithm. While the signal *x(n)* corresponding to the outgoing voice signal (that notation is used in the previous section) is referred to as the far-end signal, the signal *y(n)* corresponding to the incoming voice signal is referred to as the near-end signal. Basically, the near-end signal *y(n)* is the delayed and attenuated version of the far-end signal *x(n)*.

### 3.2. Time-domain adaptive algorithms

Knowing the adaptive theory, it is trivial that the delay estimation can be achieved by selecting the largest value from the adaptive filter weights vector, *w*. There is only one issue that has to be taken into account. The adaptive filter needs some time in order to converge to the optimal performance. The existing adaptive algorithms differ from each other with different convergence properties and computational memory requirements. The robust fast converging algorithms are primarily used in the acoustical echo cancellation applications. They take a lot of computational resources. In our case, it is not necessary to apply the complex algorithms, because the adaptive filter is not directly used for the purpose of echo cancellation, but for the delay estimation. Therefore, the reduced complexity adaptive filtering algorithms became the subject of our interest.

#### 3.2.1. Proportionate adaptive filtering algorithms

In the following, we provide a reader along with a principal of the proportionate adaptive technique. The proportionate normalized least mean squares (PNLMS) algorithm proposed in (Duttweiler, 2000) has been developed for use especially in the telephone network environment. For hybrid echo cancellers, it is reasonable to assume that the echo path has a sparse character (i.e., many IR’s (Impulse Response) coefficients are close to zero). Although there are studies and research on the multiple reflection echo paths, a typical echo path impulse response in the practical communication networks has only one reflection, which means all the active coefficients are occupied in a continuous area of the whole echo span. Therefore, we do not need to update all the filter weights to achieve a nominal performance. One can adjust and operate only with non-zero active coefficients. Basically, this is a main idea behind the PNLMS algorithm and other subsequently derived proportionate adaptive filters (Hongyang & Doroslovacki, 2006; Paleologu et al., 2006). Proportionate approaches achieve their higher convergence rate by using the fact that the active part of network echo path is usually much smaller (4-8ms) compared to 64-128 ms of the whole echo path that has to be covered by the adaptive filter. In case of voice transmission over the packet-switching network, these numbers may be more considerable. In the PNLMS algorithm, the adaptive step-size parameters are assigned to all the filter coefficients. They are calculated from the last estimate of the filter weights in such a way that a larger coefficient receives a larger increment. As a result, the convergence rate can be increased the fact that the active taps are adjusted faster than non-active coefficients. Therefore for the sparse IR, the PNLMS algorithm converges much faster comparing to the NLMS. This feature is an advantage especially when it is necessary to estimate the long echo delays. The PNLMS algorithm can be described using the following equations:

where **G***(n-1)* is a diagonal matrix adjusting the step-size parameters, *μ*_{0} is an overall step-size parameter. The diagonal elements of *G(n)* are estimated as follows:

Parameters *δ*_{p} and *ρ* are positive numbers with typical values for *δ*_{p} = 0.01 and for *ρ* = 5/*L*. The first term in (17), *ρ*, prevents *w*_{l}*(n)* from stalling when it is much smaller than the largest coefficient and *δ*_{p} regularizes the updating when all coefficients have zero values at initialization. In spite of the sparse system identification, which is a vital requirement for the fast converging adaptive filters, there is another requirement. It is directly addressed to the adaptive filter implementation. The algorithm should have reasonable power concerns. Unfortunately, the PNLMS algorithm has drawbacks. One of them is an increase in the computational complexity by 50% compared to the NLMS algorithm. Furthermore, the PNLMS algorithm shows the slow convergence rate after the fast initial start. It is because of the slow convergence rate dedicated to the small coefficients (Gay, 1998). The increased computational complexity can be reduced by the way of selective partial-updating. In turn, the slow convergence of the PNLMS in the stable state can be improved by switching from the PNLMS to NLMS equations after the fast initial convergence has been achieved (Benesty & Gay, 2002). Recently researchers proposed the partial-update techniques for using along with the adaptive algorithms. The partial-update algorithms reduce computational complexity by updating only a subset of the filter weights per iteration. They can be relevant for applications requiring fast real-time processing. Unfortunately, there is another side to the coin. The fewer coefficients you are going to update per iteration, there is more misalignment presented in the algorithm. Therefore, a certain trade-off should be made when selecting the number of coefficients to update. The following subsection presents several partial-update algorithms along with the proposed modification of the partial-update PNLMS algorithm.

#### 3.2.2. Partial-update adaptive filtering algorithms

The partial-update algorithms can be seen to exploit the sparseness of the echo path in two different ways. It is known that when the unknown system’s impulse response is sparse, many of the adaptive filter’s weights can be approximated to zero. Alternatively, the sparseness may be present in the weight update vector as a consequence of the distribution of the input samples in the (*L*x1) input vector, *x*_{n}=[*x(n)*, *x(n-1)*, …, *x(n-L+1)*]^{T}. In both these cases, exploiting the sparseness properties can reduce complexity and improve performance of the adaptive algorithm (Fevrier & Gelfand, 1999). Some of the first work on the partial-update algorithms was done by Douglas (Douglas, 1997). It presents the periodic and the sequential updating schemes for the Max-NLMS algorithm. However, these partial-update algorithms show slow convergence 2properties compared to the full-update algorithms. The reason is inconsistent updating schemes. More recently, the partial-updating concept was developed by Aboulnasr (Aboulnasr, 1999). It leads to the M-Max NLMS algorithm and supporting convergence analysis (Aboulnasr & Mayyas, 1998). Another block-updating scheme for the NLMS algorithm was studied by Schertler (Schertler, 1998). The latter work was published by Dogancay and Tanrikulu. They consider approaches for more robust Affine Projection Algorithm (APA) (Dogancay & Tarinkulu, 2001, 2002). Further, we give a summary of the general partial-update algorithms along with the proposed one.

**M-Max NLMS**

The algorithm selects a specified number of the coefficients providing the largest reduction in the mean squared error per iteration (Naylor & Sherliker, 2003). Only *M* out of the total *L* filter coefficients are updated. Those *M* coefficients are the ones associated with the *M* largest values within the following vector |*x(n* - *i + 1)*|; *i* = 1;...; *L*. The update equations for this algorithm are

One of the features of the M-MAX-NLMS algorithm is that it reduces the complexity of the adaptive filter by selectively updating the coefficients while maintaining the closest performance to the full-update NLMS algorithm. We present misalignment curves for the algorithm in the up-following section. The misalignment value was calculated using the equation formula

Here, *w*_{j} is the _{j}-th coefficient of the adaptive filter, while *h*_{j} is the _{j}-th coefficient of the simulated echo path’s IR.

**Selective-partial-update NLMS**

This algorithm opposed to the M-Max NLMS has a block structure. An objective behind the latter is the same: it reduces computational costs by updating a subset of the filter coefficients. But first, the vector *x(n)* and the coefficient vector *w(n)* are arranged into *K* blocks of length *M* = *L*/*K*, where *L* is an integer as in

The coefficient vector’s blocks *w*_{1}*(n)*, *w*_{2}*(n)*,…,*w*_{K}*(n)* represent candidate subsets that can be updated during the current iteration. For a single-block updating scheme, the constrained minimization problem, which is solved by the NLMS algorithm, can be written as

The selection of the block that has to be updated is made by determining the block with the smallest squared-Euclidian-norm update. According to Eq. (24), that justification can be described by the following terms

Generalization from the single-block to the multiple-block updating scheme is done through the following. Suppose that only the *B* (*B*<*K*) blocks with the largest magnitudes are selected to be updated. Let a vector *I*_{B} = {*i*_{1}, *i*_{2},…, *i*_{B}} denote the subset of *B* blocks out of {1, 2, …, *K*} to be updated. Thus, the equation for the *B*-block updating scheme is

where *x*_{IB} and *w*_{IB} are defined as follows

The computational and memory requirements of the selective-partial-update NLMS algorithm are almost identical to those of the selective-block-update algorithm proposed in (Aboulnasr & Mayyas, 1998). Nevertheless, simulation results illustrated in the next section shows that this approach does not lead to the reasonable trade-off between performance and simplicity. The algorithm’s efficiency is weaker than the one of the M-Max NLMS algorithm. As an alternative approach, a sparse-partial-update NLMS algorithm applies more relevant selection criterion.

**Sparse-partial-update NLMS**

This algorithm utilize a so-called sparse-partial (SP) weight selection criterion (Jinhong & Doroslovacki, 2008). The adaptive filter weights are updated based on the largest product of the multiplication of *x(n)* and *w(n)*. The SP-NLMS single-block update equations are given by

Hongyang and Dyba recently suggested a generalization for updating *B* blocks out of *K*, i.e.

**Simple-partial-update PNLMS**

This proposed algorithm exploits the sparseness of the communication channel to speed up the initial convergence and employs the partially updating scheme to reduce the computational complexity. A selection procedure is performed in accordance with the estimated magnitude of the channel’s impulse response. The S-PNLMS algorithm for single -block update is defined as follows. Arrange *x(n)* and *w(n)* into *K* blocks of length *M* = *L*/*K* in the same way as it is done in Eq. (22, 23). Then let *G*_{i}*(n)* denote the corresponding *M*x*M* block of the diagonal weighing matrix, *G(n)*. The recursion for updating adaptive filter weights is given by

where the block selection is done according to the following

It is different to

which is used with the SPU-PNLMS algorithm. It is apparently from the simulations that the S-PNLMS has similar performance to the SP-NLMS and outperforms the SPU-PNLMS algorithm. The S-PNLMS for updating *B* blocks out of *M* has these update equations

Further, we provide the comparison results for the presented algorithms and demonstrate their performance while estimating the predefined echo delay. Table 5 illustrates the computational complexity of the full-update algorithms and shows saving achieved by the partially updating schemes.

ALG. | MULT. | ADD. | DIV. |

NLMS | 3L+1 | 3L-1 | 1 |

M-Max-NLMS | 3M+1 | 3M-1 | 1 |

SPU-NLMS | 3M*B+1 | 3M*B -1 | 1 |

SP-NLMS | 3M*B +1 | 3M*B -1 | 1 |

PNLMS | 6L+1 | 4L-2 | L+1 |

M-Max-PNLMS | 6M +1 | 4M -2 | M +1 |

SPU-PNLMS | 6M*B +1 | 4M*B -2 | M*B +1 |

SP-PNLMS | 6M*B +1 | 4M*B -2 | M*B +1 |

S-PNLMS | 6M*B +1 | 4M*B -2 | M*B +1 |

### 3.3. Discussion over experimental results

To evaluate the performance of the adaptive filtering algorithms, we implemented them in MATLAB environment. The working scenario is as follows. The filters have to estimate the predefined echo path’s impulse responses specified in the ITU-T Recommendation G.168 (ITU-T G.168, 2002). They consist of 1024 taps. The overall step-size parameter, *μ*_{0}, is chosen to be 0.1. The control parameters *ρ* and *δ*_{p} are chosen to be 0.001 and 0.01 respectively. In the first part of the experiment, we look at the misalignment curves of the M-Max-, SPU-, SP- and S-PNLMS algorithms. They are illustrated in Fig. 9 below. The SPU-updating scheme produces the worst results. The proposed S-criterion considerably outperforms it especially in terms of the initial convergence speed. The rest of the algorithms have nearly the same convergence and tracking performance.

All the algorithms, except the M-Max-PNLMS, show poor results when the *M* value equals 64. It can be explained by the fact that the active part of the IR is approximately 16ms long. This value corresponds to 128 samples for sampling frequency of 8kHz, therefore, 64 samples are not enough to cover the active region completely. Regarding to the dissimilar selection criterion, the M-Max-PNLMS algorithm can deal relatively well with that problem. The Max-updating formula does not count with the sparse character of the IR. It performs selection according to the distribution of the values of the input vector. Otherwise, its drawback is lower initial convergence speed comparing to the SP-PNLMS algorithm. The second part of our experiment concerns the performance of the adaptive algorithms versus the ones based on the generalized cross-correlation function. They are compared in the context of the time delay estimation. The time difference between time when the outgoing signal leaves the voice terminal and the moment when the incoming signal containing the echo of the original signal arrives back from the network is referred to as a true echo delay. Table 6 concludes the results for the AF and GCC algorithms for different delay values. Fig. 10 shows the estimated echo delay diagram for the echo path that has a 60 ms pure delay. It can be observed that the outputs from the adaptive filter in the steady state have much smaller variance than the results obtained from the algorithm based on the generalized cross-correlation function.

[ms] | ROTH | SCOT | CPS-2 | HB | M-Max | SPU | SP | S | samples |

5 | 5,1 | 5,2 | 5,2 | 5,2 | 5,1 | 5,1 | 5,0 | 5,0 | 40 |

10 | 10,3 | 10,3 | 10,3 | 10,4 | 10,3 | 10,3 | 10,4 | 10,1 | 80 |

20 | 19,6 | 19,7 | 20,0 | 20,0 | 20,5 | 20,5 | 20,1 | 20,3 | 160 |

30 | 29,3 | 29,6 | 30,1 | 30,0 | 30,8 | 30,8 | 30,1 | 30,2 | 240 |

50 | 49,0 | 49,7 | 49,4 | 49,8 | 51,3 | 51,3 | 50,3 | 50,1 | 400 |

100 | 98,1 | 99,4 | 98,8 | 99,6 | 102,5 | 102,5 | 100,5 | 100,6 | 800 |

200 | 196,2 | 198,8 | 197,6 | 199,1 | 205,1 | 205,1 | 201,3 | 201,2 | 1600 |

300 | 294,2 | 298,1 | 296,4 | 298,7 | 307,6 | 307,6 | 301,8 | 301,9 | 2400 |

## 4. Partial, proportionate and sparse control for multi-delay filters

Taking into the account the fact that the generalized cross-correlation algorithms operate in the frequency domain and use advantages of the fast Fourier transform, reasonable computational savings for adaptive filtering algorithms can be achieved as well. The basic operation underlying frequency domain adaptive filters is the transformation of the input signal into a more desirable wave form before its adaptive processing. This is accomplished by the Discrete Fourier Transform (DFT) whereby the input signal is transformed to the frequency domain as shown in Fig. 11.

Frequency Domain (FD) adaptive filters have primarily two advantages compared to time domain implementations (Dohnal, 1995). The first advantage is the potentially large savings in the computational complexity. The Fast Fourier Transform (FFT) is an efficient implementation of the DFT, which provides these savings. A second advantage is that the DFT structures generate signals that are approximately uncorrelated (orthogonal) (Shynk, 1992). As a result, a time-varying step-size parameter is used for each adaptive weight, thereby allowing a more uniform convergence rate across the adaptive filter. Thus, the overall convergence rate of the algorithm may be improved, sometimes approaching that achievable with Recursive Least Squares (RLS) algorithms without a similar increase in the computational complexity (Widrow & Stearns, 1985). Many other attractive features and variation of the FD adaptive filters can be found in the literature (Khong et al., 2007).

### 4.1. Sparse partial-update proportionate multi-delay filter

The multi-delay frequency domain adaptation algorithm generates its individual step-size control factors that are mutually orthogonal. This advantage put the Multi-Delay block Frequency domain (MDF) algorithm closer to the analytical RLS algorithm. It differs from the algorithms processing in the time domain with better convergence properties and lower computational complexity. The MDF algorithm decreases the processing delay associated with frequency domain adaptive filters. The delay value is directly proportional to the number of the filter coefficients. The multi-delay approach solves this problem by partitioning the time-domain filter of length *L* into *K* sub-filters each of length *N* with *L*=*KN*. Consequently, the delay value is reduced by a factor of *L*/*N* compared to the full length approach. The diagram of the MDF algorithm is shown in Fig. 12. This algorithm uses the FFT of size *N*_{FFT}. It equals to the smallest power of two integers larger than or equal to 2*L*/*K*=2*KN*/*K*=2*N*. The MDF algorithm works as follows. Let us first define *m*, as the frame index. The speech signals are operated on the frame-by-frame basis.

The first step of the MDF algorithm is to convert the most recent overlapped input samples to the frequency domain as follows

The input vectors incoming to each of the sub-filter blocks can be obtained via the frame index shifting without invoking any computation as follows:

The described approach suggests that only one FFT is needed per frame iteration in order to transform the input vector into the frequency domain. It implies a significant computation saving. The output and the error frequency-domain vectors can be expressed and calculated as

where *W*_{f}*(k,m)* is the *k*^{th} coefficient vector and *d(m)* is the desired vector. The coefficient update equations to minimize the Mean Square Error (MSE) are the following

where *k*=1,2,…, *K* and μ_{k} is the block step-size control parameter, 0<<*λ*<1 is the forgetting factor. Further, we examine how the partial update technique may be incorporated into the MDF algorithm. For frequency-domain selection, the known approach is to select frequency bins corresponding to the largest magnitude Fourier transform of the tap-input over all the sub-filter blocks *k*=1,2,…, *K* (Khong et al,2008). However, the selection is better to perform on the coefficient-block basis (Deng & Dyba, 2008). It is faster and takes less computation power. Therefore, first we need to calculate metrics for each of the sub-filter in order to perform selection procedure. The number of metrics equals the number of sub-filter blocks *K*. One particular frequency-domain metric, *Q*_{k}, may, for example, represent the sum of the frequency bins in the corresponding sub-filter block

There is a little bit different selection approach, which shows better convergence performance compared to the last one. Define the 2*L* x 1 vector *χ(m)*, which consists of the concatenated Fourier transform of the input signal across all sub-filters, as

Each element in the 2*L* x 2*L* diagonal selection matrix *P(m)*

The tap-selection routine is done by multiplying this matrix with the corresponding input in every *k*^{th} sub-filter, as it is shown below

Otherwise, if it is supposed that the algorithm models a sparse network impulse response, the sparse partial updating scheme should be involved. The control metrics are calculated as follows (*k*=1,2,…, *K*)

Here, we suggest using a new type of metric, in order to choose sub-filter blocks for updating. Note, that it should be calculated in the time-domain as follows

We have to remember that the minimum number, *M*, of the sub-filter blocks to be updated must equal or be greater than two (2≤*M*<*K*). This is because the pure delay of the echo path is never known. Thus, if only one sub-filter is chosen, it can not cover all the possibilities of the variation of echo path. Another important question for frequency domain adaptive filters is the FFT length? – From our point of view, it is reasonable using the FFT of 256 samples long. This is an optimal length, which can provide satisfactory results for the high range of MDF algorithms. We have tried to apply different values, but they did not bring sufficient improvements. For example, if the echo path consists of 1024 taps, the adaptive filter should be divided in 1024/(256/2) sub-filters. The number *K* of sub-filters will equal 8.

Before we present a proposed frequency domain algorithm, let us resume the following. The partial update frequency-domain algorithms may use the values of the calculated block metrics as selection criteria. Each type of metrics has its own features. In case of (13, 18, 19), only *K* (*K*<<2*L*) metrics are sorted in order to select *M* (2≤*M*<*K*) active subfilters. Since *K* is a small number compared to the entire number 2*L* of subfilters coefficients in (14, 15, 16, 17), the sorting overhead may be reduced by factor 2*L*/*N*. We have spoken about the features of sparse echo paths and the possibility of improving their convergence rate by assigning the individual step-size parameters proportionally to the coefficient magnitudes. In the art, there is a big family of various MDF algorithms. Researchers are trying to improve algorithms performance from one hand and to reach computational savings from the other hand. Further we present our approach how to deal with this contradictory problem. First, we describe the basic idea of sparseness measure, and show how to incorporate it into the MDF framework. The concept of sparseness refers to a representational scheme where only a few units (out of a large massive) are effectively used to represent typical data vectors. In effect, this implies most units taking values close to zero while only few take significantly non-zero values. Numerous sparseness measures have been proposed and used in the literature to date (Benesty et al, 2006). Such measures quantify how much energy of a vector is packed into only a few components. On a normalized scale, the sparsest possible vector (only a single component is non-zero) should have a sparseness of one, whereas a vector with all elements equal should have a sparseness of zero. In our approach, we use s sparseness measure based on the relationship between the *L*_{1} norm and the *L*_{2} norm:

Here, *L* is the length of the adaptive filter in the time domain. This function evaluates to unity if and only if *w* contains only a single non-zero component, and takes a value of zero if and only of all components are equal (up to signs), interpolating smoothly between the two extremes. We use this sparseness measure *ξ* of the time domain estimated filter coefficients to adaptively control the number of taps that need to be updated

Note that the total number of corresponding frequency domain coefficients is 2*L* for an *L*-long time domain adaptive filter, therefore

If selection is performed on the block-based basis, then the corresponding number of sub-filters in frequency domain equals

Fig. 13 shows sparseness measure for sparse and dispersive impulse responses. It is also shown the number of adaptive filter coefficients that should be updated, if partial updating scheme is considered.

The convergence process, by itself, can be divided into two stages, i.e. before and after the convergence of the impulse response (or some time, before and after the convergence of the considerable/large coefficients). It has been showed in the time domain adaptive filtering that it is better using PNLMS/IPNLMS at the initial stage for fast convergence, and afterwards switching to NLMS for assuring better misalignment. Here, for recognizing between two stages in adaptation process, we propose to use the following parameter

The quantity in Eq. (56) represents an energy measure in [dBm] within an estimated impulse response. Fig. 14 demonstrates the curve for this parameter for two frequency domain algorithms. It can be observed that after 0.5 seconds estimation of IR energy measure for IPMDF stops fluctuating. Consequently, this fact can be used for switching between different adaptation schemes.

Before representing the proposed MDF scheme, declare the following statements:

I. Utilization of sparseness measure, 0<ξ(m)<1 (for real IR)

ξ(m)<0.7

IR is considered dispersive

M_{t} = L, if fully updated scheme is chosen (mostly during initial period)

L/K ≤ M_{t} < L, if partial updated scheme is chosen

ξ(m)≥0.7

IR is considered sparse

Mt = L.(1- ξ(m))

II. Utilized updating schemas

non-partial

**χ**– based selection (coefficient-based), 2L-vector**μ**- based selection (block-based), K-vector

III. Utilization of IR energy measure, *η*

|η(m) - η(m-1)| ≤ Δ

_{η}, switching to the MDF algorithm|η(m) - η(m-1)| > Δ

_{η}, switching to the IPMDF (SC-IPMDF) algorithm

Finally, our proposed algorithm can be described as:

1^{st} stage: |η(m) - η(m-1)| ≤ Δ_{η}

### 4.2. Discussion over experimental results

To compare the performance of the proposed algorithm with reference MDF, IPMDF, SC-IPMDF adaptive filtering algorithms, we implement all four adaptive filters in MATLAB environment. These MDF-based filters estimate the impulse response of the predefined echo paths, which are specified by the ITU-T Recommendation (ITU-T G.168, 2002). The tested impulse responses are 1024 taps long. All experiments are performed using pre-recorded real speech signals. The number *K* of sub-filters equals 8. The following values for control parameters are used: *μ*_{initial} = 0.1, *α*_{initial} = -0.75, *ε* = 10^{-3}, *Δ*_{η} = 0.05 dBm, *T*_{η} = 8. The performance of each algorithm is studied using the normalized misalignment parameter, which can be estimated as follows

where *h* is a true impulse response of length *L*. Another criterion is Echo Return Loss Enhancement (ERLE), which is used in real-life environment to evaluate performance

where *y(m)* is a desired signal (echo) and *d(m)* is adaptive filter’s output. Note that any reliable adaptive filter with disabled residual echo suppressor has to achieve ERLE of -15dB within 1 second after starting convergence process (ITU-T G.131, 2003). Fig. 15, 16, 17 illustrate an application of *μ*-based selection metric. This kind of metric is used for sub-filter selection, when estimated sparse measure parameter, ξ, equals or larger than 0.7. This value was defined experimentally during multiple trials for numerous types of echo path. We suggest using *μ*-based selection metric as an individual block step-size parameter. It helps accelerating a speed of convergence by allocating larger step-size values for currently updated sub-filters. If you look at the diagram illustrated in Fig. 17 carefully, you will notice that the energy, which is available for adaptation, is concentrated around the sparse region of the echo path. Thus, this fact can be used for selecting sub-filters to be updated along with setting the step-size parameter for these sub-filters. When estimation of sparse measure, ξ, is smaller than 0.7, we suggest switching to the *χ* – based selection metric.

Fig. 18 demonstrates the normalized misalignment and ERLE parameters obtained for real speech signals. The proposed partially updated scheme for MDF shows the similar performance comparing to the other three fully updated frequency domain algorithms. During our future work we are going to enhance the above described algorithm and propose a new class of partial sparse-controlled robust algorithms, which will work reliably, even in double-talk situation. We will apply all the knowledge, which were presented within this particular chapter. Further to conclude the chapter, let us provide summary of material and make several contributions according to the proposed algorithms.

## 5. Conclusion

The first section outlines a basic principle of echo control in packet-based networks. It explains why it is so important to provide monitoring during telephone conversations. When delivering the VoIP service in the packet-switching network, it is important to have the value of the echo delay under control. The increasing transmission delay associated with packet data transmission can make a negligible echo more annoying. Therefore, it is suggested using the echo assessment algorithm. It is purpose is to add an additional attenuation to a particular voice channel (which in terms means to activate an echo canceller), so as to remove the unwanted echo in time. In the second section, we consider an opportunity of using cross-correlation for estimating echo delays. That section provides readers along with up-to-date correlation-based TDE algorithms, which we use to estimate the echo path delays. The problem of long delays taken place in the packet-switching network is considered as a topic of interest. The experiments show that the algorithms precision decreases with increasing transmission delays. The generalized cross-correlation algorithms operating in the frequency domain provide more reliable result comparing to the standard cross-correlation and normalized cross-correlation algorithms. As an alternative to correlation-based methods, techniques, which use adaptive filtering algorithms, can be also applied. Therefore, the third section presents numerous partial-update algorithms and their application to delay estimation. The echo assessment is based on the reduced complexity partial-update adaptive filters. The experiments show a reliable performance of these algorithms. However, their precision suffers during the initial stage of convergence. According to the ITU-T Recommendation G.168, this period should not last more than one second. The Multi-Delay block Frequency domain (MDF) adaptive algorithm can easily outperform all existing time domain algorithms. Moreover, taking into the account the fact that the generalized cross-correlation algorithms operate in the frequency domain and use advantages of the fast Fourier transform, further computational savings for the adaptive filters are achieved in the frequency domain. Therefore, the fourth section deals with partial, proportionate, and sparse-controlled adaptive filtering algorithms working in the frequency domain. What we claimed, within this section, is: a new metric for performing partial updating; a new approach for designating transitions between MDF and IPMDF-based updating schemas; a method for estimating step-size control parameter; a new partially updated sparseness-controlled improved proportionate multi-delay filter; all the approaches are suitable for implementation whether in time or frequency domains. The proposed algorithm has both a performance compared to the IPMDF and SC-MDF algorithms and reduced computational complexity along with the adjustable step-size parameter. Although the preferred embodiments of the proposed algorithm have been described, it will be understood by those skilled in the art that various changes may be made thereto without departing from the main scope of the invention or the appended claims.

## Acknowledgments

This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS 10/275/OHK3/3T/13 and by Grant The Ministry of Education, Youth and Sports No. MSM6840770014.