Open access peer-reviewed chapter

Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles

Written By

Amin Saremi

Reviewed: 20 April 2022 Published: 10 June 2022

DOI: 10.5772/intechopen.105002

From the Edited Volume

Advances in Fundamental and Applied Research on Spatial Audio

Edited by Brian F.G. Katz and Piotr Majdak

Chapter metrics overview

114 Chapter Downloads

View Full Metrics

Abstract

Since the introduction of hands-free telephony applications and speech dialog systems in automotive industry in 1990s, microphones have been mounted in car cabins to capture, and route the driver’s speech signals to the corresponding telecommunication networks. A car cabin is a noisy and reverberant environment where engine activity, structural vibrations, road bumps, and cross-talk interferences can add substantial amounts of acoustic noise to the captured speech signal. To enhance the speech signal, a variety of real-time signal enhancement methods such as acoustic echo cancelation, noise reduction, de-reverberation, and beamforming are typically applied. Moreover, the recent introduction of AI-driven online voice assistants in automotive industry has resulted in new requirements on speech signal enhancement methods to facilitate accurate speech recognition. In this chapter, we focus on spatial filtering techniques that are designed to spatially enhance signals that arrive from certain directions while attenuating signals that originate from other locations. The fundamentals of conventional beamforming and echo cancelation are explained and are accompanied by some real-world examples. Moreover, more recent techniques (namely blind source segregation, and neural-network based adaptive beamforming) are presented in the context of automotive applications. This chapter provides the readers with both fundamental and hands-on insights into the fast-growing field of automotive speech signal processing.

Keywords

  • automotive speech signal processing
  • hands-free telephony
  • automotive voice assistant
  • beamforming
  • acoustic echo cancelation

1. Introduction

In 1990s, first telephony systems were introduced in vehicles to enable drivers to converse in hands-free phone calls through the vehicle’s embedded microphones and loudspeakers while driving [1]. To assure the audio quality during the hands-free telecommunication, a number of speech signal processing techniques are widely used. Besides of hands-free telephony, speech dialog systems have been developed to enable drivers to communicate with their vehicle functions and media contents by means of voice communication [2, 3]. In the core of a speech dialog system, there is an acoustic model that performs the speech recognition task. Speech dialog systems require high quality audio input to assure the accuracy of the speech recognition.

In a vehicle audio system, the phone is connected to the infotainment head unit via a Bluetooth communication channel which allows the driver’s speech (near end) to route from the microphones mounted inside the vehicle to the other side of the tele-communication network (far end). Vice versa, the speech signal received from the far end is played on the vehicle’s loudspeakers.

A major problem that typically rises in this communication system is that the far end hears a replica of their own voice back from the vehicle with a certain delay (i.e. acoustic echo). The observed acoustic echo is due to the acoustic feedback from loudspeakers to the microphones in the vehicle [4, 5, 6]. Various acoustic echo cancelation (AEC) solutions have been developed to address this issue. Most of these AEC solutions use adaptive filters that aim to simulate the acoustic path between speakers and microphones and thereby estimate and subtract the echo from the received signal [4, 5, 6].

Another major problem is that the signals captured by the microphones are contaminated with ambient noise and reverberation. The ambient noise often consists of stationary noise sources (engine noise, road noise, windows vibrations), and non-stationary cross-talks from other car occupants. To address the stationary ambient noise issues, high-pass filters were used mainly to filter out the engine noise and structural vibrational components in the captured signal [1, 2]. To address non-stationary ambient noise, adaptive algorithms have been extensively developed (e.g. [7]).

Moreover, directional microphones have been previously used in vehicles to form a spatial focus toward the driver while attenuating signals arriving from other directions. These directional microphones were often Cardioid type which were usually mounted on the ceiling of the cabin, over the head of the driver, and directed toward the driver’s mouth. Most common Cardioid microphones are electret condenser components and achieve this type of directivity by means of mechanical channels mounted in their membranes. In 2000s, a new generation of microphones, known as micro electromechanical system (MEMS), were introduced to the electronic industry providing superior performance in coding the sound while having a low-cost footprint [8]. Since the mobile phone industry started to extensively deploy MEMS in their products, this type of microphones has prevailed in most telecommunication applications e.g. in tablets, wearable devices, medical systems, and automobiles [8].

A major difference between MEMS and electret microphones is that MEMS microphones, due to their specific design and miniature structure, are omnidirectional i.e. are agnostic to directions since they treat sounds arriving from all directions equally. Therefore, the desirable directivity needs to be implemented by means of external post processing. To do so, a number of MEMS microphones are placed in a certain distance from each other forming an ‘array’ and specific array signal processing techniques are applied to exploit the time differences and relative phase shifts across the signals captured by the microphones in the array to amplify or attenuate sounds arriving from specific directions and thereby create the desired spatial directivity [9, 10, 11].

Advertisement

2. Spatial filtering: beamforming

2.1 Basic concepts

A beamformer is a signal processing module that performs spatial filtering to separate signals that have overlapping temporal and frequency contents but originate from different spatial locations [9, 10, 11, 12]. A conventional linear beamformer, as shown in Figure 1, is a filter-and-sum system that consists of a number of filters that are applied on the input array signals and the results are thereafter summed. The task is to set the complex filter coefficients (Wjin Figure 1) in a manner to amplify specific directions in the received array signals and suppress other directions.

Figure 1.

A linear beamformer that consists of an array of M microphones. The signal captured by jth microphone passes through the jth finite-impulse-response filter defined by its weights (Wj). The direction of arrival (DOA) is shown by θ.

From another perspective, a beamformer can be viewed as a multiple-input single-output system whose output y[k] is determined based on Eq. (1) below.

yn=j=1Mp=1LWj,pxjnpE1

Eq. (1) can also be viewed as a summation of M finite impulse response (FIR) filters with L coefficients per each filter that are applied on input signals (xj[n]). Eq. (1) can be summarized into Eq. (2) where T denotes Hermitian (complex conjugate) transpose and W represents an M × L matrix of filter coefficients.

yn=WTxnE2

2.2 Conventional beamforming

A conventional beamformer assumes that each described filter needs to apply a specific tap delay (τ) on the corresponding array signal to properly align the inputs to achieve the desired directivity in the output. In this sense, each FIR filter has the following frequency response. The first filter (p = 1) has no associated tap delay since the signal from the first microphone is considered the zero-phase reference.

rω=p=1LWpejωτp1E3

Assuming that the propagating sound pressure is a complex plane wave with the direction of arrival (DOA) θ and frequency ω, the tap delay at pth filter (τp) is a function of θ, and Eq. (3) can be re-written as below.

rω=WpdωθE4
Dωθ=1eτ2θeτMθTE5

The term D(ω,θ) is known as the array vector response. D(ω,θ) determines the spatial outcome of the beamformer and thus is also called steering vector or direction vector. The simplest solution is to apply a constant delay per array, a so called ‘delay-and-sum’ algorithm. Accordingly, each array signal is delayed by τp=p1dcsinθ where c is the speed of sound, 343 m/s at 20C temperature, and p extends from 1 to M.

The distance between the array elements, d, is an essential geometric constraint that has a great effect on the performance of such a delay-and-sum configuration. An important limitation that is imposed on the performance of the beamforming due to the distance between microphones (d) is the ‘spatial aliasing frequency’ (fal) that is calculated by fal=c2d which gives the upper frequency limit of the delay-and-sum system. This is because, at this frequency (fal), the phase shift at the microphones equals half the wavelength (λ)of the signal (see Figure 3 and Figure 4 of [12]). Therefore, to avoid spatial aliasing, the distance between the microphones (d) should be chosen carefully for the delay-and-sum beamformer to push fal above the frequency range of interest.

In more sophisticated beamformers, the tap delay values (τp) in Eqs. (4)(5) are set as a function of angular frequency (ω) in a filter-and-sum configuration. The aim is to control the behavior of the system at different frequency ranges and assure a consistent directivity across the entire frequency range of interest. A well-designed filter-and-sum beamformer with tailored frequency-dependent tap delays, τpωs, can overcome the upper frequency barrier (fal)to some good degree.

If the angles at which the interfering signals arrive is known, it is possible to design the steering vector so that the beamformer minimizes sound intensities (represented by statistical variance in the data) arriving from these specific angles. In this configuration, called linearly constrained minimum variance (LCMV) beamforming, the steering vector is designed to multiply null in given interference directions while amplifying the desired DOA.

Figure 1 presents a one-dimensional beamformer which operates in xy plane as DOA (θ) is in that plane. However, if necessary, it is possible to add microphones on z axis where similar equations, Eqs. (1)(5), can be written in the xz plane with a DOA in that plane. Accordingly, a two-dimensional beamformer would be created which filters the xyz space with regard to one DOA in xy plane and another DOA in xz plane.

From another perspective, Figure 1 depicts a ‘broadside’ beamformer which is designed to form a beam toward the target which is located in the broadside plane of the microphone array. However, if the target is located along with the axis of the array (therefore θ = ±90), then the configuration is called ‘end-fire’ [9]. In an end-fire configuration, the summation in Figure 1 is replaced by subtraction. Consecutively, each filter output is subtracted in Eq. (1) instead. Thus, an end-fire beamformer is also called a ‘filter-and-subtract’ or a ‘differential’ beamformer. This type of beamformer, which can be viewed as a special case of the general beamforming shown in Figure 1, forms a beam toward either above the array axis (θ = 90) or below the array axis (θ = −90).

2.2.1 Fixed beamforming vs. adaptive beamforming

In fixed beamforming, the DOA is known and time-invariant thus the steering vector, D(ω,θ), can be set for a known fixed geometry. A good example of fixed beamforming is in automotive industry where the target talker (driver) sits in a fixed location and the DOA toward the microphones is predetermined. Fixed beamforming can be viewed as a ‘data-independent’ algorithm since the steering vector is designed solely based on the known geometry of the sound propagation and is independent of the received data. In contrast, in adaptive beamforming, DOA varies and the steering vector should adapt to the changes in DOA. For an example, an adaptive beamformer is needed in case the system is supposed to localize and adapt itself to capture signals from all car occupants (besides of the driver) who are sitting at different location inside the vehicle. In this case, the system should iteratively find the target talker first and then update its steering vector toward that target. Another example of an adaptive beamformer is in the ‘cocktail party’ problem wherein the target location can vary in the room and the system should constantly localize the target and the beamforming algorithm should adapt to the new DOA and other geometrical factors, accordingly. From this perspective, adaptive beamformers can be viewed as being ‘data-dependent’ systems since their parameters change according to variations in the received data. As a result, adaptive beamformers usually require substantial computational resources [10, 13, 14].

An adaptive beamformer is often accompanied by a pre-processing stage whose task is to localize the target and determine the new DOA. This ‘localization’ stage usually accomplishes its task by examining the data and finding optimum DOA that maximizes a specific metric such as signal strength, or speech intelligibility [10, 13, 14, 15]. Alternatively, some localization algorithms are built on minimizing a specific cost function, such as noise and reverberation, in the signal. When the localization algorithms finds the DOA, the values in the steering vector (D(ω,θ)) should adapt to this new angle.

There are some relatively newer solutions that merge the ‘localization’ and ‘beamforming’ stages together. Warsitz and Haeb-Umbach [14] presented an algorithm that optimizes the FIR filter coefficients (denoted by W in Eqs. (1)(2) above) by iteratively estimating and maximizing the cross power spectral density of the microphone signals. An important feature of this algorithm is that the filter coefficients are optimized directly without localizing the source. In other words, the DOA information is implicitly absorbed in the optimization problem although it is possible to extract the underlying DOA information from the results afterwards, if needed.

2.3 Neural-based adaptive beamforming in speech recognition applications

Speech signal enhancement (SSE) techniques, such as beamforming, have traditionally been performed as an independent pre-processing stage to speech recognition back ends [13, 15]. In this conventional setup, SSE algorithms are performed to improve the signal-to-noise ratio (SNR) by reducing ambient noise and reverberation in the captured signal. The output of the SSE stage is then fed into acoustic models, usually deep neural networks, which perform the automatic speech recognition (ASR) task.

In the last few years, adaptive beamforming algorithms have been designed that are tuned jointly together with the speech recognition backend [13, 15, 16]. To do so, the FIR coefficients (shown by W in Figure 1 and also in Eqs. (1)(2)) are jointly trained together with the parameters of the ASR model where the optimization is performed using a gradient learning algorithm. The goal of this optimization process is to find FIR coefficients that result in higher ASR accuracy.

Several neural-network approaches have been developed to address the ASR problem [15] but the most successful ASR models are currently built on the convolutional deep neural network (CL-DNN) concept [13, 15]. The input is filtered by a time-domain filterbank pre-processor, usually a Gammatone filterbank together with a nonlinearity function, which is supposed to loosely mimic the human auditory periphery (cochlea) in terms of spectral feature extraction and compression [17]. The output is then fed into the CL-DNN model. The first stage in the CL-DNN model is the fconv layer that convolves the output signals across the filterbank channels and the results are pooled along the frequency axis. The next stage comprises a number of long short-term memory (LSTM) layers. LSTM network is a specific type of recurrent neural network that is tailored for recognizing sequential time-series data such as audio. The final stage is a single fully-connected DNN that consists of at least 1024 hidden units [13, 15, 16].

Sainath et al. [13, 16] presented a multi-microphone solution to incorporate the data captured by M microphone arrays into the CL-DNN model. They replaced each spectral channel of the Gammatone filterbank pre-processor with FIR filters that are connected to the microphones and are used for beamforming (identical to Eqs. (1) and (2) above). They essentially created a filter-and-sum beamformer per spectral channel. The difference is that the tap delays (Tp) and therefore DOA data are implicitly absorbed in the FIR coefficients similar to earlier works by Warsiz and Haeb-Umbach [10]. Sainath et al. [13, 16] trained the beamforming FIR coefficients together with the CL-DNN parameters using a gradient learning algorithm to maximize the ASR accuracy. Sainath et al. [16] showed that during the training, the FIR coefficients become optimized to extract both spectral and spatial features of the incoming speech signals. They showed that the multi-microphone ASR model with joint beamforming achieves an over 10% improvement in word error rate (WER) compared to its single-microphone counterpart.

Besides of excellent ASR accuracy, a major benefit of neural-network based beamforming is that the model is, to a great extent, independent of the array spacing whereas the conventional beamforming relies on the prior knowledge of the distance between microphones (d) to calculate the tap delays. Due to its remarkable success in ASR, neural-based beamforming is prevailing in all ASR systems that have access to multiple microphone input. A very good candidate for applying this technique is in automotive ASR systems wherein online voice assistants based on this technique are currently being designed and evaluated.

A potential shortcoming of the neural-based beamforming is that the source localization information (i.e. DOA) is implicitly embedded in the model and might not be extractable and interpretable in terms of physical geometry. This could impose a limitation in applications which require an explicit knowledge of the source location. Besides, an important distinction is that neural-based beamforming parameters are tuned solely based on ASR objectives and might not necessarily improve the audio quality (e.g. SNR) with regard to the human psychoacoustics [13, 15, 16]. Therefore, neural-network based beamforming is currently considered more applicable to speech recognition tasks rather than to applications such as telephony wherein human listeners are involved. The feasibility of neural-network based beamforming for telephony applications and its relation to human psychoacoustics need further investigations.

2.4 Beamforming applications in automotive industry

Beamforming techniques introduced into the automotive industry almost at the same time that the first automotive hands-free telephony and speech dialog systems were being devised [1]. Although there have been some studies using multiple microphones [18], it is by far more common to only have dual microphones available in vehicles for beamforming. There are mainly two reasons for this, the first one is the production costs, and the second one complications in the vehicle’s interior design and excess wiring if multiple microphones are used. Therefore, in the following sections regarding automotive applications, two-microphone solutions are in focus.

Figure 2(A) shows a car with two dedicated microphones (marked M1 and M2) about 4.5 centimeters apart (d = 4.8 cm) that have been mounted in the car ceiling. The DOA is ideally around 90 degrees according to the illustrated coordinates. To provide a fixed beamforming solution for this particular geometry, an ‘end-fire’ differential beamformer should be used since the desired source is located along the axis of the array (θ = ±90). The input signals are filtered according to Eqs. (1)-(5) and then subtracted. The frequency-dependent tap delays for microphone M2 (i.e.,τ2ω) were chosen to enable the steering vector to enhance sounds that arrive from the driver side (θ = 90).

Figure 2.

A) a car cabin geometry with a dual microphone mounted in the car ceiling. B) the beampattern achieved by the described end-fire (differential) beamformer at 1 kHz, C) at 2 kHz, and D) at 4 kHz. (A) dual microphones in a personal car, (B) beampattern at 1KHz, (C) beampattern at 2KHz and (D) beampattern at 4KHz.

Figure 2 shows the beam patterns resulted from an end-fire filter-and-sum beamformer where the tap delays for M2 microphone (i.e.τ2ω) have been adjusted as a function of frequency at several frequency channels covering the frequency range from 0.1 to about 7 kHz. Figure 2(B) shows the beam pattern at 1 kHz. This beam pattern demonstrates that sounds from θ = 90 (driver side) have passed through the system whereas sounds from θ = −90 = 270 have been substantially attenuated. A very similar beam pattern is shown by Figure 2(C) at 2 kHz although the beam pattern at 4 kHz, shown in Figure 2(D), deviates somewhat with minimal effect on overall performance.

The presented beamformer was tested in-situ by placing a head-and-mouth simulator system at the typical location of the driver’s head and playing standard hearing-in-noise test (HINT) sentences [19] while the engine was running in idle mode creating some stationary background noise. The raw signals captured by the M1 microphone were recorded. The test was repeated while applying the described beamforming on the raw signals. The beamforming results were compared to the raw signal. The results showed a signal-to-noise ratio improvement (SNRI) of 5.7 dB(A) across frequencies between 0.1 and 8 kHz.

Figure 3 shows the beamforming geometry in a large truck cabin wherein a dual microphone array is installed on the overhead compartment. The distance between the array and the driver’s mouth is about 0.4 m and the DOA is approximately 30 degrees (θ = 30 in zy plane) although these numbers vary depending on the height and other biometrics of the driver. The distance between the two microphones (d) is 23 mm which yields a higher spatial aliasing frequency upper limit compared to the system shown in Figure 2. In this case, an end-fire beamformer can be designed to form a beam downwards toward the cabin’s floor (θ = 90). The drawback is that an amount of engine noise and AC fan noise will also leak into the beamformer since these noise signals originate from the dashboard which is also located below the overhead compartment.

Figure 3.

A truck cabin geometry with a dual microphone installed in an overhead compartment forming DOA of approximately 30 degrees (θ = 30) toward the mouth of a 180-cm long male driver. The yellow arrow shows the first-wave propagation from the driver’s mouth whereas the green arrow shows the noise signal propagation (engine noise and AC fan noise).

Alternatively, a broadside beamformer can be used to direct the beam toward the DOA of θ = 30. As a well-known drawback, the broadside configuration also amplifies the angle that is 180 degrees behind the DOA (i.e. 30 + 180 = 210 degrees in this case). This is because the broadside beamforming, characterized by Eqs. (14), is agnostic to the 180-degrees axis and any sound coming from θ + 180 is treated equally as θ. However, since the overhead console behaves as a mechanical damper for sounds and vibrations coming from the roof and the backward direction, broadside solution appears to be a better solution in this practical case.

A broadside filter-and-sum has been devised with tailored frequency-dependent tap delays to facilitate a consistent beamforming toward the driver at frequencies between 0.1 and 8 kHz. The in-situ measurements and beam patterns are not finalized yet. However, the preliminary analysis indicates that the system can achieve an SNRI of about 6 dB when the engine is running in idle mode. The described beamformer functions on xz plane. However, a third microphone can be added on y axis next to the existing pair of microphones to perform beamforming on xy plane as well. This new beamformer on xy plane can be tuned to attenuate sounds arriving from the co-driver’s side although adding multiple microphones are currently uncommon in vehicles.

Advertisement

3. Acoustic echo cancelation

3.1 Basic concepts

Acoustic echoes are generated in speech telecommunication networks due to the acoustic feedback from loudspeakers to microphones. This phenomenon deteriorates the perception of sound by causing the users to hear a delayed replica of their own voice being reflected back from the other side of the network [4, 5, 6]. Figure 4 shows a driver in a truck cabin making a phone conversation through embedded microphones and loudspeakers (marked red). The speech signal is denoted by s[n] whereas the echo is denoted by y[n] and r[n] represents the ambient noise. The echo (y[n]) can be considered a copy of the far-end speech signal played by the loudspeaker (x[n]) that has been filtered by the acoustic path (modeled by an FIR filter with the linear impulse response h[n]) between the loudspeaker and the microphone. The received signal (d[n]) is an addition of these three signals (d[n] = s[n] + y[n] + r[n]).

Figure 4.

A) an overview of the described AEC algorithm. B) a driver in a truck serving as the near-end party during a hands-free phone conversation.

To remove acoustic echoes from the captured signal, acoustic echo cancelation (AEC) algorithms have been developed that use machine learning methods to adaptively estimate the ‘acoustic echo path’ in real time and subtract its effect from the captured signal so that only the desired near-end speech components remain [4, 5, 6, 20, 21]. Similar adaptive methods are also commonly used for estimating the noise propagation acoustic path in stationary noise reduction applications (e.g. [7]). The most common adaptive method used in AEC tasks is normalized least mean square (NLMS) filtering [5, 6, 20, 21]. Least-mean-square adaptive filters were used even in the earliest generation of AECs [4] and several improved variants of it, such as NLMS, have been developed since then [5, 6, 20, 21].

The true impulse response of the echo path (i.e. h[n]) is unknown and the task of an AEC solution is to identify it. To do so, the NLMS algorithm constantly tries to adapt to an impulse response (ĥn) that closely matches the true impulse response of the echo path (i.e.ĥn=hn) and consequently, ŷn=y[n] and thus the error signal, e[n], becomes zero. The length of ĥn, denoted by L, has an important role in the performance of the AEC. The filter should be long enough to realistically model the acoustic path and furthermore to guarantee that the acoustic path can be assumed time-invariant during the time that corresponds to L samples. If the goal of the AEC is to reduce the echo by 30 dB, then L should correspond to T30 reverberation time [5]. In most modern vehicles, 50 ms appears to be a good estimate of T30.

In case the echo (x[n]) is effectively the only signal present (r[n] and s[n] are absent), the output of the adaptive process (ŷn) is given by Eq. (6) below where L tap of x[n] is transposed (represented by xLTn) and multiplied by ĥn.

dn=ŷn=xLTn.ĥnE6

The adaptive process estimates a new ĥnper each sample of y[n] through a small adjustment ∆ ĥnin each iteration, as expressed in Eq. (7). This adjustment value is determined based on the error signal and the reference signal according to Eq. (8) where μ[n] is known as ‘step size’. Choosing an optimized step size is important for the convergence rate and accuracy of the system and is determined by the parameters α and β. These two parameters need to be adjusted according to the specifics of every given NLMS problem. Choosing appropriate values for α and β has been comprehensively studied to a great complication [5, 20, 21, 22].

ĥn+1=ĥn+ĥnE7
ĥn=μn×xLn×enE8
μn=αβ+δxLn2,δxLn2=varxLn=xLTn.xLn.E9

Eqs. (6)(9) are applicable only when the echo is the only present component in the received signal (i.e. d[n] = y[n]). In other words, the near-end talker is silent and the ambient noise is insignificant (s[n] = 0 and r[n] ≈ 0). However, in a natural full-duplex speech communication both parties (far end and near end) might talk simultaneously sometimes (i.e. a ‘double-talk’ event may occur). If there is any remarkable double talk in d[n] (i.e. a non-zero s[n]), then the adaptive process, formulated by Eqs. (7)(9), might diverge and fail since the s[n] components cannot be modeled by h[n]. Therefore, every adaptive AEC solution needs to constantly watch out for double-talk events to halt the adaptation as long as double talk is present [23, 24].

3.2 Spatial acoustic echo cancelation

All conventional NLMS-based adaptive methods, explained in the previous section, rely on modeling the acoustic path by an FIR system and aim to find the coefficients of the corresponding filter (i.e. h[n] in Figure 4). A major drawback of NLMS-based adaptive approach is that the adaptive process, presented by Eqs. (7)(9), needs to run constantly which imposes a remarkable computational cost. This is because the acoustic path, characterized by h[n], constantly changes due to slight movements of the objects in the environment and other reasons such as temperature variations and the adaptive process needs to estimate the new impulse response.

Recently, alternative methods based on probabilistic clustering techniques have been successfully used for blind source separation (BSS) of the echo components from the near-end speech signal [25]. The BSS method uses the spatial information from the captured microphone signals to cluster and separate the desired speech signal (s[n]) from the echo (y[n]). Any BSS method, similar to beamforming, requires multiple microphones to be able to extract the location cues.

Every BSS method assumes that the signal captured by the microphones (d1,d2,,dM), where M denotes the number of microphones, are from N independent source signals (s1,s2,,sN) that have been mixed together. The mixture model is then modeled as described by Eq. (10) below where hjk is the impulse response of length L that describes the acoustic path from the kth source (Sk) to the jth microphone (dj).

djn=k=1Np=1LhjkpsknpE10
d1n=p=1Lh1spsnp+p=1Lh1xpxnpE11
d2n=p=1Lh2spsnp+p=1Lh2xpxnpE12
dn=h1sh1xh2sh2xsnxn=WsnxnE13

The BSS techniques that are extensively used to address the ‘cocktail party’ problem aim to find the mixing impulse responses (hjk) and use this information to de-mix and find the original speech signals [25, 26]. In case there are more microphones than sources (M ≥ N), the BSS reduces to a ‘determined’ problem and linear filters can successfully be deployed to effectively separate the mixtures. Otherwise, if there are fewer number of microphones than sources (M < N), then the problem is ‘underdetermined’ and linear filters would not work adequately.

In the case depicted by Figure 3, there are two microphones and also two independent sources (M = N = 2), namely: 1) near-end speech (s[n]), and 2) the echo (x[n]) that leaks from the loudspeaker to the microphones. Eqs. (11)(12) formulize the mixture model for the signal received by microphone 1 and microphone 2, respectively. Here, h1x and h2x are the impulse responses of the acoustic path from the loudspeaker to the first and the second microphone, respectively. Eqs. (11)(12) can be summarized into a matrix form and re-written by Eq. (13) where the relation between the sources and microphones signals is denoted by a Wiener filter. Eq. (13) can be used to inverted so thatsnxn=WTdn.

A conventional approach to solving Eqs. (10)(13) is using independent component analysis (ICA) [26]. Accordingly, a cost function is defined to estimate the statistical (convolutional) independence of s[n] and x[n]. The coefficients of W (which comprises h1x, h1s, h2x, and h2s) are adaptively updated so that the statistical independence of s[n] and x[n] is increased. The statistical independence is often increased by either maximizing the non-Gaussianity or by minimizing the mutual information between the two signals.

3.3 The performance of acoustic echo cancelers in vehicles

The performance of an AEC is primarily measured by two metrics: 1) echo return loss enhancement (ERLE), and 2) convergence time. ERLE is a commonly used indicator for quantifying the achievement of an AEC solution to attenuate echoes [5, 20, 21, 23]. ERLE is calculated according to the Eq. (14) below where σdn2 and σen2represent the variance of the captured audio by the microphone (d[n]) and the variance of the error signal (e[n]) which is the output of the AEC and is ideally echo-free. Since all signals are zero-mean, the variance of a signal is a measure of the magnitude of its intensity. Therefore, Eq. (14) yields the ERLE as the magnitude of the AEC output relative to the microphone input signal.

ERLE=10×log10σdn2σen2E14

International telecommunication union (ITU) G.168 standard for AECs [27] declares a number of requirements that should be followed in all speech telecommunication applications. Accordingly, the AEC should yield at least 6 dB of ERLE at the second frame (since each frame is 50 ms in a typical automotive solution, this means at 0.1 second). The ERLE should then increase to minimum 20 dB at 1 second. Thereafter, the ERLE should reach its steady state at 10 second and should stay over that steady state value, afterwards.

The convergence time is the time it takes for the AEC to reach to its steady-state ERLE. ITU G.168 requires that the convergence time should be no longer than one second. In the tuning of the adaptive parameters, such as step size, there is a tradeoff between ERLE and convergence time since higher ERLE might result in slower convergence time [21, 22].

We implemented an adaptive NLMS-based AEC described by Eqs. (6)(9) on a large Volvo truck model. The length of the Wiener filter (L) was chosen 800 which corresponds to 50 ms at the sampling rate of 16 kHz which would be consistent with T30 in large vehicles. The term α in Eq. (9), which could take a value between 0 and 2, determines the speed of convergence. Higher α values result in quicker adaptation of the NLMS algorithm, however, there is a tradeoff between convergence and overall success of the echo canceller in terms of ERLE ([20]). Here, we chose α = 1.98 to assure the fast convergence of the algorithm. The term β, known as the regularization parameter, is meant to improve the performance of the NLMS in noise and it has to be adjusted with regard to the characteristics of the ambient noise (r[n] in Figure 1) and the signal-to-noise ratio (SNR) of the microphone hardware [20]. Here, we chose β = 0.1 which corresponds to the SNR of the electret condenser microphones that are commonly used in automotive industry.

Furthermore, a statistical double-talk detection (DTD) decision circuit based on the normalized cross-correlation (NCC) between x[n] and d[n]. NCC is also called ‘Pearson correlation coefficient’ in statistics [28]. In case the far-end is the only talker, there will be a non-zero cross-correlation between x[n] and d[n]. However, when the near end talks too (i.e. DT occurs), the cross-correlation between x[n] and d[n] diminishes and approaches zero since d[n] would convey s[n] components as well. Accordingly, DT is detected if NCC drops below a certain threshold. Eq. (15) presents the NCC between x[n] and d[n] where σxLnand σdLnare the standard deviation (square root of variance) of L samples of x[n] and d[n], respectively, and cov(xLn,dLn) is the covariance between them.

NCCxLndLn=covxLndLnσxLn×σdLnE15

NCC can yield a number in the range [−1, +1], where +1 indicates perfect correlation and − 1 perfect anti-correlation between the two inputs while 0 shows a non-existing correlation. Here, we set the threshold of our DTD decision to 104 using the method discussed in [28] by normalizing the false alarm probability (pf) to about 0.1.

To evaluate the presented AEC solution, the far-end party reads 10 HINT sentences while the driver (near-end party) is silent and the vehicle’s engine is off. The system registers the incoming signal to the speaker (x[n]) while the microphone records y[n]. In this case y[n] = d[n] since the driver is silent (s[n] = 0) and there is no engine noise (r[n] = 0). The presented solution is applied on these signals and, as depicted in Figure 5 below, the presented AEC solution manages to attenuate the echo received by the microphone significantly by a total of 25.54 dB according to Eqs. (10)(13). Figure 5(B) shows ERLEs per each sentence and how the ERLE becomes stronger as the algorithm continues adapting. The results demonstrate compliance with ITU G.168 standards [27].

Figure 5.

A) Captured microphone data (d[n]) versus the output of the AEC (e[n]) while the far end is reading ten HINT sentences. The sentences are marked by numerical indicators. B) the echo attenuation achieved by the presented AEC solution in terms of ERLE per HINT sentence.

3.4 Post-processing acoustic echo suppression

The minimum acceptable ERLE required by ITU G.168 (i.e. 20 dB) may not practically suffice since the echo might still be noticeable and irritating especially if the loudspeaker volume is set at a high level. As an example if the loudspeaker is set to generate sounds that are about 70 dB SPL loud, an ERLE of 20 dB would imply that there is an echo of 50 dB SPL (i.e. 70–20 = 50) being transmitted back to the far-end party which can be quite noticeable. Therefore, it is good practice in automotive industry to achieve much higher echo reduction i.e. typically over 40 dB.

The conventional NLMS-based adaptive AEC modules typically achieve maximum 30 dB ERLE, as shown in Figure 5. Therefore, to further improve the echo reduction, the remaining echo components (i.e. ‘residual echo’) are suppressed by means of acoustic echo suppression (AES) post processing. The simplest AES methods which have historically been used are based on attenuating the captured microphone signal (d[n] in Figure 4) whenever the farend is talking (i.e. whenever the magnitude of x[n] is over a reasonable threshold) [5]. A major shortcoming of this method is that, in case of double talk wherein both near end and far end are simultaneously talking, the near-end speech signal is also attenuated. Another issue is that such an approach is nonlinear. Speech recognition models require that the audio signal chain must be free of any nonlinearity [29]. Since adaptive AEC algorithms use linear filters to cancel echo, they could legitimately be used as a pre-processing stage to ASR systems. However, the use of nonlinear AES must be avoided in ASR applications. As a result, linear solutions, such as BSS techniques explained previously by Eqs. (10)(13), have been deployed to perform the task of AES on the residual echo especially in speech recognition applications. A properly designed combination of conventional adaptive AEC and a post-processing AES must comfortably achieve echo reductions over 40 dB.

Advertisement

4. Discussions, conclusions, and prospects

Hands-free telephony has been extensively offered in premium cars since early 2000s, and since then, audio signal processing modules have been deployed to enhance the speech signal quality by means of addressing issues such as acoustic echo, ambient noise, and reverberation [1]. Besides of hands-free telephony, speech dialog systems have been developed to enable drivers to communicate with vehicle functions by means of voice communication [2, 3]. In the core of such a speech dialog system, there is a neural-network based acoustic model that performs the speech recognition task. Speech recognition systems also demand high quality audio input which makes speech signal enhancement techniques necessary. Especially, online voice assistants rely on specific ‘wake words’ (also called ‘hot words’) to communicate with users. These are ‘Ok Google!’ for Google assistant, ‘Alexa!’ for Amazon Alexa, and ‘Hey Siri!’ for Siri. The ASR system should constantly listen for these wake words meanwhile music or speech signals might be simultaneously playing on the speakers. In order to detect the wake words while playing sounds, the system needs to benefit from a capable echo cancelation module to estimate and cancel the feedback from speaker(s) to the microphone(s) as well as a noise reduction module (such as beamforming) to minimize the reverberation and ambient noise in the captured signals.

In this chapter, the fundamentals of the filter-and-sum beamforming were described and two practical designs of dual-microphone fixed beamforming (end-fire versus broadside) were presented inside a personal car and a truck, respectively. The fundamentals of beamforming were described for a general case although the applications were exclusively focused on dual microphones because that is the most common setup in vehicles. The directivity index, which is the gain of the beamforming on the desired DOA relative to all other directions, is a good measure of a beamformer’s performance. A conventional multi-microphone fixed beamformer can achieve a directivity index of about 25 dB at best [30]. In real world, the directivity index turns out to be lower. In case of dual-microphone solution, the directivity index is minimal i.e. in the range of 10 to 12 dB. Multiple microphones can provide a sharper beam and potentially higher SNRIs.

Despite its modest directivity index, a well-designed beamforming system improves the quality of the sound substantially. One important benefit of beamforming, besides of the SNRI, is the reduction in the perceived reverberation. Reverberation is related to the sum of all sound reflections from the walls and surroundings of a given acoustic room and has been shown to have adverse effects on the speech intelligibility especially in case of hearing-impaired listeners [31]. Beamforming minimizes reverberation in the captured signal by means of geometrically dampening the sound reflections received from undesired directions and thereby facilitates speech intelligibility. Moreover, beamforming modules are in many cases followed by non-stationary noise reduction modules that adaptively suppress the noise (e.g. [7]). Together with the beamformer, an adaptive noise suppressor can achieve very good results in managing non-stationary noise.

Neural-based beamforming was also described in this chapter. This type of beamforming, wherein the steering filter coefficients are optimized jointly together with a neural-network speech model, has emerged in many speech recognition applications and shown remarkable success [13, 15, 16]. However, since the beamforming coefficients are optimized implicitly as a part of a speech recognition task, the success of this method in improving sound quality for a human listener is not entirely known and further studies are needed to evaluate this method for telephony and hearing-aid applications wherein human listeners are involved.

A large part of this chapter was dedicated to acoustic echo cancelation. The fundamentals of a conventional adaptive method based on NLMS was described. In this method, the acoustic path between the loudspeaker and microphone is modeled by an FIR filter and the adaptive process seeks to find the coefficients of this filter and subtract the echo from the captured signal. Adaptive NLMS-based acoustic echo cancelers are relatively easy to implement and are extensively in use. If designed appropriately, this method can comfortably achieve ERLEs about 30 dB [5, 21, 22, 30]. Although this level is higher than the required level by the ITU guidelines [27], a higher ERLE becomes necessary in most automotive telephony applications. Therefore, acoustic echo suppression algorithms have been developed as post-processing modules to further reduce the residual echo.

The simplest and most common acoustic echo suppressors are implemented by means of applying a gain on the microphone signal and reducing this gain whenever the far-end party is talking. However, due to its nonlinear behavior, this approach cannot be used in speech recognition applications which require linearity of all audio components [29]. Instead, linear approaches such as BSS based on ICA appear to be suitable. The BSS method uses spatial cues to find mixing coefficients of a linear model and uses this information to de-mix the signals and segregate the source signals (in this case: echo versus near-end speech).

Although beamforming and echo cancelation are well-known problems that have been extensively studied since early 1960s [4, 5, 9, 30], it needs great efforts to tailor them to address new challenges. Therefore, new statistical optimization approaches and neural-network based solutions are being deployed to strengthen the conventional methods, whenever feasible. Automotive industry is expanding quickly and manufacturers are competing in providing vehicles that allow vehicle occupants to have independent conference calls simultaneously. Another competition frontier is speech recognition. Automotive manufacturers aim to provide user interfaces that are driven by voice. These interfaces allow the drivers to simply talk to their cars and do their daily errands (such online shopping, scheduling meetings, listening to audio books) while driving by solely voice commands. Prototypes of such online automotive voice assistants have just been introduced as Google [32] and Amazon entered the game [33] and have received a great attention from the media, and the public. These systems open up new scientific and technical challenges in human-machine interfacing, cloud-based and embedded speech recognition, and last but not least, spatial audio signal processing.

Advertisement

Acknowledgments

Parts of this project has been funded by the innovation office at the Department of vehicle connectivity (VeCon) at Volvo group in Gothenburg, Sweden and some of the results were published in an M.Sc. thesis by Balaji Ramkumar in collaboration with Linköping University.

References

  1. 1. Oh S, Viswanathau V, Papamichalis P. Hands-free voice codcation in an automobile with a microphone array. PTOC ICASSP. 1992:281-284
  2. 2. Heisterkamp P. Linguatronic- product-level speech system for Mercedes-Benz cars. In proceedings of the first international conference on human language technology research. USA; 2001
  3. 3. Chen F, Jonsson IM, Villing J, Larsson S. Application of speech technology in vehicles. In: Speech Technology: Theory and Applications. UK: Springer; 2010. pp. 195-219
  4. 4. Sondhi MM, Presti AJ. A self-adapting echo canceller. Bell System Technical Journal. 1966;45:1851-1854
  5. 5. Kellermann W. "Echo Cancellation,"in Handbook of Signal Processing in Acoustics. Vol. 1. USA: Springer; 2008. pp. 883-895
  6. 6. Jung MA, Elshamy S, Finscheidt T. An automotive wideband stereo acoustic echo canceler using frequency-domain adaptive filtering. 22nd Europen signal processing conference (EUSIPCO). 2014. pp. 1453-1456
  7. 7. Chen YH, Raun SJ, and Qi T. An automotive application of real-time adaptive wiener filter for non-stationary noise cancellation in a car environment. IEEE international conference on signal processing, communication, and computing (ICSPCC). 2012. pp. 597-601
  8. 8. Zawawi SA, Hamzah AA, Majlis BY, Mohd-Yasin F. A review of MEMS capacitive microphones. Micromachines. 2020;11(482):1-28
  9. 9. Van Veen BD, Buckley KM. Beamforming: A versatile approach to spatial filtering. IEEE ASSP MAGAZINE. 1989:740-761
  10. 10. Timofeev S, Bahai ARS, Varayia P. Adaptive acoustic beamformer with source tracking capabilities. IEEE Transactions on Signal Processing. 2008;56(7):2812-2819
  11. 11. Vu NV, Ye H, Wittington J, Delvin J, and Mason M. Small footprint implementation of dual-microphone delay-and-sum beamforming for in-car speech enhancement. IEEE international conference on acoustics, speech, and signal processing. 2010. pp. 1482-1485
  12. 12. Cigada A, Lurati M, Ripamonti F, Vanali M. Beamforming method: Supression of spatial alliasing using miving arrays. Journal of acousticsl Society of America (JASA). 2008;124(6):3648-3658
  13. 13. Sainath TN, Weiss RJ, Wilsom KW, Naraayanan A, Bachiani M, Senior A. Speaker localization and microphone spacing invariant acoustic modeling from raw multichannel waveforms. Google Research. 2015:1-7
  14. 14. Wartsiz E, Haeb-Umbach R. Acoustic filter-and-sum beamforming by adaptive principal analysis. ICASSP. 2005:797-800
  15. 15. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE. 2012;29(6):82-97
  16. 16. Saniath TN, Weiss RJ, Wilson KW, Li B, Narayanan A, Variani E, et al. Multichannel signal processing with deep neural networks for automatic speech recognition. Google Research. 2017:1-14
  17. 17. Saremi A, Beutelmann R, Dietz M, Ashida G, Kretzberg J, Verhulst S. A comparitive study of seven human cochlear filter models. The Journal of the Acoustical Society of America. 2016;140(3):1618-1634
  18. 18. Qi Z, Moir TJ. Automotive 3-microphone noise canceller in a frequently moving noise source environment. International Journal of Information and Communication Engineering. 2007;3(4):297-304
  19. 19. Hällgren M, Larsby B, Arlinger S. A Swedish version of the hearing In noise test (HINT) for measurement of speech recognition. International Journal of Audiology. 2006;45:227-237
  20. 20. Paleologu C, Ciochin S, Benesty J, Grant SL. An overview on optimized NLMS algorithms for acoustic echo cancellation. EURASIP Journal on advances in signal proc. 2015. DOI: 10.1186/s13634-015-0283-1
  21. 21. Enzner G, Buchner H, Favrot A, Keuch F. Acoustic echo control. In: Academic Press Library in Signal Processing. USA: Academic Press; 2014. pp. 807-877
  22. 22. Hänsler E, Schmidt G. Acoustic Echo and Noise Control: A Practical Approach. Hoboken, NJ, USA: Wiley; 2004
  23. 23. Souden M, Wung J, Biing-Hwang FJ. A probabistic approach to acoustic echo clustering and suppression. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 2013
  24. 24. Hussain MS, Hasan MA, Bari MF, and Harun-Ur-Rashid ABM. A fast double-talk detection algorithm based on signal envelopes for implementation of acoustic echo cancellation in embedded systems. 4th International Conference on Advances in Electrical Engineering (ICAEE). 2017. DOI: 10.1109/ICAEE.2017.8255353
  25. 25. Makino S, Lee TW, Sawada H. "Convolutive Blind Source Seperation for Audio Signals " in Blind Speech Seperation. USA: Springer; 2007. pp. 1-42
  26. 26. Sawada H, Ono N, Kameoka H, Kitamura D, Saruwatari H. A review of blind source separation methods: Two converging routes to ILRMA originating from ICA and NMF. APSIPA Transactions on Signal and Information Processing. 2019;8:1-12
  27. 27. International telecommunication union. G.168: 04/2015 Digital network echo canceller. Available online: https://www.itu.int/rec/T-REC-G.168-201504-I/en [Accessed: December 15, 2021]
  28. 28. Benesty J, Morgan DR, Cho JH. A new class of doubletalk detectors based on cross-correlation. IEEE Transactions on Speech and Audio Processing. 2000;8(2):168-172
  29. 29. Google Android team. 5.4.2 Capture for voice recognition. In: Android compatibility definition document. Available online: https://source.android.com/compatibility/10/android-10-cdd [Accessed: December 16, 2021]
  30. 30. Kellermann W. Strategies for combining acoustic echo cancelation and adaptive microphone beamforming array. IEEE. 1997:219-222
  31. 31. Hazrati O, Loizou PC. The combined effects of reverberation and noise on speech intelligibility by cochlear implant listeners. International Journal of Audiology. 2012;51(6):437-443
  32. 32. Volvo Cars Sverige AB. Volvo Cars collaborates with Google on a brand new infotainment system. Available online: https://group.volvocars.com/news/connectivity/2018/volvo-cars-collaborates-with-google-on-a-brand-new-infotainment-system [Accessed: December 15, 2021]
  33. 33. Volvo trucks Global. Volvo trucks to deliver Amazon Alexa in new heavy-duty trucks. Available online: https://www.volvotrucks.com/en-en/news-stories/press-releases/2020/dec/volvo-trucks-first-to-deliver-amazon-alexa-in-new-heavy-duty-trucks.html [Accessed: December 15, 2021]

Written By

Amin Saremi

Reviewed: 20 April 2022 Published: 10 June 2022