Open access peer-reviewed chapter - ONLINE FIRST

Speech Enhancement Based on LWT and Artificial Neural Network and Using MMSE Estimate of Spectral Amplitude

By Mourad Talbi, Riadh Baazaoui and Med Salim Bouhlel

Submitted: October 16th 2020Reviewed: February 2nd 2021Published: March 29th 2021

DOI: 10.5772/intechopen.96365

Downloaded: 16

Abstract

In this chapter, we will detail a new speech enhancement technique based on Lifting Wavelet Transform (LWT) and Artifitial Neural Network (ANN). This technique also uses the MMSE Estimate of Spectral Amplitude. It consists at the first step in applying the LWTto the noisy speech signal in order to obtain two noisy details coefficients, cD1 and cD2 and one approximation coefficient, cA2. After that, cD1 and cD2 are denoised by soft thresholding and for their thresholding, we need to use suitable thresholds, thrj,1≤j≤2. Those thresholds, thrj,1≤j≤2, are determined by using an Artificial Neural Network (ANN). The soft thresholding of those coefficients, cD1 and cD2, is performed in order to obtain two denoised coefficients, cDd1 and cDd2 . Then the denoising technique based on MMSE Estimate of Spectral Amplitude is applied to the noisy approximation cA2 in order to obtain a denoised coefficient, cAd2. Finally, the enhanced speech signal is obtained from the application of the inverse of LWT, LWT−1 to cDd1, cDd2 and cAd2. The performance of the proposed speech enhancement technique is justified by the computations of the Signal to Noise Ratio (SNR), Segmental SNR (SSNR) and Perceptual Evaluation of Speech Quality (PESQ).

Keywords

  • ANN
  • Lifting Wavelet Transform
  • Ideal thresholds
  • Soft thresholding
  • MMSE Estimate

1. Introduction

Numerous speech enhancement techniques have been developed in the previous years as speech enhancement is a core target in numerous challenging domains such as speech and speaker recognitions, telecommunications, teleconferencing and hand-free telephony [1]. In such applications, the goal is to recover a speech signal from observations degraded by diverse noises components [2]. The unusual noise components can be of various classes that are frequently present in the environment [3]. Many algorithms and approaches were proposed for resolving the problem of degraded speech signals [4, 5, 6]. Furthermore, methods of single or multi-microphones are proposed in order to ameliorate the behaviour of the speech enhancement approaches and also to reduce the acoustic noise components even in very noisy conditions [2]. The most well-known single channel approaches that are extensively known in speech enhancement application is the spectrum subtraction (SS) that requires only one channel signal [7]. It has been embedded in some high-quality mobile phones for the same application. Though, the SS approach is just appropriate for stationary noise environments [2]. Furthermore, it surely introduces music noise problem. In fact, the higher the noise is reduced, the greater the alteration is brought to the speech signal and accordingly the poorer the intelligibility of the enhanced speech is obtained [8, 9]. As a result, ideal enhancement can hardly be attained when the Signal to Noise Ratio (SNR) of the noisy speech is relatively low; below 5dB. However, it has quite good result when the noisy speech SNRis relatively high; above 15dB[2]. The SS and other speech enhancement methods that are based on SS principal have ameliorated the decision directed (DD) methods in reducing the musical noise components [10, 11, 12, 13]. Numerous algorithms that ameliorate the DDmethods were suggested in [14]. In [15], a speech enhancement technique based on high order cumulant parameter estimation was proposed. In [16, 17], a subspace technique, which is based on Singular Value Decomposition (SVD) approaches was proposed; the signal is enhanced when the noise subspace is eliminated, and accordingly, the clean speech signal is estimated from the subspace of the noisy speech [2]. Another technique which was widely studied in speech enhancement application, is the adaptive noise cancellation (ANC) approach that was firstly suggested in [18, 19]. Moreover, most important speech enhancement methods employed adaptive approaches for getting the tracking ability of non-stationary noise properties [20, 21]. Numerous adaptive techniques were proposed for speech enhancement application, we can find time domain algorithm, frequency domain adaptive algorithms [22, 23, 24, 25, 26] or adaptive spatial filtering methods [27, 28] that frequently employ adaptive SVD methods in order to separate the speech signal space from the noisy one. Another direction of research combines the Blind Source Separation (BSS) methods with adaptive filtering algorithms for enhancing the speech signal and to cancel effectively the acoustic echo components [29, 30, 31, 32]. This approach employs at least two microphones configuration in order to update the adaptive filtering algorithms. Also, a multi-microphone speech enhancement technique was proposed for the same aim and ameliorated the existing one-channel and two-channel speech enhancement and noise reduction adaptive algorithms [33, 34]. Numerous research works highlighted the problem of speech enhancement on a simple problem of mixing and unmixing signals with convolutive and instantaneous noisy observations [35, 36, 37]. In the last decade, a novel research direction has proven the efficacy of the wavelet domain as a novel effective mean that can ameliorates the speech enhancement approaches, and numerous algorithms and methods have been proposed for the same aim [38, 39]. In this chapter, we propose a novel speech enhancement technique based on Lifting Wavelet Transform (LWT) and Artifitial Neural Network (ANN) and also uses MMSEEstimate of Spectral Amplitude [40]. The presentation of this chapter is given as follows: after the introduction, we will deal with Lifting Wavelet Transform (LWT), in section 2. Section 3 is reserved to describe the proposed speech enhancement technique. Section 4 presents the obtained results and finally we conclude our work in section 5.

Advertisement

2. Lifting wavelet transform (LWT)

The Lifting Wavelet Transform (LWT) becomes a powerful tool for signal analysis due to its effective and faster implementation compared to the Discrete Wavelet Transform (DWT). In the domains of the signal denoising, signal compression and watermarking, the LWTpermits to obtain better results compared to the DWT. The LWTpermits to saves times and has a better frequency localization feature that overcomes the shortcomings of DWT. The Signal decomposition using LWTneeds three steps: splitting, prediction and update.

3. The proposed technique

In this chapter, we propose a new speech enhancement technique based on Lifting Wavelet Transform (LWT) and Artifitial Neural Network (ANN). This technique also uses the MMSEEstimate of Spectral Amplitude. It consists at the first step in applying the LWTto the noisy speech signal in order to obtain two noisy details coefficients, cD1and cD2and one approximation coefficient, cA2. After that, cD1and cD2are denoised by soft thresholding and for their thresholding, we need to use suitable thresholds, thrj,1j2. Those thresholds, thrj,1j2, are determined by using an Artificial Neural Network (ANN). The soft thresholding of those coefficients, cD1and cD2, is performed in order to obtain two denoised coefficients, cDd1and cDd2. Then the denoising technique based on MMSE Estimate of Spectral Amplitude [40] is applied to the noisy approximation cA2in order to obtain a denoised coefficient, cAd2. Finally, the enhanced speech signal is obtained from the application of the inverse of LWT, LWT1to cDd1, cDd2and cAd2. For each coefficient, cDj,1j2, the corresponding ideal thrj,1j2is computed using the following MATLAB function:

function[thr] = Compute_Threshold (cc,cb).

R = [];

for(i = 1:length(cb)).

r = 0;

for(j = 1:length(cc)).

r = r + (wthresh(cb(j),'s’,abs(cb(i)))-cc(j)).^2;

end;

R = [R r];

end;

[guess,ibest] = min(R);

thr = abs(cb(ibest));

The inputs of this function are the clean details coefficient, ccand the corresponding noisy coefficient, cb. The output of this function is the ideal threshold, thr. Note that the couple cccbcan be ccD1cD1or ccD2cD2where ccDjandcDj,1j2are respectively the clean details coefficient and the corresponding noisy details one at the level j.

In this chapter and as previously mentioned, the ideal threshold at level jis thrjand is used for soft thresholding of the noisy details coefficient cDjat that level. In this work, the used ANNis trained by a set of couples PTwhere Pis the input of this neural network and is chosen to be cDjand Tis the Target and is chosen to be the corresponding ideal threshold, thrjat level j. Consequently, for computing a suitable threshold to be used in soft thresholing of cDj,1j2, we use one ANNso we have two ANNsand two different suitable thresholds. Each of those ANNsis constituted of two layers where the first one is a hidden layer and contains ten neurons having tansigmoid activation function and the second layer is the output one and contains one neurone having purelin activation function (Figure 1).

Figure 1.

The architecture of the ANN used in this work.

As shown in Figure 1, the input of this ANN is the noisy details coefficient at level 1j2, P=cDj,1j2and the desired output or Target, T=thrj,1j2. The activation functions tansigmoid and pureline are respectively expressed by Eqs. 1 and 2.

tansign=1/1+expnE1
Purelinn=nE2

Generally, neural networks consist of a minimum of two layers (one hidden layer and another output layer). The input information is connected to the hidden layers through weighted connections where the output data is calculated. The number of hidden layers and the number of neurons in each layer controls the performance of the network. According to [41], there are no guidelines for deciding a manner for selecting the number of neurons along with number of hidden layers for a given problem to give the best performance. And it is still a trial and error design method [41].

For training each ANNused in this work we have employed 50 speech signals and 10 others used for testing those networks. Therefore, for training each used ANN, we used 50 couples of Input and Target PT. Evidently, the noisy speech signals used for the ANNstesting do not belong to the training database. The different parameters used for the training of the used ANNs are the epochs number which is equal to 5000, the momentum, μor Mu which is equal to 0.1, the gradient minimum which is equal to 1e7. The employed training algorithm is Leverberg-Marquardt.

In summary, the novelty of the proposed technique consists in applying the denoising technique based on MMSE Estimate of Spectral Amplitude [40]. Also, we apply the ANN for computing ideals thresholds to be used for thresholding of the noisy details coefficients obtained from the application of the LWTto the noisy speech signal.

4. Results and discussions

For the evaluation of the proposed technique, we have applied it to twenty Arabic speech signals pronounced by a male and female speakers. Those signals are corrupted in artificial manner with additive manner by two types of noise (White Gaussian and Car noises) at different values of SNRibefore denoising.. The used Arabic speech signals (Table 1) are material phonetically balanced and they are sampled at 16kHz.

Female speakerMale speaker
أحفظ من الأرض: Signal1لا لن يذيع الخبر: Signal 1
أين المسا فرين: Signal 2أكمل بالإسلام رسالتك: Signal 2
لا لم يستمتع بثمرها: Signal 3سقطت إبرة: Signal 3
سيؤذيهم زماننا: Signal 4من لم ينتفع: Signal 4
كنت قدوة لهم: Signal 5غفل عن ضحكاتها: Signal 5
ازار صائما: Signal 6و لماذا نشف مالهم: Signal 6
كال و غبط الكبش: Signal 7أين زوايانا و قانوننا: Signal 7
هل لذعته بقول:Signal 8صاد الموروث مدلعا: Signal 8
عرف واليا و قائدا: Signal 9نبه آبائكم: Signal 9
خالا بالنا منكما: Signal 10أظهره و قم: Signal 10

Table 1.

The list of the employed Arabic speech sentences.

Also for the evaluation of the proposed speech enhancement approach, we have applied the denoising technique based on MMSEEstimate of Spectral Amplitude [40]. This evaluation is performed in terms of Signal to Noise Ratio (SNR), the Segmental SNR(SSNR) and the Perceptual Evaluation of Speech Quality (PESQ). In Tables 27 are listed the results obtained from the computations of SNRf (after denosing), of SSNR and PESQ and this for the proposed technique and The denoising technique based on MMSE Estimate of Spectral Amplitude [40].

SNRi (dB)SNRf (dB)
The Denoising approach
The proposed speech enhancement techniqueThe denoising technique based on MMSE Estimate of Spectral Amplitude [40]
−58.36507.1431
013.085711.6110
516.901015.5721
1019.893318.8719
1522.313521.7972

Table 2.

Results in term of SNR (signal 4 (female voice) corrupted by Gaussian white noise).

SNRi (dB)SSNR (dB)
The Denoising approach
The proposed speech enhancement techniqueThe denoising technique based on MMSE Estimate of Spectral Amplitude [40]
−5−0.0954−0.7089
02.59971.7725
54.73733.9719
106.80385.9329
159.53248.7158

Table 3.

Results in term of SSNR (signal 4 (female voice) corrupted by Gaussian white noise).

SNRi (dB)PESQ
The Denoising approach
The proposed speech enhancement techniqueThe denoising technique based on MMSE Estimate of Spectral Amplitude [40]
−51.32251.3755
01.59351.6320
51.88121.8977
102.20162.2311
152.51472.6079

Table 4.

Results in term of PESQ (signal 4 (female voice) corrupted by Gaussian white noise).

SNRi (dB)SNRf (dB)
The Denoising approach
The proposed speech enhancement techniqueThe denoising technique based on MMSE Estimate of Spectral Amplitude [40]
−55.87374.2192
09.84148.3451
514.164712.6024
1018.530817.4120
1522.510221.4578

Table 5.

Results in term of SNR (signal 2 (male voice) corrupted by car noise).

SNRi (dB)SSNR (dB)
The Denoising approach
The proposed speech enhancement techniqueThe denoising technique based on MMSE Estimate of Spectral Amplitude [40]
−50.2145−1.1347
02.74781.7861
55.66444.7166
108.89427.8228
1511.966310.9850

Table 6.

Results in term of SSNR (signal 8 (male voice) corrupted by car noise).

SNRi (dB)PESQ
The Denoising approach
The proposed speech enhancement techniqueThe denoising technique based on MMSE Estimate of Spectral Amplitude [40]
−52.28372.4021
02.59992.7163
52.87093.0184
103.11903.2461
153.35903.4789

Table 7.

Results in term of PESQ (signal 8 (male voice) corrupted by car noise).

According to those results listed in Tables 27, the best results are in Red colour. In terms of SNRf(After denoising), and SSNR, the best results are those obtained from the application of the proposed speech enhancement technique. However, in term of PESQ, the denoising technique based on MMSEEstimate of Spectral Amplitude [40] is slightly better than the proposed technique.

In Figures 25 are illustrated some examples of speech enhancement using the proposed technique.

Figure 2.

An example of speech enhancement applying the proposed technique: Signal 4 (pronounced by a female voice (Table 1) corrupted by Gaussian white noise withSNRi=10dB(before enhancement)). After enhancement we have:SNRf=19.8933,SSNR=6.8038andPESQ=2.2016.

Figure 3.

An example of speech enhancement applying the proposed technique: Signal 1 (pronounced by a male voice (Table 1) corrupted by Gaussian white noise withSNRi=5dB(before enhancement)). After enhancement we have:SNRf=13.7710,SSNR=0.7135andPESQ=2.2350.

Figure 4.

An example of speech enhancement applying the proposed technique: Signal 7 (pronounced by a male voice (Table 1) corrupted by car noise withSNRi=5dB(before enhancement)). After enhancement we have:SNRf=15.1244,SSNR=8.7594andPESQ=3.3304.

Figure 5.

An example of speech enhancement applying the proposed technique: Signal 5 (pronounced by a male voice (Table 1) corrupted by car noise withSNRi=10dB(before enhancement)). After enhancement we have:SNRf=18.8848,SSNR=6.4497andPESQ=3.5469.

These Figures show the efficiency of the proposed speech enhancement technique. In fact, it permits to reduce considerably the noise while conserving the original signal and this especially when the SNRiis higher (5, 10 and 15 dB).

In our future work and in order to improve this proposed speech enhancement technique, we will use a Deep Neural Network instead (DNN) instead of a simple ANN and other transforms such as Empirical Mode Decomposition (EMD).

5. Conclusion

In this chapter, we will detail a new speech enhancement technique based on Lifting Wavelet Transform (LWT) and Artifitial Neural Network (ANN). This technique also uses the MMSEEstimate of Spectral Amplitude. It consists at the first step in applying the LWTto the noisy speech signal in order to obtain two noisy details coefficients, cD1and cD2and one approximation coefficient, cA2. After that, cD1and cD2are denoised by soft thresholding and for their thresholding, we need to use suitable thresholds, thrj,1j2. Those thresholds, thrj,1j2, are determined by using an Artificial Neural Network (ANN). The soft thresholding of those coefficients, cD1and cD2, is performed in order to obtain two denoised coefficients, cDd1and cDd2. Then the denoising technique based on MMSEEstimate of Spectral Amplitude is applied to the noisy approximation cA2in order to obtain a denoised coefficient, cAd2. Finally, the enhanced speech signal is obtained from the application of the inverse of LWT, LWT1to cDd1, cDd2and cAd2. The performance of the proposed speech enhancement technique is justified by the computations of the Signal to Noise Ratio (SNR), Segmental SNR(SSNR) and Perceptual Evaluation of Speech Quality (PESQ).

Download for free

chapter PDF

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Mourad Talbi, Riadh Baazaoui and Med Salim Bouhlel (March 29th 2021). Speech Enhancement Based on LWT and Artificial Neural Network and Using MMSE Estimate of Spectral Amplitude [Online First], IntechOpen, DOI: 10.5772/intechopen.96365. Available from:

chapter statistics

16total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us