This chapter presents an overview of dictionary learning-based speech enhancement methods. Specifically, we review the existing algorithms that employ sparse representation (SR), nonnegative matrix factorization (NMF), and their variations applying for speech enhancement. We emphasize that there are two stages in a speech enhancement system, namely learning dictionary and enhancement. The two scenarios of learning dictionary process, offline and online, are discussed carefully as well. We finally present some evaluation methods and suggest the future lines of work.
- dictionary learning
- nonnegative matrix factorization
- projected gradient descent
- speech enhancement
- sparse representation
Speech is the most important tool of expression and it is crucial information carrier of language communication. Speech signals in real-world scenarios are corrupted due to some disturbing noise such as background noise, reverberation, babble noise, etc. The purpose of speech enhancement (SE) is to extract the clean speech signal from the interferer components mixture as much as possible, so as the clarity and intelligibility of the speech signal. The research of speech enhancement technology is particularly important and difficult. Speech denoising is an importance problem with increasing various applications as hearing aids, speech/speaker recognition, mobile communications over telephone, and Internet . The difficulties arise from the nature of real-world noise that is often unknown, nonstationary, potentially speech-like, overlapping between [1, 2, 3].
Assume that the noisy speech
The speech enhancement techniques mainly focus on removal of noise from speech signal. The various types of noise and techniques for removal of those noises are presented [5, 6, 7, 8, 9, 10, 11, 12, 13]. The famous spectral subtraction technique  extracted the clean speech spectrum based on the principle that the noise contamination process is additive. The major advantage of the spectral subtraction method is their simplicity by subtracting an estimation of the interfere spectrum from the observed mixture spectrum [5, 6]. The main problem with the magnitude spectral subtraction is that it does not attenuate noise sufficiently negative magnitude by error in the subtraction.
Filtering techniques [7, 8] or short-time spectral amplitude (STSA) estimators  or estimators based on super-Gaussian prior distributions for speech DFT coefficients are [10, 11, 12, 13] the statistical models assumed for each of the speech and noise signals that estimate the clean speech from the noisy observation without any prior information on the noisy type or speaker identity. However, in the case of nonstation of background noise, these methods face much difficulty in estimating the noise power spectral density (PSD) [14, 15, 16].
Recently, dictionary learning (DL) techniques, which build dictionary consisting of atoms and represent a class of signals in terms of the atoms, have been shown to be effective in machine learning, neuroscience, and audio processing [17, 18, 19, 20]. In speech enhancement, the dictionary models utilize specific types of the a priori information considered for both the speech and noise signals [21, 22, 23, 24, 25]. This class of methods assumes that a target spectrogram can be generated from a set of basis target spectra (a dictionary) through weighted linear combinations. Generally, this approach decomposes the time-frequency representations (the power or magnitude spectrogram) of noisy speech in terms of elementary atoms of a dictionary. One of the key issues in dictionary-based speech enhancement is how to precisely learn a dictionary. Dictionary learning methods are commonly based on an alternating optimization strategy, in which the signal representation is fixed, and the dictionary elements are learned; then the sparse signal representation is found, while the dictionary is fixed. Two popular methods have appeared to determine a dictionary within a matrix decomposition including sparse coding  and nonnegative matrix factorization (NMF) .
The observation that speech and other structured signals can be well approximated by few atoms of a suitably trained dictionary , which lies at the core of sparse representation (SR). In SR, sparse signals can be reconstructed with a few atoms of an overcomplete dictionary. Recently, developed SR has been shown to be effective in data representation, which factorizes given matrix with regularization methods or regularization term to constrain the sparsity of desire representation. Since speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal was decomposed and reconstructed from the noisy speech-driven sparse dictionary [21, 22, 23].
In many reality applications, the nonnegativities of the signals and the dictionary are required such as multispectral data analysis [29, 30], image representation [31, 32], and some other important problems [33, 34], the so-called nonnegative dictionary learning becomes necessary. Nonnegative matrix factorization is a popular dictionary method, which projects the given nonnegative matrix onto the subspace spanned by nonnegative dictionary vectors. Treating speech enhancement as a source separation problem between speech and noise, NMF-based techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. On the one hand, a clean speech signal can be estimated from the product of speech dictionaries and their activation.
In this chapter, we review the dictionary learning approaches for speech enhancement. After a brief introduction to the problem and its characterization as a sound source separation task, we present a survey on both theoretically and applicable of dictionary-based techniques, the main subject of this chapter. We finally provide an overview of the evaluation methods and suggest some future lines of works.
Dictionary learning performs approximate matrix factorization of a data matrix into the product of a dictionary matrix and a coding matrix, under some sparsity constraints on the coding matrix. Dictionary learning is the generalization of gain-shape codebook learning. Signal vectors are represented as linear combinations of multiple dictionary atoms, allowing for lower approximation error while maintaining equal dictionary size. Two relatively different methods are described for how to form the dictionary from the given data including sparse representation (SR) and nonnegative matrix factorization (NMF).
2.1 Sparse representation (SR) and K-SVD algorithm
where denote the Frobenius and
Eq. (2) shows that a signal
2.2 Nonnegative matrix factorization (NMF) theory
Nonnegative matrix factorization (NMF) can be viewed as an approach for dictionary learning. NMF, first introduced by Paatero and Tapper  and later popularized by Lee and Seung [23, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], has been known as a part-based representation model. Different to other matrix factorization approaches, NMF takes into account the fact that most types of real-world data, particularly sound and videos, are nonnegative and maintain such nonnegativity constraints in factorization. Moreover, the nonnegativity constraints in NMF are compatible with the intuitive notion of combining parts to form a whole, that is, they provide a parts-based local representation of the data. A parts based model not only provides an efficient representation of the data but can potentially aid in the discovery of causal structure within it and in learning relationships between the parts.
Given a nonnegative matrix , a positive integer
Different the similarity measures between
There exist different optimization models for the approximation factorization (5) [36, 39, 40]. The most popular solution is alternative multiplicative update rules (MURs) , which do not have required user-specified optimization parameters. For a KL cost function (6), the iteratively updating rules are given by:
However, it is found that the monotonicity guaranteed by the proof of multiplicative updates may not imply the full Karush-Kuhn-Tucker conditions [39, 40]. MUR is relatively simple and easy to implement, but it converges slower in comparison with gradient approaches . More efficient algorithms equipped with stronger theoretical convergence property have been introduced. One popular method is to apply gradient descent algorithms with additive update rules, which are represented by the projective gradient descent method (PGD) . In PGD framework, to select the learning step size, a line search method with the Armijo rule is applied  and the new estimate is obtained by first calculating the unconstrained steepest-descent update and then zeroing its negative elements. In addition, considering the separate convexity, the two-variable optimization problem is converted into the nonnegative least squares (NLS) optimization subproblems, which alternate the minimization over either
Because of the initial condition
NMF will not get the unique solution under the sole nonnegativity constraint. Hence, to remedy the ill-posedness, it is imperative to introduce additional auxiliary constraints on
where the regularization parameters
The performance of NMF can be improved by imposing extra constraints and regularizations. For the sparseness learning, the sparse term
3. Dictionary learning-based speech enhancement
A major outcome of speech enhancement techniques is the improved quality and reduced listening effort in the presence of an interfering noise signal. The decomposition of time-frequency representations, such as the power or magnitude spectrogram in terms of elementary atoms, has become a popular tool in speech enhancement since their success in finding high-“quality” dictionary atoms that best describe latent features of the underprocessed data. The dictionary-based techniques utilize specific types of the a priori information of speech or noise [21, 23, 46, 47, 48, 49, 50]. A priori information can be typical patterns or statistics obtained from a speech or noise database. Dictionary-based speech enhancement consists of two separate stages: a training stage, in which the model parameters are learned, and a denoising stage, in which the noise reduction task is carried out. In the first step, dictionary
Speech enhancement herein is implemented in the short-time Fourier transform (STFT) magnitude domain, assuming that the phase of the interferer can be approximated with the phase of the mixture. The number of frequency bins per frame is determined by the length of the time-domain analysis window, where a Hamming window was chosen for the STFT. The temporal smoothness frames are determined by the time-domain analysis window overlap, where a minimum amount of overlap is necessary to avoid aliasing.
3.1 Offline dictionary
Sparse representation has been described as an overcomplete models wherein the number of bases is greater than the dimensionality of spectral representations. In sparse representation, sparse signals can be expressed as the linear combination of only a few atoms in an overcomplete dictionary. While speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal reconstructed from the noisy speech is considered as clean speech. A possibly overcomplete dictionary of atoms is trained for both speech and interferer magnitudes, which are then concatenated into a composite dictionary. The training process of updated dictionary is drawn in Figure 1.
When applying the sparse coding technique to speech enhancement, it is desirable to have the trained offline clean speech dictionary
The clean speech magnitude is estimated by disregarding the contribution from the interferer dictionary, preserving only the linear combination of speech dictionary atoms (analogously for the interferer) and
It is known that NMF represents data as a linear combination of a set of basis vectors, in which both the combination coefficients and the basis vectors are nonnegative. Although the basis learned by NMF is sparse, it is different from sparse coding . This is because NMF learns a low rank representation of the data, while sparse coding usually learns the full rank representation. Treating speech enhancement as a source separation problem (speech and noise), NMF-based techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. Assume that a clean speech spectrogram as
To reduce the noise in the noisy speech, the concatenated dictionary
where the time-varying activation matrix is formulated .
Discarding the noise coding matrix, the target speech is estimated from the product of speech dictionaries and their activations as
The clean speech waveform is estimated using the noisy phase and inverse DFT and the general framework of NMF-based speech enhancement is drawn in Figure 2.
3.2 Online dictionary learning
The aforementioned dictionary learning approaches access the whole training set to determine the bases, which are referred as offline training process. These methods were reported to have good performance on modeling nonstationary noise types, which had been seen during training. For the time-frequency analysis of audio signals, however, the obtained basis may not be adequate to capture the temporal dependency of repeating patterns within the signal, and the success of these methods strongly relies on the prior knowledge of noise or speech or both, which limits implementations of the models. Recently, the online dictionary learning methods have been proposed in two aspects of implementing scheme [46, 47, 48, 49, 50] and circumventing the mismatch problem between the training and testing stages [24, 52].
One drawback of the multiplicative update procedure on offline dictionary learning is the requirement of all the training signals to be read into memory and processed in each iteration. This high demand on both computing resources and memory is prohibitive in large-scale tasks. To address this problem, the online optimization algorithms were developed in an incremental fashion, which processes one sample of the training set at a time based on stochastic approximations or only a part of the training data at a time and updates patterns gradually until completely processed whole training corpus [46, 47, 48, 51]. More specifically, given
where denoted the expectation on .
The coefficient matrix is computed by
For the online NMF framework, at step
where is the probabilistic subspace spanned by the arrived elements and the corresponding are computed available in the previous
In , an online noise basis learning scheme is proposed that uses the temporal dependencies of speech and noise signal to construct informative prior distribution. In this model, the noise basis matrix is learned from the noisy observation. To update the noise basis, the past noisy DFT magnitude frames are stored into a buffer and the buffer will be then updated with fixed speech basis when a new noisy frame arrives.
Kwon et al.  present a speech enhancement technique combining statistical models and NMF with online update of speech and noise bases. A cascaded structure of combining a statistical model-based enhancement (SE) (the first state)  and NMF approach (second stage) with simultaneous update of speech and noise bases is proposed. In this model, the output clean speech at current frame is fed as an input to update the speech and noise bases in the following frame. In other words, at each frame, the clean speech estimation is obtained; the speech and noise bases for the NMF analysis in the following frame are updated. This online bases update makes it possible to deal with the speech and noise variations that cannot be covered by the training noise database and is considered a promising way to cope with the nonstationary nature of the signal. The noisy data
4. Summary and discussion
In the experimental simulations, speech and noise materials were selected from TIMIT  (192 sentences), NOISEX-92 DBs (15 types of noise: birds, casino, cicadas, computer keyboard, eating chips, f16, factory1, factory2, frogs, jungle, machineguns, motorcycles, ocean, pink, and volvo) , the GRID audiovisual corpus (34 speakers of both genders) , the NOIZEUS speech corpus (30 utterances with clean samples) . The noisy speech examples were synthesized by adding clean speech to different types of noises at various input SNRs.
Speech enhancement algorithms aim to improve both the speech quality and the speech intelligibility. A high-quality speech signal is perceived as being natural and pleasant to listen to, and free of distracting artifacts. An effective technique should suppress noises without bringing too much distortion to the enhanced speech. Measuring speech quality is challenging, as it is subjective and can be classified into subjective and objective measures. The speech enhancement performance was commonly evaluated in terms of three criteria including the signal to noise ratio (SNR) of enhanced speech , the segmental SNR (segSNR) , or the perceptual estimation of speech quality score (PESQ) [57, 58, 59]. Given the true and estimated speech magnitude spectra, the frequency-weighted segmental SNR is defined as:
segSNR is a conceptually simple objective measure, computed on individual signal frames, and the per-frame scores are averaged over time.
Contrary to spectral subtraction, dictionary approach does not assume a stationary interferer, optimizes the trade-off between source distortion and source confusion, and thus shows superiority over objective quality measures like cepstral distance, in the speaker-dependent and -independent case, in real-world environments and under low SNR condition. One possible reason could be due to lack of plenty of data to estimate a noise dictionary. At low SNR levels, the total volume of noise is much higher than that at high SNR levels, which offers a higher chance to obtain a good dictionary or noise modeling. However, under high SNR conditions, a lot of noise spectrum is buried in speech spectrum, which could make the learning of a noise dictionary difficult. The pretrained speech dictionary models outperform state-of-the-art methods like multiband spectral subtraction and approaches based on vector quantization [21, 22, 23]. Offline speech dictionary learning in a joint decomposition framework of the noisy speech spectrogram and a primary estimate of the clean speech spectrogram. Online learning approach processes input signals piece-by-piece by breaking the training data into small pieces and updates learned patterns gradually using accumulated statistics. With this approach, only a limited segment of the input signal is processed at a time. The online estimated dictionary is sufficient enough in basis subspace to avoid speech distortion. The online approaches tend to give better performance than batch learning .
The computing demand for both offline learning and online learning consists of updating the coefficient matrix
In short, dictionary learning plays an important role in machine learning, where data vectors are modeled as sparse linear combinations of basis factors (i.e., dictionary). However, how to conduct dictionary learning in noisy environment has not been well studied. In this chapter, we have reviewed speech enhancement techniques based on dictionary learning. The dictionary learning-based algorithms have gained a lot of attention due to their success in finding high-“quality” dictionary atoms (basis vectors) that best describe latent features of the underprocessed data. As a multivariate data analysis and dimensionality reduction technique, two relatively novel paradigms for dimensionality reduction and sparse representation, NMF and SR, have been in the ascendant since its inception. They enhance learning and data representation due to their parts-based and sparse representation from the nonnegativity or purely additive constraint. NMF and SR produce high-quality enhancement results when the dictionaries for different sources are sufficiently distinct. This survey chapter mainly focuses on the theoretical research into dictionary learning-based speech enhancement where the principles, basic models, properties, algorithms, and employing on SR and NMF are summarized systematically.
This research is partially supported by the Ministry of Science and Technology under Grant Number 108-2634-F-008 -004 through Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan.