## Abstract

This chapter presents an overview of dictionary learning-based speech enhancement methods. Specifically, we review the existing algorithms that employ sparse representation (SR), nonnegative matrix factorization (NMF), and their variations applying for speech enhancement. We emphasize that there are two stages in a speech enhancement system, namely learning dictionary and enhancement. The two scenarios of learning dictionary process, offline and online, are discussed carefully as well. We finally present some evaluation methods and suggest the future lines of work.

### Keywords

- dictionary learning
- nonnegative matrix factorization
- projected gradient descent
- speech enhancement
- sparse representation

## 1. Introduction

Speech is the most important tool of expression and it is crucial information carrier of language communication. Speech signals in real-world scenarios are corrupted due to some disturbing noise such as background noise, reverberation, babble noise, etc. The purpose of speech enhancement (SE) is to extract the clean speech signal from the interferer components mixture as much as possible, so as the clarity and intelligibility of the speech signal. The research of speech enhancement technology is particularly important and difficult. Speech denoising is an importance problem with increasing various applications as hearing aids, speech/speaker recognition, mobile communications over telephone, and Internet [1]. The difficulties arise from the nature of real-world noise that is often unknown, nonstationary, potentially speech-like, overlapping between [1, 2, 3].

Assume that the noisy speech *x* is a linear additive mixture of the clean speech *s* and the interfere *n* as defined in the following equation:

where *x*(*t*) is the time-domain mixture signal at sample *t*, and *s*(*t*) and *n*(*t*) are the time-domain speech and interferer signals, respectively. The speech enhancement algorithm attempts to suppress noise without distorting speech and obtain the enhanced speech components *γ* = 1 gives the magnitude of spectrum or the power spectrum by *γ* = 2. The inverse Fourier transformation then is used to convert the estimated speech to the time domain, assuming that the phase of the interferer can be approximated with the phase of the mixture [4].

The speech enhancement techniques mainly focus on removal of noise from speech signal. The various types of noise and techniques for removal of those noises are presented [5, 6, 7, 8, 9, 10, 11, 12, 13]. The famous spectral subtraction technique [5] extracted the clean speech spectrum based on the principle that the noise contamination process is additive. The major advantage of the spectral subtraction method is their simplicity by subtracting an estimation of the interfere spectrum from the observed mixture spectrum [5, 6]. The main problem with the magnitude spectral subtraction is that it does not attenuate noise sufficiently negative magnitude by error in the subtraction.

Filtering techniques [7, 8] or short-time spectral amplitude (STSA) estimators [9] or estimators based on super-Gaussian prior distributions for speech DFT coefficients are [10, 11, 12, 13] the statistical models assumed for each of the speech and noise signals that estimate the clean speech from the noisy observation without any prior information on the noisy type or speaker identity. However, in the case of nonstation of background noise, these methods face much difficulty in estimating the noise power spectral density (PSD) [14, 15, 16].

Recently, dictionary learning (DL) techniques, which build dictionary consisting of atoms and represent a class of signals in terms of the atoms, have been shown to be effective in machine learning, neuroscience, and audio processing [17, 18, 19, 20]. In speech enhancement, the dictionary models utilize specific types of the a priori information considered for both the speech and noise signals [21, 22, 23, 24, 25]. This class of methods assumes that a target spectrogram can be generated from a set of basis target spectra (a dictionary) through weighted linear combinations. Generally, this approach decomposes the time-frequency representations (the power or magnitude spectrogram) of noisy speech in terms of elementary atoms of a dictionary. One of the key issues in dictionary-based speech enhancement is how to precisely learn a dictionary. Dictionary learning methods are commonly based on an alternating optimization strategy, in which the signal representation is fixed, and the dictionary elements are learned; then the sparse signal representation is found, while the dictionary is fixed. Two popular methods have appeared to determine a dictionary within a matrix decomposition including sparse coding [26] and nonnegative matrix factorization (NMF) [27].

The observation that speech and other structured signals can be well approximated by few atoms of a suitably trained dictionary [28], which lies at the core of sparse representation (SR). In SR, sparse signals can be reconstructed with a few atoms of an overcomplete dictionary. Recently, developed SR has been shown to be effective in data representation, which factorizes given matrix with regularization methods or regularization term to constrain the sparsity of desire representation. Since speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal was decomposed and reconstructed from the noisy speech-driven sparse dictionary [21, 22, 23].

In many reality applications, the nonnegativities of the signals and the dictionary are required such as multispectral data analysis [29, 30], image representation [31, 32], and some other important problems [33, 34], the so-called nonnegative dictionary learning becomes necessary. Nonnegative matrix factorization is a popular dictionary method, which projects the given nonnegative matrix onto the subspace spanned by nonnegative dictionary vectors. Treating speech enhancement as a source separation problem between speech and noise, NMF-based techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. On the one hand, a clean speech signal can be estimated from the product of speech dictionaries and their activation.

In this chapter, we review the dictionary learning approaches for speech enhancement. After a brief introduction to the problem and its characterization as a sound source separation task, we present a survey on both theoretically and applicable of dictionary-based techniques, the main subject of this chapter. We finally provide an overview of the evaluation methods and suggest some future lines of works.

## 2. Background

Dictionary learning performs approximate matrix factorization of a data matrix into the product of a dictionary matrix and a coding matrix, under some sparsity constraints on the coding matrix. Dictionary learning is the generalization of gain-shape codebook learning. Signal vectors are represented as linear combinations of multiple dictionary atoms, allowing for lower approximation error while maintaining equal dictionary size. Two relatively different methods are described for how to form the dictionary from the given data including sparse representation (SR) and nonnegative matrix factorization (NMF).

### 2.1 Sparse representation (SR) and K-SVD algorithm

Let **X** be a matrix of *M* training signals **D** of *K* unit-norm atoms **X** and **DC** is sufficiently small. For example, if the exact sparsity level *T*_{0} is known, the problem can be formalized as minimizing the error cost function *OSR*(**D**, **C**) defined as:

where *l*_{0} norm, respectively.

Eq. (2) shows that a signal **x** can be expressed as the linear combination of only a few column vectors in **D**. Matrix factorization problem (2) is a difficult problem, since the joint optimization of **D** and **C** is nonconvex. Many dictionary algorithms follow an iterative scheme that alternates between updates of dictionary **D** and sparse coding **C** to minimize the cost function (2). K-SVD, one of the methods, goes under the category of sparse representation (SR), which came from the theory of sparse and redundant representation of signals. It was first introduced by Aharon et al. [34]. The K-SVD algorithm defines an initial overcomplete dictionary matrix

**The sparse coding approximation step** derives the column **c***m*, *m* = 1. *M* by using the orthogonal matching pursuit (OMP) algorithm with given **X** and **D** to solve the following equation:

**The updating dictionary step** is taken by minimizing the approximation error (2) with the current coding **C**. Atom-by-atom is updated in an iterative process.

where **c**^{[i]} is the *i*th row of **C**. The residual norm is minimized by seeking for a rank-one approximation [35]. The approximation is based on computing the singular value decomposition (SVD) [23].

### 2.2 Nonnegative matrix factorization (NMF) theory

Nonnegative matrix factorization (NMF) can be viewed as an approach for dictionary learning. NMF, first introduced by Paatero and Tapper [36] and later popularized by Lee and Seung [23, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], has been known as a part-based representation model. Different to other matrix factorization approaches, NMF takes into account the fact that most types of real-world data, particularly sound and videos, are nonnegative and maintain such nonnegativity constraints in factorization. Moreover, the nonnegativity constraints in NMF are compatible with the intuitive notion of combining parts to form a whole, that is, they provide a parts-based local representation of the data. A parts based model not only provides an efficient representation of the data but can potentially aid in the discovery of causal structure within it and in learning relationships between the parts.

Given a nonnegative matrix *K* < < min{N, M}, NMF projects **X** onto a space by a linear combination of a set of nonnegative basis vectors **D** = **{d***nk***}**, that is, **X ≈ DC** where **C** = **{c***km***}, c***km* ≥ 0. In order to find an approximate factorization for the matrix **X**, cost function that quantifies the quality of the decomposition needs to be defined. Operationally, NMF can be described as the following objective function

where *f* is denoted a distance metric.

Different the similarity measures between **X** and the product **DC** lead to different variants of NMF. The common choices include Euclidean distance [38], generalized Kullback-Leibler divergence [39], Itakura-Saito divergence [40]… For instance, the NMF based on Kullback–Leibler (KL) divergence is formulated as follows:

There exist different optimization models for the approximation factorization (5) [36, 39, 40]. The most popular solution is alternative multiplicative update rules (MURs) [36], which do not have required user-specified optimization parameters. For a KL cost function (6), the iteratively updating rules are given by:

However, it is found that the monotonicity guaranteed by the proof of multiplicative updates may not imply the full Karush-Kuhn-Tucker conditions [39, 40]. MUR is relatively simple and easy to implement, but it converges slower in comparison with gradient approaches [41]. More efficient algorithms equipped with stronger theoretical convergence property have been introduced. One popular method is to apply gradient descent algorithms with additive update rules, which are represented by the projective gradient descent method (PGD) [42]. In PGD framework, to select the learning step size, a line search method with the Armijo rule is applied [42] and the new estimate is obtained by first calculating the unconstrained steepest-descent update and then zeroing its negative elements. In addition, considering the separate convexity, the two-variable optimization problem is converted into the nonnegative least squares (NLS) optimization subproblems, which alternate the minimization over either **D** or **C**, with the other matrix fixed.

Because of the initial condition *K* < < min{N, M}, the obtained basis vectors are incomplete over the original vector space. In other words, this NMF approach tries to represent the high-dimensional stochastic pattern with far fewer bases, so the perfect approximation can be achieved successfully only if the intrinsic features are identified in **D**.

NMF will not get the unique solution under the sole nonnegativity constraint. Hence, to remedy the ill-posedness, it is imperative to introduce additional auxiliary constraints on **D** and/or **C** as regularization terms, which will also incorporate prior knowledge and reflect the characteristics of the issues more comprehensively. The constrained NMF models can be unified under the similar extended objective function

where the regularization parameters *α*and *χ* are used to balance the trade-off between the fitting goodness and the constraints *g*(**D**) and *h*(**C**).

The performance of NMF can be improved by imposing extra constraints and regularizations. For the sparseness learning, the sparse term *h*(**C**) expects to constraint the mount of nonzero elements in each column of the projection matrix. The *L*_{0} norm could be selected to count nonzero elements in **C** [43]. One limitation of using *L*_{0} norm is that the solution is not unique because of many local minima of the cost function. In this situation, the *L*_{1} norm of the projection matrix is usually replaced as a relaxation of the *L*_{0} penalty [44, 45].

## 3. Dictionary learning-based speech enhancement

A major outcome of speech enhancement techniques is the improved quality and reduced listening effort in the presence of an interfering noise signal. The decomposition of time-frequency representations, such as the power or magnitude spectrogram in terms of elementary atoms, has become a popular tool in speech enhancement since their success in finding high-“quality” dictionary atoms that best describe latent features of the underprocessed data. The dictionary-based techniques utilize specific types of the a priori information of speech or noise [21, 23, 46, 47, 48, 49, 50]. A priori information can be typical patterns or statistics obtained from a speech or noise database. Dictionary-based speech enhancement consists of two separate stages: a training stage, in which the model parameters are learned, and a denoising stage, in which the noise reduction task is carried out. In the first step, dictionary **D** is learned while fixing coefficient matrix **C**, and in second step, **C** is computed with the fixed dictionary matrix **D**. This process of alternate minimization is repeated iteratively until a stopping criterion is reached. In order to learn dictionary atoms capable of revealing the hidden structure in speech, long temporal context of speech signals must be considered. Two major classes of dictionary-based speech enhancement techniques may be the offline learning and online learning. Offline algorithms for dictionary learning are second-order iterative batch procedures, accessing the whole training set at each iteration in order to minimize a cost function under some constraints [21, 22, 23]. In speech enhancement, learning spectrotemporal atoms spanning several consecutive frames is done through training large volumes of datasets, which places unrealistic demand on computing power and memory. In large-scale tasks, online dictionary learning tends to gain lower empirical cost than conventional batch learning [46, 47, 48, 49, 50].

Speech enhancement herein is implemented in the short-time Fourier transform (STFT) magnitude domain, assuming that the phase of the interferer can be approximated with the phase of the mixture. The number of frequency bins per frame is determined by the length of the time-domain analysis window, where a Hamming window was chosen for the STFT. The temporal smoothness frames are determined by the time-domain analysis window overlap, where a minimum amount of overlap is necessary to avoid aliasing.

### 3.1 Offline dictionary

Sparse representation has been described as an overcomplete models wherein the number of bases is greater than the dimensionality of spectral representations. In sparse representation, sparse signals can be expressed as the linear combination of only a few atoms in an overcomplete dictionary. While speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal reconstructed from the noisy speech is considered as clean speech. A possibly overcomplete dictionary of atoms is trained for both speech and interferer magnitudes, which are then concatenated into a composite dictionary. The training process of updated dictionary is drawn in Figure 1.

When applying the sparse coding technique to speech enhancement, it is desirable to have the trained offline clean speech dictionary **D***speech* to be coherent to the speech signal and incoherent to the background noise signal as well as a coherent noise dictionary **D***noise*. In the enhancement step, the noisy speech is sparsely coded in the composite dictionary [**D***speech,* **D***noise*]. As a result, this mixture of speech and interferer **x** is explained by a sum of a linear combination of atoms from the speech dictionary **D***speech* and a linear combination of atoms from the interferer dictionary **D***noise*. The noisy **x** is coded using the least angle regression (LASSO) [51] with a preset threshold *θ* as follows:

The clean speech magnitude is estimated by disregarding the contribution from the interferer dictionary, preserving only the linear combination of speech dictionary atoms (analogously for the interferer) and

It is known that NMF represents data as a linear combination of a set of basis vectors, in which both the combination coefficients and the basis vectors are nonnegative. Although the basis learned by NMF is sparse, it is different from sparse coding [26]. This is because NMF learns a low rank representation of the data, while sparse coding usually learns the full rank representation. Treating speech enhancement as a source separation problem (speech and noise), NMF-based techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. Assume that a clean speech spectrogram as **X***speech* and a clean noise spectrogram as **X***noise*. Consider a supervised denoising approach where the clean speech basis matrix **D***speech* and the clean noise basis matrix **D***noise* are learned separately by performing NMF on the speech and the noise. During training process, minimized

To reduce the noise in the noisy speech, the concatenated dictionary **D** = [**D***speech,* **D***noise*] is fixed and utilized in decomposing the noisy speech **X***noisy* by

where the time-varying activation matrix is formulated

Discarding the noise coding matrix, the target speech is estimated from the product of speech dictionaries and their activations as

The clean speech waveform is estimated using the noisy phase and inverse DFT and the general framework of NMF-based speech enhancement is drawn in Figure 2.

### 3.2 Online dictionary learning

The aforementioned dictionary learning approaches access the whole training set to determine the bases, which are referred as offline training process. These methods were reported to have good performance on modeling nonstationary noise types, which had been seen during training. For the time-frequency analysis of audio signals, however, the obtained basis may not be adequate to capture the temporal dependency of repeating patterns within the signal, and the success of these methods strongly relies on the prior knowledge of noise or speech or both, which limits implementations of the models. Recently, the online dictionary learning methods have been proposed in two aspects of implementing scheme [46, 47, 48, 49, 50] and circumventing the mismatch problem between the training and testing stages [24, 52].

One drawback of the multiplicative update procedure on offline dictionary learning is the requirement of all the training signals to be read into memory and processed in each iteration. This high demand on both computing resources and memory is prohibitive in large-scale tasks. To address this problem, the online optimization algorithms were developed in an incremental fashion, which processes one sample of the training set at a time based on stochastic approximations or only a part of the training data at a time and updates patterns gradually until completely processed whole training corpus [46, 47, 48, 51]. More specifically, given *M* samples

where

The coefficient matrix is computed by

For the online NMF framework, at step *t*, on the arrival of sample **x**^{(t)}, the corresponding coefficient **c**^{(t)} is formulated by

where **D**^{(t−1)} is the previous basis matrix. The matrix **D**^{(t)} is updated by

where *t* steps.

In [50], an online noise basis learning scheme is proposed that uses the temporal dependencies of speech and noise signal to construct informative prior distribution. In this model, the noise basis matrix is learned from the noisy observation. To update the noise basis, the past noisy DFT magnitude frames are stored into a buffer and the buffer will be then updated with fixed speech basis when a new noisy frame arrives.

Kwon et al. [52] present a speech enhancement technique combining statistical models and NMF with online update of speech and noise bases. A cascaded structure of combining a statistical model-based enhancement (SE) (the first state) [53] and NMF approach (second stage) with simultaneous update of speech and noise bases is proposed. In this model, the output clean speech at current frame is fed as an input to update the speech and noise bases in the following frame. In other words, at each frame, the clean speech estimation is obtained; the speech and noise bases for the NMF analysis in the following frame are updated. This online bases update makes it possible to deal with the speech and noise variations that cannot be covered by the training noise database and is considered a promising way to cope with the nonstationary nature of the signal. The noisy data **X′**(*t*) used for the online bases update herein is constructed by concatenating preenhanced output **X**_{SE}(*t*) of performing statistical model-based enhancement (SE) with the current frame input **X**(*t*). The updating dictionary process will be learned by adding a regular term to the original objective function as follows:

where **D′**(*t*) = **[D′***speech,* (*t*)**D′***noise*(*t*)] denotes the basis matrix in NMF decomposing of the concatenated noisy data **X′**(*t*) and **D**(*t*) = [**D***speech,* (*t*)**D***noise*(*t*)] is the basis matrix used to analyze the *t*-frame **X**(*t*) in the second state.

## 4. Summary and discussion

In the experimental simulations, speech and noise materials were selected from TIMIT [53] (192 sentences), NOISEX-92 DBs (15 types of noise: birds, casino, cicadas, computer keyboard, eating chips, f16, factory1, factory2, frogs, jungle, machineguns, motorcycles, ocean, pink, and volvo) [54], the GRID audiovisual corpus (34 speakers of both genders) [55], the NOIZEUS speech corpus (30 utterances with clean samples) [1]. The noisy speech examples were synthesized by adding clean speech to different types of noises at various input SNRs.

Speech enhancement algorithms aim to improve both the speech quality and the speech intelligibility. A high-quality speech signal is perceived as being natural and pleasant to listen to, and free of distracting artifacts. An effective technique should suppress noises without bringing too much distortion to the enhanced speech. Measuring speech quality is challenging, as it is subjective and can be classified into subjective and objective measures. The speech enhancement performance was commonly evaluated in terms of three criteria including the signal to noise ratio (SNR) of enhanced speech [56], the segmental SNR (segSNR) [56], or the perceptual estimation of speech quality score (PESQ) [57, 58, 59]. Given the true and estimated speech magnitude spectra, the frequency-weighted segmental SNR is defined as:

segSNR is a conceptually simple objective measure, computed on individual signal frames, and the per-frame scores are averaged over time.

where *Xb,speech* (*t*) is the frequency-domain representation of the clean speech signal, for frequency *b* and time frame *t*,

Contrary to spectral subtraction, dictionary approach does not assume a stationary interferer, optimizes the trade-off between source distortion and source confusion, and thus shows superiority over objective quality measures like cepstral distance, in the speaker-dependent and -independent case, in real-world environments and under low SNR condition. One possible reason could be due to lack of plenty of data to estimate a noise dictionary. At low SNR levels, the total volume of noise is much higher than that at high SNR levels, which offers a higher chance to obtain a good dictionary or noise modeling. However, under high SNR conditions, a lot of noise spectrum is buried in speech spectrum, which could make the learning of a noise dictionary difficult. The pretrained speech dictionary models outperform state-of-the-art methods like multiband spectral subtraction and approaches based on vector quantization [21, 22, 23]. Offline speech dictionary learning in a joint decomposition framework of the noisy speech spectrogram and a primary estimate of the clean speech spectrogram. Online learning approach processes input signals piece-by-piece by breaking the training data into small pieces and updates learned patterns gradually using accumulated statistics. With this approach, only a limited segment of the input signal is processed at a time. The online estimated dictionary is sufficient enough in basis subspace to avoid speech distortion. The online approaches tend to give better performance than batch learning [53].

The computing demand for both offline learning and online learning consists of updating the coefficient matrix **C** and the pattern matrix **D**. The learning task is defined as an optimization problem, which aims to minimize an objective cost function *f*(**D**) with respect to the pattern matrix **D**. It is observed that the reconstruction error for both the online and offline methods converges to a similar value after several iterations and not monotonically decreasing at the beginning. Both batch and online learning converge to a stationary point of the expected cost function *f*(**D**) with unlimited data and unlimited computing resources. This situation is only valid in theory. For small-scale tasks where data are limited, but computing resources are unlimited, batch learning converges to a stationary point of the cost function *ft*(**D**), while online learning fails to converge, resulting in suboptimal patterns. For large-scale tasks, the more common situation is where training data are abundant but computing resources are limited. In this situation, due to its early learning property, online learning tends to obtain lower empirical cost than batch learning [49]. For sparse coding where the pattern matrix is overcomplete, for example, (*K* > *M*), then online learning is slower than batch learning. The online learning is significantly faster than the batch alternating learning by a factor of the large number of spectrograms reconstructed at each iteration [60].

In short, dictionary learning plays an important role in machine learning, where data vectors are modeled as sparse linear combinations of basis factors (i.e., dictionary). However, how to conduct dictionary learning in noisy environment has not been well studied. In this chapter, we have reviewed speech enhancement techniques based on dictionary learning. The dictionary learning-based algorithms have gained a lot of attention due to their success in finding high-“quality” dictionary atoms (basis vectors) that best describe latent features of the underprocessed data. As a multivariate data analysis and dimensionality reduction technique, two relatively novel paradigms for dimensionality reduction and sparse representation, NMF and SR, have been in the ascendant since its inception. They enhance learning and data representation due to their parts-based and sparse representation from the nonnegativity or purely additive constraint. NMF and SR produce high-quality enhancement results when the dictionaries for different sources are sufficiently distinct. This survey chapter mainly focuses on the theoretical research into dictionary learning-based speech enhancement where the principles, basic models, properties, algorithms, and employing on SR and NMF are summarized systematically.

## Acknowledgments

This research is partially supported by the Ministry of Science and Technology under Grant Number 108-2634-F-008 -004 through Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan.