Speech Enhancement Using an Iterative Posterior NMF

Over the years, miscellaneous methods for speech enhancement have been proposed, such as spectral subtraction (SS) and minimum mean square error (MMSE) estimators. These methods do not require any prior knowledge about the speech and noise signals nor any training stage beforehand, so they are highly flexible and allow implementation in various situations. However, these algorithms usually assume that the noise is stationary and are thus not good at dealing with nonstationary noise types, especially under low signal-to-noise (SNR) conditions. To overcome the drawbacks of the above methods, nonnegative matrix factorization (NMF) is introduced. NMF approach is more robust to nonstationary noise. In this chapter, we are actually interested in the application of speech enhancement using NMF approach. A speech enhancement method based on regularized nonnegative matrix factorization (NMF) for nonstationary Gaussian noise is proposed. The spectral components of speech and noise are modeled as Gamma and Rayleigh, respectively. We propose to adaptively estimate the sufficient statistics of these distributions to obtain a natural regularization of the NMF criterion.


Introduction
Over the past several decades, there has been a large research interest in the problem of single-channel sound source separation. Such work focuses on the task of separating a single mixture recording into its respective sources and is motivated by the fact that real-world sounds are inherently constructed by many individual sounds (e.g., human speakers, musical instruments, background noise, etc.). While source separation is difficult, the topic is highly motivated by many outstanding problems in audio signal processing and machine learning, including the following: 1. Speech denoising and enhancement-the task of removing background noise (e.g., wind, babble, etc.) from recorded speech and improving speech intelligibility for human listeners and/or automatic speech recognizers 2. Content-based analysis and processing-the task of extracting and/or processing audio based on semantic properties of the recording such as tempo, rhythm, and/or pitch 3. Music transcription-the task of notating an audio recording into a musical representation such as a musical score, guitar tablature, or other symbolic notations 4.Audio-based forensics-the task of examining, comparing, and evaluating audio recordings for scientific and/or legal matters 5. Audio restoration-the task of removing imperfections such as noise, hiss, pops, and crackles from (typically old) audio recordings 6. Music remixing and content creation-the task of creating a new musical work by manipulating the content of one or more previously existing recordings 2. Nonnegative matrix factorization

NMF model
Nonnegative matrix factorization is a process that approximates a single nonnegative matrix as the product of two nonnegative matrices. It is defined by is a matrix of basis vectors, basis functions, or dictionary elements; H ∈ R N z ÂN t þ is a matrix of corresponding activations, weights, or gains; and N f is the number of rows of the input matrix. N t is the number of columns of the input matrix; N z is the number of basis vectors [1].
• Each column is an N f -dimensional data sample.
• Each row represents a data feature.
W ∈ R N f ÂN z þ -matrix of basis vectors, basis functions, or dictionary elements.
• A column represents a basis vector, basis function, or dictionary element.
• Each column is not orthonormal, but commonly normalized to one.
• A row represents the gain of a corresponding basis vector.
• Each row is not orthonormal, but sometimes normalized to one.
When used for audio applications, NMF is typically used to model spectrogram data or the magnitude of STFT data [2]. That is, we take a single-channel recording, transform it into the time-frequency domain using the STFT, take the magnitude or power V, and then approximate the result as V ≈ WH. In doing so, NMF approximates spectrogram data as a linear combination of prototypical frequencies or spectra (i.e., basis vectors) over time.
This process can be seen in Figure 1 [3], where a two-measure piano passage of "Mary Had a Little Lamb" is shown alongside a spectrogram and an NMF factorization. Notice how W captures the harmonic content of the three pitches of the passage and H captures the time onsets and gains of the individual notes. Also note that N z is typically chosen manually or using a model selection procedure such as cross-validation and N f and N t are a function of the overall recording length and STFT parameters (transform length, zero-padding size, and hop size).
This leads to two related interpretations of how NMF models spectrogram data. The first interpretation is that the columns of V (i.e., short-time segments of the mixture signal) are approximated as a weighted sum of basis vectors as shown in Figure 2 and Eq. (2): The second interpretation is that the entire matrix V is approximated as a sum of matrix "layers," as shown in Figure 3 and Eq. (3).

Optimization formulation
To estimate the basis matrix W and the activation matrix H for a given input data matrix V, NMF algorithm is formulated as an optimization problem. This is written as: Þis an appropriately defined cost function between V and W H and the inequalities ≥ are element-wise. It is also common to add additional equality constraints to require the columns of W to sum to one, which we enforce.
Þis additively separable, the cost function can be reduced to Figure 3. NMF interpretation II. The matrix V (i.e., the mixture signal) is approximated as a sum of matrix "layers" [3]. where ½ ft indicates its argument at row f and column t and D V WH j ð Þis a scalar cost function measured between V and WH.
Popular cost functions include the Euclidean distance metric, Kullback-Liebler (KL) divergence, and Itakura-Saito (IS) divergence. Both the KL and IS divergences have been found to be well suited for audio purposes. In this work, we focus on the case where d q p j ð Þ is generalized (non-normalized) KL divergence: where ½ ft indicates its argument at row f and column t and d q p j ð Þ is a scalar cost function measured between q and p.
This results in the following optimization formulation: Given this formulation, we notice that the problem is not convex in W and H, limiting our ability to find a globally optimal solution to Eq. (7). It is, however, biconvex or independently convex in W for a fixed value of H and convex in H for a fixed value of W, motivating the use of iterative numerical methods to estimate locally optimal values of W and H.

Parameter estimation
To solve Eq. (7), we must use an iterative numerical optimization technique and hope to find a locally optimal solution. Gradient descent methods are the most common and straightforward for this purpose but typically are slow to converge. Other methods such as Newton's method, interior-point methods, conjugate gradient methods, and similar [4] can converge faster but are typically much more complicated to implement, motivating alternative approaches.
The most popular alternative that has been proposed is by Lee and Seung [1,5] and consists of a fast, simple, and efficient multiplicative gradient descent-based optimization procedure. The method works by breaking down the larger optimization problem into two subproblems and iteratively optimizes over W and then H, back and forth, given an initial feasible solution. The approach monotonically decreases the optimization objective for both the KL divergence and Euclidean cost functions and converges to a local stationary point.
The approach is justified using the machinery of majorization-minimization (MM) algorithms [6]. MM algorithms are closely related to expectation maximization (EM) algorithms. In general, MM algorithms operate by approximating an optimization objective with a lower bound auxiliary function. The lower bound is then maximized instead of the original function, which is usually more difficult to optimize.
Algorithm 1 shows the complete iterative numerical optimization procedure applied to Eq. (7) with the KL divergence, where the division is element-wise, ⊗ is an element-wise multiplication, and 1 is a vector or matrix of ones with appropriately defined dimensions [5].

Algorithm 1 KL-NMF parameter estimation
until convergence return: W and H NMF is an optimization technique using EM algorithm in terms of matrix, whereas probabilistic latent component analysis (PLCA) is also an optimization technique using EM algorithm in terms of probability. In PLCA, we are going to incorporate probabilities of time and frequency. In the next section, the development of PLCA-based algorithm to incorporate time-frequency constraints is discussed.

A probabilistic latent variable model with time-frequency constraints
Considering this approach, we now develop a new PLCA-based algorithm to incorporate the time-frequency user-annotations. For clarity, we restate the form of the symmetric two-dimensional PLCA model we use: Compared to a modified NMF formulation, incorporating optimization constraints as a function of time, frequency, and sound source into the factorized PLCA model is particularly interesting and motivating to our focus.
Incorporating prior information into this model, and PLCA in general, can be done in several ways. The most commonly used methods are by direct observations (i.e., setting probabilities to zero, one, etc.) or by incorporating Bayesian prior probabilities on model parameters. Direct observations do not give us enough control, so we consider incorporating Bayesian prior probabilities. For the case of Eq. (10), this would result in independently modifying the factor terms p f z j ð Þ, p t z j ð Þ, or p z ð Þ. Common prior probability distributions used for this purpose include Dirichlet priors [7], gamma priors [8], and others.
Given that we would like to incorporate the user-annotations as a function of time, frequency, and sound source, however, we notice that this is not easily accomplished using standard priors. This is because the model is factorized, and each factor is only a function of one variable and (possibly) conditioned by another, making it difficult to construct a set of prior probabilities that, when jointly applied to p f z j ð Þ, p t z j ð Þ, and/or p z ð Þ, would encourage or discourage one source or another to explain a given time-frequency point. We can see this more clearly when we consider PLCA to be the following simplified estimation problem: where X f; t ð Þ is the observed data that we model as the product of three distinct functions or factors φ z ð Þ, φ f ; z ð Þ, and φ t; z ð Þ. Note, each factor has different input arguments and each factor has different parameters that we wish to estimate via EM. Also, forget for the moment that the factors must be normalized probabilities.
Given this model, if we wish to incorporate additional information, we could independently modify: Þto incorporate past knowledge of the variable f and z • φ t; z ð Þ to incorporate past knowledge of the variable t and z This way of manipulation allows us to maintain our factorized form and can be thought of as prior-based regularization. If we would like to incorporate additional information/regularization that is a function of all three variables z, f, and t, then we must do something else. The first option would be to try to simultaneously modify all factors together to impose regularization that is a function of all three variables. This is unfortunately very difficult-both conceptually difficult to construct and practically difficult to algorithmically solve.
This motivates the use of posterior regularization (PR). PR provides us with an algorithmic mechanism via EM to incorporate constraints that are complementary to prior-based regularization. Instead of modifying the individual factors of our model as we saw before, we directly modify the posterior distribution of our model. The posterior distribution of our model, very loosely speaking, is a function of all random variables of our model. It is natively computed within each E step of EM and is required to iteratively improve the estimates of our model parameters. In this example, the posterior distribution would be akin to φ z; f ; t ð Þ, which is a function of t, f, and z, as required. We now formally discuss PR below, beginning with a general discussion and concluding with the specific form of PR we employ within our approach.

Posterior regularization
The framework of posterior regularization, first introduced by Graca, Ganchev, and Taskar [9,10], is a relatively new mechanism for injecting rich, typically data-dependent constraints into latent variable models using the EM algorithm. In contrast to standard Bayesian prior-based regularization, which applies constraints to the model parameters of a latent variable model in the maximization step of EM, posterior regularization applies constraints to the posterior distribution (distribution over the latent variables, conditioned on everything else) computation in the expectation step of EM. The method has found success in many natural language processing tasks, such as statistical word alignment, part-of-speech tagging, and similar tasks that involve latent variable models.
In this case, what we do is constrain the distribution q in some way when we maximize the auxiliary bound F q; Θ ð Þwith respect to q in the expectation step of an EM algorithm, resulting in where Ω q ð Þ constrains the possible space of q. Note, the only difference between Eq. (12) and our past discussion on EM is the added term Ω q ð Þ. If Ω q ð Þ is set to zero, we get back the original formulation and easily solve the optimization by setting q = p without any computation (except computing the posterior p). Also note to denote the use of constraints in this context, the term "weakly supervised" was introduced by Graca [11] and is similarly adopted here.
This method of regularization is in contrast to prior-based regularization, where the modified maximization step would be where Ω Θ ð Þ constrains the model parameter Θ.

Linear grouping expectation constraints
Given the general framework of posterior regularization, we need to define a meaningful penalty Ω q ð Þ for which we map our annotations. We do this by mapping the annotation matrices to linear grouping constraints on the latent variable z. To do so, we first notice that Eq. (12) decouples for each time-frequency point for our specific model. Because of this, we can independently solve Eq. (12) for each time-frequency point, making the optimization much simpler. When we rewrite our E step optimization using vector notation, we get where q and p z f; t j ð Þfor a given value of f and t is written as q ft and p ft without any modification; we note q is optimal when equal to p z f; t j ð Þas before. We then apply our linear grouping constraints independently for each timefrequency point: where we define λ ft ¼ Λ ft1 ::……Λ ft1 Λ ft2 ………Λ ft2 Â Ã T ∈ R N z as the vector of userdefined penalty weights, T is a matrix transpose, ≥ is element-wise greater than or equal to, and 1 is a column vector of ones. In this case, positive-valued penalties are used to decrease the probability of a given source, while negative-valued coefficients are used to increase the probability of a given source. Note the penalty weights imposed on the group of values of z that correspond to a given source s are identical, linear with respect to the z variables, and applied in the E step of EM, hence the name "linear grouping expectation constraints." To solve the above optimization problem for a given time-frequency point, we form the Lagrangian L q ft ; γ ¼ Àq T ft ln p ft þ q T ft ln q ft þ q T ft λ ft þ γ 1 À q T ft 1 (16) With γ being a Lagrange multiplier, take the gradient with respect to q and γ: set Eqs. (17) and (18) equal to zero, and solve for q ft , resulting in where exp{} is an element-wise exponential function. Notice the result is computed in closed form and does not require any iterative optimization scheme as may be required in the general posterior regularization framework [9], minimizing the computational cost when incorporating the constraints. Also note, however, that this optimization must be solved for each timefrequency point of our spectrogram data for each E step iteration of our final EM parameter estimation algorithm.

Parameter estimation
Now knowing the posterior-regularized expectation step optimization, we can derive a complete EM algorithm for a posterior-regularized two-dimensional PLCA model (PR-PLCA): where Λ ¼ exp ÀΛ f g. The entire algorithm is outlined in Algorithm 2. Notice we continue to maintain closed-form E and M steps, allowing us to optimize further and draw connections to multiplicative nonnegative matrix factorization algorithms.

Algorithm 2 PR-PLCA with linear grouping expectation constraints
//observed normalized data N z //number of basis vectors N s //number of sources Λ ∈ R N f ÂN t ÂN z //penalties ) Initialize: feasible p z ð Þ,p f z j ð Þ and p t z j ð Þ

Precompute:
Λ exp ÀΛ ð Þ repeat Expectation step for all z, f, t do end for until convergence return:p f z j ð Þ, p t z j ð Þ,p z ð Þ and p z f; t j ð Þ

• Multiplicative Update Equations
We can rearrange the expressions in Algorithm 2 and convert to a multiplicative form following similar methodology to Smaragdis and Raj [12].
Rearranging the expectation and maximization steps, in conjunction with Bayes' rule, and Rearranging further, we get which fully specifies the iterative updates. By putting Eqs. (30) and (31) in matrix notation, we specify the multiplicative form of the proposed method in Algorithm 3.

Algorithm 3. PR-PLCA with linear grouping expectation constraints in matrix notation
Procedure PLCA ( For all s do Over the past several years, research has been carried out in single-channel sound source separation methods. This problem is motivated by speech denoising, speech enhancement [13], music transcription [14], audio-based forensic, and music remixing. One of the most effective approach is nonnegative matrix factorization (NMF) [5]. The user-annotations can be used to obtain the PR terms [15]. If the number of sources is more, then it is difficult to identify sources in the spectrogram. In such cases, the user interaction-based constraint approaches are inefficient.
In order to avoid the previous problem, in the proposed method, an automatic iterative procedure is introduced. The spectral components of speech and noise are modeled as Gamma and Rayleigh, respectively [16].

Notation and basic concepts
Let noisy speech signal x[n] be the sum of clean speech s[n] and noise d[n] and their corresponding magnitude spectrogram be represented as where f represents the frequency bin and t the frame number. The observed magnitudes in time-frequency are arranged in a matrix X ∈ R f Ât þ of nonnegative elements. The source separation algorithms based on NMF pursue the factorization of X as a product of two nonnegative matrices, in which the columns collect the basis vectors and H ¼ h T 1 ; h T 2 ; :…; h T R Â Ã T ∈ R RÂt þ that collects their respective weights, i.e., where R denotes the number of latent components.

Proposed regularization
There are several ways to incorporate the user-annotations into latent variable models, for instance, by using the suitable regularization functions. For expectation maximization (EM) algorithms, posterior regularization was introduced by [9,11]. This method is data dependent. This method gives richness and also gives the constraints on the posterior distributions of latent variable models. The applications of this method is used in many natural language processing tasks like statistical word alignment, part-of-speech tagging. The main idea is to constrain on the distribution of posterior, when computing expectation step in EM algorithm.
The prior distributions for the magnitude of the noise spectral components are modeled as Rayleigh probability density function (PDF) with scale parameter σ, which is fitted to the observations by a maximum likelihood procedure [16,17], i.e., The above equation can be written as By applying negative logarithm on both sides of (41), we will get Then, the regularization term for the noise is defined as The spectral components of speech modeled as Gamma probability density function [16,18] with shape parameter k>0 and scale parameter θ>0: where the auxiliary variable s is defined as The regularization term for the speech samples is defined as (by applying negative logarithm in both sides of (44)) Special case: When we fix k = 1, the Gamma density simplifies to the exponential density and The proposed multiplicative nonnegative matrix factorization method is summarized in Algorithm 4 [16]. In general, like in the specific case of Algorithm 4, one can only guarantee the monotonous descent of the iteration through a majorization-minimization approach [19] or the convergence to a stationary point [20].
The subscript(s) with parenthesis represents corresponding columns or rows of the matrix assigned to a given source. 1 is an approximately sized matrices of ones, and ⊗ represents element-wise multiplication.

Experimental results
The speech and noise audio samples were taken from NOIZEUS [21]. Sampling frequency is 8 KHz. The algorithm is iterated until convergence [16]. The proposed method was compared with Euclidean NMF (EUC-NMF) [5], Itakura-Saito NMF (IS-NMF) [22], posterior regularization NMF (PR-NMF) [15], Wiener filtering [23], and constrained version of NMF (CNMF) [24]. These methods are implemented by considering nonstationary noise, babble noise and street noise. The performance of proposed method was evaluated by using perceptual evaluation of speech quality (PESQ) [25] and source-to-distortion ratio (SDR) [26]. SDR gives the average quality of separation on dB scale and considers signal distortion as well as noise distortion. For PESQ and SDR, the higher value indicates the better performance. Tables 1 and 2 show the PESQ and SDR values of different NMF algorithms evaluated. The experimental results show that proposed method performs better than other existing methods in terms of the PESQ and SDR indices.

Conclusion
A novel speech enhancement method based on an iterative and regularized NMF algorithm for single-channel source separation is proposed. The clean speech and noise magnitude spectra are modeled as Gamma and Rayleigh distributions, respectively. The corresponding log-likelihood functions are used as penalties to Table 1. PESQ and SDR for babble noise. regularize the cost function of the NMF. The estimation of basis matrices and excitation matrices are calculated by using proposed regularization of multiplicative update rules. The experiments reveal that the proposed speech enhancement method outperforms other existing benchmark methods in terms of SDR and PESQ values.

Author details
Sunnydayal Vanambathina Department of Electronics and Communication Engineering, VIT-AP University, Amaravati, India *Address all correspondence to: sunny.conference@gmail.com © 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.