Open access peer-reviewed chapter

Speech Enhancement Using an Iterative Posterior NMF

By Sunnydayal Vanambathina

Submitted: October 6th 2018Reviewed: February 5th 2019Published: May 27th 2019

DOI: 10.5772/intechopen.84976

Downloaded: 252


Over the years, miscellaneous methods for speech enhancement have been proposed, such as spectral subtraction (SS) and minimum mean square error (MMSE) estimators. These methods do not require any prior knowledge about the speech and noise signals nor any training stage beforehand, so they are highly flexible and allow implementation in various situations. However, these algorithms usually assume that the noise is stationary and are thus not good at dealing with nonstationary noise types, especially under low signal-to-noise (SNR) conditions. To overcome the drawbacks of the above methods, nonnegative matrix factorization (NMF) is introduced. NMF approach is more robust to nonstationary noise. In this chapter, we are actually interested in the application of speech enhancement using NMF approach. A speech enhancement method based on regularized nonnegative matrix factorization (NMF) for nonstationary Gaussian noise is proposed. The spectral components of speech and noise are modeled as Gamma and Rayleigh, respectively. We propose to adaptively estimate the sufficient statistics of these distributions to obtain a natural regularization of the NMF criterion.


  • nonnegative matrix factorization (NMF)
  • speech enhancement
  • signal-to-noise ratio (SNR)
  • expectation maximization (EM) algorithms
  • posterior regularization (PR)

1. Introduction

Over the past several decades, there has been a large research interest in the problem of single-channel sound source separation. Such work focuses on the task of separating a single mixture recording into its respective sources and is motivated by the fact that real-world sounds are inherently constructed by many individual sounds (e.g., human speakers, musical instruments, background noise, etc.). While source separation is difficult, the topic is highly motivated by many outstanding problems in audio signal processing and machine learning, including the following:

  1. Speech denoising and enhancement—the task of removing background noise (e.g., wind, babble, etc.) from recorded speech and improving speech intelligibility for human listeners and/or automatic speech recognizers

  2. Content-based analysis and processing—the task of extracting and/or processing audio based on semantic properties of the recording such as tempo, rhythm, and/or pitch

  3. Music transcription—the task of notating an audio recording into a musical representation such as a musical score, guitar tablature, or other symbolic notations

  4. Audio-based forensics—the task of examining, comparing, and evaluating audio recordings for scientific and/or legal matters

  5. Audio restoration—the task of removing imperfections such as noise, hiss, pops, and crackles from (typically old) audio recordings

  6. Music remixing and content creation—the task of creating a new musical work by manipulating the content of one or more previously existing recordings

2. Nonnegative matrix factorization

2.1 NMF model

Nonnegative matrix factorization is a process that approximates a single nonnegative matrix as the product of two nonnegative matrices. It is defined by


VR+Nf×Ntis a nonnegative input matrix. WR+Nf×Nzis a matrix of basis vectors, basis functions, or dictionary elements; HR+Nz×Ntis a matrix of corresponding activations, weights, or gains; and Nfis the number of rows of the input matrix. Ntis the number of columns of the input matrix; Nzis the number of basis vectors [1].

VR+Nf×Nt—original nonnegative input data matrix

  • Each column is an Nf-dimensional data sample.

  • Each row represents a data feature.

WR+Nf×Nz—matrix of basis vectors, basis functions, or dictionary elements.

  • A column represents a basis vector, basis function, or dictionary element.

  • Each column is not orthonormal, but commonly normalized to one.

HR+Nz×Nt—matrix of activations, weights, encodings, or gains.

  • A row represents the gain of a corresponding basis vector.

  • Each row is not orthonormal, but sometimes normalized to one.

When used for audio applications, NMF is typically used to model spectrogram data or the magnitude of STFT data [2]. That is, we take a single-channel recording, transform it into the time-frequency domain using the STFT, take the magnitude or power V, and then approximate the result as VWH. In doing so, NMF approximates spectrogram data as a linear combination of prototypical frequencies or spectra (i.e., basis vectors) over time.

This process can be seen in Figure 1 [3], where a two-measure piano passage of “Mary Had a Little Lamb” is shown alongside a spectrogram and an NMF factorization. Notice how W captures the harmonic content of the three pitches of the passage and H captures the time onsets and gains of the individual notes. Also note that Nzis typically chosen manually or using a model selection procedure such as cross-validation and Nfand Ntare a function of the overall recording length and STFT parameters (transform length, zero-padding size, and hop size).

Figure 1.

NMF of a piano performing “Mary had a little lamb” for two measures with N z = 3. Notice how matrix W captures the harmonic content of the three pitches of the passage and matrix H captures the time onsets and gains of the individual notes [3].

This leads to two related interpretations of how NMF models spectrogram data. The first interpretation is that the columns of V (i.e., short-time segments of the mixture signal) are approximated as a weighted sum of basis vectors as shown in Figure 2 and Eq. (2):


Figure 2.

NMF interpretation I. the columns of V (i.e., short-time segments of the mixture signal) are approximated as a weighted sum or mixture of basis vectors W [3].

The second interpretation is that the entire matrix V is approximated as a sum of matrix “layers,” as shown in Figure 3 and Eq. (3).


Figure 3.

NMF interpretation II. The matrix V (i.e., the mixture signal) is approximated as a sum of matrix “layers” [3].

The application of NMF on noisy speech can be seen in Figure 4.

Figure 4.

Applying NMF on noisy speech.

2.2 Optimization formulation

To estimate the basis matrix W and the activation matrix H for a given input data matrix V, NMF algorithm is formulated as an optimization problem. This is written as:


where DVWHis an appropriately defined cost function between V and W H and the inequalities are element-wise. It is also common to add additional equality constraints to require the columns of W to sum to one, which we enforce. When DVWHis additively separable, the cost function can be reduced to


where ftindicates its argument at row f and column t and DVWHis a scalar cost function measured between V and WH.

Popular cost functions include the Euclidean distance metric, Kullback-Liebler (KL) divergence, and Itakura-Saito (IS) divergence. Both the KL and IS divergences have been found to be well suited for audio purposes. In this work, we focus on the case where dqpis generalized (non-normalized) KL divergence:

dKLqp=q lnqpq+pE6

where ftindicates its argument at row f and column t and dqpis a scalar cost function measured between q and p.

This results in the following optimization formulation:

argminW,Hf=1Nft=1NtVft lnWHft+WHft+const

Subject to


Given this formulation, we notice that the problem is not convex in W and H, limiting our ability to find a globally optimal solution to Eq. (7). It is, however, biconvex or independently convex in W for a fixed value of H and convex in H for a fixed value of W, motivating the use of iterative numerical methods to estimate locally optimal values of W and H.

2.3 Parameter estimation

To solve Eq. (7), we must use an iterative numerical optimization technique and hope to find a locally optimal solution. Gradient descent methods are the most common and straightforward for this purpose but typically are slow to converge. Other methods such as Newton’s method, interior-point methods, conjugate gradient methods, and similar [4] can converge faster but are typically much more complicated to implement, motivating alternative approaches.

The most popular alternative that has been proposed is by Lee and Seung [1, 5] and consists of a fast, simple, and efficient multiplicative gradient descent-based optimization procedure. The method works by breaking down the larger optimization problem into two subproblems and iteratively optimizes over W and then H, back and forth, given an initial feasible solution. The approach monotonically decreases the optimization objective for both the KL divergence and Euclidean cost functions and converges to a local stationary point.

The approach is justified using the machinery of majorization-minimization (MM) algorithms [6]. MM algorithms are closely related to expectation maximization (EM) algorithms. In general, MM algorithms operate by approximating an optimization objective with a lower bound auxiliary function. The lower bound is then maximized instead of the original function, which is usually more difficult to optimize.

Algorithm 1 shows the complete iterative numerical optimization procedure applied to Eq. (7) with the KL divergence, where the division is element-wise,is an element-wise multiplication, and 1 is a vector or matrix of ones with appropriately defined dimensions [5].

Algorithm 1 KL-NMF parameter estimation

Procedure KL-NMF (VR+Nf×Nt//input data matrix.

Nz//number of basic vectors.)

Initialize: WR+Nf×Nz, HR+Nz×Nt.


Optimize over W


Optimize over H


until convergence

return: W and H

NMF is an optimization technique using EM algorithm in terms of matrix, whereas probabilistic latent component analysis (PLCA) is also an optimization technique using EM algorithm in terms of probability. In PLCA, we are going to incorporate probabilities of time and frequency. In the next section, the development of PLCA-based algorithm to incorporate time-frequency constraints is discussed.

3. A probabilistic latent variable model with time-frequency constraints

Considering this approach, we now develop a new PLCA-based algorithm to incorporate the time-frequency user-annotations. For clarity, we restate the form of the symmetric two-dimensional PLCA model we use:


Compared to a modified NMF formulation, incorporating optimization constraints as a function of time, frequency, and sound source into the factorized PLCA model is particularly interesting and motivating to our focus.

Incorporating prior information into this model, and PLCA in general, can be done in several ways. The most commonly used methods are by direct observations (i.e., setting probabilities to zero, one, etc.) or by incorporating Bayesian prior probabilities on model parameters. Direct observations do not give us enough control, so we consider incorporating Bayesian prior probabilities. For the case of Eq. (10), this would result in independently modifying the factor terms pfz, ptz, or pz. Common prior probability distributions used for this purpose include Dirichlet priors [7], gamma priors [8], and others.

Given that we would like to incorporate the user-annotations as a function of time, frequency, and sound source, however, we notice that this is not easily accomplished using standard priors. This is because the model is factorized, and each factor is only a function of one variable and (possibly) conditioned by another, making it difficult to construct a set of prior probabilities that, when jointly applied to pfz, ptz, and/or pz, would encourage or discourage one source or another to explain a given time-frequency point. We can see this more clearly when we consider PLCA to be the following simplified estimation problem:


where Xftis the observed data that we model as the product of three distinct functions or factors φz, φfz, and φtz. Note, each factor has different input arguments and each factor has different parameters that we wish to estimate via EM. Also, forget for the moment that the factors must be normalized probabilities.

Given this model, if we wish to incorporate additional information, we could independently modify:

  • φzto incorporate past knowledge of the variable z

  • φfzto incorporate past knowledge of the variable f and z

  • φtzto incorporate past knowledge of the variable t and z

This way of manipulation allows us to maintain our factorized form and can be thought of as prior-based regularization. If we would like to incorporate additional information/regularization that is a function of all three variables z, f, and t, then we must do something else. The first option would be to try to simultaneously modify all factors together to impose regularization that is a function of all three variables. This is unfortunately very difficult—both conceptually difficult to construct and practically difficult to algorithmically solve.

This motivates the use of posterior regularization (PR). PR provides us with an algorithmic mechanism via EM to incorporate constraints that are complementary to prior-based regularization. Instead of modifying the individual factors of our model as we saw before, we directly modify the posterior distribution of our model. The posterior distribution of our model, very loosely speaking, is a function of all random variables of our model. It is natively computed within each E step of EM and is required to iteratively improve the estimates of our model parameters. In this example, the posterior distribution would be akin to φzft, which is a function of t, f, and z, as required. We now formally discuss PR below, beginning with a general discussion and concluding with the specific form of PR we employ within our approach.

3.1 Posterior regularization

The framework of posterior regularization, first introduced by Graca, Ganchev, and Taskar [9, 10], is a relatively new mechanism for injecting rich, typically data-dependent constraints into latent variable models using the EM algorithm. In contrast to standard Bayesian prior-based regularization, which applies constraints to the model parameters of a latent variable model in the maximization step of EM, posterior regularization applies constraints to the posterior distribution (distribution over the latent variables, conditioned on everything else) computation in the expectation step of EM. The method has found success in many natural language processing tasks, such as statistical word alignment, part-of-speech tagging, and similar tasks that involve latent variable models.

In this case, what we do is constrain the distribution q in some way when we maximize the auxiliary bound FqΘwith respect to q in the expectation step of an EM algorithm, resulting in


where Ωqconstrains the possible space of q.

Note, the only difference between Eq. (12) and our past discussion on EM is the added term Ωq. If Ωqis set to zero, we get back the original formulation and easily solve the optimization by setting q = p without any computation (except computing the posterior p). Also note to denote the use of constraints in this context, the term “weakly supervised” was introduced by Graca [11] and is similarly adopted here.

This method of regularization is in contrast to prior-based regularization, where the modified maximization step would be


where ΩΘconstrains the model parameter Θ.

3.2 Linear grouping expectation constraints

Given the general framework of posterior regularization, we need to define a meaningful penalty Ωqfor which we map our annotations. We do this by mapping the annotation matrices to linear grouping constraints on the latent variable z. To do so, we first notice that Eq. (12) decouples for each time-frequency point for our specific model. Because of this, we can independently solve Eq. (12) for each time-frequency point, making the optimization much simpler. When we rewrite our E step optimization using vector notation, we get

subject toqftT1=1,qft0E14

where q and pzftfor a given value of f and t is written as qftand pftwithout any modification; we note q is optimal when equal to pzftas before.

We then apply our linear grouping constraints independently for each time-frequency point:

Subject toqftT1=1,qft0,E15

where we define λft=Λft1..Λft1Λft2Λft2TRNzas the vector of user-defined penalty weights, Tis a matrix transpose, is element-wise greater than or equal to, and 1 is a column vector of ones. In this case, positive-valued penalties are used to decrease the probability of a given source, while negative-valued coefficients are used to increase the probability of a given source. Note the penalty weights imposed on the group of values of z that correspond to a given source s are identical, linear with respect to the z variables, and applied in the E step of EM, hence the name “linear grouping expectation constraints.”

To solve the above optimization problem for a given time-frequency point, we form the Lagrangian


With γbeing a Lagrange multiplier, take the gradient with respect to q and γ:


set Eqs. (17) and (18) equal to zero, and solve for qft, resulting in


where exp{} is an element-wise exponential function.

Notice the result is computed in closed form and does not require any iterative optimization scheme as may be required in the general posterior regularization framework [9], minimizing the computational cost when incorporating the constraints. Also note, however, that this optimization must be solved for each time-frequency point of our spectrogram data for each E step iteration of our final EM parameter estimation algorithm.

3.3 Parameter estimation

Now knowing the posterior-regularized expectation step optimization, we can derive a complete EM algorithm for a posterior-regularized two-dimensional PLCA model (PR-PLCA):


where Λ¯=expΛ. The entire algorithm is outlined in Algorithm 2. Notice we continue to maintain closed-form E and M steps, allowing us to optimize further and draw connections to multiplicative nonnegative matrix factorization algorithms.

Algorithm 2 PR-PLCA with linear grouping expectation constraints

Procedure PLCA (

VR+Nf×Nt//observed normalized data

Nz//number of basis vectors

Ns//number of sources



Initialize: feasible pz,pfzand ptz



Expectation step

for all z, f, t do


end for

Maximization step

for all z, f, t do


end for

until convergence

return:pfz, ptz,pzand pzft

  • Multiplicative Update Equations

We can rearrange the expressions in Algorithm 2 and convert to a multiplicative form following similar methodology to Smaragdis and Raj [12].

Rearranging the expectation and maximization steps, in conjunction with Bayes’ rule, and


we get


Rearranging further, we get


which fully specifies the iterative updates. By putting Eqs. (30) and (31) in matrix notation, we specify the multiplicative form of the proposed method in Algorithm 3.

Algorithm 3. PR-PLCA with linear grouping expectation constraints in matrix notation

Procedure PLCA (

VR+Nf×Nt//observed normalized data

Nz//number of basis vectors

Ns//number of sources



Initialize: WR+Nf×Nz, HR+Nz×Nt


For all s do


End for



For all s do


End for

until convergence

return: W and H

4. An iterative posterior NMF method for speech enhancement in the presence of additive Gaussian noise (proposed algorithm)

Over the past several years, research has been carried out in single-channel sound source separation methods. This problem is motivated by speech denoising, speech enhancement [13], music transcription [14], audio-based forensic, and music remixing. One of the most effective approach is nonnegative matrix factorization (NMF) [5]. The user-annotations can be used to obtain the PR terms [15]. If the number of sources is more, then it is difficult to identify sources in the spectrogram. In such cases, the user interaction-based constraint approaches are inefficient.

In order to avoid the previous problem, in the proposed method, an automatic iterative procedure is introduced. The spectral components of speech and noise are modeled as Gamma and Rayleigh, respectively [16].

4.1 Notation and basic concepts

Let noisy speech signal x[n] be the sum of clean speech s[n] and noise d[n] and their corresponding magnitude spectrogram be represented as


where f represents the frequency bin and t the frame number. The observed magnitudes in time-frequency are arranged in a matrix XR+f×tof nonnegative elements. The source separation algorithms based on NMF pursue the factorization of X as a product of two nonnegative matrices, W=w1w2wRR+f×Rin which the columns collect the basis vectors and H=h1Th2T.hRTTR+R×tthat collects their respective weights, i.e.,


where R denotes the number of latent components.

4.2 Proposed regularization

There are several ways to incorporate the user-annotations into latent variable models, for instance, by using the suitable regularization functions. For expectation maximization (EM) algorithms, posterior regularization was introduced by [9, 11]. This method is data dependent. This method gives richness and also gives the constraints on the posterior distributions of latent variable models. The applications of this method is used in many natural language processing tasks like statistical word alignment, part-of-speech tagging. The main idea is to constrain on the distribution of posterior, when computing expectation step in EM algorithm.

The prior distributions for the magnitude of the noise spectral components are modeled as Rayleigh probability density function (PDF) with scale parameterσ, which is fitted to the observations by a maximum likelihood procedure [16, 17], i.e.,


The above equation can be written as


By applying negative logarithm on both sides of (41), we will get


Then, the regularization term for the noise is defined as


The spectral components of speech modeled as Gamma probability density function [16, 18]


with shape parameter k>0and scale parameter θ>0:

θ=1kNi=1Nxi and k3s+s32+24s12sE45

where the auxiliary variable s is defined as s=ln1Ni=1Nxi1Ni=1Nlnxi.

The regularization term for the speech samples is defined as (by applying negative logarithm in both sides of (44))


Special case: When we fix k = 1, the Gamma density simplifies to the exponential density and


The proposed multiplicative nonnegative matrix factorization method is summarized in Algorithm 4 [16]. In general, like in the specific case of Algorithm 4, one can only guarantee the monotonous descent of the iteration through a majorization-minimization approach [19] or the convergence to a stationary point [20].

The subscript(s) with parenthesis represents corresponding columns or rows of the matrix assigned to a given source. 1 is an approximately sized matrices of ones, and represents element-wise multiplication.

Algorithm 4: Gamma-Rayleigh regularized PLCA method (GR-NMF)


XR+f×t% Observed normalized data

ΛSR+f×t, s1..NS% ΛS-Penalties, NS-Number of sources



For all s do

Λs=1μΛsOLD+μΛsNEW%Update penalties usingLMSE50

End for


For all s do


End for


For all s do

MsWsHsWH%Compute FilterE57
X̂sMsX%Filter MixtureE58
xsISTFTX̂sXP%P- STFT parametersE59

if update k % Gamma model


else % Exponential model

k = 1,


ΛsNEW=expΛsOLD%ΛsOLDrepresents bothΛSPandΛNE63

End for

Until Convergence

Return: Time domain signals xs

5. Experimental results

The speech and noise audio samples were taken from NOIZEUS [21]. Sampling frequency is 8 KHz. The algorithm is iterated until convergence [16]. The proposed method was compared with Euclidean NMF (EUC-NMF) [5], Itakura-Saito NMF (IS-NMF) [22], posterior regularization NMF (PR-NMF) [15], Wiener filtering [23], and constrained version of NMF (CNMF)[24]. These methods are implemented by considering nonstationary noise, babble noise and street noise. The performance of proposed method was evaluated by using perceptual evaluation of speech quality (PESQ) [25] and source-to-distortion ratio (SDR) [26]. SDR gives the average quality of separation on dB scale and considers signal distortion as well as noise distortion. For PESQ and SDR, the higher value indicates the better performance. Tables 1 and 2 show the PESQ and SDR values of different NMF algorithms evaluated. The experimental results show that proposed method performs better than other existing methods in terms of the PESQ and SDR indices.

Table 1.

PESQ and SDR for babble noise.

Table 2.

PESQ and SDR for street noise.

6. Conclusion

A novel speech enhancement method based on an iterative and regularized NMF algorithm for single-channel source separation is proposed. The clean speech and noise magnitude spectra are modeled as Gamma and Rayleigh distributions, respectively. The corresponding log-likelihood functions are used as penalties to regularize the cost function of the NMF. The estimation of basis matrices and excitation matrices are calculated by using proposed regularization of multiplicative update rules. The experiments reveal that the proposed speech enhancement method outperforms other existing benchmark methods in terms of SDR and PESQ values.

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Sunnydayal Vanambathina (May 27th 2019). Speech Enhancement Using an Iterative Posterior NMF, New Frontiers in Brain - Computer Interfaces, Nawaz Mohamudally, Manish Putteeraj and Seyyed Abed Hosseini, IntechOpen, DOI: 10.5772/intechopen.84976. Available from:

chapter statistics

252total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

A Self-Paced Two-State Mental Task-Based Brain-Computer Interface with Few EEG Channels

By Farhad Faradji, Rabab K. Ward and Gary E. Birch

Related Book

First chapter

Smartphone and Portable Media Device: A Novel Pathway toward the Diagnostic Characterization of Human Movement

By Robert LeMoyne and Timothy Mastroianni

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us