Open access peer-reviewed chapter

Speech Enhancement Using an Iterative Posterior NMF

Written By

Sunnydayal Vanambathina

Submitted: October 6th, 2018 Reviewed: February 5th, 2019 Published: May 27th, 2019

DOI: 10.5772/intechopen.84976

From the Edited Volume

New Frontiers in Brain

Edited by Nawaz Mohamudally, Manish Putteeraj and Seyyed Abed Hosseini

Chapter metrics overview

879 Chapter Downloads

View Full Metrics

Abstract

Over the years, miscellaneous methods for speech enhancement have been proposed, such as spectral subtraction (SS) and minimum mean square error (MMSE) estimators. These methods do not require any prior knowledge about the speech and noise signals nor any training stage beforehand, so they are highly flexible and allow implementation in various situations. However, these algorithms usually assume that the noise is stationary and are thus not good at dealing with nonstationary noise types, especially under low signal-to-noise (SNR) conditions. To overcome the drawbacks of the above methods, nonnegative matrix factorization (NMF) is introduced. NMF approach is more robust to nonstationary noise. In this chapter, we are actually interested in the application of speech enhancement using NMF approach. A speech enhancement method based on regularized nonnegative matrix factorization (NMF) for nonstationary Gaussian noise is proposed. The spectral components of speech and noise are modeled as Gamma and Rayleigh, respectively. We propose to adaptively estimate the sufficient statistics of these distributions to obtain a natural regularization of the NMF criterion.

Keywords

  • nonnegative matrix factorization (NMF)
  • speech enhancement
  • signal-to-noise ratio (SNR)
  • expectation maximization (EM) algorithms
  • posterior regularization (PR)

1. Introduction

Over the past several decades, there has been a large research interest in the problem of single-channel sound source separation. Such work focuses on the task of separating a single mixture recording into its respective sources and is motivated by the fact that real-world sounds are inherently constructed by many individual sounds (e.g., human speakers, musical instruments, background noise, etc.). While source separation is difficult, the topic is highly motivated by many outstanding problems in audio signal processing and machine learning, including the following:

  1. Speech denoising and enhancement—the task of removing background noise (e.g., wind, babble, etc.) from recorded speech and improving speech intelligibility for human listeners and/or automatic speech recognizers

  2. Content-based analysis and processing—the task of extracting and/or processing audio based on semantic properties of the recording such as tempo, rhythm, and/or pitch

  3. Music transcription—the task of notating an audio recording into a musical representation such as a musical score, guitar tablature, or other symbolic notations

  4. Audio-based forensics—the task of examining, comparing, and evaluating audio recordings for scientific and/or legal matters

  5. Audio restoration—the task of removing imperfections such as noise, hiss, pops, and crackles from (typically old) audio recordings

  6. Music remixing and content creation—the task of creating a new musical work by manipulating the content of one or more previously existing recordings

Advertisement

2. Nonnegative matrix factorization

2.1 NMF model

Nonnegative matrix factorization is a process that approximates a single nonnegative matrix as the product of two nonnegative matrices. It is defined by

V WH E1

V R + N f × N t is a nonnegative input matrix. W R + N f × N z is a matrix of basis vectors, basis functions, or dictionary elements; H R + N z × N t is a matrix of corresponding activations, weights, or gains; and N f is the number of rows of the input matrix. N t is the number of columns of the input matrix; N z is the number of basis vectors [1].

V R + N f × N t —original nonnegative input data matrix

  • Each column is an N f -dimensional data sample.

  • Each row represents a data feature.

W R + N f × N z —matrix of basis vectors, basis functions, or dictionary elements.

  • A column represents a basis vector, basis function, or dictionary element.

  • Each column is not orthonormal, but commonly normalized to one.

H R + N z × N t —matrix of activations, weights, encodings, or gains.

  • A row represents the gain of a corresponding basis vector.

  • Each row is not orthonormal, but sometimes normalized to one.

When used for audio applications, NMF is typically used to model spectrogram data or the magnitude of STFT data [2]. That is, we take a single-channel recording, transform it into the time-frequency domain using the STFT, take the magnitude or power V, and then approximate the result as V WH . In doing so, NMF approximates spectrogram data as a linear combination of prototypical frequencies or spectra (i.e., basis vectors) over time.

This process can be seen in Figure 1 [3], where a two-measure piano passage of “Mary Had a Little Lamb” is shown alongside a spectrogram and an NMF factorization. Notice how W captures the harmonic content of the three pitches of the passage and H captures the time onsets and gains of the individual notes. Also note that N z is typically chosen manually or using a model selection procedure such as cross-validation and N f and N t are a function of the overall recording length and STFT parameters (transform length, zero-padding size, and hop size).

Figure 1.

NMF of a piano performing “Mary had a little lamb” for two measures with N z = 3. Notice how matrix W captures the harmonic content of the three pitches of the passage and matrix H captures the time onsets and gains of the individual notes [3].

This leads to two related interpretations of how NMF models spectrogram data. The first interpretation is that the columns of V (i.e., short-time segments of the mixture signal) are approximated as a weighted sum of basis vectors as shown in Figure 2 and Eq. (2):

V V 1 V 2 V 3 V N t j = 1 N z H j 1 W j j = 1 K H j 2 W j j = 1 K H j N t W j E2

Figure 2.

NMF interpretation I. the columns of V (i.e., short-time segments of the mixture signal) are approximated as a weighted sum or mixture of basis vectors W [3].

The second interpretation is that the entire matrix V is approximated as a sum of matrix “layers,” as shown in Figure 3 and Eq. (3).

V V 1 V 2 V 3 V N t W 1 W 2 W 3 W N z h 1 T h 2 T h 3 T . h N z T
V W 1 h 1 T + W 2 h 2 T + W 3 h 3 T + . . + W N z h N z T E3

Figure 3.

NMF interpretation II. The matrix V (i.e., the mixture signal) is approximated as a sum of matrix “layers” [3].

The application of NMF on noisy speech can be seen in Figure 4.

Figure 4.

Applying NMF on noisy speech.

2.2 Optimization formulation

To estimate the basis matrix W and the activation matrix H for a given input data matrix V, NMF algorithm is formulated as an optimization problem. This is written as:

argmin W , H D V WH
W 0 , H 0 E4

where D V WH is an appropriately defined cost function between V and W H and the inequalities are element-wise. It is also common to add additional equality constraints to require the columns of W to sum to one, which we enforce. When D V WH is additively separable, the cost function can be reduced to

D V WH = f = 1 N f t = 1 N t d V ft WH ft E5

where ft indicates its argument at row f and column t and D V WH is a scalar cost function measured between V and WH.

Popular cost functions include the Euclidean distance metric, Kullback-Liebler (KL) divergence, and Itakura-Saito (IS) divergence. Both the KL and IS divergences have been found to be well suited for audio purposes. In this work, we focus on the case where d q p is generalized (non-normalized) KL divergence:

d KL q p = q   ln q p q + p E6

where ft indicates its argument at row f and column t and d q p is a scalar cost function measured between q and p.

This results in the following optimization formulation:

argmin W , H f = 1 N f t = 1 N t V ft   ln WH ft + WH ft + const

Subject to

W 0 , H 0 E7

Given this formulation, we notice that the problem is not convex in W and H, limiting our ability to find a globally optimal solution to Eq. (7). It is, however, biconvex or independently convex in W for a fixed value of H and convex in H for a fixed value of W, motivating the use of iterative numerical methods to estimate locally optimal values of W and H.

2.3 Parameter estimation

To solve Eq. (7), we must use an iterative numerical optimization technique and hope to find a locally optimal solution. Gradient descent methods are the most common and straightforward for this purpose but typically are slow to converge. Other methods such as Newton’s method, interior-point methods, conjugate gradient methods, and similar [4] can converge faster but are typically much more complicated to implement, motivating alternative approaches.

The most popular alternative that has been proposed is by Lee and Seung [1, 5] and consists of a fast, simple, and efficient multiplicative gradient descent-based optimization procedure. The method works by breaking down the larger optimization problem into two subproblems and iteratively optimizes over W and then H, back and forth, given an initial feasible solution. The approach monotonically decreases the optimization objective for both the KL divergence and Euclidean cost functions and converges to a local stationary point.

The approach is justified using the machinery of majorization-minimization (MM) algorithms [6]. MM algorithms are closely related to expectation maximization (EM) algorithms. In general, MM algorithms operate by approximating an optimization objective with a lower bound auxiliary function. The lower bound is then maximized instead of the original function, which is usually more difficult to optimize.

Algorithm 1 shows the complete iterative numerical optimization procedure applied to Eq. (7) with the KL divergence, where the division is element-wise, is an element-wise multiplication, and 1 is a vector or matrix of ones with appropriately defined dimensions [5].

Algorithm 1 KL-NMF parameter estimation

Procedure KL-NMF ( V R + N f × N t //input data matrix.

N z //number of basic vectors.)

Initialize: W R + N f × N z , H R + N z × N t .

repeat

Optimize over W

W W V WH H T 1 H T E8

Optimize over H

H H W T V WH W T 1 E9

until convergence

return: W and H

NMF is an optimization technique using EM algorithm in terms of matrix, whereas probabilistic latent component analysis (PLCA) is also an optimization technique using EM algorithm in terms of probability. In PLCA, we are going to incorporate probabilities of time and frequency. In the next section, the development of PLCA-based algorithm to incorporate time-frequency constraints is discussed.

Advertisement

3. A probabilistic latent variable model with time-frequency constraints

Considering this approach, we now develop a new PLCA-based algorithm to incorporate the time-frequency user-annotations. For clarity, we restate the form of the symmetric two-dimensional PLCA model we use:

p f t = z p z p f z p t z E10

Compared to a modified NMF formulation, incorporating optimization constraints as a function of time, frequency, and sound source into the factorized PLCA model is particularly interesting and motivating to our focus.

Incorporating prior information into this model, and PLCA in general, can be done in several ways. The most commonly used methods are by direct observations (i.e., setting probabilities to zero, one, etc.) or by incorporating Bayesian prior probabilities on model parameters. Direct observations do not give us enough control, so we consider incorporating Bayesian prior probabilities. For the case of Eq. (10), this would result in independently modifying the factor terms p f z , p t z , or p z . Common prior probability distributions used for this purpose include Dirichlet priors [7], gamma priors [8], and others.

Given that we would like to incorporate the user-annotations as a function of time, frequency, and sound source, however, we notice that this is not easily accomplished using standard priors. This is because the model is factorized, and each factor is only a function of one variable and (possibly) conditioned by another, making it difficult to construct a set of prior probabilities that, when jointly applied to p f z , p t z , and/or p z , would encourage or discourage one source or another to explain a given time-frequency point. We can see this more clearly when we consider PLCA to be the following simplified estimation problem:

X f t φ z φ f z φ t z E11

where X f t is the observed data that we model as the product of three distinct functions or factors φ z , φ f z , and φ t z . Note, each factor has different input arguments and each factor has different parameters that we wish to estimate via EM. Also, forget for the moment that the factors must be normalized probabilities.

Given this model, if we wish to incorporate additional information, we could independently modify:

  • φ z to incorporate past knowledge of the variable z

  • φ f z to incorporate past knowledge of the variable f and z

  • φ t z to incorporate past knowledge of the variable t and z

This way of manipulation allows us to maintain our factorized form and can be thought of as prior-based regularization. If we would like to incorporate additional information/regularization that is a function of all three variables z, f, and t, then we must do something else. The first option would be to try to simultaneously modify all factors together to impose regularization that is a function of all three variables. This is unfortunately very difficult—both conceptually difficult to construct and practically difficult to algorithmically solve.

This motivates the use of posterior regularization (PR). PR provides us with an algorithmic mechanism via EM to incorporate constraints that are complementary to prior-based regularization. Instead of modifying the individual factors of our model as we saw before, we directly modify the posterior distribution of our model. The posterior distribution of our model, very loosely speaking, is a function of all random variables of our model. It is natively computed within each E step of EM and is required to iteratively improve the estimates of our model parameters. In this example, the posterior distribution would be akin to φ z f t , which is a function of t, f, and z, as required. We now formally discuss PR below, beginning with a general discussion and concluding with the specific form of PR we employ within our approach.

3.1 Posterior regularization

The framework of posterior regularization, first introduced by Graca, Ganchev, and Taskar [9, 10], is a relatively new mechanism for injecting rich, typically data-dependent constraints into latent variable models using the EM algorithm. In contrast to standard Bayesian prior-based regularization, which applies constraints to the model parameters of a latent variable model in the maximization step of EM, posterior regularization applies constraints to the posterior distribution (distribution over the latent variables, conditioned on everything else) computation in the expectation step of EM. The method has found success in many natural language processing tasks, such as statistical word alignment, part-of-speech tagging, and similar tasks that involve latent variable models.

In this case, what we do is constrain the distribution q in some way when we maximize the auxiliary bound F q Θ with respect to q in the expectation step of an EM algorithm, resulting in

q n + 1 = argmin q KL q p + Ω q E12

where Ω q constrains the possible space of q.

Note, the only difference between Eq. (12) and our past discussion on EM is the added term Ω q . If Ω q is set to zero, we get back the original formulation and easily solve the optimization by setting q = p without any computation (except computing the posterior p). Also note to denote the use of constraints in this context, the term “weakly supervised” was introduced by Graca [11] and is similarly adopted here.

This method of regularization is in contrast to prior-based regularization, where the modified maximization step would be

Θ n + 1 = argmax Θ F q n + 1 Θ + Ω Θ E13

where Ω Θ constrains the model parameter Θ .

3.2 Linear grouping expectation constraints

Given the general framework of posterior regularization, we need to define a meaningful penalty Ω q for which we map our annotations. We do this by mapping the annotation matrices to linear grouping constraints on the latent variable z. To do so, we first notice that Eq. (12) decouples for each time-frequency point for our specific model. Because of this, we can independently solve Eq. (12) for each time-frequency point, making the optimization much simpler. When we rewrite our E step optimization using vector notation, we get

argmin q q ft T ln p ft + q ft T ln q ft
subject to q ft T 1 = 1 , q ft 0 E14

where q and p z f t for a given value of f and t is written as q ft and p ft without any modification; we note q is optimal when equal to p z f t as before.

We then apply our linear grouping constraints independently for each time-frequency point:

argmin q q ft T ln p ft + q ft T ln q ft + q ft T λ ft
Subject to q ft T 1 = 1 , q ft 0 , E15

where we define λ ft = Λ ft 1 . . Λ ft 1 Λ ft 2 Λ ft 2 T R N z as the vector of user-defined penalty weights, T is a matrix transpose, is element-wise greater than or equal to, and 1 is a column vector of ones. In this case, positive-valued penalties are used to decrease the probability of a given source, while negative-valued coefficients are used to increase the probability of a given source. Note the penalty weights imposed on the group of values of z that correspond to a given source s are identical, linear with respect to the z variables, and applied in the E step of EM, hence the name “linear grouping expectation constraints.”

To solve the above optimization problem for a given time-frequency point, we form the Lagrangian

L q ft γ = q ft T ln p ft + q ft T ln q ft + q ft T λ ft + γ 1 q ft T 1 E16

With γ being a Lagrange multiplier, take the gradient with respect to q and γ :

q ft L q ft γ = ln p ft + 1 + ln q ft + λ ft γ 1 = 0 E17
a L q ft γ = 1 q ft T 1 = 0 E18

set Eqs. (17) and (18) equal to zero, and solve for q ft , resulting in

q ft = P ft exp λ ft P ft T exp λ ft E19

where exp{} is an element-wise exponential function.

Notice the result is computed in closed form and does not require any iterative optimization scheme as may be required in the general posterior regularization framework [9], minimizing the computational cost when incorporating the constraints. Also note, however, that this optimization must be solved for each time-frequency point of our spectrogram data for each E step iteration of our final EM parameter estimation algorithm.

3.3 Parameter estimation

Now knowing the posterior-regularized expectation step optimization, we can derive a complete EM algorithm for a posterior-regularized two-dimensional PLCA model (PR-PLCA):

p z f t p z p f z p t z Λ ¯ ftz z p z p f z p t z Λ ¯ ftz E20

where Λ ¯ = exp Λ . The entire algorithm is outlined in Algorithm 2. Notice we continue to maintain closed-form E and M steps, allowing us to optimize further and draw connections to multiplicative nonnegative matrix factorization algorithms.

Algorithm 2 PR-PLCA with linear grouping expectation constraints

Procedure PLCA (

V R + N f × N t //observed normalized data

N z //number of basis vectors

N s //number of sources

Λ R N f × N t × N z //penalties

)

Initialize: feasible p z , p f z and p t z

Precompute : Λ ¯ exp Λ E21

repeat

Expectation step

for all z, f, t do

p z f t p z p f z p t z Λ ¯ ftz z p z p f z p t z Λ ¯ ftz E22

end for

Maximization step

for all z, f, t do

p f z = t V ft p z f t f t V f t p z f t E23
p t z = f V ft p z f t f t V f t p z f t E24
p z = f t V ft p z f t z f t V f t p z f t E25

end for

until convergence

return: p f z , p t z , p z and p z f t

  • Multiplicative Update Equations

We can rearrange the expressions in Algorithm 2 and convert to a multiplicative form following similar methodology to Smaragdis and Raj [12].

Rearranging the expectation and maximization steps, in conjunction with Bayes’ rule, and

Z f t = z p z p f z p t z Λ ¯ ftz ,

we get

p z f t = p f z p t z Λ ¯ ftz Z f t E26
p t z = f V ft q z f t E27
p f z = t V ft q z f t t p t z E28
p z = t p t z E29

Rearranging further, we get

p f z = p f z t V ft Λ ¯ ftz Z f t p t z t p t z E30
p t z = p t z f p f z V ft Λ ¯ ftz Z f t E31

which fully specifies the iterative updates. By putting Eqs. (30) and (31) in matrix notation, we specify the multiplicative form of the proposed method in Algorithm 3.

Algorithm 3. PR-PLCA with linear grouping expectation constraints in matrix notation

Procedure PLCA (

V R + N f × N t //observed normalized data

N z //number of basis vectors

N s //number of sources

Λ s R N f × N t , 1 . N s //penalties

)

Initialize: W R + N f × N z , H R + N z × N t

Precompute:

For all s do

Λ ¯ s exp Λ s E32
X s V Λ ¯ s E33

End for

Repeat

Γ s W s H s Λ ¯ s E34

For all s do

Z s X s Γ E35
W s W s Z s H s T 1 H s T E36
H s H s W s T Z s E37

End for

until convergence

return: W and H

Advertisement

4. An iterative posterior NMF method for speech enhancement in the presence of additive Gaussian noise (proposed algorithm)

Over the past several years, research has been carried out in single-channel sound source separation methods. This problem is motivated by speech denoising, speech enhancement [13], music transcription [14], audio-based forensic, and music remixing. One of the most effective approach is nonnegative matrix factorization (NMF) [5]. The user-annotations can be used to obtain the PR terms [15]. If the number of sources is more, then it is difficult to identify sources in the spectrogram. In such cases, the user interaction-based constraint approaches are inefficient.

In order to avoid the previous problem, in the proposed method, an automatic iterative procedure is introduced. The spectral components of speech and noise are modeled as Gamma and Rayleigh, respectively [16].

4.1 Notation and basic concepts

Let noisy speech signal x[n] be the sum of clean speech s[n] and noise d[n] and their corresponding magnitude spectrogram be represented as

X f t = S f t + D f t E38

where f represents the frequency bin and t the frame number. The observed magnitudes in time-frequency are arranged in a matrix X R + f × t of nonnegative elements. The source separation algorithms based on NMF pursue the factorization of X as a product of two nonnegative matrices, W = w 1 w 2 w R R + f × R in which the columns collect the basis vectors and H = h 1 T h 2 T . h R T T R + R × t that collects their respective weights, i.e.,

X = WH = z = 1 R W z H z E39

where R denotes the number of latent components.

4.2 Proposed regularization

There are several ways to incorporate the user-annotations into latent variable models, for instance, by using the suitable regularization functions. For expectation maximization (EM) algorithms, posterior regularization was introduced by [9, 11]. This method is data dependent. This method gives richness and also gives the constraints on the posterior distributions of latent variable models. The applications of this method is used in many natural language processing tasks like statistical word alignment, part-of-speech tagging. The main idea is to constrain on the distribution of posterior, when computing expectation step in EM algorithm.

The prior distributions for the magnitude of the noise spectral components are modeled as Rayleigh probability density function (PDF) with scale parameter σ , which is fitted to the observations by a maximum likelihood procedure [16, 17], i.e.,

f x σ = x σ 2 e x 2 / 2 σ 2 for x 0 with σ 2 = 1 2 N i = 1 N x i 2 E40

The above equation can be written as

f x σ = e log x σ 2 e x 2 / 2 σ 2 = e log x σ 2 x 2 2 σ 2 E41

By applying negative logarithm on both sides of (41), we will get

log f x σ = log e log x σ 2 x 2 2 σ 2 = x 2 2 σ 2 log x σ 2 E42

Then, the regularization term for the noise is defined as

Λ N Λ S 1 = log f x σ = x 2 2 σ 2 log x σ 2 . E43

The spectral components of speech modeled as Gamma probability density function [16, 18]

f x k θ = x k 1 e x θ θ k Γ k E44

with shape parameter k > 0 and scale parameter θ > 0 :

θ = 1 kN i = 1 N x i   and   k 3 s + s 3 2 + 24 s 12 s E45

where the auxiliary variable s is defined as s = ln 1 N i = 1 N x i 1 N i = 1 N ln x i .

The regularization term for the speech samples is defined as (by applying negative logarithm in both sides of (44))

Λ SP Λ S 2 = log f x k θ = x θ log x k 1 θ k Γ k , x 0 E46

Special case: When we fix k = 1, the Gamma density simplifies to the exponential density and

f x 1 θ = 1 θ e x θ , Λ SP Λ S 2 = x θ , x 0 E47

The proposed multiplicative nonnegative matrix factorization method is summarized in Algorithm 4 [16]. In general, like in the specific case of Algorithm 4, one can only guarantee the monotonous descent of the iteration through a majorization-minimization approach [19] or the convergence to a stationary point [20].

The subscript(s) with parenthesis represents corresponding columns or rows of the matrix assigned to a given source. 1 is an approximately sized matrices of ones, and represents element-wise multiplication.

Algorithm 4: Gamma-Rayleigh regularized PLCA method (GR-NMF)

Procedure

X R + f × t % Observed normalized data

Λ S R + f × t , s 1 . . N S % Λ S -Penalties, N S -Number of sources

Λ s NEW = 0
Λ S 1 = Λ N OLD = X 2 2 σ 2 log X σ 2 and Λ S 2 = Λ SP OLD = X θ log X k 1 θ k Γ k E48
Λ s OLD exp Λ s E49

Repeat

For all s do

Λ s = 1 μ Λ s OLD + μ Λ s NEW % Update penalties using LMS E50
Λ s OLD = Λ s NEW E51
X s X Λ S E52

End for

Γ s W s H s Λ S E53
Z s X s Γ E54

For all s do

W s W s Z s H s T 1 H s T E55
H s H s W s T Z s E56

End for

Reconstruction

For all s do

M s W s H s WH % Compute Filter E57
X ̂ s M s X % Filter Mixture E58
x s ISTFT X ̂ s X P % P- STFT parameters E59

if update k % Gamma model

s = ln 1 N X ̂ s 1 N ln X ̂ s , k 3 s + s 3 2 + 24 s 12 s E60

else % Exponential model

k = 1,

end

θ = 1 kN X ̂ s E61
Λ S 1 = Λ N OLD = X ̂ s 1 2 2 σ 2 log X ̂ s 1 σ 2
Λ S 2 = Λ SP OLD = X ̂ s 2 θ log X ̂ s 2 k 1 θ k Γ k E62
Λ s NEW = exp Λ s OLD % Λ s OLD represents both Λ SP and Λ N E63

End for

Until Convergence

Return: Time domain signals x s

Advertisement

5. Experimental results

The speech and noise audio samples were taken from NOIZEUS [21]. Sampling frequency is 8 KHz. The algorithm is iterated until convergence [16]. The proposed method was compared with Euclidean NMF (EUC-NMF) [5], Itakura-Saito NMF (IS-NMF) [22], posterior regularization NMF (PR-NMF) [15], Wiener filtering [23], and constrained version of NMF (CNMF)[24]. These methods are implemented by considering nonstationary noise, babble noise and street noise. The performance of proposed method was evaluated by using perceptual evaluation of speech quality (PESQ) [25] and source-to-distortion ratio (SDR) [26]. SDR gives the average quality of separation on dB scale and considers signal distortion as well as noise distortion. For PESQ and SDR, the higher value indicates the better performance. Tables 1 and 2 show the PESQ and SDR values of different NMF algorithms evaluated. The experimental results show that proposed method performs better than other existing methods in terms of the PESQ and SDR indices.

Table 1.

PESQ and SDR for babble noise.

Table 2.

PESQ and SDR for street noise.

Advertisement

6. Conclusion

A novel speech enhancement method based on an iterative and regularized NMF algorithm for single-channel source separation is proposed. The clean speech and noise magnitude spectra are modeled as Gamma and Rayleigh distributions, respectively. The corresponding log-likelihood functions are used as penalties to regularize the cost function of the NMF. The estimation of basis matrices and excitation matrices are calculated by using proposed regularization of multiplicative update rules. The experiments reveal that the proposed speech enhancement method outperforms other existing benchmark methods in terms of SDR and PESQ values.

References

  1. 1. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788-791
  2. 2. Smaragdis P. Non-negative matrix factorization for polyphonic music transcription. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics; 19–22 October 2003. Mohonk Mountain; 2013. pp. 177-180
  3. 3. Bryan NJ, Mysore GJ. An efficient posterior regularized latent variable model for interactive sound source separation. In: International Conference on Machine Learning (ICML); June 2013
  4. 4. Boyd S, Vandenberghe L. Convex Optimization. New York, NY, USA: Cambridge University Press; 2004
  5. 5. Lee DD, Seung HS. Algorithms for Non-negative Matrix Factorization. NIPS Proceedings. 2001
  6. 6. Hunter DR, Lange K. A tutorial on MM algorithms. The American Statistician. 2004;58:30-37
  7. 7. Paltz N. Separation by ‘Humming’: user-guided sound extraction from monophonic mixtures. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA); 2009. pp. 69–72
  8. 8. Fitzgerald D. User assisted separation using tensor factorisations. In: European Signal Processing Conference (EUSIPCO). 2012. pp. 2412–2416
  9. 9. Graca J, Ganchev K, Taskar B. Expectation maximization and posterior constraints. Advances in Neural Information Processing Systems. 2008;20:1-8
  10. 10. Ganchev K, Gillenwater J. Posterior regularization for structured latent variable models. Journal of Machine Learning Research. 2010;11:2001-2049
  11. 11. Graça J, Ganchev K, Taskar B, Pereira F. Posterior vs. parameter sparsity in latent variable models. NIPS–Advances in Neural Information Processing Systems. 2009:664-672
  12. 12. Smaragdis P, Raj B. Shift-invariant probabilistic latent component analysis. Journal of Machine Learning Research. Technical Report TR2007009, MERL; December, 2007:5
  13. 13. Mysore GJ, Smaragdis P. A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics Gautham J. Mysore Advanced Technology Labs Adobe Systems Inc, University of Illinois at Urbana-Champaign, Adobe Systems Inc. IEEE International Conference on Acoustics, Speech and Signal Processing–ICASSP 2011; 2011. pp. 17–20
  14. 14. Bertin N, Badeau R, Vincent E. Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech and Language Processing. 2010;18:538-549
  15. 15. Bryan NJ, Mysore GJ. An Efficient Posterior Regularized Latent Variable Model for Interactive Sound Source Separation. in Icml, 2013
  16. 16. Sunnydayal K k, Cruces-Alvarez SA. An iterative posterior NMF method for speech enhancement in the presence of additive Gaussian noise. Neurocomputing. 2017;230:312-315
  17. 17. Cruces-Alvarez SA, Cichocki A, ichi Amari S. From blind signal extraction to blind instantaneous signal separation: Criteria, algorithms, and stability. IEEE Transactions on Neural Networks. 2004;15:859-873
  18. 18. Erkelens JS, Hendriks RC, Heusdens R, Jensen J. Minimum mean-square error estimation of discrete Fourier coefficients with generalized gamma priors. IEEE Transactions on Audio, Speech and Language Processing. 2007;15(6):1741-1752
  19. 19. Cichocki A, Cruces S, ichi Amari S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy;13:134-170
  20. 20. Lin C-J. On the convergence of multiplicative update for nonnegative matrix factorization. IEEE Transactions on Neural Networks. 2007;18:1589-1596
  21. 21. https://ecs.utdallas.edu/loizou/speech/noizeus/ [Online]
  22. 22. Févotte C, Bertin N, Durrieu J-L. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Computation. 2009;21:793-830
  23. 23. Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics. 1984;32:1109-1121
  24. 24. Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis. 2007;52(1):155-173
  25. 25. Hu Y, Loizou PC. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on Acoustics, Speech, and Signal Processing. 2008;16(1):229-238
  26. 26. Vincent E, Gribonval R, Fevotte C. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing. 2006;14:1462-1469

Written By

Sunnydayal Vanambathina

Submitted: October 6th, 2018 Reviewed: February 5th, 2019 Published: May 27th, 2019