Closed form solutions for MMSE & MAP estimators for speech enhancement in additive Gaussian noise

Abstract:

In this paper, we derive closed form solutions for the Minimum Mean Square Error (MMSE) and Maximum A Posteriori (MAP) estimators for speech enhancement in additive Gaussian noise assuming a t-location-scale probability density function (PDF) as clean speech prior. Fitting a t-location-scale PDF to the real and imaginary parts of the DFT coefficients of clean speech signals demonstrates the lower Jensen-Shannon Divergence (JSD) compared to the other heavy-tailed distributions such as Laplacian and Gamma. We utilize the two presented estimators along with the Wiener filter and MMSE estimators based on Laplacian, Gamma, and generalized gamma prior PDFs to enhance noisy signals from the NOIZEUS database. All the estimators are compared together in terms of both signal and noise distortions. The obtained results show that our proposed MMSE estimator results in the minimum squared error and signal distortion to estimate the complex-valued DFT coefficients of speech. The quality assessments of the enhanced signals are also performed in terms of PESQ, segmental and general SNRs.

1 Introduction

A clean speech signal recorded by a single microphone can be contaminated by various disturbing environmental signals, such as additive noise, echo or reverberant signals. Different techniques have been proposed to enhance a distorted clean speech signal to improve its quality and reducing listening efforts in the presence of interfering signals. These methods are of interest for various applications including hearing aids, speech recognition, and speech communication over telephony and internet. Within the last three decades, different strategies have been proposed for speech enhancement in the additive noise case [1]. Most of these strategies employ enhancement algorithms in the frequency domain that in spite of low computational complexities result in notable quality improvements [2]. The well-known spectral subtraction method [3], which has a simplified mathematical expression, Wiener filtering and its variations such as iterative Wiener filtering [4] are all included in the category of frequency domain enhancement methods. The other class of enhancement methods in the frequency domain perform estimation of the enhanced signal using Bayesian technique such as Minimum mean square error (MMSE) estimator [1], log-MMSE-based estimator [5], maximum likelihood (ML) [6] and maximum a posteriori (MAP) estimators [7]. The MMSE and MAP estimators are dominantly utilized in the DFT domain to infer the amplitude or complex DFT coefficients of speech from the observed noisy DFT coefficients. In the MMSE estimators, a standard mathematical cost function [1]or a perceptually-relevant criterion [8-10] is optimized to obtain a nonlinear gain function for modifying the DFT coefficients of noisy speech. To find the MMSE and MAP estimators two PDFs are required, first the prior PDF (clean speech PDF) and second the PDF of noise signal [1]. As a pioneer work, assuming the DFT coefficients of clean speech and noise signals are both Gaussian, a closed form solution for the MMSE estimator was derived in [1]. Further researches, which studied the PDF of clean speech in both time and frequency domains, have shown that the real and imaginary parts of the DFT coefficients of clean speech have a supergaussian trend ( a sharper peak and heavy tails compared to a Gaussian) [7][11]. Up to now assuming Gaussian noise and considering more advanced clean speech prior PDFs, for example Laplacian [11], Gamma [7][12] and Generalized Gamma [13], different MMSE and MAP estimators have been derived. In general, the MMSE/MAP estimator is not of a close-form solution where a more complicated but more fitted prior PDF is employed. However, this problem can be surmounted in the MMSE-based speech enhancement methods by numerical approximations of the integrals and constructing a look-up table of Gain functions [13]. It should be noted that Bayesian estimators (MMSE or MAP) working on the amplitudes of noisy DFT coefficients combine the estimated amplitudes with the corresponding noisy phase to reconstruct the complex DFT coefficients [1][13]. Hence, due to noisy phase spectrum, which can have a deteriorating effect on perceptual information [14], the performance of the enhancements algorithms can be reduced specially in low SNR case and fast-changing noise signals. Therefore, some recent studies have presented phase aware speech enhancement methods [14-15]. As another modification to improve the performance of all statistical-based speech enhancement algorithms, it is possible to estimate the speech presence (absence) probability by which the basic estimates are multiplied [1][16]. Further developments on the MMSE estimators have been obtained by considering a more fitted PDF to the noise DFT coefficients. It is shown that most noise signals obey a Gaussian PDF, however it is not true for special noise types such as babble and fan noises [17]. In [18] the MMSE estimator was extended using a Gaussian mixture model (GMM) to model the noise DFT coefficients. Moreover, the speech signals produced by human beings can be expressed through a mathematical (sinusoidal/exponential) deterministic model. This is in contrast to the assumption employed in the statistical-based enhancement methods considering the speech signal is completely generated by a stochastic process. In [19] and [20], two different strategies have been presented to find a stochasticdeterministic MMSE estimator. The zero-mean assumption of the prior PDF in a basic MMSE estimator is replaced by the obtained deterministic model in [19]. In [20] speech enhancement is carried out by a two-state model where a transition probability to switch between stochastic and deterministic models is defined.

Besides all of the aforementioned speech enhancement approaches (MMSE, wiener filtering and etc.) that are unsupervised, recently a supervised enhancement method based on Deep Neural Network IET Research Journals, pp. 1ï¿½ï¿½ï¿½12 c ï¿½ï¿½ï¿½ The Institution of Engineering and Technology 2015 1 (DNN) has been proposed that utilizes a training data set to find a mapping between noisy and clean speech data [21]. In this paper, we experimentally demonstrate that a t-locationscale PDF has an adequate fit to the empirical PDF formed from the real and imaginary parts of the clean speech DFT coefficients. Assuming a t-location-scale prior PDF, the MMSE and MAP estimators of complex-valued speech DFT coefficients are developed. Besides our proposed estimators, the already presented speech enhancement algorithms are tested and the objective instrumental measures are employed to evaluate the enhancement results. This paper is organized as follows. In the next section, the experimental results of fitting different distributions to clean speech signals are reported. In Section 3, the basic assumptions in singlemicrophone based MMSE estimator for speech enhancement are explained. Also, Section 3 is devoted to extend the MMSE and MAP estimators for speech enhancement considering a t-location-scale clean speech prior and Gaussian noise. The simulations and experimental results are brought in Section 4. Finally, in Section 5 we conclude the paper. 2 Fitting a t-location-scale PDF to clean speech data The t-distribution is one of the heavy-tailed distributions. The probability density function of a t-random variable (S) with degrees of freedom is given by [22]

Essay: Closed form solutions for MMSE & MAP estimators for speech enhancement in additive Gaussian noise

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: