FORWARD MASKING THRESHOLD ESTIMATION USING NEURAL NETWORKS AND ITS APPLICATION TO PARALLEL SPEECH ENHANCEMENT

Forward masking models have been used successfully in speech enhancement and audio coding. Presently, forward masking thresholds are estimated using simplified masking models which have been used for audio coding and speech enhancement applications. In this paper, an accurate approximation of forward masking threshold estimation using neural networks is proposed. A performance comparison to the other existing masking models in speech enhancement application is presented. Objective measures using PESQ demonstrates that our proposed forward masking model, provides significant improvements (5-15 %) over four existing models, when tested with speech signals corrupted by various noises at very low signal to noise ratios. Moreover, a parallel implementation of the speech enhancement algorithm was developed using Matlab parallel computing toolbox.


INTRODUCTION
Forward masking is a time domain phenomenon in which a masker precedes the signal in time.Forward masking psychoacoustic data depends on four dimensions, i.e. frequency, masker level, time difference between masker and maskee, and masker signal duration [1].The current forward masking models do not fully take into account all the four dimension of forward masking data.
Functional models of the forward masking effect of the human auditory system have recently been used with success in speech and audio coding to provide more efficient signal compression [2,3].Furthermore, forward masking has been used for speech enhancement [4] using the speech boosting technique [5].Instead of focusing on suppressing the noise, the speech boosting technique increases the relative power of the speech, thus acting as a speech booster.It is only active when speech is present, and remains idle when noise is present.
Jesteadt's forward masking model [6] provides a reasonable approximation to the forward masking effect.Strope et al. [7] extended the Jesteadt experiment to 120 ms.In Jesteadt's and Najafzadeh's forward masking models [6,8], only masker level and delay have been taken into account.While in [9], Gunawan and Ambikairajah have refined the model to reflect forward masking data more accurately by averaging several parameters across frequencies.Currently, the majority of these works focus on formulating mathematical models of the forward masking.Such models are often too general.Further refinement of the model requires software that can do curve-fitting of multi-dimensional data.Nevertheless, for this purpose, we utilise neural network to better approximate forward masking threshold.
To evaluate the performance of our forward masking model, five speech enhancement algorithms were implemented: spectral subtraction [10], spectral subtraction with minimum statistics [11], speech boosting [5], speech boosting using forward masking model 1 [4] and forward masking model 2 [9].The Perceptual Evaluation of Speech Quality or PESQ (ITU-T P.862) measure was used here to benchmark the various methods.
Speech enhancement algorithm exploiting temporal masking properties of human auditory system has a very high computation requirement, especially when the noisy speech signal is long or the number of subbands is high.Recent advances in multi-core system make it a natural choice and viable option for solving high computation requirements of the speech enhancement algorithm.Therefore, the objective of this paper is two-folds: to evaluate the performance of our forward masking model in terms of enhanced speech quality and to implement and evaluate parallel speech enhancement algorithm on a multi-core system.The rest of the sections are organized as follows: Section 2 discusses the development of forward masking models using neural networks.Section 3 describes the sequential speech enhancement algorithm while Section 4 discusses the parallel implementation of speech enhancement algorithm.Experimental results and analyses are discussed in Section 5 for the sequential and parallel algorithms.Finally, Section 6 concludes this paper.

FORWARD MASKING MODELS USING NEURAL NETWORKS
Neural network has been applied for various applications within the following broad categories: function approximation (or regression analysis), classification, and data processing (filtering, clustering, blind source separation, etc).Brown et al. [12] applied non-recurrent neural networks for simultaneous masking modelling.In this paper, neural networks is employed to approximate the forward masking threshold for the three input parameters, including frequency, masker level, and delay.
By taking into account the threshold in quiet (TIQ ) the absolute threshold of forward masking ( FM ) can be calculated using the equation we have developed below: As stated in [13], the threshold in quiet is a function of frequency and signal duration.By curve-fitting a set of 120 data points compiled from [13], we approximated the threshold in quiet to be as follows: (3) The amount of forward masking can be approximated using feed forward neural network as shown in Fig. 1.The network configuration as shown in Fig. 1 with 1 hidden layer could approximate any function [14].To avoid over-fitting to the training data, the Bayesian regularization as proposed in [15] was used.Figure 2 shows the amount of forward masking against Lm and t  at frequency of 500Hz using neural network.Similar plots can be obtained for various frequencies, thus providing a more accurate estimation of forward masking data.Fig. 2: Amount of forward masking estimation at 500Hz.

SPEECH ENHANCEMENT
This section presents the incorporation of our model to fit the speech enhancement algorithm developed in [4].Speech that has been contaminated by noise can be expressed as where   n x is the noisy speech,   n s is the clean speech signal and   n v is the additive noise, all of which are in the discrete time domain.The objective in speech enhancement is to suppress the noise, thus resulting in an output signal   n y that has a higher signal-tonoise ratio (SNR).
The speech enhancement algorithm that incorporates forward masking [4] is shown in This filtering operation can be described in the time domain as is the impulse response of the m th filter.The global forward masking threshold (GFM) and the forward masking threshold in each subband ( 1 Our objective is now to find a gain function, m  , that weights the input signal subbands,   n x m , based on forward masking threshold to noise ratio (MNR).The MNR in each subband can be calculated by using the ratio of a short-term average forward masking threshold,   n P m , and an estimate of the noise floor level, as given in Eqn (8).The short-term average temporal masking threshold in subband m is calculated as where m  is a small positive constant (i.e.
where m  is a small positive constant (i.e.
are combined in a novel manner in order to calculate the gain function where dB provides a suitable limiter for the gain function.

PARALLEL SPEECH ENHANCEMENT ALGORITHM
The design of an efficient parallel speech enhancement algorithm can be a challenging task.First step in the parallelization of any sequential code is to identify which part of code that takes the longest execution time.Using Matlab profiling tool, it was identified that the calculation of forward masking threshold and gain calculation for each subband (see Eqn 5 to 9) were taking the longest execution time.In this paper, we will utilize the Matlab parallel computing toolbox.The hardware used was an AMD quad core 2.5 GHz system with 2 GBytes of memory.
Master-slave paradigm is used in the parallelization.To achieve a scalable parallel implementation of speech enhancement algorithm, we used the data-parallel or single program multiple data (SPMD) programming model.A single program was written for both master and slave processes that asynchronously execute on each node.In particular, all processes will work on different piece of data.There are two data partition schemes available in speech enhancement algorithm, time partition and frequency (subband) partition.In time partition a long noisy speech file is partitioned into smaller time and processed individually.While in frequency or subband partition, the total number of subbands is divided and distributed into a number of slaves.As the calculation of temporal masking requires the information from previous frames, it is obvious that subband partition is more appropriate for parallelization.Hence, it will be used in our implementation.Figure 4 shows the flowchart of parallel speech enhancement algorithm that can be implemented on multi-core system and/or cluster system.Initially, the parallel program starts with initialization at every node.Of the two communication schemes available in Matlab parallel computing toolbox, i.e. distributed array and message passing, we will use message passing scheme as it provides more flexible communication scheme.Then, a noisy speech signal is partitioned (using subband partition) and distributed to core N configured in master-slave fashion.After that, each slave is then filtered the noisy speech signal accordingly, calculates the forward masking threshold, determines the gain for each subband, and applies the gain for each subband signal.After obtaining denoise speech signal for each subband, each slave then sends the results to the master node.Finally, speech reconstruction and PESQ evaluation are applied at the master node.

PERFORMANCE EVALUATION
In this section, the performance of sequential code, in terms of subjective and objective quality of the enhanced speech, was evaluated.Furthermore, the performance of parallel code, in terms of speedup for various numbers of cores, was presented.

Subjective and Objective Quality
In order to assess the performance of the new forward masking model in enhancing speech signals, a large number of simulations were performed.Six speech files were taken from EBU SQAM data set including English female and male speakers, French female and male speakers, and German female and male speakers.The length of the files was between 17 and 20 seconds.
Different types of background noises from the NOISEX-92 and AURORA database have been used -including car, white noise, pink noise, F16, factory, babble, airport, exhibition, restaurant, street, subway and train noise.The variance of noise has been adjusted to obtain -5 dB, 0 dB, 5 dB, and 10 dB SNRs.
The PESQ (Perceptual Evaluation of Speech Quality, ITU-T P.862) measure [16] was utilised for the objective evaluation.Note that, the PESQ has a 93.5% correlation with subjective tests [16].To evaluate the performance of the speech enhancement algorithms, we developed a new measure to assess the improvement achieved.Suppose that we have A total of 288 data sets from six speech files, twelve noises, and four SNRs for each method were simulated.The average quality improvement,  , achieved by various speech enhancement methods is shown in Figure 4.Note that the  results for various speech files and noises were averaged for -5, 0, 5, and 10 dB SNRs.From these results, the speech boosting technique incorporating neural networks forward masking model outperforms other methods for all SNRs.In order to analyse the performance of our proposed method in more detail, the average of quality improvement at -5, 0, 5, and 10 dB SNRs for various noises is shown in Table 1.The best  result for each type of noise condition is shown in bold, from which it can be seen that our method using neural networks forward masking model provides a better PESQ improvement than the five other methods tested.
Table 2 shows the average of quality improvement at -5, 0, 5 and 10 dB SNRs for various speech files.The best  result for each individual speech file is shown in bold.The table shows that more accurate forward masking threshold calculation leads to a better and enhanced speech quality.Furthermore, informal listening test confirm that the speech processed with the proposed algorithm sounds more pleasant to a human listener than those obtained by other algorithms.

Parallel Performance
The computing environment used in this research was AMD Phenom Quad-Core Processor 2.5 GHz system with 2 GBytes of memory.This section is intended to analyse the parallel performance of the speech enhancement algorithms in terms of parallel execution time and speed up.Table 3 shows the performance of the parallel speech enhancement algorithm for 1, 2, 3, and 4 processor.For the evaluation purposes, we used female speech signal with various noises and various SNRs and take the average of parallel execution time and speedup.The parallel speech enhancement algorithm achieves almost linear speedup indicating the high efficiency on parallelization.Moreover, this could be due to the fast communication scheme between processor in which it did not affect the parallel performance.When the number of nodes is high, the communication time will affect the speedup, especially in a cluster system.Therefore, it will be interesting if we evaluated our parallel speech enhancement algorithms on a cluster system with higher number of nodes.

CONCLUSIONS
In this paper, a new forward masking model using neural networks has been proposed and incorporated into a speech enhancement algorithm.The performance of our speech enhancement algorithm employing new forward masking model was compared with five other speech enhancement methods (two other functional models of forward masking) over twelve different noise types and four SNRs.PESQ results reveal that the proposed algorithm outperforms the other algorithms by 5-15% depending on the SNR.Hence, it appears that the proposed forward masking model has good potential for speech enhancement applications across many types and intensities of environmental noise.On a quad core system, the parallel speech enhancement algorithm developed was very efficient in which almost linear speedup was achieved.

Fig. 3 .
Fig. 3.By filtering the input signal   n x using a bank of M analysis filters, the signal is divided into M subbands, each denoted by   n x m , where m is the subband index.

FM
are used to calculate the gain ( m  ) in each subband.The gain, m  , is a weighting function that amplifies the signal in band m during speech activity.
sensitivity of the algorithm to changes in forward masking threshold, and acts as a smoothing factor.The slowly varying noise floor estimate for the m-th subband, fast the noise floor level estimate in the m-th subband adapts to changes in the noise environment.The variables


positive constant controlling the contribution of the forward masking threshold ratio and the short term MNR.Since the calculation of  n m involves a division, care must be taken to ensure that the quotient does not become excessively large due to a small   n Q m .In a situation with a very high MNR, will become very large if no limit is imposed on this function.Therefore, a limiter can be applied on
ref PESQ which is the PESQ score for the reference clean speech,   n s , and the corrupted speech,   n x .The PESQ score of the enhanced speech,   n y , was also measured and denoted as proc PESQ .Therefore, we can derive a new value,  , which measures the PESQ improvement achieved by the algorithm as follows:

Table 3 :
Parallel execution time and speedup for various number of processors.