Implementation of Sub band Coding and Pitch Extraction Using Cumulative Impulse Strength

ABSTRACT- Pitch extraction is a challenging task in speech signal analysis because of its time varying characteristics of source .The impulses in a quasi-periodic sequence incorporated with noise are located by a temporal measure called as cumulative impulse strength. And it is subsequently used for detecting GCI using methods like Dynamic Programming Phase Slope algorithm (DYPSA) and Speech Event Detection using Residual Excitation and the Mean Based Signal (SEDREAMS). The technique of breaking down the source signal into constituent parts and decoding the parts separately is called Sub-band (SBC) coding. The speech signal is divided into two or more frequency bands and each of these sub band signal are coded individually in SBC coding. These sub-bands are processed and recombined to form the output signal, whose bandwidth covers the whole frequency spectrum. Decimation and interpolation in frequency domain are performed after decomposing the signal into low and high frequency components. Then the pitch extraction is implemented using cumulative impulse strength for DYPSA and SEDREAMS algorithms. The proposed structure significantly reduces error and preventive to additive noise and able to locate the Glottal Closure Instants (GCIs) .

INDEX TERMS—SEDREAMS, GCI, GOI, SBC, DYPSA

Subband coding is a technique of decomposing the source signal into constituent parts and decoding the parts separately. If the low frequency components are isolated, then it is called a low-pass filter. [24] Similarly, we have high-pass or band –pass filters. In general, a filter which isolates a number of bands simultaneously is called sub-band filter. By applying optimal FIR filters to each sub band signal, these filters reduce additive noise components with less speech distortion compared to conventional approaches [24,27].

Section II of this paper briefly reviews the implementation of sub-band coding which decomposes the signal into high and low frequency components and its use in data compression by performing decimation and interpolation in the frequency domain. In Section III the Cumulative Impulse Strength is implemented to find the pitch in a speech signal and GCI can be analyzed for DYPSA and SEDREAMS algorithm.

I.INTRODUCTION

Most recent search for discontinuities in the linear production model of speech [25] is by deconvolving the excitation signal and vocal tract filter with linear predictive coding (LPC)[5]. Preliminary efforts are documented in [25]; more recent algorithms use known features of speech to achieve more reliable detection [13],[14],[15]. The identifiably of GCIs from reverberant speech using the DYPSA and a new extensions to the multi microphone case was accessed in [17]. DYPSA implements the phase-slope function for estimating GCI candidates from the speech signal [27].

Accurate estimate of the frequency response of the supra laryngeal vocal-tract system in the closed phase region was analyzed. [1],[2]. Determining the characteristics of the voice source by a careful analysis of the signal within a glottal pulse is possible from the knowledge of epochs. The epochs can be used as pitch markers for prosody manipulation, which is useful in applications like text-to-speech synthesis, voice conversion and speech rate conversion [3], [4].By knowing about the epoch locations we can estimate the time-delay between speech signals collected over a pair of spatially distributed microphones [5]. The segmental signal-to-noise ratio (SNR) of the speech signal is high in the regions around epochs, and hence, it is possible to enhance the speech by exploiting the characteristics of speech signals around the epochs [6]. It has been shown that the excitation features derived from the regions around the epoch locations provide complementary speaker-specific information to the existing spectral features [7], [8]..

I. DECIMATION AND INTERPOLATION

2.1 Decimation

The reduction of sampling rate of a signal x(n) by factor D is also known as decimation of a signal x(n). It is also called as down sampling. We down sample the signal x(n) with spectrum X(ω) by an integer factor D. Let us assume X(ω) is non-zero value in the range of 0 to π which is also equivalent to F≤ F_x/2 .The sampling rate is reduced by possible selection of Dth value. The resulting x(n) signal is aliased with a folding frequency of F_x/2D. To avoid that, the bandwidth of x(n) must be reduced to Fmax= F_x/2Dor equivalently, to ωmax= π/D. In this case, the signal x(n) is down sampled correctly and thus avoid aliasing. [1-3, 6] .To eliminate the spectrum within the range of π and π/D the input is given to low pass filter. The filtering also helps to process the spectrum which is less than π/D further. The impulse response h(n) and a frequency response HD(ω) together constitute the low pass filter design. [24,27].

2.2 Interpolation

As a complement to decimation the increase of sampling rate by factor I is known as interpolation of a signal x(n) where I is an interpolation or integer factor. It is also called as up sampling. Let us assume that the signal x(n) having a spectrum of X(ω) is up sampled by I and is nonzero in the range of 0 to π . The sampling rate is increased by an integer factor of I by interpolating 1-1

new samples between successive values of the signal. The interpolation can be done in many ways one such is that maintaining the spectral shape of the signal sequence x(n). [8,9,27]

2.3 Sub-Band Coding using Decimation and Interpolation

Fig 1 shows the block diagram of Subband speech encoder where the

input speech signal is sampled and divided into two with respect to

bandwidth i.e.

low pass signal 0≤F≤ Fs/4 and a high pass signal Fs/4 . At first stage the low pass signal is divided into two equal bands a low pass

signal (0≤F≤Fs/8) and a high pass signal (Fs/8<F<Fs/4).Then at third stage frequency subdivision divides the low pass signal from the second stage into two equal bandwidth signals[27]. Thus the signal is subdivided into four frequency bands, covering three octaves,

as shown in Fig.1.Decimation by a factor of 2 is performed after frequency division. Different number of bits per sample are allocated in four bands for the signal, so that the bit rate of the digitized speech signal can bereduced.

In subband coding, good performance can be achieved

by proper filter design.When decimation is done, aliasing resuls

which can be neglected. [16,27].The reverse of encoding process is decoding of subband signal. As shown in fig 2, the adjacent high pass and low pass frequency bands are interpolated, filtered and then combined. In decoding corresponds to the final impulse.[27]

> F> Fs/2

and a highpass

subdivision splits

Finally, the third frequency 8

ENCODER

LPF D

HPF D DECODER

speech LPF D

HPF D ENCODER To channel

signal

To channel

HPF D ENCODER

Fig 1: Block diagram of sub band speech encoder

DECODER I

+ I FILTER

DECODER I

+ I FILTER

DECODER I +

DECODER I

O/P

Fig 2: Block diagram of subband speech decoder

section, a pair of quadrature mirror filters QMF, (ie. High pass and low pass filters) are used as shown in fif 3. Bandwidth compression of the signal can be achieved in subband coding; when signal energy is taken at a particular frequency band. Subband coding is effectively implemented by Multirate signal processing. [27].

The two-channel QMF shown in Fig. 3 is the basic building block in speech signal encoding. It consists of decimators in encoding part and interpolator in decoding part. The ho(n) and h1(n) are the impulse responses of low pass and high pass filters in encoding part and go (n) and g1 (n) are impulse responses of low pass and high pass filter in decoding part. [27]

xLP(n) xa0(n) xi1(n) xb0(n)

LPF 2 G0(w)

H0(w) 2

x(n) pe equ

x^(n) +

HPF 2 2 G1(w)

H1(w)

xHP(n) xa1(n) xi2(n) xb1(n)

Fig 3: Two channel quadrature mirror filter bank

III . Cumulative Impulse Strength

Let r[n] is an amplitude-perturbed, quasi-periodic impulse train of

length N represented as follows:

r[n] =N ∑_(K=1)^N Akδ[n−nk], (1)

Nk=nk-1 nk =nk-

Nk-1 1+N0+ΔΔ Δk 2≤k≤N

where nk is the location of the k-th impulse with amplitude Ak ,

δ[n− nk] denotes the Kronecker delta function, N0 is the average period of r[n] and ∆k is the deviation of nk − nk−1 from N0. The measure CIS is defined recursively at each location n, by combining the effect

of the signal r and the CIS C around the previous impulse location. That is, if ρ = maxk|∆k|, the CIS C[n] at the n-th sample is defined as follows:[29]

C[n] = max C[m] + r[m])

n−N0−ρ≤m≤n−N0+ρ (3)

In order to locate the impluses from C[n], we define one more

sequence V [n] as follows.

V [n] = argmax C[m] + r[m]. (4)

n−N0−ρ≤m≤n−N0+ρ

That is, at each sample n, V [n] stores the location that maximizes

C[n] within the search interval defined in Eq. 3. Once the location of the last impulse is known, a back tracking procedure is employed to

locate all the impulses from V [n] as follows: if nk corresponds to the kth impulse location, the (k−1)th impulse location is given by V [nk]. The location of the final impulse is defined to be that which maximizes r[m], N −1−N0 + ρ ≤ m ≤ N −1. This is because the

location of the maxima of the r[m] within the last periodic interval corresponds to the final impulse.[29]

3.1 SEDREAMS Algorithm

In speech processing, Glottal-synchronous technique is a field of speech science in which the pseudo periodicity of voiced speech is exploited. Many research works are being developed for determining the pitch contours and speech quality analysis has been promising in the field of Phonetics.[28] However more recent efforts in the detection of Glottal Closure Instants (GCIs) enable the estimation of both pitch contours and, additionally, the boundaries of individual cycles of speech. Such information has been put to practical use in applications including prosodic speech modification [28], speech dereverberation [21], glottal flow estimation, speech synthesis [18], data-driven voice source modeling and causal-anticausal deconvolution of speech signals [28].Minimum and maximum values of frequency are taken as 80 and 240Hz.Here we hypothese the peaks of the subaband signals as GCI.

Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) algorithm are technique which is used to determine the GCI locations in speech spectrum automatically. It also helps to determine both GCIs and GOIs from the spectrum and it is quite consistent [28]. Recent research efforts are only focuses on determining the GCIs but the GOI determination methods are ignored in this algorithm. The determination of short intervals where GCIs are expected to occur and the refinement of the GCI locations within these intervals are the two steps involved in this SEDREAMS algorithm [28].

3.2 DYPSA

The algorithm which is responsible for identifying the peak instants and GCI estimation in a linear prediction of speech signal is known as Dynamic Programming Phase Slope Algorithm (DYPSA) [14] .It consists of two main components: estimation of GCI candidates with the group delay function of the LP residual and dynamic programming.These components are defined as follows.[14]

Group Delay Function:

The group delay function is the average slope of the unwrapped phase spectrum of the short time Fourier transform of the LP residual[28],[29].It can be shown to accurately identify impulsive features in a function provided their minimum separation is known

.GCI candidates are selected based on the negative-going zero crossings of the group delay function.

Dynamic Programming:

In order to select the GCI subsets which are corresponding the true GCI values the erroneous GCI values are eliminated using the known characteristics of spectrum by minimizing the cost function.[14]

IV Proposed Method

First a wave signal is taken and decimated to four frequency bands by two stages of convolution. Then the signal is filtered to have lower and upper speech bands by taking FFT for the time domain response of the filter .The resulting frequency domain of the four bands of the decimated signals are reconstructed by again convolving the four bands and taking FFT for the above signal.

The synthesized signal is written in terms of wave file and given as an input for finding the GCI location. The first step for GCI is estimating the fundamental frequency which is known as pitch tracking. The epoch locations are determined by taking average pitch period of samples in speech signal.

In the second step, the polarity of the speech signal is estimated and by using SDREAMS algorithm the GCI locations are obtained by getting the maximum LPC residue .

To that, the mean based signal are determined and the low frequency contents are removed from that. Then the minima and maxima are detected and median positions of GCIs redetermined within the cycle. Finally, the complex cepstrum causal and anti-anti causal decomposition which is used find the glottal flow estimation when specific window criteria are met. The speech signal is taken from APLAWD database . The sampling frequency is 16KHz.The pitch contours are extracted from library and GCI is calculated directly from speech waveform. The

Comparison

3000

2500 original band

2000

|X| 1500

1000

500

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

6000

5000 Synthesized Band

|Band| 4000

3000

2000

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig 4:Output of Sub band coding

Fig 5 : GCI detection with SEDREAMS algorithm.

The GCI was noted near the local maxima of the LP residual signal. The maximum of LPR signal within the specified intervals is the final estimated GCI .

The above graphs represent the GCI detection of DYPSA and SEDREAMS using sub band coding without considering the average pitch period. If Cumulative Impulse Strength is considered then the location of GCI can be detected more precisely and is robust to noise.

GCI detection with the DYPSA algorithm

Residual excitation

GCI locations

0.8 V-UV decisions

0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8

0 2 4 6 8 10 12 14

x 104

Fig 6: GCI Detection with DYPSA Algorithm

Table of Contents

Conclusion

The proposed method addresses the problem of decomposing a signal into low and high frequency components. The subband coding of speech signal is performed by using low pass filter, high pass filter, decimator and a interpolator. This system will improve the efficiency and the error rate is reduced when compared to delta modulation encoding systems. The proposed structure is used with SEDREAMS algorithm and DYPSA algorithm to analyze the glottal closure instant (GCIs) locations. On implementing subband coding as pre-processing unit, and cumulative impulse strength for pitch extraction ,both algorithms detect GCI in a better way when compared with without subband coding. In future we can go for different filter characteristics.

References

[1]. John G. Proakis and Dimitris G. Manolakis. Digital Signal Processing, Principles, Algorithms, and Applications.Prentice Hall.New Jersey, 2008.

Roberts R. A. and Mullis C. T. Digital Signal Processing. Addison-Wesley, Reading.Mass, 2006.

Oppenheim A. V. and Schafer R. W. Discrete-Time Signal Processing. Prentice Hall. Englewood Cliffs, New Jersey, 2007.

Crochiere R. E. and Rabiner L. R. Multirate Digital Signal Processing. Prentice Hall, Engelwood Cliffs, New Jersey, 1983.

Schafer R. W. and Rabiner L. R., “A Digital Signal Processing Approach to Interpolation,” Proc. IEEE, Vol. 61, pp. 692-702, June 2003. Mcgillem C. D. and Cooper G. R. Continous and Discrete Signal and System Analysis, 2nd ed., Holt Rinehart and Winston, New York, 1984.

Crochiere R. E. and Rabiner L. R. ,”Optimum FIR Digital Filter Implementations for Decimations, Interpolation, and Narrowband Filtering,” IEEE Trans. on Acoustics, Speech, and Signal Processing,” Vol. ASSP-23, pp. 444-456, Oct. 2004.

Crochiere R. E. and Rabiner L. R. ,”Further Considerations in the Design of Decimators and Interpolators,” IEEE Trans. on Acoustics, Speech, and Signal Processing,” Vol. ASSP-24, pp. 296-311, August 2007.

Crochiere R. E. and Rabiner L. R. ,”Interpolation and Decimations of Digital Signals – A Tutorial Review,” Proc. IEEE, Vol. 69, pp. 300-331, March 2008.

[10]Andreas I.Koutrouvelis, George P.Kafentzis, NikolayD.Gaubitch, and Richard Hesdens ,“A Fast Method for High Resolution Voiced/Unvoiced detection and Glottal Closure/Opening Instant

Estimation of Speech”, IEEE/ACM Transactions on Audio, Speech

,and Language Processing,Vol 24,No.2,February 2016.

[11] T. Drugman, P. Alku, A. Alwan, and B. Yegnanarayana,

“Glottal source processing: From analysis to applications,” Comput.

Speech Lang., vol. 28, no. 5, pp. 1117–1138, 2014.

[12] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE,

vol. 63, no. 4, pp. 561–580, Apr. 1975.

[13] S. Gonzalez and M. Brookes, “PEFAC – A pitch estimation algorithm robust to high levels of noise,” IEEE Trans. Audio, Speech,

Lang. Process., vol. 22, no. 2, pp. 518–530, Feb. 2014.

T. Drugman, M. Thomas, J. Gudnason, P. Naylor, T. Dutoit, Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review, IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, No. 3, March 2012

T. Drugman, B. Bozkurt, T. Dutoit, Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation, Speech Communication Journal,Elseiver, Febuary 2011.

T. Drugman, B. Bozkurt, T. Dutoit, A Comparative Study of Glottal Source Estimation Techniques, Computer, Speech and Language Journal, Elsevier, September 2011.

T. Drugman, B. Bozkurt, T. Dutoit, Glottal Source Estimation Using an Automatic Chirp Decomposition, Lecture Notes in Computer Science, Advances in Non-Linear Speech Processing,

volume 5933, pp. 35-42, 2010.

T. Drugman and T. Dutoit, “Glottal closure and opening instant detection from speech signals,” in Proc. Interspeech Conf., 2009.

K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speechsignals,”IEEETrans.Audio,Speech,Lang.Process.,vol.16,no.8,p

p. 1602–1613, Nov. 2008.

[20]Crochiere R. E., “On the Design of Sub-band Coders for Low Bit Rate Speech Communication,” Bell Syst. Tech. J., Vol. 56, pp. 747-711, May-June 1977.

Blahut R. E. Fast Algorithms for Digital Signal Processing, Addison-Wesley, Reading, Mass, 1985.

Gray A. H. Source Coding Theory, Kluwer, Boston, MA, 1990.

Crochiere R. E., “Sub-band Coding,” Bell Syst. Tech. J., Vol. 60, pp. 1633-1654, Sept. 1981.

Vetterli. J., “Multi-dimensional Sub-band Coding: Some Theory and Algorithms,” Signal Processing, Vol. 6, pp. 97-112, April 1984.

Jain V. K. and Crochiere R. E, “Quadrature Mirror Filter Design in the Time Domain,” IEEE Trans. on Acoustics, Speech, and Signal Processing,” Vol. ASSP-32, pp. 353-361, April 1984.

Leon W. Couch. Digital and analog Communication Systems.

Prentice Hall, New Jersey, 1993.

[27]Ashraf M. Aziz, “Subband Coding of Speech Signals Using Decimation and Interpolation”, Aerospace Sciences&Aviation

Technology, ASAT- 13, May 26 – 28, 2009.

[28] Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick

Naylor, Thierry Dutoit, “Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review”, IEEETRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING,Vol 20,No:3,March 2012.

[29] Prathosh A. P., Member, IEEE Sujith P, Ramakrishnan A. G. , Senior Member, IEEE and Prasanta Kumar Ghosh, Senior Member, IEEE, “Cumulative Impulse Strength for Epoch Extraction”, IEEE Signal Processing Letters, SPL-17876-2015.R.

Essay: Implementation of Sub band Coding and Pitch Extraction Using Cumulative Impulse Strength

Essay details and download:

Text preview of this essay:

Conclusion

References

About this essay:

Essay details and download:

Text preview of this essay:

Conclusion

References

About this essay:

Essay Categories: