In this age of modern Electronic devices, It is well accepted that people interact with electronic devices through a natural language whether it is English or any other language. Speaker recognition is the process of recognizing automatically who is speaking on the basis of individual information included in speech waves. This technique uses the speaker’s voice to verify their identity and provides control access to services such as voice dialing, database access services, security control for confidential information areas and several other fields where security is the main area of concern.or speaker command their devices via speech such as latest technology Apple’s Siri?? (iPhone?? software for speech recognition) and Microsoft’s Kinect?? (a gaming device for Xbox360?? and windows-based platforms) it allows users to handle and interface with their device via a natural way of interfacing like spoken commands. However, for that to become a reality in the future, it is essential to improve the current accuracy of the applications even in the most ordinary tasks, such as determining who’s speaking and what is speaking.
1.2 Sound & Human Speech
Sound is a one form of energy that travel in waveform through an air or any other medium, and it can affect an ear when it reached a person causes people to hear. Sound is generated through a particles vibration in any medium, like the vibrations of air atoms or molecules or pressure in the air. These types of vibration or wave consist two layers they interleaved and travelling together through the medium, the layers are high pressure layer or compression and low pressure layer or also called rarefaction. In fact, sound can be seen as a signal; the amplitude of which corresponds to the pressure change and the length of which corresponds to the distance between two consecutive high (or two consecutive low) pressure layers.
Speech is the natural way of human communication; it is common and efficient form of communication for people to interact with each other. It is multi layered spectral variation which convey information like word, speaker identity, accent, feeling, gender, age etc. Human speech is a form of sound,the source of sound in human is the vocal cord it consist two thin elastic band of tissue that vibrate to produce to sound,vocal cord produce sound when air from lungs flows via a windpipe and into the voice box (where vocal cord is present. Then the air pushes the vocal cord making them vibrate. These vibration create a sound waves series that exit through a mouth. When vocal cords vibrate, they vibrate periodically at a base frequency. Sounds that excite the vocal cords are called voiced sounds if vocal cords are extra relaxed or stiff the result sound is said to be unvoiced. The smallest part of the spoken language is phenome Different phonemes have different characteristics that can be used to recognise them.The most observable feature of phonemes is whether it is voiced or unvoiced, which comes from the existence of the base frequency or not. The real time conduct is one of the important features of human speech interaction. Therefore; effective, highly accurate and time-efficient methods are necessary to deal with large amounts of speech information.
1.2.1 Signal
It is the function that describes the variation in a physical variable with respect to independent variable like time, space,temperature etc. An example of a signal is the measured voltage at a certain point in an electric circuit.When signal repeats itself in regular interval or period T,that signal is called a periodic signal some example of periodic signal are sinusoid,triangle,square.T is the time taken by the signal to repeat itself.How many number of times that the signal repeats itself in one second is called the frequency(i.e inverse of period).Frequency is change with respect to time when it changes in short period that means its high frequency and when its change in long period means low frequency.when signal can not change with time its frequency is zero and if it changes rapidly it means infinite frequency.
According to Coleman(2005,p72)[32] Jean Fourier provided the means where any real world signal can be represented as the sum of (maybe) a number of sinusoids with different frequencies, phases and magnitudes. This technique is straightforwardly called Fourier Transformation. Fourier Transformation has only one condition: the original signal must be finite in length and amplitude at any time. This transformation generates a new representation of the signal in the frequency domain2. Where each frequency is associated with a value representing how much the sinusoid of that frequency contributes to the original signal. It is common to study the frequency spectrum of a signal, especially a speech signal (which is composed of many mixed frequencies, caused by the shape of the human vocal tract). Figure 1.1 (a) shows the time series for the sum of two sinusoids with frequencies 100 and 220 Hertz; with some artificially added noise. It is very hard to observe the frequencies in the time domain.
Figure 1.1 (b) is the result of Fast Fourier Transformation for the signal in (a). In (b) it is easier to locate the base frequencies of the mixed signal.
Fig 1.1(a) The time series and (b) frequency spectrum of a signal
1.2.2 Sampling
In signal processing when analog signal process in computer its essential to convert these signal into a digital form.Analog signal is continuous in amplitude and time but digital signal is discrete in both so here conversion of a signal is needed from contnuous to discrete signal so,the process by which continuous signal transformed into a discrete signal is called as sampling.The value of a signal is measured at a particular period,each mesurement of period is known as sample.The time period between two consecutive measures is called the sampling interval T ; and the number of samples taken in one second is called the sampling frequency Fs.A common example is the conversion of a sound wave (a continuous signal) to a sequence of samples (a discrete-time signal). A sample refers to a value or set of values at a point in time and/or space. A sampler is a subsystem or operation that extracts samples from a continuous signal.
Sampling can be done for functions varying in space, time, or any other dimension, and similar results are obtained in two or more dimensions. For functions that vary with time, let s(t) be a continuous function (or "signal") to be sampled, and let sampling be performed by measuring the value of the continuous function every T seconds, which is called the sampling interval. Then the sampled function is given by the sequence: s(nT), for integer values of n.The sampling frequency or sampling rate, fs, is defined as the number of samples obtained in one second (samples per second), thus fs = 1/T. Reconstructing a continuous function from samples is done by interpolation algorithms.
Fig 1.2 Signal sampling representation.
The continuous signal is represented with a green colored line while the discrete samples are indicated by the blue vertical lines.Function with s(t). This mathematical abstraction is sometimes referred as impulse sampling.Most sampled signals are not simply stored and reconstructed. But the fidelity of a theoretical reconstruction is a customary measure of the effectiveness of sampling. That fidelity is reduced when s(t) contains frequency components higher than fs/2 Hz, which is known as the Nyquist frequency of the sampler. Nyquist Frequency is the highest frequency of a signal that can be faithfully preserved when sampled at a certain frequency Fs; after which the original signal may be distorted, i.e. spatially aliased. It can be proven that Nyquist Frequency is in fact, ??Fs (Jurafsky and Martin, 2009) [1].
1.2.3 Nyquist Frequency
The Nyquist frequency is named after electronic engineer Harry Nyquist,It is also called the nyquist limit is the highest frequency that can be coded at a given sampling rate in order to be able to fully reconstruct the signal i.e Nyquist frequency is half of the sampling rate of a discrete signal processing system.[1][2] It is sometimes known as the folding frequency of a sampling system.[3] An example of folding is depicted in Figure 1, where fs is the sampling rate and 0.5 fs is the corresponding Nyquist frequency. The black dot plotted at 0.6 fs represents the amplitude and frequency of a sinusoidal function whose frequency is 60% of the sample-rate (fs). The other three dots indicate the frequencies and amplitudes of three other sinusoids that would produce the same set of samples as the actual sinusoid that was sampled. The symmetry about 0.5 fs is referred to as folding. The Nyquist frequency should not be confused with the Nyquist rate, which is the minimum sampling rate that satisfies the Nyquist sampling criterion for a given signal or family of signals. The Nyquist rate is twice the maximum component frequency of the function being sampled. For example, the Nyquist rate for the sinusoid at 0.6 fs is 1.2 fs, which means that at the fs rate, it is being undersampled. Thus, Nyquist rate is a property of a continuous-time signal, whereas Nyquist frequency is a property of a discrete-time system. When the function domain is time, sample rates are usually expressed in samples/second, and the unit of Nyquist frequency is cycles/second (hertz). When the function domain is distance, as in an image sampling system, the sample rate might be dots per inch and the corresponding Nyquist frequency would be in cycles/inch.
Fig.1.3 Example of Nyquist frequency signal
The black dots are aliases of each other. The solid red line is an example of adjusting amplitude vs. frequency. The dashed red lines are the corresponding paths of the aliases.
1.2.4 Aliasing
When the signal is converted back into a continuous time signal it will exhibit a phenomenon called aliasing. Referring again to Figure 1.3, under sampling of the sinusoid at 0.6 fs is what allows there to be a lower-frequency alias, which is a different function that produces the same set of samples. The mathematical algorithms that are typically used to recreate a continuous function from its samples will misinterpret the contributions of under sampled frequency components, which causes distortion. Samples of a pure 0.6 fs sinusoid would produce a 0.4 fs sinusoid instead. If the true frequency was 0.4 fs, there would still be aliases at 0.6, 1.4, 1.6, etc., [vague] but the reconstructed frequency would be correct. In a typical application of sampling, one first chooses the highest frequency to be preserved and recreated, based on the expected content (voice, music, etc.) and desired fidelity. Then one inserts an anti-aliasing filter ahead of the sampler. Its job is to attenuate the frequencies above that limit. Finally, based on the characteristics of the filter, one chooses a sample-rate (and corresponding Nyquist frequency) that will provide an acceptably small amount of aliasing. In applications where the sample-rate is pre-determined, the filter is chosen based on the Nyquist frequency, rather than vice-versa. For example, audio CDs have a sampling rate of 44100 samples/sec. The Nyquist frequency is therefore 22050 Hz. The anti-aliasing filter must adequately suppress any higher frequencies but negligibly affect the frequencies within the human hearing range. A filter that preserves 0’20 kHz is more than adequate for that.
1.2.5 Quantization
In mathematics and digital signal processing quantization is the process of mapping a large set of input values to a (countable) smaller set or integer values. A device or algorithmic function that performs quantization is called a quantizer. A common use of quantization is in the conversion of a discrete signal (a sampled signal) into a digital signal by quantizing. In analog-to-digital conversion, the difference in between the actual analog value and quantized digital value is called quantization error or quantization distortion.The error occur is either due to rounding or truncation. The error signal is sometimes modeled as an additional random signal called quantization noise because of its stochastic behavior. Quantization is involved to some degree in nearly all digital signal processing, as the process of representing a signal in digital form ordinarily involves rounding. Quantization also forms the core of essentially all lossy compression algorithms.
Fig 1.4 Quantization of signal
The simplest way to quantize a signal is to choose the digital amplitude value closest to the original analog amplitude. The quantization error that results from this simple quantization scheme is a deterministic function of the input signal.
1.2.6 Windowing
In signal processing, speech is a non stationary signal where characteristics can change rapidly with time. For most of the phonemes the characteristic of speech remains same for a short interval of time. Window functions are used as a temporary bound on the original signal to limit the stream on the interesting range.. Window effect the original signal and convert it into time bounded signal that is similar to the original signal inside the rectangular range and is set to null outside of it. This helps further processes to attend only to the bounded part. Other window functions sustain more focus to the central values of the window rather the boundary values. Examples include Hamming, Hanning and Triangular window functions. It is common to use overlapping windows over the same signal, i.e. one starts before the previous one ends in the time domain. Each window will have an extract of the original signal without any interference among the result windows.
In DSP, windowing is also seen as a function affecting the samples of the signal to produce a new series of samples, i.e. a new signal. Figure 1.5 shows the effects of different windowing functions on a digital flat signal
Fig 1.5 The effects of different window functions on a digital signal
1.2.7 Filtering
A filter affects frequency domain of the signal; i.e. it change the components of the signal at a specific frequency ranges. For example, an ideal low pass filter keeps only low frequency components and clears any component over a particular frequency (called the cut-off frequency). like windowing, when we filtering a signal it will result a new signal with sample values affected by the filter. In practice the use of the difference equation is the simplest way to build a digital filter; which determines the value of the output sample as a linear combination of previous input (and/or output) sample values, as presented in equation 1.1. An example would be the Averager; i.e. averaging the past N input samples, which will result in a signal less influenced by quick changes (or high frequencies) of the original signal, mimicking a low pass filter. Other example is the Differentiator; i.e. averaging the difference of the previous N input samples, which will result in a signal less influenced by the slow changes (or low frequencies) of the original signal, mimicking a high pass filter. In general the following equation is used (x is the input signal and y is the output signal):
Eqn 1.1
Using the different values for the coefficients (a’s and b’s) results in different filter behaviors. The resulting filter is known as the Butterworth filter. In practice, coefficients values are calculated from lookup tables according to their ratio of cut-off frequency to the sampling frequency of the input signal. Furthermore, combining two filters consecutively allows the build of a pass band filter (only passes a certain range of frequencies).
Filters are generally explain by their impulse responses. FIR filter have a impulse response of finite duration, because it settles to zero in finite time.IIR filter impulse response can not become a exactly zero after a certain point, but continues indefinitely. A filter bank is a collection of filters that covers the entire range of frequencies found in the original signal. Each filter emits a different signal; which can be further processed individually.
1.2.8 Causes for Differences in Speech Signal
When vocal cords vibrate, the frequency of that vibration (or fundamental frequency) is the source of most differences between voices. Some general classification such as gender (male voices usually have lower frequency than female voices) and age (the fundamental frequency drops with age). Vocal tract have the ability to change its shape is also a factor of differences between voices. In a phoneme-based study we find differences in class and dialect between different people. Those changes affect the places of stress and the syllables used (Campbell, 1997)[4]. Also a source of difference is the number of uttered syllables in a period of time or rate of speech (Jurafsky and Martin, 2009)[5]. That also affects the speaking style of the person, such as the deletion of the last syllable of a word, the reduction of the stress in some cases and merging words for convenience (Huang, Hon and Reddy, 2001)[6]. Moreover, the same person speech may be affected by their mood, such as repetition or if he is whispering or yelling.
1.3 Speaker and Speech Recognition
Speaker Recognition (SR) is a major topic which includes many different speaker specific tasks. According to Reynolds (2002)[7], the tasks can be sub categorized into text dependent (where speakers are expected to utter a certain piece of text) and text independent (where the speaker may speak anything they wish) tasks. Similarly, depending on the information that the method is allowed to use and the output expected from the process; speaker recognition generally comprises of the listed tasks
Speaker Identification: In speaker Identification a closed set of speakers is introduced to the system along with the testing data. This system determine who is speaking from a set of known speakers. This is often referred to as Closed-Set Identification to avoid confusion with the verification task, or more conveniently speaker identification.
Speaker Verification: In speaker Identification a closed set of speakers is introduced to the system along with the testing data. This system determine who is speaking from a set of known speakers. This is often referred to Open-Set Identification. Campbell (1997)[4] adds the following task under the SR umbrella.
Speaker Detection: One speaker’s data (often called the target speaker) is offered to the system along with many testing speeches. The system is expected to correctly flag the speeches of the target speaker. Other tasks are also related to Speaker Recognition, as they are considered of the same family of research (Kotti et al, 2008)[8]:
Speaker Segmentation: A large input stream, with more than one speaker present, is offered to the system. The system is expected to find the points where the speaker changes; i.e. turn points. If knowledge about the speakers is available beforehand, then the system can build models for each speaker. Then the task is called model-based speaker segmentation. Otherwise, it is called blind speaker segmentation, or metric-based speaker segmentation.
Speaker Clustering: A large number of test inputs are presented to the system. The system must correctly cluster them according to the speaker. This task is often done online, alongside another task, as to group segments of the same speaker together.
Speaker Diarization: A stream is presented to the system. The system is expected to decide who is speaking at each period of the stream. This task is often thought of as segmentation of the stream followed by clustering. Similar to the segmentation task, if knowledge is available a priori to the system then models can be built (which helps in the online clustering as well) and the task is called model-based speaker diarization.
Fig 1.6 Speaker Recognition Tasks
1.3.1 Automatic Speaker Recognition
Automatic Speaker Recognition is the technique used for identification or verification of a person using speech features extracted from an utterance. The goal of this system is to analyze, extract and recognize information about the speaker identity. Speaker recognition system consists of four stages. First speech analysis stage deals with suitable frame size for segmenting speech signal for further analysis and extraction then feature extractor followed by a robust speaker modeling technique for generalized representation of extracted features and a classification stage that verifies or identifies the feature vectors with linguistic classes. In the extraction stage of an ASR system, the input speech signal is converted into a series of low-dimensional vectors, the necessary temporal and spectral behaviour of a short segment of the acoustical speech input is summarized by each vector (Reynolds, 2002) [7].
Speaker recognition systems involve two types of phases named as training and testing.In training phase the process of familiarizing the system with the voice characteristics of the speakers registering is done and in testing phase the actual recognition of speaker is done. Figure 1.7 shows the block diagram of training phase. Feature vectors representing the voice characteristics of the speaker are extracted from the training utterances and are used for building the reference models. During testing, similar feature vectors are extracted from the test utterance, and the degree of their match with the reference is obtained using some matching technique. The level of match is used to arrive at the decision.
Fig1.7 Block diagram of a training phase(a) and testing phase (b)
A speaker recognition system is composed of the following stages:
Front end processing- In this part speech signal converted into feature vectors which consist properties of speech that can differ speakers.
Speaker modeling- In this stage reduction of feature data is perform by modeling the distribution of the feature vectors.
Speaker database-Speaker database is prepared by stored speaker models here.
Decision logic- Final decision about the identity of the speaker decideby comparing unknown vector with known vector in the database and select best or closest matching model.
The outcomes of ASR, recognition and device control, permits an individual to control access to services such as voice call dialling, banking by telephone, telephone shopping, telemedicine, database access services, information services, voice mail, security control for confidential area etc. Sadaoki Furui [9].Performance of speaker recognition system depends on technique used in the various stages of the system. (Chakraborty and Ahmed, 2007, Rabiner and Juang, 1993).[10]
1.4 Classification Of Speaker Recognition
A Speaker Recognition is a major research today it can be classify mainly in two classes by defining what kind of utterance system can recognize. Speaker recognizes is complex and very challenging task because of variability in a signal. The SR approach classes are:
1. Conventional.
a. Speaker identification
b. Speaker verification
2. Text Conversion.
a. Text independent recognition
b. Text dependent recognition
1.4.1 Speaker Identification
Identification of a Speaker is the process of identify which speaker provides a same utterance which already registered into a database of speakers and utterances are added to the database that may be used at a later time during the speaker identification process.In simple words in this process system dtermine who spoke when. The below figure show the process of speaker identification in which feature extraction from the source speech, a measure of similarity from the available speaker utterances and in final step that identifies the speaker identification based upon the closest matches of utterance with save data in a system.
Fig 1.8 Speaker Identification
1.4.2 Speaker Verification
It is the process of accepting or rejecting the identity claim of a speaker. is a task of determining whether a particular person is speaking or not. The acceptance or rejection of an identity claimed by a speaker is known as Speaker Verification. The speaker verification process is shown in Figure 1.9, and includes feature extraction from the source speech, comparison with speech utterances already stored in the database from the speaker whose identity is now being claimed and a final decision step that give a positive or negative response of a system to verify a speaker.
Fig 1.9 Speaker verification
1.4.3 Text-independent recognition
Another category of classification of speaker recognition systems is based upon the text uttered by the speaker during the identification process.
In Figure1.10, text-independent SR system is shown where the key feature of the system is speaker identification utilizing random utterance input speech (Chakraborty and Ahmed, 2007)[10].
Fig 1.10 Text-independent Speaker Recognition
1.4.4 Text-dependent recognition
In this case, the test utterance is same to the text used in the training phase. In Figure 1.11 , a text-dependent SR system is shown where recognition of the speaker’s identity is based on a match with utterances made by the speaker previously and stored for later comparison. Phrases like passwords, card numbers, PIN codes, etc. made be used (Chakraborty and Ahmed, 2007)[10].
Fig 1.11 Text dependent Speaker Recognition
1.5 Modules Of Speaker Recognition
The main modules of a speaker recognition system are:
1. Feature Extraction:
The purpose of this module is to convert the speech waveform into a set of features or rather feature vectors used for further analysis.
2. Feature Matching:
In this module the features extracted from the input speech are matched with the stored template (reference model) and a recognition decision is made.we have use of MFCC for feature extraction process. For feature extraction, we have used algorithms namely VQLBG.
1.6 Speech Feature Extraction
Feature extraction is a process of retaining useful information of the signal while discarding unwanted signal.The purpose of this module is to convert the speech waveform to some type of parametric representation for further analysis and processing. This is often referred to as the signal-processing front end. The speech signal is a slowly time varying signal. An example of speech signal is shown in Figure 1.12. When examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over longer periods of time (on the order of 1/5 seconds or more) the signal characteristics change to reflect the different speech sounds being spoken.Therefore, short-time spectral analysis is the most common way to characterize the speech signal [12].
Fig 1.12 An example of speech signal
Historically, the following spectrum-related speech features have dominated the speech and SR areas: Real Cepstral Coefficients (RCC) introduced by Oppenheim (1969)[13], LPC proposed by Atal and Hanauer (1971)[14], LPCC derived by Atal (1974, Sambur, 1976)[15], and MFCC by Davis and Mermelstein (1980)[16]. Other speech features such as, PLP coefficients by Hermansky (1990), Adaptive Component Weighting (ACW) cepstral coefficients by Assaleh and Mammone (1994) and various wavelet-based features, although presenting reasonable solutions for the same tasks, did not gain widespread practical use.
The reasons why some approaches may not have been utilised may include more sophisticated computation requirements or due to the fact that they do not provide significant advantages when compared to the well known MFCC (Ganchev, 2005, Plumpe, 1999).[17]
1.6.1 Mel-Frequency Cepstrum Coefficients
The LPC [14] features were very popular in the early speaker-identification and speaker-verification systems. However, comparison of two LPC feature vectors requires the use of computationally expensive similarity measures such as the Itakura-Saito distance and hence LPC features are unsuitable for use in real-time systems. Furui suggested the use of Cepstrum, defined as the inverse Fourier transform of the logarithm of the magnitude spectrum, in speech-recognition applications. The use of the cepstrum allows for the similarity between two cepstral feature vectors to be computed as a simple Euclidean distance. Furthermore, Ata has demonstrated that the cepstrum derived from the MFCC features rather than LPC features results in the best performance in terms of FAR [False Acceptance Ratio] and FRR [False Rejection Ratio] for a speaker recognition system.
MFCCs are based on the known variation of the human ear’s critical bandwidths with frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. Here the mel scale is being used which translates regular frequencies to a scale that is more appropriate for speech, since the human ear perceives sound in a nonlinear manner. This is useful since our whole understanding of speech is through our ears, and so the computer should know about this, too. Feature Extraction is done using MFCC processor. In the Mel scale, to capture the phonetically important characteristics of speech of frequency F in Hz, a subjective pitch is measured in units known as mel. The reference point between this scale and normal frequency measurement is defined by equating a 1000 Hz tone, 40 dB above the listener’s threshold, with a pitch of 1000 mels (Ganchev, 2005 [17], Therefore the approximate formula shown in Equation (1.2) can be used to compute the mels for a given frequency F in Hz.
Fmel = 2595 log10 (1+F/ 700) Eqn 1.2
The frequency versus mel frequency scale is shown in Figure 1.12. The scale of Figure 1.12 is linear frequency spacing below 1000 Hz and logarithmic spacing above 1000 Hz.
Fig 1.13 Frequency (linear) vs Mel frequency
1.6.1.1 MEL-FREQUENCY CEPSTRUM COEFFICIENTS PROCESSOR
A block diagram of the structure of an MFCC processor is given in Figure 1.14, The speech input is typically recorded at a sampling rate above 12500 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog to- digital conversion.
Fig 1.14 Block diagram of the MFCC processor
A. Frame Blocking
In this step, the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by N ‘ M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N – 2M samples. This process continues until all the speech is accounted for within one or more frames [18]. The values for N and M are taken as N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the fast radix-2 FFT) and M = 100. Frame blocking of the speech signal is done because when examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Overlapping frames are taken not to have much information loss and to maintain correlation between the adjacent frames. N value 256 is taken as a compromise between the time resolution and frequency resolution. One can observe these time and frequency resolutions by viewing the corresponding power spectrum of speech files
B.Windowing
The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. If we define the window asw(n), 0 ‘ n ‘ N ‘1, where N is the number of samples in each frame, then the result of windowing is the signal
Eqn 1.3
Typically the Hamming window is used, which has the form and plot is given in
Eqn 1.4
.
Fig 1.15 Hamming window
C. Fast Fourier Transform (FFT)
The next processing step is the Fast Fourier Transform, which converts each frame of N samples from the time domain into the frequency domain. These algorithms are popularized by Cooley and Tukey and are based on decomposing and breaking the transform into smaller transforms and combining them to give the total transform. FFT reduces the computation time required to compute a discrete Fourier transform and improves the performance by a factor of 100 or more over direct evaluation of the DFT. FFT reduces the number of complex multiplications from N2 to N/2log2N and it’s speed improvement factor is N2 /(N/2) Log2N). In other words FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on the set of N samples {xn}, as follow
Eqn 1.5
In general Xn’s are complex numbers. The resulting sequence {Xn} is interpreted as follows: the zero frequency corresponds to n = 0, positive frequencies 0 < f < Fs/2 correspond to values1 ‘ n ‘ N / 2 ‘1, while negative frequencies ‘ Fs / 2 < f < 0 correspond to N / 2 +1 ‘ n ‘ N ‘1. Here, Fs denote the sampling frequency. The result after this step is often referred to as spectrum or periodogram.
Fig 1.16 Power spectrums of speech files for different M and N values
1.6.1.2 Mel-frequency Wrapping
As mentioned above, psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the ‘mel’ scale. The mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. As a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 mels [1][2]. Therefore we can use the following approximate formula to compute the mels for a given frequency f in Hz:
mel( f ) = 2595*log10(1+ f / 700) Eqn 1.6
One approach to simulating the subjective spectrum is to use a filter bank, spaced uniformly on the mel scale see in figure 4. That filter bank has a triangular bandpass frequency response, and the spacing as well as the bandwidth is determined by a constant mel frequency interval. The modified spectrum of S(??) thus consists of the output power of these filters when S(??) is the input. The number of mel spectrum coefficients, K, is typically chosen as 20. This filter bank is applied in the frequency domain, therefore it simply amounts to taking those triangle-shape windows in the Figure 1.17 on the spectrum. A useful way of thinking about this mel-wrapping filter bank is to view each filter as an histogram bin (where bins have overlap) in the frequency domain.
Fig 1.17 An example of mel-spaced filterbank for 20 filters
1.6.1.3 Cepstrum
Cepstrum is defined as the Fourier transform of the logarithm of the auto spectrum. It is the inverse Fourier transform of the logarithm of the power spectrum of a signal. It is useful for determining periodicities in the auto spectrum.
Additions in the Cepstrum domain correspond to multiplication in the frequency domain and convolution in the time domain. The Cepstrum is the Forward Fourier Transform of a spectrum. It is thus the spectrum of a spectrum, and has certain properties that make it useful in many types of signal analysis [3]. One of its most powerful attributes is the fact that any periodicities, or repeated patterns, in a spectrum will be sensed as one or two specific components in the Cepstrum. If a spectrum contains several sets of sidebands or harmonic series, they can be confusing because of overlap. But in the Cepstrum, they will be separated in a way similar to the way the spectrum separates repetitive time patterns in the waveform. The Cepstrum is closely related to the auto correlation function. The Cepstrum separates the glottal frequency from the vocal tract resonances. The Cepstrum is obtained in two steps. A logarithmic power spectrum is calculated and declared to be the new analysis window. On that an inverse FFT is performed. The result is a signal with a time axis. The word Cepstrum is a play on spectrum, and it denotes mathematically:
c(n) = ifft(log|fft(s(n))|), Eqn 1.7
Where s(n) is the sampled speech signal, and c(n) is the signal in the Cepstral domain. The Cepstral analysis is used in speaker identification because the speech signal is of that particular form above, and the "Cepstral transform" of it makes the analysis incredibly simple. The speech signal s(n) is considered as the convolution of pitch p(n) and vocal tract h(n), then, c(n) which is the Cepstrum of the speech signal can be represented as..
c(n) = ifft(log( fft( h(n)*p(n) ) ) ) Eqn 1.8
c(n) = ifft(log( H(jw)P(jw) ) ) Eqn 1.9
c(n) = ifft(log(H(jw)) + ifft(log(P(jw))) Eqn 1.10
The key is that the logarithm, though nonlinear, basically just attenuates each spectrum. For human speakers, Fp, the pitch frequency, can take on values between 80Hz and 300Hz, so we are able to narrow down the portion of the Cepstrum where we look for pitch. In the Cepstrum, which is basically the time domain, we look for an impulse train. The pulses are separated by the pitch period, i.e. 1/Fp.
In this final step, we convert the log Mel spectrum back to time. The result is called the Mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the Mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if we denote those Mel power spectrum coefficients that are the result of the last step are (Sk) ??, K = , = 1,2,…k, , we can calculate the MFCC’s cn as
,
Note that we exclude the first component, ,(c0) ?? from the DCT since it represents the mean value of the input signal, which carried little speaker specific information
1.6.2 Feature Matching Techniques
The problem of speaker recognition has always been a much wider topic in engineering field so called pattern recognition. The aim of pattern recognition lies in classifying objects of interest into a number of categories or classes. The objects of interest are called patterns and in our case are sequences of feature vectors that are extracted from an input speech using feature extraction. Each class here refers to each individual speaker. Since here we are only dealing with classification procedure based upon extracted features, it can also be abbreviated as feature matching.
To add more, if there exists a set of patterns for which the corresponding classes are already known, then the problem is reduced to supervised pattern recognition.These patterns are used as training set and classification algorithm is determined for each class. The rest patterns are then used to test whether the classification algorithm works properly or not; collection of these patterns are referred as the test set. In the test set if there exists a pattern for which no classification could be derived, and then the pattern is referred as unregistered user for the speaker identification process. In real time environment the robustness of the algorithm can be determined by checking how many registered users are identified correctly and how efficiently it discards the unknown users.
Feature matching problem has been sorted out with many class-of-art efficient algorithms like VQLBG, DTW and stochastic models such as GMM, HMM. In our study we have put our focus on VQLBG algorithm. VQLBG algorithm due to its simplicity.
1.6.2.1 Vector Quantization
The vector quantization is a generalization of scalar quantization to the quantization of a vector. Vector quantization is used with sophisticated digital signal processing where the most cases the input signal already have some form of digital representation and the desired output is a compressed version of original signal. Basically,Vector quantization is efficient data reduction technique, VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space[18].. Each region is called a cluster and can be represented by its centre, called a codeword. The collection of all codeword’s is called a codebook. Vector quantization is a form of pattern recognition where input information is matched with already stored set of words.VQ is based on the design of codebook its seen as a clustering, it divide input pattern space into encoded clusters.
A VQ technique includes two fundamental tasks: [18].
An encoding process which involves a nearest neighbor (NN) search, assigning the closed codeword to a given vector.
Codebook generation.
The mean vector of a cluster is selected as a representation of the entire cluster. Each centroid is considered as the codevector associated with a given cluster.This codevector is added as an entry in codebook.Based on a given codebook a training set of pattern and disttortion is measure.
The best known VQ codebook generation algorithms used in speaker verification/recognition tasks include: the K-means algorithm [19], the Linde Buzo Gray (LBG) algorithm, the Kohonen’s self organizing map (KSOM) and Fuzzy C-means. In these algorithms the process of finding an optimal codebook is guided by minimization of the average distortion function (objective or cost function) representing an average total sum of distances between the original vectors and the codewords. It is also called the quantization error.
An ideal codebook should contain a set of uncorrelated (linearly independent) centroid vectors. In reality there is always remaining a certain amount of correlation between centroids (Memon, 2010)[20].
1.6.2.1.1 Clustering
The goal of clustering is the classification of objects according to their similarities among them, and organizes data into groups. Clustering techniques is the unsupervised methods; they don’t use prior class identifiers. The main parameter of clustering is to detect the underlying structure in data, not only for classification and pattern recognition, but also for model reduction and optimization. Different classifications can be related to the algorithmic approach of the clustering techp;p;niques. Partitioning, hierarchical, graph theoretic methods and methods based on objective function can be distinguished. In the following subsection K-means, Linde-Buzo-Gray (LBG) clustering, Information Theory and Fuzzy C-means techniques are described (Chakraborty and Ahmed, 2007).[10]
K-means clustering
K means algorithm is one of the simplest unsupervised algorithm that solve clustering problem. In this algorithm classify a given data into a group data into a certain number of clusters(or K clusters)fixed a priori. The K-means algorithm (Memon, 2009) was developed for the Vector quantization. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Then assign each object to the group that has the closest centroid, When all objects have been assigned, recalculate the positions of the K centroids repeat until centroid do not move any more. Main objective of this algorithm to minimizing an objective function.
Linde-Buzo-Gray Clustering Technique
The LBG algorithm is a finite sequence of steps in which, at every step, a new quantizer, with an average distortion less or equal to the previous one, is produced. The LBG algorithm includes two phases: (1) the codebook initialization, and (2) codebook optimization. The codebook optimization starts from an initial codebook and, after some iterations, generates a final codebook with a distortion corresponding to a local minimum.
Information theoretic based clustering
A new set of concepts from information theory provides a computationally efficient technique, which eliminates many disadvantages of classical VQ algorithms. Unlike LBG, this algorithm relies on minimization of a well-defined cost function. The cost function used in LBG and K-means algorithms is defined as an average distortion (or distance), and as such, it is complex and may contain discontinuities making the application of traditional optimization procedures very difficult (Memon, 2009)[20]. According to the information theory a distance minimization is equivalent to the minimization of the divergence between distribution of data and distribution of code vectors. Both distributions can be estimated using the Parzen density estimator method. The Information Theoretic Vector Quantization (ITVQ) algorithm is based on the principle of minimizing the divergence between Parzen estimator of the code vectors density distributions and a Parzen estimator of the data distribution.
Fuzzy C-means Clustering
Fuzzy C Means (FCM) is a method of clustering which allows data can be belongs to two or more clusters with the certain degree of membership rather than belonging to just one cluster. This method is used in pattern recognition the main objective of this algorithm is to minimize the objective function. Since clusters can formally seen as subsets of the data, one possible classification of clustering methods can be according to whether the subsets are fuzzy or crisp (hard). Hard clustering methods are based on classical set theory, and require that an object either does or does not belong to a cluster. Hard clustering of a data set X is the partitioning of the data into a specified number of mutually exclusive subsets of X. The number of subsets (clusters) is denoted by c. The data set X is thus partitioned into c fuzzy subsets. In many real situations, fuzzy clustering is more natural than hard clustering, as objects on the boundaries between several classes are not forced to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1 indicating their partial memberships.
1.6.2.2 Gaussian Mixture Model
GMM is a feature modelling and classification algorithm widely used in speech based pattern recognition, since it can smoothly approximate a wide variety of density distributions. Adapted GMMs known as UBM-GMM and MAP-GMM further enhanced speaker verification outcomes.The introduction of the adapted GMM algorithms has increased computational efficiency and strengthened the speaker verification optimization process. The Expectation Maximization (EM) algorithm is most commonly used to iteratively derive class models. The EM algorithm is initialized with a speaker model and estimates a new model at the end of algorithm iterations.
1.6.2.3 Hidden Markov Model
Hidden Markov models are generative models based on stochastic finite state networks. Currently it was the most popular and successful acoustic models for automatic speech recognition.The Hidden Markov Model (HMM) is created using continuous probability measures of GMM. HMM is used for text-dependent speaker recognition (Rosenberg and Sambur, 1975, Naik, 1990, Matsui and Furui, 1992, Memon, 2010).[54] In HMM, time-dependent parameters are observation symbols which are created by VQ codebook labels. The main assumption of HMM is that the current state depends on the previous state. In the training phase, state transition probability distribution, observation symbol probability distribution and initial state probabilities are estimated for each speaker as a speaker model. The probability of observations for a given speaker model is calculated for speaker recognition. The use of HMM for text independent speaker recognition under the constraint of limited data and mismatched channel conditions is demonstrated by Kimball, Schmidt, Gish and Waterman (1997).[43]
1.6.2.4 Neural Networks
Neural networks have been widely used for pattern recognition problems; the strength of neural networks to discriminate between patterns of different classes is exploited for SR. Neural network is organized in layers which is made by interconnected nodes that contains an activation function. Neural networks have an input layer, one or more hidden layers and an output layer. Input layer connected with hidden layers where the main processing is done through connections then hidden layer are connected with output layer from where output is come. Each layer consists of processing units, where each unit represents a model of an artificial neuron, and the interconnection between the two units as a weight associated with it..
1.6.2.5 Probabilistic Neural Network
The probabilistic neural network (PNN) is a multilayered feed-forward neural network derived through a Bayes decision methodology, which estimates the probability density function for each test vector. It consist of three layers the input layer, the hidden layer and the output layer.The input layer shows the test vectors, and it was connected to the hidden layer. The hidden layer have node for each training vector. Each hidden node calculates the dot product between the input vector and the test vector, subtracts 1 from it, and divides the result by the standard deviation squared. The output layer has a node for each class. The sum of each hidden node is forward to the output layer and the output node with the highest value determines the class for the input test vector. It requires large memory, PNN shows a high sensitivity to noisy data also gave the better result than other methods like band limited,sparse splice inversion technique etc (Specht, 1990, Memon, 2010)[54]
1.6.2.6 Support Vector Machines(SVMs)
In recent years, support vector machines (SVMs) has been generally used tool for pattern recognisation that use discriminative approach, SVMs use linear and non linear separating hyper planes for classification of data. However it only classifies data of fixed length rather than variable length data. Variable length data has to be converted into a fixed length vectors before SVMs can be used, it constructs a hyper-plane in a multidimensional vector space, which is then used to separate vectors that belong to two different classes. A separation of data is get by the hyper-plane, it has the greatest distance to the closest training vectors of each class (Wan and Renals, 2005, Memon, 2010).[20]