Automatic Speech Recognition (ASR): A Comprehensive Review

An Extensive Review on Automatic Speech Recognition System for Real time Application

Rajkumar Bhosale1, Narendra Chaudhari2

1 Assistant Professor, Amrutvahani College of Engineering, Sangamner,

422605 Maharashtra, India, bhos_raj@rediffmail.com

2 Professor and Director, VNIT Nagpur,

Maharashtra, India, nsc183@gmail.com

Abstract. This paper provides the recent development in real time speech recognition research. This paper gives the concept of sources of knowledge and the use of knowledge to generate and verify hypotheses is discussed. Automatic speech recognition (ASR) is thought to be an idea of sci-fi and which has been hit by number of execution corrupting factors, is presently an imperative piece of data and correspondence innovation. Changes in the essential methodologies and advancement of new methodologies via researchers lead to the progression of ASR which were simply reacting to an arrangement of sounds to advanced ASRs which reacts to fluidly spoken regular language. Nonetheless, there are still mechanical obstructions to adaptable arrangements and user fulfillment under a few circumstances. In this article, we review the recent techniques related to ASR system based on the few factors which includes the pre-processing technique, classification technique, feature extraction technique and speech recognition technique.

Keywords: Automatic Speech Recognition, feature extraction, pre-processing, classification.

1 Introduction

In speech recognition area, there are several key areas of research for the current development of spoken language systems [1]. These key areas are automatic speech recognition, robust speech recognition, spontaneous speech, etc. [2]. Many consumer and industrial applications require fast and lightweight real-time speech recognition with limited vocabulary, such as hands-free control for portable music players, car audio systems, cordless phones and domestic appliances. Speech recognition is a pattern classification problem and speech recognition systems employ isolated word recognition [3].

Mobile, embedded and hands-free speech applications fundamentally require continuous, real-time speech recognition. Many current applications, such as speech control of GPS navigation systems and speech-controlled song selection for portable music players and car stereos also require a reliable and flexible speech interface. Finally, sophisticated natural language applications such as handheld speech-to-speech translation [4] require fast and lightweight speech recognition. ASR technology has advanced rapidly in the past decade. While many ASR applications employ powerful computers to handle complex recognition algorithms, there is clearly a demand for effective solution on embedded systems like portable communication devices and various low-cost consumer electronic systems [5]. There have been literatures describing many successful attempts to ASR implementation on low-cost embedded systems [6], [7].

Current speech recognition systems use a pattern matching approach. The classifier, which is commonly based on hidden Markov models (HMM) [8]. HIDDEN Markov model (HMM)-based speech recognition technologies have developed considerably and can now obtain a high recognition performance. The development of speech input interfaces embedded in mobile terminals requires recognition accuracy, miniaturization, and low-power consumption [9]. Previous research on custom hardware described the implementation of the HMM algorithm using application-specific integrated circuits (ASICs) [10] and field-programmable gate arrays (FPGAs) [11], [12]. A real-time continuous speech recognition system using MFCC feature input and Hidden Markov Models (HMM) is reported in the preliminary case study report in [13]. Real-time speech recognition for psychology experiments using MFCC is reported in [14]. The speech recognition systems with MFCC-derived cestrum work well in clean environments, but speech recognition performance is severely degraded in noisy environments [15].

2 Survey on Automatic Speech Recognition System

In ASR system three main approaches are followed which are: Acoustic Phonetic Approach, Pattern Recognition Approach, Statistics-based Approach, Artificial Intelligence Approach. Automatic Speech Recognition is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. The most common process involved in the ASR system is given in fig 1.

Fig. 1. Typical architecture of ASR system

2.1 Review based on Pattern Recognition Approach

David Rybach et al. [20] announce the public availability of the RWTH Aachen University speech recognition toolkit. This toolkit includes state of the art speech recognition technology for acoustic model training and decoding.

Fig. 2. Block diagram for Pattern-recognition based speech recognition

Maria Schuster et al. [34] Proposed a new method, was applied on recordings of a standard test to evaluate articulation disorders (psycholinguistic analysis of speech disorders of children PLAKSS) of 31 children at the age of 10. 1 '' 3. 8 years.

2.2 Review based on Statistics based Approach

At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one. Naoki Hirayama et al. [21] have presented an automatic speech recognition (ASR) system that accepts a mixture of various kinds of dialects. The system recognized dialect utterances on the basis of the statistical simulation of vocabulary transformation and combinations of several dialect models

2.3 Review based on Artificial Intelligence Approach

Artificial Intelligence Approach or knowledge based approach attempts to mechanize the recognition procedure according to the way a person applies its intelligence in visualizing, analysing and finally making a decision on the measured acoustic features. Expert system is used widely in this approach.

The features of speech were classified by the K nearest neighbor classifier and compared its accuracy with linear discriminate analysis. The experimental results were evaluated, and the overall average recognition exactitude of 76.8 % was obtained. Arun Narayanan and DeLiang Wang [22] have presented a supervised speech separation system and automatic speech recognition (ASR) performance in realistic noise conditions has been improved. The system performed separation via ratio time-frequency masking; the ideal ratio mask (IRM) was estimated using DNNs (Deep Neural Networks).

2.4 Review based on Other Approach

Noise robustness has long been one of the most important goals in speech recognition. While the performance of automatic speech recognition (ASR) deteriorates in noisy situations, the human auditory system is relatively adept at handling noise. Vikas Joshi et al. [23] have described a novel framework to sub-band based Histogram Equalization (HEQ) applied to robust speech recognition. They proposed a frequency band specific equalization to compensate the noise distortion on the individual frequency bands.

Andre Coy'' and Jon Barker [24] have presented a speech recognition system which couples these processes by using a combination of primitive and schema-driven processes: first, a set of coherent spectro-temporal fragments was generated by primitive segmentation techniques; then, a decoder based on statistical ASR techniques performs a simultaneous search for the correct background/foreground segmentation and word sequence hypothesis.

3 Performance Analysis

The ASR is one of the advantageous techniques used for the man machine interface, in this paper we have reviewed some of so far research methodologies based on different approach. Then in this section the performance of various research methodologies is compared. The table 1 shows the performance of the various methodologies.

Table 1. Performance Analysis headings.

Author Techniques Performance

Application

Pre-processing Features Classifier SNR WER Accuracy Real Time

Jesper Jensen and Zheng-Hua Tan [16] – MFCC & delta- and acceleration

features Short-Time

Fourier Transform ' ' ' '

Umit H. Yapanel and John H. L. [17] Perceptual-MVDR MFCC & Mel-scaled filterbank Hidden Markov Model ' ' ' '

Javier Gonzalez-Dominguez et al. [18] Multi Recognizer Module Filter bank Energy Deep Neural Network ' ' ' '

Octavian Cheng et al. [19] Gaussian Mixture Model MFCC Adaptive Pruning Algorithm ' ' ' '

Naoki Hirayama et al. [21] Maximization Of Recognition Acoustic And Linguistic Features Dialect Language Model ' ' ' '

Arun N. et al. [22] Denoising MFCC, ratio masking Deep Neural Network ' ' ' '

Engin Avci and Zuhtu Hakan Akpolat [23] Data acquisition Filtering White de-noising wavelet packet decomposition adaptive network based fuzzy inference system ' ' ' '

Vikas Joshi et al. [24] histogram analysis MFCC Deep Neural Network ' ' ' '

Mark Gales ,Steve Y. et al. [25] – MFCC Hidden Markov Model ' ' ' '

The performance analysis of the some of the past research work in the field of automatic speech recognition system is given in table 1. The analysis is made based on the different stages of ASR and corresponding techniques used in the stages, moreover the performance measures like SNR, WER and Accuracy are considered in the past work also analyzed.

4 Database Description

SI84 database: The SI84 data (7077 utterances, or 15 hours of speech from 84 speakers) are used during the training phase. The training material is separated into a 6877-sentence training set and a 200-sentence validation set. The testing phase uses the Nov92 evaluation data, which contains 330 utterances from 8 speakers. The number of context- independent phonemes is 40.

EPPS Spanish Database: The EPPS Spanish Database (European Parliament Plenary Sessions), training set consists of 21127 sentences grouped in 1802 speaker turn. Development set consists of 2402 sentences grouped in 106 speaker turns.

AURORA-4 database: The AURORA-4 continuous speech recognition corpus, derived from the WallStreet Journal (WSJ0) corpus, has 14 test sets grouped into the following 4 groups: (a)Test set A – clean/multi-style speech in training and clean speech in test, same channel (set 1), (b) Test set B -clean/multi-style speech in training and noisy speech in test, same channel (sets 2-7), (c) Test set C – clean/multi-style speech in training and clean speech in test, different channel (set 8), and (d) Test set D -clean/multi-style speech in training and noisy speech in test, different channel (sets 9-14).

TIMIT database: The TIMIT database contained 6300 sentences, which were divided into two non-overlapping sets for training and testing purpose. The training set (train set) contained 4620 utterances spoken by 462 speakers, and the testing set (test set) contained 1680 utterances spoken by the other 168 speakers. The speech data were sampled at 16 kHz with 16-bit quantization. All of the TIMIT utterances were processed with the aforementioned low-pass filters. The NTIMIT database was collected by transmitting all of the TIMIT utterances through various telephone channels and re-digitizing them. The bandwidth of the NTIMIT data was 0.3'3.4 kHz.

SPEECON database: SPEECON is an extensive speech database which was collected to support the development of speech-driven consumer applications. The database consists of about 20 languages each represented by 600 speakers. Several recordings were done under real conditions and environments. For example, recordings for home, office, public places and in-car environments are available.

MASS EFFECT 3 video game: The MASS EFFECT 3 video game consists of 20,000 speech recordings, around 500 roles, around 50 speakers, and around 20 hours of speech of professional actors. A subset of 4,000 speech recordings was used for the annotation of speech classes. Each speech recording were recorded in professional conditions, and encoded into a 48 kHz-16 bits format. The duration of speech recording varies from 0.1s to 15s.

5 Future Direction

The ASR system is still in research to find out a better technology, so that the accuracy of the system has to enhance. So a system with better pre-processing, feature extraction and classification technique is motivated research area in future. The better pre-processing technique have to perform denoising, word separation, and etc. In feature extraction, feature parameter selection is important, so most possible feature have to extract and to reduce the complexity the feature section process have to penetrate using optimization techniques. Then finally the classification technique is the most important process, in recognition, so a novel or hybrid classifier for ASR system is engorged in future.

Table of Contents

Conclusion

In this paper, we presented the detailed review of recent techniques related to automatic speech recognition system. Initially, we presented the analysis of ASR technique based on the different approach like Acoustic Phonetic Approach, Pattern Recognition Approach, Statistics-based approach and Artificial Intelligence Approach. Then comparison on recent techniques of ASR is presented based on the factors such as pre-processing techniques, classification techniques, feature extraction techniques and performance measures. The benchmark datasets used in the literatures and their details are presented subsequently we presented the future scope of the ASR by analyzing the literature used in this review.

References

1. M Siafarikas, T. Ganchev and N. Fakotakis, "Wavelet packet based speaker verification', In ODYSSEY04-The Speaker and Language Recognition Workshop, pp. 257-264, 2004

2. E. Avci and Z.H Akpolat, "Speech recognition using a wavelet packet adaptive network based fuzzy inference system", Expert Systems with Applications, Vol. 31, No. 3, pp. 495-503, 2006

3. J. Manikandan, B. Venkataramani, K. Girish, H. Karthic and V. Siddharth, "Hardware implementation of real-time speech recognition system using TMS320C6713 DSP", In Proceedings of IEEE International Conference on VLSI Design, pp. 250-255, 2011.

4. A. Waibel, A. Badran, A.W. Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo, L. Mayfield Tomokiyo, J. Reichert, T. Schultz, D. Wallace, M. Woszczyna and J. Zhang, "Speechalator: Two-way speech-to-speech translation in your hand", In Proceedings of Conference of North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 4, pp. 29-30, 2003

5. Yuan, T. Lee, P.C. Ching and Y. Zhu, "Speech recognition on DSP: issues on computational efficiency and performance analysis", Microprocessors and Microsystems, Vol. 30, No. 3, pp. 155-164, 2006

6. B. Delaney, N. Jayant, M. Hans, T. Simunic and A. Acquaviva, "A low-power, fixed-point, front-end feature extraction for a distributed speech recognition system", In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, pp. I-793 – I-796, 2002

7. W. Han, K.W. Hon, C.F. Chan, T. Lee, C.S. Choy, K.P. Pun and P.C. Ching, "An HMM-based speech recognition IC", In Proceedings of IEEE International Symposium on Circuits and Systems, Vol. 2, pp. 744-747, 2003

8. C. Nadeu, D. Macho and J. Hernando, "Time and frequency filtering of filter-bank energies for robust HMM speech recognition", Speech Communication, Vol. 34, No. 1, pp. 93-114, 2001

9. S. Yoshizawa, N. Wada, N. Hayasaka and Y. Miyanaga, "Scalable architecture for word HMM-based speech recognition and VLSI implementation in complete system", IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 53, No. 1, pp. 70-77, 2006

10. W. Han, K.W. Hon, C.F. Chan, T. Lee, C.S. Choy, K.P. Pun and P.C. Ching, "An HMM-based speech recognition IC", In Proceedings of IEEE International Symposium on Circuits and Systems, ISCAS'03, Vol. 2, pp. 744-747, 2003

11. S.J. Melnikoff, S.F. Quigley and M.J. Russell, "Implementing a simple continuous speech recognition system on an FPGA", In Proceedings of IEEE Annual Symposium on Field-Programmable Custom Computing Machines, pp. 275-276, 2002

12. J.J. Rodri''guez-Andina, R.D.R. Fagundes and D.B. J''nior, "A FPGA-based Viterbi algorithm implementation for speech recognition systems", In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 1217-1220, 2001

13. D. Huggins-Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar and A.I. Rudnicky, "Pocket sphinx: A free, real-time continuous speech recognition system for hand-held devices", In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 185-188, 2006

14. C. Donkin, S.D. Brown and A. Heathcote, "Choice Key: A real-time speech recognition program for psychology experiments with a small response set", Behavior research methods, Vol. 41, No. 1, pp. 154-162, 2009

15. V. Tyagi and C. Wellekens, "On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition", In Proceedings of IEEE International Conference on ICASSP, Vol. 1, pp. 529-532, 2005

16. Jesper Jensen and Zheng-Hua Tan, "Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features-A Theoretically Consistent Approach", IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 23, No. 1, pp. 186-197, 2015

17. Umit H. Yapanel and John H. L. Hansen, "A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition", Speech Communication, Vol. 50, pp. 142'152, 2008

18. Javier Gonzalez-Dominguez, David Eustis, Ignacio Lopez-Moreno, Andrew Senior, Fran''oise Beaufays, and Pedro J. Moreno, "A Real-Time End-to-End Multilingual Speech Recognition Architecture", IEEE Journal of Selected Topics In Signal Processing, Vol. 9, No. 4, pp. 749-759, 2015

19. Octavian Cheng, Waleed Abdulla, and Zoran Salcic, "Hardware'Software Codesign of Automatic Speech Recognition System for Embedded Real-Time Applications", IEEE Transactions on Industrial Electronics, Vol. 58, No. 3, pp. 850-859, 2011

20. David Rybach, Christian Gollan, Georg Heigold, Bjorn Hoffmeister, Jonas Loof, Ralf Schluter and Hermann Ney , "The RWTH Aachen University Open Source Speech Recognition System", In Interspeech, pp. 2111-2114, 2009

21. Naoki Hirayama, Koichiro Yoshino, Katsutoshi Itoyama, Shinsuke Mori, and Hiroshi G. Okuno, "Automatic Speech Recognition for Mixed Dialect Utterances by Mixing Dialect Language Models", Journal OF IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 2, pp. 373 – 382, 2015

22. Arun Narayanan and DeLiang Wang, "Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training", IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 23, No. 1, pp. 92-101, 2015

23. Vikas Joshi, Raghvendra Bilgi, S. Umesh, Luz Garcia and Carmen Benitez, "Sub-band based histogram equalization in cepstral domain for speech recognition", Speech Communication, Vol. 69, pp. 46-65, 2015

24. Andre Coy'' and Jon Barker, "An automatic speech recognition system based on the scene analysis account of auditory perception", Speech Communication, Vol. 49, No. 5, pp. 384-401, 2007

25. Mark Gales and Steve Young, "The Application of Hidden Markov Models in Speech Recognition", Signal Processing, Vol. 1, No. 3, pp. 195'304, 2007

Essay: Automatic Speech Recognition (ASR): A Comprehensive Review

Essay details and download:

Text preview of this essay:

Conclusion

References

About this essay:

Essay details and download:

Text preview of this essay:

Conclusion

References

About this essay:

Essay Categories: