P. 1
Realization of Embedded Speech Recognmition Module Based on STM32

Realization of Embedded Speech Recognmition Module Based on STM32

|Views: 244|Likes:

More info:

Published by: German Dario Buitrago on Oct 21, 2013
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





The 11th International Symposium on Communications & Information Technologies (ISCIT 2011


Realization of Embedded Speech Recognition Module Based on STM32
Qinglin Qu School of Electrical and Information Engineering Anhui University of Science and Technology Huainan, Anhui, China E-mail:294025193@qq.com
Abstract—Speech recognition is the key to realize man-machine interface technology. In order to improve the accuracy of speech recognition and implement the module on embedded system, an embedded speaker-independent isolated word speech recognition system based on ARM is designed after analyzing speech recognition theory. The system uses DTW algorithm and improves the algorithm using a parallelogram to extract characteristic parameters and identify the results. To finish the speech recognition independently, the system uses the STM32 series chip combined with the other external circuitry. The results of speech recognition test can achieve 90%, and which meets the real-time requirements of recognition. Keywords-speech recognition; embedded system; STM32; DTW

Liangguang Li School of Electrical and Information Engineering Anhui University of Science and Technology Huainan, Anhui,China E-mail:lgli@aust.edu.cn II. PRINCIPLE OF SPEECH RECOGNITION

Speech recognition is a part of pattern recognition which includes two processes: speech training and speech recognition. The first stage is training also known as modeling stage. In this stage, the system learned and summarized the human language and the learned knowledge is stored to establish a language reference model. The second stage is identification also known as testing stage. The system will match the outside input voice messages with the reference model in the library and get the nearest meaning or semantic recognition results[2]. The typical methods of speech recognition are dynamic time warping (DTW), vector quantization (VQ), hidden markov method (HMM), artificial neural network (ANN) and mixed pattern recognition technology, etc.

I. It's a long

INTRODUCTION dream that human can



communicate with the machine and let it understand what you say. Speech recognition is such a kind of technology which let the machine translate voice signals into texts or commands through recognition and understanding. Speech recognition is a cross subject which includes the signal processing, and pattern recognition, and probability theory and information theory, voice mechanism auditory mechanism artificial intelligence[1]. With the development of super-large-scale integration in recent years, the embedded speech recognition system has become a new direction in voice recognition.

A. Hardware design of system The system chooses STM32F103VET6 processor as the core which is produced by ST and based on Cortex-M3. With the help of speech input amplifier filter circuit, storage circuit, LCD, keyboard, audio DAC and the PC which is linked in JLINK8 by JTAG interface, the processes of modeling and testing are completed. The system hardware diagram is shown in figure 1.



windows addition and endpoints detection.3V in magnified filter circuits. flexible static memory controller with 4 chip. Pre-emphasis is mainly to improve the signal high frequency bands and analyze the frequency spectrum. Speech signals are transformed by ADC and the results are stored in RAM.PC JLINK Voice ADC JTAG GPIO Control input Keyboard parameters of components were changed Amplification input filter circuit between 0V and 3. The single cycle multiplication and hardware division is very favorable to digital signal processing. 11 timers and 112 fast I/O ports. so the algorithm was improved before preprocessing to improve the operating speed. endpoints detection. B.93 and the formulas is[5]: data[i] = original[i] − 0. The audio signals pass the piece of I2S interface and output to the external audio DAC chip and then output to the speaker. STM32F103VET6 completes speech signals collection. Its max working frequency is 72 MHz. LCD parallel interface. It has three modes of low power consumption: sleep mode. NOR and NAND memories. Therefore. 12-channel DMA controller. It also includes: 512K bytes of flash memory. PSRAM. feature extraction and speech recognition finish in the processor. The keyboard is mainly complete system reset and some simple control functions. hardware division and 12-bits A/D converters in 32-bits processors STM32F103VET6. Speech signals were input from microphone and after amplifying and filtering enter into the controller. we can make speech signals sampling. The system matches the results disposed by a series of algorithms with templates in the library to realize speech recognition. with the help of external amplification filter circuits. Design of software The system chooses the linear prediction coefficient (LPC) as characteristic vectors and the Dynamic Time Warping (DTW) as algorithms to complete voice recognition. feature extraction.93 seven bits and get integer part 119. The processor has STM32F103VET6 Voice output Display output DAC I2S FSMC Chip memory LCD LCD parallel LCD interface and it is convenient to display the results. The coefficient used in this experiment is 0. VREF+=3. So the result increases 27 times and the formula becomes: data[i] = original[i] << 7 −119* original[i −1] (2) Adding window technology can ensure the 74 . Because it includes single-cycle multiplication.≤ VIN ≤ VREF+. LCD. the The original storages the sampling data and the data is pre-emphasis output data.1 Structure of system hardware STM32F103VET6 is [3] a high 32-bits RAM. and it is convenient to store data sampling in external storage. VREF-=0V. we will shift left 0. For the convenience of calculation. training and recognition. 1) Preprocessing Preprocessing includes pre-emphasis. 64K bytes of SRAM. Fig. the system can finish speech recognition simply.93* original[i − 1] (1) processor based on CM3 and has the optimal level of power dissipation .3V. 3) Storage part STM32F103VET6 supports compact flash. Using STM32 computing power. The structure of system is listed below. 2) Input and output This section has a microphone. In the filter circuit. 1) Control section Control section is and composed by STM32F103VET6 keyboard. amplification filter circuit. audio DAC conversion and the speaker. stop mode and standby mode. The conversion voltage range of STM32F103VET6 chip ADC is[4]: VREF.

j++) { r[j]=0. After pre-emphasis.46 cos(2π n / N ) (3) N is defined as speech data sampling points and its realizing program is: data[n]*=h(n).3 (b) 10 order LPC coefficient of speech signal "1" 75 .4 -0. we choose short-term energy and zero rate of speech signals as the characteristic parameters [6] 2) Extraction of feature vector The system uses linear parameters coefficient (LPC) as characteristic and solves in Durbin algorithm.8 and use the threshold limit method to judge the starting point and ending point of speech signals. so h (n) is stored in a form of array and the data shift left seven bits to get integer part stored in h(N) (N is the length of a frame). } According to the characteristics of Chinese pronunciation.i<n-j. for(i=0. After the treatments above.54 − 0. In linear forecast analysis process. Cepstrum operation of LPC uses a minimum phase characteristic of track system function and less computational complexity. The maximum of data after calculating is 226.j<=p.8 0. 1 2 3 4 5 6 7 8 9 10 11 Fig. In order to stop the covariance overflow while we calculate the linear prediction coefficient. A bigger p would causes much shakes which make the inherent characteristics of speech signals appear random. The short-term energy of S (n) is defined as follows: 0 2 4 6 8 10 12 14 En = ∑ [ s (m) w(n − m)]2 −∞ ∞ (4) Fig.6 -0. 1 0. the result of signal “1” is shown in figure 2.2 -0. we can obtain the results shown in figure 3.6 0. the data shift right 10 bits.2 -0.8 0.4 0.2 0 -0. the sampling data increase 2 14 20 10 0 0 200 400 600 800 1000 1200 Fig. the choice of order p must be careful.2 0 -0.4 0. Taking 256 points of speech signal "1" and calculating separately the 12 order and 10 order LPC coefficient. Because the window function needs to use a lot ZCR 0 200 400 600 800 1000 1200 of cosine functions and each value is less than 1.2 Processing result of signal “1” times and shift left 14 bits in total.3 (a) 12 order LPC coefficient of speech signal "1" 1 0. The covariance program is shown as follows: r [i] stores covariance data and the data keep four effective numbers (binary): for(j=0.8 w(n-m) is defined as Hamming window h(n). The algorithm selects Hamming window to add and Hamming window formula is: speech 1 0 -1 0 1 2 3 4 5 6 7 8 x 10 10 Energy 5 0 9 4 h(n) = 0.6 0.6 -0.short performance of speech signals.i++) r[j]+=(data[i]*data[i+j])>>4.4 -0.

the speech signal sampling frequency is 10. This algorithm is a classic and earlier algorithm.5% 93.R[1]} as a starting point to search which increases the system recognition rate. It also combined with hardware to improve the algorithm and make speed fast to meet the system real time.R[1]}、 d{T[1]. we can see that the recognition rate of a same single word increases with the increasing of templates and the recognition rate is over 90% which meet the practical operation requirements of speech recognition.3% 95. At the same time because the dependence of algorithm to the endpoint is large. j − 2)] The inference of DTW algorithm all the [7] (6) shows that the traditional algorithm needs to calculate d{T[i]. the recognition results of DTW algorithm and HMM algorithm are more or less the same. According to the The system tests 0-9 200 voices as speech characteristics of STM32.4 Flow diagram of software process IV. we use all path constraint (ADTW) algorithm in actual application. D (i − 1. Extracting linear parameters coefficient and handling recognition.R[j]}. This path D is the two vector distance in the optimal time neat. It can describe the signal better.8% [ D (i − 1. In the same condition. j − 1). D (i − 1. The system CONCLUSION uses the low power consumption STM32 series chips and analyses the hardware circuit and algorithm.j). reference RESULTS OF THE EXPERIMENTAL templates. which saves a lot of storage space. we can get the results as shown in table I. we use dynamic starting point and find out the minimum point among d{T[1]. The algorithm bases on dynamic programming (DP) ideas and solves the template matching problem of pronunciation with different length. j ). From the table I. so we choose 10 as the experiment order p.4895 KHZ. According to the characteristics of speech signals. Templates of a same single word 10 20 30 RESULTS OF SPEECH RECOGNITION Recognition numbers 5 5 5 Recognition rate 91. j ) = d (i.The figure 3 can be concluded that the concussion of 10 order LPC coefficient is less than 12 order LPC coefficient. but the HMM algorithm is much more complicated. R[ j ]} =| R[ j ] − T [i ] | The formula of D can be received: D (i.R[2]}、 d{T[2].8 ms (250 sampling points) and the shift is 100. V. According to the evaluation distance formula between the two vectors: (5) d [{T [i ]. The basic DTW algorithm is looking for an optimal path to get the minimum sum of different points in the path.R[j]}] and D(i. the length of one frame is 23. The system has strong 76 . j ) + min N error output control words Fig. TABLE I. We can use a parallelogram to limit the scope of dynamic neat calculation and only calculate the area of the diamond d[{T[i]. After adding Hamming window. initialization start the recognition receive voice signals signal preprocessing extract LPC coefficient for each frame signal match the results Y 3) Improved DTW algorithm of speech recognition The most simple and effective method in isolated words speech recognition is DTW (Dynamic Time Warping) algorithm.

C. 8(4): 369~384. 1994.Neely. 26(7):195-197.versatility which can be applied to many embedded system and have a good prospect in areas of voice control in home appliances. A Research on improving DTW in speech recognition[J].Crop.Young.J Acoustical Soc America.2000." IEEE Trans. toys.Tian Bin.B. Zhao Li. Beijing. REFERENCES [1] [2] [3] [4] Yi Ke-chu.87(4):2~18. and Dynamic Programming. Defense industry press. M.Perceptual linear predictive (PLP) analysis of speech. Speech signal processing[M].Woodland. mobile phones and intelligent devices etc. Mechanical industry press. WEN Han. HUANG Guo-shun. Micro-computer Information.1990. 103xCDE Data Manual. 2010.J. PDA. Fu Qiang. State clustering in HMM-based continuous speech recognition. March 2009. " Speech Recognition Experiments with Linerar Prediction. Computer Speech and Language.White and R. Speech signal processing[M]. P. S. Hermansky H. AcousticsSpeech Signal.2003 [5] [6] [7] ST. 77 . Bandpass Filtering.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->