You are on page 1of 57

1

CHAPTER 1

INTRODUCTION

1.1 DESIGN INTRODUCTION For the past several decades, designers have processed speech for a wide variety of applications ranging from mobile communications to automatic reading machines. Speech recognition reduces the overhead caused by alternate communication methods. Speech has not been used much in the field of electronics and computers due to the complexity and variety of speech signals and sounds. However, with modern processes, algorithms, and methods we can process speech signals easily and recognize the text. 1.2 INTRODUCTION Our project aimed at developing a Real Time Speech Recognition Engine on an FPGA using Altera DE2 board. The system was designed so as to recognize the word being spoken into the microphone. Both industry and academia have spent a considerable effort in this field for developing software and hardware to come up with a robust solution. However, it is because of large number of accents spoken around the world that this conundrum still remains an active area of research. Speech Recognition finds numerous applications including health care, artificial intelligence, human computer interaction, Interactive Voice Response Systems, military,

avionics etc. Another most important application resides in helping the physically challenged people to interact with the world in a better way. We implemented a Real Time Speech Recognition Engine that takes as an input the time domain signal from a microphone and performs the frequency domain feature extraction on the sample to identify the word being spoken. Our design exploits the fact that most of the words spoken across various accents around the world have some common frequency domain features that can be used to identify the word. Speech Recognition has always been a conundrum and a point of keen interest for researchers all around the globe. While various methodologies have been developed to solve this issue, it still remains an unsolved, nevertheless an intriguing problem.

CHAPTER 2

LITERATURE SURVEY

2.1 EXISTING SYSTEM By using the existing system we can only recognised the speech but we wont be able display the text. In the existing system we can only use discrete or continuous Hidden Markov Model. Using SRAM we can only implement and recognised 500 words. Recognition speed of existing system is slow. The existing system can be applied for various practical purposes but it does provide various losses so thats way we are using speech to text. 2.2 PROPOSED SYSTEM The proposed system provides with much more advantage and provides better use of HMM. our system uses both discrete and continuous form of hidden markov model. The system does not only provide recognition but also provides display of the text with the help of a liquid crystal display.with proper training and in a closed environment we can achieve much more accuracy in the text and the viterbi algorithm helps us to designate and find out the most likely text in the speech. This system has such a practical application for deaf persons and even further improvement in our system will lead to educate a lot of deaf people and even speech to visual conversion would be possible in the near future.

CHAPTER 3

BACKGROUND THEORY

3.1 SPEECH RECOGNITION PRINCIPLE

Figure 3.1 Speech waves

Speech recognition systems can be classified into several models by describing the types of utterances to be recognized. These classes shall take into consideration the ability to determine the instance when the speaker starts and finishes the utterance. In our project we aimed to implement Isolated Word Recognition System which usually used a rectangular window over the word being spoken. These types of systems have "Listen/Not-Listen" states, where they require the speaker to wait between utterances.

A desktop microphone usage shall not be appropriate for realization of the project since they tend to pick up more ambient noise that gives might not be appropriate for accurate detection of speech. The usage of headset styled microphone allows the ambient noise to be minimized. Since the Speech Recognition is heavily dependent on processing speed because of a large amount of signal processing, implementation of the same on an FPGA was a good choice and motivation behind this project. Also, the memory available on Altera DE2 Development board was enough to easily and successfully implement the design for a word of length nearly 1 second. The Speech Recognition Engines are broadly classified into 2 types, namely Pattern Recognition and Acoustic Phonetic systems. While the former use the known/trained patterns to determine a match, the latter uses attributes of the human body to compare speech features (phonetics such as vowel sounds). The pattern recognition systems combine with current computing techniques and tend to have higher accuracy.

3.1.1 Flow chart

Figure 3.1.1 speech recognition principle

The system recognizes the spoken digit using a maximum likehood estimate, i.e., a Viterbi decoder. The input speech sample is preprocessed to extract the feature vector. Then, the nearest codebook vector index for each frame is sent to the digit models. The system chooses the model that has the maximum probability of a match.

3.2 DATA ACQUISITION The speech signal is essentially analog in nature. Hence, the signals must be converted to digital data in order to be read and processed. We used an inbuilt ADC using the Wolfson Codec to sample our signal at 8 KHz frequency thus producing a 16 bit signed digital output. Once the word was known to be detected, we acquired the FFT over the next 32 blocks of data and read their power coefficients in Nios II.

Figure 3.2 ADC Wave Form.

3.3 DETECTION The system must know when a spoken word is input. Thus, a detection algorithm has been devised. This is done by continually computing the difference of the absolute average of two adjacent sound windows (sets of consecutive sound data), and comparing it to a predefined threshold.

The detector algorithm can be broken down as follows: 1. The absolute average w1 of a sound window of length W is computed from the sound samples si starting at sa and ending at sb as shown in Eq. 1. W1=1/w ..(1) 2. The average of the second window w2 is computed from the sound samples si starting at sb and ending at sc as shown in Eq. 2. W1=1/w (2) 3. The difference between w2 and w1 is compared to the threshold value Th. If it is larger, the spoken word is considered to start at sc. Else, the algorithm goes on to step 4. 4. The average of the oldest window (w1) is discarded, and replaced by w2. Then, the algorithm goes back to step 2. Note: That the value has been experimentally determined in the MATLAB implementation (see appendix A). Nevertheless, it may vary depending on the sound acquisition setup (i.e. position of the microphone, noise level, etc.). Finally, the length of the word is fixed to 1.024s for convenience. 3.4 FREQUENCY CONTENT Once the word is detected, it is mapped to the frequency domain by computing its Discrete Fourier Transform (DFT) using the Fast Fourier Transform (FFT) algorithm. Since the length of a word is 1.024 s and the sound is sampled at 5 kHz, five 1024 points FFTs are required to fully characterize a single word. In the MATLAB implementation, these are stored in each row of a 1024 x 5 matrix. This matrix constitutes the fingerprint. Note that, for the sake of simplicity, only the real part of the

DFT is kept. In the training mode, the user defines how many times a word is trained. The frequency content of each is averaged by adding their fingerprints together and dividing the final sum by the number of times the word has been trained. This generates the reference fingerprint. 3.5 DISTANCE The comparison between a word's fingerprint and the reference fingerprint is done by taking the euclidean distance between them. To do this, they are considered as five 1024-dimensional vectors(one for each matrix row), and the average of their respective euclidean distance is computed. This is shown in Eq. 3, where D is the distance, and ani and bni are the ith components of the fingerprints. The n index points to each of the five vector pairs.
D=1/55n=1

(ani - bni)2

...............................(3)

If the distance is less than a preset maximum (maxDis), then the analyzed word is considered to match the reference word. Note that maxDis is experimentally set to 140 in the MATLAB implementation (see appendix B). Similarly to the Th parameter, this value depends on the sound acquisition setup and may need to be varied in order to achieve accurate speech recognition.

10

CHAPTER 4

HARDWARE IMPLEMENTATION

In order to implement the speech recognition algorithm in the Altera DE2 board, it is broken down into modules. These are then mapped to combinational logic and finite-state machines (FSM), using the Quartus II software package. 4.1 WOLFSON INTERFACE The board has a Wolfson WM8731 Coder-Decoder (CODEC), which acts as the ADC. This audio chip has a microphone jack, and is connected in a master-slave configuration with the FPGA (the latter being the master). In order for the master to control the CODEC and acquire the digital data, three modules have been created: the I2C bus controller, the clock module, and a sound fetcher. 4.1.1 I C Bus Controller Three tasks need to be performed on the CODEC to modify its internal settings: de-mute the microphone input, boost the microphone volume, and change the default sound path (so that the microphone is given priority over other inputs). To do this, the FPGA communicates with the Wolfson via the I C (Inter-Integrated Circuit) protocol using two pins: 'SDIN' (the data line), and 'SCLK' (the bus clock), as seen in Fig 4.1.1
2 2

11

Figure 4.1.1 Two line I2C bus protocol for the wolfson WM8731

The contents of the data line are sent in the same order as seen above (after a start condition): 'RADDR', 'R/W', 'ACK', 'DATAB[15-9]', and 'DATAB[8-0]', which stand respectively for base address, Read/Write, acknowledge, control address, and control data. The last block modifies the settings. For instance, if 'DATAB[0]' is '1', the volume is boosted. The base and control addresses are used to specify which internal CODEC registers need to be accessed. Read/Write will always be set to zero (i.e. write), since the Wolfson is write-only. To signify a start condition, 'SDIN' goes from high to low while the clock is maintained high. The same applies for a stop condition, except the transition is low-to-high. Finally, the 'ACK' signal is sent from the CODEC to the FPGA, as opposed to all the other data line contents. This introduces the need for 'SDIN' to be implemented as a bi-directional pin, which requires the use of a tri-state buffer. An FSM is created to implement the bus interface between the FPGA and the Wolfson. Note that, because 'SCLK' must be between 0 Hz and 400 kHz, 'ADCLRC' (48.83 kHz) is used (see section 2.1.3). For start and stop conditions, 'ADCLRC' is overridden by the FSM, so that 'SCLK at 1

12

4.1.2 Sound Fetcher After the Wolfson digitalizes the input, it presents the data ('ADCDAT') serially as seen in Fig.4.1.2a. This is the Integrated Interchip Sound (I S) standard. Two clocks are needed: 'ADCLRC' (the left-right clock for ADC data), and 'BCLK' (the bit-stream clock). The CODEC will place the most significant bit (MSB) on the 'ADCDAT' line so that it can be fetched on the second rising 'BCLK' edge following a high-to-low transition of 'ADCLRC'. The left and right channel distinction is used for stereo sound. Since this project deals with mono sound, the data is fetched when 'ADCLRC' is low (left channel).
2

Figure 4.1.2a ACDAT output convention used by the wolfson WM8731(I2S)

13

Figure 4.1.2b Circuit schematic of the overall ADCDAT fetcher

The FSM in Fig 4.1.2b ('ADCDAT_fetcher_FSM') is used to keep


track of the events on the clocks (e.g. rising edges) in order to know the exact moment one can start and stop to fetch. Because the data is presented serially, the FSM communicates with a serial-to-parallel register ('LPM_SHIFTREG'), which outputs this data in parallel form.

14

Decimal number

Binary (2s comp.)


011 010 001 000 111 110 101 100

Quantized decimal
1

Quantized binary (2s comp.)


01

3 2 1 0 -1 -2 -3 -4

00

-1

11

-2

10

Table 4.1.2 Twos complement quantization from 3 bits to 2 bits

The next step is to quantize. The ADCDAT word length is 24 bits in twos complement form. As said in section 1.2, the objective is to reduce the length to 8 bits. In order to see how signed binary numbers can be quantized, Table 1 illustrates a quantization from 3 bits to 2 bits. A closer look at the second and fourth columns reveals that, in order to quantize, it is only necessary to keep the two MSBs. Note that this is possible because the two's complement scheme is used. Consequently, when going from 24 bits to 8 bits, only the first eight most significant bits need to be kept.

15

The last D-type flip-flop ('LPM_DFF/downsampler_ff') reduces the output data rate from 48 kHz to 5 kHz. In order to do that, it is controlled by the two modules (a counter and an FSM) in the top right corner of Fig. 3, which generate two pulses. Both pulses occur at a 5 kHz frequency. The first instructs the flip-flop to fetch the data. The second pulse is an output 'READY' signal that happens half-a-period after the first. has been properly latched. 4.1.3 Clock Module The FPGA is clocked at 50 MHz [1]. Because it acts as the According to the Its purpose is to make sure that the rest of the circuit will fetch the data after it

Wolfson's master, it must feed the latter with various clocks: the main audio chip clock ('XCK'), 'ADCLRC', and 'BCLK'. Wolfson data sheets, both 'ADCLRC' and 'XCK' are dependent on the sampling frequency. Since the latter is 48 kHz, 'ADCLRC' must also be 48 kHz. 'XCK' is 12.288 MHz [4]. 'BCLK' must be at least 2.4 MHz, because it needs to yield 25 rising clock edges (1 to wait for the MSB and 24 to fetch each 'ADCDAT' bit) within half the period of 'ADCLRC' (i.e. within 10.42 s).

Figure 4.1.3 Block diagram of clock module

16

To implement all three clocks, a single clock module was devised. As seen in Fig. 4.1.3, it takes the 50 MHz clock as an input. Using a 2-bit counter, it then proceeds to divide it by 2 yielding a 12.5 MHz 'XCK' signal. Similarly, 'ADCLRC' and 'BCLK' are output using respectively 10bit and 3-bit counters (to divide by 2 and 2 ). This produces 48.83 kHz, and 6.25 MHz signals (the latter being greater than 2.4 MHz). Even though those values are approximations of the ideal ones specified in the data sheets, they are close enough for practical purposes [3]. 4.3 FFT The discrete Fourier transform (DFT) plays an important role in the analysis, design, and implementation of discrete-time signal processing algorithms and systems because efficient algorithms exist for the computation of the DFT. These efficient algorithms are called Fast Fourier Transform (FFT) algorithms. In terms of multiplications and additions, the FFT algorithms can be orders of magnitude more efficient than competing algorithms. It is well known that the DFT takes N2 complex multiplications and N2 complex additions for complex N-point transform. Thus, direct computation of the DFT is inefficient. The basic idea of the FFT algorithm is to break up an N-point DFT transform into successive smaller and smaller transforms known as butterflies (basic computational elements). The small transforms used can be 2-point DFTs known as Radix-2, 4-point DFTs known as Radix-4, or other points. A two-point butterfly requires 1
10 3 2

17

complex multiplication and 2 complex additions, and a 4-point butterfly requires 3 complex multiplications and 8 complex additions. Therefore, the Radix-2 FFT reduces the complexity of a N-point DFT down to (N/2)log2N complex multiplications and Nlog2N complex additions since there are log2N stages and each stage has N/2 2-point butterflies. For the Radix-4 FFT, there are log4N stages and each stage has N/4 4-point butterflies. Thus, the total number of complex multiplication is (3N/4)log4N = (3N/8)log2N and the number of required complex additions is 8(N/4)log4N = Nlog2N. Above all, the radix-4 FFT requires only 75% as many complex multiplies as the radix-2 FFT, although it uses the same number of complex additions. These additional savings make it a widely-used FFT algorithm. Thus, we would like to use Radix-4 FFT if the number of points is power of 4. However, if the number of points is power of 2 but not power of 4, the Radix-2 algorithm must be used to complete the whole FFT process. In this application note, we will only discuss Radix-4 FFT algorithm. Now, lets consider an example to demonstrate how FFTs are used in real applications. In the 3GPP-LTE (Long Term Evolution), M-point DFT and Inverse DFT (IDFT) are used to convert the signal between frequency domain and time domain. 3GPP-LTE aims to provide for an uplink speed of up to 50Mbps and a downlink speed of up to 100Mbps. For this purpose, 3GPP-LTE physical layer uses Orthogonal Frequency Division Multiple Access (OFDMA) on the downlink and Single Carrier Frequency Division Multiple Access (SC-FDMA) on the uplink.

18

In order to map the sound data from the time domain to the frequency domain, the Altera IP Megacore FFT module is used. The module is configured so as to produce a 1024-point FFT. It is not only capable of taking a streaming data input in natural order, but it can also output the transformed data in natural order, with a maximum latency of 1024 clock cycles once all the data (1024 data samples) has been received. 4.2 DETECTOR The absolute values of the first 1024 samples that constitute a window are accumulated (summed together). Then sum is shifted right by 10 in order to divide by 1024 (since 2 average value of the window. The difference between that average and the one from the previous window (stored in 'Register 1') is then computed. 'Register 2' is used to control the comparator's input in order to ensure the comparison with userdefined 9-bit threshold takes place when all the samples of the window have been processed. Once done, the contents of 'Register 1' are replaced by newer window avg.
10

= 1024), thus producing the

19

Figure 4.3 Detector

An FSM is needed in order to control when to do this average swapping, when to enable 'Register 2', when to determine if a count of 1024 samples has been reached, and when to clear the accumulator to restart the summation. It also accepts a 'RESET' signal that asynchronously clears the accumulator

4.4 MEMORY MANAGEMENT In order to store the reference fingerprint, the 512 kB SRAM module built in the board is used. There are three memory modules on the Altera DE2: a 4 MB Flash memory chip, an 8 MB SDRAM chip and a 512 kB SRAM chip. While the Flash module provides a vast amount of non volatile storage, it is very slow with respect to the main system clock. It also requires a controller capable of dealing with its timing constraints. The SDRAM chip is very fast and has a very large storage capacity, but it

20

requires a very sophisticated controller to be operated. This makes the SRAM chip an obvious choice. Even though it is not the fastest nor the largest, it has ten times the required storage capacity needed for this project, and it is fast enough (since it can perform a read or write operation in less than 20 ns, i.e. a system clock period) so as to avoid any timing issues. Moreover, it is a fairly simple device and can be easily controlled.

Figure 4.4 512kb SRAM chip block diagram

The SRAM memory module is depicted in Fig. 6 with its inputs and outputs. Note that the 'Data' pins are bidirectional and into require a tristate buffer to be properly driven The chip storage is divided 2 16-bit blocks which can be directly addressed trough the 18 'Address' lines. This is not convenient for the implementation since the data stored is 8-bit wide. 4.4.1 Memory Controller Memory Controller shown in Fig. 4.4.1a has four user inputs ('ADDR', 'DATA_IN', 'MODE', and 'ENABLE'), one user output ('DATA_OUT) and seven inputs/outputs (depicted in green) that connect
18

21

directly to the SRAM chip ('Low Byte Mask', 'High Byte Mask', 'Output Enable', 'Write Enable', 'Chip Enable', 'Address', and 'Data'). The controller simplifies the communication to the SRAM chip by splitting the bidirectional pins and allowing each 8-bit memory block to be directly accessed (see its detailed schematics in 4.4.1b).

Figure 4.4.1a Memory controller block diagram

The pins are split by using Altera's bustri (tri-state buffer) and each 8-bit block can be accessed using the 'High Byte Mask' and the 'Low Byte Mask' according to the least significant bit of 'ADDR'. As a result, the user sees an 8-bit data input ('DATA_IN'), a separate 8-bit data output ('DATA_OUT') and 19 address lines ('ADDR') which double the original address space.

22

Figure 4.4.1b Schematic diagram of memory controller

The controller simplifies the communication to the SRAM chip by splitting the bidirectional pins and allowing each 8-bit memory block to be directly accessed (see its detailed schematics diagram). The pins are split by using Altera's bustri (tri-state buffer) and each 8-bit block can be accessed using the 'High Byte Mask' and the 'Low Byte Mask' according to the least significant bit of 'ADDR'. As a result, the user sees an 8-bit data input ('DATA_IN'), a separate 8-bit data output ('DATA_OUT') and 19 address lines ('ADDR') which double the original address space.

23

4.4.2 Memory Batch Operator In order to sequentially access the memory, a 'Memory Batch Operator' module was devised. As shown in Fig. 8, its takes 6 inputs ('START_ADDR', 'END_ADDR', 'DATA_IN', 'MODE', 'DATA_READY', and 'ENABLE') and has 5 outputs ('DATA_OUT', 'ADDR', 'MEM_MODE', 'MEM_ENABLE', and 'DONE'). It operates on the rising edge of a clock signal ('CLK').

Figure 4.4.2 Memory batch operator block diagram

The module works as follows: y Whenever the 'ENABLE' input goes high, it fetches the starting and ending addresses as specified in the 'START_ADDR' and 'END_ADDR' inputs, and readies to start writing or reading (according to the 'MODE' input) at the starting address. This takes two clock cycles.

24

y Whenever the 'DATA_READY' signal is asserted, the module goes to the next address and reads (the data can be read from the 'DATA_OUT' lines of the memory controller) or writes (the data from the 'DATA_IN' input lines). y If the module reaches the ending address, then it signals 'DONE' until the 'ENABLE' input is low and goes back to step 1. Else, it goes back to step 2. Note that on each step, the module takes care of sending the appropriate signals to the memory controller in order to perform the desired action. 4.5 DISTANCE(HMM) The distance module illustrated in Fig.4.5a has four inputs ('A', 'B', 'ENABLE', and jkn'RST') and one output 'Distance'. the squared difference of theinput is high. It computes the distance between two arbitrarily sized vectors by adding and accumulating In order to clear the accumulated distance the asynchrono 'A' and 'B' inputs on each rising edge of a clock signal 'CLK' while the 'ENABLE' us 'RST' signal must be asserted.

Figure 4.5a distance block diagram

25

Figure 4.5b Schematic diagram of distance

4.6 HMM TRAINING An important part of speech-to-text conversion using pattern recognition is training. Training involves creating a pattern representative of the features of a class using one or more test patterns that correspond to speech sounds of the same class. The resulting pattern (generally called a reference pattern) is an example or template, derived from some type of averaging technique. It can also be a model that characterizes the reference pattern statistics. Our system uses speech samples from three individuals during training. A model commonly used for speech recognition is the HMM, which is a statistical model used for modeling an unknown system using an observed output sequence. The system trains the HMM for each digit in the vocabulary using the Baum-Welch algorithm. The codebook index created during preprocessing is the observation vector for the HMM model.

26

After preprocessing the input speech samples to extract feature vectors, the system builds the codebook. The codebook is the reference code space that we can use to compare input feature vectors. The weighted cepstrum matrices for various users and digits are compared with the codebook. The nearest corresponding codebook vector indices are sent to the Baum-Welch algorithm for training an HMM model. The HMM characterizes the system using three matrices: y AThe state transition probability distribution. y BThe observation symbol probability distribution. y nThe initial state distribution. Any digit is completely characterized by its corresponding A, B, and n matrices. The A, B, and n matrices are modeled using the Baum-Welch algorithm, which is an iterative procedure (we limit the iterations to 20). The Baum-Welch algorithm gives 3 matrices for each digit corresponding to the 3 users with whom we created the vocabulary set. The A, B, and n matrices are averaged over the users to generalize them for userindependent recognition. For the design to recognize the same digit uttered by a user for which the design has not been trained, the zero probabilities in the B matrix are replaced with a low value so that it gives a non-zero value on recognition. To some extent, this arrangement overcomes the problem of less training data.

27

Training is a one-time process. Due to the complexity and resource requirements, it is performed using standalone PC application software that we created by compiling our C program into an executable. For recognition, we compile the same C program but target it to run on the Nios II processor instead. We were able to accomplish this crosscompilation because of the wide support for the C language in the Nios II processor IDE. 4.6.1 HMM-Based Recognition Recognition or pattern classification is the process of comparing the unknown test pattern with each sound class reference pattern and computing a measure of similarity (distance) between the test pattern and each reference pattern. The digit is recognized using a maximum likelihood estimate, such as the Viterbi decoding algorithm, which implies that the digit whose model has the maximum probability is the spoken digit. Preprocessing, feature vector extraction, and codebook generation are same as in HMM training. The input speech sample is preprocessed and the feature vector is extracted. Then, the index of the nearest codebook vector for each frame is sent to all digit models. The model with the maximum probability is chosen as the recognized digit. After preprocessing in the Nios II processor, the required data is passed to the hardware for Viterbi decoding. Viterbi decoding is computationally intensive so we implemented it in the FPGA for better execution speed, taking advantage of hardware/software co-design. We wrote the Viterbi decoder in Verilog HDL and included it as a custom

28

instruction in the Nios II processor. Data passes through the dataa and datab ports and the prefix port is used for control operations. The custom instruction copies or adds two floating-point numbers from dataa and datab, depending on the prefix input. The output (result) is sent back to the Nios II processor for further maximum likelihood estimation. 4.6.2 Flowchart Of HMM The system trains the HMM for each digit in the vocabulary. The same weighted cepstrum matrices for various users and digits are compared with the codebook and their corresponding nearest codebook vector indices is sent to the Baum-Welch algorithm to train a model for the input index sequence. The codebook index is the observation vector for the HMM model.

Figure 4.6.2 flowchart of HMM

29

The Baum-Welch model is an iterative procedure and our system limits the iterations to 20. After training, we have three models for each digit that correspond to the three users in our vocabulary set. We find the average of the A, B, and n matrices over the users to generalize the models. 4.7 LCD

Figure 4.7 LCD

A liquid crystal display (LCD) is a thin, flat electronic visual display that uses the light modulating properties of liquid crystals (LCs). LCs do not emit light directly. They are used in a wide range of applications, including computer monitors, television, instrument panels, aircraft cockpit displays, signage, etc. They are common in consumer devices such as video players, gaming devices,clocks, watches, calculators, and telephones. LCDs have displaced cathode ray tube (CRT) displays in most applications. They are usually more compact, lightweight, portable, less expensive, more reliable, and easier on the eyes. They are available in a wider range of screen sizes than CRT and plasma displays, and since they do not use phosphors, they cannot suffer image

30

burn-in. LCDs are more energy efficient and offer safer disposal than CRTs. Its low electrical power consumption enables it to be used in battery-powered electronic equipment. It is an electronically-modulated optical device made up of any number of pixels filled with liquid crystals and arrayed in front of a light source (backlight) or reflector to produce images in colour or monochrome. 4.8 LED

Figure 4.8 LED

A light-emitting diode (LED) is a semiconductor light source. LEDs are used as indicator lamps in many devices, and are increasingly used for lighting. Introduced as a practical electronic component in 1962, early LEDs emitted low-intensity red light, but modern versions are available across the visible, ultraviolet and infrared wavelengths, with very high brightness.

31

When releasing

a energy

light-emitting diode is in the form

forward of photons.

biased This

(switched effect is

on), electrons are able to recombine with electron holes within the device, called electroluminescence and the color of the light (corresponding to the energy of the photon) is determined by the energy gap of the semiconductor. An LED is often small in area (less than 1 mm2), and integrated optical components may be used to shape its radiation pattern. LEDs present many advantages over incandescent light sources including lower energy consumption, longer lifetime, improved robustness, smaller size, faster switching, and greater durability and reliability. LEDs powerful enough for room lighting are relatively expensive and require more precise current and heat management than compact fluorescent lamp sources of comparable output.

Light-emitting diodes are used in applications as diverse as replacements for aviation lighting, automotive lighting(particularly brake lamps, turn signals and indicators) as well as in traffic signals. The compact size, the possibility of narrow bandwidth, switching speed, and extreme reliability of LEDs has allowed new text and video displays and sensors to be developed, while their high switching rates are also useful in advanced communications technology .Infrared LEDs are also used in the remote control units of many commercial products including televisions, DVD players, and other domestic appliances.

32

CHAPTER 5

ARCHITECTURE

5.1 SYSTEM CONTROLLER

Figure 5.1 Overall Diagram

33

Fig 5.1 shows how the modules discussed in this chapter interact with each other. Most of the signals pass through the System Controller module. It controls the datapath by coordinating the modules so that the data can flow. It deals primarily with the training phase of the algorithm, since it is much more complex than the sound recognition phase. For instance, once a sound has been detected, the system controller is notified. Then, It waits for the FFT to output the data before notifying the 'Average' module it should start operating. Finally, it instructs the memory controller to store the averaged data. 5.2 NIOS II PROCESSOR We used the NIOS II Software Build Tools for Eclipse software for writing our C program. The C program executes our algorithm for the speech recognition. We have included the complete code in the code listing section. The overall operation of the code can be described as follows. The code is executing an infinite loop as its always either expecting the input or processing it. It initiates start by giving the fftstart signal which starts the memory loading and the FFT operation. It keeps checking the fftcomplete signal to detect the end of the FFT operation. Once the FFT is complete it make the fftstart signal low so that the FFT values stored in the FFT memory doesnt change before it copies the values to SDRAM. It then checks for the fftlevel signal to check whether a significant level of input is present of the MIC input and so as to indicate the start of the voice command. We found out experimentally that the value of fftlevel greater

34

than 60 corresponds to an actual voice command, while the value below this represent either silence or the noise. Once we detected the start of the command, we continuously stored the FFT output of next 32 chunks of voice sample which each chunk being 32ms. We store these values in a large array named fftcoeff of size 8192. This is performed through a for loop iterating for 32 cycles and performing the above operation of initiating the FFT module and then storing the FFT output into the fftcoeff array at appropriate location. Now, we have got the power spectrum of the word which has been spoken. We will now do the feature extraction and determine the word spoken. First step is to convert the spectrum to the mel scale. We defined the melcepstrum_conversion function which converts the input power spectrum to the mel scale. We extracted 12 coefficients from this spectrum. We pass the as input the fftcoeff array, and output the mel array. The C module does the shifting as described in the theory section. We did the mel shifting for the entire 1 sec speech instead of 32ms chunks of speech. It would have been more efficient to do separate mel shifting for each of the 32ms but would have required a sophisticated synchronization and nevertheless wasnt required for the operation which we are trying to achieve. Next step is to compute the discrete cosine transform of these spectral points and obtain the MFCCs. We defined the dct function which

35

takes the input as the mel_array (12 coefficients) and outputs the mfcc_array(12 coefficients). Next we identify the spoken word based on the DCT coefficients. Since the first two coefficients contain the maximum information we took the sum of first two coefficients of the dct output and store it in the variable named sum_mel. Since in our implementation we are differentiating between the words Yes and No, we experimentally noticed that this value was always above 59 for the word Yes and was in between 50 to 58 for the word No. The program compares sum_mel variables with these values in order to determine whether the spoken word is a Yes or No. It then accordingly glows the appropriate LEDs and the hardware displays the word Yes or No on the 7-segment display.

36

CHAPTER 6

FIELD PROGRAMABLE GATE ARRAY (FPGA)

6.1 FPGA A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by the customer or designer after manufacturinghence "field-programmable". The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC) (circuit diagrams were previously used to specify the configuration, as they were for ASICs, but this is increasingly rare). FPGAs can be used to implement any logical function that an ASIC could perform. The ability to update the functionality after shipping, partial re-configuration of the portion of the design and the low non-recurring engineering costs relative to an ASIC design (not withstanding the generally higher unit cost), offer advantages for many applications. FPGAs contain programmable logic components called "logic blocks", and a hierarchy of reconfigurable interconnects that allow the blocks to be "wired together"somewhat like a one-chip programmable breadboard. Logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. memory. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of

37

6.1.1 Introduction The area of field programmable gate array (FPGA) design is evolving at a rapid pace. The increase in the complexity of the FPGAs architecture means that it can now be used in far more applications than before. The newer FPGAs are steering away from the plain vanilla type "logic only" architecture to one with embedded dedicated blocks for specialized applications. With so many choices available, the designer not only has to familiarize himself with the various architectures and their strengths, but he also needs a way to quickly estimate the performance of his design when targeted to the different technologies. This paper briefly outlines the latest offerings from the key FPGA vendors and in its latter half discusses the importance of using the right synthesis tool in order to target the same design to these various technologies. Definitions of Relevant Terminology are : Field-programmable Device (FPD) a general term that refers to any type of integrated circuit used for implementing digital hardware, where the chip can be configured by the end user to realize different designs. Programming of such a device often involves placing the chip into a special programming unit, but some chips can also be configured in-system. Another name for FPDs is programmable logic devices (PLDs); although PLDs encompass the same types of chips as FPDs, we prefer the term FPD because historically the word PLD has referred to relatively simple types of devices. PLA a Programmable Logic Array (PLA) is a relatively small FPD that contains two levels of logic, an AND-plane and an OR-plane, where both levels are

38

programmable (note: although PLA structures are sometimes embedded into full-custom chips, we refer here only to those PLAs that are provided as separate integrated circuits and are user-programmable). PAL a Programmable Array Logic (PAL) is a relatively small FPD that has a programmable AND-plane followed by a fixed OR-plane. SPLD refers to any type of Simple PLD, usually either a PLA or PAL. CPLD a more Complex PLD that consists of an arrangement of multiple SPLD-like blocks on a single chip. Alternative names (that will not be used in this paper) sometimes adopted for this style of chip are Enhanced PLD (EPLD), Super PAL, Mega PAL, and others.

FPGA a Field-Programmable Gate Array is an FPD featuring a general structure that allows very high logic capacity. Whereas CPLDs feature logic resources with a wide number of inputs (AND planes), FPGAs offer more narrow logic resources. FPGAs also offer a higher ratio of flip-flops to logic resources than do CPLDs. 6.1.2 The FPGA Landscape In the semiconductor industry, the programmable logic segment is the best indicator of the progress of technology. No other segment has such varied offerings as field programmable gate arrays. It is no wonder that FPGAs were among the first semiconductor products to move to the 0.13m technology, and again recently to 90nm technology. This rapidly changing technology means that more complex functionality is being designed.

39

Figure 6.1.2 Structure Of An FPGA

The players in the current programmable logic market are Altera, Atmel, Actel, Cypress, Lattice, Quick logic and Xilinx. Some of the larger and more popular device families are: Stratix from Altera; Accelerator from Actel; ispXPGA from Lattice and Virtex from Xilinx. Between these FPGA devices, many major electronics applications such as communications, video, image and digital signal processing, storage area networks and aerospace are covered. While the architecture of each FPGA is unique, the basic combination of the functional block remains the same: LUTs + registers + carry-chain + wide MUX. It is important to be aware of the required resources for a design and to cross-reference this with what is available. Sometimes, however, it is also the supported configuration that is important for a design's requirement. For example, the capability of a dedicated RAM to function in a particular mode might not be supported by all vendors.

40

6.1.3 FPGA synthesis The Vendor-Independent Approach The present-day FPGAs offer the necessary features for successfully completing most complex designs. Table 6.1.3 highlights the amount of key resources available in the largest device offered by each FPGA vendor. Clock management forms a very important part of any digital design and this functionality is facilitated by on-chip phase locked loop (PLLs or DLLs) circuitry. Dedicated memory blocks offer data storage and can be configured as basic single-port RAMs, ROMs (read only memory), FIFOs (first in first out), or CAMs (content addressable memory). Data processing or the logic fabric of these FPGAs varies widely in size with the biggest Xilinx Virtex-II Pro offering up to 100K LUT4s. The ability to interface the FPGA with backplanes, high-speed buses, and memories is possible by the availability of various single-ended and differential I/O standards support. Many of the major electronics applications such as communications, video, image and digital signal processing; storage area networks and aerospace are covered between the above-mentioned FPGA devices. Although all of these FPGAs can perform the key functions required by these applications, each of them is individually better suited for certain target segments. For example, although Virtex-II and the Stratix both offer dedicated multiplier blocks, the existence of the adders in the dedicated DSP block may enable the Stratix device to target DSP applications more effectively due to its ability to create efficient MAC (multiply-accumulate)

41

blocks4. In a similar manner, for programmable systems applications requiring embedded processors, the Virtex-II Pro with its 32-bit RISC processor (PowerPC 405) would be an ideal choice.
Features Xilinx Virtex II Pro Clock management Embedded memory blocks DCM Up to 12 Block RAM Up to 10 Mbit Altera stratix PLL Up to 12 Tri Matrix Memory Up to10 Mbit Actel Accelerator PLL Up to 8 Embedded RAM Up to 338K Lattice is pXPGA Sys CLOCK PLL up to 8 Sys MEM Blocks Up to 414K

Data processing

CLB and 18-bitx 18-bit Multipliers

LEs and embedded multipliers

Logic modules (C-cell &Rcell) Advanced IO Support Per pin FIFOs for bus application

PFU based

Programmable I/O s Special features

Select IO

Advanced IO Support

Sys IO

Embedded power PC405 Cores

DSP blocks

Sys Hs 1 for high speed serial interface

Table 6.1.3 Features Offered In FPGA

42

6.1.4 Applications of FPGAs FPGAs have gained rapid acceptance and growth over the past decade because they can be applied to a very wide range of applications. A list of typical applications includes: random logic, integrating multiple SPLDs, device controllers, communication encoding and filtering, small to medium sized systems with SRAM blocks, and many more.

Other interesting applications of FPGAs are prototyping of designs later to be implemented in gate arrays, and also emulation of entire large hardware systems. The former of these applications might be possible using only a single large FPGA (which corresponds to a small Gate Array in terms of capacity), and the latter would entail many FPGAs connected by some sort of interconnect; for emulation of hardware, QuickTurn [Wolff90] (and others) has developed products that comprise many FPGAs and the necessary software to partition and map circuits. Another promising area for FPGA application, which is only beginning to be developed, is the usage of FPGAs as custom computing machines. This involves using the programmable parts to execute software, rather than compiling the software for execution on a regular CPU.

43

CHAPTER 7

ALTERA CYCLONE II DE2 KIT

7.1 LAYOUT AND COMPONENTS A photograph of the DE2-70 board is shown in Figure 7.1. It depicts the layout of the board and indicates the location of the connectors and key components.

Figure 7.1 Altera DE2 kit

44

The DE2-70 board has many features that allow the user to implement a wide range of designed circuits, from simple circuits to various multimedia projects. The following hardware is provided on the DE2-70 board y Altera Cyclone II 2C70 FPGA device y Altera Serial Configuration device - EPCS16 y USB Blaster (on board) for programming and user API control; both JTAG and Active Serial(AS) programming modes are supported y 2-Mbyte SSRAM y Two 32-Mbyte SDRAM y 8-Mbyte Flash memory y SD Card socket y 4 pushbutton switches y 18 toggle switches y 18 red user LEDs y 9 green user LEDs y 50-MHz oscillator and 28.63-MHz oscillator for clock sources y 24-bit CD-quality audio CODEC with line-in, line-out, and microphone-in jacks y VGA DAC (10-bit high-speed triple DACs) with VGA-out connector y 2 TV Decoder (NTSC/PAL/SECAM) and TV-in connector y 10/100 Ethernet Controller with a connector y USB Host/Slave Controller with USB type A and type B connectors y RS-232 transceiver and 9-pin connector y PS/2 mouse/keyboard connector

45

y IrDA transceiver y 1 SMA connector y Two 40-pin Expansion Headers with diode protection In addition to these hardware features, the DE2-70 board has software support for standard I/O interfaces and a control panel facility for accessing various components. Also, software is provided for a number of demonstrations that illustrate the advanced capabilities of the DE2-70 board. In order to use the DE2-70 board, the user has to be familiar with the Quartus II software. The necessary knowledge can be acquired by reading the tutorials Getting Started with Alteras DE2-70 Board and Quartus II Introduction (which exists in three versions based on the design entry method used, namely Verilog, VHDL or schematic entry). These tutorials are provided in the directory DE2_70_tutorials on thr DE2 -70 systemCDROMS that accompanies the DE2-70 board and can also be found on Alteras DE2-70 web pages 7.2 BLOCK DIAGRAM OF THE DE2-70 BOARD Figure 7.2 gives the block diagram of the DE2-70 board. To provide maximum flexibility for the user, all connections are made through the Cyclone II FPGA device. Thus, the user can configure the FPGA to implement any system design.

46

Following is more detailed information about the blocks in Figure 7.2.1 7.2.1 Cyclone II 2C70 FPGA y 68,416 Les. y 250 M4K RAM blocks. y 1,152,000 total RAM bits. y 150 embedded multipliers. y 4 PLLs. y 622 user I/O pins. y FineLine BGA 896-pin package.

Figure 7.2.1. Block Diagram Of The DE2-70 Board.

47

7.2.2 Serial Configuration Device And USB Blaster Circuit y Alteras EPCS16 Serial Configuration device. y On-board USB Blaster for programming and user API control. y JTAG and AS programming modes are supported. 7.2.3 SSRAM y 2-Mbyte standard synchronous SRAM. y Organized as 512K x 36 bit and Accessible as memory for the Nios II processor and by the DE2-70 Control Panel. 7.2.4 SDRAM y Two 32-Mbyte Single Data Rate Synchronous Dynamic RAM memory chips. y Organized as 4M x 16 bits x 4 banks. y Accessible as memory for the Nios II processor and by the DE2-70 Control Panel. 7.2.5 Flash Memory y 8-Mbyte NOR Flash memory. y Support both byte and word mode access. y Accessible as memory for the Nios II processor and by the DE2-70 Control Panel. 7.2.6 SD Card Socket y Provides SPI and 1-bit SD mode for SD Card access. y Accessible as memory for the Nios II processor with the DE2-70 SD Card Driver.

48

7.2.7 Pushbutton Switches y 4 pushbutton switches. y Debounced by a Schmitt trigger circuit. y Normally high generates one active-low pulse when the switch is pressed. 7.2.8 Toggle Switches y 18 toggle switches for user inputs. y A switch causes logic 0 when in the down (closest to the edge of the DE2-70 board) position and logic 1 when in the UP position. 7.2.9 Clock Inputs y 50-MHz oscillator. y 28.63-MHz oscillator. y SMA external clock input. 7.2.10 Audio CODEC y Wolfson WM8731 24-bit sigma-delta audio CODEC. y Line-level input, line-level output, and microphone input jacks y Sampling frequency: 8 to 96 KHz. y Applications for MP3 players and recorders, PDAs, smart phones, voice recorders, etc. 7.2.11 VGA Output y Uses the ADV7123 240-MHz triple 10-bit high-speed video DAC. y With 15-pin high-density D-sub connector. y Supports up to 1600 x 1200 at 100-Hz refresh rate.

49

7.2.12 NTSC/PAL/ SECAM TV Decoder Circuit y Uses two ADV7180 Multi-format SDTV Video Decoders. y Supports worldwide NTSC/PAL/SECAM color demodulation. y One 10-bit ADC, 4X over-sampling for CVBS. y Supports Composite Video (CVBS) RCA jack input. y Supports digital output formats : 8-bit ITU-R BT.656 YCrCb 4:2:2 output + HS, VS, and FIELD. y Applications: DVD recorders, LCD TV, Set-top boxes, Digital TV, Portable video devices, and TV PIP (picture in picture) display. 7.2.13 10/100 Ethernet Controller y Integrated MAC and PHY with a general processor interface. y Supports 100Base-T and 10Base-T applications. y Supports full-duplex operation at 10 Mb/s and 100 Mb/s, with autoMDIX. y Fully compliant with the IEEE 802.3u Specification. y Supports IP/TCP/UDP checksum generation and checking. y Supports back-pressure mode for half-duplex mode flow control. 7.2.14 USB Host/Slave Controller y Complies fully with Universal Serial Bus Specification Rev. 2.0. y Supports data transfer at full-speed and low-speed. y Supports both USB host and device. y Two USB ports (one type A for a host and one type B for a device). y Provides a high-speed parallel interface to most available processors; supports Nios II with a Terasic driver.

50

7.2.15 Serial Ports y One RS-232 port. y One PS/2 port. y DB-9 serial connector for the RS-232 port. y PS/2 connector for connecting a PS2 mouse or keyboard to the DE270 board. 7.2.16 IRDA Transceiver y Contains a 115.2-kb/s infrared transceiver. y 32 mA LED drive current. y Integrated EMI shield. y IEC825-1 Class 1 eye safe. y Edge detection input. 7.2.17 Two 40-pin Expansion Headers y 72 Cyclone II I/O pins, as well as 8 power and ground lines, are brought out to two 40-pin expansion connectors. y 40-pin header is designed to accept a standard 40-pin ribbon cable used for IDE hard drives. y Diode and resistor protection is provided.

51

CHAPTER 8

EXPERIMENTAL RESULTS

The machine is trained three times by the WORD. The word help is recognized 90.9% of the time, whereas held is correctly ignored (100% correct) when speaks. However, these percentages are respectively 45.5%, and 0% when speaks. If during the training phase, first person inputs two words and second person one, their percentages become respectively (when saying help) 72.7%, and 45.5%. When saying held, the machine correctly assesses that they are not saying help in all cases. This data was collected by saying help 11 times, and held two times.
Word help help help help help help help help help help help held held Verdict Same Different Same Same Same Same Same Same Same Same Same Different Different Correct? Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Correctness 90.9 %

92.3 %

100 %

Table 8 Experimental Results

52

This indicates that the training works properly, because the correctness in first person results decreases, when his participation in the training decreases (from three times to two). On the other hand, second person correctness increases when he participates in the training. Since the fingerprints are analyzed in the time domain, the system is much more sensible to the speed, the intonation and the surrounding noise when a word is input. Thus, the above results should be taken with caution, because the words were spoken really close to the microphone, and in a somewhat similar way each time. Nonetheless, the results seem conclusive. Thus, despite a potential lack in accuracy, the machine is functional.

53

CHAPTER 9

ADVANTAGES

y The user can talk and write freely. The system understands, analyzes and creates all the elements that are presented. y The user leads and controls the dialogue. He or she can interact by canceling or substituting previous functions and sentences. y The technology understands, analyzes and creates all the elements representation using a grammatical analysis strategy, assuring the right interpretation and management of all the semantic capacity of natural language. y The platform offers real-time interaction in massive environments using acute memory management strategies. y The solution completely adapts to the user profile and previous history dialogues with the aim to customize all interactive processes.

54

CHAPTER 10

APPLICATONS

y Interactive voice response system (IVRS) . y Voice-dialing in mobile phones and telephones . y Hands-free dialing in wireless bluetooth headsets . y PIN and numeric password entry modules. y Value added service (VAS) providers . y Automated teller machines (ATMs) If we increased the systems vocabulary using phoneme-based recognition for a particular language, e.g., English, the system could be used to replace standard input devices such as keyboards, touch pads, etc. in ITand electronics-based applications in various fields. The design has a wide market opportunity ranging from mobile service providers to ATM makers. Some of the targeted users for the system include. y Mobile operators . y Home and office security device providers . y ATM manufacturers. y Mobile phone and bluetooth headset manufacturers. y Telephone service providers. y Manufacturers of instruments for disabled persons. y PC users.

55

CHAPTER 11

FUTURE IMPROVEMENTS

Given more time, we would have liked to implement a more robust system with a larger dictionary of words. Also, using the Mel Scale followed by DCT is a weaker approach to solve the Speech Recognition problem. Instead, the usage of algorithms built upon the concept of using the Hidden Markov Models which are predominantly statistical techniques that treat a speech signal as a piecewise stationary signal over a window of 10ms is preferred. Hence, using these algorithms we could enhance the accuracy of the system and make the system more robust.

56

CHAPTER 12 CONCLUSION

After applying background theory and scripting a VLSI prototype, a speech recognition system can indeed be successfully implemented and using FPGA technology. The experimental theoretical results show that the algorithm is accurate and fast enough for consumer product applications. Despite only partial hardware implementation due to technical difficulties, it remains functional. Besides producing a full implementation (by including an FFT module and thus being able to analyze words in the frequency spectrum), other improvements can be done to the system. For instance, allowing the use of a variable length for the input sounds would drastically improve its performance on very short or very long words. Also, adding support for training several words would be rather simple and would increase the system flexibility.

57

REFERENCE

[1] L. Rabiner, and B. Juang, Fundamentals of speech recognition: Tsinghua University Press. [2] E. Trentin, and M. Gori, A survey of hybrid ANN/HMM models for automatic speech recognition, Neurocomputing, vol. 37, no. 1, pp. 91-126, 2001. [3] V. Steinbiss, B. Tran, and H. Ney, "Improvements in beam search," ICSLP, 1994, pp. 2143-2146. [4] D. Llorens, and F. Casacuberta, An Experimental Study of Histogram Pruning in Speech Recognition, Proceeding of the VIII SNRFAI. pp, vol. 5, pp. 6, 1999. [5] M. Mosleh, S. Setayeshi, and M. Kheyrandish, Accelerating Speech Recognition Algorithm with Synergic Hidden Markov Model and Genetic Algorithm based on Cellular Automata. ICSPS2009,pp.1-7,2009. [6] K. Kuo, Dual-ALU Structure Processor for Speech Recognition in Proceedings of the 2006 IEEE/SMC International Conference on System of Systems Engineering, USA, 2006, pp. 193-196.

You might also like