Professional Documents
Culture Documents
FPGA-BASED SPEECH
RECOGNITION SYSTEM
Done by:
IDOUGHI Achour
BOUTOUFIK Assia
SUPERVISOR: DR.Benzekri
List of Figures
VIII
List of Tables
IX
Nomenclature
CODEC: enCOder/DECoder
X
Nomenclature
XI
General Introduction
A very challenging area of research is making a machine understand natural language, a fact
that attracts a lot of researchers in recent years to develop speech recognition tools.
It is of great importance not only in practical application but scientific research. The research
on speech recognition technology mainly concentrates on two aspects: One is the software
running on computer, the other is embedded systems. The advantages of Embedded Systems
are high-performance, convenience, cheap and they have huge potential for development.
The classical front end analysis in speech recognition is a spectral analysis which
parameterizes the speech signal into feature vectors. The most popular set of feature vectors
used in recognition systems is the Mel Frequency Cepstral Coefficients (MFCC) [25]. They
are based on a standard power spectrum estimate which is first subjected to a log-based
transform of the frequency axis; it results in a spectral representation on a perceptually
frequency scale, based on the response of the human perception system. After, they are
decorrelated using a modified discrete cosine transform, which allows an energy compaction
in its lower coefficients [25].
In the hardware part, we implement an audio codec controller that is responsible of the pre-
emphasis and sampling, an FFT module for transforming the time domain signal into
frequency domain, and finally a NIOS soft processor that will be used to execute the software
program which consists mainly of the MFCC and recognition algorithms.
1
General Introduction
MFCC technique is used to extract feature vectors of the spoken word; this latter is composed
of FFT, Mel filter bank, DCT and Mel spectrum. The first component is done in hardware
part, the remaining are achieved in software part. The complexity of these components made
us decide to develop them in software.
In the software part, a C language program running on the NIOS soft processor used to
perform the MFCC (Mel filter bank, DCT and Mel Spectrum) for feature extraction and the
matching algorithm for training and recognition.
Our system design is based on the FPGA because of the advantages of short development
cycle, low-cost design and low-risk. In recent years, FPGA has become the key components
in high-performance digital signal processing systems used in digital communication,
network, video and image fields [14].
Chapter I is an introduction to the field of speech recognition, where we have covered the
theoretical part of our project, from speech theory to hardware used in our realization.
Chapter II ‘Hardware and software implementation’ deals with the hardware and software
implementation descriptions of our system. The hardware part is divided into three main
blocks: audio codec, FFT Module, and NIOS system. This part gives explanation of the
different hardware blocks and how they are related to each other to form the overall hardware
system. The last part of this chapter deals with the software description of the implemented
feature extraction, training and recognition.
Chapter III ‘Result and Discussion’ represents the obtained results and their analysis.
The report is ended with a general conclusion and some suggestions for future works.
2
Chapter I: Theoretical Background
Introduction
1. Speech Recognition
Speech Recognition can be defined as the process by which a computer (or other type of
machine) identifies spoken words (speech signal) using different algorithms implemented in a
computer system. The goal of speech recognition is to interact with deferent machines.
Generally speaking, an Automatic Speech Recognition (ASR) system can be organized in two
blocks: the feature extraction and the modeling stage (or feature matching)[9]. As described in
figure 1.1.
Speech signal is an analog representation of the vocalized form of communication based upon
the syntactic combination of lexical and names that are drawn from very large vocabularies.
3
Chapter I: Theoretical Background
Isolated Word
Isolated word recognizes attain usually require each utterance to have quiet on both side of
sample windows. It accepts single words or single utterances at a time .This is having “Listen
and Non Listen state”. Isolated utterance might be better name of this class [21].
Connected Word
Connected word systems are similar to isolated words but allow separate utterance to be “run
together minimum pause between them [21].
Continuous Speech
Continuous speech recognizers allow user to speak almost naturally, while the computer
determines the content. Recognizer with continuous speech capabilities are some of the most
difficult to create. They utilize special method to determine utterance boundaries [21].
Spontaneous Speech
At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An
ASR System with spontaneous speech ability should be able to handle a variety of natural
speech feature such as complete sentences [21].
An utterance is the vocalization (speaking) of a word or words that represent a single meaning
to the computer. Utterances can be a single word, a few words, a sentence, or even multiple
sentences [26].
Speaker Dependence
Speaker dependent systems are designed around a specific speaker. They generally are more
accurate for the correct speaker, but much less accurate for other speakers. They assume the
speaker will speak in a consistent voice and tempo. Speaker independent systems are designed
for a variety of speakers. Adaptive systems usually start as speaker independent systems and
utilize training techniques to adapt to the speaker to increase their recognition accuracy [26].
4
Chapter I: Theoretical Background
Vocabularies
Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the
SR system. Generally, smaller vocabularies are easier for a computer to recognize, while
larger vocabularies are more difficult [26].
Accurate
The ability of a recognizer can be examined by measuring its accuracy - or how well it
recognizes utterances. This includes not only correctly identifying an utterance but also
identifying if the spoken utterance is not in its vocabulary. Good ASR systems have an
accuracy of 98% or more! The acceptable accuracy of a system really depends on the
application [26].
Training
Some speech recognizers have the ability to adapt to a speaker. When the system has this
ability, it may allow training to take place. An ASR system is trained by having the speaker
repeat standard or common phrases and adjusting its comparison algorithms to match that
particular speaker. Training a recognizer usually improves its accuracy [26].
Training can also be used by speakers that have difficulty speaking, or pronouncing certain
words. As long as the speaker can consistently repeat an utterance, ASR systems with training
should be able to adapt [26].
Feature extraction is a process that extracts valid feature parameters from an input speech
signal. Even if the same word is spoken, no two speech instances can produce the same
speech waveform. The reason for this phenomenon is that the speech waveform includes not
only speech information but also the emotional state and tone of the speaker. Therefore, the
goal of speech feature extraction is to extract feature parameters that represent speech
information. Further, this process is a part of compressing speech signals and modeling the
human vocal tract. The feature parameters are devised to represent the phonemes accurately
for speech recognition [22]. MFCC is commonly used for the above mentioned feature
extraction.
5
Chapter I: Theoretical Background
The most popular spectral based parameter used in recognition approach is the Mel Frequency
Cepstral Coefficients called MFCC [13].. MFCCs are coefficients, which represent audio,
based on perception of human auditory systems.
systems
MFCC is based on human hearing perceptions which cannot perceive frequenci
frequencies over 1Khz.
In other words, MFCC is based on known variation of the human ear’s cr
critical bandwidth
with frequency. MFCC has two types of filter which are spaced linearly at low frequency
below 1000 Hz and logarithmic spacing above 1000Hz. A subjective pitch is present on Mel
Frequency Scale
ale to capture important characteristic of phonetic in speech [13]
[13].
The overall process of the MFCC is shown in Figure 1.2.
Pre-emphasis
Framing
The process of segmenting the speech samples obtained from analog to digital conversion
(ADC) into a small frame with the length within the range of 20 to 40 msec. The voice signal
is divided into frames of N samples [13].
[13]
6
Chapter I: Theoretical Background
Windowing
At the edges of each frame, there are discontinuities in the input signal that contain
unnecessary information. In order to minimize the discontinuities at the edges of the frames,
each frame is multiplied with the window coefficients [13].
Hamming window is used as window shape by considering the next block in feature
extraction processing chain and integrates all the closest frequency lines [13].
In order to extract the feature parameters of the input speech, the FFT algorithm can be
applied to convert the time domain into the frequency domain to figure out the frequency
characteristics of the input. In the time domain, a speech signal has discrete non-periodic
features. Through FFT, which converts the time domain into the frequency domain, a speech
signal is transformed into a continuous periodic signal [13].
The frequencies range in FFT spectrum is very wide and voice signal does not follow the
linear scale. The human ear responds non-linearly to a speech signal. When the speech
recognition system performs a non-linear process, it improves the recognition performance.
By applying a Mel-filter bank, we can obtain a non-linear frequency resolution. The Mel
Filter bank method is widely used in the speech recognition process [13].
The figure 1.3 shows a set of triangular filters that are used to compute a weighted sum of
filter spectral components so that the output of process approximates to a Mel scale. Each
filter’s magnitude frequency response is triangular in shape and equal to unity at the center
7
Chapter I: Theoretical Background
frequency and decreases linearly to zero at the center frequency of two adjacent filters. Then,
each filter output is the sum of its filtered spectral components.
The logarithm and discrete cosine transform (DCT) of the Mel-Filter bank energy are
computed to extract the required minimum information.
This process is to convert the log Mel spectrum into time domain using Discrete Cosine
Transform (DCT). The result of the conversion is called Mel Frequency Cepstrum
Coefficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is
transformed into a sequence of acoustic vector [13].
The matching technique is used to compare a spoken word to referenced ones in a data base.
These words have to be in vector representation form with limited number of entrance, say N.
The matching algorithm compares the entries of the spoken word’s vector to the referenced
one element by element till N, by calculating the difference between them. Then it takes the
average of the differences and compares it to a referenced threshold value. Whenever, it finds
that average less than the threshold value it senses that it has found a spoken word.
8
Chapter I: Theoretical Background
The DE2 board provides high-quality 24-bit audio via the Wolfson WM8731 audio CODEC
(Encoder/Decoder) [4]. This chip supports microphone-in, line-in, and line-out ports, with a
sample rate adjustable from 8 kHz to 96 kHz. The WM8731 is controlled by a serial I2C bus
interface, which is connected to pins on the Cyclone II FPGA [20].
This chip is used to sample and digitalize the audio signal using the Analogue to digital
conversion (ADC).
9
Chapter I: Theoretical Background
The digital audio interface takes the data from the internal ADC and places it on the
ADCDAT output. ADCDAT is the formatted digital audio data stream output from the ADC
digital filters with left and right channels multiplexed together. ADCLRC is an alignment
clock that controls whether Left or Right channels data is present on ADCDAT lines.
ADCDAT and ADCLRC are syncronised with the BCLK signal with each data bit transition
signified by a BCLK high to low transition. BCLK may be an input or an output dependent on
whether the device is in master or slave mode [20].
There are four digital audio interface formats accommodated by the WM8731/L. Refer to the
electrical characteristics section for timming information [20].
Left justified mode is where the MSB is available on the first rising edge of BCLK
following ADCLRC transition [20].
I2S Mode is where the MSB is available on the second rising edge of BCLK following
an ADCLRC transition [20].
Right Justified Mode is where the LSB is available on the rising edge of BCLK
preceding an ADCLRC transition, yet MSB is still transmitted first [20].
DSP Mode is where the left channel MSB is available on either the first or the second
rising edge of BCLK following a LRC transition high. Right channel immediately
follows left channel data [20].
10
Chapter I: Theoretical Background
All these modes are MSB first and operate with data width of 16 to 32 bits, but the right
justified mode does not support 32 bits [20].
The Audio Codec has 11 registers with 16 bits per register (7 bit address + 9 bits of data).
These can be controlled using either the 2 wire or 3 wire MPU interface. The complete
register map is shown in table 1.1.
The I2C (Inter Integrated Circuit) bus is a simple bi-directional serial bus that supports
multiple master and slaves. It consists of only two lines; a serial bi-directional data line (SDA)
and a serial bi-directional clock line (SCL) [23].
The I2C bus is idle when both SCL and SDL are at logic 1 level. The master initiates a data
transfer by issuing a START condition, which is high to low transition on the SDA line while
the SCL line is as high as shown in figure 1.6. The bus is considered to be busy after the
START condition. After the START condition, a slave address is sent out on the bus by the
master. The address is 7 bits long followed by an eighth bit which is a data direction bit (R/W)
[23].
11
Chapter I: Theoretical Background
The master, who is controlling the SCL line, will send out the bits on the SDA line, one bit
per line can be changed only when the SCL line at a low.
The Altera IP Library provides many useful IP core functions for a production use without
purchasing an additional license. Some Altera MegaCore® IP functions require that you
purchase a separate license for production use. However, the OpenCore® feature allows
evaluation of any Altera IP core in simulation and compilation in the software [2].
The most important IP core we have used in our project is the FFT IP core.
OpenCore Plus evaluation supports the following two operation modes:
Untethered—run the design containing the licensed IP for a limited time.
Tethered—run the design containing the licensed IP for a longer time or indefinitely.
For IP cores, the untethered time-out is 1 hour; the tethered time-out value is indefinite. The
design stops working after the hardware evaluation time expires [2].
12
Chapter I: Theoretical Background
The FFT IP core is a high performance, highly-parameterizable Fast Fourier transform (FFT)
processor. The FFT IP core implements a complex FFT or inverse FFT (IFFT) for high-
performance applications [2].
The FFT IP core I/O is illustrated in the figure 1.8.
In the I/O Data Flow of an FFT IP Core we distinguish four architectures [2]:
Streaming FFT that allows continuous processing of input data, and outputs a
continuous complex data stream without the need to halt the data flow in or out of the
FFT IP core.
The input order which allows you to select the order in which you feed the samples to
the FFT. (Natural order, Bit Reverse order, Digit Reverse Order or –N/2 to N/2 order).
The buffered burst I/O data flow FFT requires fewer memory resources than the
streaming I/O data flow FFT, but the tradeoff is an average block throughput reduction.
The burst I/O data flow FFT operates similarly to the buffered burst FFT, except that the
burst FFT requires even lower memory resources for a given parameterization at the
expense of reduced average throughput.
Note
The FFT IP core operates in both tethered and untethered modes [2].
13
Chapter I: Theoretical Background
14
Chapter II: Hardware and Software Implementation
CHAPTER II:
Introduction
In this project we implemented a hardware system that recognizes a speech, by the mean of
head styled microphone and the Altera DE2-board. We used VHDL language to describe the
hardware of the different blocks used in our speech recognition system.
This chapter provides the reader with the different parts and components used in our hardware
design followed by a description of their functionalities.
Figure 2.1 shows the general block diagram of our system. As mentioned here, our design is
composed of three main components.These components are the audio codec, the FFT module,
and the NIOS system.
1. Audio Codec
This Audio codec is the first block of our system, which is in charge of the reception of the
audio signal from the microphone. The analog signal produced by the microphone must be
converted to digital before processed by our system. To perform this, an analog to digital
converter (ADC) should be used. Fortunately, the DE2-Board is equipped by an audio chip
that does this task, which is the Wolfson WM8731 audio codec.
The WM8731 on the DE2 board is the small chip labeled U1 near the row of colored audio
connectors; it has a flexible digital interface, that we access using dedicated pins on the
Cyclone II FPGA device. It has internal analog-to-digital converters (ADCs) and digital-to-
15
Chapter II: Hardware and Software Implementation
analog converters (DACs) that allow the device to convert at sampling rates from 8kHz to
96kHz, supporting digital word lengths from 16-32 bits [20].
This chip can be configured by writing to 11 registers which can be accessed using the I2C
(Inter integrated circuit) protocol. For audio data path, WM8731 may be operated in either
one of the following four offered audio interface modes:
• Right justified
• Left justified
• I2S
• DSP mode
In this part we have developed a VHDL code to get the audio data from the MIC-line in of the
DE2 to the Wolfson chip, and save it in a memory block that we created using the Altera
Megafunction wizard tool of the Quartus II software. This VHDL code is composed of two
main parts, configuration controlling and data flow controlling. To deal with synchronization
aspects, we used two phase-locked loops (PLL) for dividing and buffering the 27MHz and
50MHz DE2-board clocks. The block symbol generated using Quartus II to this component is
shown in figure 2.2.
This symbol shows the different inputs and outputs of the audio codec. This block will be
connected later to other blocks for further processing of the audio signal.
16
Chapter II: Hardware and Software Implementation
only one peripheral device that will be interfaced. We designed a specified FSM (finite state
machine) that has six states (reset state, start condition, send data, acknowledge, prepare for
stop, stop condition).
Following reset, the FSM entered the start condition, in which the I2C clock line was held
high while the I2C data line was pulled low. The FSM then entered the send data state, where
each of the 24-bits of the current packet was sent. After each 8 bits, the FSM entered the
acknowledge state, where the I2C data line was set to high impedance. Following
transmission of the full 24-bit packet, the FSM entered the stop condition, where the I2C
clock line was held high, while the I2C data line was pulled high.
In this configuration we have set the audio codec chip to be a slave device, and the FPGA be a
master device. Figure 2.3 is the FSM diagram we used in this configuration.
We created a small ROM to save these configuration packets, which will be retrieved by the
controller and send via the two-wire interface to the audio device. This process happens only
once and must be the first to be done in our system in order to get a correct data in right
moment. For this purpose, we created a delay counter module that is used to wait for 42ms
until it finishes the configuration, and then enable the data transmission carried by the
ADCDAC controller.
17
Chapter II: Hardware and Software Implementation
This ROM contains our specified configuration which differs from the default one. In fact, we
modified some of them and let others as they were by default. Among the modified ones, we
can list the most important of them, which are:
Enable microphone input boost level
Disable line in input to ADC
Sampling frequency to 8KHz
Left justified mode for digital audio
Figure 2.4 represents the symbol generated by Quartus II for configuring the audio codec chip
block. The different I/O ports of this block are mentioned in the symbol.
As shown in the previous figure, the audio codec controller has two inputs and three outputs.
The two inputs are the provided DE2-board 50 MHz clock and a reset which is connected to
one of the push buttons of the board. For the outputs, we have the serial clock (I2C_SDK) and
the serial data line (I2C_SDAT) that we connect to the specified pins to control the chip
serially under the I2C protocol [7].
The remaining output line (SDA_Control) is used by the ADCDAC controller that we will
introduce in the following paragraphs.
18
Chapter II: Hardware and Software Implementation
bit clock signal run at 3.07MHz, using the 18.42105 MHz (master clock XCK) signal
produced by a PLL. The second clock, the left/right select clock, runs at 192 KHz. The
left/right select clock is used to switch channels within the codec [20].
We set the digital audio interface format in the configuration part by accessing the register 7
[20]. We have used the following parameters:
Left justified mode
16 bit data precision
Right channel ADC/DAC data available when DACLRC/ADCLRC is low
MSB is available on the first BCLK rising after ADCLRC/DACLRC rising edge
Right channel ADC/DAC data Right and vice-versa
Enable slave mode
Don’t invert BCLK
The left justified mode works as follows:
The type of communication we used is the master slave one, where our data controller module
acts as a master and the Wolfson chip as a slave; we used five lines to control the data
transaction. Three of them are for the different clocks and the remaining ones are for data.
The following figure shows how this connection is made.
19
Chapter II: Hardware and Software Implementation
For testing purposes, we created a loop back of the sound from the MIC-line in to the Line-
Out line by writing the data gotten from the ADC (ADCDAT) to the DAC using DACDAT
line. With this technique we can adjust our configuration in the way to get a right data in the
right moment and see the effect of different configurations. The figure 2.7 is the block
generated using Quartus II of the data flow controller that we named data_controller, which
has three inputs and seven outputs.
20
Chapter II: Hardware and Software Implementation
1.5.1 ROM
Read only memory of the size 32x24 bits, which is used to store 10x24 bits of the audio codec
configuration. We created this memory using the Megafunction Wizard embedded in Quartus
II CAD software. The resulting schematic block of the created ROM is shown in figure 2.10.
21
Chapter II: Hardware and Software Implementation
1.5.2 RAM
Random access memory is used in our system to store a voice chunk of 32ms duration. Note
that the voice is stored before its FFT is performed. Since the sampling frequency used is 8
KHz, a total of 256 samples are taken in a timeframe of 32ms. Therefore the required size of
that RAM would be 256x16bits. The resulting schematic block of the created RAM is shown
in figure 2.11.
As noticed, the RAM of the above figure is a 2-port RAM that can be accessed for both read
and write actions simultaneously. In fact, we have used this type of memory because we are
implementing a real-time speech recognition system which should provide the possibility to
write and read data at the same time. This RAM stores the output data from the ADC of the
audio codec in the form of a 16-bits word each and which can then be accessed by the FFT
module to read that audio data to be transform.
22
Chapter II: Hardware and Software Implementation
CodecConfiguration
inst2
delay
clock resetAdc
NOT OUTPUT AUD_DACDAT
KEY INPUT reset
VCC
inst7
OUTPUT AUD_DACLRC
inst1
AUD_ADCDAT INPUT
VCC
OUTPUT AUD_ADCLRC
OUTPUT AUD_XCLK
rst_n dac_data ram
inclk0 c0
inclk0 frequency: 27.000 MHz clock_18 bitClock data[15..0]
areset Operation Mode: Normal adc_data dacLRSelect w raddress[7..0]
2 5 6 W o rd (s )
adcLRSelect w ren
R AM
Clk Ratio Ph (dg) DC (%)
c0 92/135 0.00 50.00 AdcRamEn
rdaddress[7..0] q[15..0] OUTPUT DATA_ADC[15..0]
addressRamVector[7..0]
dataAdc[15..0] w rclock
inst Cy clone II
inst3 rdclock
rdclock INPUT
VCC
23
Chapter II: Hardware and Software Implementation
24
Chapter II: Hardware and Software Implementation
A best way to handle the inputs of the FFT is by implementing a state machine. This later is
implemented to control the functioning of the FFT module using four states. We have set
‘reset_n’ to make the program be in SOP (start of packet) state initially. When it is in the SOP
state, the ’sink_sop’ is set to indicates the start of incoming FFT frame, begin counting by
setting the count to one and assign the input signal to ‘sink_real’ that will be processed by the
FFT, then it goes to Transforming state. This state is where it does its job of transforming the
time signal to frequency one, it releases ‘sink_sop’ and increments ‘count’ to indicate that it is
in that state. Once ‘count’ reaches (255-2) it leaves this state and goes to EOP (end of packet)
state where it does the last transformation and signals the end by setting ‘sink_eop’, then
enters the IDLE state where it will wait for the next incoming FFT frame. The signal ‘go’
indicates the presence of data ready for processing, so in this state it stays checking for this
signal, when it is set it goes to start state (SOP) and repeat the same process. The following
figure represents this FSM.
The data processed by the FFT will be saved in a memory block before interfacing it to the
NIOS_II.
25
Chapter II: Hardware and Software Implementation
completion of FFT operation is indicated to the NIOS through a PIO. When the NIOS senses
that signal, it checks for the ‘source_exp’ of the FFT value, to know whether there is a speech.
In case of the presence of any speech, the NIOS-II reads the next 31 blocks of that memory.
ramfft
data[15..0]
w raddress[7..0]
256 Word(s)
RAM
w ren
rdaddress[7..0] q[15..0]
w rclock
rdclock
3. NIOS II System
The last hardware part of our implementation is creating a NIOS II soft processor in order to
treat the data produced by the FFT module, which is very difficult to implement in hardware.
This is why we decided to complete the feature extraction in software, i.e., with a C program
executed by the soft processor.
Using the SOPC builder tool of the Quartus II, we created a NIOS-II system consisting of:
NIOS processor
On-chip memory
SDRAM
I/O inputs and outputs for the LEDs and Switches
I/O inputs and outputs to interface the system to the FFT module
LCD
26
Chapter II: Hardware and Software Implementation
After creating this system, we have generated a schematic block for instantiating it and
synchronize it to the other parts of the system in the top-level that we called speech-
recognition. The following figure shows the block diagram representation of this NIOS-
System.
nios_system
clk
reset_n
in_port_to_the_iFFTCoeff [15..0]
in_port_to_the_iFFTComplete
in_port_to_the_iFFTLevel[5..0]
in_port_to_the_iKeys[2..0]
in_port_to_the_iSw itches[7..0]
LCD_E_from_the_lcd_0
LCD_RS_from_the_lcd_0
LCD_RW_from_the_lcd_0
LCD_data_to_and_from_the_lcd_0[7..0]
out_port_from_the_oFFTAddress[7..0]
out_port_from_the_oFFTStart
out_port_from_the_oLEDG[7..0]
out_port_from_the_oLEDR[7..0]
zs_addr_f rom_the_sdram_0[11..0]
zs_ba_from_the_sdram_0[1..0]
zs_cas_n_from_the_sdram_0
zs_cke_from_the_sdram_0
zs_cs_n_from_the_sdram_0
zs_dq_to_and_f rom_the_sdram_0[15..0]
zs_dqm_f rom_the_sdram_0[1..0]
zs_ras_n_from_the_sdram_0
zs_w e_n_f rom_the_sdram_0
inst
4. Top Level
The schematic diagram Top level module connects and synchronizes the audio codec, FFT
module and NIOS-II System. In order to properly generate the required clock signals for the
NIOS, SDRAM and AUDIO CODEC, it uses 2 PLLs which are inbuilt in the DE2 board. The
general block diagram of our hardware implementation is represented in the next page.
27
OUTPUT I2C_SCLK
VCC
sdram_PLL
BIDIR I2C_SDAT
VCC
OUTPUT ON
OUTPUT AUD_ADCLRCK
inclk0 c0 OUTPUT DRAM_CLK OUTPUT BLON
inclk0 frequency: 50.000 MHz
Operation Mode: Normal
OUTPUT AUD_DACLRCK
Clk Ratio Ph (dg) DC (%)
c0 1/1 -54.00 50.00 OUTPUT AUD_DACDAT
OUTPUT AUD_BCLK
CLOCK_50 INPUT
VCC
VCC
nios_system
CLOCK_27 INPUT
VCC
clk
reset_n
KEY INPUT
VCC
audio_codec VCC in_port_to_the_iFFTCoeff[15..0]
AUD_ADCDAT INPUT
VCC
KEY I2C_SCLK
FFT_Module in_port_to_the_iFFTComplete
CLOCK_50 I2C_SDAT
CLOCK_27 AUD_ADCLRCK iReset oSampAddr[7..0]
in_port_to_the_iFFTLevel[5..0]
AUD_ADCDAT AUD_DACLRCK iStart oPower[15..0]
rdaddress[7..0] AUD_DACDAT iStateClk oExp[5..0]
in_port_to_the_iKeys[2..0] OUTPUT LCD_E
rdclock AUD_XCK iSamp[15..0] oDone
AUD_BCLK iReadAddr[7..0] instate[2..0]
in_port_to_the_iSwitches[7..0] OUTPUT LCD_RS
DATA_ADC[15..0] iReadClock
LCD_E_from_the_lcd_0
inst inst1 LCD_RS_from_the_lcd_0
LCD_RW_from_the_lcd_0 OUTPUT LCD_RW
LCD_data_to_and_from_the_lcd_0[7..0]
BIDIR LCD_data[7..0]
VCC
out_port_from_the_oFFTAddress[7..0]
out_port_from_the_oFFTStart
switches[7..0] INPUT
VCC
out_port_from_the_oLEDG[7..0] OUTPUT LEDG[7..0]
inst6
28
Chapter II: Hardware and Software Implementation
5. Software Implementation
As mentioned previously, the first step in any ASR system is to extract features from the
audio signal will be used in the recognition stage.
The first different parts of the MFCC algorithm ranging from the pre-processing up to the
FFT phase are implemented in the previous section. The remaining parts are implemented
using the SOPC builder and the NIOS-II processor. Matching algorithm is also executed
within the NIOS processor.
The following section goes through the description of the software part used in our speech
recognition system.
As shown in figure 2.19, first the system has to make sure that it receives a speech signal
which is loud enough in order to be processed. The level of the sound is given by the FFT
module.
Once the sound is detected, for each sample, FFT coefficients are read from the FFT module
that are used as arguments in some required functions.
After getting all the FFT coefficients, the MFC Coefficients are gathered by applying first the
filter bank, then computing the Mel transform, and finally applying the DCT transform.
The result will be 12 coefficients which are the MFCC (coefficients) required for the
matching.
29
Chapter II: Hardware and Software Implementation
Start
Start FFT
Is FFT Yes
completed?
No
FFTFinished
MFCC
No
Is the sample
size reached?
Yes
End
No
Get the FFT address
Figure 2.19: MFCC Flow Chart
Increment the
sample size
30
Chapter II: Hardware and Software Implementation
5.2 Data-Base
In our project we have made our database using the MFCC coefficients resulted from the
processing of the speech signal. The data base contains four words each one has 12
coefficients stored in the memory.
In our design we have taken only 12 coefficients of the 26 DCT coefficients into
consideration because the higher DCT coefficients represent fast changing in filter bank
energies: these fast changes degrade the SR performance.
The figure 2.20 shows how the data stored into the memory.
Start
Libraries declaration
While (1)
No
Yes
MFCC coefficient
End
No
Increment
31
Chapter II: Hardware and Software Implementation
After getting the MFCC of the spoken word, the following steps are performed:
32
Chapter II: Hardware and Software Implementation
Start
While (1) No
Yes End
Yes
Reach the MFCC coeff
last address
No
Matched
Display word
33
Chapter III: Result and Discussion
Introduction
This chapter deals with the discussion of our simulation results obtained with Quartus II CAD
software. Some implementation results will be shown for testing the functionality of our
system on DE2 board.
The simulator tool included in the Quartus II CAD software was used to simulate the behavior
of the audio codec block of our system. The time interval chosen for that simulation is 9.5
msec. The obtained results are shown in figure 3.1.
The above simulation waveforms demonstrate the well functionality of the audio codec
circuit. These results deal with the two major phases for the audio codec block, the audio
codec chip configuration and the audio data flow. Both phases are discussed below.
Figure 3.2 clearly shows the ten activations of the serial clock signal line (I2C_SCLK) for the
transmissions of ten frames of data using the serial data line (I2C_SDA). The size of each
frame is 24 bits. On each activation, an 8-bits Wolfson’s register is configured accordingly.
34
Chapter III: Result and Discussion
Figure 3.2: Transmission of Ten Frames of Data to the Wolfson Chip Using I2C Protocol
The two first corresponding Wolfson chip register configurations are hold by the two first
frames. The Wolfson chip can be accessed using two addresses which are selected using the
control interface address selection as shown in the table below.
Since the CSB is tied to ground, and referring to the schematic in the DE2 user Manuel. The
audio codec address is ‘0011010‘. The seventh bit is zero because the Wolfson chip is
configured in a slave mode (just writing to it).
The first 8-bits of each frame correspond to the address of the audio codec chip (bits [7:1])
and a writing bit (bit 0). The corresponding value is ‘00110100’.
The next 8-bits is the configuration register’s address (bits [7:1]), followed by the writing bit
(bit 0). The last 8-bits are for the configuration. As shown in figure III.3
35
Chapter III: Result and Discussion
36
Chapter III: Result and Discussion
The most important register, by which our work is concerned more, is the analog path control
one. The zoomed-in view of the simulation results for this phase is illustrated in figure 3.5.
37
Chapter III: Result and Discussion
For sampling issue which is very important in time and memory considerations, we have
generated a master clock of 18MHz.
^
Figure 3.8
.8: Master Clock (XCK) Generation Simulation.
The figure above shows the simulation result of the master clock. The difference between the
two highlighted values gives right the XCK period of 0.00005423 ms which is equivalent to
18.4346918 MHz.
38
Chapter III: Result and Discussion
2. Hardware Result
2.1 Compilation Report
Using the Quartus CAD software we have compiled the project named speech_recognition
and we have gotten the report result shown in figure 3.9.
As shown in the compilation report we have used about 25% of total logic element and about
21% of total pins of our FPGA device.
Using the programmer tool in the Quartus software, we have downloaded the project to the
FPGA. Then received two messages: the first, indicating that the Megafunctions used will
work for a limited time(1 hour as mentioned in the Altera user manual) as illustrated in figure
3.10, the second, indicating that the system must be connected to the computer for the proper
function, as shown is figure 3.11. To avoid those problems our Quartus software should be
licensed.
39
Chapter III: Result and Discussion
Figure 3.10:
3.10 Hardware Limited Time Message
Figure 3.11:
3.11 Connecting to Computer Message
2.3 Results
Using NIOS II 9.1 IDE we run the C program in the NIOS soft core. We plugged the head
styled microphone to the DE2 board MIC line.
Right
Left
Yes
No
40
Chapter III: Result and Discussion
Discussion
We notice that the red LEDs are ON, which means a presence of a speech sound. The green
LEDs are also ON which indicates that a word in a dictionary has been recognized. The word
recognized is displayed on the LCD. ‘Right’ in figure 3.12 and ‘Left’ in figure 3.13.
41
Chapter III: Result and Discussion
Discussion
In figure 3.14 the red LEDs are turned ON which signifies a presence of speech sound.
However, the Green LEDs are OFF meaning that the spoken word is not in the data base.
When there is no speech, i.e., silence, the Red LEDs as well as the LCD are OFF, as
illustrated in the figure 3.15.
42
General Conclusion
Developing a human machine interaction mean that fits real-time criteria and gives a facility
for both the user and a machine is what motivated our project. Speech recognition systems are
the perfect solution for this interaction. Many software applications perform the recognition.
However, performing the same in hardware will result in better performances: fitting the real-
time criteria and accuracy in speech recognition.
In this report, the design and implementation of an FPGA-based real time speech recognition
is presented. This system is developed using the two aspects of hardware and software
implementations. In the beginning we desired to realize the whole system in hardware, but
implementing the MFCC algorithm in hardware is much difficult and will take a lot of our
reduced time. This is why we used the NIOS-II soft processor to execute a C program
performing this MFCC algorithm. The soft processor is also used in performing the
recognition using the matching approach.
We have successfully designed a real time speech to text engine using FPGA technology,
taking advantage of parallel processing and very high speed resources proposed by the Altera
DE2 board.
A real time isolated word is implemented in this project using a matching approach. As
improvement of this work we suggest to advance this system into a continuous speech
recognition engine, changing the matching technique by the HMM (Hidden Markov Model)
algorithm.
43
References
[1] Aakash Jain, Krishna Teja Panchagnula, Nitish Paliwal, (2010). Real Time Speech
Recognition Engine, Cornell University.
[2] Altera Corporation, (2015). FFT IP Core User Guide.
[3] Altera Corporation, (2014). Embedded Memory (RAM: 1-PORT, RAM: 2-PORT, ROM: 1-
PORT, and ROM: 2-PORT) User Guide.
[4] Altera Corporation, (October 2006). Audio / Video Configuration Core for DE2/DE1
Boards.
[5] Altera Corporation, (May 2011). NIOS II Hardware Development, Tutorial.
[6] Altera Corporation, (2008). Introduction to the Altera SOPC Builder Using VHDL Design.
[7] Altera Corporation, Altera DE2 Board Pin Table.
[8] AnkitaGoel , Hamid Mahmoodi, (spring 2012). Embedded Systems Design Flow Using
Altera’s FPGA Development Board (DE2-115 T-Pad).
[9] Carlos Asmat, David Lopez Sanzo, Kanwen Wu, (June 2007). Speech Recognition Using
FPGA Technology.
[10] José Ignacio, Mateos Albiach, (2006). Interfacing a Processor Core in FPGA to an
Audio System, Master Thesis Performed in Electronics Systems.
[11] HuanFang , Xin Liu. SOPC-based Voice Print Identification System.
[12] Lattice Semiconductor Corp., (2015). I2C Master Controller.
[13] Lindasalwa Muda, Mumtaj Begam, I. Elamvazuthi, (March 2010). Voice Recognition
Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping
(DTW) Techniques.
[14] Haitao Zhou, Xiaojun Han, (August 2009). Design and Implementation of Speech
Recognition System Based on Field Programmable Array, Tianjin Polytechnic
University,China.
[15] M.T. Bala Murugan and M. Balaji, Instructor: Dr. B. Venkataramani, (2006). SOPC-
Based Speech-to-Text Conversion.
[16] Pong P. Chu, Cleveland State University (2008). FPGA Prototyping By VHDL Examples.
[17] Ricardo da Silva, Professor Steve Gunn, (April, 2013). Speech Recognition on Embedded
Hardware.
[18] Shivanker Dev Dhingra, Geeta Nijhawan, Poonam Pandit, (August 2013). ISOLATED
SPEECH RECOGNITION USING MFCC AND DTW.
[19] Vonei A. Pedroni, MIT Press, (2004). Circuit Design with VHDL.
[20] Wolfson Microelectronics, (April 2004). Data sheet, WM8731/WM8731L, Portable
Internet Audio Codec with Headphone Driver and Programmable Sample Rates.
XII
References
XIII