You are on page 1of 47

S.A.R.

A
Minor Project Report
submitted
in partial fulfillment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Engineering
(with specialization in Computer Science & Engineering )

Department of Computer Science & Engineering

International Institute of Management, Engineering &


Technology
Rajasthan Technical University
November, 2014

S.A.R.
A
Minor Project Report
submitted
in partial fulfillment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Engineering
(with specialization in Computer Science & Engineering )

Supervisor :

Submitted by :

Ms. Swati Saxena

Hitesh Khandelwal-11E1IRCSM4XP010

(Associate Professor, CSE)

Himanshi Gupta-11E1IRCSF4XP009
Ipseema Ved-11E1IRCSF4XP011
Ojasvita Sharma-11E1IRCSM4XP019

Department of Computer Science & Engineering

International Institute of Management, Engineering &


Technology
Rajasthan Technical University
November, 2014

Candidates Declaration
We hereby declare that the Report of the U.G. Project Work entitled S.A.R.
which is being submitted to the International Institute of Management, Engineering

& Technology, in the partial fulfillment of the requirements for the award of the
Degree of Bachelor of Engineering in COMPUTER ENGINEERING in the
Department of Computer Engineering, is a bonafide report of the work
carried out by us. The material contained in this Report has not been submitted
to any University or Institution for the award of any degree.

Hitesh Khandelwal - 11E1IRCSM4XP010


Himanshi Gupta - 11E1IRCSF4XP009
Ipseema Ved - 11E1IRCSF4XP011
Ojasvita Sharma - 11E1IRCSF4XP019

ACKNOWLEDGEMENT
We take this opportunity to express my deepest and sincere gratitude to our supervisor Ms.
Swati Saxena for his insightful advice, motivating suggestions, invaluable guidance, help and
support in successful completion of this project and also for his constant encouragement and
advice throughout our Bachelors program.
We express our deep gratitude to Mr. Rishikant Shukla & Mr. Pankaj Jain of Computer
Science Department for their regular support, co-operation, and co-ordination.
The in-time facilities provided by the department throughout the Bachelors program are also
equally acknowledgeable.
We would like to convey our thanks to the teaching and non-teaching staff of the Department
of Computer Engineering, for their invaluable help and support throughout the period of
Bachelors Degree. We are also grateful to all our classmates for their help, encouragement
and invaluable suggestions.

Hitesh Khandelwal - 11E1IRCSM4XP010

CERTIFICATE OF APPROVAL
The undersigned certify that the final year project entitled Reverse Engineering Exploit
submitted by Hitesh Khandelwal, Ojasvita Sharma, Ipseema Ved and Himanshi Gupta to the
Department of Computer Engineering in partial fulfillment of requirement for the degree of
Bachelor of Engineering in Computer Engineering. The project was carried out under special
supervision and within the time frame prescribed by the syllabus.
We found the students to be hardworking, skilled, bonafide and ready to undertake any
commercial and industrial work related to their field of study.

1. .
Ms. Swati Saxena
(Project Supervisor)
12. .
Mr. Rishikant Shukla
(External Examiner)
1
23. .
Mr. Pankaj Jain
(H.O.D, CSE)

Table of Contents
Topics

Page No

Candidates Declaration i
Acknowledgement ii
Certification of Approval iii
Table of content

iv

List of Figures

vii

List of Tables

viii

List of Abbreviations
Abstract

ix

Chapter 1: Introduction
1.1 Background Introduction
.1.1 Book Reader

3
3

.1.2 Speech Recognizer3


1.2 Other related Topics 4
.2.1 Java Speech API 4
.2.2 JSGF 4
.2.3 MBROLA 5
1.3 Motivation

1.4 Problem Definition

1.5 Goals and Objectives

1.6 Scope and Application

8
4

Chapter 2: Requirement Analysis


2.1 Project requirements 10
2.1.1 Software Requirement

10

2.1.2 Hardware Requirement

10

2.1.3 Technologies to be used

10

2.2 Feasibility Study

10

Chapter 3: System Design and Architecture


3.1 Block Diagram12
3.1.1 Book Reader 12
3.1.2 Speech Recognizer 15
3.2 Use Case Diagram

22

3.3 Sequence Diagram

24

Chapter 4: Methodology, Implementation Details and Result


4.1 Speech Recognition Process25
4.2 Speech Synthesis Process

26

4.3 Types of Speech Recognition

27

4.4 Speech Recognition Algorithm

29

4.4.1 Dynamic Time Warping 29


4.4.2 Hidden Markov Model

30

4.4.3 Neutral Networks 30


4.5 Components of CMU Sphinx
4.5.1 Front End

32

4.5.2 Knowledge Base


4.5.3 Decoder

31

32

33

4.6 Implementation of Java Speech API


4.7 Snapshots

34

Chapter 5: Conclusion and Future Work


5.1 Conclusion

36

5.2 Limitation

36
5

33

5.2 Future

37

List of Figures
Figure 1.1

Block Diagram of Speech Synthesizer

Figure 1.2

Block Diagram of Speech Synthesizer

Figure 3.1

Block Diagram of Speech Synthesizer

12

Figure 3.2

Text-to-Speech Synthesis Cycle

Figure 3.3

Basic Diagram of Speech Recognizer15

Figure 3.4

Typical Speech Recognition System 17

Figure 3.5

Signal analysis converts raw speech to speech frames

Figure 3.6

Acoustic models: template and state representations for the word cat

Figure 3.7

The alignment path with the best total score identifies the word sequence and
segmentation 20

Figure 3.8

Speech Synthesizer (Use case Diagram)

22

Figure 3.9

Speech Recognizer (Use case Diagram)

23

Figure 3.10

Sequence Diagram

Figure 4.1

Sphinx Architecture 31

Figure 4.2

Book Reader 34

Figure 4.3

Browse option

Figure 4.4

File Path for Synthesizer

Figure 4.5

Command Prompt for Speech Recognizer

14

24

34
35

35

18
19

List of Tables
Table 1Five steps of speech recognitions

16

List of Abbreviations

Abbreviation
OS
JSGF
JSAPI
SAR
TTS

Full Form
Operating System
Java Speech Grammar Format
Java Speech API
Synthesizer and Recognizer
Text to Speech

ABSTRACT
Speech recognition technology is one from the fast growing engineering technologies. It has a
number of applications in different areas and provides potential benefits. Nearly 20% people
of the world are suffering from various disabilities; many of them are blind or unable to use
their hands effectively. The speech recognition systems in those particular cases provide a
significant help to them, so that they can share information with people by operating
computer through voice input. This project is designed and developed keeping that factor into
mind, and a little effort is made to achieve this aim. Our project is will identify the words that
a person speaks into a microphone, recognizes the command given and performs the
operation as per the requirement of the user. This can be used in Hands-free Computing, CarBased System and Health-Care System. Speech recognizer will converts the normal language
text into speech irrespective of the file format for which the operation has to be performed. It
is the artificial production of human speech.

Chapter 1

Introduction
Language is men's most important means of communication and speech its primary medium.
Spoken interaction both between human interlocutors and between humans and machines is
inescapably embedded in the laws and conditions of Communication, which comprise the
encoding and decoding of meaning as well as the mere transmission of messages over an
acoustical channel. Here we deal with this interaction between the man and machine through
synthesis and recognition applications. Speech recognition, involves capturing and digitizing the
sound

waves,

converting

them

to basic language units or phonemes, constructing words from phonemes, and contextually analy
sing the words to ensure correct spelling for words that sound alike. Speech Recognition is the
ability of a computer to recognize general, naturally flowing utterances from a wide variety of
users. It recognizes the caller's answers to move along the flow of the call. Emphasis is given on
the modelling of speech units and grammar on the basis of CMU Sphinx model. Speech
Recognition allows you to provide input to an application with your voice. The applications and
limitations on this subject enlighten the impact of speech processing in our modern technical
field. While there is still much room for improvement, current speech recognition systems have
remarkable performance. We are only humans, but as we develop this technology and build
remarkable changes we attain certain achievements. Rather than asking what is still deficient, we
ask instead what should be done to make it efficient.

1.1Background Introduction
2

Chapter 1
1.1.1. Book Reader (Speech Synthesis) :
Human speech is the most natural form of communication between people. Therefore, it
would be very benecial using natural speech for the interaction between human and
computer. An articial production of speech by a computer is called speech synthesis,
speech synthesis from a text is called Text-to-Speech (TTS) synthesis.
Many applications of speech synthesis exist, we only mention some examples: In
telecommunications services, speech synthesis can replace the operator's voice, short
messages can be simply pronounced by the synthesis. In public transport, synthesized
speech is used instead human speech to inform the passengers about the
arrivals/departures or to provide other important information. Speech synthesis could be
also used for language education, but to our best knowledge, this is not done yet, due to
the low quality of existing systems.
The main goal of this paper is to propose and implement a TTS system based on the
MBROLA (Multiband Resynthesize Overlap-Add) project.
1.1.2. Speech Recognizer :
Speech recognition technology is one from the fast growing engineering technologies. It
has a number of applications in different areas and provides potential benefits. Nearly
20% people of the world are suffering from various disabilities; many of them are blind
or unable to use their hands effectively. The speech recognition systems in those
particular cases provide a significant help to them, so that they can share information with
people by operating computer through voice input.
Unlike old software this software can operate with dictated continuous speech. Large
vocabulary with different words have been provided for increasing the probability of
correct recognition process. A constrained syntax been used helps recognize words by
disambiguating similar sounds.

Chapter 1
Since this is a data-centric product a gram file is required to store the entire vocabulary in
a form which is understandable by the system. Speech recognizer will communicate with
this gram file in order to interpret the commands given by the user. User is supposed to
give the instruction in the microphone which are decoded by the decoder, thus it is
converted into a string. Here the gram file comes in use where the strings in instruction is
compared with those present in gram file and hence the operation is performed
accordingly

1.2Other related topics


1.2.1

Java Speech API (JSAPI) :

The Java Speech API defines a standard, easy-to-use, cross platform software interface to
state-of-the-art speech technology. Two core speech technologies are supported through
the Java Speech API: speech recognition and speech synthesis. Speech recognition
provides computers with the ability to listen to spoken language and to determine what
has been said. In other words, it processes audio input containing speech by converting it
to text.
The Java Speech API was developed through an open development process. With the
active involvement of leading speech technology companies, with input from application
developers and with months of public review and comment, the specification has
achieved a high degree of technical excellence. As a specification for a rapidly evolving
technology, Sun will support and enhance the Java Speech API to maintain its leading
capabilities.
The Java Speech API is an extension to the Java platform. Extensions are packages of
classes written in the Java programming language (and any associated native code) that
application developers can use to extend the functionality of the core part of the Java
platform.
1.2.2

Java Speech Grammar Format (JSGF) :

The JavaTM Speech Grammar Format (JSGF) is a platform and vendor independent way
of describing a rule grammar (also known as a command and control grammar or regular
grammar).
4

Chapter 1
A rule grammar specifies the types of utterances a user might say. For example, a service
control grammar might include Service, and "Action commands.
A voice application can be based on a set of scenarios. Each scenario knows the context
and provides appropriate grammar rules for the context.
Grammar rules can be provided in multi-lingual manner.
The grammar body defines rules as a rule name followed by its definition-token.
The definition can include several alternatives separated by | characters.
For example:
public <greet> = (Good morning | Hello) (Bhiksha | Evandro | Paul | Philip);
System returns the words Bhiksha, Evandro, Paul or Philip only if Good
morning or Hello was spoken.
1.2.3

MBROLA :

MBROLA represents a speech synthesizer based on concatenated synthesis of diphones.


Therefore, a database of diphones adapted to the MBROLA format is needed to run the
synthesizer. This database has been created for about twenty languages for example for
English, Czech, French, etc.
Input of the system is a list of phonemes with prosodic information such as duration of
phonemes and a linear description of pitch. It is thus not a Text-To-Speech (TTS)
synthesizer, because it does not accept raw text as an input. Output of MBROLA are
speech samples on 16 bits, at the sampling frequency of the diphone database.

1.3Motivation
Keyboard, although a popular medium, is not very convenient as it requires a certain
amount of skill for effective usage. A mouse on the other hand requires a good hand-eye
co-ordination. It is also cumbersome for entering non-trivial amount of text data and
hence requires use of an additional media such as keyboard. Physically challenged people
find computers difficult to use. Partially blind people find reading from a monitor
difficult.
So, this project SAR (Synthesis and Recognition) together form a speech interface. A
speech synthesiser converts text into speech. Thus it can read out the textual contents

Chapter 1
from the screen. Speech recogniser had the ability to understand the spoken words and
convert it into text.

1.4Problem Definition
The aim of this project is to provide better interface for the synthesis of the files so it can
easily be read by the users and to use the computer though our voice command. It will
explain the purpose and features of the system, the interfaces of the system, what the
system will do, the constraints under which it must operate and how the system will react
to external stimuli.

1.5Goals and Objectives


This Product have two functions that is of a speech synthesizer and recognizer. In this
software, a speech recognition module transcribes the users speech into a word stream.
The character flow is then processed by a language engine dealing with syntax,
semantics, and finally by the back-end application program. A speech synthesizer
converts resulting answers (strings of characters) into speech to the user.

Figure 1.1 Speech Synthesizer


6

Chapter 1
A speech synthesizer is a speech engine that converts text to speech. The
javax.speech.synthesis package defines the Synthesizer interface to support Speech
Synthesis plus a set of supporting classes and interfaces. As a type of speech engine,
much of the functionality of a Synthesizer is inherited from the Engine interface in the
javax.speech package and from other classes and interfaces in that package.

Figure 1.2 Speech Recognizer


Speech recognizer is a speaker independent application which can recognize speech from
any native user. Speaker-independence is obtained by pre training recognition systems
with a large number of speakers, so when a new speaker talks to the system, he/she can
expect to fall within already trained or modelled voice patterns. Unlike old software this
software can operate with dictated continuous speech. Large vocabulary with different
words have been provided for increasing the probability of correct recognition process. A
constrained syntax been used helps recognize words by disambiguating similar sounds.
So the soul objective of this project is to make these two functions and make the better
use of the software.

1.6Scope and Application


The "Speech Synthesizer and Recognizer" is a desktop application which provides a very
natural way to the users to interact with the computer without any training requirements.
7

Chapter 1
It almost eliminates the use of keyboard, mouse or any other input providing interface.
There are two purposes served by this software:

i.

Speech Synthesizer: This part of the application converts the normal language text
into speech irrespective of the file format for which the operation has to be
performed. It is the artificial production of human speech. This application can be
used as screen reader for people with visual impairment, other than that this can
also be used by people with dyslexia and other reading difficulties as well as by
pre-literate children.

ii.

Speech Recognizer: This is another speech analyzer which can be defined as an


independent, computer-driven transcription of spoken language into computerbased language. This part of application allows a computer to identify the words
that a person speaks into a microphone, recognizes the command given and
performs the operation as per the requirement of the user. This can be used in
Hands-free Computing, Car-Based System and Health-Care System.

This software in platform dependent and require Window's Operating System for its
functioning. The synthesizer part of the software makes use of the MBROLA language.
This software needs a microphone for providing clarity in the words been spoken by the
user for the purpose of recognition. The application maintains a JSGF file which is a
grammar file that is it stores acoustic language for proper recognition and synthesis of the
normal human language which is supposed to be US English. The software includes an
xml file which gets loaded whenever program runs. This is a configuration file which
includes the packages for recognition of the commands. The application has the ability of
reading the text from any kind of file which could be a word file, pdf file and txt file. The
recognizer has the ability to open or close any application through voice recognition.

Chapter 2

Requirement Analysis

2.1Project Requirements
.1.1Hardware Requirements :

Client Side : Microphone

.1.2Software Requirements :

Front End Client : Java 1.6 and above

Operating System : Windows 7, 8, 8.1

.1.3Technologies to be used :

JAVA: Application architecture.


XML : Extension Markup Language
Localization : 1 Language English(U.S.)

.2 Feasibility Study
The main objective of this study is to determine SAR (Synthesizer and Recognizer) is
feasible or not. Mainly there are three types of feasibility study to which the developed
system is subjected as described below. The key considerations are involved in the
feasibility:

1. Technical feasibility
2. Economic feasibility
3. Operational feasibility

The developed system must be evaluated from a technical viewpoint first, and their
impact on the organization must be accessed. If compatible, behavioral system can be
9

Chapter 2
devised. Then they must be tested for the feasibility. The above three keys are explained
below:
Technical Feasibility
SAR satisfies technical feasibility because this Service can be implemented as a standalone application. It is Microsoft Windows OS compatible.
Economic feasibility
Our project entitled SAR (Synthesizer and Recognizer) is economically feasible because
it is developed using very less amount of economic resources. It is Free.
Operational feasibility
Operational feasibility should be accounted after the software is developed so that it can
cope up with the defined objectives.

The application is user friendly with its GUI and handy to use.
The application is be affordable because the requirement is just normal computers and a

microphone.
Since this application is developed in Java it runs on multiple platform.

10

Chapter 3

System Design and Architecture

3.1Block Diagram
3.1.1 Book Reader (Speech Synthesizer) :
The TTS systems rst convert the input text into its corresponding linguistic or phonetic
representations and then produce the sounds corresponding to those representations. With
the input being a plain text, the generated phonetic representations also need to be
augmented with information about the intonation and rhythm that the synthesized speech
should have. This task is done by a text analysis module in most speech synthesizers. The
transcription from the text analysis module is then given to a digital signal processing
(DSP) module that produces synthetic speech. Figure 3.1 shows the block diagram of a
TTS system.

Figure 3.1: Block Diagram of Speech Synthesizer

Figure 3.2 shows the text-to-speech synthesis cycle in most TTS systems. In this cycle,
the text preprocessing transforms the input text into a regularized format that can be
processed by the rest of the system. This includes breaking the input text into sentences,
11

Chapter 3
tokenizing them into words and expanding the numerals. The prosodic phrasing
component then divides the preprocessed text into meaningful chunks of information
based on language models and constructs. The pronunciation generation component is
responsible for generating the acoustic sequence needed to synthesize the input text by
nding the pronunciation of individual words in the input text. Duration value for each
segment of speech is determined by the segmental duration generation component. The
function of the intonation generation component is to generate a fundamental frequency
(F0) contour for the input text to be synthesized. The waveform generation component
takes as input the phonetic and prosodic information generated by the various
components described above, and generates the speech output.

12

Chapter 3

Figure 3.2: Text-to-Speech Synthesis cycle

3.1.2 Speech Recognizer :

13

Chapter 3
Speech Recognition involves capturing the users utterance, digitizing utterance into
digital signal than converting them into basic unit of utterance (phonemes) and
contextually analyzing the words to ensure correct spelling for words that sound alike
(such as write and right). Figure 3.3 illustrates these processes.

Figure 3.3: Basic Diagram of Speech Recognizer


We can roughly divide process flow of speech recognition into five steps .They are User
Input ,Digitization ,Phonetic breakdown Statistical modeling and Matching ,as table 1
shows.

Table 1: Five steps of speech recognitions


Ste

Process Name

Description

p
1.

User Input

The system catches users voice in the form of

2.

Digitization

analog acoustic signal.


Digitize the analog acoustic signal.
14

Chapter 3
3.
4.
5.

Phonetic Breakdown
Statistical Modeling

Breaking signals into phonemes.


Mapping phonemes to their

Matching

representation using statistics model.


According to grammar, phonetic representation

phonetic

and Dictionary, the system returns an n-best list


(i.e.: a word plus a confidence score)

Grammar here means the union words or phrases to constraint the range of input or
output in the voice application, and Dictionary means the mapping table of phonetic
representation and word ,for example thu ,thee is mapping to the.

15

Chapter 3

Figure 3.4: Typical Speech Recognition System


The structure of a standard speech recognition system is illustrated in Figure 3.4. The elements
are as follows:

Raw speech - Speech is typically sampled at a high frequency, e.g., 16 KHz over a
microphone or 8 KHz over a telephone. This yields a sequence of amplitude values over

time.
Signal analysis - Raw speech should be initially transformed and compressed, in order to
simplify subsequent processing. Many signal analysis techniques are available which can

16

Chapter 3
extract useful features and compress the data by a factor of ten without losing any
important information.

Figure 3.5 - Signal analysis converts raw speech to speech frames.

Speech frames - The result of signal analysis is a sequence of speech frames, typically at
10 milliseconds intervals, with about 16 coefficients per frame. These frames maybe
augmented by their own first and/or second derivatives, providing explicit information
about speech dynamics; this typically leads to improved performance. The speech frames

are used for acoustic analysis.


Acoustic models - In order to analyze the speech frames for their acoustic content, we
need a set of acoustic models. There are many kinds of acoustic models, varying in their
representation, granularity, context dependence, and other properties. During training, the
acoustic models are incrementally modified in order to optimize the overall performance
of the system. During testing, the acoustic models are left unchanged.

17

Chapter 3

Figure 3.6 - Acoustic models: template and state representations for the word cat.

Acoustic analysis and frame scores - Acoustic analysis is performed by applying each
acoustic model over each frame of speech, yielding a matrix of frame scores, as shown in
Figure 3.6. Scores are computed according to the type of acoustic model that is being
used. For template-based acoustic models, a score is typically the Euclidean distance
between a templates frame and an unknown frame. For state-based acoustic models, a
score represents an emission probability, i.e., the likelihood of the current state generating
the current frame, as determined by the states parametric or non-parametric function.

18

Chapter 3

Figure 3.7 - The alignment path with the best total score identifies the word sequence and
segmentation.

Time alignment - Frame scores are converted to a word sequence by identifying a


sequence of acoustic models, representing a valid word sequence, which gives the best
total score along an alignment path through the matrix. The process of searching for the
best alignment path is called time alignment.
An alignment path must obey certain sequential constraints which reflect the fact that
speech always goes forward, never backwards. These constraints are manifested both
within and between words. Within a word, sequential constraints are implied by the
sequence of frames (for template-based models), or by the sequence of states (for statebased models) that comprise the word, as dictated by the phonetic pronunciations in a
dictionary, for example. Between words, sequential constraints are given by a grammar,
indicating what words may follow what other words.

19

Chapter 3
Time alignment can be performed efficiently by dynamic programming, a general
algorithm which uses only local path constraints, and which has linear time and space
requirements.
(This general algorithm has two main variants, known as Dynamic Time Warping (DTW)
and Viterbi search, which differ slightly in their local computations and in their
optimality Criteria.)
In a state-based system, the optimal alignment path induces segmentation on the word
sequence, as it indicates which frames are associated with each state. This segmentation
can be used to generate labels for recursively training the acoustic models on

corresponding frames.
Word sequence - The end result of time alignment is a word sequence - the sentence
hypothesis for the utterance. Actually it is common to return several such sequences,
namely the ones with the highest scores, using a variation of time alignment called N-best
search. This allows a recognition system to make two passes through the unknown
utterance: the first pass can use simplified models in order to quickly generate an N-best
list, and the second pass can use more complex models in order to carefully rescore each
of the N hypotheses, and return the single best hypothesis.

20

Chapter 3
.2 Use Case Diagram :

Figure 3.8 Speech Synthesizer

21

Chapter 3

Figure 3.9 Speech Recognizer


Speech Synthesizer:
User: User is supposed to feed the text in a file. The file could be of any format.

System: System will analyze the text provided by the user and then performs
linguistic analysis to generate waveform which will be then listened by the user
through speakers

Speech recognizer:
User: User is supposed to give commands to the system through a microphone.

System: System will recognize the commands and will perform task accordingly.

22

Chapter 3

.2 Sequence Diagrams:

Figure 3.10 Sequence Diagram

23

Chapter 4

Methodology, Implementation Details and Result

.1 SPEECH RECOGNITION PROCESS


In humans the speech or acoustic signals are received by the ears & then transmitted to
the brain for understanding & extracting the meaning out of the speech & then to react it
appropriately.
Speech recognition enabled computer or devices too, work under the same principle.
They receive the acoustic signal through microphone; these signals are in analog form &
need to be digitalized to be understood by the system. The signals are then digitalized &
sent to the processing unit for extracting the meaning out of the signals & to give the
desired output to the user.
Any speech recognition system involves following five major steps:
1. Signal Processing
The sound is received through the microphone in the form of analog electrical signals.
These signals consist of the voice of the user & the noise from the surroundings. The
noise is then removed & the signals are converted into digital signal. These digital signals
are converted into a sequence of feature vectors.
(Feature Vector - If you have a set of numbers representing certain features of an object
you want to describe, it is useful for further processing to construct a vector out of these
numbers by assigning each measured value to one component of the vector.)
2. Speech Recognition
This is the most important part of this process; here the actual recognition is done. The
sequence of feature vectors is then decoded into a sequence of words. This decoding is
done on the basis of algorithms such as Hidden Markov Model, Neural Network or
Dynamic Time Wrapping. The program has big dictionary of popular words that exist in
language. Each feature vector is matched against the sound & converted into appropriate
character group. It checks and compares words that are similar in sound with the formed
character groups. All these similar words are then collected.
3. Semantic Interpretation

24

Chapter 4
Here it checks if the language allows a particular syllable to appear after another. After
that, there will be grammar check. It tries to find out whether or not the combination of
words any sense.
4. Dialog Management
The errors encountered are tried to be corrected. Then the meaning of the combined
words is extracted & the required task is performed.
5. Response Generation
After the task is performed, the response or the result of that task is generated. The
response is either in the form of a speech or text. What words to use so as to maximize
the user understanding, are decided here. If the response is to be given in the form of
speech, then Text to Speech conversion process is used.

.2 Speech Synthesis Process


We can divide methods of speech synthesis into three following categories:
1. Articulatory synthesis: direct modeling of the human speech production system;
2. Formant synthesis: modeling the pole frequencies of speech signal or transfer
function of vocal tract based on source-lter-model;
3. Concatenative synthesis: use of the pre-recorded samples derived from natural
speech.
Articulatory synthesis is usually not used in current synthesizers, because it is too
complicated for high quality implementations. However, it may be a favourable method
in the future.
In current systems, formant and concatenative methods are mostly used. The formant
approach has been dominant for long time, but today the concatenative method is
becoming more and more popular. Therefore, we present the details of these methods.
The main issue of the concatenative synthesis is to nd correct length of the speech units
that will be concatenated. Word units are practical when they are pronounced in isolation.
25

Chapter 4
However, the sound of the continuous sentence is not natural. The current synthesizers
are thus mostly based on shorter units such as phonemes, diphones, and demi syllables or
on combinations of these.
Several methods of concatenative synthesis exist, we mention here the interesting ones
only:

Micro phonemic method: units of variable length de- rived from natural speech are
used.

Linear Prediction (LP) based methods: based on the source-lter-model, lter


coefcients estimated automatically from the speech.

Sinusoidal models: assumption that the speech signal can be represented as a sum of
sine waves with time- varying amplitudes and frequencies.

PSOLA methods: very popular method, allows prerecorded speech samples smoothly
concatenated and provides good controlling for pitch and duration, used in some
synthesis systems as for example in ProVerbe, HADIFIX, MBROLA (system that is
used in this work), etc.

.3 Types of Speech Recognition System


Speech recognition systems can be separated in several different classes by describing
what types of utterances they have the ability to recognize. These classes are based on the
fact that one of the difficulties of SR is the ability to determine when a speaker starts and
finishes an utterance. Most packages can fit into more than one class, depending on
which mode they're using.

Isolated Word
Isolated word recognizers usually require each utterance to have quiet (lack of an audio
signal) on BOTH sides of the sample window. It doesn't mean that it accepts single
words, but does require a single utterance at a time. Often, these systems have

26

Chapter 4
"Listen/NotListen" states, where they require the speaker to wait between utterances
(usually doing processing during the pauses).
Connected Word
Connect word systems (or more correctly 'connected utterances') are similar to Isolated
words, but allow separate utterances to be 'runtogether' with a minimal pause between
them.
Continuous Speech
Recognizers with continuous speech capabilities are some of the most difficult to create
because they must utilize special methods to determine utterance boundaries. Continuous
speech recognizers allow users to speak almost naturally, while the computer determines
the content. Basically, it's computer dictation.
Spontaneous Speech
At a basic level, it can be thought of as speech that is natural sounding and not rehearsed.
An ASR system with spontaneous speech ability should be able to handle a variety of
natural speech features such as words being run together, "ums" and "ahs", and even
slight stutters.
Voice Verification/Identification
Some ASR systems have the ability to identify specific users by characteristics of their
voices (voice biometrics). If the speaker claims to be of a certain identity and the voice is
used to verify this claim, this is called verification or authentication. On the other hand,
identification is the task of determining an unknown speaker's identity. In a sense speaker
verification is a 1:1 match where one speaker's voice is matched to one template (also
called a "voice print" or "voice model") whereas speaker identification is a 1: N match
where the voice is compared against N templates.
There are two types of voice verification/identification system, which are as follows:

Text-Dependent:

If the text must be the same for enrolment and verification this is called text dependent
recognition. In a text-dependent system, prompts can either be common across all
speakers (e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets

27

Chapter 4
(e.g.: passwords and PINs) or knowledge-based information can be employed in order to
create a multi-factor authentication scenario.

Text-Independent:

Text-independent systems are most often used for speaker identification as they require
very little if any cooperation by the speaker. In this case the text during enrolment and
test is different. In fact, the enrolment may happen without the user's knowledge, as in the
case for many forensic applications. As text-independent technologies do not compare
what was said at enrolment and verification, verification applications tend to also employ
speech recognition to determine what the user is saying at the point of authentication.
In text independent systems both acoustics and speech analysis techniques are used.

.4 SPEECH RECOGNITION ALGORITHMS


4.4.1

Dynamic Time Warping

Dynamic Time Warping algorithm is one of the oldest and most important algorithms in
speech recognition. The simplest way to recognize an isolated word sample is to compare
it against a number of stored word templates and determine the best match. This goal
depends upon a number of factors. First, different samples of a given word will have
somewhat different durations. This problem can be eliminated by simply normalizing the
templates and the unknown speech so that they all have an equal duration. However,
another problem is that the rate of speech may not be constant throughout the word; in
other words, the optimal alignment between a template and the speech sample may be
nonlinear. Dynamic Time Warping (DTW) is an efficient method for finding this optimal
nonlinear alignment.

4.4.2

Hidden Markov Model

The most flexible and successful approach to speech recognition so far has been Hidden
Markov Models (HMM).A Hidden Markov Model is a collection of states connected by
28

Chapter 4
transitions. It begins with a designated initial state. In each discrete time step, a transition
is taken up to a new state, and then one output symbol is generated in that state. The
choice of transition and output symbol are both random, governed by probability
distributions.

4.4.3

Neural Networks

A neural network consists of many simple processing units (artificial neurons) each of
which is connected to many other units. Each unit has a numerical activation level
(analogous to the firing rate of real neurons). The only computation that an individual
unit can do is to compute a new activation level based on the activations of the units it is
connected to. The connections between units are weighted and the new activation is
usually calculated as a function of the sum of the weighted inputs from other units.
Some units in a network are usually designated as input units which mean that their
activations are set by the external environment. Other units are output units, their values
are set by the activation within the network and they are read as the result of a
computation.
Those units which are neither input nor output units are called hidden units.

.5 Components of CMU Sphinx


The high level architecture for sphinx is straightforward. As shown in the Figure 4.1, the
architecture consists of the front end, the decoder, a knowledge base, and the application.

29

Chapter 4

Figure 4.1: Sphinx Architecture


The front end is responsible for gathering, annotating, and processing the input data. In
addition, the front end extracts features from the input data to be read by the decoder .The
annotations provided by the front end include the beginning and ending of a data
segment.
The knowledge base provides the information the decoder needs to do its job.
This information includes the acoustic model and the language model. The knowledge
base can also receive feedback from the decoder, permitting the knowledge base to
dynamically modify itself based upon successive search results.
The decoder performs main component of SR. It selects next set of likely states, scores
incoming features against these states, drop low scoring states and finally generates
results.

Details of each components will be discussed in the following sections.


4.5.1 Front End
The function of Front-End API is straight forward, it provides some low-level audio
access API for us to record user utterance input and play the output voice files.
4.5.2 Knowledge Base
The Knowledge base composed of three parts, they are Acoustic Model, Language Model
and Dictionary.
30

Chapter 4
Acoustic models characterize how sound changes over time. Each phoneme or speech
sound is modeled by a sequence of states and signal observation probability distributions
of sounds that you might hear (observe) in that state.
Sphinx4 is implemented using a 5-state phonetic model, each phone model has exactly
five states. At run-time, frames of the input audio are compared to the distributions in the
states to see which ones the sound could have come from which might be likely
producers of the observed audio.
Acoustic models that are matched to the conditions they will be used in perform best.
That is to say, English acoustic models work best for English, and telephone models work
best on the telephone. With SphinxTrain, we can train acoustic models for any language,
task, or channel condition.
An LM file (often with a .lm extension) is a Language model. The Language model
describes the likelihood, probability, or penalty taken when a sequence or collection of
words is seen. Sphinx4 uses N-gram models, and usually N is 3, so they are tri-gram
models, and these are sequences of three words. All the sequences of three words, two
words, and one word are combined together using back-off weights in order to assign
probabilities to sequences of words.
Finally, the decoder needs to know the pronunciations of words, and the Dictionary file
(often with a .dic extension) is a list of words with a sequence of phones.

4.5.3 Decoder
The decoder performs main component of SR. It reads features from the front end,
couples this with data from the knowledge base and feedback from the application, and
performs a search to determine the most likely sequences of words that could be
represented by a series of features. The term "search space" is used to describe the most
likely sequences of words, and is dynamically updated by the decoder during the
decoding process.

.6 Implementation of Java Speech API


Although the Java Speech API was released on several years ago, it seems only a few
vendor interested in this spec, I think this is due to lack of performance and low-level
control of Java platform. The following Java Speech API implementations are known to
exist.
31

Chapter 4
FreeTTS and CMU Sphinx 4 are open-source implementations of Java Speech API. They
will become reference implementations after Sphinx 4 project complete.
The FreeTTS is a speech synthesis system written entirely in the Java programming
language. It is based upon Flite 1.1: a small run-time speech synthesis engine developed
at CMU. Flite is derived from the Festival Speech Synthesis System from the University
of Edinburgh and the FestVox project from CMU. The goal of FreeTTS is quick and
small, because most of Java applications runs on Internet.
This requirement lead to trade-off of its voice quality.

.7 Snapshots

Figure 4.2: Book Reader (Speech Synthesizer)

32

Chapter 4

Figure 4.3: Browse option

Figure 4.4: File Path for Synthesizer

33

Chapter 4

Figure 4.5: Command Prompt for Speech Recognizer

34

Chapter 5

Conclusion and Future Work


.1 Conclusion

Speech recognition will revolutionize the way people interacted with Smart devices &
will, ultimately, differentiate the upcoming technologies. Almost all the smart devices
coming today in the market are capable of recognizing speech. Many areas can benefit
from this technology. Speech Recognition can be used for intuitive operation of

computer-based systems in daily life.


This technology will spawn revolutionary changes in the modern world and become a
pivot technology. Within five years, speech recognition technology will become so
pervasive in our daily lives that service environments lacking this technology will be

considered inferior.
In this work, we propose and implement Book Reader, a TTS synthesizer that uses
MBROLAforspeechsynthesis.Themainrequirementsarethelanguageindependence
andunderstandabilityofthepronouncedspeech.

.2 Limitations
Out-of-Vocabulary (OOV) Words
Systems have to maintain a huge vocabulary of word of different language & sometimes
according to the user phonetics also. They are not capable of adjust their vocabulary
according to the change in users. Systems must have some method of detecting OOV
words, and dealing with them in a sensible way.

Spontaneous Speech
Systems are unable to recognize the speech properly when it contains disfluencies (filled
pauses, false starts, hesitations, ungrammatical constructions etc.). Spontaneous speech
remains a problem.

Adaptability

35

Chapter 5
Speech Recognition software are not capable of adapting to various changing conditions
which include different microphone, background noise, new speaker, new task domain,
new language even. The efficiency of the software degrades drastically.

Accent
Mostly all the systems are made according to the common accent of the particular
language. But the accent of people varies in a wide range. This application supports US
accent only.

.3 Future Scope
This work can be taken into more detail and more work can be done on the project in
order to bring modifications and additional features. The current software doesnt support
a large vocabulary, the work will be done in order to accumulate more number of samples
and increase the efficiency of the software. The current version of the software supports
only US accent but more areas can be covered and effort will be made in this regard.

36

You might also like