You are on page 1of 52

Expert System Voice Assistant

Submitted For The Partial Fulfilment Of The Requirement
For The Award Of Degree Of

Submitted by:
1. Aakash Shrivastava(0101CS101001)
2. Ashish Kumar Namdeo(0101CS101024)
3. Avinash Dongre(0101CS101026)
4. Chitransh Surheley(0101CS101031)

Guided By:
Prof. Shikha Agarwal


2013- 2014




This is to certify that Akash Shrivastava, Ashish Kumar Namdeo, Avinash Dongre, Chitransh
Surheley of B.E fourth year, Computer science & Engineering have completed their major project
Expert System Voice Assistant during the academic year 2013-14 under our guidance and
We approve the project for the submission for the partial fulfillment of the requirement for the
award of degree in Computer Science & Engineering.

Prof. Shikha Agarwal

Project Guide

Dr. Sanjay Silakari

( Head CSE Dept.)


Dr. V.K.Sethi
(Director, UIT-RGPV)

We hereby declare that the work which is being presented in the Major project Expert System
Voice Assistant submitted in partial fulfillment of the requirement for the award of Bachelor
Degree in Computer Science & Engineering .The work which has been carried out at
University Institute of Technology, RGPV, Bhopal is an authentic record of our work carried
under the guidance of Prof. Shikha Agrawal Department of Computer Science & Engineering,
UIT-RGPV, Bhopal.
The matter written in this project has not been submitted by us for the award of any other

Aakash Shrivastava(0101CS101001)
Ashish Kumar Namdeo(0101CS101024)
Avinash Dongre(0101CS101026)
Chitransh Surheley(0101CS101031)


We take the opportunity to express our cordial gratitude and deep sense of indebtedness to our
guide Prof. Shikha Agrawal, Department / Computer Science and Engineering for the valuable
guidance and inspiration throughout the project duration. We feel thankful to her for their
innovative ideas, which led to successful completion of this project work. She has always
welcomed our problem and helped us to clear our doubt. We will always be grateful to them for
providing us moral support and sufficient time.
We owe our sincere thanks to Dr. Sanjay Silakari (HOD, CSE) who helped us duly in time
during our project work in the Department.
At the same time, we would like to thank all other faculty members and all non-teaching staff in
Computer Science and Engineering Department for their valuable co-operation.

Aakash Shrivastava(0101CS101001)
Ashish Kumar Namdeo(0101CS101024)
Avinash Dongre(0101CS101026)
Chitransh Surheley(0101CS101031)


Speech interface to computer is the next big step that computer science need to take for
general users. Speech recognition will play an important role in taking technology to them.
Our goal is to create a speech recognition software that can recognise spoken words. This
report takes a brief look at the basic building block of a speech recognition, speech synthesis
and the overall human and computer interaction. The most important purpose of this project is
to understand the interface between a person and a computer. Traditional or orthodox ways of
interaction are keyboard, mouse or any other input device but nowadays the computing has
become more sophisticated and complex operation. With these properties we have got the
advantage and resources to think about building a more modern interface which will allow us
to make a more natural looking interaction. So in this project, we have tried to develop an
application which will make the human - computer interaction more interesting and user
friendly. It is called the Expert System Voice Assistant the main application of this project is
that it takes human voice as an input,processes it accordingly, does the given task and
responds at the end. This project is Digital life assistant which uses mainly human
communication means such Twitter, instant message and voice to create two way connections
between human and his computer, controlling power, documents, social media and much
more. In our project we mainly use voice as communication, so it is basically the Speech
recognition application. The concept of speech technology really encompasses two
technologies: Synthesizer and Recognizer. A speech synthesizer takes as input and produces
an audio stream as output. A speech recognizer on the other hand does opposite. It takes an
audio stream as input and thus turns it into text transcription. The voice is a signal of infinite
information. A direct analysis and synthesizing the complex voice signal is due to too much
information contained in the signal. Therefore the digital signal processes such as Feature
Extraction and Feature Matching are introduced to represent the voice signal. In this project
we directly use speech engine which use Feature extraction technique as Mel scaled frequency
cepstral. The mel- scaled frequency cepstral coefficients (MFCCs) derived from Fourier
transform and filter bank analysis are perhaps the most widely used front- ends in state-of-theart speech recognition systems. Our aim to create more and more functionalities which can
help human to assist in their daily life and also reduces their efforts.

Table of Contents


INTRODUCTION -------------------------------------------- ERROR! BOOKMARK NOT DEFINED.




POST QUERY DESIGN -------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

PROTOTYPE AND INCEPTION ----------------------------------------- ERROR! BOOKMARK NOT DEFINED.
DEFAULT COMMANDS.TXT ------------------------------------------ ERROR! BOOKMARK NOT DEFINED.

RESULTS -------------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.


MICROSOFT VISUAL STUDIO ----------------------------------------- ERROR! BOOKMARK NOT DEFINED.

SPEECH SYNTHESIS ---------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.



PROBLEM DESCRIPTION ----------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

ARCHITECTURE OF THE PROJECT ------------------------------------- ERROR! BOOKMARK NOT DEFINED.
WORKING OF THE PROJECT ------------------------------------------- ERROR! BOOKMARK NOT DEFINED.




AVAILABILITY OF RESOURCES---------------------------------------- ERROR! BOOKMARK NOT DEFINED.
RELATED WORK -------------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

PROPOSED WORK ------------------------------------------- ERROR! BOOKMARK NOT DEFINED.


EXISTING SYSTEMS ---------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

SPEECH RECOGNITION ------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.
SPEECH SYNTHESIS ---------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

SNAPSHOT OF THE GUI ------------------------------------------------ ERROR! BOOKMARK NOT DEFINED.

FLOWCHATS --------------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.


REFERENCES -------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.



POST QUERY DESIGN -------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

PROTOTYPE AND INCEPTION ----------------------------------------- ERROR! BOOKMARK NOT DEFINED.
DEFAULT COMMANDS.TXT ------------------------------------------ ERROR! BOOKMARK NOT DEFINED.



Chapter 1
1. Introduction
Speech is an effective and natural way for people to interact with applications, complementing
or even replacing the use of mice, keyboards, controllers, and gestures. A hands-free, yet
accurate way to communicate with applications, speech lets people be productive and stay
informed in a variety of situations where other interfaces will not. Speech recognition is a
topic that is very useful in many applications and environments in our daily life. Generally
speech recognizer is a machine which understands humans and their spoken word in some
way and can act thereafter. A di erent aspect of speech recognition is to facilitate for people
with functional disability or other kinds of handicap. To make their daily chores easier, voice
control could be helpful. With their voice they could operate the system. This leads to the
discussion about intelligent homes where these operations can be made available for the
common man as well as for handicapped.Voice activated systems and gesture control systems
have taken the experiences of the nave end-users to the next level. Present day users are able
to access or control the system without making a physical interaction with the computer. The
proposed model presents a new approach to voice activated control systems which enhances
the response time and user experience by looking beyond the steps of speech recognition and
focus on the post processing step of natural language processing. The proposed method
conceives the system as a Deterministic Finite State Automata, where each state is allowed a
finite set of keywords, which will be listened to by the speech recognition system. This is
achieved by the introduction of a new system to handle Finite Automata called Switch State
Mechanism. The natural language processing is used to regularly update the state keywords
and give the user a life like interaction with the computer.
With the input functionality of speech recognition, your application can monitor the state,
level, and format of the input signal, and receive notification about problems that might
interfere with successful recognition.You can create grammars programmatically using
constructors and methods on the GrammarBuilder and Choices classes. Your application can
dynamically modify programmatically created grammars while it is running. The structure of

grammars authored using these classes is independent of the Speech Recognition Grammar
voice recognition fundamentally functions as a pipeline that converts PCM (Pulse Code
Modulation) digital audio from a sound card into recognized speech. The elements of the
pipeline are:
1. Transform the PCM digital audio into a better acoustic representation
2. Apply a "grammar" so the speech recognizer knows what phonemes to expect. A
grammar could be anything from a context-free grammar to full-blown English.
3. Figure out which phonemes are spoken.
4. Convert the phonemes into words.

1.1 Existing Systems

Although some promising solutions are available for speech synthesis and recognition, most
of them are tuned to English. The acoustic and language model for these systems are for
English language. Most of them require a lot of conguration before they can be used. ISIP
and Sphinx

are two of the known Speech Recognition software in open source. gives a

comparison of public domain software tools for speech recognition. Some commercial
software like IBMs ViaVoice are also available.

1.1.1 SIRI
SIRI is an intelligent personal assistant and knowledge navigator which works as an
application for Apple Inc.'s iOS. The application uses a natural language user interface to
answer questions, make recommendations, and perform actions by delegating requests to a set
of Web services. Apple claims that the software adapts to the user's individual preferences
over time and personalizes results. The name Siri is Norwegian, meaning "beautiful woman
who leads you to victory", and comes from the intended name for the original developer's first
Siri was originally introduced as an iOS application available in the App Store by Siri, Inc.,
which was acquired by Apple on April 28, 2010. Siri, Inc. had announced that their software

would be available for BlackBerry and for phones running Android, but all development
efforts for non-Apple platforms were cancelled after the acquisition by Apple.
Siri has been an integral part of iOS since iOS 5 and was introduced as a feature of the iPhone
4S in October 14, 2011. Siri was added to the third generation iPad with the release of iOS 6
in September 2012, and has been included on all iOS devices released during or after October
2012. Siri has several fascinating features where you can call or text someone, search
anything, open any app etc with your voice which is very helpful indeed.

1.1.2 S-VOICE
S Voice is an intelligent personal assistant and knowledge navigator which is only available as
a built-in application for the Samsung Galaxy smartphones. The application uses a natural
language user interface to answer questions, make recommendations, and perform actions by
delegating requests to a set of Web services. It is based on the Vlingo personal assistant.
Some of the capabilities of S Voice include making appointments, opening apps, setting
alarms, updating social network websites such as Facebook or Twitter and navigation. S Voice
also offers efficient multitasking as well as automatic activation features, for example when
the car engine is started.
s-voice possesses same features as siri.


Google Now is an intelligent personal assistant developed by Google. It is available within the
Google Search mobile application for the Android and iOS operating systems, as well as the
Google Chrome web browser on personal computers. Google Now uses a natural language
user interface to answer questions, make recommendations, and perform actions by delegating
requests to a set of web services. Along with answering user-initiated queries, Google Now
passively delivers information to the user that it predicts they will want, based on their search
habits. It was first included in Android 4.1 ("Jelly Bean"), which launched on July 9, 2012,
and was first supported on the Galaxy Nexus smartphone. The service was made available for
iOS on April 29, 2013 in an update to the Google Search app, and later for Google Chrome on
March 24, 2014.

The expert system voice assistant is based on the combination of 3 major operations
Speech Recognition
Intermediate Operations and result creation
Speech Synthesis

1.2 Speech Recognition

Speech recognition refers to the ability to listen (input in audio format) spoken words and
identify various sounds present in it, and recognise them as words of some known language.
Speech recognition in computer system domain may then be dened as the ability of computer
systems to accept spoken words in audio format - such as wav or raw - and then generate its
content in text format. Speech recognition in computer domain involves various steps with
issues attached with them. The steps required to make computers perform speech recognition
are: Voice recording, word boundary detection, feature extraction, and recognition with the
help of knowledge models. Word boundary detection is the process of identifying the start and
the end of a spoken word in the given sound signal. While analysing the sound signal, at times
it becomes dicult to identify the word boundary. This can can be attributed to various
accents people have, like the duration of the pause they give between words while speaking.
Feature Extraction refers to the process of conversion of sound signal to a form suitable for the
following stages to use. Feature extraction may include extracting parameters such as
amplitude of the signal, energy of frequencies, etc. Recognition involves mapping the given
input (in form of various features) to one of the known sounds. This may involve use of
various knowledge models for precise identication and ambiguity removal. Knowledge
models refers to models such as phone acoustic model, language models, etc. which help the
recognition system. To generate the knowledge model one needs to train the system. During
the training period one needs to show the system a set of inputs and what outputs they should
map to. This is often called as supervised learning.

Structure of a standard speech recognition system.

How Speech Recognition Works

A speech recognition engine (or speech recognizer) takes an audio stream as input and turns it
into a text transcription. The speech recognition process can be thought of as having a front end
and a back end.

Convert Audio Input

The front end processes the audio stream, isolating segments of sound that are probably speech
and converting them into a series of numeric values that characterize the vocal sounds in the

Match Input to Speech Models

The back end is a specialized search engine that takes the output produced by the front end and
searches across three databases: an acoustic model, a lexicon, and a language model.

The acoustic model represents the acoustic sounds of a language, and can be trained to
recognize the characteristics of a particular user's speech patterns and acoustic

The lexicon lists a large number of the words in the language, and provides information
on how to pronounce each word.

The language model represents the ways in which the words of a language are combined.

For any given segment of sound, there are many things the speaker could potentially be saying.
The quality of a recognizer is determined by how good it is at refining its search, eliminating the
poor matches, and selecting the more likely matches. This depends in large part on the quality of
its language and acoustic models and the effectiveness of its algorithms, both for processing
sound and for searching across the models.

While the built-in language model of a recognizer is intended to represent a comprehensive
language domain (such as everyday spoken English), a speech application will often need to
process only certain utterances that have particular semantic meaning to that application. Rather
than using the general purpose language model, an application should use a grammar that
constrains the recognizer to listen only for speech that is meaningful to the application. This
provides the following benefits:

Increases the accuracy of recognition


Guarantees that all recognition results are meaningful to the application

Enables the recognition engine to specify the semantic values inherent in the recognized

1.2.1 Algorithms
Both acoustic modeling and language modeling are important parts of modern statisticallybased speech recognition algorithms. Hidden Markov models (HMMs) are widely used in
many systems. Language modeling is also used in many other natural language processing
applications such as document classification or statistical machine translation.
Hidden Markov models
Modern general-purpose speech recognition systems are based on Hidden Markov Models.
These are statistical models that output a sequence of symbols or quantities. HMMs are used
in speech recognition because a speech signal can be viewed as a piecewise stationary signal
or a short-time stationary signal. In a short time-scale (e.g., 10 milliseconds), speech can be
approximated as a stationary process. Speech can be thought of as a Markov model for many
stochastic purposes.
Another reason why HMMs are popular is because they can be trained automatically and are
simple and computationally feasible to use. In speech recognition, the hidden Markov model
would output a sequence of n-dimensional real-valued vectors (with n being a small integer,
such as 10), outputting one of these every 10 milliseconds. The vectors would consist of
cepstral coefficients, which are obtained by taking a Fourier transform of a short time window
of speech and decorrelating the spectrum using a cosine transform, then taking the first (most
significant) coefficients. The hidden Markov model will tend to have in each state a statistical
distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood
for each observed vector. Each word, or (for more general speech recognition systems), each
phoneme, will have a different output distribution; a hidden Markov model for a sequence of
words or phonemes is made by concatenating the individual trained hidden Markov models
for the separate words and phonemes.

Described above are the core elements of the most common, HMM-based approach to speech
recognition. Modern speech recognition systems use various combinations of a number of
standard techniques in order to improve results over the basic approach described above. A
typical large-vocabulary system would need context dependency for the phonemes (so
phonemes with different left and right context have different realizations as HMM states); it
would use cepstral normalization to normalize for different speaker and recording conditions;
for further speaker normalization it might use vocal tract length normalization (VTLN) for
male-female normalization and maximum likelihood linear regression(MLLR) for more
general speaker adaptation. The features would have so-called delta and delta-delta
coefficients to capture speech dynamics and in addition might useheteroscedastic linear
discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use
splicing and an LDA-based projection followed perhaps byheteroscedastic linear discriminant
analysis or a global semi-tied covariance transform (also known as maximum likelihood linear
transform, or MLLT). Many systems use so-called discriminative training techniques that
dispense with a purely statistical approach to HMM parameter estimation and instead optimize
some classification-related measure of the training data. Examples are maximum mutual
information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the term for what happens when the system is presented with a new
utterance and must compute the most likely source sentence) would probably use the Viterbi
algorithm to find the best path, and here there is a choice between dynamically creating a
combination hidden Markov model, which includes both the acoustic and language model
information, and combining it statically beforehand (the finite state transducer, or FST,
A possible improvement to decoding is to keep a set of good candidates instead of just
keeping the best candidate, and to use a better scoring function (rescoring) to rate these good
candidates so that we may pick the best one according to this refined score. The set of
candidates can be kept either as a list (the N-best list approach) or as a subset of the models (a
lattice). Rescoring is usually done by trying to minimize the Bayes risk (or an approximation
thereof): Instead of taking the source sentence with maximal probability, we try to take the
sentence that minimizes the expectation of a given loss function with regards to all possible

transcriptions (i.e., we take the sentence that minimizes the average distance to other possible
sentences weighted by their estimated probability). The loss function is usually the
Levenshtein distance, though it can be different distances for specific tasks; the set of possible
transcriptions is, of course, pruned to maintain tractability. Efficient algorithms have been
devised to rescore lattices represented as weighted finite state transducers with edit distances
represented themselves as a finite state transducer verifying certain assumptions.
Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an approach that was historically used for speech recognition but has
now largely been displaced by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences that
may vary in time or speed. For instance, similarities in walking patterns would be detected,
even if in one video the person was walking slowly and if in another he or she were walking
more quickly, or even if there were accelerations and decelerations during the course of one
observation. DTW has been applied to video, audio, and graphics indeed, any data that can
be turned into a linear representation can be analyzed with DTW.
A well-known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal match
between two given sequences (e.g., time series) with certain restrictions. That is, the
sequences are "warped" non-linearly to match each other. This sequence alignment method is
often used in the context of hidden Markov models.
Neural networks
Neural networks emerged as an attractive acoustic modeling approach in ASR in the late
1980s. Since then, neural networks have been used in many aspects of speech recognition such
as phoneme classification, isolated word recognition, and speaker adaptation.
In contrast to HMMs, neural networks make no assumptions about feature statistical
properties and have several qualities making them attractive recognition models for speech
recognition. When used to estimate the probabilities of a speech feature segment, neural
networks allow discriminative training in a natural and efficient manner. Few assumptions on

the statistics of input features are made with neural networks. However, in spite of their
effectiveness in classifying short-time units such as individual phones and isolated words,
neural networks are rarely successful for continuous recognition tasks, largely because of their
lack of ability to model temporal dependencies. Thus, one alternative approach is to use neural
networks as a pre-processing e.g. feature transformation, dimensionality reduction, for the
HMM based recognition.

1.3 Speech Synthesis

Speech synthesis is the artificial production of human speech. A computer system used for
this purpose is called a speech synthesizer, and can be implemented in software or hardware
products. A text-to-speech (TTS) system converts normal language text into speech; other
systems render symbolic linguistic representations like phonetic transcriptions into speech.
Synthesized speech can be created by concatenating pieces of recorded speech that are stored
in a database. Systems differ in the size of the stored speech units; a system that stores phones
or diphones provides the largest output range, but may lack clarity. For specific usage
domains, the storage of entire words or sentences allows for high-quality output.
Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice
characteristics to create a completely "synthetic" voice output.
The quality of a speech synthesizer is judged by its similarity to the human voice and by its
ability to be understood clearly. An intelligible text-to-speech program allows people with
visual impairments or reading disabilities to listen to written works on a home computer.
Many computer operating systems have included speech synthesizers since the early 1990s.


A typical TTS system


A text-to-speech system (or "engine") is composed of two parts. a front-end and a back-end.
The front-end has two major tasks. First, it converts raw text containing symbols like numbers
and abbreviations into the equivalent of written-out words. This process is often called text
normalization, pre-processing, or tokenization. The front-end then assigns phonetic
transcriptions to each word, and divides and marks the text into prosodic units, like phrases,
clauses, and sentences. The process of assigning phonetic transcriptions to words is called
text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody
information together make up the symbolic linguistic representation that is output by the frontend. The back-endoften referred to as the synthesizerthen converts the symbolic linguistic
representation into sound. In certain systems, this part includes the computation of the target
prosody (pitch contour, phoneme durations), which is then imposed on the output speech.

1.4 Intermediate Operations

After the computer recognizes the speech then it is able to convert the spoken words into
respective text. Now that text can be used as command. whatever we speak will be converted
into a command and that command is handled by various system references.
We can operate, Manage and manipulate any system attribute or element using these
commands. We can use various RSS feeds to create weather, email and other social media
Results are created as products of intermediate operations. Now these results are fed into the
speech synthesis engine which is responsible for responding to all the events. Now we can get
some better feedback from the computer


1.5 Architecture of the project


Chapter 2

2. Literature Survey and Related Work

2.1 Microsoft Speech recognition Engine

Windows Speech Recognition is a speech recognition application included in Windows
Vista, Windows 7 and Windows 8.Windows Speech Recognition allows the user to control the
computer by giving specific voice commands. The program can also be used for the dictation
of text so that the user can enter text using their voice on their Vista or Windows 7 computer.
Applications that do not present obvious "commands" can still be controlled by asking the
system to overlay numbers on top of interface elements; the number can subsequently be
spoken to activate that function. Programs needing mouse clicks in arbitrary locations can also
be controlled through speech; when asked to do so, a "mousegrid" of nine zones is displayed,
with numbers inside each. The user speaks the number, and another grid of nine zones is
placed inside the chosen zone. This continues until the interface element to be clicked is
within the chosen zone.
Windows Speech Recognition has a fairly high recognition accuracy and provides a set of
commands that assists in dictation. A brief speech-driven tutorial is included to help
familiarize a user with speech recognition commands. Training could also be completed to
improve the accuracy of speech recognition.
Currently, the application supports several languages, including English (U.S. and British),
Spanish, German, French, Japanese and Chinese (traditional and simplified).
Windows speech recognition plays an important role in the development of expert system
voice assistant. The speech recognition phase is carried out with the help of windows speech
recognition engine


2.2 Collected Information and History

With the information presented so far one question comes naturally: how is speech recognition
done? To get knowledge of how speech recognition problems can be approached today, a
review of some research highlights will be presented. The earliest attempts to devise systems
for automatic speech recognition by machine were made in the 1950s, when various
researchers tried to exploit the fundamental ideas of acoustic-phonetics. In 1952, at Bell
Laboratories, Davis, Biddulph, and Balashek built a system for isolated digit recognition for a
single speaker . The system relied heavily on measuring spectral resonances during the vowel
region of each digit. In 1959 another attempt was made by Forgie , constructed at MIT
Lincoln Laboratories. Ten vowels embedded in a /b/-vowel-/t/ format were recognized in a
speaker independent manner . In the 1970s speech recognition research achieved a number of
signicant milestones. First the area of isolated word or discrete utterance recognition became
a viable and usable technology based on the fundamental studies by Velichko and Zagoruyko
in Russia , Sakoe and Chiba in Japan and Itakura in the United States . The Russian studies
helped advance the use of pattern recognition ideas in speech recognition; the Japanese
research showed how dynamic programming methods could be successfully applied; and
Itakura research showed how the ideas of linear prediction coding (LPC). At AT&T Bell Labs,
began a series of experiments aimed at making speech recognition systems that were truly
speaker independent. They used a wide range of sophisticated clustering algorithms to
determine the number of distinct patterns required to represent all variations of di erent
words across a wide user population.
In the 1980s a shift in technology from template-based approaches to statistical modeling
methods, especially the hidden Markov model (HMM) approach The purpose of this paper is
getting a deeper theoretical and practical understanding of a speech recognizer. The work
started by examines a currently existing state of the art for feature extracting method MFCC.
With this study from MFCC applying this knowledge in practical manner, the speech
recognizer is implemented in .Net technology in C# language developed by Microsoft . In our
project we use The Speech Application Programming Interface or SAPI is an API developed

by Microsoft to allow the use of speech recognition and speech synthesis within Windows
applications. Applications that use SAPI include Microsoft Office,Microsoft Agent and
Microsoft Speech Server.. In general all API have been designed such that a software
developer can write an application to perform speech recognition and synthesis by using a
standard set of interfaces, accessible from a variety of programming languages. In addition, it
is possible for a 3rd-party company to produce their own Speech Recognition and Text-ToSpeech engines or adapt existing engines to work with SAPI. Basically Speech platform
consist of an application runtimes that provides speech functionality, an Application Program
Interface (API) for managing the runtime and Runtime Languages that enable speech
recognition and speech synthesis (text-to-speech or TTS) in specific languages.

2.3 Availability of resources

2.3.1 Microsoft Speech API

The Speech Application Programming Interface or SAPI is an API developed by Microsoft

to allow the use of speech recognition and speech synthesis within Windows Applications. To
date, a number of versions of the API have been released, which have shipped either as part of
a Speech SDK, or as part of the Windows OS itself. Applications that use SAPI include
Microsoft Office, Microsoft Agent and Microsoft Speech Server.
In general all versions of the API have been designed such that a software developer can write
an application to perform speech recognition and synthesis by using a standard set of
interfaces, accessible from a variety of programming languages. In addition, it is possible for a
3rd-party company to produce their own Speech Recognition and Text-To-Speech engines or
adapt existing engines to work with SAPI. In principle, as long as these engines conform to
the defined interfaces they can be used instead of the Microsoft-supplied engines.
In general the Speech API is a freely redistributable component which can be shipped with
any Windows application that wishes to use speech technology. Many versions (although not
all) of the speech recognition and synthesis engines are also freely redistributable.

There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4
are all similar to each other, with extra features in each newer version. SAPI 5 however was a
completely new interface, released in 2000. Since then several sub-versions of this API have
been released.

2.3.2 .NET Application Architecture


2.4 Related Work

It would be inappropriate to not mention Siri or Google Now, when discussing voice activated
systems, though they are for mobile devices. Siri relies on web services and hence facilitates
the learning of user preferences over time. However, all the intelligent personal assistants
including Samsungs S Voice, Iris and others use natural language processing following
speech recognition. The use of state machine is limited to context storage and evaluation. One
of the best attempts to create expert system voice assistant was achieved recognition system
followed by natural language processing. From the videos posted of his Project Jarvis we can
concur that the response time of the system is not real time. However, his project was able to
capture the entirety of digital life assistant. Individual projects such as Project Alpha and
others have tried to utilize state systems through the use of Windows Speech Recognition
Macros. Further other small projects such as Project Rita rely on state system for concocting
responses to a command spoken by the user. The scope of these projects, however are limited
due to improper management of macros or keywords.


Chapter 3
3. Problem Description
A voice assistant is not a very traditional or orthodox application. These applications are not
generally available in a very big context. Other thing is that not all the people can interact with
the computer via orthodox input methods like keyboards or mouse click. Some people with
physical disability or those who are not able to see, may find it very difficult to interact with
the computer but with the help of this application they can feel like operating the computer as
smoothly as the normal people do. The problem is that we have to combine the features of
speech recognition, interpretation,
system manipulation, command generation and speech synthesis.
we want the computer to recognize our spoken words and we want the spoken operation to be
performed. After all that we want the application to respond in text to speech or any other
synthetic voice feedback.
We have to make sure that the application understands every command and provides the
results with feedback


Chapter 4
4. Proposed Work
Here are a few proposed function that are being included into project
1. weather - Gives the local weather for the current day You can set the specific location and
when you are connected to internet you can easily ask for the local weather update and
forecast. The assistant will vocally specify about the current situation.
2. forecast - Gives the local weather for the next few days. You can speak the word
forecast and can get a glimpse of the future conditions vocally.
3. News - Shows the latest news headline from the BBC. You just have to make a statement
about news or anything identical and it will either read the news to you or it will show them
on the internet.
4. Alarm - Starts the alarm chain command for wake up You have to specify the time and its
done . It will notify you when the eventual time is arrived.
5. Time - Displays the time.
6. Date - Displays the date date and time - Displays both the date and time.
7. Mute - Mutes system volume.
8. Unmute - Unmutes system volume.
9. Radio - Streams From the internet, instantly.
10. Introduce - Gives a general introduction to The expert system voice assistant.
11. Speak - You can get the sample of the current TTS voice of the application. For more
accuracy, various modern TTS voices can be embedded into the project which gives you the
options of changing the voice of the application according to your preference .


12. killtask - Kills a specified task You have to specify vocally which running task is to be
13. CMD - Starts a new command prompt window.
14. Start or Close any Program or Directory - You can start any program by saying its
name. You can open or close any directory by commanding the name and you can switch from
one another via voice well. The confirmation of the start and termination can be vocal.
15. Tasklist - Views current running processes.
16. lock - Locks the workstation.
17. Screen off - Turns off the monitor.can dim the brightness of the screen.
18. System specific tasks - You can control your computers regular operations via voice
commands. like you can turn off or put asleep the computer by saying that. you can open close
disk tray by voice commands You can turn you computer off by saying turn off or can put it to
sleep by saying sleep etc..
19. Open any website - You can open a specific website by calling it. This includes many
famous websites.
20. What is there to offer: The first thing will be to know the potentials and capabilities of
the project, So if the user says what can you do or commands the application will show
the list of commands and operations it can perform.
21 . Print this page: This command is said to print a specific page. The application will take
the spoken word print as an input and the status of the task will be provided as the output
via voice.
22. Screenshot anything: You can take the screenshot of any page or window by saying the


23. Play music or video Locally: You can just simply instruct the assistant to play a local
music or video file, On the basis of name, artist or genre etc.
24. Multimedia Control: You can control the volume and select the playlist and go to next or
previous track on the basis of voice commands.
25. Manage your Email: You can manage and check for any new emails by saying
something like check mail. The system will vocally response about the fed command and
can read your emails for you.
26 . Presentation control: You can start the presentation go to previous or next slides and end
the presentation
27. Delete file: You can delete any selected file by saying this command
28. Cut/Copy/Paste: You can do these operations on any selected file or text
29. Select all: Say it and it will select the whole document or all the files
Program Options
Start Automatically - If checked, this program will be added to your start-up folder so that it
will start automatically each time you start Windows.
Show Progress Bars - The program can monitor your usage of the mouse and keyboard and
show you the progress you are making at using your voice instead of the mouse and keyboard.
Progress is measured on several dimensions including: mouse clicks, mouse movement,
keyboard letters, and navigation/function keys.
General options 1. Open and Close Programs
2. Navigate Programs/Folders
3. Switch or Minimize Windows
4. Change Settings


Chapter 5
5. Design and Development
5.1 Required:
Hardware: Pentium Processor, 512MB of RAM, 10GB HDD.
OS: Windows.

Language: C#.
Tools: .Net Framework 4.5, Microsoft Visual Studio 2010, voice macros.The speech signal
and all its characteristics can be represented in two different domains, the time and the
frequency domain A speech signal is a slowly time varying signal in the sense that, when
examined over a short period of time (between 5 and 100 ms), its characteristics are shorttime stationary. This is not the case if we look at a speech signal under a longer time
perspective (approximately time T>0.5 s). In this case the signals characteristics are nonstationary, meaning that it changes to reflect the different
sounds spoken by the talker To be able to use a speech signal and interpret its characteristics
in a proper manner some kind of representation of the speech signal are preferred.

5.2 Microsoft Visual Studio

Microsoft Visual Studio is an integrated development environment (IDE) from Microsoft. It is
used to develop computer programs for Microsoft Windows superfamily of operating systems,
as well as web sites, web applications and web services. Visual Studio uses Microsoft
software development platforms such as Windows API, Windows Forms, Windows
Presentation Foundation, Windows Store and Microsoft Silverlight. It can produce both native
code and managed code.
Visual Studio includes a code editor supporting IntelliSense as well as code refactoring. The
integrated debugger works both as a source-level debugger and a machine-level debugger.
Other built-in tools include a forms designer for building GUI applications, web designer,
class designer, and database schema designer. It accepts plug-ins that enhance the
functionality at almost every levelincluding adding support for source-control systems (like


Subversion) and adding new toolsets like editors and visual designers for domain-specific
languages or toolsets for other aspects of the software development lifecycle(like the Team
Foundation Server client: Team Explorer).
Visual Studio supports different programming languages and allows the code editor and
debugger to support (to varying degrees) nearly any programming language, provided a
language-specific service exists. Built-in languages include C, C++ and C++/CLI (via Visual
C++), VB.NET (via Visual Basic .NET), C# (via Visual C#), and F# (as of Visual Studio
2010). Support for other languages such as M, Python, and Ruby among others is available via
language services installed separately. It also supports XML/XSLT, HTML/XHTML,
JavaScript and CSS.
Microsoft provides "Express" editions of its Visual Studio at no cost. Commercial versions of
Visual Studio along with select past versions are available for free to students via Microsoft's
DreamSpark program

5.3 Speech Synthesis

The most important qualities of a speech synthesis system are naturalness and intelligibility.
Naturalness describes how closely the output sounds like human speech, while intelligibility is
the ease with which the output is understood. The ideal speech synthesizer is both natural and
intelligible. Speech synthesis systems usually try to maximize both characteristics.
The two primary technologies generating synthetic speech waveforms are concatenative
synthesis and formant synthesis. Each technology has strengths and weaknesses, and the
intended uses of a synthesis system will typically determine which approach is used.
Create TTS Content
The content that a TTS engine speaks is called a prompt. Creating a prompt can be as simple
typing a string. See Speak the Contents of a String.
For greater control over speech output, you can create prompts programmatically using the
methods of the PromptBuilder class to assemble content for prompts from text,Speech
Synthesis Markup Language (SSML), files containing text or SSML markup, and prerecorded

audio files. PromptBuilder also allows you to select a speaking voice and to control attributes
of the voice such as rate and volume. See Construct and Speak a Simple Prompt and Construct
a Complex Prompt for more information and examples
Initialize and Manage the Speech Synthesizer
The SpeechSynthesizer class provides access to the functionality of a TTS engine in Windows
Vista, Windows 7, and in Windows Server 2008. Using the SpeechSynthesizerclass, you can
select a speaking voice, specify the output for generated speech, create handlers for events that
the speech synthesizer generates, and start, pause, and resume speech generation.
Generate Speech
Using methods on the SpeechSynthesizer class, you can generate speech as either a
synchronous or an asynchronous operation from text, SSML markup, files containing text or
SSML markup, and prerecorded audio files.
Respond to Events
When generating synthesized speech, the SpeechSynthesizer raises events that inform a
speech application about the beginning and end of the speaking of a prompt, the progress of a
speak operation, and details about specific features encountered in a prompt. EventArgs
classes provide notification and information about events raised and allow you to write
handlers that respond to events as they occur
Control Voice Characteristics
To control the characteristics of speech output, you can select a voice with specific attributes
such as language or gender, modify properties of the SpeechSynthesizer such as rate and
volume, or adding instructions either in prompt content or in separate lexicon files that guide
the pronunciation of specified words or phrases.
Apart from the analysis some manual scripts can help in answering the most common
questions without having the trouble of creating a process .


Chapter 6
6.Implementation and coding
6.1 Post Query Design
In Visual C#, you can use either the Windows Form Designer or the Windows Presentation
Foundation (WPF) Designer to quickly and conveniently create user interfaces. For
information to help you decide what type of application to build

Adding controls to the design surface.

Setting initial properties for the controls.

Writing handlers for specified events.

Although you can also create your UI by manually writing your own code, designers enable
you to do this work much faster.
Adding Controls
In either designer, you use the mouse to drag controls, which are components with visual
representation such as buttons and text boxes, onto a design surface.As you work visually, the
Windows Forms Designer translates your actions into C# source code and writes them into a
project file that is named name.designer.cs where name is the name that you gave to the form.
Similarly, the WPF designer translates actions on the design surface into Extensible
Application Markup Language (XAML) code and writes it into a project file that is named
Window.xaml. When your application runs, that source code (Windows Form) or XAML
(WPF) will position and size your UI elements so that they appear just as they do on the
design surface. For more information.
Setting Properties
After you add a control to the design surface, you can use the Properties window to set its
properties, such as background color and default text.

In the Windows Form designer, the values that you specify in the Properties window are the
initial values that will be assigned to that property when the control is created at run time. In
the WPF designer, the values that you specify in the Properties window are stored as attributes
in the window's XAML file.
In many cases, those values can be accessed or changed programmatically at run time by
getting or setting the property on the instance of the control class in your application. The
Properties window is useful at design time because it enables you to browse all the properties,
events, and methods supported on a control.

Handling Events
Programs with graphical user interfaces are primarily event-driven. They wait until a user does
something such as typing text into a text box, clicking a button, or changing a selection in a
listbox. When that occurs, the control, which is just an instance of a .NET Framework class,
sends an event to your application. You have the option of handling an event by writing a
special method in your application that will be called when the event is received.
You can use the Properties window to specify which events you want to handle in your code.
Select a control in the designer and click the Events button, with the lightning bolt icon, on the
Properties window toolbar to see its events.
When you add an event handler through the Properties window, the designer automatically
writes the empty method body. You must write the code to make the method do something
useful. Most controls generate many events, but frequently an application will only have to
handle some of them, or even only one. For example, you probably have to handle a button's
Click event, but you do not have to handle its Size Changed event unless you want to do
something when the size of the button changes.

6.2 Prototype And Inception

The Project is being coded in the language csharp.Speech recognition is the very first step in
this process so we start with that.


Initialize the Speech Recognizer

To initialize an instance of the shared recognizer in Windows, we us
SpeechRecognizer sr = new SpeechRecognizer();

Create a Speech Recognition Grammar

One way to create a speech recognition grammar is to use the constructors and methods on the
GrammarBuilder Load the Grammar into the Speech Recognizer
After the grammar is created, it must be loaded into the speech recognizer. The following
example loads the grammar by calling the LoadGrammar(Grammar) method, passing the
grammar created in the previous operation.
Register for Speech Recognition Event Notification
The speech recognizer raises a number of events during its operation, including the
SpeechRecognized event. For more information, see Use Speech Recognition Events. The
speech recognizer raises the SpeechRecognized event when it matches a user utterance with a
grammar. An application registers for notification of this event by appending an
EventHandler instance as shown in the following example. The argument to the
EventHandler constructor, sr_SpeechRecognized, is the name of the developer-written event
sr.SpeechRecognized += new
Create a Speech Recognition Event Handler
When you register a handler for a particular event, the Intellisense feature in Microsoft Visual
Studio creates a skeleton event handler if you press the TAB key. This process ensures that

parameters of the correct type are used. The handler for the SpeechRecognized event shown in
the following example displays the text of the recognized word or phrase using the Result
property on the SpeechRecognizedEventArgs parameter, e.
void sr_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)

Namespace has been efficiently used to synthesize the speech and it gets underway like that
using System;
using System.Speech.Synthesis;

namespace SampleSynthesis
class Program
static void Main(string[] args)

SpeechSynthesizer synth = new SpeechSynthesizer();


synth.Speak("This example demonstrates a basic use of Speech Synthesizer");

Console.WriteLine("Press any key to exit...");

} }}
System.Diagnostics.Process.Start(Name) can be used to execute the commanded text.t
public static Process Start(
string fileName,
string arguments,
string userName,
SecureString password,
string domain

6.4 Default Commands.TXT:

Hello Jarvis
Goodbye Jarvis
Close Jarvis
Stop talking
What's my name?
What time is it
What day is it
Whats todays date
Whats the date
Hows the weather
Whats the weather like
Whats it like outside
What will tomorrow be like
Whats tomorrows forecast
Whats tomorrow like
Whats the temperature

Whats the temperature outside

Play music
Play a random song
You decide
Turn Shuffle On
Turn Shuffle Off
Next Song
Previous Song
Fast Forward
Stop Music
Turn Up
Turn Down
What song is playing
Exit Fullscreen
Play video
next window
select all
print this page
Close window
Out of the way
Come back
Show default commands
Show shell commands
Show web commands

Show social commands

Show Music Library
Show Video Library
Show Email List
Show listbox
Hide listbox
Log off
I want to add custom commands
I want to add a custom command
I want to add a command
Update commands
Set the alarm
What time is the alarm
Clear the alarm
Stop listening
JARVIS Come Back Online
Refresh libraries
Change video directory
Change music directory
Check for new emails
Read the email
Open the email
Next email
Previous email
Clear email list
Change Language
Check for new updates

new folder
take screenshot
go up
go down
save as
start presentation
next slide
previous slide
end presentation
zoom in
hold control

6.5 RSS_Reader
using System;
using System.Linq;
using System.Text;
using CustomizeableJarvis.Properties;
using System.Xml;
using System.Xml.Linq;
using System.Net;
namespace CustomizeableJarvis

class RSSReader
public static void CheckForEmails()
string GmailAtomUrl = "";
XmlUrlResolver xmlResolver = new XmlUrlResolver();



XmlTextReader xmlReader = new XmlTextReader(GmailAtomUrl);
xmlReader.XmlResolver = xmlResolver;
XNamespace ns = XNamespace.Get("");
XDocument xmlFeed = XDocument.Load(xmlReader);

var emailItems = from item in xmlFeed.Descendants(ns + "entry")

select new
Author = item.Element(ns + "author").Element(ns + "name").Value,
Title = item.Element(ns + "title").Value,
Link = item.Element(ns + "link").Attribute("href").Value,
Summary = item.Element(ns + "summary").Value
frmMain.MsgList.Clear(); frmMain.MsgLink.Clear();
foreach (var item in emailItems)
if (item.Title == String.Empty)
frmMain.MsgList.Add("Message from " + item.Author + ", There is no subject

and the summary reads, " + item.Summary);

frmMain.MsgList.Add("Message from " + item.Author + ", The subject is " +
item.Title + " and the summary reads, " + item.Summary);
if (emailItems.Count() > 0)
if (emailItems.Count() == 1)
frmMain.Jarvis.SpeakAsync("You have 1 new email");
else { frmMain.Jarvis.SpeakAsync("You have " + emailItems.Count() + " new
emails"); }
else if (frmMain.QEvent == "Checkfornewemails" && emailItems.Count() == 0)
{ frmMain.Jarvis.SpeakAsync("You have no new emails"); frmMain.QEvent =
String.Empty; }
catch { frmMain.Jarvis.SpeakAsync("You have submitted invalid log in information");
public static void GetWeather()

string query = String.Format("" +
Settings.Default.WOEID.ToString() + "&u=" + Settings.Default.Temperature);
XmlDocument wData = new XmlDocument();
XmlNamespaceManager man = new XmlNamespaceManager(wData.NameTable);
man.AddNamespace("yweather", "");
XmlNode channel = wData.SelectSingleNode("rss").SelectSingleNode("channel");
XmlNodeList nodes = wData.SelectNodes("/rss/channel/item/yweather:forecast",







frmMain.QEvent = "connected";
catch { frmMain.QEvent = "failed"; }
public static void CheckBloggerForUpdates()
if (frmMain.QEvent == "UpdateYesNo")
frmMain.Jarvis.SpeakAsync("There is a new update available. Shall I start the
String UpdateMessage;
String UpdateDownloadLink;
string AtomFeedURL = "";
XmlUrlResolver xmlResolver = new XmlUrlResolver();
XmlTextReader xmlReader = new XmlTextReader(AtomFeedURL);
xmlReader.XmlResolver = xmlResolver;
XNamespace ns = XNamespace.Get("");


XDocument xmlFeed = XDocument.Load(xmlReader);

var blogPosts = from item in xmlFeed.Descendants(ns + "entry")
select new
Post = item.Element(ns + "content").Value
foreach (var item in blogPosts)
string[] separator = new string[] { "<br />" };
string[] data = item.Post.Split(separator, StringSplitOptions.None);
UpdateMessage = data[0];
UpdateDownloadLink = data[1];
if (UpdateDownloadLink == Properties.Settings.Default.RecentUpdate)
frmMain.QEvent = String.Empty;
frmMain.Jarvis.SpeakAsync("No new updates have been posted");
frmMain.Jarvis.SpeakAsync("A new update has been posted. The description
says, " + UpdateMessage + ".");


frmMain.Jarvis.SpeakAsync("Would you like me to download the update?");
frmMain.QEvent = "UpdateYesNo";
Properties.Settings.Default.RecentUpdate = UpdateDownloadLink;

Chapter 7
7.1 Conclusion and Future work:
In this project a simple mechanism that could eliminate the excess use of Natural Language
Processing. This takes us another step closer to the most ideal expert voice assistant However,
there is still lot of scope for research on this topic and Switch State Mechanism only offers us
a partial solution that solves the responsiveness issue or the computation time for
understanding the command
In this Project Expert voice assistant which uses mainly human communication means such
Twitter, instant message and voice to create two way connections between human and his
computer, controlling it and its applications, notify him of breaking news, Facebooks
Notifications and many more. In our project we mainly use voice as communication means so
the ESVA is basically the Speech recognition application. The concept of speech technology
really encompasses two technologies: Synthesizer and recognizer. A speech synthesizer takes
as input and produces an audio stream as output. A speech recognizer on the other hand does
opposite. It takes an audio stream as input and thus turns it into text transcription. The voice is
a signal of infinite information. A direct analysis and synthesizing the complex voice signal is
due to too much information contained in the signal. Therefore the digital signal processes
such as Feature Extraction and Feature Matching are introduced to represent the voice signal.
In this project we directly use speech engine which use Feature extraction technique as Mel
scaled frequency cepstral. The mel- scaled frequency cepstral coefficients (MFCCs) derived
from Fourier transform and filter bank analysis are perhaps the most widely used front- ends
in state-of-the-art speech recognition systems. Our aim to create more and more functionalities
which can help human to assist in their daily life and also reduces their efforts. In our test we
check all this functionality is working properly.
In the future this is going to be one of the most prominent technologies that are going to
evolve around the technical world. This application might not fulfill all the commands that
user want it to have but in future the commands can be in various ranges and forms Language
support can be extended as well


This project delivers most of the things that were promised. This works with a very good
efficiency as well. ESVA helped us in learn a lot about speech recognition, synthesis and
system processes and operations. Still there are a lot of possibilities in the field of speech and
artificial intelligence. It can go beyond the expected human machine interaction and can
deliver which we see in science fiction.


Chapter 8

8.1 Snapshot of the GUI


[1] Siri Intelligent Personal Assistant for iOS. Available at http:// www.









[3] Project by Chad Barraford Available at
[4] Project Alpha Available at
[5] Bahl, L.R.; Brown, P.F.; de Souza, P.V.; Mercer, R.L.; "Speech recognition with
continuous-parameter Hidden Markov models"; Acoustics, Speech, and Signal Processing,
1988. ICASSP-88., 1988 International Conference on , vol., no., pp.40-43 vol.1, 11-14 Apr
[6] Christopher D. Manning, Hinrich Schtze, Foundations of Statistical Natural Language
Processing, MIT Press: 1999
[7] E.J. ONeil, P.E. ONeil, and G. Weikum; The LRU- k page replacement algorithm for
database disk buffering; In Proceedings of the 1993 ACM Sig-mod International Conference
on Management of Data, pages 297-306, 1993.
[8] Ciprian Chelba, Dan Bikel, Maria Shugrina, Patrick Nguyen, Shankar Kumar; Large









[9] Hopcroft, John E.; Motwani, Rajeev; Ullman, Jeffrey D.

(2001); Introduction to Automata Theory, Languages, and Computation (2 ed.); Addison

Wesley. (Chapter 2)
[10] Project Rita by Mike Leslie Available at