Professional Documents
Culture Documents
39S Be 1732 PDF
39S Be 1732 PDF
Submitted in partial fulfillment of the requirement for the award of the degree of
BACHELOR OF ENGINEERING
Submitted by
INTERNAL GUIDE :
"MS. KUSHALATHA M R"
(Assistant Professor)
CERTIFICATE
Certified that the Project Work entitled “Providing voice enabled gadget assistance to
inmates of old age home(vridddhashrama) including physically disabled people.”
guided by IISc was carried out by Abhirami Balaraman (1NT12EC004),
Akshatha.P(1NT12EC013) bonafide students of Nitte Meenakshi Institute of Technology in
partial fulfillment for the award of Bachelor of Engineering in Electronics and
Communication of the Visvesvaraya Technological University, Belgaum during the academic
year 2015-2016.The project report has been approved as it satisfies the academic
requirement in respect of project work for completion of autonomous scheme of Nitte
Meenakshi Institute of Technology for the above said degree.
External Viva
Name of the Examiners Signature with Date
……………………………… …............................................
ACKNOWLEDGEMENT
We express our deepest thanks to our principal Dr. H. C. Nagaraj and Dr N. R. Shetty,
Director Nitte Meenakshi Institute of Technology, Bangalore for allowing us to carry
out the industrial training and supporting us throughout.
We also thank Indian Institute of Science for giving us the opportunity to carry out our
internship project in their esteemed instituition & giving us all the support we need to
carry on the idea as our final year project.
We express our deepest thanks to Dr.Rathna G N for taking part in useful decision ,
guidance and necessary equipment for the project and progresssing it to our final year
project. We choose this moment to acknowledge her contribution gratefully.
We also express our deepest thanks to our HOD Dr.S.Sandya for allowing us to carry
our industrial training and helped us in all the way so that we could gain a practical
experience of the industry.
We also take this opportunity to thank Ms. Kushalatha M R [Asst. Prof, ECE Dept.] for
guiding us in the right path and being of immense help to us.
Finally we thank all other unnamed who helped us in various ways to gain knowledge
and have a good training.
ABSTRACT
Speech recognition is one of the most recently developing field of research at both industrial
and scientific levels. Until recently, the idea of holding a conversation with a computer seemed
pure science fiction. If you asked a computer to “open the pod bay doors”—well, that was only
in movies. But things are changing, and quickly. A growing number of people now talk to their
mobile smart phones, asking them to send e-mail and text messages, search for directions, or
find information on the Web. Our Project aims at one such application. Project was designed
keeping in mind ,the various categories of people who suffer from loneliness due to absence of
others to care for them,especially the ones who are under cancer treatment and old aged
people.The system will provide interaction and entertainment and control appliances such as
television on voice commands.
LITERATURE SURVEY
Books are available to read and learn about speech recognition ,these enabled us to
see what happens beyond the code.
Claudio Becchetti, Klucio Prina Ricotti.“Speech Recognition: Theory and C++ Implementation “
2008 edition, In this book we learned how to write and implement the c++ code.
4. What are the technological facilities provided at your organization for entertainment
Ans. No, we are not aware of using all of it. Also it is expensive.
6. What are the changes you would like to have in your daily rountine?
Ans. The rountine is monotonous, so we like means to pass time, like playing games,
Ans Yes. It provides us entertainment and keeps us engaged and not feel bored.
Speech activation is very helpful for us as it easy for us to use,especially tv.
8. Any suggestions?
Ans. Add books,scriptures since our eyes gets weak with age. Add quiz games so
that we can improve our knowledge. We need something that can train us to
learning new languages or anything based on our intrest, without using internet.
CONTENTS
1.INTRODUCTION ..................................................................................................................................... 8
LIST OF FIGURES
1.Block diagram of WATSON recognition system ..................................................................... 2
2.Raspberry Pi model B .............................................................................................................................. 13
3.Sound Card (Qauntum)............................................................................................................................ 14
4.Collar Mic ........................................................................................................................................................ 15
5.IR LED ............................................................................................................................................................... 18
6.IR Receiver ...................................................................................................................................................... 18
7.PN2222 .............................................................................................................................................................. 19
8.The Raspbian Desktop ............................................................................................................................. 20
9.Jasper client .................................................................................................................................................. 21
10.Schematic..................................................................................................................................................... 48
11.Flowchart ...................................................................................................................................................... 51
12.Block Diagram of System ................................................................................................................... 52
13.GSM Quadband 800A ............................................................................................................................ 53
14.Home automation possibilities ....................................................................................................... 54
15.Car automation ......................................................................................................................................... 56
CHAPTER 1
Some SR systems use "training" (also called "enrolment") where an individual speaker
reads text or isolated vocabulary into the system. The system analyzes the person's
specific voice and uses it to fine-tune the recognition of that person's speech, resulting
in increased accuracy. Systems that do not use training are called "speaker
independent"[1] systems. Systems that use training are called "speaker dependent".
Speech recognition applications include voice user interfaces such as voice dialling
(e.g. "Call home"), call routing (e.g. "I would like to make a collect call"), domotic
appliance control, search (e.g. find a podcast where particular words were spoken),
simple data entry (e.g., entering a credit card number), preparation of structured
documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or
emails), and aircraft (usually termed Direct Voice Input).
From the technology perspective, speech recognition has a long history with several waves of
Now the rapid rise of powerful mobile devices is making voice interfaces even more
useful and pervasive.
Jim Glass, a senior research scientist at MIT who has been working on speech interfaces
since the 1980s, says today’s smart phones pack as much processing power as the
laboratory machines he worked with in the ’90s. Smart phones also have high-bandwidth
data connections to the cloud, where servers can do the heavy lifting involved with both
voice recognition and understanding spoken queries. “The combination of more data and
more computing power means you can do things today that you just couldn’t do before,”
says Glass. “You can use more sophisticated statistical models.”
The most prominent example of a mobile voice interface is, of course, Siri, the voice-
activated personal assistant that comes built into the latest iPhone. But voice functionality is
built into Android, the Windows Phone platform, and most other mobile systems, as well as
many apps. While these interfaces still have considerable limitations (see Social
Intelligence), we are inching closer to machine interfaces we can actually talk to.
In 1971, DARPA funded five years of speech recognition research through its Speech
Understanding Research program with ambitious end goals including a minimum
vocabulary size of 1,000 words. BBN. IBM., Carnegie Mellon and Stanford Research
Institute all participated in the program.[11] The government funding revived speech
recognition research that had been largely abandoned in the United States after John
Pierce's letter. Despite the fact that CMU's Harpy system met the goals established at the
outset of the program, many of the predictions turned out to be nothing more than hype
disappointing DARPA administrators. This disappointment led to DARPA not continuing the
funding.[12] Several innovations happened during this time, such as the invention of beam
search for use in CMU's Harpy system.[13] The field also benefited from the discovery of
several algorithms in other fields such as linear predictive coding and cepstral analysis.
CHAPTER 2
OUR OBJECTIVE
Providing information and entertainment,to otherwise solitary people,hence acts as a
personal assistant.
People with disabilities can benefit from speech recognition programs. For individuals
that are Deaf or Hard of Hearing, speech recognition software is used to automatically
generate a closed-captioning of conversations such as discussions in conference
rooms, classroom lectures, and/or religious services.[4]
Speech recognition is also very useful for people who have difficulty using their hands, ranging
from mild repetitive stress injuries to involved disabilities that preclude using conventional
computer input devices. In fact, people who used the keyboard a lot and developed RSI became
an urgent early market for speech recognition.[6] Speech recognition is used in deaf telephony,
such as voicemail to text, relay services, and captioned telephone. Individuals with learning
disabilities who have problems with thought-to-paper communication (essentially they think of
an idea but it is processed incorrectly causing it to end up differently on paper) can possibly
benefit from the software but the technology is not bug proof.[7] Also the whole idea of speak to
text can be hard for intellectually disabled person's due to the fact that it is rare that anyone tries
to learn the technology to teach the person with the disability.[8]
Being bedridden can be very difficult for many patients to adjust to and it can also cause other
health problems as well. It is important for family caregivers to know what to expect so that they
can manage or avoid the health risks that bedridden patients are prone to. In this article we
would like to offer some information about common health risks of the bedridden patient and
some tips for family caregivers to follow in order to try and prevent those health risks.
Depression is also a very common health risk for those that are bedridden because they are
unable to care for themselves and maintain the social life that they used to have. Many seniors
begin to feel hopeless when they become bedridden but this can be prevented with proper care.
Family caregivers should make sure that they are caring for their loved one’s social and
emotional needs as well as their physical needs. Many family caregivers focus only on the
physical needs of their loved ones and forget that they have emotional and social needs as well.
Family caregivers can help their loved ones by providing them with regular social activities and
arranging times for friends and other family members to come over so that they will not feel
DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016
lonely and forgotten. Family caregivers can also remind their loved ones that being bedridden
does not necessarily mean that they have to give up everything they used to enjoy.[10]
But since family members wont always be available at home,the above mentioned
problems are still prevalent in these patients,hence our interactive system will provide
them with entertainment (music,movies),and voice responses to general questions.
Therefore it behaves as an electronic companion.
CHAPTER 3
SYSTEM REQUIREMENTS
The project needs both hardware and software components.The hardware components
includes,the Raspberry Pi model B ,keyboard,mouse,earphones,microphone with sound
card,ethernet cable, HDMI screen and HDMI cable.Software components are Rasbian OS
on SD card,C++ compiler and the online resourses google speech API and Wolfram alpha
.They are described in detail below.
1. RASPBERRY PI MODEL B
Specifications include
Video outputs:HDMI (rev 1.3 & 1.4), 14 HDMI resolutions from 640×350 to 1920×1200
plus various PAL and NTSC standards, composite video (PAL and NTSC) via RCA jack
Audio outputs:Analog via 3.5 mm phone jack; digital via HDMI and, as of revision 2 boards, I²S
On-board network:[11]10/100 Mbit/s Ethernet (8P8C) USB adapter on the third/fifth port
of the USB hub (SMSC lan9514-jzx)[42]
Low -level peripherals:8× GPIO plus the following, which can also be used as GPIO:
UART, I²C bus, SPI bus with two chip selects, I²S audio +3.3 V, +5 V, ground.
The Power ratings:700 mA (3.5 W)
Power source:5 V via MicroUSB or GPIO header
A sound card (also known as an audio card) is an internal computer expansion card
that facilitates economical input and output of audio signals to and from a computer
under control of computer programs. The term sound card is also applied to external
audio interfaces that use software to generate sound, as opposed to using hardware
inside the PC. Typical uses of sound cards include providing the audio component for
multimedia applications such as music composition, editing video or audio,
presentation, education and entertainment (games) and video projection.
Sound functionality can also be integrated onto the motherboard, using components similar
to plug-in cards. The best plug-in cards, which use better and more expensive components,
can achieve higher quality than integrated sound. The integrated sound system is often still
referred to as a "sound card". Sound processing hardware is also present on modern video
cards with HDMI to output sound along with the video using that connector; previously they
used a SPDIF connection to the motherboard or sound card.
It is a photo detector and preamplifier in one package, high photo sensitivity, improved inner
shielding against electrical field disturbance, low power consumption, Suitable burst length≧10
cycles/burst, TTL and CMOS compatibility, improved immunity against ambient light, Internal
filter for PCM frequency. Bi-CMOS manufacture IC; ESD HBM>4000V; MM>250V
It is miniaturized receivers for infrared remote control systems with the high speed PIN
phototransistor and the full wave band preamplifier. Some of its applications are: Infrared
applied system, Light detecting portion of remote control, AV instruments such as Audio,
TV, VCR, CD, MD, etc. ,CATV set top boxes, other equipments with wireless remote
control, Home appliances such as Air-conditioner, Fan, etc. Multi-media Equipment.
Fig 6. IR Receiver
6. PN2222 TRANSISTOR - Transistor here is used to help drive the IR LED. Each
transistor is a general purpose amplifier, model PN2222 and has a standard EBC pin
out. They can switch up to 40V at peak currents of 1A, with a DC gain of about 100.
A similar transistor is used with same current rating.KSP2222.
7.10k Ohm RESISTOR- Resistor that goes between rPi GPIO and the PN2222 transistor
Breadboard.
1. RASPBIAN OS
Although the Raspberry Pi’s operating system is closer to the Mac than
Windows, it’s the latter that the desktop most closely resembles
It might seem a little alien at first glance, but using Raspbian is hardly any different to
using Windows (barring Windows 8 of course). There’s a menu bar, a web browser,
a file manager and no shortage of desktop shortcuts of pre-installed applications.
Raspbian is an unofficial port of Debian Wheezy armhf with compilation settings
adjusted to produce optimized "hard float" code that will run on the Raspberry Pi. This
provides significantly faster performance for applications that make heavy use of floating
point arithmetic operations. All other applications will also gain some performance
through the use of advanced instructions of the ARMv6 CPU in Raspberry Pi.
Although Raspbian is primarily the efforts of Mike Thompson (mpthompson) and Peter Green
(plugwash), it has also benefited greatly from the enthusiastic support of Raspberry Pi
community members who wish to get the maximum performance from their device.
2.JASPER CLIENT
Jasper is an open source platform for developing always -on, voice-controlled applications Use
your voice to ask for information, update social networks, control your home, and more. Jasper
is always on, always listening for commands, and you can speak from meters away. Build it
yourself with off-the-shelf hardware, and use our documentation to write your own modules.
3. CMU Sphinx
CMUSphi nx http:/ /c mu sph i nx. sou rc ef o r ge .net collects over 20 years of the CMU
research. All advantages are hard to list, but just to name a few:
Flexible design
Commercial support
Active community (more than 400 users on Linkedin CMUSphinx group) Wide
CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech
recognition systems developed at Carnegie Mellon University. These include a series of
speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).
In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech
recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech
decoders come with acoustic models and sample applications. The available resources
include in addition software for acoustic model training, Language model compilation and a
public-domain pronunciation dictionary, cmudict.
A version of Sphinx that can be used in embedded systems (e.g., based on an ARM
processor). PocketSphinx is under active development and incorporates features such
as fixed-point arithmetic and eficient algorithms for GMM computation.
4. WinSCP
WinSCP (Windows Secure Copy) is a free and open-source SFTP, FTP, WebDAV and
SCP client for Microsoft Windows. Its main function is secure file transfer between a local
and a remote computer. Beyond this, WinSCP ofers basic file manager and file
synchronization functionality. For secure transfers, it uses Secure Shell (SSH) and supports
the SCP protocol in addition to SFTP.[3]
Development of WinSCP started around March 2000 and continues. Originally it was
hosted by the University of Economics in Prague, where its author worked at the time.
Since July 16, 2003, it is licensed under the GNU GPL and hosted on SourceForge.net.[4]
WinSCP is based on the implementation of the SSH protocol from PuTTY and FTP
protocol from FileZilla.[5] It is also available as a plugin for Altap Salamander file
manager,[6] and there exists a third-party plugin for the FAR file manager.[7]
5.PUTTY
PuTTY is a free and open-source terminal emulator, serial console and network file transfer
application. It supports several network protocols, including SCP, SSH, Telnet, rlogin, and
raw socket connection. It can also connect to a serial port (since version 0.59). The name
"PuTTY" has no definitive meaning.[3]
PuTTY was written and is maintained primarily by Simon Tatham and is currently beta
software.
6. LIRC:
LIRC (Linux Infrared remote control) is an open source package that allows users to
receive and send infrared signals with a Linux-based computer system. There is a
Microsoft Windows equivalent of LIRC called WinLIRC. With LIRC and an IR receiver
the user can control their computer with almost any infrared remote control (e.g. a TV
remote control). The user may for instance control DVD or music playback with their
remote control. One GUI frontend is KDELirc, built on the KDE libraries.
7.Python 2.7
Python interpreters are available for installation on many operating systems, allowing Python
code execution on a wide variety of systems. Using third-party tools, such as Py2exe or
Pyinstaller,[29] Python code can be packaged into stand-alone executable programs for some
of the most popular operating systems, allowing the distribution of Python-based software for
use on those environments without requiring the installation of a Python interpreter.
CPython, the reference implementation of Python, is free and open-source software and
has a community-based development model, as do nearly all of its alternative
implementations. CPython is managed by the non-profit Python Software Foundation.
If you can do exactly what you want with Python 3.x, great! There are a few minor downsides,
such as slightly worse library support1 and the fact that most current Linux distributions and
Macs are still using 2.x as default, but as a language Python 3.x is definitely ready. As long as
Python 3.x is installed on your user's computers (which ought to be easy, since many people
reading this may only be developing something for themselves or an environment they control)
and you're writing things where you know none of the Python 2.x modules are needed, it is an
excellent choice. Also, most linux distributions have Python 3.x already installed, and all have it
available for end-users. Some are phasing out Python 2 as preinstalled default.
CHAPTER 4
IMPLEMENTATION
Both acoustic modeling and language modeling are important parts of modern statistically-
based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in
many systems. Language modeling is also used in many other natural language processing
applications such as document classification or statistical machine translation.
4.1 ALGORITHMS
HMM
Modern general-purpose speech recognition systems are based on Hidden Markov
Models. These are statistical models that output a sequence of symbols or quantities.
HMMs are used in speech recognition because a speech signal can be viewed as a
piecewise stationary signal or a short-time stationary signal. In a short time-scale (e.g.,
10 milliseconds), speech can be approximated as a stationary process. Speech can be
thought of as a Markov model for many stochastic purposes.
Another reason why HMMs are popular is because they can be trained automatically and
are simple and computationally feasible to use. In speech recognition, the hidden Markov
model would output a sequence of n-dimensional real-valued vectors (with n being a small
integer, such as 10), outputting one of these every 10 milliseconds. The vectors would
consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short
time window of speech and decorrelating the spectrum using a cosine transform, then
taking the first (most significant) coefficients. The hidden Markov model will tend to have in
each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which
will give a likelihood for each observed vector. Each word, or (for more general speech
recognition systems), each phoneme, will have a different output distribution; a hidden
Markov model for a sequence of words or phonemes is made by concatenating the
individual trained hidden Markov models for the separate words and phonemes.
Decoding of the speech (the term for what happens when the system is presented with
a new utterance and must compute the most likely source sentence) would probably
use the Viterbi algorithm to find the best path, and here there is a choice between
dynamically creating a combination hidden Markov model, which includes both the
acoustic and language model information, and combining it statically beforehand (the
finite state transducer, or FST, approach).
The loss function is usually the Levenshtein distance, though it can be different
distances for specific tasks; the set of possible transcriptions is, of course, pruned to
maintain tractability. Efficient algorithms have been devised to re score lattices
represented as weighted finite state transducers with edit distances represented
themselves as a finite state transducer verifying certain assumptions.[8]
The success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial
researchers, in collaboration with academic researchers, where large output layers of the DNN
based on context dependent HMM states constructed by decision trees were adopted.[7][8] [9]
Since the initial successful debut of DNNs for speech recognition around 2009-2011,
there have been huge new progresses made. This progress (as well as future
directions) has been summarized into the following eight major areas:[8]
Feature processing by deep models with solid understanding of the underlying mechanisms;
Convolution neural networks and how to design them to best exploit domain knowledge
of speech;
Other types of deep models including tensor-based models and integrated deep
generative/discriminative models.
Large-scale automatic speech recognition is the first and the most convincing successful case of
deep learning in the recent history, embraced by both industry and academic across the board.
Between 2010 and 2014, the two major conferences on signal processing and speech recognition,
IEEE-ICASSP and Interspeech, have seen near exponential growth in the numbers of accepted
papers in their respective annual conference papers on the topic of deep learning for speech
recognition. More importantly, all major commercial speech recognition systems (e.g., Microsoft
Cortana, Xbox, Skype Translator, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a
range of Nuance speech products, etc.) nowadays are based on deep learning methods.[5]
1.4. Display
There are two main connection options for the RPi display, HDMI (High Definition) and
Composite (Standard Definition).
• HD TVs and many LCD monitors can be connected using a full-size 'male' HDMI
cable, and with an inexpensive adaptor if DVI is used. HDMI versions 1.3 and 1.4 are
supported and a version 1.4 cable is recommended. The RPi outputs audio and video
via HMDI, but does not support HDMI input.
• Older TVs can be connected using Composite video (a yellow-to-yellow RCA cable) or
via SCART (using a Composite video to SCART adaptor). Both PAL and NTSC format
TVs are supported.
When using a composite video connection, audio is available from the 3.5mm jack socket,
and can be sent to your TV, headphones or an amplifier. To send audio to your TV, you will
need a cable which adapts from 3.5mm to double (red and white) RCA connectors.
Note: There is no analogue VGA output available. This is the connection required by
many computer monitors, apart from the latest ones. If you have a monitor with only a
D-shaped plug containing 15 pins, then it is unsuitable.
1.5. Power Supply
The unit is powered via the microUSB connector (only the power pins are connected, so
it will not transfer data over this connection). A standard modern phone charger with a
microUSB connector will do, providing it can supply at least 700mA at +5Vdc. Check
your power supply's ratings carefully. Suitable mains adaptors will be available from the
RPi Shop and are recommended if you are unsure what to use.
Note: The individual USB ports on a powered hub or a PC are usually rated to provide
500mA maximum. If you wish to use either of these as a power source then you will need a
special cable which plugs into two ports providing a combined current capability of 1000mA.
1.6. Cables
You will need one or more cables to connect up your RPi system.
• Video cable alternatives: o HDMI-A cable o HDMI-A cable + DVI adapter o Composite
video cable o Composite video cable + SCART adaptor • Audio cable (not needed if you
use the HDMI video connection to a TV) • Ethernet/LAN cable (Model B only)
5. Copy the extracted files onto the SD card that you just formatted
6. Insert the SD card into your Pi and connect the power supply
7.You can also alternatively download the raspbian image from https://raspberrypi.org
Your Pi will now boot into NOOBS and should display a list of operating systems that you can
choose to install. If your display remains blank, you should select the correct output mode for
your display by pressing one of the following number keys on your
keyboard; 1. HDMI mode this is the default display mode.
2.HDMI safe mode select this mode if you are using the HDMI connector and
cannot see anything on screen when the Pi has booted.
3.Composite PAL mode select either this mode or composite NTSC mode if you
are using the composite RCA video connector
4. Composite NTSC mode
To build pocketsphinx in a unix-like environment (such as Linux, Solaris, FreeBSD etc) you
need to make sure you have the following dependencies installed: gcc, automake, autoconf,
libtool, bison, swig at least version 2.0, python development package, pulseaudio development
package. If you want to build without dependencies you can use proper configure options like –
without-swig-python but for beginner it is recommended to install all dependencies.
You need to download both sphinxbase and pocketsphinx packages and unpack them. Please
note that you can not use sphinxbase and pocketsphinx of different version, please make sure
that versions are in sync. After unpack you should see the following two main folders:
sphinxbase-X.X
pocketsphinx-X.x
On step one, build and install SphinxBase. Change current directory to sphinxbase
folder. If you downloaded directly from the repository, you need to do this at least once
to generate the configure file:
% ./autogen.sh
if you downloaded the release version, or ran autogen.sh at least once, then compile and install:
% ./configure
% make
% make install
The last step might require root permissions so it might be sudo make install. If you
want to use fixed-point arithmetic, you must configure SphinxBase with the –enable-
fixed option. You can also set installation prefix with –prefix. You can also configure with
or without SWIG python support.
The sphinxbase will be installed in /usr/local/ folder by default. Not every system loads libraries
from this folder automatically. To load them you need to configure the path to look for shared
libaries. It can be done either in the file /etc/ld.so.conf or with exporting environment variables:
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
Keyword lists
Pocketsphinx supports keyword spotting mode where you can specify the keyword list to look
for. The advantage of this mode is that you can specify a threshold for each keyword so that
keyword can be detected in continuous speech. All other modes will try to detect the words from
grammar even if you used words which are not in grammar. The keyword list looks like this:
Take a long recording with few occurrences of your keywords and some other sounds. You can
take a movie sound or something else. The length of the audio should be approximately 1 hour
Run keyword spotting on that file with different thresholds for every keyword, use the
following command:
pocketsphinx_continuous -infile <your_file.wav> -keyphrase <"your
keyphrase"> -kws_threshold \
<your_threshold> -time yes
From keyword spotting results count how many false alarms and missed detections
you've encountered
Select the threshold with smallest amount of false alarms and missed detections
For the best accuracy it is better to have keyphrase with 3-4 syllables. Too short
phrases are easily confused.
Grammars
Grammars describe very simple type of the language for command and control, and they are
usually written by hand or generated automatically within the code. Grammars usually do not
have probabilities for word sequences, but some elements might be weighed. Grammars could
be created with JSGF format and usually have extension like .gram or .jsgf.
Grammars allow to specify possible inputs very precisely, for example, that certain word
might be repeated only two or three times. However, this strictness might be harmful if
your user accidentally skips the words which grammar requires. In that case whole
recognition will fail. For that reason it is better to make grammars more relaxed, instead
of phrases list just the bag of words allowing arbitrary order. Avoid very complex
grammars with many rules and cases, it just slows the recognizer, you can use simple
rules instead. In the past grammars required a lot of effort to tune them, to assign
variants properly and so on. The big VXML consulting industry was about that.
Language models
Statistical language models describe more complex language. They contain probabilities of
the words and word combinations. Those probabilities are estimated from a sample data
and automatically have some flexibility. For example, every combination from the
vocabulary is possible, though probability of such combination might vary. For example if
you create statistical language model from a list of words it will still allow to decode word
combinations though it might not be your intent. Overall, statistical language models are
recommended for free-form input where user could say anything in a natural language and
they require way less engineering effort than grammars, you just list the possible sentences.
For example, you might list numbers like “twenty one” and “thirty three” and statistical
language model will allow “thirty one” with certain probability as well.
Overall, modern speech recognition interfaces tend to be more natural and avoid command-
and-control style of previous generation. For that reason most interface designers prefer natural
language recognition with statistical language model than old-fashioned VXML grammars.
On the topic of design of the VUI interfaces you might be interested in the following
book: It's Better to Be a Good Machine Than a Bad Person: Speech Recognition and
Other Exotic User Interfaces at the Twilight of the Jetsonian Age by Bruce Balentine
There are many ways to build the statistical language models. When your data set is
large, there is sense to use CMU language modeling toolkit. When a model is small,
you can use an online quick web service. When you need specific options or you just
want to use your favorite toolkit which builds ARPA models, you can use it.
Language model can be stored and loaded in three different format - text ARPA format,
binary format BIN and binary DMP format. ARPA format takes more space but it is
possible to edit it. ARPA files have .lm extension. Binary format takes significantly less
space and faster to load. Binary files have .lm.bin extension. It is also possible to
convert between formats. DMP format is obsolete and not recommended.
Building a grammar
Text preparation
First of all you need to prepare a large collection of clean texts. Expand abbreviations,
convert numbers to words, clean non-word items. For example to clean Wikipedia XML
dump you can use special python scripts like https://github.com/attardi/wikiextractor. To
clean HTML pages you can try http://code.google.com/p/boilerpipe a nice package
specifically created to extract text from HTML
For example on how to create language model from Wikipedia texts please see
http://trulymadlywordly.blogspot.ru/2011/03/creating-text-corpus-from-wikipedia.html
Once you went through the language model process, please submit your langauge
model to CMUSphinx project, we'd be glad to share it!
Language modeling for many languages like Mandarin is largely the same as in English,
with one addditional consideration, which is that the input text must be word segmented.
A segmentation tool and associated word list is provided to accomplish this.
There are many toolkits that create ARPA n-gram language model from text files.
DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016
IRSLM
MITLM
SRILM
If you are training large vocabulary speech recognition system, the language model
training is outlined in a separate page Building a large scale language model for
domain-specific transcription.
Once you created ARPA file you can convert the model to binary format if needed.
Training with SRILM is easy, that's why we recommend it. Morever, SRILM is the most
advanced toolkit up to date. To train the model you can use the following command:
You can prune the model afterwards to reduce the size of the model
You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.
1) Prepare a reference text that will be used to generate the language model. The language model
toolkit expects its input to be in the form of normalized text files, with utterances delimited by
<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and
heavy at times </s>
<s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s>
<s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east
south east breeze </s>
<s> cloudy damp and misty today with spells of rain and drizzle in most places much of
this rain will be
light and patchy but heavier rain may develop in the west later </s>
More data will generate better language models. The weather.txt file from sphinx4 (used
to generate the weather language model) contains nearly 100,000 sentences.
2) Generate the vocabulary file. This is a list of all the words in the
3) You may want to edit the vocabulary file to remove words (numbers, misspellings,
names). If you find misspellings, it is a good idea to fix them in the input transcript.
4) If you want a closed vocabulary language model (a language model that has no
provisions for unknown words), then you should remove sentences from your input
transcript that contain words that are not in your vocabulary file.
If your language is English and text is small it's sometimes more convenient to use web
service to build it. Language models built in this way are quite functional for simple
command and control tasks. First of all you need to create a corpus.
The “corpus” is just a list of sentences that you will use to train the language model. As an
example, we will use a hypothetical voice control task for a mobile Internet device. We'd like to
tell it things like “open browser”, “new e-mail”, “forward”, “backward”, “next window”, “last
window”, “open music player”, and so forth. So, we'll start by creating a file called corpus.txt:
open browser
new e-mail
forward
backward
next window
last window
You should see a page with some status messages, followed by a page entitled “Sphinx
knowledge base”. This page will contain links entitled “Dictionary” and “Language
Model”. Download these files and make a note of their names (they should consist of a
4-digit number followed by the extensions .dic and .lm). You can now test your newly
created language model with PocketSphinx.
To quickly load large models you probably would like to convert them to binary format
that will save your decoder initialization time. That's not necessary with small models.
Pocketsphinx and sphinx3 can handle both of them with -lm option. Sphinx4
automatically detects format by extension of the lm file.
ARPA format and BINARY format are mutually convertable. You can produce other file
with sphinx_lm_convert command from sphinxbase:
This section will show you how to use, test, and improve the language model you created.
In Sphinx4 high-level API you need to specify the location of the language model in
Configuration:
configuration.setLanguageModelPath("file:8754.lm");
If the model is in resources you can reference it with resource: URL
configuration.setLanguageModelPath("resource:/com/example/8754.lm");
There are various tools to help you to extend an existing dictionary for new words or to build
a new dictionary from scratch. If your language already has a dictionary it's recommended
to use since it's carefully tuned for best performance. If you starting a new language you
need to account for various reductions and coarticulations effects. They make it very hard to
create accurate rules to convert text to sounds. However, the practice shows that even
naive conversion could produce a good results for speech recognition. For example, many
developers were successful to create ASR with simple grapheme-based synthesis where
each letter is just mapped to itself not to the corresponding phone.
For most of the languages you need to use specialized grapheme to phoneme (g2p)
code to do the conversion using machine learning methods and existing small
database. Nowdays most accurate g2p tools are Phonetisaurus:
http://code.google.com/p/phonetisaurus
And sequitur-g2p:
http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html
http://cmusphinx.sourceforge.net/projects/freetts
http://mary.dfki.de/
or espeak for C:
http://espeak.sourceforge.net
Please note that if you use TTS you often need to do phoneset conversion. TTS phonesets are
usually more extensive than required for ASR. However, there is a great adavantage in TTS
tools because they usually contain more required functionality than simple G2P. For example,
they are doing tokenization by converting numbers and abbreviations to spoken format.
For English you can use simplier capabilities by using on-line webservice:
http://www.speech.cs.cmu.edu/tools/lmtool.html
Online LM Tool, produces a dictionary which matches its language model. It uses the latest
CMU dictionary as a base, and is programmed to guess at pronunciations of words not in the
existing dictionary. You can look at the log file to find which words were guesses, and make
your own corrections, if necessary. With the advanced option, LM Tool can use a hand-made
dictionary that you specify for your specialized vocabulary, or for your own pronunciations as
corrections. The hand dictionary must be in the same format as the main dictionary
If you want to run lmtool offline you can checkout it from subversion:
2.TEXT TO SPEECH
eSpeak is a compact open-source speech synthesizer for many platforms. Speech
synthesis is done offline, but most voices can sound very “robotic”.
Festival uses the Festival Speech Synthesis System, an open source speech
synthesizer developed by the Centre for Speech Technology Research at the University
of Edinburgh. Like eSpeak, also synthesizes speech offline.
Initial voice was espeak later changed to
festival. sudo apt-get update
sudo apt-get install festival festvox-don
4.4.SETTING UP LIRC
First, we’ll need to install and configure LIRC to run on the
RaspberryPi: sudo apt-get install lirc
Second,You have to modify two files before you can start testing the receiver and
IR LED. Add this to your /etc/modules file:
lirc_dev
lirc_rpi gpio_in_pin=23 gpio_out_pin=22
########################################################
# /etc/lirc/hardware.conf
#
# Arguments which will be used when launching
lircd LIRCD_ARGS="--uinput"
dtoverlay=lirc-rpi,gpio_in_pin=23,gpio_out_pin=22
Fig 11.Schematic
Run these two commands to stop lircd and start outputting raw data from the IR receiver:
mode2 -d /dev/lirc0
Point a remote control at your IR receiver and press some buttons. You should see
something like this:
space 16300
pulse 95
space 28794
space 19395
When using irrecord it will ask you to name the buttons you’re programming as you program
them. Be sure to run irrecord --list-namespace to see the valid names before you begin.
Here were the commands that we ran to generate a remote configuration file:
# Create a new remote control configuration file (using /dev/lirc0) and save the output to
~/lircd.conf
irrecord -d /dev/lirc0 ~/lircd.conf
The emitter is simply an IR LED (Light Emitting Diode) and the detector is simply an IR
photodiode which is sensitive to IR light of the same wavelength as that emitted by the
IR LED. When IR light falls on the photodiode, its resistance and correspondingly, its
output voltage, change in proportion to the magnitude of the IR light received. This is
the underlying principle of working of the IR sensor.
Fig 12 Flowchart
The flowchart of the python script is shown below. Where the voice input is first
verified if it is the keyword. Then the system sends a high beep through the audio out, to
indicate microphone is actively listening. The voice input now given is compared with
the configured commands and the corresponding function is called.
Here we are using CMU Sphinx with jasper-client brain which implements deep learning algorithm.
First the keyword which is configured is said, we will hear a high beep, which means jasper is listening.
Now the command is given ,which is decoded and searched by the pocketshinx dictionary by HMM
computation.
Match is found to mentioned words in modules and the appropriate function is executed. Which can be
playing a song or video or reading a book or changing TV channel or playing a quiz game.
The song and video database can have any regional language songs as well. The
CHAPTER 5
FURTHER ENHANCEMENTS
In such cases this project may not function, therefore we have enhancing this
project to work even without internet using recognition toolkits such as CMU Sphinx.
3. HOME AUTOMATION
With the right level of ingenuity, the sky's the limit on things you can automate in your
home, but here are a few basic categories of tasks that you can pursue:
Automate your lights to turn on and of on a schedule, remotely, or when certain conditions are
triggered.
Set your air conditioner to keep the house temperate when you're home and save energy while
you're away.
CHAPTER 6
APPLICATIONS
Usage in education and daily life
For language learning, speech recognition can be useful for learning a second
language. It can teach proper pronunciation, in addition to helping a person develop
fluency with their speaking skills.[6]
Students who are blind (see Blindness and education) or have very low vision can
benefit from using the technology to convey words and then hear the computer recite
them, as well as use a computer by commanding with their voice, instead of having to
look at the screen and keyboard. [6]
Aerospace (e.g. space exploration, spacecraft, etc.) NASA’s Mars Polar Lander used
speech recognition from technology Sensory, Inc. in the Mars Microphone on the Lander[7]
Automatic translation
The improvement of mobile processor speeds made feasible the speech-enabled Symbian and
Windows Mobile smartphones. Speech is used mostly as a part of a user interface, for creating
In Car systems
Typically a manual control input, for example by means of a finger control on the
steering-wheel, enables the speech recognition system and this is signalled to the driver
by an audio prompt. Following the audio prompt, the system has a "listening window"
during which it may accept a speech input for recognition.
Simple voice commands may be used to initiate phone calls, select radio stations or play
music from a compatible smartphone, MP3 player or music-loaded flash drive. Voice
recognition capabilities vary between car make and model. Some of the most recent car
models offer natural-language speech recognition in place of a fixed set of commands.
allowing the driver to use full sentences and common phrases. With such systems there is,
therefore, no need for the user to memorize a set of fixed command words.
Helicopters
The problems of achieving high recognition accuracy under stress and noise pertain strongly to
the helicopter environment as well as to the jet fighter environment. The acoustic noise problem
is actually more severe in the helicopter environment, not only because of the high noise levels
but also because the helicopter pilot, in general, does not wear a facemask, which would reduce
acoustic noise in the microphone. Substantial test and evaluation programs have been carried
out in the past decade in speech recognition systems applications in helicopters, notably by the
U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal
Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in
the Puma helicopter. There has also been much useful work in Canada. Results have been
encouraging, and voice applications have included: control of communication radios, setting of
navigation systems, and control of an automated target handover system.
As in fighter applications, the overriding issue for voice in helicopters is the impact on
pilot effectiveness. Encouraging results are reported for the AVRADA tests, although
these represent only a feasibility demonstration in a test environment. Much remains to
be done both in speech recognition and in overall speech technology in order to
consistently achieve performance improvements in operational settings.
[4]Waibel, Hanazawa, Hinton, Shikano, Lang. (1989) "Phoneme recognition using time-
delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing."
[6]]'Low Cost Home Automation Using Offline Speech Recognition', International Journal of
Signal Processing Systems, vol. 2, no. 2, pp. 96-101, 2014.
[8] Deng, L.; Li, Xiao (2013). "Machine Learning Paradigms for Speech Recognition: An
Overview". IEEE Transactions on Audio, Speech, and Language Processing.
[9]P. V. Hajar and A. Andurkar, “Facial Recognition and Speech Recognition using Raspberry
Pi', International Journal of Advanced Research in Computer and CommunicationReview Paper
on System for Voice and F Engineering, vol. 4, no. 4, pp. 232-234, 2015.
[10]Common Health Risks of the Bedridden Patient Posted on October 24, 2013 by
Carefect Blog Team