You are on page 1of 59

PROJECT REPORT 2015-2016

NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

(An Autonomous Institute, Affiliated to Visvesvaraya Technological University,


Belgaum, Approved by AICTE & Govt. of Karnataka)
ACADEMIC YEAR 2015-16

Final Year Project Report on

“Providing voice enabled gadget assistance to inmates of old age home


(vridddhashrama) including physically disabled people.”

Submitted in partial fulfillment of the requirement for the award of the degree of

BACHELOR OF ENGINEERING

Submitted by

ABHIRAMI BALARAMAN (1NT12EC004) AKSHATHA P(1NT12EC013)

INTERNAL GUIDE :
"MS. KUSHALATHA M R"
(Assistant Professor)

Department of Electronics and Communication Engineering NITTE MEENAKSHI


INSTITUTE OF TECHNOLOGY
Yelahanka, Bangalore-560064

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

(An Autonomous Institute, Affiliated to VTU, Belgaum, Approved by AICTE &


State Govt. of Karnataka), Yelahanka, Bangalore-560064

Department Of Electronics And Communication Engineering

CERTIFICATE

Certified that the Project Work entitled “Providing voice enabled gadget assistance to
inmates of old age home(vridddhashrama) including physically disabled people.”
guided by IISc was carried out by Abhirami Balaraman (1NT12EC004),
Akshatha.P(1NT12EC013) bonafide students of Nitte Meenakshi Institute of Technology in
partial fulfillment for the award of Bachelor of Engineering in Electronics and
Communication of the Visvesvaraya Technological University, Belgaum during the academic
year 2015-2016.The project report has been approved as it satisfies the academic
requirement in respect of project work for completion of autonomous scheme of Nitte
Meenakshi Institute of Technology for the above said degree.

Signature of the Guide Signature of the HOD


(Ms.Kushalatha M R) (Dr. S. Sandya)

External Viva
Name of the Examiners Signature with Date
……………………………… …............................................

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

ACKNOWLEDGEMENT

We express our deepest thanks to our principal Dr. H. C. Nagaraj and Dr N. R. Shetty,
Director Nitte Meenakshi Institute of Technology, Bangalore for allowing us to carry
out the industrial training and supporting us throughout.

We also thank Indian Institute of Science for giving us the opportunity to carry out our
internship project in their esteemed instituition & giving us all the support we need to
carry on the idea as our final year project.

We express our deepest thanks to Dr.Rathna G N for taking part in useful decision ,
guidance and necessary equipment for the project and progresssing it to our final year
project. We choose this moment to acknowledge her contribution gratefully.

We also express our deepest thanks to our HOD Dr.S.Sandya for allowing us to carry
our industrial training and helped us in all the way so that we could gain a practical
experience of the industry.
We also take this opportunity to thank Ms. Kushalatha M R [Asst. Prof, ECE Dept.] for
guiding us in the right path and being of immense help to us.

Finally we thank all other unnamed who helped us in various ways to gain knowledge
and have a good training.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

ABSTRACT
Speech recognition is one of the most recently developing field of research at both industrial
and scientific levels. Until recently, the idea of holding a conversation with a computer seemed
pure science fiction. If you asked a computer to “open the pod bay doors”—well, that was only
in movies. But things are changing, and quickly. A growing number of people now talk to their
mobile smart phones, asking them to send e-mail and text messages, search for directions, or
find information on the Web. Our Project aims at one such application. Project was designed
keeping in mind ,the various categories of people who suffer from loneliness due to absence of
others to care for them,especially the ones who are under cancer treatment and old aged
people.The system will provide interaction and entertainment and control appliances such as
television on voice commands.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

LITERATURE SURVEY
Books are available to read and learn about speech recognition ,these enabled us to
see what happens beyond the code.
Claudio Becchetti, Klucio Prina Ricotti.“Speech Recognition: Theory and C++ Implementation “
2008 edition, In this book we learned how to write and implement the c++ code.

A most recent comprehensive textbook, "Fundamentals of Speaker Recognition" by


Homayoon Beigi, is an in depth source for up to date details on the theory and practice.
A good insight into the techniques used in the best modern systems can be gained by
paying attention to government sponsored evaluations such as those organised by
DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE
project, which involves both speech recognition and translation components).

"Automatic Speech Recognition: A Deep Learning Approach" (Publisher: Springer)


written by D. Yu and L. Deng published near the end of 2014, with highly
mathematically-oriented technical detail on how deep learning methods are derived and
implemented in modern speech recognition systems based on DNNs and related deep
learning methods.This gave us an insight into the conversion algorithm used by Google.
Here are some IEEE and other articles we referred :
Waibel, Hanazawa, Hinton, Shikano, Lang. (1989) "Phoneme recognition using time-delay
neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing."
Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker
identification using Gaussian mixture speaker models" (PDF). IEEE Transactions on
Speech and Audio Processing (IEEE) 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063-
6676. OCLC 26108901. Retrieved 21 February 2014.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

SURVEY QUESTIONNNAIRE CONDUCTED IN OLD AGE HOME:

1.What is the total number of people in this old age

home? Ans. There are 22 people of age above 70.

2.What are the facilities available to you?

Ans. All basic needs are provided.

3. Is 24/7 medical assistance for someone who's bed-

ridden? Ans. No 24/7 nursing.

4. What are the technological facilities provided at your organization for entertainment

purpose? Ans. There was only television in each room.

5. Do you have access to computers,internet and mobile phones at your organization?

Ans. No, we are not aware of using all of it. Also it is expensive.

6. What are the changes you would like to have in your daily rountine?

Ans. The rountine is monotonous, so we like means to pass time, like playing games,

learning anything based on our intrest.

7. Do you think our project is helpful to you?

Ans Yes. It provides us entertainment and keeps us engaged and not feel bored.
Speech activation is very helpful for us as it easy for us to use,especially tv.

8. Any suggestions?

Ans. Add books,scriptures since our eyes gets weak with age. Add quiz games so
that we can improve our knowledge. We need something that can train us to
learning new languages or anything based on our intrest, without using internet.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

CONTENTS
1.INTRODUCTION ..................................................................................................................................... 8

2. OUR OBJECTIVE ............................................................................................................................... 11


3.SYSTEM REQUIREMENTS ............................................................................................................ 13
3.1 HARDWARE COMPONENTS ............................................................................................... 13
3.2 SOFTWARE REQUIRED .......................................................................................................... 16
4.IMPLEMENTATION ............................................................................................................................. 19
4.1 ALGORITHMS ............................................................................................................................... 19
4.2 SETTING UP RASPBERRY PI .............................................................................................. 22
4.3 DOWNLOADING OTHER SOFTWARE ............................................................................ 43
4.4 SETTING UP LIRC ...................................................................................................................... 46
4.5 WORKING OF IR LED ............................................................................................................... 50
4.6 FLOWCHART .................................................................................................................................. 51
4.7 BLOCK DIAGRAM ....................................................................................................................... 52
5.FURTHER ENHANCEMENTS ........................................................................................................ 53
6.APPLICATIONS .................................................................................................................................... ,55
7.REFERENCES …...............................................................................................................59

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

LIST OF FIGURES
1.Block diagram of WATSON recognition system ..................................................................... 2
2.Raspberry Pi model B .............................................................................................................................. 13
3.Sound Card (Qauntum)............................................................................................................................ 14
4.Collar Mic ........................................................................................................................................................ 15
5.IR LED ............................................................................................................................................................... 18
6.IR Receiver ...................................................................................................................................................... 18
7.PN2222 .............................................................................................................................................................. 19
8.The Raspbian Desktop ............................................................................................................................. 20
9.Jasper client .................................................................................................................................................. 21
10.Schematic..................................................................................................................................................... 48
11.Flowchart ...................................................................................................................................................... 51
12.Block Diagram of System ................................................................................................................... 52
13.GSM Quadband 800A ............................................................................................................................ 53
14.Home automation possibilities ....................................................................................................... 54
15.Car automation ......................................................................................................................................... 56

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

CHAPTER 1

INTRODUCTION TO SPEECH RECOGNITION

In computer science and electrical engineering, speech recognition (SR) is the


translation of spoken words into text. It is also known as "automatic speech recognition"
(ASR), "computer speech recognition", or just "speech to text" (STT).

Some SR systems use "training" (also called "enrolment") where an individual speaker
reads text or isolated vocabulary into the system. The system analyzes the person's
specific voice and uses it to fine-tune the recognition of that person's speech, resulting
in increased accuracy. Systems that do not use training are called "speaker
independent"[1] systems. Systems that use training are called "speaker dependent".

Speech recognition applications include voice user interfaces such as voice dialling
(e.g. "Call home"), call routing (e.g. "I would like to make a collect call"), domotic
appliance control, search (e.g. find a podcast where particular words were spoken),
simple data entry (e.g., entering a credit card number), preparation of structured
documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or
emails), and aircraft (usually termed Direct Voice Input).

The term voice recognition[2][3][4] or speaker identification[5][6] refers to identifying the


speaker, rather than what they are saying. Recognizing the speaker can simplify the task of
translating speech in systems that have been trained on a specific person's voice or it can
be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
major innovations. Most recently, the field has benefited from advances in deep learning
and big data. The advances are evidenced not only by the surge of academic papers
published in the field, but more importantly by the world-wide industry adoption of a
variety of deep learning methods in designing and deploying speech recognition
systems. These speech industry players include Microsoft, Google, IBM, Baidu (China),
Apple, Amazon, Nuance, IflyTek (China), many of which have publicized the core
technology in their speech recognition systems being based on deep learning.

Fig 1.WATSON block diagram

Now the rapid rise of powerful mobile devices is making voice interfaces even more
useful and pervasive.
Jim Glass, a senior research scientist at MIT who has been working on speech interfaces
since the 1980s, says today’s smart phones pack as much processing power as the
laboratory machines he worked with in the ’90s. Smart phones also have high-bandwidth
data connections to the cloud, where servers can do the heavy lifting involved with both
voice recognition and understanding spoken queries. “The combination of more data and
more computing power means you can do things today that you just couldn’t do before,”
says Glass. “You can use more sophisticated statistical models.”

The most prominent example of a mobile voice interface is, of course, Siri, the voice-
activated personal assistant that comes built into the latest iPhone. But voice functionality is
built into Android, the Windows Phone platform, and most other mobile systems, as well as
many apps. While these interfaces still have considerable limitations (see Social
Intelligence), we are inching closer to machine interfaces we can actually talk to.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

In 1971, DARPA funded five years of speech recognition research through its Speech
Understanding Research program with ambitious end goals including a minimum
vocabulary size of 1,000 words. BBN. IBM., Carnegie Mellon and Stanford Research
Institute all participated in the program.[11] The government funding revived speech
recognition research that had been largely abandoned in the United States after John
Pierce's letter. Despite the fact that CMU's Harpy system met the goals established at the
outset of the program, many of the predictions turned out to be nothing more than hype
disappointing DARPA administrators. This disappointment led to DARPA not continuing the
funding.[12] Several innovations happened during this time, such as the invention of beam
search for use in CMU's Harpy system.[13] The field also benefited from the discovery of
several algorithms in other fields such as linear predictive coding and cepstral analysis.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

CHAPTER 2

OUR OBJECTIVE
Providing information and entertainment,to otherwise solitary people,hence acts as a
personal assistant.

People with disabilities can benefit from speech recognition programs. For individuals
that are Deaf or Hard of Hearing, speech recognition software is used to automatically
generate a closed-captioning of conversations such as discussions in conference
rooms, classroom lectures, and/or religious services.[4]
Speech recognition is also very useful for people who have difficulty using their hands, ranging
from mild repetitive stress injuries to involved disabilities that preclude using conventional
computer input devices. In fact, people who used the keyboard a lot and developed RSI became
an urgent early market for speech recognition.[6] Speech recognition is used in deaf telephony,
such as voicemail to text, relay services, and captioned telephone. Individuals with learning
disabilities who have problems with thought-to-paper communication (essentially they think of
an idea but it is processed incorrectly causing it to end up differently on paper) can possibly
benefit from the software but the technology is not bug proof.[7] Also the whole idea of speak to
text can be hard for intellectually disabled person's due to the fact that it is rare that anyone tries
to learn the technology to teach the person with the disability.[8]

Being bedridden can be very difficult for many patients to adjust to and it can also cause other
health problems as well. It is important for family caregivers to know what to expect so that they
can manage or avoid the health risks that bedridden patients are prone to. In this article we
would like to offer some information about common health risks of the bedridden patient and
some tips for family caregivers to follow in order to try and prevent those health risks.

Depression is also a very common health risk for those that are bedridden because they are
unable to care for themselves and maintain the social life that they used to have. Many seniors
begin to feel hopeless when they become bedridden but this can be prevented with proper care.
Family caregivers should make sure that they are caring for their loved one’s social and
emotional needs as well as their physical needs. Many family caregivers focus only on the
physical needs of their loved ones and forget that they have emotional and social needs as well.
Family caregivers can help their loved ones by providing them with regular social activities and
arranging times for friends and other family members to come over so that they will not feel
DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016
lonely and forgotten. Family caregivers can also remind their loved ones that being bedridden
does not necessarily mean that they have to give up everything they used to enjoy.[10]

But since family members wont always be available at home,the above mentioned
problems are still prevalent in these patients,hence our interactive system will provide
them with entertainment (music,movies),and voice responses to general questions.
Therefore it behaves as an electronic companion.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

CHAPTER 3

SYSTEM REQUIREMENTS
The project needs both hardware and software components.The hardware components
includes,the Raspberry Pi model B ,keyboard,mouse,earphones,microphone with sound
card,ethernet cable, HDMI screen and HDMI cable.Software components are Rasbian OS
on SD card,C++ compiler and the online resourses google speech API and Wolfram alpha
.They are described in detail below.

3.1 HARDWARE COMPONENTS

1. RASPBERRY PI MODEL B

The Raspberry Pi is a series of credit card–sized single-board computers developed in


the UK by the Raspberry Pi Foundation with the intention of promoting the teaching of
basic computer science in schools.[5][6][7]
The system is developed through ARM microprocessor ARM is a registered
trademark of ARM Limited. Linux now provides support for the ARM-11 family processors; it
gives consumer device manufacturers, commercial-quality Linux implementation along with
tools to reduce time-to-market and development costs. Raspberry Pi is a credit card sized
computer development platform based on a BCM2835 system on chip, sporting an ARM11
processor, developed in the UK by Raspberry Pi Foundation. Raspberry Pi model functions
as a regular desktop computer when it is connected to the keyboard or monitor. Raspberry
Pi is very cheap and most reliable to make a Raspberry Pi supercomputer. The Raspberry
Pi uses Linux kernel-based.
The Foundation provides Debian and Arch Linux ARM distributions for download.Tools are
available for Python as the main programming language, with support for BBC BASIC(via
the RISC OS image or the Brandy Basic clone for Linux), C, C++, Java,Perl and Ruby.

Fig 2.Raspberry Pi Model B

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

Specifications include

SoC:Broadcom BCM2835 (CPU, GPU, DSP, SDRAM, one USB port)

CPU:700 MHz single-core ARM1176JZF-S

GPU:Broadcom VideoCore IV @ 250 MHz

OpenGL ES 2.0 (24 GFLOPS)

MPEG-2 and VC-1 (with license),1080p30 H.264/MPEG-4 AVC high-profile decoder


and encoder
Memory (SDRAM):512 MB (shared with GPU) as of 15 October 2012

USB 2.0 ports :2 (via the on-board 3-port USB hub)

Video outputs:HDMI (rev 1.3 & 1.4), 14 HDMI resolutions from 640×350 to 1920×1200
plus various PAL and NTSC standards, composite video (PAL and NTSC) via RCA jack
Audio outputs:Analog via 3.5 mm phone jack; digital via HDMI and, as of revision 2 boards, I²S

On-board storage:[SD / MMC / SDIO card slot

On-board network:[11]10/100 Mbit/s Ethernet (8P8C) USB adapter on the third/fifth port
of the USB hub (SMSC lan9514-jzx)[42]
Low -level peripherals:8× GPIO plus the following, which can also be used as GPIO:
UART, I²C bus, SPI bus with two chip selects, I²S audio +3.3 V, +5 V, ground.
The Power ratings:700 mA (3.5 W)
Power source:5 V via MicroUSB or GPIO header

Size:85.60 mm × 56.5 mm (3.370 in × 2.224 in) – not including protruding connectors

Weight:45 g (1.6 oz)


The main differences between the two flavours of Pi are the RAM, the number of USB 2.0 ports and
the fact that the Model A doesn’t have an Ethernet port (meaning a USB Wi-Fi is required to access
the internet. While that results in a lower price for the Model A, it means that a user will have to buy
a powered USB hub in order to get it to work for many projects. The Model A is aimed more at those
creating electronics projects that require programming and control directly from the command line
interface. Both Pi models use the Broadcom BCM2835 CPU, which is an ARM11-based processor
running at 700MHz. There are overclocking modes built in for users to
DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016
increase the speed as long as the core doesn’t get too hot, at which point it is throttled back.
Also included is the Broadcom VideoCore IV GPU with support for OpenGL ES 2.0, which SD
can perform 24 GFlops and decode and play H.264 video at 1080p resolution. Originally the
Model A was due to use 128MB RAM, but this was upgraded to 256MB RAM with the Model B
going from 256MB to 512MB. The power supply to the Pi is via the 5V microUSB socket. As the
Model A has fewer powered interfaces it only requires 300mA, compared to the 700mA that the
Model B needs. The standard system of connecting the Pi models is to use the HDMI port to
connect to an HDMI socket on a TV or a DVI port on a monitor. Both HDMI-HDMI and HDMI-
DVI cables work well, delivering 1080p video, or 1920x1080. Sound is also sent through the
HDMI connection, but if using a monitor without speakers then there’s the standard 3.5mm jack
socket for audio. The RCA composite video connection was designed for use in countries where
the level of technology is lower and more basic displays such as older TVs are used.

2. SOUND CARD WITH MICROPHONE

Sound card is used since raspberry pi has no on board ADC ,

A sound card (also known as an audio card) is an internal computer expansion card
that facilitates economical input and output of audio signals to and from a computer
under control of computer programs. The term sound card is also applied to external
audio interfaces that use software to generate sound, as opposed to using hardware
inside the PC. Typical uses of sound cards include providing the audio component for
multimedia applications such as music composition, editing video or audio,
presentation, education and entertainment (games) and video projection.

Sound functionality can also be integrated onto the motherboard, using components similar
to plug-in cards. The best plug-in cards, which use better and more expensive components,
can achieve higher quality than integrated sound. The integrated sound system is often still
referred to as a "sound card". Sound processing hardware is also present on modern video
cards with HDMI to output sound along with the video using that connector; previously they
used a SPDIF connection to the motherboard or sound card.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
We are using Quantum sound card and hauwei collar mic.
A microphone, colloquially nicknamed mic or mike (/ˈmaɪk/),[1] is an acoustic-to-electric transducer
or sensor that converts sound into an electrical signal. Electromagnetic transducers facilitate the
conversion of acoustic signals into electrical signals.[2] Microphones are used in many applications
such as telephones, hearing aids, public address systems for concert halls and public events, motion
picture production, live and recorded audio engineering, two-way radios, megaphones, radio and
television broadcasting, and in computers for recording voice, speech recognition, VoIP, and for non-
acoustic purposes such as ultrasonic checking or knock sensors.

Most microphones today use electromagnetic induction (dynamic microphones),


capacitance change (condenser microphones) or piezoelectricity (piezoelectric
microphones) to produce an electrical signal from air pressure variations. Microphones
typically need to be connected to a preamplifier before the signal can be amplified with
an audio power amplifier and a speaker or recorded.

Fig 3.Sound Card Fig 4. Collar Mic

3.KEYBOARD ,MOUSE AND HDMI SCREEN Are the other peripherals.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
4. 940nm IR LED 20deg - 20 degree viewing angle. Bright and tuned to 940nm wavelength and
940nm IR LED 40deg - 40 degree viewing angle. Bright and tuned to 940nm wavelength.

Fig 5. IR Transmitter LED

5. 38 khz IR RECEIVER - Receives IR signals at remote control frequencies.

It is a photo detector and preamplifier in one package, high photo sensitivity, improved inner
shielding against electrical field disturbance, low power consumption, Suitable burst length≧10
cycles/burst, TTL and CMOS compatibility, improved immunity against ambient light, Internal
filter for PCM frequency. Bi-CMOS manufacture IC; ESD HBM>4000V; MM>250V

It is miniaturized receivers for infrared remote control systems with the high speed PIN
phototransistor and the full wave band preamplifier. Some of its applications are: Infrared
applied system, Light detecting portion of remote control, AV instruments such as Audio,
TV, VCR, CD, MD, etc. ,CATV set top boxes, other equipments with wireless remote
control, Home appliances such as Air-conditioner, Fan, etc. Multi-media Equipment.

Fig 6. IR Receiver

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

6. PN2222 TRANSISTOR - Transistor here is used to help drive the IR LED. Each
transistor is a general purpose amplifier, model PN2222 and has a standard EBC pin
out. They can switch up to 40V at peak currents of 1A, with a DC gain of about 100.
A similar transistor is used with same current rating.KSP2222.

Fig 8.PN2222 Pinout

7.10k Ohm RESISTOR- Resistor that goes between rPi GPIO and the PN2222 transistor

Breadboard.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

3.2 SOFTWARE REQUIRED

1. RASPBIAN OS

Although the Raspberry Pi’s operating system is closer to the Mac than
Windows, it’s the latter that the desktop most closely resembles
It might seem a little alien at first glance, but using Raspbian is hardly any different to
using Windows (barring Windows 8 of course). There’s a menu bar, a web browser,
a file manager and no shortage of desktop shortcuts of pre-installed applications.
Raspbian is an unofficial port of Debian Wheezy armhf with compilation settings
adjusted to produce optimized "hard float" code that will run on the Raspberry Pi. This
provides significantly faster performance for applications that make heavy use of floating
point arithmetic operations. All other applications will also gain some performance
through the use of advanced instructions of the ARMv6 CPU in Raspberry Pi.
Although Raspbian is primarily the efforts of Mike Thompson (mpthompson) and Peter Green
(plugwash), it has also benefited greatly from the enthusiastic support of Raspberry Pi
community members who wish to get the maximum performance from their device.

Fig 9.The Raspbian Desktop

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

2.JASPER CLIENT
Jasper is an open source platform for developing always -on, voice-controlled applications Use
your voice to ask for information, update social networks, control your home, and more. Jasper
is always on, always listening for commands, and you can speak from meters away. Build it
yourself with off-the-shelf hardware, and use our documentation to write your own modules.

Fig 10.Jasper client

3. CMU Sphinx

CMUSphi nx http:/ /c mu sph i nx. sou rc ef o r ge .net collects over 20 years of the CMU

research. All advantages are hard to list, but just to name a few:

State of art speech recognition algorithms for eficient speech recognition.


CMUSphinx tools are designed specifically for low-resource platforms

Flexible design

Focus on practical application development and not on research

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
Support for several languages like US English, UK English, French,
Mandarin, German, Dutch, Russian and ability to build a models for others

BSD-like license which allows commercial distribution

Commercial support

Active development and release schedule

Active community (more than 400 users on Linkedin CMUSphinx group) Wide

range of tools for many speech-recognition related purposes (keyword spotting,

alignment, pronuncation evaluation)

CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech
recognition systems developed at Carnegie Mellon University. These include a series of
speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).

In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech
recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech
decoders come with acoustic models and sample applications. The available resources
include in addition software for acoustic model training, Language model compilation and a
public-domain pronunciation dictionary, cmudict.

Here , we use the pocketsphinx tool.

A version of Sphinx that can be used in embedded systems (e.g., based on an ARM
processor). PocketSphinx is under active development and incorporates features such
as fixed-point arithmetic and eficient algorithms for GMM computation.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

4. WinSCP

WinSCP (Windows Secure Copy) is a free and open-source SFTP, FTP, WebDAV and
SCP client for Microsoft Windows. Its main function is secure file transfer between a local
and a remote computer. Beyond this, WinSCP ofers basic file manager and file
synchronization functionality. For secure transfers, it uses Secure Shell (SSH) and supports
the SCP protocol in addition to SFTP.[3]

Development of WinSCP started around March 2000 and continues. Originally it was
hosted by the University of Economics in Prague, where its author worked at the time.
Since July 16, 2003, it is licensed under the GNU GPL and hosted on SourceForge.net.[4]

WinSCP is based on the implementation of the SSH protocol from PuTTY and FTP
protocol from FileZilla.[5] It is also available as a plugin for Altap Salamander file
manager,[6] and there exists a third-party plugin for the FAR file manager.[7]

5.PUTTY
PuTTY is a free and open-source terminal emulator, serial console and network file transfer
application. It supports several network protocols, including SCP, SSH, Telnet, rlogin, and
raw socket connection. It can also connect to a serial port (since version 0.59). The name
"PuTTY" has no definitive meaning.[3]

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
PuTTY was originally written for Microsoft Windows, but it has been ported
to various other operating systems. Oficial ports are available for some
Unix-like platforms, with work-in-progress ports to Classic Mac OS and Mac
OS X, and unoficial ports have been contributed to platforms such as Symbian, [4][5]
Windows Mobile and Windows Phone.

PuTTY was written and is maintained primarily by Simon Tatham and is currently beta
software.

6. LIRC:

LIRC (Linux Infrared remote control) is an open source package that allows users to
receive and send infrared signals with a Linux-based computer system. There is a
Microsoft Windows equivalent of LIRC called WinLIRC. With LIRC and an IR receiver
the user can control their computer with almost any infrared remote control (e.g. a TV
remote control). The user may for instance control DVD or music playback with their
remote control. One GUI frontend is KDELirc, built on the KDE libraries.

7.Python 2.7

Python is a widely used high-level, general-purpose, interpreted, dynamic


programming language.[3][4] Its design philosophy emphasizes code readability, and its
syntax allows programmers to express concepts in fewer lines of code than would be
possible in languages such as C++ or Java.[5][6] The language provides constructs
intended to enable clear programs on both a small and large scale.[7]

Python supports multiple programming paradigms, including object-oriented, imperative and


functional programming or procedural styles. It features a dynamic type system and
automatic memory management and has a large and comprehensive standard library.[8]

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

Python interpreters are available for installation on many operating systems, allowing Python
code execution on a wide variety of systems. Using third-party tools, such as Py2exe or
Pyinstaller,[29] Python code can be packaged into stand-alone executable programs for some
of the most popular operating systems, allowing the distribution of Python-based software for
use on those environments without requiring the installation of a Python interpreter.

CPython, the reference implementation of Python, is free and open-source software and
has a community-based development model, as do nearly all of its alternative
implementations. CPython is managed by the non-profit Python Software Foundation.

Why python 2.7?

If you can do exactly what you want with Python 3.x, great! There are a few minor downsides,
such as slightly worse library support1 and the fact that most current Linux distributions and
Macs are still using 2.x as default, but as a language Python 3.x is definitely ready. As long as
Python 3.x is installed on your user's computers (which ought to be easy, since many people
reading this may only be developing something for themselves or an environment they control)
and you're writing things where you know none of the Python 2.x modules are needed, it is an
excellent choice. Also, most linux distributions have Python 3.x already installed, and all have it
available for end-users. Some are phasing out Python 2 as preinstalled default.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

CHAPTER 4

IMPLEMENTATION
Both acoustic modeling and language modeling are important parts of modern statistically-
based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in
many systems. Language modeling is also used in many other natural language processing
applications such as document classification or statistical machine translation.

4.1 ALGORITHMS

HMM
Modern general-purpose speech recognition systems are based on Hidden Markov
Models. These are statistical models that output a sequence of symbols or quantities.
HMMs are used in speech recognition because a speech signal can be viewed as a
piecewise stationary signal or a short-time stationary signal. In a short time-scale (e.g.,
10 milliseconds), speech can be approximated as a stationary process. Speech can be
thought of as a Markov model for many stochastic purposes.

Another reason why HMMs are popular is because they can be trained automatically and
are simple and computationally feasible to use. In speech recognition, the hidden Markov
model would output a sequence of n-dimensional real-valued vectors (with n being a small
integer, such as 10), outputting one of these every 10 milliseconds. The vectors would
consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short
time window of speech and decorrelating the spectrum using a cosine transform, then
taking the first (most significant) coefficients. The hidden Markov model will tend to have in
each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which
will give a likelihood for each observed vector. Each word, or (for more general speech
recognition systems), each phoneme, will have a different output distribution; a hidden
Markov model for a sequence of words or phonemes is made by concatenating the
individual trained hidden Markov models for the separate words and phonemes.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
Described above are the core elements of the most common, HMM-based approach to speech
recognition. Modern speech recognition systems use various combinations of a number of
standard techniques in order to improve results over the basic approach described above. A
typical large-vocabulary system would need context dependency for the phonemes (so
phonemes with different left and right context have different realizations as HMM states); it
would use cepstral normalization to normalize for different speaker and recording conditions; for
further speaker normalization it might use vocal tract length normalization (VTLN) for male-
female normalization and maximum likelihood linear regression (MLLR) for more general
speaker adaptation. The features would have so-called delta and delta-delta coefficients to
capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis
(HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based
projection followed perhaps by heteroscedastic linear discriminant analysis or a global semi-tied
co variance transform (also known as maximum likelihood linear transform, or MLLT). Many
systems use so-called discriminative training techniques that dispense with a purely statistical
approach to HMM parameter estimation and instead optimize some classification-related
measure of the training data. Examples are maximum mutual information (MMI), minimum
classification error (MCE) and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with
a new utterance and must compute the most likely source sentence) would probably
use the Viterbi algorithm to find the best path, and here there is a choice between
dynamically creating a combination hidden Markov model, which includes both the
acoustic and language model information, and combining it statically beforehand (the
finite state transducer, or FST, approach).

A possible improvement to decoding is to keep a set of good candidates instead of just


keeping the best candidate, and to use a better scoring function (re scoring) to rate these
good candidates so that we may pick the best one according to this refined score. The set
of candidates can be kept either as a list (the N-best list approach) or as a subset of the
models (a lattice). Re scoring is usually done by trying to minimize the Bayes risk[7] (or an
approximation thereof): Instead of taking the source sentence with maximal probability, we
try to take the sentence that minimizes the expectancy of a given loss function with regards
to all possible transcriptions (i.e., we take the sentence that minimizes the average distance
to other possible sentences weighted by their estimated probability).

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

The loss function is usually the Levenshtein distance, though it can be different
distances for specific tasks; the set of possible transcriptions is, of course, pruned to
maintain tractability. Efficient algorithms have been devised to re score lattices
represented as weighted finite state transducers with edit distances represented
themselves as a finite state transducer verifying certain assumptions.[8]

DEEP NEURAL NETWORK


A deep neural network (DNN) is an artificial neural network with multiple hidden layers of
units between the input and output layers.[6] Similar to shallow neural networks, DNNs can
model complex non-linear relationships. DNN architectures generate compositional models,
where extra layers enable composition of features from lower layers, giving a huge learning
capacity and thus the potential of modeling complex patterns of speech data.[6] The DNN is
the most popular type of deep learning architectures successfully used as an acoustic
model for speech recognition since 2010.

The success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial
researchers, in collaboration with academic researchers, where large output layers of the DNN
based on context dependent HMM states constructed by decision trees were adopted.[7][8] [9]

One fundamental principle of deep learning is to do away with hand-crafted feature


engineering and to use raw features. This principle was first explored successfully in the
architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features,[2]
showing its superiority over the Mel-Cepstral features which contain a few stages of fixed
transformation from spectrograms. The true "raw" features of speech, waveforms, have
more recently been shown to produce excellent larger-scale speech recognition results.[3]

Since the initial successful debut of DNNs for speech recognition around 2009-2011,
there have been huge new progresses made. This progress (as well as future
directions) has been summarized into the following eight major areas:[8]

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

Scaling up/out and speedup DNN training and decoding;

Sequence discriminative training of DNNs;

Feature processing by deep models with solid understanding of the underlying mechanisms;

Adaptation of DNNs and of related deep models;

Multi-task and transfer learning by DNNs and related deep models;

Convolution neural networks and how to design them to best exploit domain knowledge
of speech;

Recurrent neural network and its rich LSTM variants;

Other types of deep models including tensor-based models and integrated deep
generative/discriminative models.
Large-scale automatic speech recognition is the first and the most convincing successful case of
deep learning in the recent history, embraced by both industry and academic across the board.
Between 2010 and 2014, the two major conferences on signal processing and speech recognition,
IEEE-ICASSP and Interspeech, have seen near exponential growth in the numbers of accepted
papers in their respective annual conference papers on the topic of deep learning for speech
recognition. More importantly, all major commercial speech recognition systems (e.g., Microsoft
Cortana, Xbox, Skype Translator, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a
range of Nuance speech products, etc.) nowadays are based on deep learning methods.[5]

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

4.2 STEPS TO SETUP RASPBERRY PI


1.1. Connecting Everything Together
1. Plug the preloaded SD Card into the RPi.
2. Plug the USB keyboard and mouse into the RPi, perhaps via a USB hub. Connect the
Hub to power, if necessary.
3. Plug a video cable into the screen (TV or monitor) and into the RPi.
4. Plug your extras into the RPi (USB WiFi, Ethernet cable, external hard drive etc.).
This is where you may really need a USB hub.
5. Ensure that your USB hub (if any) and screen are working.
6. Plug the power supply into the mains socket.
7. With your screen on, plug the power supply into the RPi microUSB socket. 8. The RPi
should boot up and display messages on the screen.
It is always recommended to connect the MicroUSB power to the unit last (while most
connections can be made live, it is best practice to connect items such as displays with
the power turned off).

1.2. Operating System SD Card


As the RPi has no internal mass storage or built-in operating system it requires an SD
card preloaded with a version of the Linux Operating System.
• You can create your own preloaded card using any suitable SD card (4GBytes or
above) you have to hand. We suggest you use a new blank card to avoid arguments
over lost pictures. • Preloaded SD cards will be available from the RPi Shop.
1.3. Keyboard & Mouse
Most standard USB keyboards and mice will work with the RPi. Wireless keyboard/mice should
also function, and only require a single USB port for an RF dongle. In order to use a Bluetooth
keyboard or mouse you will need a Bluetooth USB dongle, which again uses a single port.
Remember that the Model A has a single USB port and the Model B has two (typically a
keyboard and mouse will use a USB port each).

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

1.4. Display
There are two main connection options for the RPi display, HDMI (High Definition) and
Composite (Standard Definition).
• HD TVs and many LCD monitors can be connected using a full-size 'male' HDMI
cable, and with an inexpensive adaptor if DVI is used. HDMI versions 1.3 and 1.4 are
supported and a version 1.4 cable is recommended. The RPi outputs audio and video
via HMDI, but does not support HDMI input.
• Older TVs can be connected using Composite video (a yellow-to-yellow RCA cable) or
via SCART (using a Composite video to SCART adaptor). Both PAL and NTSC format
TVs are supported.
When using a composite video connection, audio is available from the 3.5mm jack socket,
and can be sent to your TV, headphones or an amplifier. To send audio to your TV, you will
need a cable which adapts from 3.5mm to double (red and white) RCA connectors.
Note: There is no analogue VGA output available. This is the connection required by
many computer monitors, apart from the latest ones. If you have a monitor with only a
D-shaped plug containing 15 pins, then it is unsuitable.
1.5. Power Supply
The unit is powered via the microUSB connector (only the power pins are connected, so
it will not transfer data over this connection). A standard modern phone charger with a
microUSB connector will do, providing it can supply at least 700mA at +5Vdc. Check
your power supply's ratings carefully. Suitable mains adaptors will be available from the
RPi Shop and are recommended if you are unsure what to use.
Note: The individual USB ports on a powered hub or a PC are usually rated to provide
500mA maximum. If you wish to use either of these as a power source then you will need a
special cable which plugs into two ports providing a combined current capability of 1000mA.

1.6. Cables
You will need one or more cables to connect up your RPi system.
• Video cable alternatives: o HDMI-A cable o HDMI-A cable + DVI adapter o Composite
video cable o Composite video cable + SCART adaptor • Audio cable (not needed if you
use the HDMI video connection to a TV) • Ethernet/LAN cable (Model B only)

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

1.7.Preparing your SD card for the Raspberry Pi


In order to use your Raspberry Pi, you will need to install an Operating System (OS)
onto an SD card. An Operating System is the set of basic programs and utilities that
allow your computer to run; examples include Windows on a PC or OSX on a Mac.
These instructions will guide you through installing a recovery program on your SD
card that will allow you to easily install different OS’s and to recover your card if you
break it. 1.Insert an SD card that is 4GB or greater in size into your computer
2. Format the SD card so that the Pi can
read it. a.Windows
i.Download the SD Association's Formatting Tool1 from
https://www.sdcard.org/downloads/formatter_4/eula_windows/
ii.Install and run the Formatting Tool on your machine
iii.Set "FORMAT SIZE ADJUSTMENT" option to "ON" in the "Options" menu
iv.Check that the SD card you inserted matches the one selected by the Tool
v. Click the “Format”
button b.Mac
i.Download the SD Association's Formatting Tool from
https://www.sdcard.org/downloads/formatter_4/eula_mac/
ii. Install and run the Formatting Tool on your machine iii. Select “Overwrite Format”
iv. Check that the SD card you inserted matches the one selected by the Tool
v. Click the “Format” button
c.Linux
i. We recommend using gparted (or the command line version parted )
ii. Format the entire disk as FAT
3. Download the New Out Of Box Software (NOOBS)
from: downloads.raspberrypi.org/noobs
4. Unzip the downloaded file
a.Windows Right click on the file and choose “Extract
all” b. Mac Double tap on the file

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
c. Linux Run unzip [downloaded filename]

5. Copy the extracted files onto the SD card that you just formatted
6. Insert the SD card into your Pi and connect the power supply
7.You can also alternatively download the raspbian image from https://raspberrypi.org

Your Pi will now boot into NOOBS and should display a list of operating systems that you can
choose to install. If your display remains blank, you should select the correct output mode for
your display by pressing one of the following number keys on your
keyboard; 1. HDMI mode this is the default display mode.
2.HDMI safe mode select this mode if you are using the HDMI connector and
cannot see anything on screen when the Pi has booted.
3.Composite PAL mode select either this mode or composite NTSC mode if you
are using the composite RCA video connector
4. Composite NTSC mode

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

4.3.DOWNLOADING OTHER SOFTWARE


1.CMU SPHINX
CMU Sphinx toolkit has a number of packages for different tasks and applications. It's
sometimes confusing what to choose. To cleanup, here is the list

Pocketsphinx — recognizer library written in C.


Sphinxtrain — acoustic model training tools
Sphinxbase — support library required by Pocketsphinx and Sphinxtrain
Sphinx4 — adjustable, modifiable recognizer written in Java

We have chosen pocketsphinx.

To build pocketsphinx in a unix-like environment (such as Linux, Solaris, FreeBSD etc) you
need to make sure you have the following dependencies installed: gcc, automake, autoconf,
libtool, bison, swig at least version 2.0, python development package, pulseaudio development
package. If you want to build without dependencies you can use proper configure options like –
without-swig-python but for beginner it is recommended to install all dependencies.

You need to download both sphinxbase and pocketsphinx packages and unpack them. Please
note that you can not use sphinxbase and pocketsphinx of different version, please make sure
that versions are in sync. After unpack you should see the following two main folders:

sphinxbase-X.X
pocketsphinx-X.x
On step one, build and install SphinxBase. Change current directory to sphinxbase
folder. If you downloaded directly from the repository, you need to do this at least once
to generate the configure file:

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

% ./autogen.sh
if you downloaded the release version, or ran autogen.sh at least once, then compile and install:

% ./configure
% make
% make install
The last step might require root permissions so it might be sudo make install. If you
want to use fixed-point arithmetic, you must configure SphinxBase with the –enable-
fixed option. You can also set installation prefix with –prefix. You can also configure with
or without SWIG python support.

The sphinxbase will be installed in /usr/local/ folder by default. Not every system loads libraries
from this folder automatically. To load them you need to configure the path to look for shared
libaries. It can be done either in the file /etc/ld.so.conf or with exporting environment variables:

export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

BUILDING LANGUAGE MODEL


There are several types of models that describe language to recognize - keyword lists, grammars
and statistical language models, phonetic statistical language models. You can chose any decoding
mode according to your needs and you can even switch between modes in runtime.

Keyword lists

Pocketsphinx supports keyword spotting mode where you can specify the keyword list to look
for. The advantage of this mode is that you can specify a threshold for each keyword so that
keyword can be detected in continuous speech. All other modes will try to detect the words from
grammar even if you used words which are not in grammar. The keyword list looks like this:

oh mighty computer /1e-


40/ hello world /1e-30/
DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016
other phrase /1e-20/
Threshold must be specified for every keyphrase. For shorter keyphrase you can use
smaller thresholds like 1e-1, for longer threshold must be bigger. Threshold must be
tuned to balance between false alarms and missed detections, the best way to tune
threshold is to use a prerecorded audio file. Tuning process is the following:

Take a long recording with few occurrences of your keywords and some other sounds. You can
take a movie sound or something else. The length of the audio should be approximately 1 hour
Run keyword spotting on that file with different thresholds for every keyword, use the
following command:
pocketsphinx_continuous -infile <your_file.wav> -keyphrase <"your
keyphrase"> -kws_threshold \
<your_threshold> -time yes
From keyword spotting results count how many false alarms and missed detections
you've encountered
Select the threshold with smallest amount of false alarms and missed detections
For the best accuracy it is better to have keyphrase with 3-4 syllables. Too short
phrases are easily confused.

Keyword lists are supported by pocketsphinx only, not by sphinx4.

Grammars

Grammars describe very simple type of the language for command and control, and they are
usually written by hand or generated automatically within the code. Grammars usually do not
have probabilities for word sequences, but some elements might be weighed. Grammars could
be created with JSGF format and usually have extension like .gram or .jsgf.

Grammars allow to specify possible inputs very precisely, for example, that certain word
might be repeated only two or three times. However, this strictness might be harmful if
your user accidentally skips the words which grammar requires. In that case whole
recognition will fail. For that reason it is better to make grammars more relaxed, instead
of phrases list just the bag of words allowing arbitrary order. Avoid very complex
grammars with many rules and cases, it just slows the recognizer, you can use simple
rules instead. In the past grammars required a lot of effort to tune them, to assign
variants properly and so on. The big VXML consulting industry was about that.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

Language models

Statistical language models describe more complex language. They contain probabilities of
the words and word combinations. Those probabilities are estimated from a sample data
and automatically have some flexibility. For example, every combination from the
vocabulary is possible, though probability of such combination might vary. For example if
you create statistical language model from a list of words it will still allow to decode word
combinations though it might not be your intent. Overall, statistical language models are
recommended for free-form input where user could say anything in a natural language and
they require way less engineering effort than grammars, you just list the possible sentences.
For example, you might list numbers like “twenty one” and “thirty three” and statistical
language model will allow “thirty one” with certain probability as well.

Overall, modern speech recognition interfaces tend to be more natural and avoid command-
and-control style of previous generation. For that reason most interface designers prefer natural
language recognition with statistical language model than old-fashioned VXML grammars.

On the topic of design of the VUI interfaces you might be interested in the following
book: It's Better to Be a Good Machine Than a Bad Person: Speech Recognition and
Other Exotic User Interfaces at the Twilight of the Jetsonian Age by Bruce Balentine

There are many ways to build the statistical language models. When your data set is
large, there is sense to use CMU language modeling toolkit. When a model is small,
you can use an online quick web service. When you need specific options or you just
want to use your favorite toolkit which builds ARPA models, you can use it.

Language model can be stored and loaded in three different format - text ARPA format,
binary format BIN and binary DMP format. ARPA format takes more space but it is
possible to edit it. ARPA files have .lm extension. Binary format takes significantly less
space and faster to load. Binary files have .lm.bin extension. It is also possible to
convert between formats. DMP format is obsolete and not recommended.

Building a grammar

Grammars are usually written manually in JSGF format:

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
#JSGF
V1.0; /**
* JSGF Grammar for Hello World
example */
grammar hello;
public <greet> = (good morning | hello) ( bhiksha | evandro | paul | philip | rita | will );

Building a Statistical Language Model

Text preparation

First of all you need to prepare a large collection of clean texts. Expand abbreviations,
convert numbers to words, clean non-word items. For example to clean Wikipedia XML
dump you can use special python scripts like https://github.com/attardi/wikiextractor. To
clean HTML pages you can try http://code.google.com/p/boilerpipe a nice package
specifically created to extract text from HTML

For example on how to create language model from Wikipedia texts please see

http://trulymadlywordly.blogspot.ru/2011/03/creating-text-corpus-from-wikipedia.html

Once you went through the language model process, please submit your langauge
model to CMUSphinx project, we'd be glad to share it!

Movie subtitles are good source for spoken language.

Language modeling for many languages like Mandarin is largely the same as in English,
with one addditional consideration, which is that the input text must be word segmented.
A segmentation tool and associated word list is provided to accomplish this.

Using other Language Model Toolkits

There are many toolkits that create ARPA n-gram language model from text files.
DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016

Some toolkits you can try:

IRSLM
MITLM
SRILM
If you are training large vocabulary speech recognition system, the language model
training is outlined in a separate page Building a large scale language model for
domain-specific transcription.

Once you created ARPA file you can convert the model to binary format if needed.

ARPA model training with SRILM

Training with SRILM is easy, that's why we recommend it. Morever, SRILM is the most
advanced toolkit up to date. To train the model you can use the following command:

ngram-count -kndiscount -interpolate -text train-text.txt -lm your.lm

You can prune the model afterwards to reduce the size of the model

ngram -lm your.lm -prune 1e-8 -write-lm your-pruned.lm


After training it is worth to test the perplexity of the model on the test data

ngram -lm your.lm -ppl test-text.txt


ARPA model training with CMUCLMTK

You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.

The process for creating a language model is as follows:

1) Prepare a reference text that will be used to generate the language model. The language model
toolkit expects its input to be in the form of normalized text files, with utterances delimited by

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
<s> and </s> tags. A number of input filters are available for specific corpora such as
Switchboard, ISL and NIST meetings, and HUB5 transcripts. The result should be the
set of sentences that are bounded by the start and end sentence markers: <s> and
</s>. Here's an example:

<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and
heavy at times </s>
<s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s>
<s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east
south east breeze </s>
<s> cloudy damp and misty today with spells of rain and drizzle in most places much of
this rain will be
light and patchy but heavier rain may develop in the west later </s>
More data will generate better language models. The weather.txt file from sphinx4 (used
to generate the weather language model) contains nearly 100,000 sentences.

2) Generate the vocabulary file. This is a list of all the words in the

file: text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab

3) You may want to edit the vocabulary file to remove words (numbers, misspellings,
names). If you find misspellings, it is a good idea to fix them in the input transcript.

4) If you want a closed vocabulary language model (a language model that has no
provisions for unknown words), then you should remove sentences from your input
transcript that contain words that are not in your vocabulary file.

5) Generate the arpa format language model with the commands:

% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt


% idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \
weather.vocab -arpa weather.lm
6) Generate the CMU binary form (BIN)

sphinx_lm_convert -i weather.lm -o weather.lm.bin


DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016
The CMUCLTK tools and commands are documented at The CMU-Cambridge
Language Modeling Toolkit page.

Building a simple language model using web service

If your language is English and text is small it's sometimes more convenient to use web
service to build it. Language models built in this way are quite functional for simple
command and control tasks. First of all you need to create a corpus.

The “corpus” is just a list of sentences that you will use to train the language model. As an
example, we will use a hypothetical voice control task for a mobile Internet device. We'd like to
tell it things like “open browser”, “new e-mail”, “forward”, “backward”, “next window”, “last
window”, “open music player”, and so forth. So, we'll start by creating a file called corpus.txt:

open browser
new e-mail
forward
backward
next window
last window

open music player


Then go to the page http://www.speech.cs.cmu.edu/tools/lmtool-new.html. Simply click
on the “Browse…” button, select the corpus.txt file you created, then click “COMPILE
KNOWLEDGE BASE”.

The legacy version is still available online also here:


http://www.speech.cs.cmu.edu/tools/lmtool.html

You should see a page with some status messages, followed by a page entitled “Sphinx
knowledge base”. This page will contain links entitled “Dictionary” and “Language
Model”. Download these files and make a note of their names (they should consist of a
4-digit number followed by the extensions .dic and .lm). You can now test your newly
created language model with PocketSphinx.

Converting model into binary format


DEPARTMENT OF ECE, NMIT
PROJECT REPORT 2015-2016

To quickly load large models you probably would like to convert them to binary format
that will save your decoder initialization time. That's not necessary with small models.
Pocketsphinx and sphinx3 can handle both of them with -lm option. Sphinx4
automatically detects format by extension of the lm file.

ARPA format and BINARY format are mutually convertable. You can produce other file
with sphinx_lm_convert command from sphinxbase:

sphinx_lm_convert -i model.lm -o model.lm.bin sphinx_lm_convert -


i model.lm.bin -ifmt bin -o model.lm -ofmt arpa You can also convert
old DMP models to bin format this way.

Using your language model

This section will show you how to use, test, and improve the language model you created.

Using your language model with PocketSphinx

If you have installed PocketSphinx, you will have a program called


pocketsphinx_continuous which can be run from the command-line to recognize speech.
Assuming it is installed under /usr/local, and your language model and dictionary are called
8521.dic and 8521.lm and placed in the current folder, try running the following command:

pocketsphinx_continuous -inmic yes -lm 8521.lm -dict 8521.dic


This will use your new language model and the dictionary and default acoustic model.
On Windows you also have to specify the acoustic model folder with -hmm option

bin/Release/pocketsphinx_continuous.exe -inmic yes -lm 8521.lm -dict 8521.dic -hmm


model/en-us/en-us
You will see a lot of diagnostic messages, followed by a pause, then “READY…”. Now
you can try speaking some of the commands. It should be able to recognize them with
complete accuracy. If not, you may have problems with your microphone or sound card.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

Using your language model with Sphinx4

In Sphinx4 high-level API you need to specify the location of the language model in
Configuration:

configuration.setLanguageModelPath("file:8754.lm");
If the model is in resources you can reference it with resource: URL

configuration.setLanguageModelPath("resource:/com/example/8754.lm");

GENERATING THE DICTIONARY

There are various tools to help you to extend an existing dictionary for new words or to build
a new dictionary from scratch. If your language already has a dictionary it's recommended
to use since it's carefully tuned for best performance. If you starting a new language you
need to account for various reductions and coarticulations effects. They make it very hard to
create accurate rules to convert text to sounds. However, the practice shows that even
naive conversion could produce a good results for speech recognition. For example, many
developers were successful to create ASR with simple grapheme-based synthesis where
each letter is just mapped to itself not to the corresponding phone.

For most of the languages you need to use specialized grapheme to phoneme (g2p)
code to do the conversion using machine learning methods and existing small
database. Nowdays most accurate g2p tools are Phonetisaurus:

http://code.google.com/p/phonetisaurus

And sequitur-g2p:

http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
Also note that almost each TTS package has G2P code included. For example you can
use g2p code from FreeTTS written in Java:

http://cmusphinx.sourceforge.net/projects/freetts

See FreeTTS example in Sphinx4 here

OpenMary Java TTS:

http://mary.dfki.de/

or espeak for C:

http://espeak.sourceforge.net

Please note that if you use TTS you often need to do phoneset conversion. TTS phonesets are
usually more extensive than required for ASR. However, there is a great adavantage in TTS
tools because they usually contain more required functionality than simple G2P. For example,
they are doing tokenization by converting numbers and abbreviations to spoken format.

For English you can use simplier capabilities by using on-line webservice:

http://www.speech.cs.cmu.edu/tools/lmtool.html

Online LM Tool, produces a dictionary which matches its language model. It uses the latest
CMU dictionary as a base, and is programmed to guess at pronunciations of words not in the
existing dictionary. You can look at the log file to find which words were guesses, and make
your own corrections, if necessary. With the advanced option, LM Tool can use a hand-made
dictionary that you specify for your specialized vocabulary, or for your own pronunciations as
corrections. The hand dictionary must be in the same format as the main dictionary

If you want to run lmtool offline you can checkout it from subversion:

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/logios

2.TEXT TO SPEECH
eSpeak is a compact open-source speech synthesizer for many platforms. Speech
synthesis is done offline, but most voices can sound very “robotic”.
Festival uses the Festival Speech Synthesis System, an open source speech
synthesizer developed by the Centre for Speech Technology Research at the University
of Edinburgh. Like eSpeak, also synthesizes speech offline.
Initial voice was espeak later changed to
festival. sudo apt-get update
sudo apt-get install festival festvox-don

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

4.4.SETTING UP LIRC
First, we’ll need to install and configure LIRC to run on the
RaspberryPi: sudo apt-get install lirc
Second,You have to modify two files before you can start testing the receiver and
IR LED. Add this to your /etc/modules file:

lirc_dev
lirc_rpi gpio_in_pin=23 gpio_out_pin=22

Change your /etc/lirc/hardware.conf file to:

########################################################
# /etc/lirc/hardware.conf
#
# Arguments which will be used when launching
lircd LIRCD_ARGS="--uinput"

# Don't start lircmd even if there seems to be a good config file


# START_LIRCMD=false

# Don't start irexec, even if a good config file seems to exist.


# START_IREXEC=false

# Try to load appropriate kernel modules


LOAD_MODULES=true

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

# Run "lircd --driver=help" for a list of supported


drivers. DRIVER="default"
# usually /dev/lirc0 is the correct setting for systems using
udev DEVICE="/dev/lirc0"
MODULES="lirc_rpi"

# Default configuration files for your hardware if any


LIRCD_CONF=""
LIRCMD_CONF=""
########################################################

Now restart lircd so it picks up these changes:

sudo /etc/init.d/lirc stop

sudo /etc/init.d/lirc start

Edit your /boot/config.txt file and add:

dtoverlay=lirc-rpi,gpio_in_pin=23,gpio_out_pin=22

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

Now ,connect the circuit .

Fig 11.Schematic

Testing the ir receiver


Testing the IR receiver is relatively straightforward.

Run these two commands to stop lircd and start outputting raw data from the IR receiver:

sudo /etc/init.d/lirc stop

mode2 -d /dev/lirc0

Point a remote control at your IR receiver and press some buttons. You should see
something like this:

space 16300

pulse 95

space 28794

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
pulse 80

space 19395

When using irrecord it will ask you to name the buttons you’re programming as you program
them. Be sure to run irrecord --list-namespace to see the valid names before you begin.

Here were the commands that we ran to generate a remote configuration file:

# Stop lirc to free up /dev/lirc0

sudo /etc/init.d/lirc stop

# Create a new remote control configuration file (using /dev/lirc0) and save the output to
~/lircd.conf
irrecord -d /dev/lirc0 ~/lircd.conf

# Make a backup of the original lircd.conf file


sudo mv /etc/lirc/lircd.conf /etc/lirc/lircd_original.conf

# Copy over your new configuration file

sudo cp ~/lircd.conf /etc/lirc/lircd.conf

# Start up lirc again


sudo /etc/init.d/lirc start
Once you’ve completed a remote configuration file and saved/added it to
/etc/lirc/lircd.conf you can try testing the IR LED. We’ll be using the irsend application
that comes with LIRC to facilitate sending commands. You’ll definitely want to check out
the documentation to learn more about the options irsend has.
Here are the commands I ran to test my IR LED (using the “tatasky” remote
configuration file I created):
# List all of the commands that LIRC knows for
'tatasky' irsend LIST tatasky ""

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
# Send the KEY_POWER command once

irsend SEND_ONCE tatasky KEY_POWER

# Send the KEY_VOLUMEDOWN command once

irsend SEND_ONCE tatasky KEY_VOLUMEDOWN

Last step, is to connect the module to python program.

4.5 WORKING OF IR RECEIVER AND TRANSMITTER


An IR LED, also known as IR transmitter, is a special purpose LED that transmits infrared rays
in the range of 760 nm wavelength. Such LEDs are usually made of gallium arsenide or
aluminium gallium arsenide. They, along with IR receivers, are commonly used as sensors.

The emitter is simply an IR LED (Light Emitting Diode) and the detector is simply an IR
photodiode which is sensitive to IR light of the same wavelength as that emitted by the
IR LED. When IR light falls on the photodiode, its resistance and correspondingly, its
output voltage, change in proportion to the magnitude of the IR light received. This is
the underlying principle of working of the IR sensor.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

4.6 FLOW CHART OF PROGRAM

Fig 12 Flowchart

The flowchart of the python script is shown below. Where the voice input is first
verified if it is the keyword. Then the system sends a high beep through the audio out, to
indicate microphone is actively listening. The voice input now given is compared with
the configured commands and the corresponding function is called.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

4.7 BLOCK DIAGRAM

Here we are using CMU Sphinx with jasper-client brain which implements deep learning algorithm.

Python modules are written for various functions.

First the keyword which is configured is said, we will hear a high beep, which means jasper is listening.

Now the command is given ,which is decoded and searched by the pocketshinx dictionary by HMM
computation.
Match is found to mentioned words in modules and the appropriate function is executed. Which can be
playing a song or video or reading a book or changing TV channel or playing a quiz game.

The song and video database can have any regional language songs as well. The

output of the system is then heard through the speakers or earphones.

Fig 13.Block Diagram of System

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

CHAPTER 5

FURTHER ENHANCEMENTS

1.RECOGNITION WITHOUT INTERNET ACCESS


We are well aware that there is no availability of internet access throughout our
country. Currently, India is nowhere near meeting the target for a service which is
considered almost a basic necessity in many developed countries.

In such cases this project may not function, therefore we have enhancing this
project to work even without internet using recognition toolkits such as CMU Sphinx.

2. GSM Module for voice activated calling

Raspberry PI SIM800 GSM/GPRS Add-on V2.0 is customized for Raspberry Pi


interface based on SIM800 quad-band GSM/GPRS/BT module. AT commands can be
sent via the serial port on Raspberry Pi, thus functions such as dialing and answering
calls, sending and receiving messages and surfing on line can be realized. Moreover,
the module supports powering-on and resetting via software.

Fig.14 GSM Quadband 800A

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

3. HOME AUTOMATION
With the right level of ingenuity, the sky's the limit on things you can automate in your
home, but here are a few basic categories of tasks that you can pursue:
Automate your lights to turn on and of on a schedule, remotely, or when certain conditions are
triggered.
Set your air conditioner to keep the house temperate when you're home and save energy while
you're away.

Fig.15 Home automation possibilities

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

CHAPTER 6

APPLICATIONS
Usage in education and daily life
For language learning, speech recognition can be useful for learning a second
language. It can teach proper pronunciation, in addition to helping a person develop
fluency with their speaking skills.[6]

Students who are blind (see Blindness and education) or have very low vision can
benefit from using the technology to convey words and then hear the computer recite
them, as well as use a computer by commanding with their voice, instead of having to
look at the screen and keyboard. [6]

Aerospace (e.g. space exploration, spacecraft, etc.) NASA’s Mars Polar Lander used
speech recognition from technology Sensory, Inc. in the Mars Microphone on the Lander[7]

Automatic subtitling with speech recognition[7]

Automatic translation

Court reporting (Realtime Speech Writing)

Telephony and other domains


ASR in the field of telephony is now commonplace and in the field of computer gaming
and simulation is becoming more widespread. Despite the high level of integration with
word processing in general personal computing. However, ASR in the field of document
production has not seen the expected[by whom?] increases in use.

The improvement of mobile processor speeds made feasible the speech-enabled Symbian and
Windows Mobile smartphones. Speech is used mostly as a part of a user interface, for creating

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
predefined or custom speech commands. Leading software vendors in this field are:
Google, Microsoft Corporation (Microsoft Voice Command), Digital Syphon (Sonic
Extractor), LumenVox, Nuance Communications (Nuance Voice Control), VoiceBox
Technology, Speech Technology Center, Vito Technologies (VITO Voice2Go), Speereo
Software (Speereo Voice Translator), Verbyx VRX and SVOX.

In Car systems
Typically a manual control input, for example by means of a finger control on the
steering-wheel, enables the speech recognition system and this is signalled to the driver
by an audio prompt. Following the audio prompt, the system has a "listening window"
during which it may accept a speech input for recognition.

Simple voice commands may be used to initiate phone calls, select radio stations or play
music from a compatible smartphone, MP3 player or music-loaded flash drive. Voice
recognition capabilities vary between car make and model. Some of the most recent car
models offer natural-language speech recognition in place of a fixed set of commands.
allowing the driver to use full sentences and common phrases. With such systems there is,
therefore, no need for the user to memorize a set of fixed command words.

Fig 16.Car automation

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

Helicopters
The problems of achieving high recognition accuracy under stress and noise pertain strongly to
the helicopter environment as well as to the jet fighter environment. The acoustic noise problem
is actually more severe in the helicopter environment, not only because of the high noise levels
but also because the helicopter pilot, in general, does not wear a facemask, which would reduce
acoustic noise in the microphone. Substantial test and evaluation programs have been carried
out in the past decade in speech recognition systems applications in helicopters, notably by the
U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal
Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in
the Puma helicopter. There has also been much useful work in Canada. Results have been
encouraging, and voice applications have included: control of communication radios, setting of
navigation systems, and control of an automated target handover system.

As in fighter applications, the overriding issue for voice in helicopters is the impact on
pilot effectiveness. Encouraging results are reported for the AVRADA tests, although
these represent only a feasibility demonstration in a test environment. Much remains to
be done both in speech recognition and in overall speech technology in order to
consistently achieve performance improvements in operational settings.

High-performance fighter aircraft


Substantial efforts have been devoted in the last decade to the test and evaluation of
speech recognition in fighter aircraft. Of particular note is the U.S. program in speech
recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16
VISTA), and a program in France installing speech recognition systems on Mirage aircraft,
and also programs in the UK dealing with a variety of aircraft platforms. In these programs,
speech recognizers have been operated successfully in fighter aircraft, with applications
including: setting radio frequencies, commanding an autopilot system, setting steer-point
coordinates and weapons release parameters, and controlling flight display.

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016
REFERENCES
[1] D. Yu and L. Deng"Automatic Speech Recognition: A Deep Learning Approach"
(Publisher: Springer) published near the end of 2014,

[2]Claudio Becchetti, Klucio Prina Ricotti.“Speech Recognition: Theory and C++


Implementation “ 2008 edition
[3]Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker
identification using Gaussian mixture speaker models" (PDF). IEEE Transactions on
Speech and Audio Processing (IEEE) 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063-
6676. OCLC 26108901. Retrieved 21 February 2014.

[4]Waibel, Hanazawa, Hinton, Shikano, Lang. (1989) "Phoneme recognition using time-
delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing."

[5] Microsoft Research. "Speaker Identification (WhisperID)". Microsoft. Retrieved 21


February 2014.

[6]]'Low Cost Home Automation Using Offline Speech Recognition', International Journal of
Signal Processing Systems, vol. 2, no. 2, pp. 96-101, 2014.

[7]Juang, B. H.; Rabiner, Lawrence R. "Automatic speech recognition–a brief history of


the technology development" (PDF). p. 6. Retrieved 17 January 2015.

[8] Deng, L.; Li, Xiao (2013). "Machine Learning Paradigms for Speech Recognition: An
Overview". IEEE Transactions on Audio, Speech, and Language Processing.
[9]P. V. Hajar and A. Andurkar, “Facial Recognition and Speech Recognition using Raspberry
Pi', International Journal of Advanced Research in Computer and CommunicationReview Paper
on System for Voice and F Engineering, vol. 4, no. 4, pp. 232-234, 2015.
[10]Common Health Risks of the Bedridden Patient Posted on October 24, 2013 by
Carefect Blog Team

DEPARTMENT OF ECE, NMIT


PROJECT REPORT 2015-2016

DEPARTMENT OF ECE, NMIT

You might also like