You are on page 1of 23

Human-Robot Communication

Supervisor: Prof. Nejat

Biomechantronics Lab

Progress Report
Ray Zhao

996229059

November-18-10

Motivation
At the turn of the millennium, the number of elderly in need of care is increasing
dramatically as the baby-boomer generation ages. Furthermore, the nation is facing an
explosion of costs in health care sector. The dramatic increase of the elderly
population along with the explosion of costs poses extreme challenge to the society.
The current practices of providing care for the elderly population are already
insufficient, and are likely to worsen over the next few decades [1].

Luckily, this problem can be aided with developing robotic technology. Socially
assistive robots in the future will be able to remind seniors to take medicine at the
right time (as many of them may suffer from dementia), safeguard patients with
physical impairment and so on. They can also communicate with seniors whom are
forced to live alone, in order to relief their stress, as social engagement can
significantly delay the deterioration and health related problems [1].
Background
Human-Robot Interaction
The field of human-robot interaction (HRI) addresses the design, understanding, and
evaluation of robotic system, which involves human and robots interaction through
communication [1].

There are three laws that determine the idea of safe interaction, in which:
1. A robot may not injure a human being or, through inaction, allow a human being
to come to harm.
2. A robot must obey any orders given to it by human beings, except where such
orders would conflict with the First Law.
3. A robot must protect its own existence and long as such protection does not

conflict with the First or Second Law [2].

Although robots have been implemented for industrial purposes for decades, the
design for autonomous robot has just started recently. Autonomous robot can identify
and track a user’s position, respond to spoken questions, display text or spatial
information, and travel on command while avoiding obstacles.

Autonomous robot nowadays are mainly manufactured for SAR quests, however,
other applications include entertainment, education, field robotics, home and
companion robotics, hospitality, rehabilitation and elder care. Extensive research to
increase capabilities and performance of the robot is still undergoing.

Socially Assistive Robotics


Socially Assistive Robots are a new generation of developing robots that are capable
of moving and acting in human-centered environments, interacting with people, and
participating in our daily lives has introduced the need for building robotic systems
able to learn how to use their bodies to communicate and react to their users in a
socially engaging way.

Speech Recognition
Speech Recognition is the process of converting acoustic signal, captured by a
telephone or a microphone, into text. The recognized words can be final results, as for
applications involving command & control, data entry, and document preparation.
They can also serve as the input to further linguistic processing to achieve speech
understanding, and provide output responses.

The term “voice recognition” is sometimes used to refer to recognition systems that
must be trained to a particular speaker, and simplifies the task of translating speech.
However, uses of this application are limited due to its nature, therefore will not be
taken into further consideration.

As alluded above, speech recognition is a complicated problem, largely because of the


many sources of variability associated with the signal. First, the acoustic realization of
phonemes, are highly depended on the context in which they appear. These phonetic
variabilities are exemplified by acoustic difference of the phonemes /k/ in “kit” and
“skill” in American English.

Second, acoustic variability can result from changes in the environment as well as in
the position and characteristics of the transducer.

Thirdly, within speaker variability can result from speaker’s physical and emotional
state, speaking speed or voice quality.

Finally, differences in sociolinguistic background, dialect, and vocal tract size and
shape can contribute to across-speaker variability [4].

How It Works
1, the engine loads a list of words to be recognized, this list of words are called
grammar.
2, audio from a speaker is captured by a microphone or telephone. This audio is
turned into a waveform, a mathematical representation of sound.
3, the engine looks at features – distinct characteristics of sound – derived from the
waveform and compares them with its own acoustic model. The engine searches its
acoustic space, using the grammar to guide this search.
4, it then determines which words in the grammar the most closely matches and
returns result [6].

Acoustic Model
Acoustic Model is created by taking audio recordings of speech and their
transcriptions, and compiling them into a statistical representation of sounds that
make up each word through the process calling “training”.
Language Model
A language model tries to capture the properties of a language, and to predict the next
word. For example, there is usually a verb following a noun in a structured phrase
(e.g. eat dinner), therefore, the language model will predict that after a verb, the next
word is most likely e a noun, and will try to “recognize” the input phrase accordingly.

Robot Speech Synthesis


Speech Synthesis is the artificial production of human speech. A computer system
used for this purpose is called a speech synthesizer. In this project, Microsoft speech
API (SAPI) will be used for outputting response from the robot [5].

Performance
The performance of speech recognition systems is usually specified in terms of
accuracy and speed. Accuracy is usually rated with word error rate (WER), whereas
speed is measured with the real time factor [3].
Current Issues
Robustness
In a robust system, performance degrades gracefully (rather than catastrophically) as
conditions become more different from those under which it was trained. Differences
in channel characteristics and acoustic environment should receive particular
attention.

Portability
Portability refers to the goal of rapidly designing, developing and deploying systems
for new applications. At present, systems tend to suffer significant degradation when
moved to a new task. In order to return to peak performance, they must be trained on
examples specific to the new task, which is time consuming and expensive.

Adaptation
How can systems continuously adapt to changing conditions (new speakers,
microphone, task, etc) and improve through use? Such adaptation can occur at many
levels in systems, subword models, word pronunciations, language models, etc.

Language Modelling
Current systems use statistical language models to help reduce the search space and
resolve acoustic ambiguity. As vocabulary size grows and other constraints are
relaxed to create more habitable systems, it will be increasingly important to get as
much constraint as possible from language models; perhaps incorporating
syntactic and semantic constraints that cannot be captured by purely statistical
models.

Confidence Measures
Most speech recognition systems assign scores to hypotheses for the purpose of rank
ordering them. These scores do not provide a good indication of whether a hypothesis
is correct or not, just that it is better than the other hypotheses. As we move to tasks
that require actions, we need better methods to evaluate the absolute correctness of
hypotheses.

Out-of-Vocabulary-Words
Systems are designed for use with a particular set of words, but system users may not
know exactly which words are in the system vocabulary. This leads to a certain
percentage of out-of-vocabulary words in natural conditions. Systems must have some
method of detecting such out-of-vocabulary words, or they will end up mapping a
word from the vocabulary onto the unknown word, causing an error.

Prosody
Prosody refers to acoustic structure that extends over several segments or words.
Stress, intonation, and rhythm convey important information for word recognition and
the user's intentions (e.g., sarcasm, anger). Current systems do not capture prosodic
structure. How to integrate prosodic information into the recognition architecture is a
critical question that has not yet been answered.

Modelling Dynamics
Systems assume a sequence of input frames which are treated as if they were
independent. But it is known that perceptual cues for words and phonemes require the
integration of features that reflect the movements of the articulators, which are
dynamic in nature. How to model dynamics and incorporate this information into
recognition systems is an unsolved problem [4].

Software
Julius
The software that will be implemented for this project is “Julius”. Julius is high-
performance two pass large vocabulary continuous speech recognition decoder
software for speech-related researchers and developers. Based on word 3-gram
context dependent HMM, it can perform real-time decoding on most current PCs in
60k word dictation task. It is also speaker-independent, which means the software can
decode input phrases despite whom the speaker is [3].

Online RSS Information Feeds


In this project, RSS feed resources will be used to extract information from the
internet, and output as a response to a given question. RSS (Really Simple
Syndication) is a family of web feed formats used to publish frequently updated
works—such as blog entries, news headlines, audio, and video—in a standardized
format. An RSS document includes full or summarized text, plus metadata such as
publishing dates and authorship. Web feeds benefit publishers by letting them
syndicate content automatically. They benefit readers who want to subscribe to timely
updates from favored websites or to aggregate feeds from many sites into one place.
RSS feeds can be read using software called an "RSS reader", "feed reader", or
"aggregator", which can be web-based, desktop-based, or mobile-device-based [4]. The
recommended RSS reader would be Google reader, which is the most popular of its
competitors [7].

Objective
In this project, the focus will be on developing natural conversational capabilities for
a socially assistive robot in the following scenarios:
(i) Social Conversation
Example:
Human: Hello.
Robot: Hello, how are you doing?
Human: I’m doing great.
Robot: nice to hear that, my name is Brian, what is your name?
Human: My name is John.
Robot: Nice to meet you, John.

(ii) Information extraction


Examples:
Human: Brian, what is the time?
Robot: it is 9p.m.
Human: what is the date today?
Robot: today is October 13, 2010.
Human: what is the current weather like?
Robot: overcast sky with chances of shower.

Information extraction includes extracting information from a preloaded


database, or from a dynamic source. RSS feed information extraction will be
the main focus of the two, useful RSS Feed websites will be used as reliable
resources, and the information displayed on it will be taken as the responses of
given input phrases.

To further understand the scope of the project, the following tasks (including
resolving current issues that are listed in background section) need to be considered:
- Create a context-specific database of recognized words sentences structures
- Decipher “meaning” from recognized words and/or phrases.
- Filter out noise and “out of vocabulary” words
- Detect uncertainty in speech recognition
- Communicate in real-time with main robot controller

Also, for voice synthesis, it is desired to achieve functionalities listed below:


- Have the ability to confirm input phrases when there is uncertainty by repeating
the spoken phrase and asking if it is correct
- Have the ability to ask the person to repeat what they said if words/phrases are not
fully recognized.
- Vary voice in pitch, speed and volume.
- Have an expandable database of responses
- Have the ability to pick a phrases from database and say it
- Coordinate robot mouth movements (and maybe body movements and facial
expressions) with speech.

By the end of the working term, the robot shall have capability to accurately
understand certain sets of context-specific sentences and respond appropriately if
there is uncertainty.

Methodology
Obtain a phrase database
The phrases that will be programmed into the system are listed in Appendix I, which
include six categories:
1) General phrases like “yes” or “no”.
2) Greeting
3) Pre-programmed database
The database contains information for a specific user, including the person’s
birthday, the name of his doctor etc.
4) Dynamic database tracking
This will allow the robot to make a search on World Wide Web (RSS Feed
websites), and output the most up to date results to the users. Questions such as
“what is the current temperature outside?” and “what is today’s date” can be
answered. The motivation for creating such a database is to assist the elderly to
remember things, and the inspiration for obtaining the questions are from MMSE
test (Appendix II)
5) Encyclopaedia type questions
This is a proposed idea, which will make the robot to answer common sense
questions or more sophisticated ones, depending how much effort is put into the
design.
6) Command
This will allow the robot to perform actions for a given input, and is closely
related to the capability of the robot.

Creating sentence database


Apparently, creating a context-specific database of recognized words sentences
structures is one of the main goals of the project, and can be achieved by
implementing Julius software in the way listed in Appendix II.

In Julian, the recognition grammar should be given into two separate files: ".grammar
file" and ".voca file". ".grammar file" defines category-level syntax, i.e. allowed
connection of words by their category name. ".voca file" defines word candidates in
each category, with its pronunciation information.

The allowed connection of words should be defined in “.grammar file”, using word
category names as terminal symbols.

Below is an example grammar for the question: “what year was I born?” The initial
sentence symbol to start should be “S”. The rewrite rule should be defined each per
line, using “:” as delimiter. Characters of ascii alphabets, numbers and underscore are
allowed for the symbol names, and they are case sensitive.

Terminal symbol, e.g. symbols which do not appear on the left of item, are treated as
“word category” name, and words in each category should be defined in voca file.

In the example, NS_B, NS_E, WHAT, YEAR, WAS, I BORN are word categories,
and their contents should be specified in voca file. NS_B and NS_E correspond to the
head silence and tail silence of input speech, and should be defined in all grammar for
Julian.

.voca file contains word definition for each word category defined in the .grammar
file.

After creating .grammar and .voca file, they need to convert into .dfa and .dict format
using “mkdfa.pl”, a grammar compiler [8].

Information extraction from an updatable offline database


C++ language will be employed to make it work. The database for a specific user will
be pre-updated, with information such as user’s birthday, his daughter’s name or his
scheduling. After an input is recognized by the robot, it will search the database and
output the result accordingly.

Information extraction from Online RSS Feed


C++ language will be employed to make it work. In order to track most up to date
information such as current weather or headliner, the robot will need to browse and
search the internet to obtain required information. To make it simpler for the robot to
do so (and simpler for programmer to program), RSS Feed information will be
extracted and output as responses. An example of RSS document is provided in
Appendix IV, which contains the information such as the temperature for a random
city.

Improve accuracy and uncertainty


Adjustments in the original C++ coding for the system are needed. Such that the robot
can respond to unidentified phrases appropriately (e.g. can you please repeat what you
have said?). Also, the confident score* for voice recognition may be altered in order
for the robot to make more accurate recognition. At this stage of the project, the main
coding file has not been looked through. Therefore exact method of optimizing
performance of the system is still under speculation.

*confident score: when the system is evaluating an input phrase, it assigns each
phoneme a confident score (out of 100%). When the score is high enough (set by the
programmer), the system will recognize it and vice versa.

Gesture, facial expression and visual display


This is also a proposed idea, that the robot will make gesture and facial expressions
according to input phrases and output responses. This has not been well thought
through. However, people who are working on the movements of the robot and I will
have a discussion in the future to examine whether it is viable or not.

Also, to assist understanding of the output response from the robot, visual aids may be
employed into the output information (e.g. display a sun when the robot answers “it is
sunny outside”). However, this idea is still under speculation, and will likely to be
considered once everything else is done.

Work Completed
At this stage of the project, most of the work is done for research and practice. The
following are the tasks that I have done up till now:

1) Obtained a basic understand of the scope of the project, completed thesis plan

and progress report.


This includes researching topics such as socially assistive robot; robot human
communication; motivation for development. Also, technical readings such as
JuliusBook, Nelson’s report, John Alder’s report are browsed through.

2) Obtained a list of phrases that will be programmed into the robot.


3) Tested out Julius and inserted some phrases into the system, the results work fine.
In order for the software to work on a PC, Visual Studio needs to be downloaded.
Also, installing Active Perl and set global environment during the installation
process is necessary to make mkdfa.pl work.

4) Obtained a lists of websites that contain useful RSS Feed information


Full list will be ready within 1 to 2 days, current RSS Feed List includes weather
network, local news headliners, lotto 649 etc.

Work Remains
Creating a phrases database
After all phrases are approved by supervisor, they will be programmed into the
system. This is the easier part of the project in terms of difficulty, but it will be quite
time consuming.

Information extraction from an offline database and internet


This is the next task in line, program need to be written in order to take information
from a database or internet, and output as a response for the robot. Some C++ needs
to be learnt to finish the job. However, as mentioned above, list of useful RSS Feed
website will be available in a couple of days, and development for this part of the
project will be under way soon.

Improve accuracy and uncertainty, gesture facial expression and visual display
As mentioned above, these problems will not be looked through till later on.
However, they are important aspect of the project, and cannot be overlooked.
Creating responses
This will be the last part of the project. After everything is set, responses will be
programmed into the system according to input phrases, granting the robot the ability
to actually communicate verbally with a human.

Working Schedule
Date Event
September 7-13 2010 N/A
September 13-20 2010 N/A
September 20-27 2010 Research
September 27- October 4 2010 Research
October 4-11 2010 Research
October 11-18 2010 - October 14th
Deadline for Submission of Thesis Plan
October 18-25 2010 - Coming up with lists of phrases
October 25- November 1 2010 - Midterms
November 1-8 2010 - Testing software, inserting phrases and results
are good
November 8-15 2010 - RSS Feed research, testing software
November 15-22 2010 - Working on Progress Report
- October 18th
Deadline for Submission of Progress Report
November 22-29 2010 - Obtain a complete list of useful RSS Feed
websites, learn C++
November 29- December 6 2010 - Start writing codes for extracting information
from RSS Feed website
December 6-13 2010 - Final*
December 13-20 2010 - Final*
December 20-27 2010 - Write code for extracting information from
internet and database
December 27 2010- January 3 2011 - Write code for visual display for a given
output response
January 3-10 2011 - Insert Phrases into the system
January 10-17 2011 - Insert Phrases into the system
January 17-24 2011 - Insert Phrases into the system
January 24-31 2011 - Work with output response
January 31- February 7 2011 - Work with output response
February 7-14 2011 - Work with uncertainty and inaccuracy
February 14-21 2011 - Work with uncertainty and inaccuracy
February 21-28 2011 - Collaborate with people who are working on
gestures and facial expressions
February 28- March 7 2011 - Debug, finish off
March 7-14 2011 - Work on Final Thesis and Final Thesis
Presentation
March 14-21 2011 - Work on Final Thesis and Final Thesis
Presentation
March 21-Final 2011 - March 22nd – April 11th
Final Thesis Presentation
- March 24th
Deadline for Submission of Final Thesis
Report
* Will keep in touch with supervisor, will complete any tasks on request
Appendix I – Examples of Questions, Commands and

Answers

General (Robot)

Yes
No
Pardon
I’m sorry, I did not get that, can you repeat what you have said please

Greeting (Robot)

Hello, my name is Brian


How are you?
Good morning
Nice to meet you
How may I help you?
Goodbye

Specific Questions

If the database is big enough and if the robot can track a database, then the number
of questions is unlimited.

Pre-programmed database
What year was I born?
What is the name of my daughter?
Which university did I attend?
When do I have a doctor appointment?
Etc

Dynamic database tracking

What time is it?


What is today’s date?
What is the weather like?
What is the headliner of the day?
What is the winning number for 649?
When is sunrise/sunset?
Which TV channel is discovery?
Who are the candidates for upcoming mayor election?
Etc

Encyclopaedia type questions

What year was Martin Luther King born?


Give me a synonym for happiness
Spell microprocessor
Etc

Command

Read a book for me


Start
Pause
Stop
Play a song
Appendix II – MMSE test
Appendix III – Grammar/Voca Example
Grammar File
S : NS_B QUESTION NS_E
S : NS_B ANSWER NS_E

QUESTION: WHAT YEAR WAS I BORN

Voca File
% NS_B
<s> sil

% NS_E
</s> sil

% WHAT
WHAT w ah t
WHAT w ah
% YEAR
YEAR y ih r

%WAS
WAS w aa z

%I
I ay

%BORN
BORN b ao r n

Appendix IV – RSS Feed Example

http://us.rd.yahoo.com/dailynews/rss/weather/Sunnyvale__CA/*http://weather.yahoo.
com/forecast/USCA1116_f.html Yahoo! Weather for Sunnyvale, CA en-us Wed, 17
Nov 2010 7:56 pm
PST 60 142 18 http://weather.yahoo.com http://l.yimg.com/a/i/brand/purplelogo//uh/
us/news-wea.gif 37.37 -
122.04http://us.rd.yahoo.com/dailynews/rss/weather/Sunnyvale__CA/*http://weather.
yahoo.com/forecast/USCA1116_f.html Wed, 17 Nov 2010 7:56 pm
PST USCA1116_2010_11_17_19_56_PST
Reference
1, Murphy, R. (Comput. Sci. & Eng., Texas A&M Univ., College Station, TX, United
States); Nomura, T.; Billard, A.; Burke, J. Source: IEEE Robotics & Automation
Magazine, v 17, n 2, p 85-9, June 2010

2, Introduction on this special issue of Human-Robot Interaction, Sara Kiesler,


Carnegie Mellon University, Pamela Hinds Stanford University.

3, Wikipedia, Retrieved November 2nd 2010 from


http://en.wikipedia.org/wiki/Julius_(software)

4, Survey of the State of the Art in Human Language Technology, Ron Cole, the Press
Syndicate of the University of Cambridge

5, Microsoft. SAPI Overview. Retrieved October 13th, 2010 from


http://www.microsoft.com/speech/technology.aspx.

6, Lumenvox Speech Recognition. Retrived Oct 15th 2010 from


http://www.lumenvox.com/

7, Baike Baidu. RSS. Retrieved October 13, 2010 from


http://baike.baidu.com/view/1644.htm

8, How to write a recognition grammar for Julian. Retrieved November 8th 2010 from
http://julius.sourceforge.jp/en_index.php?q=en_grammar.html

You might also like