You are on page 1of 14

Autonomous Multi-Modal Anomaly Detection

Caleb Little, Clemson University
Dr. Charlie Kemp, Georgia Tech
July 31, 2014
Abstract
The objective of my summer project was to develop a multi-modal anomaly detection system. Although the
development of a generalizable model that can be employed for any generic task was the primary objective, an
anomaly detecting yogurt feeding robot was put forth as a tangible test case. In order to facilitate achieving this
goal, two separate systems were developed to enable the robot to safely preform the task. First, a means of detecting
the presence of yogurt on a spoon was developed by using a histogram to detect changes in the spoon’s color.
Secondly, a series of monitors watching the force, torque, and acceleration measured from the PR2’s gripper, which
grasped the spoon during feedings, were developed. This monitor system then compares the current state of the
robot to Gaussian models developed from the analysis of previous iterations of the robot’s actions in order to locate
outliers of the expected behavior, which are recognized as anomalies. After incorporating these two systems, the
robot successfully fed yogurt to a volunteer, and detected multiple artificial anomalies. In addition, cross-validation
testing demonstrated that the monitor functions are reasonably accurate. Given these results, the Gaussian model
approach for multi-modal anomaly detection appears to be applicable for any robotic task, provided the task is
divided properly into sections with usable examples of the expected operation in each section.
I. INTRODUCTION
The field of robotics has recently undergone a very interesting shift in focus. Robots are no longer sole
being developed for use in controlled situations, such as laboratories and factories, but are also moving
into the unpredictable open world. As a result of this push, many interesting opportunities have opened
up in which robotics can make a significant difference in people’s lives, through assistance in everyday
tasks. One problem, however, remains ever present while robotics attempts to grow into its new role:
robots lack common sense. Although humanity takes common sense for granted, providing robots with
this skill requires much more effort than it would initially appear, as labs around the world have spent
countless hours attempting to find a way in which robots can loosely define a task in their environment
and carry it out, while monitoring for potential issues and identifying when things go wrong.
This task of identifying and reacting to problems as they arise can be broken into several steps.
Somewhat simple is first finding a way to monitor a robot’s environment, preferably using multiple
modalities, or senses, in order to notice any potential issues that may arise in a form only detectable by
one modality. With such a system implemented, a robot would then be capable of noticing changes in the
environment, though the tasks of identifying and reacting to anomalies would still remain. Achieving this
simple first step can be done utilizing a fairly basic observation system, but the second portion, that of
identifying anomalies, proves more difficult. My focus over this summer has been to create a means of
incorporating functional versions of these ideas into a practical system, or more specifically, to develop a
multi-modal anomaly detection system that can be utilized for any general task. In order to focus efforts,
however, I designed this system to be employed in a Willow Garage PR2 robot programmed for feeding
yogurt.
There are many instances in which labs have developed robots for the task of feeding in the past,
though very few of these had any means of detecting unexpected issues. For the intended user of this
PR2 program, however, this is a necessity. Henry Evans, a resident of California, suffered a stroke at the
age of 40 in 2002, leaving him mute and a quadriplegic. As such, his interactions with his environment
are very limited. So far, he has had to rely almost exclusively on his wife, as there is too much of a risk
involved with getting an outside caretaker, due to the risks of accidental harm. Previously, Dr. Kemp’s lab
worked with Henry Evans to return some of his independence by using the PR2 robot to perform various
tasks, such as opening doors and retrieving items, as well as shaving [1]. As such, the idea that the PR2
could be used to feed Henry came up fairly quickly, but this requires a good deal of care. Specifically,
how can we assert that the robot can safely preform the feeding task, without bringing any potential harm
to Henry, who would be unable to protect himself? The ultimate objective of my research for this summer
was to find a way to provide the PR2 with an anomaly detection system, so that it can detect when things
have gone wrong.
A. Related Work
The idea of using a robot to feed people is not new. Commercial models of feeding robots exist, such
as the MySpoon, a Japanese model that feeds a person using a spoon and is controlled by voice. However,
these commercial models can have issues with human interaction and certain impractical requirements for
some users [2]. Most notably, this device requires vocalization, which would prove an issue for Henry,
who no longer has the capacity for independent speech. Similar devices have been built in research, as
previously mentioned, but these have either required more interaction from the user than Henry is capable
of, or in the case of a previous attempt by Dr. Kemp’s lab, lack a means of detecting anomalies [3].
Additionally, these devices tend to be designed specifically for the purpose of feeding, whereas it would
be considerably more beneficial to have a general purpose robot that could carry out the task of feeding,
preferably the PR2 which already preforms various tasks for Henry thanks to the previous research done
by Dr. Kemp’s lab [1].
Due to Henry’s disabilities, some means of monitoring the environment is necessary in order to prevent
the robot from accidentally hurting him. In order to achieve this state of awareness, the robot will require
anomaly detection in some form. Work has been done in this area before, but the ideal solution to this
problem takes the form of a general solution, not a specific one. Dr. Kemp’s lab has done some basic
work in this area before, while attempting to create a robot that could open doors safely [4]. Given the
time frame for this research, the multi-modal model needed to be simple by necessity, but the use of the
additional modalities should enable the robot to detect more error types, making it practical in the long
run.
B. Acknowledgements
I would like to thank the Atlantic Coast Conference Inter-Institutional Academic Collaborative, or
ACCIAC, along with the NSF, specifically grant number 1263049, for providing the support necessary to
conduct this research.
II. METHODOLOGY
I worked on two separate systems this summer, one for detecting the presence of yogurt on a spoon
grasped in the PR2’s gripper, and another, more general system that allows the robot to know when its
actions or the nearby environment have deviated from expectations.
A. Yogurt Analysis
I was able to develop a fairly straightforward system that fulfills the first requirement, that of detecting
the presence of yogurt. The PR2 is equipped with multiple head cameras, each of which publishes an
image stream. I designed the first portion of the yogurt detecting system to acquire a single image from
this stream, by un-subscribing itself from the active stream as soon as it has received an image, and then
saving said image as one of two jpeg images, a choice dependent on whether the node has been called
an even or odd number of times. Presumably, this first node is called to take the first image prior to the
robot attempting to retrieve yogurt, and then called again once the robot needs to check for the presence
of yogurt, producing the two pictures needed for a comparison. In order to actually detect the presence
of any yogurt, I developed a second node which runs a color histogram, a process that determines the
components of an image with respect to a color system such as RGB, on the two images produced by the
first node. In the event that the color histogram notices a change in intensity in any of the three colors
of red, green, or blue, it will return a 0 to its caller. This was designed to indicate that yogurt had been
found on the spoon, while a return value of 1 would imply that no yogurt had been found. I tested this
system by experimenting with several trials where no yogurt was present on the spoon in both images,
several where yogurt was present on the spoon in both images, and finally several where the spoon was
empty in the first image, but different yogurts were used in the second image during each trial.
B. Multi-Modal Anomaly Detection System
The anomaly detection system is significantly more complex than the yogurt detection system. After
some investigation, the team I worked with decided to use audio in conjunction with the acceleration,
force, and torque experienced by the PR2’s left gripper, which holds the spoon, as the selected modalities
to observe. Once these had been chosen, the most straightforward implementation of a detection system
was decided to be a Gaussian probability distribution model for a random variable, which has the general
form shown below.
P(x) =
1
σ


e
−(x−µ)
2
/2σ
2
A Gaussian random variable is a variable which changes based on real world information or events, and has
a probability distribution in the shape of a bell curve. Effectively, a Gaussian probability equation can be
used to compare how similar a specific instance of a variable is to the mean value of a properly distributed
set. The result of this comparison is the particular instance’s z-score, which is used in the modality specific
anomaly detection systems. Additionally, if the values of µ and σ are 0 and 1 respectively, the model
becomes a standard Gaussian distribution.
However, due to the multiple motions involved with the task of feeding yogurt, the team decided that
individual models for each motion, rather than a single, overarching one, would allow the system to more
accurately make a distinction between acceptable behavior and anomalies. Given this, the team developed
the general system layout seen in Figure 1. In the case of the nodes associated with the PR2’s physical
gripper, those of acceleration, force, and torque, active input from the robot feeds into the respective node
where the magnitude is compared to previously a generated model, a model chosen relative to what stage
of the overall task the robot is currently involved with. Specifically, the nodes compare by calculating
the equivalent of the z-score value of the current sensor data and then subtracting a threshold value.
They then pass this to the master node, which listens to all three values in order to make a decision
about whether the current situation is an anomaly. In the event that the master node’s threshold is passed,
the system sends a warning to the robot’s control, letting it know that it needs to stop. In order to
first develop the models, however, statistical analysis of expected behavior was required, leading to the
collection and analysis of 20 proper performance examples. Once this was completed, the thresholds
required further calibration, as these thresholds affect the accuracy of the system. I ran several Receiver
Operator Characteristic tests in order to determine acceptable values for these thresholds, using the data
collected to create the models along with several recorded examples of artificial anomalies. Please note,
however, that the audio node, due both to less variation during expected operation and to its separation
from the robot’s embedded systems, was designed to run using a single Gaussian equation, and directly
alerts the robot that something unexpected has occurred, rather than interfacing with the Robot linked
master node. Due to the fact that the microphone uses a completely different message format to those of
the PR2 systems, we were unable to acquire the examples needed to tweak its settings using Receiver
Operator Characteristic analysis. This fact prompted the decision to simply use a threshold of one, chosen
due to the distribution of a standard Gaussian distribution equation. In the end, the team elected to test
the audio node’s functionality in the implementation testing stage.
C. Implementation Testing
Once testing had concluded on the individual sections, we preformed several tests with multiple
volunteers, testing both the ability of the system to properly feed yogurt to a human, requiring the proper
functionality of the yogurt detection module, and the ability of the system to respond to unexpected
physical interactions and sounds, verifying the real world functionality of the multi-modal anomaly
detection system. These tests were conducted using a Willow Garage PR2, a wired lapel microphone,
and a flexible silicon spoon. The PR2’s program, along with the all systems I described previously, was
coded in the Robot Operating System, or ROS. During these trials, the robot preformed autonomously,
and human interaction was as close to the ideal form requested by Mr. Evans as we could achieve, with
the exception of the artificial anomalies that were inserted into various tests in order to verify the anomaly
detection system. Figure 2 below shows the general setup and layout of the room in which we performed
the tests.
Figure 1: Multi-Modal Anomaly Detection System
Overview
Figure 2: Physical Experimental Setup
III. RESULTS
A. Yogurt Detection
Table I contains the results of the yogurt detection tests. During the Difference trials, different colors
of yogurt were used in the second image in order to verify the system’s ability to detect multiple colors
of yogurt. Figures 3 and 4 present a graphical representation of test 8. Please note that the histogram only
covers a 100 by 75 frame around the spoon, rather than the entire view of the camera seen in Figure 3.
Table I: Yogurt Detection Tests
Trial Description Results [Red, Green, Blue]
1 No Yogurt Either Time [0, 1, 0]
2 No Yogurt Either Time [2, 0, 1]
3 No Yogurt Either Time [7, 1, 4]
4 Yogurt Both Times [3, 2, 5]
5 Yogurt Both Times [4, 2, 1]
6 Yogurt Both Times [4, 8, 0]
7 Difference [33, 4, 143]
8 Difference [57, 18, 49]
9 Difference [43, 30, 80]
(a) Camera View without Yogurt (b) Camera View with Yogurt
Figure 3: Camera Input for Trial 7
B. Gaussian Model Variables
Table II details the µ and σ values for the Gaussian models for the acceleration, torque, and force nodes
in the following format [Acceleration magnitude, Force magnitude, Torque magnitude]. Task section 0 is
ignored, due to its random nature, as it represents the robot moving from any position to a fixed home
position (*).
C. Multi-Modal Anomaly Receiver Operator Characteristic Testing
Figure 5(a) shows the Receiver Operator Characteristic (ROC) curve for the force node. Figure 5(b)
shows the ROC curve for the torque node, while Figure 5(c) shows the ROC curve for the acceleration
node. Figure 5(d) shows the ROC curve for the main anomaly node. Each test had an opportunity for 250
true positives, cases where anomalous data was recognized as anomalous, and 250 false positives, where
acceptable values were incorrectly deemed anomalous. Table III shows the chosen threshold values for
each node.
(a) Histogram without Yogurt (b) Histogram with Yogurt
Figure 4: Histogram Graphical Output for Trial 7
0 20 40 60 80 100 120 140
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(a) Force Node
0 20 40 60 80 100 120 140
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(b) Torque Node
0 20 40 60 80 100 120 140
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(c) Acceleration Node
0 20 40 60 80 100 120 140
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(d) Master Node
Figure 5: Receiver Operator Characteristics
Table II: Gaussian Model Variables
Task Section µ σ Description
0 [9.85, 11.1, 1.53] [0.052, 1.4, 0.009] Moving to Start (*)
1 [9.7, 12.1, 1.58] [0.11, 3.59, 0.003] Move Spoon to Above Bowl
2 [9.16, 15.78, 1.6] [0.039, 1.0, 0.002] Tilt Spoon Above Bowl
3 [8.99, 16.41, 1.08] [0.021, 0.11, 0.038] Move Spoon to Bottom of Bowl
4 [8.97, 16.29, 1.23] [0.021, 0.81, 0.021] Move Spoon Forward Along Bottom
5 [8.98, 17.07, 1.09] [0.030, 1.49, 0.0056] Move Spoon to Top Edge (wiping)
6 [9.05, 16.83, 1.54] [0.058, 2.01, 0.026] Tilt Spoon to Horizontal Position
7 [9.72, 12.04, 1.58] [0.14, 4.82, 0.0041] Move Spoon to Camera
8 [10.13, 9.52, 1.5] [0.040, 0.80, 0.0021] Move from Camera to Head
9 [9.85, 11.46, 1.56] [0.054, 1.27, 0.0069] Move Closer to Mouth
10 [9.66, 12.67, 1.66] [0.022, 0.09, 0.0035] Move Away from Mouth
11 [9.34, 14.99, 1.59] [0.14, 5.18, 0.0018] Move Back to Start
D. Implementation Tests
Table IV contains the results of the artificial anomaly tests performed on the live system while the
robot attempted to feed a subject.
Table III: Chosen Threshold Values
Node Threshold Value
Force 0.4
Torque 2.5
Acceleration 1.1
Master 2.65
Table IV: Live Implementation Tests
Subject Trial Description Results
1 1 Shout Detected
1 2 Shove Detected
2 3 Talking Detected
2 4 Tap Detected
3 5 Talking Earlier Detected*
3 6 Push Detected
IV. DISCUSSION
A. Yogurt Detection
The yogurt detection system tests appear to indicate that, provided the camera is properly set up, the
system can detect the presence of yogurt, as shown in Table I. There is some fluctuation in the values,
but given a proper threshold, in this case 20, the shift in color due to the presence of yogurt can be
successfully distinguished from the raw fluctuation caused by the PR2’s somewhat low-resolution head
cameras. Originally, the plan had been to utilize the Kinect mounted to the top of the robot for a much
clearer picture that would have had less fluctuation and better defined colors, but as a result of the inability
to disable certain default settings on the Kinect’s camera such as the auto-gain balance, this plan had to
be scrapped due to the existence of locations in the camera’s field of view where reflections caused glare.
This glare made it difficult to distinguish yogurt from the spoon in certain configurations, presenting a
problem for the color histogram method of detecting the presence of yogurt. Fortunately, using the PR2’s
head cameras instead of the Kinect’s color camera, although causing a decrease in quality, allowed me to
tweak the camera settings, enabling me to remove the debilitating glare by reducing the gain. As shown
by the results of the trials shown in Table I, the yogurt detection system functions reasonably well as an
independent entity, provided the camera is properly adjusted.
B. Gaussian Model Variables
Table II simply lists the sections in which the task was divided up, and the results of running a training
set of 20 trials through a scipy, a scientific python library, statistics package. These weren’t tests, but
rather statistical analysis to find the values that correspond to the Gaussian distribution equation, shown
below, for each particular section of the task and modality.
P(x) =
1
σ


e
−(x−µ)
2
/2σ
2
C. Multi-Modal Anomaly Detection
The purposes of these tests were much more oriented towards tweaking the threshold values of the four
nodes based on the available data set then verifying proper functionality, but they do provide some clues
as to the accuracy of the equations. Although the results of the individual nodes were a little disappointing,
with rather high false positive rates coupled with less than stellar true positive rates, the master node has
a somewhat more reasonable spread when using a larger threshold, as a result of being able to poll the
multiple modalities before making a decision about whether or not an anomaly was occurring. Due to
the still relatively significant false positive rate, around 28 percent, the team decided to add an additional
threshold to the main anomaly node. This threshold prevents spikes from setting off the anomaly alarm,
while letting the longer lasting true anomalies through to stop the PR2. The actual testing of this succession
threshold is a little harder to implement in simulation, resulting in the testing of this additional feature
taking place during the implementation tests on the system. Due to the difference in the audio node,
as it runs separate from the robot’s sensors, and uses a different format, we were unable to tweak it
through testing, instead using a standard deviation, or 1, as its threshold, as dictated by the form of the
standardized Gaussian distribution model.
D. Implemented System Performance
These tests proved quite informative, pointing out several potential issues with the system. Before I
discuss the actual results, I would like to mention two issues not shown by the test data which came up
while debugging the system. The first of these is update speed. Due to the fact that the PR2’s sensors
publish information at different rates, it is quite possible for the messages to arrive at their respective
Gaussian nodes at different times, which in turn can end up passing data from slightly different instances
to the master node. Fortunately, the publishing rate is fast enough that this loss of synchronization does
not appear to affect performance in any visible way. The second, and more notable issue, is the heavy
dependence of the system on good expected performance examples and the existence of well-defined
anomaly examples for the purposes of developing the models and thresholds. While using the original
value of 2.65 in the master node, the team found that the system’s physical sensors were sensitive to the
more forceful anomalies that had been recorded for calibration purposes, but they were unable to detect
the more realistic, lighter pushes. As a result, we made the decision to lower the threshold in the master
node to 1.65, while raising the number of successive triggers needed to pass the succession check in order
to filter out the additional false positives. Second, we also found that, due to some variation in certain
portions of the trial, the predicted values from the Gaussian nodes tended to behave rather oddly at certain
points, as shown by the variations in values present in table II. Fortunately, the training set, coupled with
the succession check, provided a base that was accurate enough to the desired proper operation that we
were able to carry out the trials without false positives setting the system off.
Once the debugging had concluded, the implemented systems worked fairly well. In each of the trials,
the artificial anomaly was introduced after the robot had concluded that the yogurt had been acquired,
which allowed us to verify the real world functionality of the yogurt detection nodes on the side. In almost
all cases, the artificial anomalies set off the anomaly detection systems immediately. The exception to
this is one incident with the audio system, where the robot had difficulty hearing sounds prior to passing
the yogurt check. Additionally, it is worth noting that the audio would occasionally pick up background
noise originating from near the subject during tests to ensure that the robot could complete the task
on its own. These issues are likely a result of the type of microphone we used during the testing, as
a cardioid microphone has a fairly specific but wide zone from which it can hear sounds, as well as
the lack of a means of filtering out background noise, which could be added to the system in a fairly
straightforward manner given the time. As such, the system appears to function reasonably well in the
real world, demonstrating that this method for multi-modal anomaly detection is quite possible.
E. Future Work
One issue that appeared during my work on the yogurt detection system was accurately distinguishing
the location of the spoon from the background. This was eventually fixed using a window around the
general position of the spoon, but a more accurate solution should be possible. For example, image
segmentation could be used to separate the spoon from the background, allowing the histogram to process
the spoon by itself. This would improve the accuracy of the system, but would require some time to
implement.
There are many things that I wish I could try in regards to this project that have the potential to improve
the accuracy of the anomaly detectors and possibly offset the reliance on accurate training data, but one
in particular stands out. During the initial stages of the project, the team ran across a paper on the use of
manifold learning as part of a system to identify objects based on the results from five tests [5].
Manifold learning is, at its core, a means of reducing higher dimensional information to a smaller set
of dimensions, and is a name given to a class of algorithms, such as self-organizing maps and isomaps;
each with certain advantages and weaknesses. These manifold systems are usually used for computer
recognition projects, as they can reduce the number of comparisons needed to identify an object.
We came up with the idea that the detection of anomalies, an instant of an action rather than the result
of a test, could be achieved utilizing comparisons on Manifold system maps rather than using Gaussian
distribution equations. manifold systems do not inherently allow for multiple object recognition, but it is
quite possible to train a system, and then tag the points used to train it in order to categorize the results,
as the team behind the object identification project did [5] . I was able to obtain a set of older data
that the team acquired during early attempts at developing the robot’s program, and generated the maps
and results shown in Figure 6 by running the same Receiver Operator Characteristic tests I ran on the
Gaussian nodes. Each of the 8 segments present in this early data set corresponds to a color in the maps
in Figure 6.
As Figure 6 shows, some manifold systems might actually prove to be a fairly decent alternative to the
Gaussian distribution method. I would need to investigate into this area a considerable bit more before
making a decision to use manifold systems over the simple Gaussian models, but a few things are clear.
First, with the number of comparisons needed by a brute force approach to determine the acceptability of
a data point, the Gaussian method is still significantly faster than the manifold systems approach, at least
during this primitive test, though this may be due to the simple brute force one I utilized during these
tests. Second, some of the manifold systems seem to have a better true positive vs. false positive rate than
the Gaussian models did, supporting my thoughts that this may warrant further research. As such, I hope
that there will be further investigation of this rather unusual use for manifold systems, which are usually
used for the recognition of objects rather than actions [5].
V. CONCLUSION
This project was challenging, and although it’s not quite complete yet, it is in a state where I feel that
it could soon be used for the practical goal of safely feeding Henry Evans his favorite food of yogurt. I do
have a few concerns that I wish could be addressed, such as getting a better data set for proper training,
as well as the issue of the glare present in the Kinect’s camera, but generally speaking, the robot, and by
extension, the yogurt detection and autonomous multi-modal anomaly detection systems, work fairly well.
In the end, the general setup appears to be generalizable to any task, provided that the task is properly
divided up and examples can be obtained for calibration, which could prove quite useful in other projects.
(a) Self-Organizing Map
0 50 100 150 200
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(b) Self-Organizing Map ROC Curve
(c) Locally Linear Embedding
0 50 100 150 200
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(d) Locally Linear Embedding ROC Curve
(e) Principle Component Analysis
0 50 100 150 200
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(f) Principle Component Analysis ROC Curve
(g) Isomap
0 50 100 150 200
0
50
100
150
200
250
FalsePositives
T
r
u
e
P
o
s
i
t
i
v
e
s
Receiver Operator Characteristics
(h) Isomap ROC Curve
Figure 6: Manifold Systems
VI. BIBLIOGRAPHY
REFERENCES
[1] T. L. Chen, M. Ciocarlie, S. Cousins, P. Grice, K. Hawkins, K. Hsiao, C. C. Kemp, C.-H. King, D. A. Lazewatsky, A. E. Leeper et al.,
“Robots for humanity: Using assistive robots to empower people with disabilities,” 2013.
[2] W.-K. Song, W.-J. Song, Y. Kim, and J. Kim, “Usability test of knrc self-feeding robot,” in Rehabilitation Robotics (ICORR), 2013
IEEE International Conference on. IEEE, 2013, pp. 1–5.
[3] C. C. K. Tiffany L. Chen, Andrea L. Thomaz, “Enabling the pr2 to assist with activities of daily living.”
[4] A. Jain and C. C. Kemp, “Improving robot manipulation with data-driven object-centric models of everyday forces,” Autonomous Robots,
vol. 35, no. 2-3, pp. 143–159, 2013.
[5] J. Sinapov, T. Bergquist, C. Schenck, U. Ohiri, S. Griffith, and A. Stoytchev, “Interactive object recognition using proprioceptive and
auditory feedback,” The International Journal of Robotics Research, vol. 30, no. 10, pp. 1250–1262, 2011.