Thesis

UNIVERSITY OF MALTA
Faculty of Engineering
Department of Systems and Control Engineering
FINAL YEAR PROJECT
B.ENG.(Hons.)
Virtual Object Manipulation
in an
Augmented Reality Setup
by
Daniel Camilleri
A dissertation submitted in partial fullment
of the requirements for the award of
Bachelor of Engineering (Hons.) of the University of Malta
Copyright Notice
Copyright Notice
1. Copyright in text of this dissertation rests with the Author. Copies (by any pro-
cess) either in full, or of extracts may be made only in accordance with regulations
held by the Library of the University of Malta. Details may be obtained from the
Librarian. This page must form part of any such copies made. Further copies (by
any process) made in accordance with such instructions may not be made without
the permission (in writing) of the Author.
2. Ownership of the right over any original intellectual property which may be con-
tained in or derived from this dissertation is vested in the University of Malta and
may not be made available for use by third parties without the written permis-
sion of the University, which will prescribe the terms and conditions of any such
agreement.
ii
Authenticity Form
Authenticity Form
DECLARATION
Students Code: 384592M
Students Name & Surname: Daniel Camilleri
Course:
Bachelor of Engineering (Honours) in Electrical and Electronics Engineering
Title of Long Essay/Dissertation/Thesis:
Virtual Object Manipulation in an Augmented Reality Setup
I hereby declare that I am the legitimate author of this Long Essay/Dissertation/Thesis.
I further conrm that this work is original and unpublished.
DANIEL CAMILLERI
Signature of Student Name of Student(in Caps)
15
th
May 2014
Date
iii
Abstract
Abstract
iv
Acknowledgements
Acknowledgements
v
Contents
Contents
List of Figures ix
List of Tables x
List of Abbreviations xi
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Project Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
I Literature Review 4
2 Augmented Reality Review 5
2.1 Dening Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Pixel Based Display AR . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Heads Up Display AR . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Marker Based AR . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Markerless AR . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Virtual Reality Based AR . . . . . . . . . . . . . . . . . . . . 9
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Pose Classication Review 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Gesture Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Appearance Based . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 3D Model Based . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Gesture Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Parameter Computation and Extraction . . . . . . . . . . . . . 16
3.4 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Feature Extraction Matching . . . . . . . . . . . . . . . . . . . 17
3.4.2 Single Hypothesis Tracking . . . . . . . . . . . . . . . . . . . 18
vi
Contents
3.4.3 Multiple Hypothesis Tracking . . . . . . . . . . . . . . . . . . 18
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
II Methods and Methodology 20
4 Augmented Reality 21
4.1 AR Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Visual Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Tracking Hardware . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Required Camera Parameters . . . . . . . . . . . . . . . . . . . 22
4.1.4 Camera Choice . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 AR Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Unity 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Oculus Unity Libraries . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 VR to AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.4 Virtual Hand . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Unity-MATLAB Communication . . . . . . . . . . . . . . . . . . . . . 29
4.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Pose Classication 32
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Skin Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Log Transform Illumination Compensation Method . . . . . . . 33
5.2.2 Mean Luminance Based Illumination Compensation Method . . 34
5.2.3 Pixel Luminance Based Illumination Compensation Method . . 34
5.2.4 No Illumination Compensation Method . . . . . . . . . . . . . 36
5.2.5 Skin Segmentation Testing . . . . . . . . . . . . . . . . . . . . 36
5.3 Pose Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3.1 Skeletonisation Approach . . . . . . . . . . . . . . . . . . . . 38
5.3.2 Modied Approach . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.3 Manipulation Actions . . . . . . . . . . . . . . . . . . . . . . 40
5.3.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
III Results and Discussion 42
6 Augmented Reality Results 43
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
Contents
6.2 Snellen Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Display Contrast Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 View Registration Test . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 Skin Segmentation Results 45
8 Pose Classication Results 46
9 Conclusion and Future Work 48
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
References 48
A Matlab Code 54
B Unity Code 57
viii
List of Figures
List of Figures
2.1 Augmented Reality Application in a mobile phone [4] GET PERMISSION 6
2.2 Augmented Reality Glasses [5] . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Marker Based AR [11] . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Augmented Reality for Maintenance and Repair [11] . . . . . . . . . . 9
3.1 Gestural Mathematical Modelling [14] . . . . . . . . . . . . . . . . . . 12
3.2 Labelled Hand Skeleton [19] . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 Oculus Rift Development Kit 1 [6] . . . . . . . . . . . . . . . . . . . . 21
4.2 Logitech C905 webcam [10] . . . . . . . . . . . . . . . . . . . . . . . 23
ix
List of Tables
List of Tables
5.1 Ellipse Parameters for [46] . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Illumination Compensation Parameters for [45] . . . . . . . . . . . . . 35
5.3 Ellipse Parameters for [45] . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1 Snellen Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.1 Parameter Extraction after Skin Segmentation . . . . . . . . . . . . . . 54
A.2 Parameter Extraction after Skin Segmentation . . . . . . . . . . . . . . 55
x
List of Tables
List of Abbreviations
HMI Human Machine Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
HCI Human Computer Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
AR Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
VR Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
FOV Field of View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
HDR High Dynamic Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SBS3D Side by Side 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
HMD Head Mounted Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ARMAR Augmented Reality for Maintenance And Repair . . . . . . . . . . . . . . . . . . . . . . . . 9
EMG ElectroMyoGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
DOF Degrees Of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
SIFT Scale Invariant Feature Transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
SURF Speeded Up Robust Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
xi
Chapter 1. Introduction
Chapter 1 - Introduction
1.1 Introduction
The widespread proliferation of computers and their incredible importance in everyday
life has given rise to many completely new branches of scientic exploration that rely
on these machines. The problem is that since their invention up to recent years, main-
stream interaction with computers has been limited to input devices which only provide
articial ways of communication. The way people interact with these devices does not
exist in the natural realm of day to day interaction between humans and as such is both
unnatural and limiting in scope. One particularly good example of limited interactivity
with computers is in the manipulation of 3D virtual objects where interaction has to be
carried out by using two dimensional input devices, generally a mouse and a keyboard.
Ever since the advent of image processing one of the main applications envisioned
has been the use of image processing for a novel and natural Human Machine Inter-
face (HMI) where the user is free to express the desired actions through hand gestures.
The main problem with an image processing approach along the years has been the sheer
computational complexity required in implementing various degrees of hand pose and
hand gesture recognition. Apart from the innovation required in the input devices for a
natural HMI, there is also the task of innovating the main output device of the computer,
the display.
Just as in the case of the input devices, users live in a 3D world but current interaction
is with a screen that can normally project 2D images and with recent technology, 3D
images limited to the screen area. However, a truly natural way of interaction would
be through the use of the 3D space around the user. In this method of interaction, the
user is not interacting with programs and other objects whilst sitting down in front of
a 2D computer display but the user is able to freely move around his surrounding envi-
ronment whilst interacting with virtual windows placed in 3D space at varying depths
and positions. All this whilst still being able to observe and interact with the real world
in the most natural and unrestrictive way possible. This is the general aim of this dis-
sertation; investigating the technologies required to provide a method of augmenting the
environment of a user whilst allowing him to interact using unaided hands.
1
1.2 Project Objectives
The objectives to be achieved in this project are divided into two sections. The rst
section deals with the construction of an Augmented Reality (AR) setup including both
hardware and software. This sections objectives include rst of all being able to view
the environment in an as unrestricted way as possible while at the same time allowing for
the observation of virtual objects in the context of the real world. The second objective
is the natural interaction with these objects by having them placed in their respective
depths with respect to the user and his environment. Finally, the last objective is to build
this system in the cheapest way possible so as to demonstrate also the nancial feasibil-
ity of the project for widespread use.
The second sections objectives deal with the accuracy, performance and user experience
of the object manipulation component of the dissertation title. In this section, the rst
objective is object manipulation while keeping the hands free of gloves or markers of
any kind thus keeping user interaction as natural as possible. The second objective is
to approach the real-time algorithmic performance required from a user interface while
replicating as many degrees of freedom as possible. Finally, the last objective is accuracy
in the parameter determination of the algorithm so that the actions carried out by the user
are faithfully replicated.
1.3 Project Approach
This section outlines the various steps carried out in order to achieve said objectives.
Research Different AR Setups
Analysis of different setup congurations with respect to the objectives
Setup the AR environment
Setting up of an environment to render virtual objects
Communicate between this environment and MATLAB
Establishing a communication channel between MATLAB and the renderer
Hand Pose Recognition Literature Review
Research various methods available in the approach of estimating the hand pose
Algorithm Implementation
Implementation of an algorithm which satises the objectives criteria
System Integration
Integration and nalisation of both systems together on a single platform
2
1.4 Dissertation Outline
The following chapter introduces the theory behind building an Augmented Reality hard-
ware setup and the associated hardware choices. Chapter 3 reports the work related to
the software interface that will integrate the hardware choices described in Chapter 2,
and to provide a platform for rendering the required virtual objects. Chapter 3 will also
detail the communication methodology used for communicating between MATLAB and
the rendering platform. This chapter concludes the Augmented Reality section of the
dissertation. Chapter 4 provides a description of the image processing theory followed
by the hand pose algorithm in Chapter 5. Finally, Chapter 6 will report the results ob-
tained and Chapter 7 concludes and discusses the successes, shortcomings and future
improvements to the system.
3
Part I
Literature Review
4
Chapter 2. Augmented Reality Review
Chapter 2 - Augmented Reality Review
2.1 Dening Augmented Reality
Augmented Reality currently has multiple denitions because of the vast applications of
the technology. [1] denes AR as a technology that allows computer generated virtual
imagery information to be overlaid onto a live, direct or indirect, real-world environment
in real time. On the other hand, another denition which comes from [2] explains AR
in 5 clear statements. According to this source, an AR system should:
1. Combine the real world with computer graphics
2. Provide interaction with objects in real time
3. Track objects in real time
4. Provide recognition of images or objects
5. Provide real time context or data
This identies the various component features of an AR system whilst the denition
in [3] is concerned more with the users impression and denes AR as the use of real
time digital computers and other special hardware and software to generate a simulation
of an alternate world. Using all of these denitions and combining them together pro-
vides the ideal AR interface. Nevertheless, integration in current hardware setups has
been limited to a subset of the features described because of various limitations. This
chapter looks into the different AR setups that are currently mainstream and breaks down
the various components of each system while looking into how these components, both
hardware and software lend to the AR experience.
2.2 Hardware
The general hardware requirements of an AR hardware setup are rst of all a display
which provides the images to the user. This is the component that denes the price range
of the setup. Following that, the next component is a method for capturing the surround-
ing environment and nally each basic AR system needs a method for superimposing
virtual objects on the captured environment. More advanced AR interfaces also include
a method of interaction with virtual objects tackled in future chapters.
5
Figure 2.1: Augmented Reality Application in a mobile phone [4] GET PERMISSION
2.2.1 Pixel Based Display AR
The rst component of an AR system and the one that denes the type of setup is gen-
erally the display. The common display type used is a pixel based monitor which has
resulted in AR systems based on mobile phones, computer monitors and TV screens.
These AR devices are typically relatively cheap because they are not dedicated devices
but have AR applications built upon their general purpose hardware. Applications in-
clude augmenting the environment with restaurant and landmark data in the case of mo-
bile phones as depicted in Figure 2.1 or the augmentation of ones environment through
a monitor based game that displays the user together with the addition of virtual ob-
jects around him. This forms the basis of many AR games and applications found in
the market. In these hardware setups, viewing the surrounding environment is achieved
by the use of cameras and the computer graphics are overlayed over this camera feed
using a graphics processor. The camera feed is also analysed by the AR software to
process specic articial or environment markers that trigger the augmentation content.
The disadvantage with these devices is that they are typically non-immersive devices, in
the sense that the user views the augmented world only through the screen. Also in some
cases, especially frequently in the case of mobiles, they also occupy the users hand thus
not leaving him free to interact.
2.2.2 Heads Up Display AR
On the other end of the spectrum, dedicated AR devices typically consist of optical
screens mounted to glasses which are worn by the user. These represent the ideal inter-
face for an AR application because they are rst of all non intrusive devices and secondly
highly immersive. In this case, the lenses and optic elements mounted to the worn frame
allow light to pass from outside the device while a projector mounted on the side of the
eye projects the virtual image to the user. In this way, the user views the surroundings
naturally through his own eyes instead of through a device such as a camera. The meth-
ods by which the real world and the virtual imagery are fused are various. In the most
6
Figure 2.2: Augmented Reality Glasses [5]
simple case it is a matter of projecting the image onto a prism such that it reects off
the angled face and at the same time allowing external light to pass through just like
the hardware in Figure 2.2. A more thorough listing of optical construction can be seen
here [8]. This means that the user can wear the AR glasses and experience augmen-
tation all around him because wherever the user looks, the glasses follow him around
giving a good simulation of an alternate, augmented world. These are the processes
which provide the augmentation and in this case, the graphics processor solely renders
the virtual objects instead of incorporating camera images as well like in the previous
devices. However, an external camera is still required in most applications so that the AR
software can process markers that provide the augmented content to the user. The down-
side of this type of hardware setup is that it is expensive to manufacture and therefore
inaccessible to most users.
Therefore, the sweet spot in terms of price and functionality comes in the form of com-
bining the wearability and immersivity of wearing glasses with the cost effectiveness
of a pixel based screen. This results in what is called a Head Mounted Device (HMD)
Using a pixel based screen also means that external cameras are required to provide the
immediate surroundings to the user but notwithstanding this extra cost, the overall cost
of this type of system is still lower when compared to a HUD based system containing
expensive optics.
2.3 Software
Next in the construction of an AR interface is the software. The software is responsible
for providing the augmentation of the real environment to the user through the superpo-
sition of virtual content on the camera feed. The requirements of the software depend
heavily on the type of AR application but there are three general approaches that can be
used to trigger the augmentation.
7
Figure 2.3: Marker Based AR [11]
2.3.1 Marker Based AR
In Marker Based AR systems, the focus is on using black and white markers to provide
the trigger for the augmentation of the live video. This method rst identies the pat-
tern displayed in the marker, and this in turn provides the required information which
links the program to the appropriate model associated with that particular marker. That
is how the system is initialised. After initialisation, the pose of the camera is estimated
relative to the marker using corner features and other appropriate methods together with
a previously calibrated camera. This camera pose with respect to the marker, provides
the azimuth and elevation which must be used in order to render the augmented model
in the appropriate orientation with respect to the real world. Then after the rst initiali-
sation routine, the marker features are matched between successive frames to modify the
visible orientation of the model which is referred to as tracking. An example of such a
system can be seen in Figure 2.3
In the context of a practical application related to a HMD, having markers spaced around
the user is a viable option but limits the augmented interactivity to these markers. [51]
Apart from this, it requires having to set up a room with physical markers to provide the
augmented content. Another disadvantage, is that the content location is limited to the
positioning of these markers. On the ip side, signicant advantages with marker based
tracking is that it is robust in environments which are challenging for feature tracking
employed in markerless systems. Such scenarios contain both cluttered and plain envi-
ronments where features are excessive or scarce respectively.
2.3.2 Markerless AR
The second software approach is similar to the rst but instead of having custom printed
markers that provide the trigger for displaying augmented content, the triggers are now
the surrounding objects. In this type of AR, object feature comparison is carried out
with a database of feature points for different objects where each object corresponds to
8
Figure 2.4: Augmented Reality for Maintenance and Repair [11]
a virtual model. This initialisation is the only technical difference between Marker and
Markerless AR because Markerless AR still has to identify the camera pose from the
features extracted to evaluate viewing angle of the model. In this case, instead of requir-
ing the physical placement of custom markers, preliminary work consists of constructing
the feature database for the surrounding objects. An example of such an application is
in AR Learning or Augmented Reality for Maintenance And Repair (ARMAR) shown
in Figure 2.4.
2.3.3 Virtual Reality Based AR
The third approach combines a Virtual Reality (VR) environment with images of the
real world in order to provide the augmentation. This is achieved in two steps. First of
all, the user has to be able to view objects in the virtual environment which is achieved
by constructing a virtual environment with augmented objects that respond to tracking
data describing the point of view of the user. The orientation data can originate from a
simple accelerometer or from a complete head tracking system in the case of a HMD.
Then after being able to implement a VR environment complete with head tracking, the
next step is to introduce the context of the immediate user environment. This is done by
superimposing the virtual objects from the virtual environment over the image from the
cameras. In the case of monocular HMDs, only a single virtual object view is required
for the system. Alternatively, when implementing a stereo HMD, two separate views of
the virtual object are required to be superimposed on the individual camera feeds.
Basing AR on a Virtual World has the advantage of making it more immersive than the
marker and markerless systems and requires less image processing of the environment
thus making it less taxing on system resources. It does however have a prerequisite
which is the construction of the virtual world. Even so, it is possible for virtual content
to be created on the y with a suitable user interface, either through conventional input
devices or through other methods such as hand pose estimation using image processing.
The main disadvantage with this type of system, is that in contrast with the other sys-
9
tems, it requires more hardware to track the users motion.
The minimumrequirement here in terms of extra hardware is a head tracker which allows
for the estimation of head orientation in 3DOF although the best system would also
include a translational tracker for the full 6DOF required to navigate the virtual world in
tandem with the real world. This of course adds cost to the system and has to be taken
into consideration when choosing the software approach. When choosing tackers, apart
from the DOF, another important parameter is the responsiveness. This is particularly
important because if the latency between a head movement or action in the real world
and that in the virtual world is discernible by the user, this breaks down the illusion of
augmentation and even results in motion sickness
2.4 Conclusion
In brief, this chapter detailed the hardware setup approaches currently available in build-
ing an AR interface together with the technological background required to build this
hardware and also dealt with the software aspect of AR that interfaces the hardware
and augments the users vision. In this chapter, the various methods employed for aug-
menting a users reality were explored together with details on the further algorithmic
requirements and prerequisites of each approach. Virtual Reality Based AR can provide
very good results with the least preliminary work on the users side whilst also pro-
viding a very good illusion and allowing easy creation of content through game design
programs. This lends much exibility in content creation and even adjustable levels of
realism according to the level of rendering detail and also according to the computa-
tional power available to the user. Thus, this chapter ends the literature review on AR
and the next chapter starts dealing with the foundation for the image processing required
to provide a user input.
10
Chapter 3. Pose Classication Review
Chapter 3 - Pose Classication Review
3.1 Introduction
Hand Pose Classication in general is the process of estimating the pose of the hand ei-
ther in 2D or 3D in order to extract specic gestures. These specic gestures then act as
triggers for the AR software and result in the execution of a particular action. From now
on, a pose refers to the immediate position and orientation of the palm, ngers and arm
whilst a gesture is referred to as the spatio-temporal interpolation of successive hand
poses. The specic application of hand pose recognition is focused on the interpreta-
tion of hand poses and gestures for Human Computer Interface (HCI). HCI is currently
the limitation when it comes to effectively manipulating and controlling computers in a
more natural way.
A brief history of HCI up to recent years has its roots in speech recognition which was
the rst natural HCI to be implemented.This was then followed by hand pose recognition
based on sensors such as gloves but the disadvantage here was that gloves are very ex-
pensive, fragile and sometimes cumbersome to use. Following this a growing interested
started in using visual interpretation of images to decipher the hand pose in the presence
of occlusions and varying environments with research still ongoing in this area. Also in
more recent years, ElectroMyoGraph (EMG) based sensor interfaces are being explored
for commercial HCI applications. The developers of this technology such as [13] pos-
tulate that it is possible to extract ne hand gestures and poses based on EMG signals
thus providing the same information required by an image processing approach by using
a different signal space.
This dissertation focuses on the application of image processing for an HCI application
and as such construction of the nal system has to go through the four stages of HCI as
described in [14].
1. Gesture Modeling
2. Gesture Analysis
3. Gesture Recognition
4. Application Integration
Of these stages, the rst three will be described within this chapter whilst the last stage
is discussed together with the implementation in Part II.
11
Figure 3.1: Gestural Mathematical Modelling [14]
3.2 Gesture Modelling
The modelling of gestures can be both temporal and/or spatial in nature with separate
sections dealing with different sets of parameters. A generic description of gesture mod-
elling deals with the mathematical modelling of both spatial and temporal hand gestures
with the approach modied depending on the required nal accuracy of the system. The
approach taken in mathematically modelling a gesture is described in Figure 3.1. It
shows that everything originates from the mental concept of the gesture within the user
who attaches signicance to that particular gesture. This is then transformed into hand
movement which is the physical realisation of the mental idea originating from the trans-
formation occurring between the cognition and motor areas of the gesturers brain. Then
the last transformation describes the conversion from hand movement to visual images
and describes mathematically what is viewable from a chosen point of view. This is
the mathematical description of what occurs in the observers process of deciphering the
hand gestures of the gesturer.
Temporal modelling describes gestures as having a preparation, nucleus and retraction
stages. The nucleus contains the data originating from the actual gesture. On the other
hand, the preparation and retraction stages are concerned with the actions preceding
and succeeding the nucleus respectively. Spatial modelling on the other hand classies
gestures as either Appearance Based of 3D Model Based.
3.2.1 Appearance Based
Appearance based spatial modelling depends on the inference of the hand position from
images with variety present in the method used for inferencing. One such method of
inference is through the use of deformable 2D templates such as those found in [15] and
[16]. Another appearance based method uses the 2D hand image sequences as gesture
12
templates just like in [15] and [17]. Appearance based modelling is also carried out using
motion history images where the image consists of accumulated pixels over a temporal
window. This results in an image that can be compared to identify a limited set of
gestures as described in [18]. Overall, these methods are mostly used for the recognition
of gestures rather than hand poses.
3.2.2 3D Model Based
On the contrary, 3D model based spatial modelling is focused mainly on inferring the
accurate 3D pose of the hand. There are two main approaches that deserve mention in
this section and these are Volumetric Models and Skeletal Models.
Volumetric Models
Volumetric models describe the 3D appearance of the hand and a common way of de-
termining pose is by comparing through an analysis-by-synthesis approach described
in [20] and [22]. In this approach, the idea is to synthesis a 3D volumetric model and
then through cross referencing with the original image, vary the parameters of the vol-
umetric model until the appearance of the volumetric model matches that of the image.
This is in effect a brute force search and thus computationally expensive apart from the
complexity arising in generating an accurate enough 3D model to do the comparison
with. Apart from this, the parameter space that needs to be varied is too big, hence real
time operation is not currently feasible but there are ways of optimizing the process.
One such optimisation described in [24], centres around the use of generalised cylinders
in the generation of the volumetric model instead of complex surface meshes which
result in less graphical computation in the generation phase of the algorithm but still has
a big parameter space to search through. Thus in an effort to reduce this parameter space,
these generalized cylinders each of which has three parameters being length,joint angle
and radius are reduced to two by removing the radius. This results in a skeletal model
which has a 19 dimension decrease in the search space.
Skeletal Models
The skeletal model as its name implies, models the hand upon its skeleton. The skeleton
is made up of eight carpals (wrist bones), ve metacarpals (palm bones) and 14 pha-
langes(nger bones) for a total of 27 bones. Each nger as seen in Figure 3.2 is split into
the Proximal, Middle and Distal Phalanges. All of these phalanges are capable of ex-
ion/extension which refers to the curling of the ngers meaning they have two Degrees
Of Freedom (DOF). Proximal phalanges are also capable of adduction/abduction which
refers to movement in the plane of the palm when spreading ngers thus they have three
13
Figure 3.2: Labelled Hand Skeleton [19]
DOF. Therefore the ngers have a total of 43 DOF to estimate in order to obtain the best
t. These DOF form what is called the local hand pose which describes just the ngers.
The six DOF indicating position and rotation of the palm together with the local hand
pose now form the global hand pose having a total of 49 DOF.
Fortunately, the joints are not exclusively actuated and as such this results in further
simplication of the skeletal model. This characteristic of the human hand is exploited
in [21] to reduce these degrees of freedom to just 26 by relating the exion angles of
the phalangeal joints within themselves. This reduction in the search space is also ob-
tained by not taking into consideration the possibility of adduction and abduction of the
proximal phalanges. All of these optimisations result in a much faster implementation
of hand pose recognition using a skeletal model.
3.3 Gesture Analysis
Gesture Analysis is the computation of model parameters from input video. These can
include features such as hand localisation and also hand tracking. Gesture Analysis is
subdivided into two sequential processes starting from detection and extraction of image
features followed by the actual computation of the model parameters from these features.
3.3.1 Feature Detection
Feature detection is a very vast and signicantly important section of image processing
because it is the fundamental step in each and every image processing application be-
cause it allows the processing efforts to be concentrated on specic objects or regions
14
of interest. The following headings detail various possible feature extraction methods
which can be used separately but also benet greatly from using multiple extraction
methods at once which would overcome the limitations present in using just one feature.
Colour Cues
Colour cues distinguish an object of interest in the scene that has a particular colour or
colour region. This is particularly important in the case of hand pose because it can easily
be used to extract the hand by using the characteristic colour footprint of skin. The most
direct approach to this would be to classify skin pixels as those lying within a specic
RGB range but this approach suffers severely in the presence of varying illumination
conditions. Therefore, the approach is valid but the specic colour space is not suitable
for consistent results. Consequently, colour spaces that are less sensitive to illumination
changes are chosen like the YC
b
C
r
or the HSV spaces. These sp[aces are distinct in
comparison with the RBG space in that they have a separate illumination component
that can be left aside and thresholding applied to the other dimensions of the image such
as work implemented in [40] and [41].This type of feature extraction benets from scale
and positional ltering to further improve and rene the segmentation results.
Motion Cues
Motion cues deals with the segmentation of moving objects and non moving objects. In
the case of hand detection, this feature extraction method is useful in conjunction with a
xed point camera and a stationary background to extract the only moving feature that
is the hand.
Silhouette
Silhouette features extraction is the easiest and most frequently used feature but it has the
disadvantage that it results in loss of information. It especially affects the performance
of 3D pose estimators because no depth information is available in a silhouette [23]
Contours
Contours on the other hands are different fromsilhouettes, in that they do not focus solely
on extracting lines appertaining to the perimeter of the object of interest but also provides
information of lines within the objects boundary. This makes contour extraction useful
in 3D model based analyses and is also used in algorithms which perform, image contour
to model contour matching as a means of registering and identifying the current pose. An
example of this work can be seen in [24] and [53]. In the latter, points on the boundary
of an observed image are matched with those on an object model found in a database
of object models through the comparison of their feature parameter arc lengths. Then,
15
comparing the relative locations of adjacent matched points in the image as compared
with their position in the model provides a geometric transformation that describes the
rotation and position of the object in the image with respect to the model in the database.
Fingertip Locations
Fingertip locations is another frequently used feature in extracting the hand pose from an
image. In this case, the detection of ngertips is not simple to implement unless using
external aids such as coloured gloves or markers like the work featured in [21]. The
main disadvantage here is that this method is very susceptible to occlusions but this is
solved in [25] by estimation of the occluded ngers through the use of a 3D model of
the gesture.
3.3.2 Parameter Computation and Extraction
Subsequently, after successfully extracting the required features through the use of one
or multiple feature extraction methods, the next step in building an HCI is estimation of
the 3D model parameters. This entails the estimation of joint angles together with the
lengths of phalanges and palm so as to populate the values for the all the DOFs of the
system. Also in this section parameter extraction is divided into two steps.
Initial Parameter Estimation
There are multiple approaches for estimating the values for the initial parameters of the
hand which can be made simpler by providing the nger lengths a priori thus further
reducing the state space search. This results in the required estimation of the angle joints
only, which can be solved through an inverse kinematics approach of the hand. The
problem with this is that inverse kinematics allows for multiple solutions which might
not be the true solution and also because they are computationally expensive. Other
ways of implementing a reduction in the search space is by further constricting the user
parameters such as by removing additional moving joints that are unnecessary for the
application.
Parameter Update
This computational block within parameter computation and extraction takes care of
tracking after the rst initialisation step. One way of doing this is through smoothing
using Kalman ltering and through prediction of the spatial location of the required
features of interest. The downside of this approach is that it only works in the presence
of small motion displacements and hence good initialisation of the parameters is crucial.
16
3.4 Gesture Recognition
After combining gesture modelling and gesture analysis, the next step in an HCI is ges-
ture recognition where information from the previous steps are compared to each other
in order to extract the hand pose. Gesture recognition therefore relates the choice of
gestural models and their parameters with the ones extracted from the gesture analysis
stage. Focusing on model based mathematical modelling, gesture recognition branches
intro three main sections as described by [26]
3.4.1 Feature Extraction Matching
This branch of gesture recognition focuses on the comparison of the image features
obtained through the comparison of the extracted features of the captured image, with
the features obtained in varying the parameters of a 3Dmodel. There are differing feature
categories that can be used for gesture recognition.
High Level Features
High level features used for gesture recognition range from marker based methods to
ngertip detection in the aim of comparing models using a high level approach which has
a much smaller search space compared to more low level approaches. One such feature
comparison method using ngertip positions as the matching parameter can be found
in [35] where Gabor lters are used to extract the ngertip locations and comparison is
done by varying a skeletal model.
Low Level Features
Low level features such as silhouettes and contours as described earlier are also used in
matching feature parameters as described in [33] but are mainly used in order to quantify
the error in the model tting section of different gesture recognition methods such as the
single and multiple hypothesis methods mentioned further on.
3D Features
Another approach to feature matching uses 3D features directly and is approached with
the use of a stereo camera setup or through the use of a range sensor. The disadvantage
with this approach is frequently the additional computational complexity required to
handle 3D models but nonetheless, 3D information is very valuable because it contains
information that allows for easier handling of occlusions when they occur. An example
of this can be seen in [37]
17
3.4.2 Single Hypothesis Tracking
Single hypothesis tracking approaches to gesture recognition are concerned with the
prediction of a single future hypothesis and can be subdivided into optimisation based
methods or physical force models.
Optimisation Based Methods
Single hypothesis tracking makes use of standard optimisation techniques like Unscented
Kalman ltering [28] and genetic algorithms [34] together with other logical approaches
such as a divide and conquer approach mentioned in [31]. In the divide and conquer
method, the global position of the hand is rst estimated followed by the iterative esti-
mation of the successive joint angles until convergence is achieved.
Physical Force Models
Single hypothesis tracking also makes use of physical force models that attract the
model towards the observed image through the application of a force proportional with
the separation of the two. This type of approach can be seen in the Iterative Closest Point
(ICP) algorithm. One such implementation of the ICP algorithm is found in [37] where
the forces where used for the registration of an articulated 3D model with the observed
image.
3.4.3 Multiple Hypothesis Tracking
Multiple Hypothesis tracking in contrast with single hypothesis tracking algorithms gen-
erate all the possible future poses given the current pose and thus create a search space
to sift through for registration of the next hand pose based on Bayesian probability. Such
gesture recognition algorithms make use of Particle Filters, Tree Based Filters and tem-
plate database searches amongst others.
Particle Filters
Particle lters are implemented through the use of a recursive Bayesian lter using
Monte Carlo simulations [36]. Their implementation is based on importance sampling
which is the calculation of the relevance of each particle in the lter with respect to the
required pose followed by the removal of low probability occurrences as per [38]. The
main disadvantage in this approach comes from the number of particles implemented
because accuracy and computational complexity are proportional to this number. Hence
a trade-off between accuracy and computation time is required in this approach to be
able to effectively implement the system in a real time environment. [32]
18
Tree-Based Filters
Tree based lters on the other hand described in [29] aim to segment the range of possi-
ble poses through a grid based approach where high level grids partition all the possible
poses into coarse grids and each coarse cell further divides into ner and ner pose
denitions as tree nodes are traversed until the best t pose described by the observed
image is found. This approach is computationally efcient compared to a database-wide
search because of the speedup obtained that is exponentially proportional to the number
of branches per level. This approach compared with other methods requires the precur-
sory building of the tree from a database of poses through different tree construction
methods such as those described in [30].
Template Database Search
This database search is similar to the tree based lter approach described earlier but
instead of having multiple consecutively positioned layers like in the case of the Tree-
Based Filters, it consist of a single layer at database. Each pose has connections that
link it with other poses where these connections describe the various hypotheses related
to the next possible pose. Thus instead of searching through the tree, consecutive poses
can thus be found by continuously traversing these connections. The results are also
further improved through a feature extraction matching algorithm mentioned previously
which provides faster and better matching as can be seen in [27].
3.5 Conclusion
This chapter presented the main approaches related to the various components and stages
present in a gesture recognition application. It provides the reader with an overview
of the vast spectrum of possibilities in the construction of said application. The next
chapters deal with the implementation of the theory found in Chapters 2,3 and 4.
19
Part II
Methods and Methodology
20
Chapter 4. Augmented Reality
Chapter 4 - Augmented Reality
4.1 AR Hardware Design
Chapter 2 detailing the theory and requirements behind the implementation of an AR
setup ended by saying that the sweet spot in terms of price and functionality comes in
the form of head mounted device containing a pixel based display and external cameras
to provide the surrounding environment to the user. After sifting through different out
of the box implementations of AR hardware, the Oculus Rift [6] was chosen as the nal
base hardware upon which the rest of the system is built.
4.1.1 Visual Hardware
The Oculus Rift Development Kit 1 (Oculus Rift DK1) shown in Figure 4.1 has a screen
with a 1280 by 800 resolution positioned a few centimetres away from the user. This
screen is partitioned in the middle such that half a screen is assigned to each eye. This
means that the resolution is also divided between the eyes horizontally resulting in a
resolution of 640 by 800 per eye at a 60Hz refresh rate. This resolution is slightly low but
it makes up for the low resolution with a wide Field of View (FOV). In general cases, the
extremities of the screen are well within the eyes FOV and this results in a very limited
viewing range which has been the problem with most virtual reality headsets. The Rift
takes care of this problem by introducing custom lenses which expand this small eld of
view to a 110 degree vertical FOV and a 90 degree horizontal FOV. This is an important
parameter when choosing an AR device because it dictates the immersivity of the user.
A wider FOV means that the user is more immersed because upon wearing the headset,
Figure 4.1: Oculus Rift Development Kit 1 [6]
21
the screen lls his eld of vision. For comparison, according to [7] the human eye has
a total horizontal FOV of about 200 degrees however visual acuity is present in varying
degrees over this eld of view with the majority focused within 120 degrees. This means
that the Rift provides a FOV comparable with what is required by the human eye but at
the expense of providing a warped image to the user which therefore adds some degree
of computational complexity in neutralising the warped image.
4.1.2 Tracking Hardware
Since the application will be based in a virtual world, according to Chapter 2, a method
for tracking the heads orientation is required. This head tracker is packaged wit the
Oculus Rift out of the box. It has a 1000Hz inertial head tracker using a gyroscope,
accelerometer and a magnetometer for head tracking over the full three rotational de-
grees of freedom. This tracking is also crucial for the user experience. It is used within
software to track the head so as to provide the visuals related to the users viewing direc-
tion in the virtual world. An important parameter in the tracking section, is the latency
between the head movement and the shifting of the view in the software. If this is dis-
cernible to the user, the virtual experience would suffer. Therefore with a sampling
frequency of 1000Hz in the Oculus Rift, this results in an average latency of 2ms which
covers even rapid head movements.
4.1.3 Required Camera Parameters
The Oculus Rift by itself is a virtual reality device meaning that it is engineered to en-
velop the user in a completely articial world but it can be converted to an augmented
reality device through the addition of a camera as described in Chapter 2. The aim of the
camera in this hardware is to provide to the user an image of his current surroundings.
In this particular case, the camera is to be a complete substitute of the eye and hence
there is a requirement not for one but for two cameras. Hence, each camera provides the
matter of perspective relevant to each eye. Thus the camera choice is crucial if a realistic
impression of the surroundings is to be provided to the user. In order to do this, accord-
ing to [9], the specics of the eye with regards to the capabilities of the Rift have to be
rst analysed and a camera that ts most of these ideal specics chosen to complete the
AR hardware setup.
Starting from the resolution, the camera needs to have a resolution greater than the Rifts
800 by 600 while keeping the aspect ratio of 1.33:1. Secondly, it is ideally required to
have a FOV comparable to the Oculus display. In this respect, by taking into account the
aspect ratio and the Oculus FOV, a horizontal FOV of 120 degrees gives a vertical one
of 90 degrees which gives a good t for this setup. The next specication is the frame
22
rate which is ideally at a minimum of 60 frames per second(fps) to match the refresh rate
of the Rift. These three specications cover the most important parameters with respect
to the cameras. After these, further considerations to take care of for the ideal inter-
face would include High Dynamic Range (HDR) and synchronisation of both cameras.
HDR operates by taking pictures at high, low and intermediate exposure levels and then
blend these together to provide better looking images which have better contrast because
bright and dark objects can be viewed together with equal detail. This imitates the eyes
better because the eyes exposure varies quickly depending on the focused region thus
providing better quality images in difcult lighting scenarios compared to conventional
cameras.
4.1.4 Camera Choice
The nal choice was the Logitech C905 shown in Figure 4.2. It has a maximum reso-
lution of 1600 by 1200 which allows a maximum frame rate of 30fps to be reached at
lower resolutions that are better suited to the Rift. At 30fps, it is lower than the specied
requirements but this it the current limit allowed by the bandwidth of the USB 2.0 stan-
dard. Higher frame rate cameras at high resolutions are for now still in development and
are based on the USB 3.0 communication interface which has the required bandwidth
for high frame rate, high resolution video streaming. The eld of view of the camera
is also low between 60 and 75 degrees but this is enhanced with the addition of wide
angle lenses that boosts this FOV to a maximum of 130 degrees in the horizontal direc-
tion. Finally it has no HDR and no camera synchronisation features present but this is
compensated by availability at a low price for a decent system. Furthermore, since the
resolution of the Oculus is still low, these features would only have a minimal effect on
the nal result in this case.
Figure 4.2: Logitech C905 webcam [10]
23
4.2 AR Software Design
The software in this section acts as the intermediary between the hardware and the image
processing part of the project. Thus this software has the responsibility of taking care
of providing the augmented reality visuals to the user whilst communicating with the
image processing software to carry out manipulation actions. Specically it has six
main functions:
1. Render the virtual world together with all of its virtual objects and their parameters
2. Take care of the physics of interaction between separate virtual objects
3. Provide an image to the user that takes into consideration the optics of the Oculus
together with the screen layout in order to provide Side by Side 3D (SBS3D).
4. Fuse head tracking data originating from the different sources present in the Rift
to provide accurate head tracking within the virtual world
5. Interface the real world imagery with the computer generated imagery
6. Provide a virtual imitation of a users hand for interaction with objects
7. Communicate with MATLAB which provides the input interaction data extracted
through image processing.
4.2.1 Unity 3D
The rst step in implementing the software is achieved by setting up the virtual envi-
ronment required as described by the rst item in the list. For this end, the program of
choice is Unity 3D. Unity 3D [42] is a full featured game development environment that
has the tools required for the construction of a virtual world containing the objects that
are used augment the users world. Any kind of object can be constructed with the tools
provided in the Unity environment but in this case, augmentation as a proof of concept,
is provided by the use of cubes and spheres.
Unity Object Properties Overview
Unity objects in general have a multitude of properties to control in order to provide both
realistic visualisations and also to dene complex interaction rules with other objects.
Listed below are some of the most important properties together with a description of
their function with respect to the object:
Transform - The Transform of an object stores all the information related to the po-
sition of the object in 3D space together with its size. It contains values for the [x,y,z]
position, the [x,y,z] angles and the scale of the object in each separate dimension.
24
Collider - The Collider of an object denes the invisible boundary around an object
that is used to detect collisions with the object. This boundary is separate from the visual
boundary of an object that denes how it looks as shown in Figure ??. The cube is the
object boundary and the green sphere around it is the collision boundary attributed to
the cube. The collision boundary can be set to either one of the standard Colliders or
even to an arbitrary user dened mesh. Theses Colliders can be further enabled and
disable by turning physics on or off for the particular object. Apart from dening the
collision boundary, the Collider also denes the simulated material, be it either metal,
wood, rubber or a used dened material in order to provide different collision behaviours
Scripts - Scripts attached to an object are used to dene complex custom rules of inter-
action with other objects and also provide a way for manipulating the general behaviour
and movement of an object. These scripts can use either the Javascript or the C# lan-
guage providing a very wide range of functionality and exibility to the object creator.
Unity Physics Engine
The Unity game engine also takes care of the physics of interaction between objects
through the use of a physics engine. This provides realistic collision simulations even
in complex situations involving multiple simultaneous collisions from different angles.
The physics engine in terms of the projects scope, takes care of checking if there are
any collisions in the current frame. A collision is agged if the Collider of one object
intersects with the Collider of another. When this occurs, and if both objects have physics
interactions enabled, this results in a force depending on the simulated material of the
objects and their momentum just like in the real world. Further features of the physics
engine is the selective addition of gravity on the objects and the possible manipulation
of gravity by adding it, decreasing it and also applying it in different directions allowing
the user to experience scenarios that are not usually possible in the real world which
provides for interesting augmentation scenarios.
Unity Camera
The Unity camera is a special object inside of Unity because it provides the viewpoint
from which the scene is to be rendered. Before the user can view the image on the screen
during the normal execution of the game, rst of all the position and visual parameters
of the different objects in the game are updated. Then, an image is taken from the
viewpoint of the Unity camera also called the virtual camera from here on. This image
is then processed to apply different effects before being presented to the users eyes on
the display.
25
4.2.2 Oculus Unity Libraries
The third item on the list of requirements is achieved through the use of a library offered
by Oculus VR that offers integration of the hardware with the Unity game engine. This
integration provides a prefabricated (prefab) object that contains two virtual cameras
mounted side by side that mimic the users eyes in the virtual world. The prefab is called
the OVRCameraController. The purpose of this prefab is very fundamental to the project.
First of all, the prefab takes care of registering two different views of the virtual world
which is essential in order to to view the virtual objects in 3D. This is achieved as men-
tioned previously by having two virtual cameras attached side by side at a user specied
distance from each other called the Inter Pupillary Distance or IPD in short and also at a
specied user height to provide the illusion of real world scale as well.
Secondly, as mentioned in the hardware section, the Oculus headset optics provide a
wider FOV at the expense of warping. This prefab takes into consideration this warping
effect and applies the inverse warping function to the image before presenting it to the
user. In theory this is all that is required in order for the user to view the original image
as it was generated. However in practice, there is another distortion effect resulting from
imperfect lenses which is chromatic aberration
Chromatic aberration if the the failure of a lens to focus all the colours to the same con-
vergence point. [56]. This is because warping of the image to increase the FOV is carried
out by having different refraction indices in the lens depending on the distance from the
lens centre. Therefore chromatic aberration is the direct result of different colour wave-
lengths being refracted through different paths of the eyepiece lens. This translates to
the same colour pixel having a separate appearance according to whether is found near
the centre of the eyes vision or closer to the fringes. Consequently, the prefab also takes
care of this chromatic aberration by applying a texture shader to the whole image. This
texture shader modies the colour of each pixel depending on its distance from the cen-
tre of the screen in order to counteract the distortion effect.
After applying the inverse distortions on he images originating from each virtual camera
individually, the warped and chromatic corrected images are now mounted side by side
which is the nal step before sending the image to the user wearing the headset thus
satisfying the third item in the list. The fourth and nal requirement of the software is
head tracking which is also present in the OVRCameraController prefab. The scripts
attached to his object process the data from the accelerometer, gyroscope and magne-
tometer present in the Oculus in order to extract the absolute head direction in terms of
26
the rotational parameters; pitch, yaw and roll.
4.2.3 VR to AR
The previous items in the list are required to be able to simulate a virtual world to the
user. This section now details how to convert from a VR to an AR application. First of
all, the software must allow the user to view his environment through the use of physical
cameras as described before.
Therefore, a list of all available physical cameras is obtained and the appropriate camera
is chosen to represent each eye. The image from each camera is projected onto separate
plane objects inside of Unity resulting in a plane that has the video feed for the right eye
and another with the video feed of the left eye. The OVRCameraController prefab as de-
scribed previously imitates the eyes in the virtual world by having two virtual cameras.
Therefore to route the image of the user surroundings to the required virtual camera, the
plane with the video feed for the right eye is made visible exclusively to the right virtual
camera and the same is done for the left video feed.
The separate planes are structured so as the ll the eld of vision of the virtual cam-
eras attributed to them. They must also follow the users head such that they are always
directly in front of the virtual cameras of the prefab. This is achieved by setting the
Transform of each plane to be identical to that of the camera with an added shift one of
the dimension such that the screen is directly in front of the eye at all times as shown
in Figure ??. In the general case, all objects are visible by all the virtual cameras im-
plemented in Unity but if this setting where kept, each eye would see two overlapping
screens.
Next, the cameras are set to render in an orthographic mode instead of the default per-
spective mode. In this mode, the view of the camera is rendered as is in 2D which
removes all perception of depth. If the perspective mode is used, the planes with the
camera feeds would be observed simply as displays in the context of the virtual world
but with 2D rendering, the effect is reversed. This way, the virtual objects are viewed in
context of the real world and not vice versa.
Finally, different users require different screen spacings depending on the convergence
point of the eyes hence a method of offsetting the plane alignments is integrated to pro-
vide a method for the alignment of both screens into convergence for the user in both
the horizontal and the vertical direction through the use of the keyboard. This allows the
user to carry out the alignment according to his/her personal preference. This is a one
27
time user calibration that can be saved and loaded each time the particular person uses
the program in a real application.
This procedure so far details the implementation required in order to view the surround-
ing environment within the virtual world. In order to be able to view virtual objects
in the context of this environment, a further step is necessary. As described before,
the two cameras that come with the OVRCameraController prefab are set to render in
orthographic mode for the camera feeds. This makes them unacceptable to satisfacto-
rily render the virtual 3D objects that provide the augmentation. These objects must be
rendered in perspective mode. Thus the prefab is modied by adding two more virtual
cameras that solely render the virtual objects in perspective mode.
The Transform of these additional virtual cameras is set to be equal to the Transform of
the orthographic cameras so that the viewing direction is identical for both sets. Finally
in order to join the images of the separate sets of cameras together, the depth parameter
of the camera sets is modied. The depth parameter setting determines the positioning
of the images during the overlaying process. The depth for the perspective cameras is
thus set to one and that for the orthographic cameras set to zero so that the virtual object
are always overlayed on top of the camera feed.
4.2.4 Virtual Hand
The next item in the list is the implementation of a virtual hand within Unity that is able
to imitate the real hands and as such has all the required degrees of freedom of a real
hand. The hand is obtained from [?]. It has a realistic deformable skin mesh and allows
control of all joint angles together as required by the image processing algorithm that
will extract the hand pose. It does not however come equipped with a Collider which is
essential for the physics interaction inside of Unity. The rst approach that one could
take in providing a Collider is by using a custom mesh that deforms together with the
hand. In this way, the physics interactions are as realistic as possible.
This is however not feasible both due to implementation complexity and also due to
the resulting computational complexity of having a deformable Collider. Therefore a
suitable approximation is taken and each phalanx is represented by a separate capsule
Collider tted to the phalanxs width and length which can be described geometrically as
a cylinder with spherical caps. The palm is also modelled by a attened capsule Collider.
This provides the hand depicted in Figure ?? with all the Colliders represented by the
green outlines.
28
In preparation for the interface that is to be constructed between the image processing
software and Unity, the hand is also tted with a script that applies the angle sent from
the image processing algorithm. This is achieved through programming in a javascript
environment in order to access the Transform parameters of the hand and modify them at
runtime. The received angles are not only implemented directly to the hand but are rst
thresholded within a certain allowable range of movement so as not to have unnatural
poses that can break the illusion. In the case of an AR application, the virtual hand will
be invisible to the user albeit providing the physics interaction between objects during
manipulation but the hand could also nd application in mimicking the human hand in
a virtual environment thus solving the problem of not having ones limbs present in a
virtual reality.
4.3 Unity-MATLAB Communication
The nal item in the required list of functions for an AR system is the Matlab unity com-
munication interface. This is required in order to have a exible testbed that is Matlab in
which to prototype and test the image processing algorithms but at the same time having
them communicate with each other allows for the renement of the algorithms not only
in terms of quantitative data but also in terms of the user experience. The communication
channel which was found to be common in both software packages was ethernet commu-
nication hence an method of linking the two programs was required and the solution was
the use of a local network server through the use of the Lidgren Network Library [43].
The Lidgren Networking library is a networking library for the .NET framework which
uses a single UDP socket to deliver an API for connecting a client to a server, reading
and sending messages. It provides example code for the execution of a client-server mes-
sage exchange interface. The stock server code that comes with the example les is used
but the source les for the client are reverse engineered to extract the core logic of the
interface and strip away that pertaining to creating windows forms and callbacks. This
core logic code is easily implemented in Unity because of the facility of implementing
object oriented C code but an implementation in Matlab is trickier.
Fortunately, Matlab offers limited support for .NET Libraries as long as they are com-
piled with .NET Framework v4 or later. Hence by recompiling the source les, the re-
quired compatible binaries where built such that the various .NET methods and classes
can be loaded into Matlab using the NET.addAssembly command. This allows Mat-
lab to directly access the methods and therefore the extracted logic can now be ported
from C++ to Matlab m-les. This provides for message based communication between
the two programs through the server but before being able to transmit hand poses, a mes-
29
sage sequence has to be established.
Since the programs operate separately, a robust approach would be to have one of them
act as the master and the other act as the client. Since Unity is the nal interface to the
user, this is the master and Matlab is the client with the message passing based on a
polling basis. Now, Unity script les have two separate section, there is an initialisation
function called Start() that is executed once and then another function called Update()
which is called every frame in order to update the state of the object it is attached to.
Therefore in the Update function, Unity sends a message request Send to Matlab.
Matlab upon receiving and parsing this message send the data for the angles in a comma
separated format. This sent packet of information is read by Unity in the next Update()
call and parsed into an array to be applied to the hand.
4.4 Testing
The line of thought in the testing of the AR hardware is in the fact that the device should
provide as close an imitation of the eye as possible whilst providing a realistic illusion of
additional virtual objects. The following is a group of tests to determine the performance
of the hardware component of the system.
1. Snellen Test - The Snellen visual acuity chart is one of the standard tests that is
applied to testing the healthiness of the eyes. Therefore this qualies it as an important
test for the quantitative measurement of the performance of an AR headset whose aim is
the replacement of eyes in the real world. The Snellen chart consists of letters arranged
in rows with each successive row having smaller and smaller letters. Each row also has
attributed to it a viewing distance which indicates the minimum distance at which the
letters should be viewable and recognisable for a person with average visual acuity. The
average visual acuity standard is 20/20. The viewing distance indicated for average eye-
sight is dened by Snellen as the distance in meters at which the letter subtends 5 min
of arc. [48].
According to the procedure described in [44] and using the test found at [49], the user
rst takes the test without the Oculus to get a baseline result. Then the Oculus Rift is
worn and the same test taken again. The difference in distances at which the same letter
can be exactly distinguished, indicates the multiplication factor or MAgnication Re-
quirements (MAR). The MAR value describes the magnication needed so that the user
can view the letters through the HMD as well as with the eyes directly.
30
2. Display Contrast Test - The next HMD parameter that is related to the visual
comparability of the HMD display with that of real life is the contrast of the display. This
is dened according to the Michelson contrast equation [50] dened in Equation 4.4.1 as
the ratio of the difference and addition of the maximum and minimum luminance values
of an image.
C =
L
max
L
min
L
max
+ L
min
(4.4.1)
3. View Registration Test - The last item which is relevant to an AR display based on
a screen and external cameras, is the actual registration of the virtual objects with the real
world objects. This refers to the maintenance of proper alignment of virtual objects with
real world objects through the accurate head tracking of the user. It is therefore centred
around the performance of the head tracker which has three aspects to it. These are the
accuracy, drift and the noise present in these angles. These are tested by positioning the
HMD at different angles while at the same time recording a number of samples at each
angle. The samples are then analysed to calculate the parameters for the drift, noise and
accuracy of the head tracker in order to get a measurement for the performance of the
head tracking.
4.5 Conclusion
This ends the AR implementation section which detailed both the requirements and how
to tackle an AR implementation both in hardware and in software. The chapter also
details the required tests in order to quantitatively and qualitatively measure the per-
formance of the system. The next chapter deals with the implementation of the image
processing required for the manipulation of the virtual objects created in this chapter.
31
Chapter 5. Pose Classication
Chapter 5 - Pose Classication
5.1 Introduction
This chapter deals with the implementation of the pose classication algorithm required
for the manipulation of the virtual object. The work carried out is along the lines of
work by [39]. The paper considers the approach of pose recognition by dividing it into
three steps. The rst step is segmenting the hand from the image, followed by tting a
2D model of the hand which provides an initial search space for the last step that uses
an annealed particle lter to extract the full 3D hand pose. Due to time limitations, the
focus of this project was concentrated on efciently carrying out the rst two steps to
provide the a robust basis for the following particle ltering.
5.2 Skin Segmentation
Starting from skin segmentation, skin segmentation is the process of identifying and ex-
tracting a skin coloured blob from an image which in this case is subdivided into three
processes. Firstly illumination compensation is carried out to normalise the RGB values
of the image. Secondly, the skin coloured segments are parametrically identied and
then lastly, the largest contiguous skin coloured blob is extracted from the image.
The basis of most skin segmentation algorithms depends mostly on the characteristic
colour of the skin as shown in [54], but various illumination sources and temperatures
result in a differently perceived skin colour. Therefore to reduce errors in segmentation
and make it more robust, illumination compensation is introduced as the preliminary
processing of the image. Illumination compensation can be approached in different man-
ners. The most accurate way of doing this is by algorithmically negating the effect of
illumination through calibration with a known light source design as in [55]. This, while
accurate, is not applicable in the general scenario and is hence superseded by other ap-
proaches. The following algorithms represent the various scenarios tested to nd the one
with the best results.
After the completion of illumination compensation, a thresholding operation is applied
to the pixel values to identify skin pixels. This thresholding operation is frequently
dependent on the illumination compensation technique used and thus one cannot separate
the two processes. The last step after skin segmentation is nding the largest blob and
returning that blob to be the hand for further analysis. This uses the assumption that since
32
the hand is the nearest skin coloured object to the camera, it will also be the biggest.
The following sections describe the various skin segmentation algorithms that where
considered.
5.2.1 Log Transform Illumination Compensation Method
This rst algorithm uses the Log Transform to compensate for illumination and then
applies elliptical tting on the result in order to extract the skin coloured pixels. The
Log Transform aims to increase the dynamic range of the image by compressing high
values and expanding lower values of the image. The application of the log transform
for illumination compensation is seen in [46] where the transform is applied to an RGB
image converted to the YC
b
C
r
space. Then, a further processing step is applied such
that pixels having a luminance 1.05 times greater than the average value in the image
are removed and replaced with the average value. The general log transform equation is
shown in 5.2.1.
g(x, y) = a +
ln(f(x, y) + 1
b ln(c)
(5.2.1)
where a = 0 , b =
1
255 ln(1.2)
and c = 255 in accordance with [46].
After compensation using equation 5.2.1 with the mentioned values for a, b and c, the
plot of C
r
vs C
b
is generated and skin coloured pixels are extracted through the use of
an ellipse clustering technique. The general equation of an ellipse is shown in Equation
5.2.2 and the parameters along with their values according to [46] are depicted in Table
5.1. Points that lie within the ellipse are classied as skin coloured and those that are
outside are discarded. Thus the LHS of the equation is applied to all the points in an
image and then a simple search to nd values less than one returns the skin pixels.
(x da)
2
a
2
cos +
(y db)
2
b
2
sin = 1 (5.2.2)
Parameter Description Value
a major radius 30.4
b minor radius 10.6
da x-axis offset 148
db y-axis offset 106
ellipse orientation 2.44
Table 5.1: Ellipse Parameters for [46]
33
5.2.2 Mean Luminance Based Illumination Compensation Method
This approach follows that described in [47]. The illumination compensation used here
is not applied to the image in all frames. Rather it takes into consideration the value
of the average luminance component of the image computed using Equation 5.2.3 and
modies the R and G dimensions of the image only if the value lies beyond a certain
range.
Y
aveg
=
Y
i,j
(5.2.3)
Where Y
i,j
= 0.3R+0.6G+0.1B. Now, according to the range in which the luminance
value Y
aveg
is found, this results in a particular value of which is used to calculate the
transformed values R and G for R and G respectively by using Equation 5.2.4.
R
i,j
= (R
i,j
)
where =
1.4 Y
aveg
< 64
0.6, Y
aveg
> 192
1, otherwise
(5.2.4)
G
i,j
= (G
i,j
)
After nding the illumination compensated values, skin segmentation is applied by using
solely the transformed C
r
values according to Equation 5.2.5.
S
ij
=
0 10 < C
r
< 45
1 otherwise
(5.2.5)
where C
r
= 0.5R
+ 0.419G
+ 0.081B
5.2.3 Pixel Luminance Based Illumination Compensation Method
This algorithm is based upon work done by Hsu et al in [45]. In this case, the illumi-
nation compensation algorithm, instead of operating on the whole image, only operates
on pixels that are found outside of the luminance limits. This approach is described as a
nonlinear transformation of the chroma according to the skin model. It operates on the
basis of the luminance component of the individual pixels. Therefore, the RGB image is
transformed to the YC
b
C
r
space rst. Then, considering C
r
and C
b
as functions of the
luminance Y , let the transformed chromatic values be C
r
(Y ) and C
b
(Y ). Thus the skin
colour model can be expressed using the centres of these transformed chroma denoted
by

C
r
(Y ) and

C
b
(Y ) and the spread of the cluster denoted by W
C
r
(Y ) and W
C
b
(Y ) as
shown in Equation 5.2.6 with C
i
representing C
r
and C
b
and K
h
and K
l
representing the
higher and lower luminance thresholds respectively.
34
C
i
(Y ) =
(C
i
(Y )

C
i
(Y ))
W
c
i
W
c
i
(Y )
+

C
i
(K
h
) if Y < K
l
or Y > K
h
C
i
(Y ) if Y [K
l
, K
h
]
(5.2.6)
The centres of the transformed chroma values are calculated based on their luminance
component value as depicted in Equations 5.2.7 and 5.2.8.
b
(Y ) =
108 +
(K
l
Y ) (118 108)
K
l
Y
min
if Y < K
l
108 +
(Y K
h
) (118 108)
Y
max
K
l
if K
h
< Y
(5.2.7)
r
(Y ) =
154 +
(K
l
Y ) (154 144)
K
l
Y
min
if Y < K
l
154 +
(Y K
h
) (154 132)
Y
max
K
l
if K
h
< Y
(5.2.8)
The spread of the cluster on the other hand is described by equation 5.2.9 and is depen-
dant on the cluster high and low limits WL
c
i
and WL
c
i
.
W
c
i
(Y ) =
WL
c
i
+
(Y Y
min
) (W
c
i
WL
c
i
)
K
l
Y
min
if Y < K
l
WH
c
i
+
(Y
max
Y ) (W
c
i
WH
c
i
)
Y
max
K
l
if K
h
< Y
(5.2.9)
The parameters obtained through the training carried out in [45] are found in Table 5.2.
W
c
b
C
b
cluster spread range 46.97
WL
c
b
C
b
cluster spread low parameter 23
WH
c
b
C
b
cluster spread high parameter 14
W
c
r
C
r
cluster spread range 38.76
WL
c
r
C
r
cluster spread low parameter 20
WH
c
r
C
r
cluster spread high parameter 10
K
l
Lower Y threshold 125
K
h
Upper Y threshold 188
Y
min
Minimum Y of image database 16
Y
max
Maximum Y of image database 235
Table 5.2: Illumination Compensation Parameters for [45]
35
ec
x
transformed centre x value 1.6
ec
y
transformed centre y value 2.41
c
x
x-axis offset 109.38
c
y
y-axis offset 152.02
a major radius 25.39
b minor radius 14.03
ellipse orientation 2.53
Table 5.3: Ellipse Parameters for [45]
After illumination compensation, similair to the Log Transform based method, ellipse
tting is applied to a plot of C
b
vs C
r
. The ellipse tting in the case of [45] is applied
as a two step process where rst, the values are shifted to the origin and rotated using
Eqauation 5.2.10 and then, the simple ellipse equation 5.2.11 is applied to each point
with the parameters from Table 5.3. After this a search returning values less than zero
provides the skin coloured pixels.
x
y
cos sin
sin cos
b
c
x
C
r
c
y
(5.2.10)
(x ec
x
)
2
a
2
+
(y ec
y
)
2
b
2
1 (5.2.11)
5.2.4 No Illumination Compensation Method
The nal algorithm described in [52] considered does not employ an illumination com-
pensation step but bases the segmentation of skin coloured pixels directly from the im-
age. First the image is converted from the RGB to the YC
b
C
r
space, then the Y com-
ponent is disregarded providing some level of illumination compensation. The skin seg-
mentation is then carried out by thresholding the image C
b
and C
r
values using limits as
described by Equation 5.2.12.
S
ij
=
0 (128 C
b
170) (73 C
r
158)
1 otherwise
(5.2.12)
5.2.5 Skin Segmentation Testing
The methodology for the testing of these algorithms starts rst with the separate imple-
mentation of each algorithm followed by their testing using images originating from the
respective papers. Then a small library of images is built in order to test the algorithms in
36
different conditions. The performance of these algorithms is then tested on each image
whilst recording the parameters required to judge performance. The variables taken into
consideration when building the library are:
Cluttered/Uncluttered images
Dark/Light Illumination
One skin coloured object/Multiple skin coloured objects (such as furniture)
Furthermore, the list of performance parameters to be taken into consideration are:
1. Computational Time - Time taken for segmentation to be carried out is an impor-
tant parameter considering that the nal system needs to operate as close to real time as
possible. This carries a higher weighting compared to the other parameters.
2. Ratio of False Positives - This ratio gives a quantitative description of the amount
of falsely classied pixels by calculating the ratio of the number of correctly segmented
skin pixels to the total number of segmented pixels. The closer the ratio is to one, the
better while values less than one implies more false positives.
3. Ratio of True Positives - This ratio gives a quantitative description of the amount of
negatively classied pixels by calculating the ratio of the number of correctly segmented
skin pixels to the total number of actual skin pixels picked manually. The closer the ratio
is to one, the better while values less than one imply more false negatives.
5.3 Pose Classication
After segmenting skin regions from the image, this results in a binary image with the
skin area specied as the foreground denoted by a logical one and the rest of the image,
the background, denoted as a logical 0. The foreground is not always a contiguous block
because it sometimes has holes that results from misclassied pixels. To ll these in,
a ood ll operation can be applied but this operates on the background of an image.
Therefore according to [39], a morphological eight-connected ood-ll operation is ap-
plied on the inverted binary image thus completely lling in any voids within the blobs
that may result from the segmentation. After lling in the holes, the largest connected
component having an area larger than the minimal hand region is classied as the hand.
This assumes that the largest blob is the hand which is a reasonable assumption con-
sidering that since the cameras are positioned in an ego-centric view, the closest skin
coloured object of relevance to the cameras is the hand. The next step in the algorithm is
the tting of a 2D hand model on the segmented hand region. The 2D hand model to be
tted shown in Figure ?? is made up of three lines representing the palm, ngers and the
37
thumb and has a total of 8 DOF. The rst two are the x and y position of the palm centre
within the frame whilst the other six represent the angle and length of both the palm, the
ngers and the thumb. The next sections will detail the two methods implemented for
tting the 2D model in order to nd the best performing approach.
5.3.1 Skeletonisation Approach
According to the work carried out in [39], after segmenting the hand and choosing a
blob, the next step is to further segment the hand image into three separate regions as
described by the 2D model. This is achieved by rst applying skeletonisation to the seg-
mented hand. The rough edges that occur in segmentation, result in a skeleton that has
branches that are not required for the tting and and actually give rise to errors in follow-
ing steps. Therefore pruning is applied to the skeleton in order to keep the principal three
branches that represent the hand, ngers and thumb. The contour of the pruned skeleton
is then subdivided into small branches and using a least squares line tting method, each
branch is represented as a straight line. These straight lines are then used to t the 2D
model by taking two assumptions.
The rst is that the thumb is always on the left for the right hand, the palm is at the
bottom and the ngers are at the top considering an ego-centric view, meaning that the
camera is capturing images from the perspective of the user. The second assumption is
that the lines representing the hand regions should be long enough for them to be con-
sidered as a good representative and therefore no tting is applied if the lines are shorter
than a threshold. Apart from this, the extracted feature lines do not always extend to
the edge of the hand by nature of the skeletonisation and the pruning algorithms which
results in loss of accuracy for length estimation. Hence an added step is required that
takes the current lines and extends their length to the edge of the image boundary. Then
an intersection is performed between this line and the segmented skin region which re-
turns a line with the correct length parameter. This procedure initialises the 2D model
together with all its parameters.
After preliminary tting, the model is tracked by a frame by frame feature tracking ap-
proach. That is achieved according to [39] with the use of the Scale Invariant Feature
Transform (SIFT). These features are used to extract highly distinctive scale invariant
features from images. By extracting these features from the initial frame, and then ex-
tracting the same feature types in the following frame, one could compare and allocate
matching pairs between the two frames and hence calculate the geometric transforma-
tion applied to the features that describes the hand movement between frames. Applying
this estimated geometric transformation to the 2D model in the old frame by multiplying
38
this transformation matrix with the points in the old 2D model according to [39], results
in the new model whose parameters correspond to the image in the new frame.
5.3.2 Modied Approach
The nal output of the previous method, are points representing the tip of the ngers,
the base of the palm, the tip of the thumb and a centre point positioned at the centre of
the palm. Thus to compare performance in terms of accuracy and computational time,
a modied approach is proposed using simple computational techniques instead of the
skeletonisation approach. In this approach, using the same assumptions that the ngers
are always at the top, the palm base at the bottom and the thumb on the left, the points
are extracted by taking into consideration the physical x,y coordinates of the segmented
skin pixels of the image. Thus the respective coordinates of the segmented pixels are
extracted into an array and the pixel with the maximum y value is assigned as the tip of
the ngers, the minimum y value is assigned as the base of the palm and the minimum x
value assigned as the thumb. The last point is the centre of the palm and for this point,
the assumption is made that the point which is the farthest away from the border of the
segmented region is the palm centre. Therefore to nd this point, the distance transform
of the image is computed and the pixel with the highest value classied as the palm cen-
tre.
Since orientation is key for the correct operation of this algorithm, before the classi-
cation of points is carried out, the orientation of the hand must be upright in order to
safeguard against conditions that could result in the violation of the stipulated assump-
tions. This is done by rst calculating the orientation of the image and then applying
a rotation equal to this angle in the opposite direction. The classication of points as
described in the previous paragraph is now carried out and nally the extracted points
are rotated back in order to register with the original image.
Furthermore, some intelligence is incorporated into this implementation by checking for
conditions where the length of a segment has changed by a relatively large factor in be-
tween frames. This is an indication of an error and therefore the data for the current
frame is discarded. The algorithm in general therefore relies on a simple search for the
positional maxima and minima and on the distance transform for the section identica-
tion. This means that the algorithm should be fast and implementable in real time while
also being robust.
39
5.3.3 Manipulation Actions
The nal aim of the system is the ability to manipulate virtual objects and as such a
dictionary of recognisable actions needs to be constructed that operates on just the 2D
model because of time constraints. This way, the relevant data can be extracted from the
previous algorithms in order to trigger these commands. Since the objects are virtual,
the preferred manipulation space is in 3D hence the dictionary should consist of com-
mands for varying the x,y and z position of the object. Another parameter that can be
extracted and applied is rotation around the z axis also called roll angle but one of the
more important features is the implementation of a grab gesture in order to delimit the
boundary between manipulating an object and a free moving hand.
The x and y parameters of the manipulated object can be applied from the x and y coor-
dinates of the palm centre. The rotation angle is achieved through the angle of the line
joining the palm centre and the nger tips which also makes part of the degrees of free-
dom mentioned previously. This line is also used in the detection of grabbing gestures
through the classication of distinct hand lengths as being either open or close ngers.
Finally, the z parameter is modied by using the area of the blob. This is carried out by
rst calibrating using the area of the users hand. A minimum of four calibration points
under specic conditions are required in order to fully calibrate the system.
The rst two are carried out by recording the area of the hand as it is held at an arms
length away both when it is fully open palm exposed and also when doing a st which
signies the grabbing gesture. The same procedure is applied for the next two calibration
points but this time with the hand positioned as close to the user as possible with all hand
extremities still visible within the image. These points are then used to calculate the
current depth of the hand through the use of a quadratic interpolation of the area values
between those of an outstretched arm and those of a hand close to the camera.
5.3.4 Testing
The performance requirements of this system are similar to those of the skin segmen-
tation section. In that they consist of computational time and the accuracy of the nal
result.
1. Computational Time - The time taken in order to t the 2D model carries the most
weight in terms of performance. This test is carried out by calculating the mean time
taken to process all the frames in a test video.
40
2. Model Fitting Accuracy - The accuracy of the algorithm is determined by com-
paring the points extracted through the algorithm for the nger tip, thumb tip, palm base
and palm centre with the points extracted manually.
3. Manipulation Accuracy - As regards to the manipulation actions, testing involves
the determination of how accurate the measured parameters are with respect to those
of the real world. The testing methodology of this section includes rst setting up pre-
determined positions in front of the user, where the user will position the hand. Then the
manipulation parameters are extracted by the algorithm and compared with the physi-
cally measured parameters in order to test the accuracy
5.4 Conclusion
This chapter concludes the methods and methodology section of this dissertation to-
gether with the approaches to be tackled in algorithms as well as the testing routines that
are undertaken in Part III of the project which highlights the results obtained for each
facet of the system.
41
Part III
Results and Discussion
42
Chapter 6. Augmented Reality Results
Chapter 6 - Augmented Reality Results
6.1 Introduction
This chapter deals with the testing results for the performance parameters of the AR
hardware and software.
6.2 Snellen Test
The rst test is the Snellen visual acuity test. First, the baseline visual acuity of the test
subjects is recorded by using the chart found at [49]. This test is programmed to be
carried out on a screen so the users where placed at a four metre distance from a monitor
and the letter size was reduced until the letters could not be distinguished anymore. The
baseline results found in the rst column of Table 6.1 indicate the visual acuity of the
test subjects with the unaided eye. After that, each subject wears the AR system and the
test is repeated again yielding the second column of results. When wearing the headset,
the reduction in the eld of vision results in a zooming in effect so the letters should be
seen even closer than they normally are and hence distinguished even better but due to
the low resolution of the display in the Oculus Rift, the resolution becomes predominant
and hence results in the inability to distinguish the smaller letters. During testing it was
observed that one possible reason for the failure to read small letters is because of the
screen backlight. Therefore the same test is also repeated with the use of a printed chart
from [57] which in fact yielded better results depicted in the third column of Table 6.1.
6.3 Display Contrast Test
For the next test, in order to evaluate the contrast offered by the Oculus Rifts display, a
video camera is used with the aperture positioned in front of the users eyepiece in order
to record a video of the image provided by the display. This video is then processed
into its YC
b
C
r
equivalent and the maximum and minimum luminance components of
Test Subject Normal Online Test Paper Test
1 20/20 20/200 20/100
2 20/25 20/200 20/100
3 20/20 20/200 20/100
Table 6.1: Snellen Test Results
43
Chapter 6. Augmented Reality Results
each frame are extracted and then averaged to provide an average of the maximum and
minimum luminance over a period of n frames. The maximum luminance was found to
be 231 whilst the minimum is 16. These values are now substituted in the Michelson
Contrast equation, Equation 6.3.1 to get a score for the Rift. The closer the obtained
value is to one, the better the quality of the display which in the case of the Rift is a good
result
C =
L
max
L
min
L
max
+ L
min
=
215
247
= 0.870 (6.3.1)
6.4 View Registration Test
44
Chapter 7. Skin Segmentation Results
Chapter 7 - Skin Segmentation Results
45
Chapter 8. Pose Classication Results
Chapter 8 - Pose Classication Results
This approach however, has many factors that can seriously affect the accuracy and the
applicability of this algorithm for 2D model hand tracking. First of all, accurately es-
timating the geometric transform between frames relies on having as many features to
compare as possible. Since the model has three independently moving links, the SIFT
features extracted need to be rst classied as either being palm, ngers or thumb. Then
this subset of features can be compared with the ones in the next frame with matching
pairs being assigned to the section they belong to. Then using these matched features,
an estimate for the separate geometric transform of each section is calculated. After ap-
plying the transformation to the model in the old frame, the features in the new frame
are again segmented to keep the tracked SIFT features as recent as possible thus offer a
better probability in matching features between consecutive frames.
Using a subset of the features in order to estimate three separate geometric transforma-
tions can result in a poor geometric estimation because of the limited number of features
attributed to each group. Thus, since the estimated geometric transforms are multiplied
with the points in the previous frame, small errors in the estimated transformation re-
sult in bigger errors on the transformed points. This process is made even worse by
the successive frame to frame transformations that continue building on the error of the
rst transformation thus small errors quickly accumulate resulting in considerable errors
between the position of the hand model and the actual hand. One solution to this prob-
lem is to use feature tracking for a xed length of time before refreshing the data by
reinitialising using the skeletonisation approach described earlier. This way, the amount
of accumulated drift can be controlled by varying the time period between successive
refreshes carried out by the initialisation algorithm.
Taking this approach to the extreme, another solution could be the exclusion of track-
ing based on SIFT features and t the model in each frame using the skeletonisation
approach. This then could result in the creation of another problem because skeletoni-
sation and pruning are computationally expensive processes which hence affect the real
time performance of the algorithm. Furthermore, SIFT features are also computationally
expensive to calculate and extract from an image. In this case, a possible solution would
be the use of Speeded Up Robust Features (SURF) which are, as their name describes,
faster in implementation.
After considering the various possible problems with the previous algorithm which has
computationally expensive steps in both the initialisation and in the tracking section, the
46
Chapter 8. Pose Classication Results
proposed approach aims to suggest a simpler way of extracting the same information
using the same assumptions. The nal output of the previous system as described, is
points representing the tip of the ngers, the base of the palm and the tip of the thumb
and a centre point positioned at the centre of the palm.
47
Chapter 9. Conclusion and Future Work
Chapter 9 - Conclusion and Future Work
9.1 Conclusion
9.2 Future Improvements
48
References
References
[1] K. Lee, in Augmented Reality in Education and Training, 2012.
[2] V. Geroimenko, Augmented reality technology and art: The analysis and visu-
alization of evolving conceptual models, in Information Visualisation (IV), 2012
16th International Conference on, July 2012, pp. 445453.
[3] M. Hincapie, A. Caponio, H. Rios, and E. Mendivil, An introduction to augmented
reality with applications in aeronautical maintenance, in Transparent Optical Net-
works (ICTON), 2011 13th International Conference on, June 2011, pp. 14.
[4] A. R. Company. Mobilear. http://augmentedrealitycompany.com/images/ar3.jpg
[Acccessed: 3 May 2014].
[5] Vuzix. Star-1200xld. http://www.vuzix.com/wp-content/uploads/
augmented-reality/ images/1200xld/STAR-1200XLD w-controller.png [Acc-
cessed: 3 May 2014].
[6] O. VR. Oculus rift development kit 1. https://dbvc4uanumi2d.cloudfront.net/cdn/
3.4.70/wp-content/themes/oculus/img/order/dk1-product.jpg [Acccessed: 3 May
2014].
[7] C. Humphrey, S. Motter, J. Adams, and M. Gonyea, A human eye like perspec-
tive for remote vision, in Systems, Man and Cybernetics, 2009. SMC 2009. IEEE
International Conference on, Oct 2009, pp. 16501655.
[8] B. Kress and T. Starner, A review of head-mounted displays (hmd) technologies
and applications for consumer electronics, in Proc. SPIE 8720, Photonic Applica-
tions for Aerospace, Commercial, and Harsh Environments IV, 87200A, May 2013.
[9] S. William. (2013, December) Ar rift. http://willsteptoe.com/post/67399683294/
ar-rift-camera-selection-part-2 [Acccessed: 3 May 2014].
[10] Logitech. C905. http://www.logitech.com/en-sg/product/5868 [Acccessed: 3 May
2014].
[11] A. Studios. Portfolio. http://www.alife-studios.com/portfolio [Acccessed: 6 May
2014].
[12] S. Henderson and S. Feiner, Augmented reality for maintenance and repair
(armar), United States Air Force Research Lab, Tech. Rep. Technical Report
86500526647, July 2007.
49
References
[13] T. Labs. Myo gesture control armband. https://www.thalmic.com/en/myo/ [Acc-
cessed: 7 May 2014].
[14] V. Pavlovic, R. Sharma, and T. Huang, Visual interpretation of hand gestures for
human-computer interaction: a review, Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 19, no. 7, pp. 677695, Jul 1997.
[15] Y. Kuno, K. Hayashi, K. Jo, and Y. Shirai, Human-robot interface using uncali-
brated stereo vision, in Intelligent Robots and Systems 95. Human Robot Interac-
tion and Cooperative Robots, Proceedings. 1995 IEEE/RSJ International Confer-
ence on, vol. 1, Aug 1995, pp. 525530 vol.1.
[16] T. Cootes, C. Taylor, A. Lanitis, D. H. Cooper, and J. Graham, Building and using
exible models incorporating grey-level information, in Computer Vision, 1993.
Proceedings., Fourth International Conference on, May 1993, pp. 242246.
[17] C. Kervrann and F. Heitz, Learning structure and deformation modes of nonrigid
objects in long image sequences, 1995.
[18] A. Bobick and J. W. Davis, Real-time recognition of activity using temporal tem-
plates, 1996, pp. 3942.
[19] M. R. Villarreal. Labelled hand skeleton image. http://upload.wikimedia.
org/wikipedia/commons/thumb/a/ab/Scheme human hand bones-en.svg/
500px-Scheme human hand bones-en.svg.png [Acccessed: 8 May 2014].
[20] R. Koch, Dynamic 3d scene analysis through synthesis feedback control, 1993.
[21] J. Lee and T. Kunii, Model-based analysis of hand posture, Computer Graphics
and Applications, IEEE, vol. 15, no. 5, pp. 7786, Sep 1995.
[22] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, Pnder: real-time tracking
of the human body, Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 19, no. 7, pp. 780785, Jul 1997.
[23] V. Pavlovic, R. Sharma, and T. S. Huang, Gestural interface to a visual computing
environment for molecular biologists, in Automatic Face and Gesture Recognition,
1996., Proceedings of the Second International Conference on, Oct 1996, pp. 30
35.
[24] D. M. Gavrila and L. S. Davis, Towards 3-d model-based tracking and recognition
of human movement: a multi-view approach, in In International Workshop on
Automatic Face- and Gesture-Recognition. IEEE Computer Society, 1995, pp. 272
277.
50
References
[25] J. Rehg and T. Kanade, Model-based tracking of self-occluding articulated ob-
jects, in Computer Vision, 1995. Proceedings., Fifth International Conference on,
Jun 1995, pp. 612617.
[26] A. Erol, G. Bebis, M. Nicolescu, R. Boyle, and X. Twombly, A review on vision-
based full dof hand motion estimation, in Computer Vision and Pattern Recogni-
tion - Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on,
June 2005, pp. 7575.
[27] N. Shimada, K. Kimura, and Y. Shirai, Real-time 3d hand posture estimation
based on 2d appearance retrieval using monocular camera, in Recognition, Analy-
sis, and Tracking of Faces and Gestures in Real-Time Systems, 2001. Proceedings.
IEEE ICCV Workshop on, 2001, pp. 2330.
[28] B. Stenger, P. Mendonca, and R. Cipolla, Model-based 3d tracking of an articu-
lated hand, in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Pro-
ceedings of the 2001 IEEE Computer Society Conference on, vol. 2, 2001, pp.
II310II315 vol.2.
[29] B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, Filtering using a
tree-based estimator, in Computer Vision, 2003. Proceedings. Ninth IEEE Inter-
national Conference on, Oct 2003, pp. 10631070 vol.2.
[30] A. Thayananthan, B. Stenger, P. H. S. Torr, and R. Cipolla, Learning a kinematic
prior for tree-based ltering, 2003.
[31] Y. Wu and T. Huang, Capturing articulated human hand motion: a divide-and-
conquer approach, in Computer Vision, 1999. The Proceedings of the Seventh
IEEE International Conference on, vol. 1, 1999, pp. 606611 vol.1.
[32] J. Lin, Y. Wu, and T. Huang, Capturing human hand motion in image sequences,
in Motion and Video Computing, 2002. Proceedings. Workshop on, Dec 2002, pp.
99104.
[33] D. Lowe, Fitting parameterized three-dimensional models to images, Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 13, no. 5, pp. 441
450, May 1991.
[34] K. Nirei, H. Saito, M. Mochimaru, and S. Ozawa, Human hand tracking from
binocular image sequences, in Industrial Electronics, Control, and Instrumenta-
tion, 1996., Proceedings of the 1996 IEEE IECON 22nd International Conference
on, vol. 1, Aug 1996, pp. 297302 vol.1.
51
References
[35] C. Nolker and H. Ritter, Visual recognition of continuous hand postures, Neural
Networks, IEEE Transactions on, vol. 13, no. 4, pp. 983994, Jul 2002.
[36] S. Maskell and N. Gordon, A tutorial on particle lters for on-line nonlinear/non-
gaussian bayesian tracking, in Target Tracking: Algorithms and Applications (Ref.
No. 2001/174), IEE, vol. Workshop, Oct 2001, pp. 2/12/15 vol.2.
[37] Q. Delamarre and O. Faugeras, 3d articulated models and multi-view tracking
with physical forces.
[38] M. Isard and A. Blake, Contour tracking by stochastic propagation of conditional
density, 1996, pp. 343356.
[39] L. Sun, U. Klank, and M. Beetz, Eyewatchme 2014;3d hand and object tracking
for inside out activity analysis, in Computer Vision and Pattern Recognition Work-
shops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on, June
2009, pp. 916.
[40] R.-L. Hsu, M. Abdel-Mottaleb, and A. Jain, Face detection in color images, Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 5, pp.
696706, May 2002.
[41] S. Lee and J. Chun, A stereo-vision approach for a natural 3d hand interaction
with an ar object, in Advanced Communication Technology (ICACT), 2014 16th
International Conference on, Feb 2014, pp. 315321.
[42] U. Technologies. Game engine software. https://unity3d.com/ [Acccessed: 9 May
2014].
[43] Lidgren networking library generation 3. https://code.google.com/p/
lidgren-network-gen3/ [Acccessed: 9 May 2014].
[44] M. Livingston, C. Zanbaka, J. Swan, and H. Smallman, Objective measures for
the effectiveness of augmented reality, in Virtual Reality, 2005. Proceedings. VR
2005. IEEE, March 2005, pp. 287288.
[45] R.-L. Hsu, M. Abdel-Mottaleb, and A. Jain, Face detection in color images, Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 5, pp.
696706, May 2002.
[46] H. kui Tang and Z. quan Feng, Hands skin detection based on ellipse clustering,
in Computer Science and Computational Technology, 2008. ISCSCT 08. Interna-
tional Symposium on, vol. 2, Dec 2008, pp. 758761.
52
References
[47] Y.-T. Pai, S.-J. Ruan, M.-C. Shie, and Y.-C. Liu, A simple and accurate color face
detection algorithm in complex background, in Multimedia and Expo, 2006 IEEE
International Conference on, July 2006, pp. 15451548.
[48] P. Vision. Snellen eye test charts interpretation. http://precision-vision.com/
Articles/snelleneyetestchartsinterpretation.html#.U3HhCvmSz1I [Acccessed: 13
May 2014].
[49] M. University of Buffalo, Scott Olitsky. Ivac snellen eye test. http://www.smbs.
buffalo.edu/oph/ped/IVAC/IVAC.html [Acccessed: 13 May 2014].
[50] A. Michelson and H. Lemon, Studies in Optics, ser. University of Chicago
Science Series. University of Chicago Press, 1927. [Online]. Available:
http://books.google.com.mt/books?id=FXazQgAACAAJ
[51] H. Kato and M. Billinghurst, Marker tracking and hmd calibration for a video-
based augmented reality conferencing system, in Augmented Reality, 1999. (IWAR
99) Proceedings. 2nd IEEE and ACM International Workshop on, 1999, pp. 85
94.
[52] S. Lee and J. Chun, A stereo-vision approach for a natural 3d hand interaction
with an ar object, in Advanced Communication Technology (ICACT), 2014 16th
International Conference on, Feb 2014, pp. 315321.
[53] V. Shantaram and M. Hanmandlu, Contour based matching technique for 3d ob-
ject recognition, in Information Technology: Coding and Computing, 2002. Pro-
ceedings. International Conference on, April 2002, pp. 274279.
[54] J.-C. Terrillon, M. Shirazi, H. Fukamachi, and S. Akamatsu, Comparative per-
formance of different skin chrominance models and chrominance spaces for the
automatic detection of human faces in color images, in Automatic Face and Ges-
ture Recognition, 2000. Proceedings. Fourth IEEE International Conference on,
2000, pp. 5461.
[55] Y.-C. Chang and J. Reid, Rgb calibration for color image analysis in machine
vision, Image Processing, IEEE Transactions on, vol. 5, no. 10, pp. 14141422,
Oct 1996.
[56] M. D. H. and W. B. A., Matching color images: The effects of axial chromatic
aberration, Journal of the Optical Society of America A, vol. 11, no. 12, p. 3113,
1994.
[57] J. Dahl. Snellen chart. http://upload.wikimedia.org/wikipedia/commons/thumb/9/
9f/Snellen chart.svg/1000px-Snellen chart.svg.png [Acccessed: 14 May 2014].
53
Appendix A. Matlab Code
Appendix A - Matlab Code
1 function [Parameters] = fitSkel2dumb(I)
2 numThresh = 300;
3
4 %Skin Segmentation Table A.2
5 Imbw = SkinSegmenter(I,RGB,0);
6
7 %fill any holes in segmentation
8 Imbw = im2uint8(imfill(Imbw,holes));
9
10 n = nnz(Imbw);
11
12 if(n > numThresh)
13
14 %%%%%%%%%%%%%%%%%find largest distance from edge%%%%%%%%%%%%%%%%
15 [pix(:,2),pix(:,1)] = find(Imbw);
16 Idist = bwdist(Imbw);
17 [vals,ypos] = max(Idist);
18 [,xpos]=max(vals);
19 ypos = ypos(xpos);
20 branchPt = [xpos,ypos];
21
22 %%%%%%%%max and min y and min x to get image extremities%%%%%%%%
23 [,indmaxY] = max(pix(:,2));
24 Palm_endPt = pix(indmaxY,:);
25 [,indminY] = min(pix(:,2));
26 Fingers_endPt = pix(indminY,:);
27 [,indminX] = min(pix(:,1));
28 Thumb_endPt = pix(indminX,:);
29
30 %%%%%%%%%%%%%%%%%%%%%format data into array%%%%%%%%%%%%%%%%%%%%%
31 Parameters = zeros(3,4);
32 Parameters(:,1:2) = repmat(branchPt,[3,1]);
33 Parameters(:,3:4) = [Fingers_endPt;Palm_endPt;
Thumb_endPt];
34 end
35 end
Table A.1: Parameter Extraction after Skin Segmentation
54
1 function [IMseg] = SkinSegmenter(IMin,type)
2
3 %%%%%%%%%initialising parameters for skin segmentation%%%%%%%%%%
4 Crmin = 73;
5 Crmax = 158;
6 Cbmin = 128;
7 Cbmax = 170;
8 IMseg = zeros(size(IMin(:,:,1)));
9
10 %%%%%%%%%convert to required format%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
11 switch type
12 case YCbCr
13 temp = input;
14 case RGB
15 temp = rgb2ycbcr(input);
16 otherwise
17 error(Specify Input Type);
18 end
19 img_ycbcr = im2uint8(temp);
20
21 %%%%%%%%%threshold image%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
22 Cbimg = img_ycbcr(:,:,2);
23 Crimg = img_ycbcr(:,:,3);
24 IMseg(Cbimg>=Cbmin & Cbimg<=Cbmax & Crimg>=Crmin & Crimg<=Crmax)
= 1;
25
26 %%%%%%%%%select largest blob%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
27 if(any(IMseg(:)))
28 [imlabel, totalLabels] = bwlabel(iImout,8);
29 sizeBlob = zeros(1,totalLabels);
30 for i=1:totalLabels,
31 sizeblob(i) = length(find(imlabel==i));
32 end
33 [maxno, largestBlobNo] = max(sizeblob);
34 outim = zeros(size(im),uint8);
35 outim(find(imlabel==largestBlobNo)) = 1;
36 end
37 end
Table A.2: Parameter Extraction after Skin Segmentation
55
1 x:=1;
2 if(x>0)
3 {
4 x:=5;
5 while(x>0)
6 {
7 x:=x-1;
8 }
9 x:=10;
10 }
11 else
12 {
13 write x;
14 }
15 x:=15;
Listing A.1: Example WHILE code
snippet with nested scopes.
1 [0]
2 [0]
3
4 [1;0]
5 [1;0]
6
7 [2;1;0]
8
9 [1;0]
10
11
12
13 [3;0]
14
15 [0]
Contents of scopesEntered for each
block.
56
Appendix B. Unity Code
Appendix B - Unity Code
57

Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF MALTA

You might also like