P. 1
Special Issue of Media Solutions That Improve Accessibility to Disabled Users

Special Issue of Media Solutions That Improve Accessibility to Disabled Users

|Views: 209|Likes:
UBICC, the Ubiquitous Computing and Communication Journal [ISSN 1992-8424], is an international scientific and educational organization dedicated to advancing the arts, sciences, and applications of information technology. With a world-wide membership, UBICC is a leading resource for computing professionals and students working in the various fields of Information Technology, and for interpreting the impact of information technology on society.
UBICC, the Ubiquitous Computing and Communication Journal [ISSN 1992-8424], is an international scientific and educational organization dedicated to advancing the arts, sciences, and applications of information technology. With a world-wide membership, UBICC is a leading resource for computing professionals and students working in the various fields of Information Technology, and for interpreting the impact of information technology on society.

More info:

Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





UBICC Journal

Ubiquitous Computing and Communication Journal 2010 Volume 5 . 2010-03-10 . ISSN 1992-8424

Special Issue of Media Solutions that Improve Accessibility to Disabled Users
Unconstrained walking plan to virtual environment for spatial learning by visually impaired Application of virtual reality technologies in rapid development and assessment of ambient assisted living environments PIXAR animation studios and disabled personages case study: Finding NEMO Web accessible design centered on user experience First steps towards determining the role of visual information in music communication Examining the feasibility of face gesture detection for monitoring users of autonomous wheel chairs Personal localization in wearable camera platform towards assistive technology for social interactions 1



23 32



UBICC Publishers © 2010 Ubiquitous Computing and Communication Journal

Managing Editor Dr. David Fonseca

Ubiquitous Computing and Communication Journal
Book: 2010 Volume 5 Publishing Date: 2010-03-10 Proceedings ISSN 1994-4608
This work is subjected to copyright. All rights are reserved whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illusions, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication of parts thereof is permitted only under the provision of the copyright law 1965, in its current version, and permission of use must always be obtained from UBICC Publishers. Violations are liable to prosecution under the copy right law.

UBICC Journal is a part of UBICC Publishers www.ubicc.org © UBICC Journal Printed in South Korea Typesetting: Camera-ready by author, data conversation by UBICC Publishing Services, South Korea

Working to grow libraries in developing countries UbiCC Journal | www.ubicc.org

Kanubhai K. Patel1, Dr. Sanjay Kumar Vij2 School of ICT, Ahmedabad University, Ahmedabad, India, kkpatel7@gmail.com 2 Dept. of CE-IT-MCA, SVIT, Vasad, India, vijsanjay@gmail.com


ABSTRACT Treadmill-style locomotion interfaces for locomotion in virtual environment typically have two problems that impact their usability: bulky or complex drive mechanism and stability problem. The bulky or complex drive mechanism requirement restricts the practical use of this locomotion interface and stability problem results in the induction of fear psychosis to the user. This paper describes a novel simple treadmill-style locomotion interface that uses manual treadmill with handles to provide needbased support, thus allowing walking with assured stability. Its simplicity of design coupled with supervised multi-modal training facility makes it an effective device for spatial learning and thereby enhancing the mobility skills of visually impaired people. It facilitates visually impaired person in developing cognitive maps of new and unfamiliar places through virtual environment exploration, so that they can navigate through such places with ease and confidence in real. In this paper, we describe the structure and control mechanism of the device along with system architecture and experimental results on general usability of the system. Keywords: assistive technology, blindness, cognitive maps, locomotion interface, Virtual learning environment.



Unlike in case of sighted people, spatial information is not fully available to visually impaired and blind people causing difficulties in their mobility in new or unfamiliar locations. This constraint can be overcome by providing mental mapping of spaces, and of the possible paths for navigating through these spaces which are essential for the development of efficient orientation and mobility skills. Orientation refers to the ability to situate oneself relative to a frame of reference, and mobility is defined as “the ability to travel safely, comfortably, gracefully, and independently” [7, 18]. Most of the information required for mental mapping is gathered through the visual channel [15]. As visually impaired people are handicapped to gather this crucial information, they face great difficulties in generating efficient mental maps of spaces and, therefore, in navigating efficiently within new or unfamiliar spaces. Consequently, many visually impaired people become passive, depending on others for assistance. More than 30% of the blind do not ambulate independently outdoors [2, 16]. Such assistance might not be required after a reasonable number of repeated visits to the new space as these visits enable formation of mental map of the new space subconsciously. Thus, a good number of researchers focused on using technology to simulate

visits to a new space for building cognitive maps. Although isolated solutions have been attempted, no integrated solution of spatial learning to visually impaired people is available to the best of our knowledge. Also most of the simulated environments are far away from reality and the challenge in this approach is to create a near real-life experience. Use of advanced computer technology offers new possibilities for supporting visually impaired people's acquisition of orientation and mobility skills, by compensating the deficiencies of the impaired channel. The newer technologies including speech processing, computer haptics and virtual reality (VR) provide us various options in design and implementation of a wide variety of multimodal applications. Even for sighted people, such technologies can be used (a) to enhance the visual information available to a person in such a way that important features of a scene are presented visibly, or (b) to train them through virtual environment leading to create cognitive maps of unfamiliar areas or (c) to get a feel of an object (using haptics) [16]. Virtual Reality provides for creation of simulated objects and events with which people can interact. The definitions of Virtual Reality (VR), although wide and varied, include a common statement that VR creates the illusion of participation in a synthetic environment rather than

UbiCC Journal, Volume 5, March 2010


going through external observation of such an environment [5]. Essentially, virtual reality allows users to interact with a simulated environment. Users can interact with a virtual environment either through the use of standard input devices such as a keyboard and mouse, or through multimodal devices such as a wired glove, the Polhemus boom arm, or else omni-directional treadmill. Even though in the use of virtual reality with the visually impaired person, the visual channel is missing, the other sensory channels can still lead to benefits for visually impaired people as they engage in a range of activities in a simulator relatively free from the limitations imposed by their disability. In our proposed design, they can do so in safe manner. We describe the design of a locomotion interface to the virtual environment to acquire spatial knowledge and thereby to structure spatial cognitive maps of an area. Virtual environment is used to provide spatial information to the visually impaired people and prepare them for independent travel. The locomotion interface is used to simulate walking from one location to another location. The device is needed to be of a limited size, allow a user to walk on it and provide a sensation as if he is walking on an unconstrained plane. The advantages of our proposed device are as follows: • It solves instability problem during walking by providing supporting rods. The limited width of treadmill along with side supports gives a feeling of safety and eliminates the possibility of any fear of falling out of the device. • No special training is required to walk on it. • The device’s acceptability is expected to be high due to the feeling of safety while walking on the device. This results in the formation of mental maps without any hindrance. • It is simple to operate and maintain and it has low weight. The remaining paper is structured as follows: Section 2 presents the related work. Section 3 describes the structure of locomotion interface used for virtual navigation of computer-simulated environments for acquisition of spatial knowledge and formation of cognitive maps; Section 4 describe control principle of locomotion device; Section 5 illustrates the system architecture; while Section 6 describe the experiment for usability evaluation, finally Section 7 concludes the paper and illustrates future work. 2 RELATED WORK

The string walker [12]. The basic idea used in these approaches is that a locomotion interface should cancel the user’s self motion in a place to allow the user to move in a large virtual space. For example, a treadmill can cancel the user’s motion by moving its belt in the opposite direction. Its main advantage is that it does not require a user to wear any kind of devices as required in some other locomotion devices. However, it is difficult to control the belt speed in order to keep the user from falling off. Some treadmills can adjust the belt speed based on the user’s motion. There are mainly two challenges in using the treadmills. The first one is the user’s stability problem while the second is to sense and change the direction of walking. The belt in a passive treadmill is driven by the backward push generated while walking. This process effectively balances the user and keeps him from falling off. The problem of changing the walking direction is addressed by [1, 6], who employed a handle to change the walking direction. Iwata & Yoshida [13] developed a 2D infinite plate that can be driven in any direction and Darken [3] proposed an Omni directional treadmill using mechanical belt. Noma & Miyasato [17] used the treadmill which could turn on a platform to change the walking direction. Iwata & Fujji [9] used a different approach by developing a series of sliding interfaces. The user was required to wear special shoes and a low friction film was put in the middle of shoes. Since the user was supported by a harness or rounded handrail, the foot motion was canceled passively when the user walked. The method using active footpad could simulate various terrains without requiring the user to wear any kind of devices. • 3 STRUCTURE INTERFACE OF LOCOMOTION

We have categorized the most common virtual reality (VR) locomotion approaches as follow: • Omni-directional treadmills (ODT) [3, 8, 14, 4], • The motion foot pad [10], • Walking-in-place devices [19], • actuated shoes [11], and

Figure 1: Mechanical structure of locomotion interface. There are three major parts in the figure: (a) A motor-less treadmill, (b) mechanical rotating base, and (c) block containing Servo motor and gearbox to rotate the mechanical base.

UbiCC Journal, Volume 5, March 2010



Figure 2: Locomotion interface. As shown in Figure 1 and 2, our device consists of a motor-less treadmill resting on a mechanical rotating base. In terms of its physical characteristics, our device’s upper platform (treadmill) is 54” in length and 30” wide with an active surface 48” X 24”. The belt of treadmill contains mat on which 24 stripes along the direction of motion, at a distance of 1” between two stripes. Below each stripe, there are force sensors that sense the position of feet. A typical manual treadmill passively rotates as the user moves on its surface, causing belt to rotate backward as the user moves forward. Advantages of this passive (i.e. non-motorized) movement are: (a) to achieve an almost silent device with negligible-noise during straight movement, and (b) the backward movement of treadmill is synchronized with forward movement of user leading thereby jerk-free motion. (c) Also in case of the trainee stopping to walk as detected by non-movement of belt, our system assists and guides the user for further movement. The side handle support provides the feeling of safety and stability to the person which results in efficient and effective formation of cognitive maps. Human beings subconsciously place their feet at angular direction whenever they intend to take a turn. Therefore the angular positions of the feet on the treadmill are monitored to determine not only user’s intention to take a turn, but also the direction and desired angle at granularity of 15o. Rotation control system finds out angle through which the platform should be turned, and turns the whole treadmill with user standing on it, on mechanical rotating base, so that the user can place next footstep on the treadmill’s belt. The rotation of platform is carried out using a servo motor. Servo motor and gearbox are placed in lower block which is lying under the mechanical rotating base. Our device also provides for safety mechanism through a kill switch, which can be triggered to halt the device immediately in case the user loses control or loses

Belt of treadmill of device rotates in backward or forward direction as user moves in forward or backward direction, respectively, on the treadmill. This is a passive, non-motorized, movement of treadmill. The backward movement of belt of treadmill is synchronized with forward movement of user leading thereby non-jerking motion. This solves the problem of stability. For maneuvering, which involves turning or side-stepping, our Rotation control system rotates the whole treadmill in particular direction on mechanical rotating base. In case of turning as shown in Figure 3, when foot is on more than three strips then user wants to turn and we should rotate the treadmill. If middle strip of new footstep is on left side of middle strip of previous footstep then rotation is on left side and if middle strip of new footstep is on right side of middle strip of previous footstep then rotation is on right side.

Figure 3: Rotation of treadmill for veer left turn (i.e. 45O) (a) Position of treadmill before turning (b) after turning

Figure 4: Rotation of treadmill for side-stepping (i.e. 15 O) (a) Before side-stepping and (b) after sidestepping

In case of side-stepping as shown in Figure 4, When both feet are on three strips then compare

UbiCC Journal, Volume 5, March 2010


distance between current and the previous foot positions to determine whether side-stepping has taken placed or not. If it is more than a threshold value, the side-stepping has taken placed otherwise there is no side-stepping. If it is equal or less than maximum gap distance then that is forward step, so no rotation is performed. After determining the direction and angle of rotation, our software sends appropriate signals to the servo motor to rotate in the desired direction by given angle and, accordingly, the platform rotates. This process ensures that the user places the next footstep on the treadmill itself, and do not go off the belt. The algorithm to find direction and angle of turning is based on (a) number of strips pressed by left foot (nl), (b) number of strips pressed by right foot (nr), (c) distance between middle strips of two feet (dist) and (d) threshold for the distance between middle strips of two feet. The outputs are direction (Left Turn - lt, Right Turn - rt, Left Side stepping - ls, or Right Side stepping – rs) and angle to turn. Different possible cases of turning and sidestepping are shown in Figure 5. ALGORITHM 1: if (nl>3) && (dist>d) then //Case-1 2: find θ 3: left_turn = true //i.e. return lt 4: elseif (nl==3) && (dist>d) then //Case–2 5: θ = 15o 6: left_side_stepping = true //i.e. return ls 7: elseif (nl>3) && (dist<d) then //Case–3, in rare case 8: find θ 9: right_turn = true //i.e. return rt 10: elseif (nr>3) && (dist>d) then //Case–4 11: find θ 12: right_turn = true //i.e. return rt 13: elseif (nr==3) && (dist>d) then //Case–5 14: θ = 15o 15: right_side_stepping = true //i.e. return rs 16: elseif (nr>3) && (dist<d) then //Case–6, in rare case 17: find θ 18: left_turn = true //i.e. return lt 19: end if

in Figure 6. The user (trainee) chooses starting location and destination, and navigates by standing and walking on our locomotion interface physically. The current position indicator (referred to as cursor in this section) moves as per the movement of the user on locomotion interface. There are two modes of navigation, first is – Guided navigation, that is navigation with system help and environment cues for creating cognitive map and, second is – Unguided navigation, that is navigation without system help and only with environment cues. During unguided navigation mode, the data of the path traversed by the user (i.e. trainee) is collected and assessed to determine the quality of cognitive map created by the user as a result of training. In the first mode of navigation, the Instruction Modulator guides visually impaired people through speech by describing surroundings, guiding directions, and giving early information of a turning, crossings, etc.

Case 1 – Left turn

Case 2 – Left side stepping

Case 3 – Right turn

Case 4 – Right turn

Case 5–Right side-stepping

Case 6 – Left turn



Our system allows visually impaired persons to navigate virtually using a locomotion interface. It is not only closer to real-life navigation as against using the tactile map, but it also simulates the distance and the directions more accurately than the tactile maps. The functioning of a locomotion interface to navigate through virtual environment has been explained in previous sections. Computer-simulated virtual environment showing few major pathways of a college is shown

Normal walking

Figure 5: Various cases of turning and side stepping.

UbiCC Journal, Volume 5, March 2010


for improvement. The experimental tasks were to travel two kinds of routes, one is easy path (with 2 turns) and other is complex path (with 5 turns). 6.1 Participants 16 blind male students, ranging from 17 to 21 years old and unknown about place equally divided in to two groups, learned to form the cognitive maps from a virtual environment exploration. Participants in first group used our locomotion interface (LI) and participants in second group used keyboard (KB) to explore the virtual environment. Each repeated the task 8 times, taking maximum 5 minutes for each trial. Figure 6: Screen shot of Computer-simulated environments Additionally, occurrences of various events such as (i) arrival of a junction, (ii) arrival of object(s) of interest, etc. are signaled by sound through speakers or headphones. Whenever the cursor is moved near an object, its sound features are activated, and a corresponding specific sound or a pre-recorded message is heard by the participant. Participant can also get information regarding orientation and nearby objects, whenever needed, through help keys. The Simulator also generates audible alert when the participant is approaching any obstacle. During training, the Simulator continuously checks and records participant’s navigating style (i.e. normal walk or drunkard/random walk) and the path followed by the user when encountered with obstacles. Once the user gets confident and memorizes the path and landmarks between source and destination, he navigates by using second mode of navigation that is without system’s help and tries to reach the destination. The Simulator records participant’s navigation performance, such as path traversed, time taken, distance traveled and number of steps taken to complete this task. It also records the sequence of objects encountered on the traversed path and the positions where he seemed to have some confusion (and hence took relatively longer time). The Data Collection module keeps receiving the data from Force Sensors, which is sent to VR system for monitoring and guiding the navigation. Feet position data are also used for sensing the user’s intention to take a turn, which is directed to the motor planning (rotation) module to rotate the treadmill. 6 EXPERIMENT EVALUATION FOR USABILITY Apparatus Using Virtual Environment Creator, we designed virtual environment based on ground floor of our institute –AESICS (as shown in Figure 6), which has three corridors and eight landmarks/objects. It has one main entrance. Our system lets the participant to form cognitive maps of unknown areas by exploring virtual environments. It can be considered an application of “learning-by-exploring” principle for acquisition of spatial knowledge and thereby formation of cognitive maps using computer-simulated environment. Computer-simulated virtual environment guides the blind through speech by describing surroundings, guiding directions, and giving early information of a turning, crossings, etc. Additionally, occurrences of various events (e.g. arrival of a junction, arrival of object(s) of interest, etc.) are signaled by sound through speakers or headphones. 6.3 Method The following two tasks were given to participants: Task 1: Go to the Faculty Room starting from Class Room G5. Task 2: Go to the Computer Laboratory starting from Main Entrance. Task 1 is somewhat easier than Task 2. One simple path, with only two turns, and other little bit more complex, with five turns. Before participants began their 8 trials, they spent a few minutes using the system in a simple virtual environment. The duration of the practice session (determined by the participant) was typically about 3 minutes. This gave the participants enough training to familiarize themselves with the controls, but not enough time to train to competence, before the trials began. 6.4 Result Table 1 and 2 show that participants performed 6.2

The evaluation consists of an analysis of time required and number of steps taken to train to competence with our locomotion interface (LI), as compared to other navigation method like keyboard (KB), and comments from users that suggest areas

UbiCC Journal, Volume 5, March 2010


reasonably well while navigating using locomotion interface in both the paths.
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1
A vg . T i m e (i n M in u tes)

Avg. Time (Minutes) taken to complete tasks

Table 1: Avg. Number of Trial Trial 1 2 3 LI EP 54 52 51 LI CP 90 86 83 KB EP 58 57 55 KB CP 93 91 90

Steps Taken for Each 4 48 76 54 88 5 45 72 52 85 6 43 70 50 83 7 42 70 51 82 8 41 65 49 80











Trial Number

Figure 8: Avg. Time (in Minutes) for two different paths using LI and KB Above figures show that locomotion interface users reasonably improved their performances (time and number of steps taken) over the course of the 8 trials. However, time required during initial trials would reduce significantly after 3 trials. To stabilize the performance users may need 4 trials or more. User comments support this understanding: “The foot movements did not become natural until 4-5 trials with LI”. “The exploration got easier each time”. “I found it somewhat difficult to move with the LI. As I explored, I got better”. Even after the 8 trials of practice, LI users still reported some difficulty moving and maneuvering. These comments point us to elements of the interface that still need improvement. “I had difficulty making immediate turns in the virtual environment”. “Walking on LI needs more efforts than real walking”. 7 CONCLUSION AND FUTURE WORK

Table 2: Avg. Time (in Trial Trial 1 2 3 LI 2.4 2.2 2.1 EP LI 4.2 4.1 3.9 CP KB 2.8 2.7 2.5 EP KB 4.6 4.5 4.3 CP

Minutes) Taken for Each 4 1.8 5 1.7 6 1.5 7 1.4 8 1.2
















On first path condition, task was completed on average with fewer than 41 steps. While in complex path condition, task was completed on average with fewer than 65 steps. Average time was less than 1.2 minutes for easy path and 2.3 minutes for complex path. Participants performed relatively not good while navigating using keyboard in both the paths. On first path condition, task was completed on average with 49 steps. While in complex path condition, task was completed on average with 80 steps. Average time was less than 2.1 minutes for easy path and 3.6 minutes for complex path.
Avg. Number of Steps taken
100 90
Av g . Nu m b er o f S te p s

80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 Trial Number



Figure 7: Avg. Number of Steps taken for two different paths using LI and KB

This paper presents a new concept for a locomotion interface that consists of a onedimensional passive treadmill mounted on a mechanical rotating base. As a result the user can move on an unconstrained plane. The novel aspect is sensing of rotations by measuring the angle of foot placement. Measured rotations are then converted into rotations of the entire treadmill on a rotary base. The proposed device although is of limited size but it gives a user the sensation of walking on an unconstrained plane. Its simplicity of design coupled with supervised multi-modal training facility makes it an effective device for virtual walking simulation. Experiment results indicate the pre-eminence of locomotion interface over method of using keyboard for virtual environment exploration. These results have implications for using locomotion interface for the visually impaired to structure the cognitive maps of an unknown places and thereby to enhance the mobility skills of them.

UbiCC Journal, Volume 5, March 2010


We tried to make a simple yet effective, loudless non-motorized locomotion device that helps user to hear the audio guidance and feedback including contextual help of virtual environment. In fact, absence of mechanical noise reduces the distraction during training thereby minimizing the obstructions in the formation of mental maps. The specifications and detailing of the design were based on the series of interactions with selected blind people. Authors do not intend to claim that their proposed device is the ultimate one. However locomotion interfaces have the advantage of providing a physical component and stimulation of the proprioceptive system that resembles the feeling of real walking. We do feel that the experimental results lead to improvements in the device to become more effective. One known limitation of our device is its inability to simulate movements on slopes. We plan to take up this enhancement in our future work. ACKNOWLEDGMENT We acknowledge Prof. H. B. Dave’s suggestions at various stages during our studies and work leading to this research paper. 8 REFERENCES

[1] Brooks, F. P. Jr., (1986). Walk Through- a Dynamic Graphics System for Simulating Virtual Buildings. Proc. Of 1986 Workshop on Interactive 3D Graphics, pp. 9-21. [2] Clark-Carter, D., Heyes, A. & Howarth, C., (1986). The effect of non-visual preview upon the walking speed of visually impaired people. Ergonomics, 29 (12), pp.1575–81. [3] Darken, R. P., Cockayne, W.R., & Carmein, D., (1997). The Omni-Directional Treadmill: A Locomotion Device for Virtual Worlds. Proc. of UIST’97, pp. 213-221. [4] De Luca A., Mattone, R., & Giordano, P.R. (2007). Acceleration-level control of the CyberCarpet. 2007 IEEE International Conference on Robotics and Automation, Roma, I, pp. 2330-2335. [5] Earnshaw, R. A., Gigante, M. A., & Jones, H., editors (1993). Virtual Reality Systems. Academic Press, 1993. [6] Hirose, M. & Yokoyama, K., (1997). Synthesis and transmission of realistic sensation using virtual reality technology. Transactions of the Society of Instrument and Control Engineers, vol.33, no.7, pp. 716-722. [7] Hollins, M. (1989). Understanding Blindness: An Integrative Approach, chapter Blindness and Cognition. Lawrence Erlbaum Associates, 1989.

[8] Hollerbach, J. M., Xu, Y., Christensen, R., & Jacobsen, S.C., (2000). Design specifications for the second generation Sarcos Treadport locomotion interface. Haptics Symposium, Proc. ASME Dynamic Systems and Control Division, DSC-Vol. 69-2, Orlando, Nov. 5-10, 2000, pp. 1293-1298. [9] Iwata, H. & Fujji, T., (1996). Virtual Preambulator: A Novel Interface Device for Locomotion in Virtual Environment. Proc. of IEEE VRAIS’96, pp. 60-65. [10] Iwata, H., Yano, H., Fukushima, H., & Noma, H., (2005). CirculaFloor, IEEE Computer Graphics and Applications, Vol.25, No.1. pp. 64-67. [11] Iwata, H, Yano, H., & Tomioka, H., (2006). Powered Shoes, SIGGRAPH 2006 Conference DVD (2006). [12] Iwata, H, Yano, H., & Tomiyoshi, M., (2007). String walker. Paper presented at SIGGRAPH 2007. [13] Iwata, H. & Yoshida, Y., (1997). Virtual walk through simulator with infinite plane. Proc. of 2nd VRSJ Annual Conference, pp. 254-257. [14] Iwata, H., & Yoshida, Y., (1999). Path Reproduction Tests Using a Torus Treadmill. PRESENCE, 8(6), 587-597. [15] Lynch, K. (1960). The image of the city. Cambridge, MA, MIT Press. [16] Lahav, O. & Mioduser, D., (2003). A blind person's cognitive mapping of new spaces using a haptic virtual environment. Journal of Research in Special Education Needs. v3 i3. 172-177. [17] Noma, H. & Miyasato, T., (1998). Design for Locomotion Interface in a Large Scale Virtual Environment, ATLAS: ATR Locomotion Interface for Active Self Motion. 7th Annual Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems. The Winter Annual Meeting of the ASME. Anaheim, USA. [18] Shingledecker, C. A. & Foulke, E. (1978). A human factors approach to the assessment of mobility of blind Pedestrians. Human Factors, vol. 20, pp. 273-286. [19] Whitton, M. C., Feasel, J., & Wendt, J. D., (2008). LLCM-WIP: Low-latency, continuousmotion walking-in-place. In Proceedings of the 3D User Interfaces (3DUI ’08), pp 97–104.

UbiCC Journal, Volume 5, March 2010


Viveca Jimenez-Mixco¹, Antonella Arca¹, Jose Antonio Diaz-Nicolas¹, Juan Luis Villalar ¹, Maria Fernanda Cabrera-Umpierrez¹, Maria Teresa Arredondo¹, Pablo Manchado², Maria Garcia-Robledo² ¹ Life Supporting Technologies, Technical University of Madrid, Spain vjimenez@lst.tfo.upm.es ² SIEMENS S.A., Spain

ABSTRACT In the current society, where the group of elderly and people with disabilities is constantly growing, especially due to the increase in life expectancy, it is becoming a must for ICT developers to provide systems that meet the needs of this community regarding accessibility and usability and enhance their quality of life consequently. Ambient Assisted Living, intended to help people live independently, with autonomy and security, is one of the most promising solutions that are coming up to address this technological challenge. This paper presents the approach proposed in the context of VAALID European funded project to make possible real rapid prototyping of accessible and usable Ambient Intelligence solutions, by integrating Virtual Reality simulation tools in the development cycle as well as appropriate user interfaces. The first functional prototype has been planned for March 2010 and will be evaluated during six months in three pilot sites with up to 50 users, starting on May 2010. Keywords: Virtual reality, ambient assisted living, rapid application development, assessment, accessibility, usability.



Nowadays Society is facing a process where life expectancy is gradually but constantly increasing. As a result, the group of elderly people is growing to become one of the most significant in the entire population [1]. This also means that the prevalence of physical and cognitive impairments is increasing in proportion. Elderly people usually suffer from vision deficiencies (yellowish and blurred image), hearing limitations (especially at high frequencies) motor impairments (for selection, execution and feedback) and slight deterioration of their cognitive skills [2]. In this context, providing the elderly and people with disabilities with accessible systems and services that could improve their level of independence, and thus enhance their quality of life, has become a must for ICT developers such as usability engineers and interaction designers. Ambient Assisted Living (AAL) is one of the solutions that are beginning to address this technological challenge. The concept of Ambient Assisted Living represents a specific, user-oriented type of Ambient Intelligence (AmI). It comprises technological and organisational-institutional solutions that can help people to live longer at the place they like most, ensuring a high quality of life, autonomy and security

[3]. AAL solutions are sensitive and responsive to the presence of people and provide assistive propositions for maintaining an independent lifestyle [4]. Within this complex and continuously evolving framework, it is very challenging to technologically meet all users’ needs and requirements regarding accessibility and usability along the development process. Accessibility is a prerequisite for basic use of products by as many users as possible, in particular elderly persons and persons with sensory, physical or cognitive disabilities. Usability denotes the ease with which these products or services can be used to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use [5]. These aspects should be taken into account during the product design ideally from early stages, following a more interactive and iterative design-developmenttesting procedure. The major problem lies in the global cost of the design and development process, which can be critically increased, since AmI solutions involve complex features such as ubiquity, context awareness, smartness, adaptiveness and computing embedded in daily life goods. Life Supporting Technologies, the research group responsible of this paper, has been addressing for years the convergence of domotics and accessibility. As a result of this process, the group is exploring the

UbiCC Journal, Volume 5, March 2010


application of Virtual Reality (VR) technologies in the process of design and development of accessible solutions for elderly and people with any kind of disability. One of the achievements in this area was the establishment of a living lab at the Technical University of Madrid that allowed the assessment of the user experience of people with disabilities in smart homes using two key technologies: virtual reality and domotics [6]. The living lab integrated a VR application into a real smart home installation. It was configurable for different settings and user profiles, and capable of supporting multimodal interaction through a set of VR and other commonly used devices and displays. The design and implementation process ran under the Design-For-All principles, taking into account concepts such as usability, adaptability, multimodality and standardisation. The living lab resulted in a useful tool for interaction designers and usability engineers to immerse users in a virtual environment and assess, through the application, their experience in terms of interaction devices, modalities and reactions within smart home environments. Based on this assessment, designers would be able to develop new concepts with users, improve existing solutions, and explore, for instance, the possibilities of innovative AAL products and services. The preliminary encouraging results allowed envisioning multiple possibilities of VR on the process of providing people with disabilities with more adapted access to domotic-related applications. However, this solution had important limitations, especially as it required a significant amount of implementation effort to finally address the assessment of user experience in just one single environment integrating a pre-defined set of products and services. This paper presents an approach proposed in the context of the European funded project VAALID that extends the key concepts applied in this living lab, providing an easier method to create virtual environments and implement interactivity, enabling dynamic changes of environment conditions and characteristics, and allowing a thorough evaluation of users and real-time interaction techniques. An authoring tool will be developed in order to enable real rapid prototyping and validation of accessible and usable AmI solutions, by integrating Virtual Reality (VR) tools and appropriate user interfaces. This approach will bridge the gap between planning AmI scenarios and their build-up and assessment in reality from the very beginning in the development process, reducing the global design and development effort. 2 VAALID CONCEPT

VAALID is a European research project that aims to develop advanced computer-aid engineering tools that will allow ICT developers, especially those ones

that design AAL products and services, to optimise and make more efficient the whole process of user interaction design and to validate usability and accessibility at all development stages, following a User Centred Design (UCD) process. The VAALID platform will utilise VR technologies to provide an immersive environment with 3D virtual ambient, specifically created for each possible use scenario, where AAL users can experience new interaction concepts and technoelements, interactively. The usage of VAALID tools will make feasible, both economically and technically, the Universal Design of AAL solutions which have the potential of being acceptable by most persons since their needs are taken into account proactively during the development phases. The methodology proposed to address AAL solutions is based on a UCD approach, drawing together the practical, emotional and social aspects of people's experience and bringing on the needed innovation that delivers real user benefit. For that reason, the UCD is particularly useful when a new product or service is to be introduced, as it is the case of AAL solutions. The methodology consists of four iterative phases of design, development and evaluation, where both usability engineers and interaction designers must participate, involving AAL users (i.e. elderly and people with disabilities) all along the process [7]: • Concept. First, AAL solution requirements must be extracted, including the functions that the proposed solution provides and how it reacts and behaves, as well as the constraints that should be considered in the design process. • Design. Once the requirements are well identified, developers define the specifications of the AAL solution, taking into account all significant facets that may have influence on the development process. Low-fidelity virtual prototypes of the AAL solution, including 3D virtual AAL-enabled spaces, are built to reflect all aspects of the conceptual design, and further evaluated by users. Design iterations are driven by users’ feedback in terms of acceptance and accessibility issues until requirements are met. • Implementation. This phase involves the creation of real and fully functional high-fidelity AAL solution prototypes, with the aim of transforming the validated conceptual design into a concrete and detailed solution. The components developed at this stage must be tested against its accessibility features, and improvements or corrective actions must be addressed accordingly. • Validation. Finally, the implementation of AAL solution prototypes is evaluated and assessed, detecting usability issues both automatically and with potential end users. This methodology allows virtually simulating

UbiCC Journal, Volume 5, March 2010


each aspect of an AAL product/service and validating it before the real implementation. The whole process involves both virtual and mixed reality elements. The simulation in the design phase requires mainly 3D virtual environments to reproduce the conceptual design of the solution; the implementation phase goes a step further and adds the possibility to use mixed reality elements, so that real functional prototypes can be tested within virtual environments as well. In order to permit developers to apply this methodology across all the stages of the design cycle, and thus make possible a rapid development of AAL solutions and further assessment with users, the VAALID platform will be structured in two parts: the Authoring Framework and the Simulation Framework. The Authoring Framework will provide the ICT designer with the appropriate components to deal with the three main pillars of an AAL solution, including the creation of user profiles, the modification of AALenabled 3D spaces (including sensors, communication networks and interaction devices and functions), the creation of virtual user-interaction devices (which may be embedded in daily life objects) and new concepts for devices and products. These individual components will be afterward validated as an integrated environment in the Simulation Framework. The VAALID project started on May 2008 and the first functional prototype of the VAALID platform is planned for March 2010. This prototype will be evaluated during six months in three pilot sites (Germany, Italy, Spain) with up to 50 users, starting on May 2010. 2.1 Target Users

service. They are: Architects, construction planners, care centres, suppliers of interaction devices, public administration, interior designers and other stakeholders who work for companies that buy and develop AAL services. System designers, who implement AAL solutions validating usability and accessibility of their products, like sensors, actuators or control software. 2.2 Sample Scenario

The potential use of VAALID can be illustrated through the following simplified scenario: A small company specialised in AAL wants to develop a service for detecting fall of elderly people when they are alone at home; if a fall is detected, an alarm is generated and automatically sent to an emergency centre. Following the VAALID approach (see Fig. 1), an interaction designer creates first a new project in the Authoring Framework.

VAALID target users can be divided into three main groups: • Primary users: Designers of AAL solutions that will use VAALID as a professional instrument. This group includes Interaction Engineers, who design the structure of the simulation, building the seniors’ profile and defining the interaction modes with the environment, and Usability Engineers, who plan the interface among AAL services and senior citizens, through the study of their interactions with the VAALID system. • Beneficiaries: The main target group of users who will benefit from the results of using VAALID tools. They will be: Elderly people over 60 years old that may have light hearing/sight problems, mobility impairments, or the normal declined cognitive and physical abilities related to age. Young people with hearing/sight/mobility problems, or Any other group of users that may profit from accessible AAL solutions. • Secondary users: All those users that may benefit indirectly from VAALID, using it as a consultancy

Figure 1: Development cycle proposed in VAALID. He selects the user profile of a person over 80 with moderate hearing problems, and VAALID automatically limits the possible elements and features consistent with that profile. He imports an AutoCAD model of a house, previously created in an architect studio for the company, and adds to the 3D model all the sensors and objects that will be involved. He also

UbiCC Journal, Volume 5, March 2010


selects from the libraries the service “Fall down” and redesigns this model adding all the needed elements for the service to work properly. In this case, he decides to embed the sensors in a carpet in each room of the house. By running the simulation in the Simulation Framework he can check whether the service has been correctly defined: the service workflow is coherent, the sensors involved are placed correctly around the house, all the features are defined in accordance with the user profile, etc. Now, the designer requires a real user to test the service in a realistic environment to gather his opinion. In the simulation room, which has been equipped with specialised VR technologies, they use specific glasses to get immersed in the virtual scene of the house. Among the different options available in the simulation room, the designer decides that the easiest way for the user to simulate movements is body gesture. After a short training, the user is capable of moving around and interacts with the house. He lies down in the floor of the simulation room to simulate a fall, and therefore he can experience what would happen in case he had really fallen down, and how the alarm service would react. He asks the designer to change the dimensions and the position of the carpet, and to reduce the time that the system should wait before launching the alarm. The designer sets the new preferences of the user in real-time. At this point, the service is being simulated in a 3D environment with virtual elements; afterwards, once the concept is fully defined and the prototype of the smart carpet is created, it can be assessed in a more realistic approach through a mixed reality environment. This means that the carpet can be taken out of the virtual scene, and instead, the real prototype is tested by the user at the same simulation room. Thus, enabling the simultaneous usage of virtual and real elements, the service can be validated before the construction of a real living lab. Several scenarios describing similar possible situations were examined by experts from different profiles, including interaction designers and usability engineers, and their impressions and recommendations regarding the main aspects of the VAALID concept such as working with elderly, 3D and virtual reality technologies have been taken into account for the final definition of the characteristics and functionalities of the Authoring and Simulation Frameworks. 2.3 Authoring Framework

environment where the user moves for tests. It can be personalised and configured to fit the needs of each designer, providing also a help section. According to the RAD (Rapid Application Development) methodology [9], this tool allows to create a model containing all those templates that will be integrated and then executed inside of the Simulation Framework. The AAL simulation is created from a conjunction of templates stored in a project, the basic component of the Authoring Framework. Every simulation is stored as a single project that is composed of three elements: User Model, Environment, and AAL Service. Each of these elements is created by editing pre-existing characteristics described as properties and behaviour. Properties are defined through ontologies that represent static features of a single model; behaviours are described as workflows of the element in relation with other elements by means of interaction. Through this kind of information the designer can build models in a rapid way following user needs. 2.3.1 Authoring Toolkit The Authoring Framework workspace is divided in three editors, one for each model (Fig. 2): • User Model Builder. The term “User” here is referred to the beneficiaries of VAALID, i.e. elderly or people with disabilities people. This user editor defines the user profile including physical, sensory and cognitive abilities. This kind of information is collected during the design and testing phases when creating AAL services. Functions implicated in this builder are: creating a new User Model from scratch; importing or exporting an existing User Model, by exchanging profiles between the current Project and the Library (or Repository); and removing the User Model associated to the current Project. The same actions are available for the Behaviour of a User Model, which can be imported, removed, exported or associated to another User Model. • Environment Model Builder. The Environment Model reproduces a standard real place with a series of properties. This editor allows developing the 3D simulation environment where users can be immersed, like in a real assisted world, and try new interaction modes and new (virtual) interaction devices. Pre-existing 3D models can be used to compose an Environment Model: common objects (including rooms, furniture or in general architectural elements), interaction devices (like sensors and actuators) and complex devices (a combination of the previous ones). Objects are characterised by their properties; interaction devices have also a behaviour. Complex devices have the same characteristics of an interaction device but are represented by a set of related sensors and actuators, targeted to a unique and

The Authoring Framework [8] is a tool created for interaction designers and usability engineers. Its main objective is to support them to build the core element that composes an AAL service simulation context. The appearance of the Authoring Tool is based on the look and feel of Eclipse (centre stage, properties tab, project browser, etc.) so that an intuitive interface helps the developer to rapidly create the virtual

UbiCC Journal, Volume 5, March 2010


specific function, as a single composite device. An innovative feature is the possibility to browse among existing objects within the VAALID Library, allowing refining and reusing components, starting from a CAD program or a 3D animation tool which export objects as VRML or X3D files. To make the simulation more realistic the designer can make some minor modifications in dimensions and positions of the objects inside a scene. Similarly to the User Model, environments and objects are composed by properties and behaviour, and can be imported, exported, retrieved from the Library and edited through graphical metaphors. AAL Service Compositor. This tool is an editor for the creation of an AAL Service Model, which is mainly described as a workflow, providing links between user and objects of the scene. It essentially acts as a controller that processes information coming from sensors, triggered by explicit or implicit user actions, and consequently activates relevant actuators (i.e. security systems, lighting, heating/air conditioning), consistently with the service specifications.

2.3.2 Authoring Implementation Facts According to the software architecture defined, each tool works using collaborative modules, managing and sharing pieces of software. The usage of Eclipse RCP (Rich Client Platform) is a step forward towards the implementation phase. This particular distribution includes the subset of components which are natively used to construct the own Eclipse framework. In this sense, client applications developed under Eclipse RCP share the same software infrastructure of Eclipse, taking profit from advanced built-in functionalities such as: • Native visual elements of the Eclipse deployment platform. • Perspective management, enabling different software views sharing the same data model. • Plugin-based architecture, facilitating version control and modular development. • Auto-update functionality that facilitates software maintenance. • Integrated high-quality help files.
Project Editor
User Model Builder

Environment Model Builder

AAL Service Compositor


3D Model Manager

Ontology Manager

Workflow Manager

VRML Parser

Ontology Parser

Workflow Parser

Figure 2: Authoring Framework scheme. Three data layers are handled and exchanged for most of the modelled elements: • Representation: graphical components that permit visualisation of each element and interaction with the designer. • Instance: structure of classes that holds the actual element model and allows its management by Java modules. • File: raw data that keep the element description when stored in a drive or the library. Once created, every model can be exported to the VAALID Repository for reuse in further projects. This way the Authoring Framework gives the possibility to have an increasing amount of models to use in different simulations or execute many variants of the same simulation. Finally, the Project Editor integrates the three tools for editing models of user, environment and AAL service in a common framework in order to manage a single simulation.

File Manager



Figure 3: Authoring Framework modules diagram. These capabilities enable certain advanced capabilities of the VAALID user interface concept, like the usage of perspectives to facilitate seamless transition between Authoring and Simulation Frameworks as well as to access content through different views and levels of detail (e.g. object browsers, flexible lists, 2D/3D floor plans), depending on user preferences and expertise. Individualisation of screen layout is also possible because RCP exploits the native potential of the same visual components of Eclipse. Regarding the multi-developer condition of the VAALID software, the RCP architecture based on plugins allows modular independence among implementation teams, considering each plugin as an

UbiCC Journal, Volume 5, March 2010


additional element of the final software framework. The auto-update feature will help in this modular approach, assisting in the adoption of updated plugin versions as soon as they are released. The integrated help infrastructure will make possible a low-effort extra support for VAALID designers. As shown in Fig. 3, each editor in the Authoring Framework is composed of two main parts: Element Manager (Ontology Manager, Workflow Manager and 3D Model Manager) and Element Parsers (Ontology Parser, Workflow Parser and VRML Parser). The Element Manager performs the translation between instances and graphical representations, while keeping in memory the actual model of the elements. Particularly the Ontology Manager holds a more relevant role since it acts as a kind of overall controller, calling the other elements managers when required. The Element Parser is responsible for converting instances to files and vice versa, verifying that each element maintains a convenient format. To end with, it is remarkable that, according to the overall VAALID architecture, the Authoring Framework shares the same instance structure and memory with the Simulation Environment, in particular with the Simulation Control Panel. This assures seamless transition and permanent data consistency between both frameworks. 2.3.3 Viewing 2D/3D Spaces One of the most innovative features of VAALID is the integration of 3D technologies in the Authoring Framework so as to dynamise and smooth the progress of designing and evaluating AAL services. In addition to the 3D view of the floor plan, the Authoring Framework provides also a 2D view in which it is possible to select objects and have a clearer idea of distances and orientation of all those elements that are present in the scene. Selections are synchronised so as the system automatically performs the changes in both views. The Eclipse RCP platform provides some functionalities to facilitate 3D management. The use of perspectives and views permits immediate changing between 2D and 3D floorplans sharing the same data model imported from the original VRML file. The actions/views mechanisms enable direct manipulation of objects from the environment taking into account different selection sources (browser, flexible list, floorplans, workflow editor, history lists, etc.). The 3D Model Manager supports 3D rendering and navigation, allowing rotation, zoom and tilt within the user view, while detecting object collision. 2.4 Simulation Framework

Thus, apart from a core set of technologies and software building components, there is a need [10] of appropriate facilities that offer the possibility of: • Testing different technical solutions from the point of view of their overall usefulness to users. • Providing a common environment for testing cooperative activities and virtual spaces. Usually, testing ambient behaviour and interaction is only possible in real laboratories. The innovation of this approach is that it will be possible to test and assess AAL scenarios, products and services across all the development process in virtual environments, before experimenting in real contexts. The models (service, user and environment), previously defined in the Authoring Framework, are put together and run in the Simulation Framework during the different stages of the development. Simulations provide feedback to developers about the accessibility, usability and user acceptance of the human-environment interaction. The Simulation Framework is composed of two main tools (Fig. 4): the Simulation Control Panel, which allows developers to configure and run the simulations, and the 3D engine or AAL Services interaction simulator, which is a renderer for the 3D scenes, based on Instant Reality system. Both of them communicate with a workflow engine, which is in charge of executing all the workflows related to a simulation.

Figure 4: Simulation Framework scheme. There are two types of simulation-validation tests that engineers can perform: • A first type is done with virtual users. These are models of users defined within the Authoring Framework, and characterised by behaviour models. This phase of assessment is important for the integration of the different interaction modalities, since it allows definition and refinement of the behaviour model in any stage of the design process. Engineers can check constraints that state incompatible values for specific properties of the different elements defined in the AAL scenario. • The second type involves real users in an immersive environment (3D virtual environment). Users will be allowed to experience real-time

Once the individual elements are defined, the process of creation of experimental AAL environments needs a testing and assessment phase.

UbiCC Journal, Volume 5, March 2010


interaction with an AAL environment using both virtual and tangible interaction devices. A virtual interaction device can be a sensor or an actuator represented in the 3D virtual environment; a tangible device (or simulation control) is physical equipment that enables interaction between the user and the virtual environment. The feedback from real users to designers will be critical in the process to meet their specific needs and requirements. At the moment, the project is exploring the feasibility of integrating several simulation controls to the platform, such as Nintendo Wii Remote, Intersense Head Tracking, LED-based Gloves, Visual Hand Control or Android Mobile Phone. These controls will be extensively assessed during the pilot tests, with the aim of finding the most adapted solution for each user. The possibility of performing these assessment phases during the design process of AAL solutions, before building up real living labs, has key benefits such as saving of time and costs. In addition, users can participate in a controlled environment, since VR technologies assure safe and secure interaction. This does not mean that evaluation in a real living lab has to be avoided, but that any further interaction experiment will be enriched by the results obtained in the preliminary design process. 2.4.1 Study Case: Using an Android Mobile Phone As stated before, VAALID aims at providing VRfounded tools that make easy the process of designing accessible solutions for ambient intelligence environments. The objective is to allow engineers to pre-validate innovative services with final users in a realistic setting using virtual scenarios, as a first filter before the actual validation in living labs. One important step in the investigation is the testing of different interaction devices in order to test the immersion feeling of users in joining the simulation.

Taking advantage of the flexibility of Instant Reality and the multimodal characteristics of the new generation of smart phones, a special setting was prepared to perform some technical and usability tests [11]. Several engineers were told to explore and interact with a 3D scene using an Android-based mobile device (i.e. HTC Magic smart phone), analysing the execution of some pre-defined tasks, such as moving around, finding objects or grab a book. After considering different approaches, multimodal user interaction was defined using the handheld device as follows, focusing on haptic interfaces (Fig. 5): • Device rotation (i.e. forwards, backwards, clockwise and counter-clockwise): performs 3D movements within the virtual environment (respectively: advance, retreat, turn right and turn left). • Finger dragging over touchscreen: performs horizontal movements of the virtual pointer. • Trackball rotation: performs vertical movements of the virtual pointer. • Trackball click: sequentially picks up/releases a particular virtual object. • Vibrator: provides vibration feedback to the user when the virtual pointer collides with the virtual object. Considering the collected data, preliminary results show that users feel comfortable in using the device and defined the experience as realistic, although there are valuable suggestions to improve the interaction (e.g. allow sensitiveness calibration). From a technical point of view, this can be taken as a good starting point for future work with VR-based applications, although further research is required concerning its suitability for elderly users. 3 DISCUSSION AND CONCLUSION

Figure 5: Testing VR using a smart phone.

Accessibility and usability concepts are currently considered within a limited range of ICT applications and services, mostly constraining its usage to research and development activities and presenting significant reservations when dealing with production and deployment phases. Although the seven principles of the universal design or Design for All [12] are well known and applicable to a wide variety of domains, business stakeholders are still highly reticent to apply them in practice. This lack of commitment with the elderly and disabled community, in particular when designing AAL solutions is mainly due to the high costs involved in the iterative design-developmenttesting procedure and the considerable time effort needed to meet user’s needs. On the other hand, the adoption of VR technologies seems to confront with the purpose of designing services for people with disabilities, as few initiatives have been carried out in this field regarding

UbiCC Journal, Volume 5, March 2010


accessibility requirements. Most of them deal with people with cognitive disabilities (dementia, autism, schizophrenia, Down's syndrome, etc.), proposing simple virtual worlds where users get immersed in order to learn some tasks, acquire some habits or recover some capabilities under a controlled scenario. Nevertheless, VR has been proven to offer significant advantages for persons with all kinds of disabilities. It can present virtual worlds where users can be trained or learn in a controlled environment, and then apply the skills acquired to a real context. VR technologies can be adapted to a wide range of users and needs, and at the same time, user’s abilities and experience can be assessed in order to reach an optimal adaptation The work proposed in this paper brings together all these issues into a technological approach that will have a beneficial impact for all the involved parts: The ICT designer will be able to evaluate the suitability of the proposed solutions with a significant reduction of the global design and development effort; business stakeholders will have a cost-effective solution and therefore new market opportunities, and finally, endusers will be provided with new services to improve their quality of life, and even better, they will be able to active and critically participate in the process of creation of these services. ACKNOWLEDGMENTS This work has been partially funded by the European Union in the context of the VAALID project (ICT-2007-224309), coordinated by SIEMENS S.A. The project started in 1st May 2008, and will finish in 31st October 2010. The VAALID consortium is composed of the following partners: SIEMENS S.A, ITACA, Fh-IGD, UNIPR, VOLTA, UID, SPIRIT and UPM. 4 REFERENCES

[1] K. Giannakouris: Ageing characterises the demographic perspectives of the European societies. Eurostat, EUROPOP2008 convergence scenario. [2] Lillo, J.; Moreira, H. Envejecimiento y diseño universal. Anuario de Psicología, 35, 4 (Tema monográfico: Psicología y ergonomía), (2004). [3] H. Steg, H. Strese, C. Loroff, J. Hull, S. Schmidt: Europe Is Facing a Demographic Challenge Ambient Assisted Living Offers Solutions.

Ambient Assisted Living – European Overview Report. (2006). [4] B. de Ruyter, E. Pelgrim: Ambient AssistedLiving Research in CareLab. ACM-interactions. SPECIAL ISSUE: Designing for seniors. New York (2007) [5] K. Wegge, D.Zimmermann: Accessibility, Usability, Safety, Ergonomics:Concepts, Models, and Differences. Universal Access in HCI, Part I, HCII 2007, LNCS 4554, pp. 294–301, 2007. Springer-Verlag Berlin Heidelberg (2007) [6] V. Jimenez-Mixco, R. de las Heras, J.L. Villalar, M.T. Arredondo: A New Approach for Accessible Interaction within Smart Homes through Virtual Reality. Universal Access in HCI, Part II, HCII 2009, LNCS 5615, pp. 75–81. Springer-Verlag Berlin Heidelberg (2009) [7] J.C. Naranjo, C. Fernandez, P. Sala, M. Hellenschmidt, F. Mercalli.: A modelling framework for Ambient Assisted Living validation. Universal Access in HCI, Part II, HCII 2009, LNCS 5615, pp. 228–237. SpringerVerlag Berlin Heidelberg (2009) [8] VAALID Deliverable 3.1.: Authoring Environment Functional Specification. May 2009. http://www.vaalid-project.org/ Contract number: ICT-2007- 224309 [9] H. Mackay, C. Carne, P. Beynon-Davies and D. Tudhope: Reconfiguring the User: Using Rapid Application Development. Social Studies of Science, Vol. 30, No. 5 (Oct., 2000), pp. 737-757 [10] P. L. Emiliani, C. Stephanidis: Universal access to ambient intelligence environments: Opportunities and challenges for people with disabilities. IBM SYSTEMS JOURNAL, VOL 44, NO 3, (2005). [11] Arca, J. Villal, J. Diaz, M. T. Arredondo. "Haptic Interaction in Virtual Reality Environments through Android-based Handheld Terminals", 3rd European Conference on Ambient Intelligence, AmI09, pp259-263. M. Tscheligi et al.(Eds.): AmI09 Salzburg, Austria, 2009, ICT&S Center, University of Salzburg, ISBN: 978-3-902737-00-7 [12] M. F. Story: Maximizing Usability: The Principles of Universal Design, Assistive Technology 10, No. 1, 4–12 (1998).

UbiCC Journal, Volume 5, March 2010


Jaume Duran UNIVERSITAT DE BARCELONA, Barcelona, Spain jaumeduran@ub.edu David Fonseca GTM / LA SALLE - UNIVERSITAT RAMON LLULL, Barcelona, Spain fonsi@salle.url.edu

ABSTRACT Many of the different characters that appear in the computer animated movies of Pixar Animation Studios are personages. From a dramaturgical point of view, these can be linked with the concept of Archetype. The same types of personages appear in all times and in all cultures. The universal patterns make it possible for the experience to be shared in different histories. These patterns do not identify concrete idiosyncrasies, but they function as a temporary development in a story for the purpose of enriching of it. Another way of interpreting the personages of a narrative history is to consider them as complementary facets of the main character (the hero). As the history develops, the characteristics of these personages modify the personality of the future hero. The aim of this work is to analyze the influence of these complementary personages in the transformation of the main character and examine whether the presence of a disability is used to obtain this transformation. As we will see, not only do we find personages whose disability affects the development of the protagonist, but others that simply fulfill other secondary functions. The base of the presented study are the seven first full-length films produced by Pixar, but we will center on the particular case of Finding Nemo. Keywords: Computer Animation, Pixar Animation Studios, Dramaturgy



Christopher Vogler [33] related the mythical structures and their mechanisms to the art of writing narrative works and scripts, after studying the proposals of Joseph Campbell [2], and Carl Gustav Jung [13, 14, 15]. To do so, he divided the theoretical trip of the fiction hero in twelve stages and enumerated up to seven archetypes. According to Vogler, most histories are composed of a few structural elements that we also find in universal myths, in stories, in movies, and even in sleep. In them, the hero, generally the protagonist, leaves their daily environment to embark on a journey that will lead them through a world full of challenges. It can be a real trip, with a clear destination and definite purpose, or it can be an interior trip, which can take place in the mind, heart or spirit. In any case, the hero ends up suffering changes, and growing throughout. There are twelve stages that compose this trip:

The Ordinary World: the first stage when the hero appears in their daily environment and their ordinary world. The Call to Adventure: the second stage when the hero will generally face a problem and an adventure will appear before them. The Rejection of the Adventure: the third stage when frequently the hero refuses the call to action. The Meeting with the Mentor: the fourth stage when the personage of the mentor appears. The Passage of the First Threshold: the fifth stage when the hero begins the adventure. Tests, Allied Forces, Enemies: the sixth stage when new challenges are revealed

UbiCC Journal, Volume 5, March 2010


and at the same time the hero is presented with new allies and hostile enemies. • The Approach to the Deepest Cavern: the seventh stage when the hero prepares a strategy for the definitive moment and gets rid of the last impediments before continuing. The Odyssey or the Calvary: the eighth stage when the hero directly faces what they are most afraid of and begins a tough, battle that could result in their own death. The Reward (Obtaining the Sword): the ninth stage when, having survived battle against death, the hero takes possession of the reward. For example, the sought-after sword or treasure. The Return of Comeback: the tenth stage when the hero suffers the consequences of their clash with the forces of evil and for obtaining the reward. The Resurrection: the eleventh stage when the hero is facing the second big moment of difficulty, where again they risk losing their life and must overcome once again. The Comeback with the Elixir: the twelfth and last stage when the hero returns to the ordinary world with the obtained treasure. This ends the trip of the hero.

that develop temporarily inside a story for the purpose of enriching the history. Also, we can interpret these complementary patterns as facets of the hero’s personality, which may affect what he o she learns and what their values are. There are seven common archetypes: • The Hero is someone capable of sacrificing their own needs for the sake of others. The word hero comes from the Greek root word that means to protect and serve. Generally, we tend to identify with the hero because he or she tends to have a combination of qualities and skills. The hero is framed within a history, and inside this narrative is where the personage learns and grows. The Mentor is the personage who helps or instructs the hero. “Mentor” comes to us from Homer [12]. In the Odyssey, the personage called Mentor helps Telemac in the course of his trip. Joseph Campbell [2] defines it as the wise elder or wise oldster in reference to the personage who teaches, protects and provides certain gifts to the hero. Vladimir Propp [29] defines this type of personage as the donor, in relation of the act of providing a gift or of offering something to the hero. The Threshold Guardian is one of the first obstacles the hero finds in their adventure. Generally, they are neither the antagonist of the history nor the principal malefactor, although they constitute a threat that the hero, if he or she interprets it well, can overcome. The Herald, in a strict sense, is the person who has a message. In Greece and Rome, they were the manager of dispensing the orders of the ruling classes, of making the proclamations and of declaring the war. The Changeable Figure is a personage difficult to identify because they make a show of their name. Their appearance and characteristics change when we examine them closely. In fact, the hero may find them a changeable and variable personage who possesses two faces. The changeable figure develops the function of introducing doubt and the suspense in the history. Often, this figure is the love of the hero. The Shade is the antagonist personage, the enemy, the malefactor. The shade challenges the hero and is a worthy opponent to fight. The Trickster is the personage who captures the energies of wickedness and

All these stages are parts of a scheme that modifies particular details according to the history and does not need to adhere to the order with rigor. It is possible that some stages can be suppressed without affecting the history. These stages can be divided into three dramatic acts (so the development of the history occurs in three parts, where the first part occurs before the target of the protagonist is known by the spectator): • • • First act: the first five stages (1 to 5). Second act: the next four stages (6 to 9). Third act: the last three stages (10 to 12).

During the hero’s trip, different personages can become present. Their mission can link with the concept of Archetype, which Carl Gustav Jung [13] uses avoiding the models of personality that are repeated from remote times and that suppose a heredity shared for every human being. The same author sums this up under the concept of Unconscious Group. The universality of the patterns and personages makes it possible for the experience to be shared in different histories, but these are not necessarily concrete idiosyncrasies that have to be supported from beginning to end. Rather, they are functions

UbiCC Journal, Volume 5, March 2010


desire for change. A buffoon, a clown, or a comical follower are all clear examples, and develop the function of a comical mitigation. 2 THE TRIP OF THE HERO IN THE FULLLENGTH FILMS OF PIXAR (1995-2006) Leaving aside shorts films or publicity productions, there have been seven computeranimated full-length films produced by Pixar Animation Studios (an independent producer before being acquired by Walt Disney Company in 2006 [8, 27, 28]): • • • • Toy Story (1995) of John Lasseter [9, 20, 31]. A Bug’s Life (1998) of John Lasseter and Andrew Stanton [1, 17]. Toy Story 2 (1999) of John Lasseter, Ash Brannon and Lee Unkrich [32]. Monsters, Inc. (2001) of Pete Docter, David Silverman and Lee Unkrich [21, 25]. Finding Nemo (2003) of Andrew Stanton and Lee Unkrich [5, 10]. The Incredibles (2004) of Brad Bird [4, 30]. Cars (2006) of John Lasseter and Joe Ranft [3, 34].

• • •

All these scripts continue, undoubtedly and despite particular exceptions or absences of certain stages and archetypes, the closed method of the trip of the hero. To summarize: In Toy Story, in the room of a child called Andy (the first stage, the ordinary world), a rag doll with rope, a cowboy called Woody (the hero), is the favorite toy. The arrival of a new plastic space ranger doll with many gadgets, Buzz Lightyear (the herald and the changeable figure at the same time), causes Woody to mistrust him, although in the beginning this is not of importance. After a fight, the two toys get lost in a petrol station and end up in the hands of Sid (the shade), Andy’s evil neighbor. The mutant toys of Sid help Woody and Buzz Lightyear avoid a fatal ending (eighth stage, the odyssey), and both manage to return to their owner (twelfth and last stage, the comeback with the elixir) after having overcome new troubles (tenth and eleventh stages, the way of comeback and the resurrection). In A Bug’s Life, in an anthill (the first stage, the ordinary world), the threat of a few grasshoppers led by the perverse Hopper (the shade), forces the ant Flik (the hero, but also the culprit of the above

mentioned threat after having lost the meal that they were giving the grasshoppers) to go on a journey in search of help (fifth stage, the passage of the first threshold). After finding a small metropolis created out of human garbage (sixth stage, tests, allied forces, enemies), the protagonist knows a few artist insects (the slickers) which he confuses as potential warriors. They do not notice the confusion either and go to the anthill. Despite the misunderstanding, they help Flik, the princess Atta (the changeable figure) and the other ants defeat the tyrants. In Toy Story 2, Woody (the hero once again) is kidnapped by a collector called Al (the herald) after trying to rescue to the doll penguin Wheezy from a home-made flea market where Andy’s mother left him. At Al’s home, he meets another dolls, Pete, the horse Bullseye and Jessie (the changeable figure). Buzz Lightyear, Mr. Potato Head, Slinky, Hamm and Rex (the friends of Woody) go out in search of him, but once they find him Woody decides to remain with his new relatives, even though he is soon cheated and persuaded by one of them, Pete. Al takes them to the airport to travel to Japan. Nevertheless, Woody, with the help of his friends again, manages to escape the plane (eighth stage, the odyssey) and they all return to Andy’s room, this time also in company of Bullseye and Jessie. In Monsters, Inc., the monster Sully (the hero) and his best friend, the monster Mike (the slicker), are employed at a factory that scares children in the real world in order to gather their screams, which are used for energy. One day, a girl called Boo (the herald) crosses one of the many doors that serve to connect these two realities and ends up inside the monsters world. The girl is discovered by Sully, who calls on Mike to help him return her to her home. As both try to arrange the situation, the monster Randall (the shade) puts many impediments in their way. Finally, the girl is returned to her world (ninth stage, the reward) and Sully comes up with the idea of gathering the guffaws and laughter of the children for energy instead of their screams of fear. In Finding Nemo, Marlin (the hero) is a clown fish who lives with his son Nemo in a coral reef (the first stage, the ordinary world). One day, Nemo is captured by a scuba-diving dentist, and Marlin must go on a long journey to find him and bring him home. He is accompanied part of the way by a blue fish called Dory (the slicker). Meanwhile, Nemo meets a few new friends in the fishbowl where he has been deposited. After finding Nemo and returning home (tenth stage, the way of comeback), a fishing ship catches Dory along with other fishes in its nets (eleventh stage, the resurrection). Nemo decides to help them and is successful, despite his fathers’ doubts. In The Incredibles, Bob Parr/Mr. Incredible (the hero) is a superhero who does not adapt himself well to a new reality (the first stage, the

UbiCC Journal, Volume 5, March 2010


ordinary world) in which resolving the problems of humanity is completely prohibited. After receiving a request to fight a dangerous machine on a faraway island, he goes to see Edna Moda (perhaps, the mentor) who arranged his old supersuit and now makes a new one for him. On the island, it turns out that Buddy Pine/Syndrome (the shade) has set a trap for Mr. Incredible. But with the help of his wife, Helen Parr/Elastigirl, and two of his sons, Dashiell and Violet, who also have superpowers, they foil the plan. Back at home, Mr. Incredible sees Syndrome kidnapping his smallest son Jack-Jack (eleventh stage, the resurrection), but Jack-Jack uses his superpowers to escape and the evil Syndrome is finally defeated. In Cars, Lightning McQueen (the hero) is a very ambitious race car that tries to win the Piston Cup. After a draw with two of his opponents, Chick Hicks and The King, a new race is necessary to find the Piston Cup winner. But in the trip to the next circuit, the protagonist becomes lost in a village called Radiator Springs where he is forced to remain. There, he meets a few very particular cars (sixth stage, tests, allied forces, enemies). After coexisting with them and experiencing many vicissitudes, he gains some new values and changes his perception on the competition. Finally, in the tiebreak race, he allows Chick Hicks to win and helps The King to the finish line.



Disability can be seen essentially as a limitation provoked by a physical or mental impediment that prevents certain activities being carried out. According to the World Health Organization [36], this concept can affect the functions of the body as follows: • • The physiological functions of the systems of the body, including psychological. The structures of the body, including anatomical parts such as organs and other components. Damages or problems in the function or structure of the body, such as significant deviations or loss. Activity, including the execution of a task or an action on the part of an individual. The participation in a certain situation. Limitations to activity. Restrictions of participation in an activity. Exogenous factors that constitute the physical or social manner and the attitude with which the people live their lives.

• • • • •

Departing from the WHO point of view and understanding that many of the personages in the computer animated full-length films of Pixar Animation Studios are animals or objects with human behaviors, attributions or qualities, here we find characters with all kinds of disabilities. But these do not always coincide with the personages’ model archetypes (which we indicate with “Ø”), which are necessary for the quest of the hero. In Toy Story, Woody (the hero) is a cowboy doll with only a voice box and a missing gun, while Buzz Lightyear (the herald and the changeable figure at the same time) is a plastic space ranger with a multiphrase voice simulator, a laser light, and wings with light indicators, as well as many other gadgets. Although, Buzz eventually realizes he is a toy and cannot fly, his characterization at first shows up what Woody lacks. When Buzz finds out about his real existence, he loses an arm after rushing through a gap from the top of a few stairs. The arm is reattached to the body by the mutant toys (the guards of the threshold) of evil Sid. Sid often experimented on them and caused them to suffer all kinds of disorders. The concept of disability is also tackled through other characters such as Mr. Potato Head (Ø or perhaps a slicker), who is constantly losing his extra pieces and becomes a bad-tempered toy; Rex (Ø or perhaps also a slicker), the plastic dinosaur with extremely short arms; R.C. (Ø), the remote control car who runs out of batteries; and, the soldier of green plastic (Ø) that is trodden accidentally by Andy’s mother. In A Bug’s Life, the queen of the ants (Ø) is older and has to open the way for the princess Atta. The perverse grasshopper Hopper (the shade) has a scar on the right side of the face that crosses over his eye. His brother (Ø), perhaps because of Hopper’s attitude, seems to have some psychological shortcomings. Also, some of the members of the other insects seem to have had mishaps or end up with some disability. For example, the ladybug Francis (one of the slickers) breaks a foot, while the caterpillar Heimlich (another of the slickers) grows wings that are only very small. In Toy Story 2, Woody (the hero) breaks his arm, and Andy’s mother relegates him to a high shelf of the room where he is forgotten with the doll penguin Wheezy (Ø), who has a broken squeaker. Nevertheless, Woody will recover his extremity, thanks to the restorer hired by Al, and Wheezy recovers his voice. Other personages who appear in Toy Story have similar treatment. In Monsters, Inc. there is an interesting paradox: certain monsters are not right in their work because they are not fierce enough. Also, what would be considered a physical shortcoming for a human being becomes a virtue. An example is the monster Mike (the slicker): he has only one eye but he is

UbiCC Journal, Volume 5, March 2010


considered very lucky physically. The same way, the Yeti (Ø) makes himself up like an exiled and almost ridiculous monster. On the other hand, the malignant monster Randall (the shade) is a simple lizard. Finding Nemo is the movie from Pixar Animation Studios that most directly tackles the topic of disability. The small clown fish Nemo (perhaps, the herald, and also the hero when in the fishbowl, as we can see after) has a fin that has not developed, and Dory (the slicker) forgets everything after a few minutes. In parallel, we can find other fishes with different disabilities: the small fishes in the school, Pearl (Ø) and Sheldon (Ø), or some fishes in the fishbowl, Gill (Ø) and Deb (Ø). Also we can find some humans as the niece of the dentist, Darla (Ø), who have not anything in common with Marlin (the hero), but they are related with Nemo. Additionally, there are “extras” without any relevancy in the history with certain disabilities as Mr. Johannsen (Ø) or Blenny (Ø). In The Incredibles there is a paradox similar to that of Monsters, Inc. in that the powers of the superheroes are seen poorly by society and are prohibited. The same people to whom Mr. Incredible had saved now denounce him. His children, Dashiell (Ø) and Violet (Ø), also have the same problem and, in turn, Syndrome (the shade) also lacks his super-powers and must design certain robot prototypes in their place. The movie is the first one that bases its history on human beings, and it presents other disabilities, for example, the glasses that Edna Moda (perhaps, the mentor) must use to be able to see well. In Cars, the inhabitants of Radiator Springs present the most obvious features. The most paradigmatic case perhaps is Mater (one of the slickers), whose suspension is a little rusty and whose crusty cabin has seen better times. Also Lizzie deserves to be mentioned because she is an old car with four cylinders and four times (Ø) because of his higher and higher absence of memory. With all the compiled information, and without having come to an exhaustive detail, we can draw the following graph (Figure 1) to see the quantity of disabled personages not justificated by the story in the full-length films of Pixar.

Disabled Personage Non Type Archetype Possible Disabled Personage Non Type Archetype Personage Type Archetype with Disability

2 7

2 2 1 1 2 1 Toy Story A Bug's Life Toy Story 2 Monsters Inc. Finding Nemo The Incredibles 2 2 2 1 Cars 1 1


Figure 1: Quantity of disabled personages in the full-length films of Pixar (1995-2006)



As we have pointed out, in Finding Nemo the disability is very present and is related to some personages who work like archetypes of the principal history: for example, Nemo (perhaps, the herald) and Dory (the slicker). But in the movie, the history starring Nemo works in parallel in that by making the hero paper it provokes the many other disabled personages who, in the history of his father (principal plot), would not have any function, to have it in this one parallel. This way, we meet their friends of the school Pearl and Sheldon. The first is an octopus that has a smaller tentacle and who black stained spear when it is scared and the second is a seahorse who likes to ride for the reef, but who is always the last one to come in because his constant sneezes throw him behind. In another stage, in the fishbowl, we find Gill and Deb. The first is a moorish idol fish, who has a few scars in the face and in the right fin after landing on the dentist’s set of instruments after frustrated attempts to escape and the second is a pitch white and black damsel who thinks its own reflection, in the crystal is its twin sister who always accompanies it. In the same way, we also have Darla, the niece of the dentist, who wears corrective devices on her teeth. Some of these disabled personages work as archetypes in the Nemo history although, once again, they do not all such a function. For example, Pearl and Sheldon. Gill, Deb and Darla would work as the mentors, trickster and shade respectively. We can also find other disabled personages, such as Mr. Johannsen, the turbot grumbler, who detests the children of the reef because they play in his sand courtyard, but who can never manage to

UbiCC Journal, Volume 5, March 2010


catch them because he only has eyes in a side, or Blenny, the small fish who does not manage to dominate his fears, especially of sharks although after the final credits of the movie we see eating up to a fish abysmal toad. With this information, and without having come to an exhaustive detail, we can draw the following graph (Figure 2) to see the quantity of disabled personages not justificated by the Nemo story.

Disabled Personage Non Type Archetype Personage Type Archetype with Disability

7 4

3 2

Finding Nemo Marlin story

Finding Nemo Nemo story

Figure 2: Quantity of disabled personages in the stories of Finding Nemo



In all the full-length films of Pixar Animation Studios between 1995 and 2006, the presence of disabled personages does not affect personal development of the hero, but in a great number of cases it influences changes in the hero’s behavior. Even at Finding Nemo where there is a clear plot in parallel starring by Nemo (the archetype herald for the protagonist or hero, Marlin, of the principal plot), this one also is employed with this type of personages. Their presence may be also be the result of other intentions, such as an effort to be politically correct, the personages’ appearance of different race, or other intentions associated with avoiding the personage’s concept type or “clichés”. This study of the direct impact on the spectator of personages that reflect human disabilities can be extended to other periods of production by Pixar & Disney, for example Ratatouille (2007) of Brad Bird and Jan Pinkava; WALL·E (2008) of Andrew Stanton; Up (2009) of Pete Docter and Bob Peterson; or to other full-length films by other production companies such as Pacific Data Images & DreamWorks SKG, Blue Sky Studios & Twentieth Century Fox.

6 REFERENCES [1] “A Bug’s Life” in The Internet Movie Database [http://www.imdb.com/title/tt0120623/]. [2] CAMPBELL, Joseph, El héroe de las mil caras. Psicoanálisis del mito, México: Fondo de Cultura Económica, 1959. [3] “Cars” in The Internet Movie Database [http://www.imdb.com/title/tt0317219/]. [4] COTTA VAZ, Mark, LASSETER, John, & BIRD, Brad, The Art of The Incredibles, San Francisco: Chronicle, 2004. [5] COTTA VAZ, Mark, LASSETER, John, & STANTON, Andrew, The Art of Finding Nemo, San Francisco: Chronicle, 2003. [6] COTTE, Olivier, Il était une fois le dessin animé… et le cinema d’animation, Paris: Dreamland, 2001. [7] DARLEY, Andrew, Cultura visual digital. Espectáculo y nuevos géneros en los medios de comunicación, Barcelona: Paidós, 2002. [8] DURAN, Jaume, El cinema d’animació nordamericà. Barcelona: Editorial UOC, 2008. [9] DURAN, Jaume, Guía para ver y analizar Toy Story. Valencia - Barcelona: Nau Llibres Octaedro, 2008. [10]“Finding Nemo” in The Internet Movie Database [http://www.imdb.com/title/tt0266543/]. [11]FONTE, Jorge, Walt Disney. El universo animado de los largometrajes, 1970-2001, Madrid: T & B, 2001. [12]HOMERO, Odisea, Madrid: Cátedra, 1988. [13] JUNG, Carl Gustav, Arquetipos e inconsciente colectivo, Barcelona: Paidós, 1970. [14]JUNG, Carl Gustav, Tipos psicológicos, Barcelona: Edhasa, 1994. [15]JUNG, Carl Gustav, El hombre y sus símbolos, Barcelona: Paidós, 1995. [16]KERLOW, Isaac V., The Art of 3D Computer Animation and Effects, New Jersey: John Wiley & Sons, 2004. [17]KURTTI, Jeff, A Bug’s Life. The Art and Making of an Epic of Miniature Proportions, New York: Hyperion, 1998. [18]LASSETER, John, “Principles of Traditional Animation Applied to 3D Computer Animation” in: SIGGRAPH’87, Computer Graphics, pp. 3544, 21 : 4, 1987. [19] LASSETER, John, “Tricks to Animating Characters with a Computer” in: SIGGRAPH’94, Animation Tricks, notes of the Course 1, 1994. [20]LASSETER, John, & DALY, Steve, Toy Story. The Art and Making of the Animated Film, New York: Hyperion, 1996. [21] LASSETER, John, DOCTER, Pete, & Disney & PIXAR, The Art of Monsters, Inc., San Francisco: Chronicle, 2001. [22]LAVANDIER, Yves, La dramaturgia. Los mecanismos del relato: cine, teatro, ópera, radio,

UbiCC Journal, Volume 5, March 2010


televisión, cómic, Madrid: Ediciones Internacionales Universitarias, 2003. [23]MAESTRI, George, Creación digital de personajes animados, Madrid: Anaya, 2000. [24]McGRATH, Declan, & MACDERMOTT, Felim, “Andrew Stanton” in: Guionistas, Barcelona: Océano, 2003. [25]“Monsters” in The Internet Movie Database [http://www.imdb.com/title/tt0198781/]. [26]National Library of Medicine - National Institutes of Health, Medical Dictionary, U. S.: 2003 [http://www.nlm.nih.gov/medlineplus/mplusdict ionary.html]. [27]PAIK, Karen, To Infinity and Beyond! The Story of Pixar Animation Studios, San Francisco: Chronicle, 2007. [28]“Pixar Animation Studios” [http://www.pixar.com/]. [29]PROPP, Vladimir, Morfología del cuento, Madrid: Fundamentos, 1972. [30]“The Incredibles” in The Internet Movie Database

[http://www.imdb.com/title/tt0317705/]. [31]“Toy Story” in The Internet Movie Database [http://www.imdb.com/title/tt0114709/]. [32]“Toy Story 2” in The Internet Movie Database [http://www.imdb.com/title/tt0120363/]. [33]VOGLER, Christopher, El viaje del escritor. Las estructuras míticas para escritores, guionistas, dramaturgos y novelistas, Barcelona: Robinbook, 2002. [34]WALLIS, Michael, WALLIS, Suzanne Fitzgerald, & LASSETER, John, The Art of Cars, San Francisco: Chronicle, 2006. [35]WEISHAR, Meter, Moving Pixels. Blockbuster Animation, Digital Art and 3D Modelling Today, U. K.: Thames & Hudson, 2004. [36]World Health Organization, Towards a Common Language for Functioning, Disability and Health: International Classification for Functioning, Disability and Health, Geneva: 2002 [http://www.who.int/classifications/icf/site/icfte mplate.cfm].

UbiCC Journal, Volume 5, March 2010


Marc Pifarré, Eva Villegas, David Fonseca GTM-Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL, Barcelona, Spain { mpifarre, evillegas, fonsi}@salle.url.edu ABSTRACT An accessible web page needs to follow the rules marked by the W3C (World Wide Web Consortium) and WCAG 2.0 (Web Content Accessibility Guidelines). The problem of the rules AA of web accessibility is that they are centered on the programming requisites more than the user needs for graphic design, functionalities or content. An important factor to take in account in accessibility standards is the lack of distinction between the different user’s profiles, since every type of disability will have particular requisites it will be difficult that standards adapt themselves to the needs of the final user. With the target of improving the reliability of the information obtained in studies of accessibility for web pages a project based on the integration of different methodologies has been realized. The methodological design applied in this study centers on the participation of the users like principal item to obtain significant results. Using methods centered on users more than accessibility standards allows obtaining reliable information about the real needs of the users. Departing from this basis is able to get a web site design properly adapted to the user’s needs. Keywords: Accessibility, User Experience, WAI, User-centred-design, WCAG, Web Design.



The rules published by the World Wide Web Consortium (W3C) and the Web Accessibility Initiative (WAI) are considered to be a standard that marks the requisites that allow the creation of accessible pages for all. Accessibility is understood as a web page designed and programmed so that the content is free for any user, independently of their profile. The target of this project is to create a base for the achievement of a web design adapted to the needs of every type of user. To manage this main objective it has been necessary to bear in mind the peculiarities of every disability; to define – with reliability – the spaces of the web that can be common to all the users; and to define those that must be individualized or customized to concrete user requisites. To establish a list of the initial needs for a trustworthy method, a users study was conducted inside a web page with level of Double A (AA). This study was created by means of a combination of methodologies that allowed obtaining concrete information about the needs of the users. It would be very difficult to obtain all of this information by

means of the accessibility rules, because the use of the standards means there is no concrete information about the needs of the user. As soon as the information was obtained, a web page was created, bearing in mind the obtained results, and the second test was conducted to verify whether the experience of the users was improving with regard to the page tested in the initial phase. 2 METHODOLOGY

2.1 Phase 1. Objectives The target of this project is to create a virtual community destined for user groups with different disabilities. Different techniques have been applied to evaluate the user's experience by integrating the methodologies of accessibility, classic usability, and new qualitative methods applied on user experience field. 2.2 Phase 1. Test Design We have designed the test to analyze a web page with a level of Double A (AA), emphasizing

UbiCC Journal, Volume 5, March 2010


what methodology is the most suitable. Different factors were born in mind, one of the most decisive was the users' type. To be able to decide the most suitable user’s sample, we analyzed them according to the World Health Organization classification [1], which defines six types of difficulties: • • • • • • Difficulties derived from mobility problems. Difficulties derived from sight problems. Difficulties derived from hearing problems. Difficulties derived from language, speech, and voice impairments. Learning difficulties. Difficulties derived from mental illnesses or disorders.

mass media (43%).) For the groups of persons with difficulties, the following methodologies were applied: • Methodology of Classic Usability: o Questionnaire of Previous Profile. o Tasks Test. o Satisfaction User Survey (SUS [3]). New techniques of user's experience o Bipolar Laddering (BLA [4]) interview (limited version).

For the group of experts in the different disciplines, a particular methodology was applied: • • Questionnaire of Previous Profile. Bippolar Laddering interview version). (full

We analyzed all the profiles to choose the most decisive sample at the time of beginning the web test. The result of this was the following groups: • • • Group 1: Twelve persons with difficulties derived from physical and cognitive problems. Group 2: An expert in persons with difficulties derived from physical and cognitive problems. Group 3: Twelve persons with difficulties derived from visual problems, consisting of six users with entire blindness and six users with poor or partial vision. Group 4: An expert in persons with difficulties derived from visual problems. Group 5: Twelve persons with difficulties derived from hearing problems, made up of six deaf users who use sign language users and six deaf users who do not use sign language. Group 6: An expert in persons with difficulties derived from auditory problems. Group 7: Or control group with users without any type of difficulty accessing Internet information at the time.

• •

• •

The realized analysis allows us to adapt the results to other disabilities, for which we have other secondary profiles in mind. For example, profiles could be adapted for users with dichromatopsy (visual disability that affects perception of red, green, blue, and yellow colors) or for third-age persons [2], or they could be based on other determinants that affect the above-mentioned groups, such as slow connections or minimal use of electronic commerce. (The main use of the Internet in these profiles is for e-mail (66%); accessing information on administrative pages (49%); and

The questionnaire of previous profile allows detailed knowledge of the profile of the user and of their level of Internet use, what type of tasks the user accomplishes or wants to accomplish, and what type of information they wish to receive. The tasks test is used to observe the behavior of the user in terms of Internet use (by means of the navigation for a web page AA), but not ratification of the usability of the page. Quantitative information was gathered according to: successful task (well-finished task), failure task (unfinished task), false success (unfinished task that the user perceives as correct), and false failure (finished task that the user perceives as not accomplished). During the test, we use the Protocol of Clear Thought: on the one hand, the user shows or expresses the considerations during his navigation for the web; and on the other hand, by means of the question-answer protocol, the reactions of the user are provoked by means of the formulation of direct questions regarding his interaction with the application. The Satisfaction User Survey (SUS) was used to detect the grade of satisfaction of the user. Ten questions are exhibited to the user, which he or she must answer on a scale of the 1 to 5 (how strongly he or she agrees with the affirmation), thereby obtaining numerical values on their satisfaction levels. The Bipolar Laddering (BLA) technique is a methodology that allows the realization of a qualitative field study and obtains the perceived strong and weak points of a product or service based on the user's experience. It is conducted using a format of interviews, during which the user explores the product and relates their experience. From this interview model, the user generates lists of significant elements and defines them by means of laddering technique. The levels of satisfaction and relevancy of every element is then

UbiCC Journal, Volume 5, March 2010


represented in a numerical scale from 0 to 10, in which the user attributes the punctuation depending on the emotional or functional implication of the element. This interview method departs from a Socratic model, so the user always freely chooses the elements that he or she is going to evaluate. This way, as soon as the results of the sample are obtained, we can establish connections between spontaneous information. This factor significantly increases the reliability of the obtained information. 2.3 Phase 1. Results The codification used in the results based on the disability of every user is: Difficulties derived from physical problems. TB: Difficulties derived from visual problems (entire blindness). PV: Difficulties derived from visual problems (poor vision). DSL: Difficulties derived from auditory problems (deaf users who use sign language). DNSL: Difficulties derived from auditory problems (deaf users who do not use sign language). CG: Control group (persons without difficulties navigating the Internet). ED: Group of Experts in Disabilities. Figure 1, shows the heterogeneity of the different disabilities in using the Internet, indicated by the difficulty level in navigating and the type of autonomy that the users have to use the computer. PP:

analysis was made and the information obtained is indicated by the following items and figures: • • • • • • • • • • • • • • • • • • • • Result 1: Customized navigator R2: Chat R3: Access to forums R4: E-mail R5: Information about health services R6: To request information or services from the public administration R7: Consultancy search R8: Job search R9: Search of contacts and friends R10: Files download R11: Consult news pages R12: Buy show tickets R13: Buy flight tickets R14: Supermarket shopping R15: On-line formation R16: Electronic banking R17: Technical help R18: To consult on subsidies or economic aids R19: Consulting my rights R20: To denounce

The following figures show the percentages of election of every item according to the type of stated disability. 2.3.2 Tools used usually User’s emphasized the use of e-mail, news pages, files download or customizing the navigator for all types of users.

Figure 2: Previous questionnaire results Figure 1: Definition of the disabilities according to the type of Internet navigation 2.3.1 Previous questionnaire results We obtained information about the tools that the users use, do not use, want to use, or do not want to use; and of the types of requisites and needs that they would want to fulfilled with the creation of a virtual community. From the following list of predefined items, an 2.3.3 Tools that users want to use but are not using at present In this case, we found notable differences between the users: We emphasize the demand of information or information about health services raised by groups PP and TB. The users from PV and TB groups are more interested in access to forums, and finally the PP and TB groups are more interested in the use of chat capabilities. On the other hand, content regarding electronic banking or searching for work is raised only by the CG group.

UbiCC Journal, Volume 5, March 2010


Figure 3: Previous questionnaire results 2.3.4 Tools that users do not want to use The groups CG, PP and PV are those that least want to realize concrete options, especially the customized navigator, use of the chat, and search of information about steps to health.

emphasizes the spontaneous creation of elements by the users, and was analyzed according to the similarities with other users and other groups. In the section “Description,” we find the elements or created sections. The C# code indicates that it is a common element – that is to say, it was mentioned spontaneously by several users. The percentage that appears in every element indicates the index in which every element was repeated. The results of the BLA differ in positive and negative elements. The positive elements are those that the user understands how a strong point of the web, the negative elements, there will be everything opposite. Next we show the common positive elements obtained. Table 1: Mention index for the different user groups
C1 C2 C3 C4 C5 C6 C7 C8 C9 Color Images Easy contact General Index Disability design concept Search A lot of information Links All information in the 1st page Customized information in function of the size Possibility to change color and size font Good Design GC ED DF DV BV DV CT DA SN DA SS TOTAL %

3,9 9,8 7,8 5,9 5,9 3,9 0,0 0,0 2,0

0,0 2,0 0,0 2,0 0,0 2,0 3,9 3,9 0,0

2,0 9,8 3,9 5,9 9,8 0,0 3,9 0,0 2,0

2,0 3,9 0,0 0,0 3,9 0,0 3,9 0,0 0,0

0,0 0,0 0,0 0,0 3,9 0,0 2,0 2,0 0,0

0,0 2,0 3,9 9,8 0,0 2,0 2,0 0,0 0,0

0,0 5,9 0,0 0,0 3,9 0,0 0,0 0,0 0,0

7,84 33,33 15,69 23,53 27,45 7,84 15,69 5,88 3,92

Figure 4: Previous questionnaire results 2.3.5 Satisfaction User Survey (SUS) This system allowed us to obtain an indicator with regard to the navigation for a page AA, which helps to raise requisites to be borne in mind for the graphic and functional design. The average obtained by user group in the navigation was as follows:

C 10 C 11 C 12









0,0 2,0

2,0 0,0

0,0 0,0

2,0 0,0

0,0 0,0

0,0 0,0

0,0 0,0

3,92 3,92

Figure 5: SUS Results We emphasize the low evaluation of the group of TB because it does not correspond to the evaluation of accessibility of the page (AA) that principally covers this type of group and the punctuation of the group of PV. 2.3.6 BLA interview results The results that we will show next are extracted from the analysis of the BLA. This analysis

Figure 6: Mention index for groups of every element The CG group does not bear in mind the elements C10 and C11, which are those who contemplate the functionalities of accessibility of the web page, because they do not value it and do not use it; therefore, they do not need it. The group of Experts centers on the elements that somehow facilitate the navigation of the users with disabilities: Icons / images (C2), General Index of the web (C4), Links (C8), Restructuring of the information as the size of the screen (C10) and

UbiCC Journal, Volume 5, March 2010


Option to change colors and fonts (C11). The group of PP does not value the elements that facilitate the navigation to the blind persons or those with poor vision because they coincide for the most part with the elements that the control group mentions. The group of PV is present in the evaluations of the visual elements, C1, C2, C10, and C11, because they are very interested in being able to adapt colors or size according to their needs. The group of TB does not comment anything on the visual elements and they center on the content: Page concept for disabled (C5), A lot of information (C7). With regard to the format and the structure, they do not emphasize any comments either. The group of DNSL does not value the accessibility elements for blind persons and only mention that they like that there are images. The majority alludes to the general index of the web, since this allows them to go straight to what they are looking for without the need to read the whole content. The group of DSL as DNSL the most interesting thing is to find images (C2) in front of the text. Next we show the common negative elements that the users named: Table 2: Mention index for the different user groups
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 Long scroll A lot of text Resoluti on Text size and contrast Color Untidy informat ion Poor design Fist negative impressi on Few images Short Keys Low Data Contrast Problem with search Long and complex main page Untidy design GC 11,76 3,92 3,92 9,80 1,96 7,84 5,88 15,69 0,00 0,00 0,00 0,00 1,96 ED 0,00 3,92 1,96 3,92 0,00 5,88 0,00 0,00 0,00 3,92 0,00 0,00 0,00 DF 9,80 0,00 1,96 7,84 5,88 0,00 3,92 0,00 0,00 0,00 3,92 0,00 1,96 DV BV DV CT DA SN DA SS 1,96 0,00 0,00 1,96 3,92 3,92 0,00 0,00 0,00 0,00 0,00 3,92 0,00 0,00 1,96 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 5,88 5,88 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,96 0,00 1,96 0,00 0,00 1,96 0,00 0,00 1,96 0,00 0,00 3,92 0,00

Figure 7: Mention index for groups of every element The group of experts centers on the elements that somehow facilitate the navigation of the users with disabilities. The group PP has difficulties moving around the page: the page is very extensive and it is difficult to them to move with the scroll. They talk about the visual elements like the Resolution / aspect of the icons (C3), Size letter and its contrast (C4), Colors (C5), and Archaic design (C7). The group of PV comments on the elements of contrast, colors, and size of the font. In the group of TB, there is only one element that alludes to the excess of text. The group of DSL mentions the visual elements C3, C9, and C12. They very much like the images and believe there would have to be more (C9), because this saves text, according to them. They are the only users who comment on it. In the particular elements, it is seen reflected that the persons with auditory disability find the vocabulary complex and do not understand Anglicism. Finally, a lot of groups cited the element refer to the bad structured information (C6). 2.4 Phase 1. Analysis The analysis was conducted bearing in mind all the results extracted by the different techniques used in this study. Previously, another type of questionnarire was used with predifined items that allow us to obtain: • Quantitative information on the profile of the users: studies level, work experience, where and how did they learn to navigate the Internet. Information on the type of content to be searched for on the web; type of knowledge of the legislation; type of information that the user looks for.


25,49 9,80 9,80 23,53 17,65 25,49 9,80 15,69 1,96 3,92 3,92 7,84 3,92



















The CG is the only group that mentions that the first impression is negative and does not mention anything on the elements C9, C10, and C12, which refer to the accessibility elements.

UbiCC Journal, Volume 5, March 2010


• •

Information to help decide a functional design: type of tools that they use or that they want to use. Information on needs for devices, hardware, or software for comfortable navigation.



The analysis of the data allowed us to value the reliability of the information gathered by means of observation and the notes regarding the interaction of the users on an accessible page AA for the previous questionnaire. The behavior and the reactions provoked by the achievement of the tasks allowed the creation of a satisfaction user survey (SUS) from which we obtained quantitative statistics by means of a numerical average based on an initially accessible page. From predefined elements for the user, we recommend the use of the interview BLA as a generative tool that emphasizes the strong and weak points marked by the proper users and being related between themselves in comparative to all the results. Thus the existing integration is revealed between the disabilities. The integration of several techinques allows us to obtain several dimensions of the user's experience. 2.5 Phase 1. Conclusions The main conclusions of the study are: We can observe that the rules of web accessibility AA only bear in mind programming requisites; they demonstrate needs for graphic design, functional design, and content. The heterogeneous needs of the disabilities are not borne in mind and we obtain a low satisfaction reported by users with visual disability. • The integration of different methodologies allows us to obtain conclusions by the questionnaires and results raised and created by the users giving higher levels to the subjective experience during the test. • The structure of the page must allow the access and the personalization of its content, depending on the profile of the user. The base of the functional design is not realized from the accessibility rules. The work base is the information provided by the analysis. This investigation-line allows us to value the accessibility from the user’s experience and not only from the technical requisites established by the WAI rules. •

Next, there appear some of the points that were changed according to the results of the evaluation by means of the user's tests and the criteria of accessibility as the result of the evaluations realized in the first phase: • • • • • • Decrease the scroll. (Remarked on by the group of physical and cognitive disability). Improve the quality of the images. (Remarked on by deaf users who use sign language.) Elimination of the images only like metaphor. (Remarked on by deaf users who use sign language.) Change of the quantity of text. (Short and long version of the text, it remarks by all the groups). Segmentation of the information. (It remarks by all the groups.) Incorporation of direct access to the sections and access to specific content for every section. (It remarks by all the groups.) Incorporation of dynamic content. (It remarks by all the groups) Improve the contrast between the background color and the color of the text. (It remarks principally by the group with poor vision.)

• •

3.1 Phase 2. Test The target of the test in this phase is to evaluate the web page of the virtual community by means of a tasks test, to value the use experience. The results will be obtained with regard to 12 users with disabilities. For this second phase of the study, we chose to reduce the users' typology (groups), since on the one hand we can group the users’ ususarios behavior as previously separated (for example, in case of deaf sign language users or non-users in their experience of web navigation, and we can affirm that their behavior did not change or has minimal changes that can be grouped) 3.1.1 User Profiles Three user profiles were considered in this second phase: • PP: Four users with physical and cognitive disability. • DV: Four users with visual disability (entire blindness and poor vision). • DSL: Four deaf users who use sign language.

UbiCC Journal, Volume 5, March 2010


3.1.2 Task test The tasks test was designed to observe the different behaviors of the users in the use of the page of the virtual community, bearing in mind the following points: • • • • Obtained information: Success, failure, false success, or false failure; and time of achievement. Remarks: Notes of the difficulties, unusual behaviors, or illogical errors. Behavior: Actions taken by the user that allow understanding for the achievement of the task. Literal: Subjective opinions about the experience and the interface, as expressed by the users.

the autonomy of the user, and the correct adaptation of the page. If we compare the results of success of two phases, we obtain:

Figure 8: Success Tasks. Phase1 vs. Phase2 When we compare the results of the success in the tasks for profiles of both phases, it remains clear that success in the tasks of the second phase is much more likely than for any other profile. 3.3 Phase 2. Analysis of results From the obtained information, we can extract three big concepts that can help us to define the parameters that bring the navigation over to a satisfactory experience: • • Use of images and icons in the navigation, representation of paragraphs, and information. Concept of the design adapted for disabled people. A page with a clear general index, that is easy to use and return to it.

All the tasks were read by the facilitator to help the users understood the questionnaire. 3.2 Phase 2. Results The users who took part in the second phase had to complete seven tasks at different difficulty levels to verify the grade of adaptation of the new design. Most of tasks were solved successfully for any profile. We found an especially good adaptation of the page to the profile PP. Table 3: Phase 2 data Tasks
Fase 1
Success % Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Average

Failure %

Success % Failure %

Success % Failure %

100,00 100,00 100,00 100,00 100,00 50,00 50,00 85,71

0,00 0,00 0,00 0,00 0,00 50,00 50,00 14,29

50,00 100,00 100,00 100,00 100,00 0,00 100,00 78,57

50,00 0,00 0,00 0,00 0,00 100,00 0,00 21,43

100,00 100,00 50,00 100,00 100,00 0,00 0,00 64,29

0,00 0,00 50,00 0,00 0,00 100,00 100,00 35,71

Table 4: Phase 1 data Tasks
Phase 1 PC DSL TB Average

48,33% 46,67% 43,33% 46,11%

40,00% 53,33% 60,00% 51,11%

False Success
13,33% 0,00% 0,00% 4,44%

False Failure
0,00% 0,00% 0,00% 0,00%

In the first phase of this project, the users had to complete a tasks test with the same target, verify

The obtained results confirm the need to generate a model of design that contemplates multiple test iterations to improve the accessibility of the web. The other result obtained is the increase of a good perception of the page. This perception has been analyzed by means of observation, the protocol of clear thought (explaining thoughts during the navigation), and the spontaneous comments of the users. In addition to the increase in the index of success, the users perceived the second web page more positively than the first one, or even valuing the first one as negative. The perception of the page is radically different between the first and second phase, and this is due to the fact that in the first phase, it was not considered to be the user's experience. On having designed the second web page by means of accessibility patterns based on the user's experience, the perception and the efficiency of the page changes radically. The application of the results of the first phase

UbiCC Journal, Volume 5, March 2010


in the design of the second web page is clearly positive. The designs that focus on fulfilling the technical accessibility specifications do not manage to offer to the users an experience of satisfactory navigation without problems, giving like turned out low success indexes in the tasks and a negative perception on the part of the users. On having introduced in the design an accessible web page pattern of experience of accessible user, we will manage to create a space in which the users can navigate more simply and effectively. These designs generate highly successful results in the proposed tasks, a high positive perception of the page, and a desire on the part of the users to use it again to fulfill their needs. In conclusion, the analysis of the results shows that to bear in mind criteria of accessible user experience gives very positive results in the use of the web. Also, it is necessary to emphasize that the automatic valuators are not tools with enough guarantees to assure that the web page is accessible for the final users. 4 METHODOLOGICAL PROPOSAL

satisfactory level of user experience. All of the norms are created from suggestions or proposals made and aim to provide global solutions for the end-product. That is to say, the potential or future users of the web page are those who have actually contributed to the definition of the page itself. 5 CONCLUSIONS

According to the information obtained in the two types of test and knowing the established type of validation of an accessible web page, we propose the following ratification cycle: Automatic validation: The code of the page is validated through current evaluation tools. The result given is code lines to be resolved and points to be taken into consideration. This cycle is carried out until the page reaches the requirements established, thus obtaining at least an AA level of accessibility. Manual validation: Validation is carried out by expert consultants who evaluate all the aspects in the layout of a web page: structure, functional design, graphic design, and adaptation to different user types. In this validation, usability is as much a consideration as accessibility. This method resulted in several findings after evaluation, which suggests each point to focus on and a categorized sample of users to whom the web page is highly accessible. User validation: In order to test user experience, a task-based test is carried out. In spite of taking in account the results from the automatic and manual validation, the user’s test provide concrete information about potentially controversial aspects of the web page, findings regarding the problems of each user’ profile and possible solutions to carry out. Establishing evaluation guidelines: Once the web page has been analyzed, the norms of the guidelines to be followed can be established with the aim of creating an accessible web page with a

The most relevant conclusions of the study are the following: • The objective of an accessible web page is that it provides a satisfactory level of user experience for those who use it. • An experience deemed to be satisfactory to the user is mainly based on the user autonomy in navigating the Internet. • The definition of a user profile is crucial when it comes to designing the test, given that deficiencies are very heterogeneous. • In order to be able to draw significant data from the study, we really have to look at the implementation of different lines of methodology, which include those stipulated by the W3C (World Wide Web Consortium) or those found in the automatic validation process as well as those determined by study carried out new subjective techniques, which permit user expression. • It is important that the perception of the page is positive in order for the user to be able to evaluate the accessibility of the web page. Accessibility is not based on a requirement to obtain an A, AA or AAA classification, but rather a requirement to provide the user with a satisfactory experience and to enable them to work autonomously (with or without deficiencies). In order to achieve this, testing carried out on handicapped, able, and elderly (with age-related deficiencies) users must be considered in the analysis of the study. 6 FUTURE LINES

In order for web pages to be created with their future users in mind, content requirements, software architecture, graphic design, and structure must be considered. To do so, we are in the process of creating a standardized methodology that takes into account the sample of users and a combination of the techniques used in usability and user experience testing. This methodology would allow us to give to any design team a tool to ensure that its final design will create a better experience on the accessibility framework for the users.

UbiCC Journal, Volume 5, March 2010


Also, it will be a good reference for other groups and ourselves to continue improving the way we evaluate accessibility and improve websites. 7 REFERENCES

[3] Brooke, J. (1996). SUS: A Quick and Dirty Usability Scale. In: P.W. Jordan, B. Thomas, B.A. Weerdmeester & I.L. McClelland (Eds.), Usability Evaluation in Industry. London: Taylor & Francis. [4] Pifarré, Marc, Bipolar Laddering (BLA), a Participatory Subjective Exploration Method on User Experience, Dux 07: Conference on designing for user experience, Chicago, USA (2008). [5] Mahoney, M.J.: Participatory epistemology and the psychology of science. In Gholston, B., Shadish, W. R., Neimeyer, R.A., Houts, A. C. (eds.): Psychology of science. Cambridge, Cambridge University Press (1989).

[1] Villegas, E., Pifarré, M., Fonseca, D. Garcia, O: Requisitos de integración en una comunidad virtual web para usuarios discapacitados utilizando la combinación de diferentes líneas metodológicas, 7ª Conferencia Iberoamericana en Sistema, Cibernética e Informática, Vol 3, Pags. 45-50, Orlando, USA (2008). [2] World Health Organization, Towards a Common Language for Functioning, Disability and Health: International Classification for Functioning, Disability and Health, Geneva: 2002http://www.who.int/classifications/icf/site/i cftemplate.cfm].

UbiCC Journal, Volume 5, March 2010


Rumi Hiraga, Nobuko Kato Tsukuba University of Technology, 4-3-15 Amakubo, Tsukuba 305-8520, Japan rhiraga@a.tsukuba-tech.ac.jp

ABSTRACT We reviewed the resuts of our previous experiments on comparing the emotions recognized by recipients for several types of stimuli. Until recently, we had assumed that visual information provided simultaneously with musical performance was useful in helping hearing-impaired people to recognize the emotions conveyed by music. After our review, we became unsure whether multichannel information that supposedly helps deaf and hard-of-hearing people to recognize emotions actually works well. In this paper, on the basis of a new question that we pose about multichannel music information, we describe an experiment on comparing stimuli provided by types of music only, by music with video sequences, and by video sequences only. The results show that visual information has less effectiveness for hearing-impaired people, though visual information does have a role in supporting the recognition of emotions conveyed by music. Keywords: music communication, hearing-impairment, multimedia, emotion. 1 INTRODUCTION We previously believed that visual information played a role in supplementing sound information, especially for deaf and hard-of-hearing people. We also believed that this would apply to the case of deaf and hard-of-hearing people listening to music. These assumptions and our experiences with students in the Department of Industrial Technology at National University Corporation, Tsukuba University of Technology (NTUT), all of whom have hearing impairments, gave us the idea of building a music performance assistance system that uses visual information in a supplementary manner to enable people to communicate through music. After reexamining a series of experiments we previously performed to ascertain the possibility of recipients recognizing emotions after music intended to convey an emotion, we became unsure as to whether visual information can be an effective supplementary tool for enhancing musical performance appreciation. Thus, we conducted another experiment and came to conclude that visual information certainly does have a supplementary role in the recognition of emotion in music. The background of building an assistance system goes back to a computer music class that we conducted for the abovementioned NTUT students for six years from 1997. All the students had been hearing impaired before starting elementary school and had quite limited experience of music, in terms of both playing and listening. Nevertheless, they enjoyed the class and the experience of playing new types of electronic instruments. In particular, they enjoyed playing music together to arouse mutual sympathy that gave them a certain satisfaction in the music activity. In developing our system, we assumed that drums would be suitable instruments because they produce strong vibrations and some of these students had had the experience of playing large Japanese drums called wadaiko. Thus, we conducted several experiments on recognizing the emotions conveyed by drum performances and other stimuli that were possible musical accompaniment candidates. We used the four basic emotions—joy, fear, anger, and sadness—because they are often used in experiments on music cognition [9]. As we conducted our experiments, one concern we had was that the visual information we assumed would play a supplementary role might not support music information but would instead eliminate or replace it. In this paper, we touch upon the background to our idea of building a musical performance assistance system, namely a computer music class for hearing-impaired college students, review previous experiments to see how effective visual information can be as a tool for recognizing emotion, explain the question we faced, and describe

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



experiments we conducted to compare three types of stimuli (music only, music with video sequences, and video sequences only) as a means of recognizing emotion conveyed by several types of media. 2 RELATED WORK Our work has been evolving from the idea of building a system to assist performances by deaf and hard-of-hearing people playing in ensembles. For this purpose, we needed to understand how the deaf and hard-of-hearing listen to music, especially how they recognize emotion in improvised drumming, because we had percussion ensemble music in mind for the system. We conducted experiments with deaf and hard-of-hearing people to understand the possibility of music communication with visual assistance. This research can be interpreted as an interdisciplinary approach involving musical activities by hearing-impaired people, aesthetics, multimedia, cognition, and computer systems. The studies mentioned below are related to our research from different viewpoints. Apart from the many active music classes and music therapies for deaf and hard-of-hearing, there are many music activities done by deaf and hard-ofhearing: there are deaf professional musicians, e.g. Dame Glennie, a percussion soloist, Dr. Paul Whittaker OBE [18], an accomplished pianist, and the participants of Deaf Rave in London. In her research on music understanding by deaf and hardof-hearing people, Darrow studied the referential meaning of pictures vis-à-vis music for hearingimpaired children [3]. That study used the musical performances that can be associated with concrete objects such as specific animals, while our study aims to pursue a more general association between audio and visual information, including abstract nuances such as the abovementioned four emotions. Ota investigated the musical abilities of young deaf and hard-of-hearing students from the viewpoint of special education [13]. The relationship between sound and visual information is an interesting research area and researchers from several fields have worked on it, e.g., Yamasaki as a psychologist [19] and Parke et al. as computer scientists [14]. They have analyzed how visual information supports the understanding of music (Yamasaki) and how music supports movies (Parke et al.). Levin and Lieberman [10], as media artists, tried to generate sound from pictures. Emotion in musical research is a significant area of interest for both performers and listeners. Juslin introduced research methods for analyzing emotion in music [9]. Senju, a professional violinist, played the violin by herself conveying some emotions and investigated the possibility of emotional

communication in music [16]. Schubert and Fabian analyzed performances of piano music into a Chernoff face [15]. Bresin and Friberg’s system automatically generated music performances with emotions [2]. Our focus in the past was on determining whether there is a difference between people with and without hearing disabilities in recognizing emotion in several types of media (e.g., [5][6]). Our results demonstrated that there were no significant differences in most cases at the 5% level between deaf and hard-of-hearing and people with hearing abilities in recognizing emotion when the emotion was conveyed through musical performances, several types of visual media, or musical performances accompanied by visual information. The most difficult emotion to recognize was fear, regardless of the medium was. 3 COMPUTER MUSIC CLASS During the period 1997–2002, we conducted a computer music class for hearing-impaired students at NTUT 1 . Most of the attending students were majoring in electronics. All of them were hearing impaired, though to different degrees. Their hearing difficulties were discovered before they began attending school, and their educational backgrounds were either special schools or general schools. In either case, their musical experience was generally more limited than others in their age group who had no hearing problems. Some of those with lesser impairment enjoyed listening to music via MP3 players, but we were not sure whether they enjoyed listening to music in other cases, e.g., when playing Nintendo video games. In terms of sound, speaking and listening to a language is more relevant them than playing and listening to music. Thus, it was difficult for us to choose topics that would both maintain their interest and cater to their individual skills, needs, and desires. If we had organized the same class for students with hearing abilities, we could have taught them how to use desktop music software systems, such as those for sequence or music notation, or software systems for digital signal processing such as Max/MSP [1]. Since these topics were unsuitable for the students we actually had, we chose to give our students the chance to use new types of instruments in an attempt to let them experience the joy of music. In the class we used the wearable instrument “Miburi” and the electronic drum set DD55 (developed and sold by Yamaha Corp). Both


NTUT was then Tsukuba College of Technology, a three-year institute.

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



used several types of stimuli to understand the recognition of emotions that music can elicit. Besides single-medium stimuli of “music only” and “drawings only”, we used the following as multimedia stimuli.  music and drawings,  music and its performance scene,  music and video sequences intended to convey no specific emotion,  music and video sequences intended to convey the same emotion as the musical performances. Figure 1: A student playing Miburi instrument. instruments generate audio data along with MIDI (musical instruments digital interface) data. The image movement of the visualization software product Visisounder (developed and sold by NEC) is controlled by MIDI. We demonstrated these instruments and software to the students and showed them how to use them. Instead of using any published musical scores, we allowed the students to decide for themselves what they would play and how they would play it, including the chosen sound color. At the end of the semester they gave performances to show what they had learned. A photograph of a student playing the Miburi as if she were playing the wadaiko is shown in Figure 1. In one year, we suggested three rhythm patterns to students with which they could play batucada, a kind of samba music played with only percussion instruments. They found this interesting; in particular, playing together enabled them to gain an appreciation of music performance. In the final year, we conducted our first experiment [7], which was an attempt to see how visual information could be useful in playing a simple rhythm. We offered four students three types of visual information that would guide a rhythm and patting sound only information. The results showed that none of the visual information was helpful to them in this respect; using the guiding sound only helped them improve their ability to follow the rhythm pattern. These results did not particularly surprise us because all four of the students had been accustomed to playing instruments before the start of the class. 4 MUSIC RECOGNITION Since the last year of the class, we have conducted several experiments to investigate how deaf and hard-of-hearing students listen to music, focusing on the possibility of music communication with emotions. Since we believed that visual information can assist one in listening to music, we 4.1.2 Drawings only A probability of less than 5% was considered a significant difference in all the experiments. In these experiments, our interest was mainly on understanding whether there were more differences or similarities in recognizing emotions elicited from several types of stimuli between deaf and hard-ofhearing people and people with hearing abilities.


4.1 Single-medium stimuli We started with experiments using two types of single-medium stimuli: “music only” and “drawings only”. 4.1.1 Music only We used performances given by (1) NTUT students, (2) amateurs with hearing abilities, and (3) professional percussionists. The number of performances in each set and the number of subjects who listened to each set was varied. The numbers are given in Table 7 in the Appendix, with the number of performances given in parentheses. The two subject groups (deaf and hard-ofhearing subjects and subjects with hearing abilities) showed no significant differences in recognizing emotion in music when they listened to performances played by hearing-impaired students and amateurs. On the other hand, there was a large difference when they listened to performances played by professional percussionists. Subjects with hearing abilities recognized emotion in this music significantly more than subjects with hearing impairments [8]. One possible reason is the difference in sound source. Professional percussionists played acoustic instruments, while other groups played the MIDI drum set and the performances were replayed from the MIDI data.

We then conducted an experiment where we used drawings only as stimuli. Experiments on emotional cognition from drawings have been conducted in the area of psychology [17]. We used three sets of drawings that were intended to

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010







Figure 2: Drawing samples.

elicit emotions from the subjects, drawn respectively by students with hearing abilities who were design majors, hearing-impaired students who were electronics majors, and hearing-impaired students who were design majors. The number of subjects differed for each set of drawings. The numbers are given in Table 8, with the number of drawings given in parentheses.
Differences between subject groups were obtained when subjects looked at two of the three drawing sets. The exception was the set drawn by hearing-impaired students who were design majors [6]. 4.2 Multimedia Stimuli Below, we present our results for investigating whether there were significant differences in recognizing emotion in two types of stimuli: (1) music only and (2) music with one of the following types of visual information: drawings, performance scenes, video sequences intended to convey no emotion, and video sequences intended to convey the same emotion as the musical performance. We present results for both deaf and hard-of-hearing subjects and for subjects with hearing abilities. For all experiments with stimuli, whatever the type, we used drum improvisation performances. 4.2.1 Music and drawings For the first multimedia stimulus type, we provided drawings intended to convey the same emotion as the musical performance during the first half of the performance. We used drawings based on our experiment on recognizing emotion conveyed by drawings (Section 4.1.2). They were provided along with the musical performances used in our previous experiments (Section 4.1.1). Some of the drawings we used are shown in Figure 2 [4]. Eleven hearingimpaired subjects (three males and eight females, aged 18–22) and 15 subjects with hearing abilities (13 males and two females, aged 20–24) participated in the experiment. The results showed that there were significant differences between the stimulus types of music only

Figure 3: A professional drummer drumming.
and music accompanied by a drawing intended to convey the same emotion for both subject groups. 4.2.2 Music and video sequences In attempting to provide a visual assistance tool, we expected a video sequence accompanying a musical performance to be helpful in enabling deaf and hard-of-hearing people to recognize emotion. We used the following three types of video sequences.  Performance scene. Since it was difficult to gather subjects at a live performance, we used replays of musical performances from our past experiments. We thought that if showing videos of the performance scenes improved recognition rate, then we could assume that players in an ensemble performance would recognize emotion for better music communication.  Video sequences not intended to convey any particular emotion. Since it seemed preferable for deaf and hard-of-hearing students to obtain added information from visual media, we thought that video sequences, even those not intended to convey any particular emotion, might be helpful in enabling them to recognize emotion in musical performance.  Video sequences intended to convey the same emotion as the musical performance. We assumed that these sequences would improve the emotion recognition rate the most for deaf and hard-of-hearing subjects. In experiments using these three stimulus types, we used different sets of musical performances and different subject groups. 4.2.3 Performance scene We asked a professional percussionist to play a drum set in an improvisational style to convey emotions. He played four sets of performances (i.e.,

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



a total of 16 performances), using a bass drum, snare, set of concert toms consisting of five different sizes, Chinese gong, and suspended cymbal with drumsticks, mallets, brushes, and other items (Figure 3). Eleven deaf and hard-of-hearing subjects (eight males and three females, aged 19–24) and 10 subjects with hearing abilities (three males and seven females, aged 19–47) viewed a 19-inch display screen while listening to the musical performances from a speaker set [5]. The results showed that there were no significant differences between the two stimulus types of music only and music with its performance scene for both subject groups. 4.2.4 Video sequences intended to convey no emotion We used the visual effects “Amoeba” and “Fountain” in Windows Media Player 10 [4]. These effects were controlled by sound data and the resulting animated scenes were not intended to convey any emotion by themselves. We chose Amoeba because its figure looked a little like some of the drawings we used in the drawings category. We chose Fountain because it looked quite different in shape and movement from Amoeba and used fewer colors than other effects. The number of deaf and hard-of-hearing subjects who watched Amoeba was the same as in the experiment using musical performance and drawings (Section 4.2.1), while eight subjects (two males and six females, aged 18– 22) watched the Fountain scene. Fifteen subjects with hearing abilities (13 males and 2 females, aged 20–24) participated in the experiment. In this experiment too, the results showed that there were no significant differences between the two stimulus types of music only and music with video sequences for both subject groups. 4.2.5 Video sequences intended to convey emotion We wrote a software program called “Music with Motion Picture” (MPM) that generates a motion picture scene from sound data and a drawing by using Max/MSP and DIPS (digital image processing with sound) [12] on Mac OSX. This program provides a set of effects for modifying the shapes of drawings to make video sequences. Each effect is given a set of preset parameter values for the shaders of GLSL (OpenGL shading language) and those that Apple Core Image supports. The most important parameters for modifying image shapes are strength, depth, and speed. Each of these corresponds to and affects how much, how smoothly, and how quickly a shape varies. The set of preset effects provided by MPM is given in Table 1.

Table 1: Effects of MPM.
Effect names Soft Wave Notch Collapsing Jell-o Bump Torus Blur Disappearance Vortex

Traits Gentle movement; shallow depth Wave-like movement; fine movement with more depth and less strength More strength; sharper notch with greater speed and depth values More strength and large speed values Tremulous movement; less strength, more speed, and greater depth values Bold shivering movement; using bump distortion Sight through lens; using torus distortion Smoothing movement, creating afterimages, and some other blurring Presentation or deletion of objects with amplitude Shape modification obtained by calculating coordinates

Sixteen deaf and hard-of-hearing students (five males and 11 females, aged 20–21) and twelve subjects with hearing abilities (three males and nine females, aged 22–48) participated in the experiment. We used one of the four sets of performances used in the experiment described in Section 4.2.3. The set in which the professional percussionist played a snare drum was the one that elicited the lowest emotion recognition rate among the four sets. For each musical performance, we used MPM to create two video sequences comprising a musical performance and a seed drawing intended to convey the same emotion as the performance. Thus, we used a total of eight video sequences accompanying the musical performances. The seed drawing, its effect on generating a motion picture scene, and a sample frame from each scene are given in Table 3. Although we provided two media with the same intended emotion, some subjects with hearing abilities pointed out a mismatch between the two media. No deaf and hard-of-hearing subjects mentioned this. The results we obtained in this experiment were unexpected. The recognition rates for subjects with hearing abilities improved more with the addition of video sequences than those for the deaf and hard-ofhearing subjects did. Moreover, there were significant differences between the two types of stimuli (music only and music with video sequences) for subjects with hearing abilities, but none for deaf and hard-of-hearing subjects.

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



4.3 Comparison of results The results of experiments on multimedia stimuli (Sections 4.2.1, 4.2.3–4.2.5) are summarized in Table 2. We conducted three experiments to obtain the results. In the first experiment we used “Drawings”, “Amoeba”, and “Fountain” 2 and in the other two we used “Performance Scene” and “Video Sequence with Emotion”. Since the musical performance sets and subject groups were different, we thought that comparing the numbers between them would be a valid way to ascertain certain tendencies. Our findings in this respect are summarized below.  Our results show that providing still-image drawings along with a musical performance can improve the emotion recognition rate. In spite of this, we feel that still images are inappropriate as a means of supplementing music because music is dynamic.  Except for the “Performance Scene” experiment, we can see from the p-values in Table 2 that visual information is more likely to benefit subjects with hearing abilities than it is to benefit deaf and hard-of-hearing subjects. The results of the experiment described in Section 4.2.5 and the abovementioned tendencies led us to question whether deaf and hard-of-hearing people could make use of visual information in listening to music. Namely, we wondered if music from multiple channels might not necessarily improve the ability of deaf and hard-of-hearing people to recognize the emotion conveyed by musical performances. 5 VISUAL INFORMATION To determine whether deaf and hard-of-hearing people could make use of visual information, we conducted an experiment that compared subjects’ ability to recognize emotion conveyed by music accompanied by visual information with their ability to recognize emotion from visual information only. This experiment extended the “video sequence only” stimulus to the experiment described in Section 4.2.5. Prior to the experiment, we thought that if any differences were observed between the recognition rates for “music only” and “music with video sequences”, then we could conclude that video sequences could be a valuable tool for subjects in enabling them to recognizing emotions. In that case, though, if the recognition rates for music with video

sequences and video sequences only were the same, then we could pose the question that subjects might not make use of sound information. 5.1 Subjects Seven deaf and hard-of-hearing subjects (aged 20–21, 3 males, 4 females) participated in the experiment. Their auditory capacity was over 100 dB, except one (80 dB). They wore their hearing aids (turned on), except for one whose auditory capacity was 110 dB, though they did not know their own auditory capacities with the hearing aids. Since the number of stimuli was small, we asked them to participate in the experiment three times, once every week. The order in which the stimuli were given to them differed each time. In each case, they listened to four drum performances, hearing and watching stimuli of drum performances accompanied by video sequences, and watching video sequences without music. We used two-way analysis of variance (ANOVA) where the effects were sessions in the experiment and types of stimuli. The p-value for three experiment sessions was 0.40, where the respective mean values were 0.66, 0.61, and 0.55. Thus we used all data (seven subjects participating in the same experiment three times). As a control group, ten subjects with hearing abilities (aged 21–50, 3 males, 7 females) participated in the experiment. 5.2 Material Three types of stimuli were used in this experiment: (1) music only, (2) musical performances with video sequences where the emotions conveyed by both stimulus types coincided, and (3) video sequences only. They were the same as we used in the experiment described in Section 4.2.5—four musical performances and eight video sequences. The video sequences used in this experiment are summarized in Table 3. 5.3 Results The recognition rates for all stimuli by subject groups are shown in Table 4. Drawings were shown to subjects who were different from those participating in this experiment. A total of 21 (three males, 18 females, aged 20–21) were asked to recognize emotions from drawings. For both deaf and hard-of-hearing subjects and subjects with hearing abilities, the recognition rates for music paired with video sequences increased from that for only musical performance. The exception was the recognition of sadness (both S1 and S2) by deaf and hard-of-hearing and anger (A1) by subjects with hearing abilities.

This means that the recognition rates for music for “Drawings”, “Amoeba”, and “Fountain” were the same.

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



Table 2: P-values obtained by comparing the recognition rates for music stimuli and for stimuli of music accompanied by visual information. * denotes visual information (one out of Drawings, Amoeba, Fountain, Performance Scene, and Video sequences with emotion).
Hearing-impaired subjects p-value Drawings Amoeba Fountain Performance Scenes Video sequences with emotion 1.2e-004 0.93 1.00 0.11 0.18 Music only 0.46 0.46 0.46 0.54 0.55 Mean Music with * 0.66 0.46 0.45 0.65 0.68 Subjects with hearing abilities p-value 9.0e-005 0.26 0.53 0.83 0.01 Music only 0.47 0.47 0.47 0.74 0.50 Mean Music with * 0.67 0.52 0.49 0.73 0.77

Table 3: Seed drawings, effects in generating video sequences, and sample frames from scenes.
Emotions Name Drawings Effects Sample frames Jell-o Bump Wave Jell-o Collapse Bump Wave Jell-o J1 Joy J2 F1 Fear F2 A1 Anger A2 S1 Sadness S2

Table 4: Recognition rates for three types of stimuli. DHH: deaf and hard-of-hearing subjects, HA: subjects with hearing abilities. VS: video sequence, M&VS: music with video sequences, Music: music only.
Name Drawings VS DHH M&VS Music VS HA M&VS Music 0.90 0.80 0.40 J1 1.00 0.86 0.71 0.43 0.80 0.80 0.90 0.50 0.10 J2 1.00 0.52 0.52 F1 0.95 0.43 0.48 0.48 0.90 0.60. 1.00 0.60 0.70 F2 0.76 0.62 0.52 A1 0.38 1.00 0.81 0.76 1.00 0.90 0.60 0.60 0.50 A2 0.43 0.67 0.95 S1 0.43 0.47 0.52 0.62 0.90 1.00 S2 0.43 0.62 0.33

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



Table 5: P-values obtained by comparing three types of stimuli (music only, music accompanied by video sequences, and video sequences only) and their mean values.
p-value DHH HA 0.62 2.64-e04 Music 0.57 0.42 Mean M&VS 0.61 0.73 VS 0.65 0.88

Table 6: P-values obtained by comparing four emotions and their mean values.
pvalue VS M&VS Music 0.24 0.14 0.10 Emotions Joy 0.78 0.72 0.40 Fear 0.71 0.53 0.28 Anger 0.92 0.82 0.74 Sadne ss 0.65 0.63 0.57

We used two-way ANOVA to analyze recognition rates, where the effects were subject groups and stimulus types. The p-values showed that there were significant differences between stimulus types (p=6.0-e004) but no differences between subject groups (p=0.25). Note that difference between stimulus types were caused mainly by subject groups with hearing abilities. The p-values and mean values for the three types of stimuli used in recognizing emotion are shown in Table 5. There were no significant differences between stimulus types for deaf and hard-of-hearing subjects, while there were differences for subjects with hearing abilities. A multiple comparison for the results of the subject group of hearing abilities showed that there were significant differences between stimulus types except for video sequences only and music with video sequences. We also analyzed the recognition rates for each emotion for the three types of stimuli using one-way ANOVA. The p-values showed that there were no significant differences in recognition rate of any of the emotions for any stimulus type. The p-values and mean recognition rates for all the emotions are shown in Table 6. 6 DISCUSSION 6.1 Role of visual information The experiments showed that there were no significant differences in recognizing emotions between two types of subject groups. This corresponds to the results of past experiments and suggests to us that musical communication may be possible between deaf and hard-of-hearing people and people with hearing abilities. From Table 5 and the p-value obtained by comparing three types of stimuli, we can see that providing video sequences did not significantly improve the recognition rates for deaf and hard-ofhearing people. On the other hand, Table 5 shows the same recognition tendency between two subjects groups: providing subjects with the stimuli of video sequences accompanying musical performances helped their recognition of emotions. Table 6 also supports the increase in recognition rate for each emotion when video sequences were accompanied by musical performances.

In these experiments, our concern was the difference in recognition of different emotions. In most of the past experiment, fear was the least recognized emotion and the present experiments showed the similar results (Table 4). Table 6 also shows a similar tendency for fear to have the lowest recognition. Although there were no significant differences in recognizing emotions in terms of the p-values, the spontaneous remarks made by subjects revealed the difficulties in differentiating joy from anger and fear from sadness in listening to musical performances only. Though the improvement was insufficient, if a video sequence accompanies a musical performance, then it will be effective for the recognition of fear. 6.2 Video sequences as supplementary information for musical performances The increase in recognition rates with video sequences encouraged us to use video sequences in our musical performance assistance system. On the other hand, the improvement in recognition with video sequences from musical performances only was small and also insufficient for differentiating emotions, so we must seek better video sequences for augmenting musical recognition. In seeking such video sequences, we should be careful not to forget that they are intended for recognizing musical performance and not for use by themselves. Table 4 reveals several interesting results. For deaf and hard-of-hearing subjects, in the case of A2, a multimedia stimulus increased the recognition rates for both cases of a single medium (musical performance only and video sequences only). On the other hand, multimedia stimulus S2 decreased the recognition for both a music-only stimulus and a video-sequence-only stimulus. These results indicate that multimedia stimuli can be both useful and detrimental, though these examples do not apply to hearing subjects. By comparing recognition rates for multimedia stimuli where the recognition rates of the two video sequence stimuli were the same, we found that the multimedia effect can describe the difference in video effects. This happens in the case of fear and

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



anger for hearing subjects. Recognition rates for video sequences only for both F1 and F2 were 0.90 and those for music only were 0.10. Similarly, the rates for video sequences for A1 and A2 were 1.00 and those for music only were 0.70. While the recognition rates for multimedia stimuli for F1 and F2 do not differ much, those for A1 and A2 were 0.60 and 0.90, respectively. From this, we can predict that the Bump effect might be the better effect to use in generating video sequences of anger for our purpose. Since the recognition rates for multimedia for A1 and A2 for deaf and hard-ofhearing subjects were 0.81 and 0.95, respectively, this prediction is likely to be true. 6.3 Music beyond hearing impairment Among the seven deaf and hard-of-hearing subjects, three were from schools for special education. Regardless of the school type, they had some experience of musical instruments at school. One of them currently belongs to a street-dance circle, one is a member of a band, and four have their own favorite music. We feel that the auditory capacity and music recognition are not necessarily related. In our past experiments, one subject enjoyed playing wadaiko despite having auditory capacity off the scale (over 120 dB). We found by enquiry that, in general, those with greater hearing losses listen to music less than those with less hearing impairment. On the other hand, we found that the recognition rate for music only for the subject who did not wear his hearing aid was the second best in this experiment. 6.4 Future work Our next step in building a performance assisting system will be to find better video sequences as “visual assistance”. To build better video sequences for multimedia use, we should be careful in using MPM. Namely, we should not necessarily make better video sequences that can convey emotions by themselves, but should seek video sequences that supplement music information. To get several other video scenes for each emotion, we should first analyze the relationship between the movement (effect) in video sequences and emotions, in order to avoid arbitrariness in generating video sequences. When we regenerate them, we could then repeat the experiment described in the previous section. We will analyze musical sound data and image data quantitatively so that our system will be able to judge emotion in music and generate video sequence candidates automatically. The question of how valuable visual information is for music includes how to activate residual hearing abilities to enable better use to be made of information from several media types, as well as how to obtain better visual information for supplementary

use in recognizing emotions in musical performances. This work will include developing other types of system besides our musical performance assistant system. Another, but as yet untouched issue, includes searching for a third medium that can convey the information contained in music, such as haptic information, which is treated in Miura’s work on vibration devices [11]. 7 CONCLUSION We described the background of our plan to build a musical performance assistance system for deaf and hard-of-hearing people to enable them to share in the joys of music. Although we previously assumed that the supplementary visual information provided by the system would act as an effective tool for this purpose, our experimental results raised doubts in our mind as to how useful visual information actually is in facilitating musical communication. Thus, we reviewed our past experiments and compared the emotion recognition rates obtained through musical performance only and through musical performance accompanied by several types of visual information. Since we were not persuaded that visual information is useful for music recognition, we conducted an experiment that compared the recognition of emotion in music accompanied by video sequences and in video sequences only. Though not strong enough, the results do support the role of visual information in music communication. ACKNOWLEDGMENTS This research was supported by the Kayamori Foundation of Information Science Advancement and the Special Research Expenses by NTUT. 8 REFERENCES [1] F. Blum: Digital Interactive Installations: Programming interactive installations using the software package Max/MSP/Jitter, VDM Verlag (2000). [2] R. Bresin and A. Friberg: Emotional coloring of computer-controlled music performances, Computer Music Journal, MIT Press, 24(4), pp. 44-63 (2000). [3] A.-A. Darrow and J. Novak: The effect of vision and hearing loss on listeners’ perception of referential meaning in music, Journal of Music Therapy, American Music Therapy Association, XLIV(1), pp. 57-73 (2007). [4] R. Hiraga and N. Kato: Understanding emotion through multimedia––comparison between hearing-impaired people and people with hearing abilities, Proc. of ASSETS, ACM, pp. 141-148 (2006). [5] R. Hiraga, N. Kato, and N. Matsuda: Effect of visual representation in recognizing emotion

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



expressed in a musical performance, Proc. of ICSMC, IEEE, pp. 131-136 (2008). [6] R. Hiraga, N. Kato, and T. Yamasaki: Understanding emotion through drawings–– comparison between hearing-impaired people and people with hearing abilities, Proc. of ICSMC, IEEE, pp. 103-108 (2006). [7] R. Hiraga and M. Kawashima: Performance visualization for hearing-impaired students, Journal of Systemics, Cybernetics and Informatics, Int’l Institute of Informatics and Cybernetics, 3(5), pp. 24-32 (2006). [8] R. Hiraga, T. Yamasaki, and N. Kato: The cognition of intended emotions for a drum performance: Differences and similarities between hearing-impaired people and people with hearing abilities, Proc. of ICMPC, pp. 219224 (2006). [9] P. N. Juslin: Communicating emotion in music performance: a review and a theoretical framework, In P. N. Juslin and J. A. Sloboda, editors, Music and Emotion, theory and research, Oxford University Press, pp. 309-340 (2004). [10] G. Levin and Z. Lieberman: Sounds from shapes: Audiovisual performance with hand silhouette contours in the manual input sessions. Proc. of NIME, pp. 115-120 (2005). [11] S. Miura and M. Sugimoto: Supporting children’s rhythm learning using vibration devices, Proc. of CHI, ACM, pp. 1127-1132 (2006). [12] C. Miyama, T. Rai, S. Matsuda, and D. Ando: Introduction of DIPS programming technique, Proc. of ICMC, ICMA, pp. 459-462 (2003). [13] Y. Ota and Y. Kato: Musical education at elementary school department for the deaf, Education for the Deaf in Japan, 43(2) (2001). [14] R. Parke, E. Chew, and C. Kyriakakis: Quantitative and visual analysis of the impact of music on perceived emotion of film, ACM Computers in Entertainment, 5(3) (2007). [15] E. Schubert and D. Fabian: An experimental investigation of emotional character portrayed by piano versus harpsichord performances of a J.S. Bach excerpt, In E. Mackinlay, editor, Aesthetics and Experience in Music Performance, Cambridge Scholars Press, pp. 77-94 (2005). [16] M. Senju and K. Ogushi: How are the player’s ideas conveyed to the audience? Music Perception, University of California Press, 4(4), pp. 311-323 (1987). [17] Y. Wada, K. Tsuzuki, H. Yamada, and T. Oyama: Perceptual attributes determining affective meanings of abstract form drawings. Proc. 17th Congress of International Association

of Empirical Aesthetic, pp. 131-134 (2002). [18] P. Whittaker: Musical potential in the profoundly deaf, Music and the Deaf (1986). [19] T. Yamasaki: Emotional communication mediated by two different expression forms: Drawings and music performances, Proc. of ICMPC, pp. 153-154 (2006). 9 APPENDIX We varied the number of stimuli and the number of subjects participating in experiments with singlemedium stimuli, for both the music-only and drawings-only categories. Table 7 shows the number of improvisational style drum performances used in the experiment referred to in Section 4.1.1. Table 8 shows the number of drawings used in the experiment referred to in Section 4.1.2.

HI: hearing-impaired. Aged 20–22, 12 males, 3 females. HA: hearing abilities. Aged 21–26, 20 males, 13 females. Performances by HI Subjects HI HA (11) 10 33 HA Amateurs Professionals (5) (2) 15 15 33 33

Table 7: Number parentheses) and the experiment conveyed by musical

of performances (in number of subjects for on recognizing emotion performances.

HI EM: hearing-impaired, electronics majors. Aged 20–22, 16 males, 3 females. HI DM: hearing-impaired, design majors. Aged 20– 22, 2 males, 8 females. HA DM: hearing abilities, design majors. Aged 21– 26, 26 males, 8 females. Drawings by HI Subjects HI EM HI DM HA EM (14) 19 10 34 DM (11) 10 9 11 HA DM (7) 19 10 34

Table 8: Number of drawings (in parentheses) and number of subjects for the experiment on recognizing emotion conveyed by drawings.

Ubiquitous Computing and Communication Journal UbiCC Journal, Volume 5, March 2010



Gregory Fine, John K. Tsotsos Department of Computer Science and Engineering, York University, Toronto, Canada fineg74@yahoo.com, tsotsos@cse.yorku.ca

ABSTRACT The user interface of existing autonomous wheelchairs concentrates on direct control of the wheelchair by the user using mechanical devices or various hand, head or face gestures. However, it is important to monitor the user to ensure safety and comfort of the user, who operates the autonomous wheelchair. In addition, such monitoring of a user greatly improves usablity of an autonomous wheelchair due to the improved communication between the user and the wheelchair. This paper proposes a user monitoring system for an autonomous wheelchair. The feedback of the user and the information about the actions of the user, obtained by such a system, will be used by the autonomous wheelchair for planning of its future actions. As a first step towards creation of the monitoring system, this work proposes and examines the feasibility of a system that is capable of recognizing static facial gestures of the user using a camera mounted on a wheelchair. The prototype of such a system has been implemented and tested, achieving 90% recognition rate with 6% false positive and 4% false negative rates. Keywords: Autonomous wheelchair, Vision Based Interface, Gesture Recognition



1.1 Motivation In 2002, 2.7 million people that were aged fifteen and older used a wheelchair in the USA [1]. This number is greater than the number of people who are unable to see or hear [1]. The majority of these wheelchair-bound people has serious difficulties in performing routine tasks and is dependent on their caregivers. The problem of providing disabled people with greater independence has attracted the attention of researchers in the area of assistive technology. As a result, modern intelligent wheelchairs are able to autonomously navigate indoors and outdoors, and avoid collisions during movement without intervention of the user. However, controlling such a wheelchair and ensuring its safe operation may be challenging for disabled people. Generally, the form of the control has the greatest impact on the convenience of using the wheelchair. Ideally, the user should not be involved in the low-level direct control of the wheelchair. For example, if the user wishes to move from the bedroom to the bathroom, the wheelchair should receive instruction to move to the bathroom and navigate there autonomously without any assistance from the user. During the execution of the task, the wheelchair will monitor the user in order to detect if the user is satisfied with the decisions taken by the wheelchair, if he/she requires some type of assistance or he/she wishes to give new instructions. Hence, obtaining feedback from the user and taking

independent decisions based on this feedback is one of the important components of an intelligent wheelchair. Such a wheelchair requires some form of feedback to obtain information about the intentions of the user. It is desirable to obtain the feedback in an unconstrained and non-intrusive way and the use of a video camera is one of the most popular methods to achieve this goal. Generally, the task of monitoring the user may be difficult. This work explores the feasibility of a system capable of obtaining visual feedback from the user for usage by an autonomous wheelchair. In particular, this work considers visual feedback, namely facial gestures. 1.2 Related Research Autonomous wheelchairs attract much attention from researchers (see e.g., [32, 36, 16] for general reviews). However, most research in the area of autonomous wheelchairs focus on automatic route planning, navigation and obstacle avoidance. Relatively, little attention has been paid to the issue of the interface with the user. Most, is not all, existing research in the area of user interfaces is concentrated on the issue of controlling the autonomous wheelchair by the user [32]. The methods that control the autonomous wheelchair include mechanical devices, such as joysticks, touch pads, etc. (e.g. [9]); voice recognition systems (e.g.[22]); electrooculographic (e.g.[4]), electromyographic (e.g.[18]) and electroencephalographic (e.g.[34]) devices; and machine vision systems (e.g.[27]). The machine

UbiCC Journal, Volume 5, March 2010


vision approaches usually rely on head (e.g. [20, 38, 36, 27, 7, 6]), hand (e.g. [25, 21]) or facial (e.g. [9, 7, 6]) gestures to control the autonomous wheelchair. A combination of joystick, touch screen and facial gestures was used in [9] to control of an autonomous wheelchair. The facial gestures are used to control the motion of the wheelchair. The authors proposed the use of Active Appearance Models (AAMs) [33] to detect and interpret facial gestures, using the concept of Action Units (AUs) introduced by [13]. To improve the performance of the algorithm, an AAM is trained, using an artificial 3D model of a human head, on which a frontal image of the human face is projected. The model of the head can be manipulated in order to model variations of a human face due to head rotations or illumination changes. Such an approach allows one to build an AAM, which is insensitive to different lighting conditions and head rotations. The authors do not specify the number of facial gestures recognizable by the proposed system or the performance of the proposed approach. In [30, 2, 29] the authors proposed the use of the face direction of a wheelchair user, to control the wheelchair. The system uses face direction to set the direction of the movement of the wheelchair. However, a straightforward implementation of such an approach produces poor results because unintentional head movements may lead to false recognition. To deal with this problem, the authors ignored quick movements and took into account the environment around the wheelchair [30]. Such an approach allows improvement of the performance of the algorithm by ignoring likely unintentional head movements. The algorithms operated on images obtained by a camera tilted by 15 degrees, which is much less than the angles in this work. To ignore quick head movements, both algorithms performed smoothing on a sequence of angles obtained from a sequence of input images. While this technique effectively filters out fast and small head movements, it does not allow fast and temporally accurate control of the wheelchair. Unfortunately, only subjective data about the performance of these approaches have been provided. In [25] the use of hand gestures to control an autonomous wheelchair was suggested. The most distinctive features of this approach are the ability to distinguish between intentional and unintentional hand gestures and ”guessing” of the meaning of unrecognized intentional hand gestures. The system assumed that a person who makes an intentional gesture would continue to do so until the system recognizes it. Once the system established the meaning of the gesture, the person continued to produce the same gesture. Hence, to distinguish between intentional and unintentional gestures, repetitive patterns in hand movement are detected. Once a repetitive hand movement is detected, it is

considered an intentional gesture. In the next stage, the system tried to find the meaning of the detected gesture by trying all possible actions until the user confirmed the correct action by repeating the gesture. The authors reported that the proposed wheelchair supports four commands, but they do not provide any data about the performance of the system. The use of a combination of head gestures and gaze direction to control an autonomous wheelchair was suggested in [27]. The system obtained images of the head of a wheelchair user by a stereo camera. The camera of the wheelchair was tilted upward 15 degrees, so that the images obtained by the camera were almost frontal. The usage of a stereo camera permits a fast and accurate estimate of the head posture as well as gaze direction. The authors used the head direction to set the direction of wheelchair movement. To control the speed of the wheelchair, the authors used a combination of face orientation and gaze direction. If face orientation coincided with a gaze direction, the wheelchair moved faster. To start or stop the wheelchair, the authors used head shaking and nodding. These gestures were defined as consecutive movements of the head of some amplitude in opposite directions. The authors do not provide data on the performance of the proposed approach. While the approaches presented in this section mainly deal with controlling the wheelchair, some of the approaches may be useful for the monitoring system. The approach proposed in [9] is extremely versatile and can be adopted to recognize facial gestures of a user. The approaches presented in [30, 2] and especially in [27] may be used to detect the area of interest of the user. The approach presented in [25] may be useful to distinguish between intentional and unintentional gestures. However, more research is required to determine whether this approach is applicable to head or facial gestures. 1.3 Contributions The research described in this paper, works towards the development of an autonomous wheelchair user monitoring system. This work presents a system that is capable of monitoring static facial gestures of a user of an autonomous wheelchair in a non-intrusive way. The system obtains the images using a standard camera, which is installed in the area above the knee of the user as illustrated in Figure 2. Such a design does not obstruct the field of view of the user and obtains input in a non-intrusive and unconstrained way. Previous research in the area of interfaces of autonomous wheelchairs with humans concentrates on the issue of controlling the wheelchair by a user. The majority of proposed approaches are suitable for controlling the wheelchair only. One of the major contributions of this work is that it examines the feasibility of creating a monitoring system for users

UbiCC Journal, Volume 5, March 2010


of autonomous wheelchairs and proposes a generalpurpose static facial gesture recognition algorithm that can be adopted for a variety of applications that require feedback from the user. In addition, unlike other approaches, the proposed approach relies solely on facial gestures, which is a significant advantage for users with severe mobility limitations. Moreover, the majority of similar approaches require the camera to be placed directly in front of the user, obstructing his/her field of view. The proposed approach is capable of handling non-frontal facial images and therefore, does not obstruct the field of view. The proposed approach has been implemented in software and evaluated on a set of 9140 images from ten volunteers, producing ten facial gestures. Overall, the implementation achieves a recognition rate of 90%. 1.4 Outline of Paper This paper consists of five sections. The first section provides motivation for the research and discusses previous related work. Section 2 describes the entire monitoring system in general. Section 3 provides technical and algorithmic details of the proposed approach. Section 4 details the experimental evaluation of a software implementation of the proposed approach. Finally, Section 5 provides a summary and conclusion of this work. 2 AN APPROACH TO WHEELCHAIR USER MONITORING

2.1 Overview While intelligent wheelchairs are becoming more and more sophisticated, the task of controlling them becomes increasingly important in order to utilize their full potential. The direct control of the wheelchair that is customary for non-intelligent wheelchairs cannot utilize fully the capabilities of an autonomous wheelchair. Moreover, the task of directly controlling the wheelchair may be too complex for some patients. To overcome this drawback this work proposes to add a monitoring system to a controlling system of an autonomous wheelchair. The purpose of such a system is to provide the wheelchair with timely and accurate feedback of the user on the actions performed by the wheelchair or about the intentions of the user. The wheelchair will use this information for planning of its future actions or correcting the actions that are currently performed. The response of the wheelchair to feedback of the user depends on the context in which this feedback was obtained. In other words, the wheelchair may react differently or even ignore feedback of the user in different situations. Because it is difficult to infer intentions of the user from his/her facial expressions, the monitoring system will complement regular controlling system of a

wheelchair instead of replacing it entirely. Such an approach facilitates the task of controlling an autonomous wheelchair and makes a wheelchair friendlier to the user. The most appropriate way to obtain feedback of the user is to monitor the user constantly using some sort of input device and classify the observations into categories that can be understood by the autonomous wheelchair. To be truly user friendly, the monitoring system should neither distract the user from his/her activities nor limit the user in any way. Wearable devices, such as gloves, cameras or electrodes, usually distract the user and therefore, are unacceptable for the purposes of monitoring. Microphones and similar voice input devices are not suitable for passive monitoring, because their usage requires explicit involvement of the user. In other words, the user has to talk, so that the wheelchair may respond appropriately. Vision based approaches are the most suitable for the purposes of monitoring the user. Video cameras do not distract the user, and if they are installed properly, they do not limit the field of view. The vision-based approach is versatile and capable of capturing a wide range of forms of user feedback. For example, they may capture facial, head and various hand gestures as well as face orientation and gaze direction of the user. As a result, the monitoring system may determine, for example, where the user is looking, is the user is pointing at anything, is the user happy or distressed. Moreover, the vision-based system is the only system that is capable of passive and active monitoring of the user. In other words, a vision-based system is the only system that will obtain the feedback of the user by detecting intentional actions or by inferring the meaning of unintentional actions. The wheelchair has a variety of ways to use this information. For example, if the user looks at a certain direction, which may differ significantly from the direction of movement, the wheelchair may slow down or even stop, to let the user look at the area of interest. If the user is pointing at something, the wheelchair may identify the object of interest and move in that direction or bring the object over if the wheelchair is equipped with a robot manipulator. If there is a notification that should be brought to attention of the user, the wheelchair may use only visual notification if the user is looking at the screen or a combination of visual and auditory notifications if the user is looking away from the screen. The fact that the user is happy may serve as confirmation of the wheelchair actions, while distress may indicate incorrect action or a need for help. As a general problem, inferring intent from action is very difficult. 2.2 General Design The monitoring system performs constant monitoring of the user, but it is not controlled by the user and therefore, does not require any user

UbiCC Journal, Volume 5, March 2010


interface. From the viewpoint of the autonomous wheelchair, the monitoring system is a software component that runs in the background and notifies the wheelchair system about detected user feedback events. To make the monitoring system more flexible, it should have the capability to be configured to recognize events. For example, one user may express distress using some sort of face gesture while another may do the same by using a head or hand gesture. The monitoring system should be able to detect the distress of both kinds correctly depending on a user observed. Moreover, due to the high variability of the gestures performed by different people, and because of natural variability of disorders, the monitoring system requires training for each specific user. The training should be performed by trained personnel at the home of the person for which the wheelchair is designed. Such training may be required for a navigation system of the intelligent wheelchairs, so the requirement to train the monitoring system is not exaggerated. The training includes collection of the training images of the user, manual processing of the collected images by personnel and training the monitoring system. During training, the monitoring system learns head, face and hand gestures as they are produced by the specific user and their meanings for the wheelchair. In addition, various images that do not have any special meaning for the system are collected and used to train the system to reject spurious images. Such an approach produces a monitoring system with maximal accuracy and convenience for the specific user. It may take a long time to train the monitoring system to recognize emotions of the user, such as distress, because a sufficient number of images of genuine facial expressions of the user should be collected. As a result, the full training of the monitoring system may consist of two stages: in the first stage, the system is trained to recognize hand gestures and the face of the user, and in the next stage, the system is trained to recognize the emotions of the user. To provide the wheelchair system with timely feedback, the system should have good performance that allows real-time processing of input images. Such performance is sufficient to recognize both static and dynamic gestures performed by the user. To avoid obstructing the field of view of the user, the camera should be mounted outside the user’s field of view. However, the camera should be also capable of taking images of the face and hands of the user. Moreover, it is desirable to keep the external dimensions of the wheelchair as small as possible, because a compact wheelchair has a clear advantage when navigating indoors or in crowded areas. To satisfy these requirements one of the places to mount the camera is on an extension of the side handrail of the wheelchair. This does not enlarge the overall

external dimensions of the wheelchair, limit the field of view of the user and allows tracking of the face and hands of the user. However, this requires that the monitoring system deals with non-frontal images of the user, taken from underneath of the face of the user. Such images are prone to distortions and therefore, the processing of such images is challenging. To the best of our knowledge, there is no research that deals with facial images taken from underneath of the user face at such large angles as required in this work. In addition, the location of the head and hands is not fixed, so the monitoring system should deal with distortions due to changes of the distance to the camera and viewing angle. The block diagram of the proposed monitoring system is presented in Fig. 1. The block diagram illustrates the general structure of the monitoring system and its integration into the controlling system of an intelligent wheelchair.

Figure 1: The block diagram of monitoring system 3 TECHNICAL APPROACH TO FACIAL GESTURE RECOGNITION

3.1 System Overview The facial gesture recognition system is part of an existing autonomous wheelchair and this fact has some implications on the system. It takes an image of the face as input, using a standard video camera, and produces the classification of the facial gesture as an output. The software for the monitoring system may run on a computer that controls the wheelchair. However, the input for the monitoring system can not be obtained using the existing design of the wheelchair and requires installation of additional hardware. Due to the fact that the system is intended

UbiCC Journal, Volume 5, March 2010


for autonomous wheelchair users, the hardware should neither limit the user nor obstruct his or her field of view. The wheelchair handrail is one of the best possible locations to mount the camera for monitoring of the user because it will neither limit the user nor obstruct the field of view. This approach has one serious drawback: the camera mounted in such a manner produces non-frontal images of the face of the user who is sitting in the wheelchair. Non-frontal images are distorted and some parts of the face may even be invisible. These facts make detection of facial gestures extremely difficult. Dealing with non-frontal facial images taken from underneath of a person is very uncommon and rarely addressed. The autonomous wheelchair with an installed camera for the monitoring system and a sample of the picture that is taken by the camera, are shown in Figure 2. 3.2 Facial Gestures Generally, facial gestures are caused by the action of one or several facial muscles. This fact along with the great natural variability of the human face makes the general task of classifying facial gestures difficult. Facial Action Coding System (FACS), a comprehensive system that classifies facial gestures was proposed in [13]. The approach is based on classifying clearly visible changes on a face and ignoring invisible or subtly visible changes. It classifies a facial gesture using a concept of Action Unit (AU), which represents a visible change in the appearance on some area of the face. Over 7000 possible facial gestures were classified by [12]. It is beyond the scope of this work to deal with this full spectrum of facial gestures. In this work, a facial gesture is defined as a consistent and unique facial expression that has some meaning in the context of application. The human face is represented as a set of contours of various distinguishable facial features that can be detected in the image of the face. Naturally, as the face changes its expression, contours of some facial features may change their shapes, some facial features may disappear, and some new facial features may appear on the face. Hence, in the context of the monitoring system, the facial gesture is defined as a set of contours of facial features, which uniquely identify a consistent and unique facial expression that has some meaning for the application. It is desirable to use a constant set of facial features to identify the facial gesture. Obviously, there are a lot of possibilities in selecting facial features, whose contours define the facial gesture. However, selected facial gestures should be easily and consistently detectable. Taking into consideration the fact that the most prominent and noticeable facial features are the eyes and mouth, the facial gestures produced by the eyes and mouth are most suitable for usage in the system. Therefore, only contours of the eyes and mouth are considered

in this research. Facial gestures formed by only the usage of the eyes and mouth, are a small subset of all facial gestures that can be produced by a human. Hence, many gestures cannot be classified using this approach. However, it is assumed that the facial gestures that have some meaning for the monitoring system differ in the contours of the eyes and mouth. Hence, this subset is enough for the purpose of this research, namely a feasibility study. 3.3 System Design Conceptually, the algorithm behind the facial gesture detection has three stages: (1) detection of the eyes and mouth in the image and obtaining their contours; (2) conversion of contours of facial features to a compact representation that describes the shapes of contours; and (3) classification of contour shapes into categories representing facial gestures. This section proceeds to briefly describe these stages; the rest of the chapter discusses these stages in more details. In the first stage, the algorithm of the monitoring system detects the eyes and mouth in the input image and obtains their contours. In this work, the modified AAM algorithm, first proposed in [35] and later modified in [33], is used. The AAM algorithm is a statistical, deformable model-based algorithm, typically used to fit a previously trained model into an input image. One of the advantages of the AAM and similar algorithms is their ability to handle variability in the shape and the appearance of the modeled object due to prior knowledge. In this work, the AAM algorithm successfully obtains contours of the eyes and mouth in non-frontal images of individuals of different gender, race, facial expression, and head pose. Some of these individuals wore eyeglasses. In the second stage, contours of facial features obtained in the first stage are converted to a representation suitable for the classification to categories by a classification algorithm. Due to movements of the head, contours, obtained in the first stage, are at different locations in the image, have different sizes and are usually rotated at different angles. Moreover, due to non-perfect detection, a smooth original contour becomes rough after detection. These factors make classification of contours using homography difficult. In order to perform robust classification of contours, a post processing stage is needed. The result of post processing should produce a contour representation, which is invariant to rotation, scaling and translation. To overcome non-perfect detection, such a representation should be insensitive to small, local changes of a contour. In addition, to improve the robustness of the classification, the representation should capture the major shape information only and ignore fine contour details that are irrelevant for the classification. In this work, Fourier descriptors,

UbiCC Journal, Volume 5, March 2010


Figure 2: (a) The autonomous wheelchair [left]. (b) Sample of picture taken by face camera [right]. first proposed in [39], are used. Several comparisons [41, 26, 28, 23] show that Fourier descriptors outperform many other methods of shape representation in terms of accuracy, computational efficiency and compactness of representation. Fourier descriptors are based on an algorithm that performs shape analysis in the frequency domain. The major drawback of Fourier descriptors is their inability to capture all contour details with a representation of a finite size. To overcome non-perfect detection by the AAM algorithm, the detected contour is first smoothed and then Fourier descriptors are calculated. Therefore, a representation of the finest details of the contour that would not be well captured by the method is removed. Moreover, the level of detail that can be represented using this method is easily controlled. In the third stage, contours are classified into categories. A classification algorithm is an algorithm that selects a hypothesis from a set of alternatives. The algorithm may be based on different strategies. One is to base the decision on a set of previous observations. Such a set is generally referred in the literature as a training set. In this research, the k-Nearest Neighbors classifier [15] was used. 3.4 Active Appearance Models (AAMs) This section presents the main ideas behind AAMs, first proposed by Taylor et al. [35]. AAM is a combined model-based approach to image understanding. In particular, it learns the variability in shape and texture of an object that is expected to be in the image, and then, uses the learned information to find a match in the new image. The learned object model is allowed to vary; the degree to which the model is allowed to change is controlled by a set of parameters. Hence, the task of finding the model match in the image becomes the task of finding a set of model parameters that maximize the match between the image and modified model. The resulting model parameters are used for contour analysis in the next stages. The learned model contains enough information to generate images of the learned object. This property is actively used in the process of matching. The shape in an AAM is defined as a triangulated mesh that may vary linearly. In other words, any shape s can be expressed as a base shape plus a linear combination of m basis shapes :

The texture of an AAM is the pattern of intensities or colors across an image patch, which is also, may vary linearly, i.e. the appearance A can be expressed as a base appearance plus a linear combination of basis appearance images :

The fitting of AAM to an input image I can be expressed as minimization of the function:

simultaneously with respect to shape and appearance parameters and , A is of the form described in Equation 2; F is an error norm function, W is a piecewise affine warp from a shape s to . The resulting set of shape parameters define contours of the eyes and mouth that were matched to the input image. In general, the problem of optimization of the function presented in Equation 3 is non-linear in

UbiCC Journal, Volume 5, March 2010


terms of shape and appearance parameters and can be solved using any available method of numeric optimization. Cootes et al. [10] proposed an iterative optimization algorithm and suggested multi-resolution models to improve the robustness and speed of model matching. According to this idea, in order to build the multi-resolution AAM of an object with k levels, the set of k images is built by successively scaling down the original image. For each image in this set, a separate AAM is created. This set of AAMs is multi-resolution AAM with k levels. The matching of the multi-resolution AAM with k levels to an image is performed as follows: first, the image is scaled down k times, and the smallest model in the multi-resolution AAM, is matched to this scaled down image. The result of the matching is scaled up and matched to the next model in the AAM. This procedure is performed k times until the largest model in the multi-resolution AAM is matched to the image of the original size. This approach is faster and more robust than the approach that matches the AAM to the input image directly. The main purpose of building an AAM is to learn the possible variations of object shape and appearance. However, it is impractical to take into account all of the possible variations of shape and appearance of object. Therefore, all observed variations of shape and appearance in training images are processed statistically in order to learn the statistics of variations that explain some percentage of all observed variation. The best way to achieve this is to collect a set of images of the object and manually mark the boundary of the object in each image. Marked contours are first aligned using the Procrustes analysis [17], and then, processed using PCA analysis [19] to obtain the base shape and the set of m shapes that can explain a certain percentage of shape variation. Similarly, to obtain the information about appearance variation, training images are first normalized by warping the training shape to the base shape , and then, PCA analysis is performed in order to obtain l images that can explain a certain percentage of variation in the appearance. For more detailed description of AAMs, the reader is referred to [10, 11, 35]. In this work, the modified version of AAM, proposed by Stegmann [33], is used. The modifications of original AAMs that were used in the current work are summarized in the following subsections. 3.4.1 Increased Texture Specificity As described above, the accuracy of AAM matching is greatly affected by the texture of the object. If the texture of the object is uniform, AAM tends to produce contours that lie inside the real object. This happens because the original AAM

algorithm is trained on the appearance inside of training shapes; it has no way to discover boundaries of an object with a uniform texture. To overcome this drawback, Stegmann [33] suggested the inclusion of a small region outside the object. Assuming that there is a difference between the texture of the object and background, it is possible for the algorithm to accurately detect boundaries of the real object in the image. Due to the fact that the object may be placed on different backgrounds, a large outside region included in the model may badly affect the performance of the algorithm. In this work, a strip that is one pixel wide around the original boundary of the object, as suggested in [33], is used. 3.4.2 Robust Similarity Measure According to Equation 3, the performance of the AAM optimization is greatly affected by the measure, or more formally, the error norm, by which texture similarity is evaluated, and denoted as F in the equation. The quadratic error norm, also known as least squares norm or norm, is one of the most popular among the many possible choices of error norm. It is defined as:

where e is the difference between the image and reconstructed model. Due to the fast growth of function , the quadratic error norm is very sensitive to outliers, and thus, can affect the performance of the algorithm. Stegmann [33] suggested the usage of the Lorentzian estimator, which was first proposed by Black and Rangarajan [8], and defined as:

where e is the difference between the textures of the image and the reconstructed AAM model; is a parameter that defines the values considered as outliers. The Lorentzian estimator grows much slower than a quadratic function, and thus, it is less sensitive to outliers and hence it is used in this research. According to Stegmann [33], the value of is taken equal to the standard deviation of appearance variation. 3.4.3 Initialization The performance of the AAM algorithm depends highly on the initial placement, scaling and rotation of the model in the image. If the model is placed too far from the true position of the object, it may not find the object or mistakenly matches the background as an object. Thus, finding good initial placement of the model in the image is a critical part of the algorithm. Generally, initial placement

UbiCC Journal, Volume 5, March 2010


or initialization depends on the application, and may require different techniques for different applications to achieve good results. Stegmann [33] proposed a technique to find the initial placement of a model that does not depend on the application. The idea is to test any possible placement of the model, and build a set of most probable candidates for the true initial placement. Then, the algorithm tries to match the model to the image at every initial placement from the candidate set using a small number of optimization iterations. The placement that produces the best match is selected as a true initial placement. After the initialization, the model at the true initial placement is optimized using a large number of optimization iterations. This technique produces good results at the expense of a high computational cost. In this research, a grid with a constant step is placed over the input image. At each grid location, the model is matched with the image at different scales. To improve the speed of the initialization, only a small number of initialization iterations is performed at this stage. Pairs of location and scale, where the best match is achieved, are selected as a candidate set. In the next stage, a normal model match is performed at each location and scale from the candidate set, and the best match is selected as the final output of the algorithm. This technique is independent of application and produces good results in this research. However, the high computational cost makes it inapplicable in applications requiring real time response. In this research, the fitting of a single model may take more than a second in the worst cases, which is unacceptable for the purposes of real-time monitoring the user. 3.4.4 Fine Tuning The Model Fit The usage of prior knowledge when matching the model to the image, does not always lead to an optimal result because the variations of the shape and the texture in the image may not be strictly the same as observed during the training [33]. However, it is reasonable to assume that the result produced during the matching of the model to the image, is close to the optimum [33]. Therefore, to improve the matching of the model, Stegmann [33] suggested the application of a general-purpose optimization to the result, produced by the regular AAM matching algorithm. However, it is unreasonable to assume that there are no local minimums around the optimum and the optimization algorithm may become stuck at the local minimum instead of optimum. To avoid local minima near the optimum, Stegmann [33] suggested the usage of a simulated annealing optimization technique, which was first proposed by Kirkpatrick et al. [24], a random-sampling optimization method that is more likely to avoid local minimum and hence it is used in this research. Due to space considerations, the detailed

description of the application of the algorithm in this work has been omitted; the reader is referred to [14] for more details. 3.5 Fourier Descriptors The contours produced by AAM algorithm at the previous stage are not suitable for classification because it is difficult to define a robust and reliable similarity measure between two contours, especially when neither centers nor sizes nor orientations of these contours coincide. Therefore, there is a need to obtain some sort of shape descriptor for these contours. Shape descriptors represent the shape in a way that allows robust classification, which means that the shape representation is invariant under translation, scaling, rotation, and noise due to imperfect model matching. There are many shape descriptors available. In this work, Fourier descriptors, first proposed by Zahn and Roskies [39], are used. Fourier descriptors provide compact shape representation, and outperform many other descriptors in terms of accuracy and efficiency [23, 26, 28, 41]. Moreover, Fourier descriptors are not computationally expensive and can be computed in real time. The performance of the Fourier descriptors algorithm is because it processes contours in the frequency domain, and it is much easier to obtain invariance to rotation, scaling, and translation in the frequency domain than in the spatial domain. This fact, along with simplicity of the algorithm and its low computational cost, are the main reasons for selecting this algorithm for usage in this research. The Fourier descriptor of a contour is a description of the contour in the frequency domain that is obtained by applying the discrete Fourier transform on a shape signature and normalizing the resulting coefficients. The shape signature is a onedimensional function, representing two-dimensional coordinates of contour points. The choice of the shape signature has a great impact on the performance of Fourier descriptors. Zhang and Lu [40] recommended the use of a centroid distance shape signature that can be expressed as the Euclidean distance of the contour points from the contour centroid. This shape signature is translation invariant due to the subtraction of shape centroid and therefore, Fourier descriptors that are produced, using this shape signature, are translation invariant. The landmarks of contours produced by the first stage are not placed equidistantly due to deformation of the model shape during the match of the model to the image. In order to obtain a better description of the contour, the contour should be normalized. The main purpose of normalization is to ensure that all parts of the contour are taken into consideration, and to improve the efficiency and insensitivity to noise of Fourier descriptors by smoothing the shape. Zhang and Lu [40] compared

UbiCC Journal, Volume 5, March 2010


several methods of contour normalization and suggested that the method of equal arc length sampling produces the best result among other methods. According to this method, landmarks should be placed equidistantly on the contour or in other words, the contour is divided into arcs of equal length, and the end points of such arcs form a normalized contour. Then, the shape signature function is applied to the normalized contour, and the discrete Fourier transform is calculated on the result. Note that the rotation of the boundary will cause the shape signature, used in this research, to shift. According to the time shift property of the Fourier transform, it causes a phase shift of Fourier coefficients. Thus, taking only a magnitude of the Fourier coefficients and ignoring the phase provides invariance to rotation. In addition, the output of the shape signature are real numbers, and according to the property of discrete Fourier transform, Fourier coefficients of a real-valued function are conjugate symmetric. However, only the magnitudes of Fourier coefficients are taken into consideration, which means that only half of the Fourier coefficients have distinct values. The first Fourier coefficient represents the scale of the contour only, so it is possible to normalize the remaining coefficients by dividing by the first coefficient in order to achieve invariance to scaling. The fact that only the first few Fourier coefficients are taken into consideration allows Fourier descriptors to catch the most important shape information and ignore fine shape details and boundary noise. As a result, a compact shape representation is produced, which is invariant under translation, rotation, scaling, and insensitive to noise. Such a representation is appropriate for classification by various classification algorithms. 3.6 K-Nearest Neighbors classification The third stage performs classification of facial features, obtained in the previous stage, into categories or in other words, it determines which facial gesture is represented by the detected boundaries of the eyes and mouth. This stage is essential because boundaries represent numerical data, whereas the system is required to produce facial gestures corresponding to boundaries or in other words, the system is required to produce categorical output. The task of classifying items into categories attracts much research, and numerous classification algorithms have been proposed. For this research, a group of algorithms that learn categories from training data and predict the category for an input image is suitable. In the literature, these algorithms are called supervised learning algorithms. Generally, no algorithm performs equally in all applications, and it is impossible to analytically predict which algorithm will have the best performance in the application. In

the case of Fourier descriptors, Zhang and Lu [40] recommended classification according to the nearest neighbor, or in other words, Fourier descriptor of the input shape is classified according to the nearest, in terms of Euclidean distance, Fourier descriptor of the training set. In this research, the generalization of this method, known as the k-Nearest Neighbors which was first proposed by Fix and Hodges [15], is used. The general idea of the method is to classify the input sample by a majority of its k nearest, in terms of some distance metrics, neighbors from the training set. Specifically, distances from an input sample to all stored training samples are calculated and k closest samples are selected. The input sample is classified by majority vote of k selected training samples. A major drawback of such an approach is that classes with more training samples tend to dominate the classification of an input sample. The distance between two samples can be defined in many ways. In this research, Euclidean distance is used as a distance measure. The process of training of k-Nearest Neighbors is simply caching of training samples in internal data structures. Such an approach is also called in the literature, as lazy learning [3]. To optimize the search of nearest neighbors some sophisticated data structures, e.g. Kd-trees [5], might be used. The process of classification is simply finding the k nearest, cached training samples, and deciding the category of the input sample. The value of k has a significant impact on the performance of the classification. Low values of k may produce a better result, but are very vulnerable to noise. Large values of k are less susceptible to noise, but in some cases, the performance may degrade. The result of the classification, produced by this stage, is a final result of the static facial gesture recognition system. 3.7 Selection Of Optimal Configuration The purpose of selecting the optimal configuration is to find the values of various algorithm parameters that ensure the best recognition rate with the lowest false positive recognition rate. Due to the fact that there are several parameters that affect the recognition rate and false positive recognition rate (e.g. initialization step of AAM algorithm, choice of classifier, number of samples used to train the classifier, number of neighbors for k-Nearest Neighbors classifier), the testing of all possible combinations of parameters is impractical. To simplify the process of finding the optimal configuration for the algorithm, the optimal initialization step of the AAM algorithm with an optimal number of training images and neighbors for k-Nearest Neighbors classifier are obtained. The obtained configuration is used to compare the performance of several classifiers and check the influence of adding shape elongation of eyes and

UbiCC Journal, Volume 5, March 2010


mouth on the performance of the whole algorithm. In addition, this configuration is used to tune the spurious images classifier to improve the false positive recognition rate of the algorithm. This approach works under the assumption that the configuration that provides the best results without the classifier of the spurious images will still produce the best results when the classifier is engaged. Both the AAM and k-Nearest Neighbors algorithms do not have the ability to reject spurious samples automatically. However, the algorithm proposed in this work should be able to reject the facial gestures that are not considered as having special meaning and therefore not trained. To reject such samples, the confidence measures (similarity measure for the AAM algorithm; the shortest distance to training sample for k-Nearest Neighbors algorithm) should be evaluated to determine if the sample is likely to contain the valid gesture. The performance of such classification has a great impact on the performance of the whole algorithm. It is clear that any classifier will inevitably reject some valid images and classify some of the spurious images as valid. The classifier used in this work consists of two parts: the first part classifies the matches obtained by the AAM algorithm; the second part classifies the results obtained by the kNearest Neighbors classifiers. These parts are independent of each other and trained separately. In this work, the problem of classifying spurious images is solved by analyzing the distribution of the values of confidence measures of valid images and classifying the images using simple thresholding. First, the part of the classifier that deals with results of the AAM algorithm is tuned. The results produced by the first part of the classifier are used to tune the second part of the classifier. While such an approach does not always provide the best results, it is extremely simple and computationally efficient. Some ideas to improve the classifier are described in Section 5. For details on the tuning of the spurious image classifier, the reader is referred to Section 4. Section 4 describes the process of selecting the optimal values of the parameters, which influence the performance of the algorithm. Due to the great number of such parameters and range of their values, testing of all possible combinations of values of the parameters goes beyond the scope of this research. In this research, the initialization step for the AAM algorithm, number of images for the training of the shape classifier, type of the shape classifier, and usage of shape elongation has been tested. It was found that the initialization step of 20×20, usage of shape elongations along with Fourier descriptors, k Nearest Neighbors classifier as a shape classifier with k equal to 1, and 2748 shapes to train the shape classifier, provide the best classification results. For the details on obtaining

the values of these parameters, the reader is referred to Section 4. 4 EXPERIMENTAL RESULTS

4.1 Experimental Design In order to test the proposed approach, the software implementation of the system was tested on a set of images that depicted human volunteers producing facial gestures. The goal of the experiment was to test the ability of the system to recognize facial gestures, irrespective of the volunteer, and measure the overall performance of the system. Due to the great variety of facial gestures that can be produced by humans by using their eyes and mouth, the testing of all possible facial gestures is not feasible. Instead, the system was tested on a set of ten facial gestures that were produced by volunteers. The participation of volunteers in this research is essential due to specificity of the system. The system is designed for wheelchair users, and to test such a system, images of people sitting in a wheelchair are required. Moreover, the current mechanical design of the wheelchair does not allow frontal images of a person sitting in the wheelchair, so the images should be acquired from the same angle as in a real wheelchair. Unfortunately, there is no publicly available image database that contains such images. All volunteers involved in this research have normal face muscle control. This fact limits the validity of the results of the experiment to people with normal control of facial muscles. The experiment was conducted in a laboratory with a combination of overhead fluorescent lighting with natural lighting from windows of the laboratory. The lighting was not controlled during the experiment and remained more or less constant. To make the experiment closer to the real application, volunteers sat in the autonomous wheelchair, and their images were taken by the camera mounted on the wheelchair handrail as described in Section 3. The mechanical design of the wheelchair does not allow fixing of the location of the camera relative to the face of a person sitting in the wheelchair. In addition, volunteers were allowed to move during the experiment in order to provide a greater variety of facial gesture views. Each of the ten volunteers produced ten facial gestures. Five volunteers wore glasses during the experiment; two were females and eight were males; two were of Asian origin and others of Caucasian origin. Such an approach allows the testing of the robustness of the proposed approach to the variability of facial gestures among different volunteers of different gender and origin. To make the testing process easier for volunteers, they were presented with samples of facial gestures and asked to reproduce the gesture as close as possible to the sample. The samples of facial gestures are

UbiCC Journal, Volume 5, March 2010


presented in Figure 3. The task of selecting proper facial gestures for the facial gesture recognition algorithm for monitoring system is very complex, because many samples of facial expressions of disabled people expressing genuine emotions need to be collected. Such work is beyond the scope of this research. The purpose of the experiments described in this chapter is to prove that the algorithm has the capability to classify facial expressions by testing it on a set of various facial gestures. In addition, five volunteers produced various gestures to measure the false positive rate of the algorithm. The volunteers were urged to produce as many gestures as possible. However, to avoid testing the algorithm only on artificial and highly improbable gestures, some of the volunteers were encouraged to talk. The algorithm is very likely to deal with facial expressions produced during talking, so it is critical to ensure that the algorithm is robust enough to reject such facial expressions. Such an approach ensured that the algorithm was tested on a great variety of facial gestures. Each gesture was captured as a color image at a resolution of 1024×768 pixels. For each volunteer and each facial image in the resulting set is acceptable for further processing. Blinking, for example, confuses the system because closed eyes are part of a separate gesture. In addition, due to limited field of view of the camera, accidental movements may cause the eyes or mouth to be occluded. Such images can not be processed by the system because the system requires both eyes and the entire mouth be clearly visible in order to recognize the facial gesture. These limitations are not an inherent drawback of the system. Blinking, for instance, can be overcome by careful selection of facial gestures. Out of a resulting set of 10000 images, 9140 images were manually selected for training and testing of the algorithm. Similarly, to test the algorithm for false positive rate, each of five volunteers produced 100 facial gestures. Out of a resulting set of 500 images, 440 images were selected manually for testing of the algorithm. The images that were used in this work are available at http://www.cse.yorku.ca/LAAV/datasets/index.html

Figure 3: Facial gestures recognized by the system. 4.2 Training Of The System The task of training the system consists of two parts. First, the system is trained to detect contours of the eyes and mouth of a person sitting in the wheelchair. Then, the system is trained to classify the contours of the eyes and mouth to facial gestures. Generally, training of both parts can be performed independently, using manually marked images. However, in order to speed up the training and achieve better results, the training of the second part is performed, using results obtained by the first part. In other words, the first stage is trained using manually marked images; the second stage is trained using contours, which are produced as a result of the processing of input set of images by the first part. This approach produces better final results because the training of the second stage is performed, using real examples of contours. The training, using real examples that may be encountered as input, generally produces better results than using manually or synthetically produced examples, because it is impossible to accurately predict the variability of input samples and reproduce it in training samples. In addition, such an approach facilitates and accelerates the process of training for the system, especially when the system is retrained for a new person. In this work, the best results are obtained using 100 images to train the first part of the system and 2748 contours to train the second part of the system. 4.3 Training of AAM The performance of AAMs has a crucial influence on the performance of the whole system. Therefore, the training of AAMs becomes crucial for the performance of the system. AAMs learn variability of training images to build a model of eyes and mouth, and then, try to fit the model to an

UbiCC Journal, Volume 5, March 2010


input image. To provide greater reliability of the results of these experiments, several volunteers participated in the research. However, a model built from training samples of all participants leads to poor detection and overall results. This phenomenon is due to the great variability among images of all volunteers that can not be described accurately by a single model. To improve the performance of the algorithm, several models are trained. Models are trained independently, and each model is trained on its own set of training samples. The fitting to the input image is also performed independently for each model, and the result of the algorithm is a model that produces the best fit to the input image. Generally, the algorithm that uses more trained models tends to produce better results due to more accurate modeling of possible image variability. However, due to the high computational cost of fitting an AAM to the input image, such an approach is impractical in terms of processing time. Selecting the optimal number of models is not an easy task. There are techniques that allow selecting the number of models automatically. In this work, a simple approach has been taken: each model represents all facial gestures, produced by a single volunteer. While this approach is probably not optimal in terms of accuracy of modeling, the variability and number of models, it has clear advantage in terms of simplicity and ease of use. This technique does not require a great number of images in a training set: one image for each facial gesture and volunteer is enough to produce acceptable results. To build the training set from each set of 100 images representing a volunteer producing a facial gesture, one image is selected randomly. As a result, the training set for AAM consists of only 100 images. To train an AAM model, the eyes and mouth are manually marked on these images. The marking is performed, using custom software, which allows the user to draw and store the contours of eyes and mouth over the training image. These contours are then normalized to have 64 landmarks that are placed equidistantly on the drawn contour. The images and contours of every volunteer are grouped together, and a separate AAM model is trained for each volunteer. Such an approach has a clear advantage when the wheelchair has only a single user. In fact, this represents the target application. Each AAM is built as a five level multiresolution model. The percentage of shape and texture variation that can be explained, using the model is selected to be 95%. In addition to building the AAM, the location of the volunteer’s face in each image is noted. These locations are used to optimize the fitting of an AAM to an input image by limiting the search for the best fit by a small region, where the face is likely to be located. The performance of the AAM fitting depends on the initial placement of the model. In this

research, it is proposed that a grid be placed over the input image and to fit the model at each grid location. The location where the best fit is obtained, is considered the true location of the model in the image. Therefore, the size of the grid has a great impact on the performance of fitting of the model. The usage of the small grid obtains excellent fitting results, but has prohibitively high computational cost, whereas the usage of the large grid has a low computational cost, but leads to poor fitting results. In this research, the optimal size of the grid was empirically determined to be 20×20. In other words, the initialization grid, placed on the input image, has 20 locations in width and 20 locations in height. Therefore, the AAM algorithm tests 400 locations during the initialization phase of the fitting. The size of the grid was chosen after series of experiments to select the optimal value. As mentioned in the Section 3.5 the AAM algorithm can not reject spurious images. To reject the spurious images, the statistics about similarity measures of valid images and spurious images is collected. The spurious images are detected using simple thresholding. 4.4 Training Of Shape Classifier The shape classifier is the final stage of the whole algorithm, so its performance influences the performance of the entire system. The task of the shape classifier is to classify the shapes of eyes and mouth, represented as a vector, to categories representing facial gestures. To accomplish this task, this research uses a technique of supervised learning. According to this technique, in the training stage, the classifier is presented with labeled samples of the input shapes. The classifier learns training samples and tries to predict the category of input samples using the learned information. In this research, the k-Nearest Neighbors classifier is used for shape classification. This classifier classifies input samples according to the closest k samples from the training set. Naturally, a large training set tends to produce better classification results at the cost of large memory consumption and slower classification. Hence, it may be impractical to collect a large number of training samples for the classifier. However, a small training set may produce poor classification results. The number of neighbors k, according to which the shape is classified, also has an impact on the performance of the classification. Large values of k are less susceptible to noise, but may miss some input samples. Small values of k usually produce better classification, but are more vulnerable to noise. To train the classifier, the input images are first processed by the AAM algorithm to obtain the contours of the eyes and mouth. Then, Fourier descriptors of each contour are obtained and combined to a single vector, representing a facial

UbiCC Journal, Volume 5, March 2010


gesture. As a result, a set of 9140 vectors, representing the facial gestures of volunteers, is built. Out of these vectors, some are randomly selected to train the classifier. The remaining vectors are used to test the performance of the classifier. The k-Nearest Neighbors classifier can not reject shapes obtained from spurious images. To reject the spurious shapes, the statistics on the closest distance of the input sample to the training set of valid images and spurious images are collected. The spurious shapes are detected using simple thresholding. 4.5 Results The testing was performed on a computer that has 512 megabytes of RAM and 1.5 GHz Pentium 4 processor under Windows XP. To detect the contours of eyes and mouth, a slightly modified C++ implementation of AAMs, proposed in [33], is used. To classify the shapes, the k-Nearest Neighbors classifier implementation of OpenCV Library [31] was used. The input images were first processed by the AAM algorithm to obtain the contours of the eyes and mouth. The samples of detected contours in input images are presented in Figure 4. Then, Fourier descriptors of each contour were obtained and combined to a single vector, representing a facial gesture. In the last stage, the vectors were classified by the shape classifier. The performance of the algorithm was measured according to the results produced by the shape classifier. In the conducted experiments, the algorithm successfully recognized 5703 out of 6300 valid images, which is a 90% success rate. The algorithm recognized 27 out of 440 spurious images, which is a 6% false positive rate. The shape classifier rejected 266 valid images and the AAM algorithm rejected 129 valid images. Therefore, in total the algorithm rejected 395 valid images, which is a 4% false negative rate. Detailed results, showing the performance of the algorithm on each particular facial gesture, are shown in Table 1. Facial gestures are denoted by letters a,b,c,. . . ,j. The axes of the table represent the actual facial gesture (vertical) versus the classification result. Each cell (i,j) in the table holds the number of cases that were actually i, but classified as j. The diagonal represents the count of correctly classified facial gestures. Table 2 summarizes performance of the algorithm on a set of spurious images. The details about rejected images are presented in Table 3. 4.6 Summary Of Implementation The monitoring of facial gestures in the context of this research is complicated by the fact that due to the peculiarity of the mechanical design of the autonomous wheelchair, it is impossible to obtain

Table 1: Facial gesture classification results. a 659 0 3 6 0 0 0 0 8 2 b 0 509 1 0 0 2 1 0 0 1 c 8 68 601 2 2 6 6 5 6 0 d 2 0 0 432 7 0 1 1 1 13 e 1 0 1 0 425 0 3 1 0 4 f 0 16 2 0 2 628 0 10 9 0 g 1 1 4 3 2 2 635 0 5 2 h 0 4 8 0 1 3 2 642 1 1 i 8 1 2 1 0 2 1 0 528 2 j 2 2 3 11 4 1 3 0 47 644

a b c d e f g h i j

Table 2: Spurious images classification results a 0 b 0 c 2 d 3 e 4 f 12 g 2 h 4 i 0 j 0

Table 3: Images Rejected by the algorithm a 9 b 27 c 30 d 95 e 109 f 41 g 20 h 15 i 30 j 19

Figure 4: Sample images produced by AAM algorithm (cropped and enlarged). frontal images of the face of a person sitting in the wheelchair. Using a set of ten facial gestures as a test bed application, it is demonstrated that the proposed approach is capable of robust and reliable monitoring of the facial gestures of a person sitting in a wheelchair. The approach, presented in this work, can be summarized as follows. First, the input image which is taken by a camera, installed on the wheelchair, is processed by AAM algorithm in

UbiCC Journal, Volume 5, March 2010


order to obtain the contours of the eyes and mouth of a person sitting in the wheelchair. Then, Fourier descriptors of the detected contours are calculated to obtain compact representation of the shapes of the eyes and mouth. Finally, obtained Fourier descriptors are classified to facial gestures, using the k Nearest Neighbors classifier. Over the experiments conducted in this work, the system that has implemented this approach is able to recognize correctly 90% of facial gestures produced by ten volunteers. The implementation demonstrated a low false positive rate of 6% and low false negative rate of 4%. The approach has proved to be robust to natural variations of facial gestures, produced by several volunteers as well as to variations due to inconstant camera point of view and perspective. The results suggest applicability of this approach to recognizing facial gestures in autonomous wheelchair applications. 4.7 Discussion The experiment was conducted on data consisting of ten facial gestures images, produced by ten volunteers. The images were typical indoor images of a human sitting in a wheelchair. The volunteers were of different origin and gender; some of them wore glasses. The location of the volunteer face relative to the camera could not be fixed due to the mechanical design of the wheelchair. Moreover, the volunteers were allowed to move during the experiment. The experiment was conducted according to the following procedure. First, the pictures of the volunteers were taken and stored. Next, a number of images were selected to train the first stage of the algorithm, to detect the contours of the eyes and mouth. After training, all images were run through the first stages of the algorithm to obtain the compact representations of facial gestures detected in the images. Some of these representations were used to train the last stage of the algorithm. The rest were used to test the last stage of the algorithm. The results of this test are presented in this chapter. In addition, multiple facial gestures, produced by five volunteers, were collected to test the ability of the algorithm to reject spurious images. Naturally, misclassification of a facial gesture by the system can occur due to the failure to accurately detect the contours of the eyes and mouth in the input image or misclassification of the detected contours to facial gestures. The reasons for the failure to detect the contours of the eyes and mouth include a large variation in the appearance of the face and insufficient training of AAMs. The great variation in the appearances can be explained by excessive distortion, caused by movements of the volunteers during the experiment, as well as natural variation in the facial appearance of the volunteer when producing a facial gesture. The reasons for inaccurate classification of the detected

contours into facial gestures include inaccurate reproduction of the gestures by volunteers, insufficient discriminative ability of Fourier descriptors used in this work, and non-optimal training of the classifier. Overall, the results demonstrate the ability of the system to recognize correctly, the facial gestures of different persons and suggest that the proposed approach can be used in autonomous wheelchairs to obtain feedback from a user. 5 CONCLUSION

This work presented a new approach in monitoring a user of an autonomous wheelchair and performed a feasibility analysis on this approach. Many approaches have been proposed to monitor the user of an autonomous wheelchair. However, few approaches focus on monitoring of the user to provide the user with greater safety and comfort. The approach proposed in this work suggests monitoring the user to obtain information about intentions and then using this information to make decisions automatically about the future actions of the wheelchair. The approach has a clear advantage over other approaches in terms of flexibility and convenience to the user. The work examined feasibility and suggested the implementation of a component of such a system that monitors the facial gestures the user. The results of the evaluation suggest applicability of this approach to monitoring the user of an autonomous wheelchair. 6 REFERENCES

[1] Facts for features: Americans with disabilities act: July 26, May 2008. [2] Y. Adachi, Y. Kuno, N. Shimada, and Y. Shirai. Intelligent wheelchair using visual information on human faces. Intelligent Robots and Systems, 1998. Proceedings., 1998 IEEE/RSJ International Conference on, 1:354–359 vol.1, Oct 1998. [3] David W. Aha. Editorial. Artificial Intelligence Review, 11(1-5):7–10, 1997. ISSN 0269-2821. [4] R. Barea, L. Boquete, M. Mazo, and E. L´opez. Wheelchair guidance strategies using eog. J. Intell. Robotics Syst., 34(3):279–299, 2002. [5] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and Applications. Springer-Verlag, January 2000. [6] L. Bergasa, M. Mazo, A. Gardel, R. Barea, and L. Boquete. Commands generation by face movements applied to the guidance of a wheelchair for handicapped people. Pattern Recognition, 2000. Proceedings. 15th International Conference on, 4:660–663 vol.4, 2000. [7] L. Bergasa, M. Mazo, A. Gardel, J. Garcia, A. Ortuno, and A. Mendez. Guidance of a wheelchair for handicapped people by face tracking.

UbiCC Journal, Volume 5, March 2010


Emerging Technologies and Factory Automation, 1999. Proceedings. ETFA ’99. 1999 7th IEEE International Conference on, 1:105–111 vol.1, 1999. [8] Michael J. Black and Anand Rangarajan. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. Int. J. Comput. Vision, 19(1):57–91, 1996. ISSN 09205691. [9] F. Bley, M. Rous, U. Canzler, and K.-F. Kraiss. Supervised navigation and manipulation for impaired wheelchair users. Systems, Man and Cybernetics, 2004 IEEE International Conference on, 3:2790–2796 vol.3, Oct. 2004. [10] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. PAMI, 23(6):681–685, June 2001. [11] G. J. Edwards, C. J. Taylor, and T. F. Cootes. Interpreting face images using active appearance models. In FG ’98: Proceedings of the 3rd. International Conference on Face & Gesture Recognition, page 300, Washington, DC, USA, 1998. IEEE Computer Society. ISBN 0-8186-83449. [12] P. Ekman. Methods for measuring facial action. Handbook of Methods in Nonverbal Behavioral Research, pages 445–90, 1982. [13] P. Ekman and W. Friesen. The facial action coding system: A technique for the measurement of facial movement. In Consulting Psychologists, 1978. [14] G. Fine and J. Tsotsos. Examining the feasibility of face gesture detection using a wheelchair mounted camera. Technical Report CSE-2009-04, York University, Toronto, Canada, 2009. [15] E. Fix and J. Hodges. Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, USA, 1951. [16] T. Gomi and A. Griffith. Developing intelligent wheelchairs for the handicapped. In Assistive Technology and Artificial Intelligence, Applications in Robotics, User Interfaces and Natural Language Processing, pages 150–178, London, UK, 1998. Springer-Verlag. [17] Colin Goodall. Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society. Series B (Methodological), 53(2):285–339, 1991. ISSN 00359246. [18] J.-S. Han, Z. Zenn Bien, D.-J. Kim, H.-E. Lee, and J.-S. Kim. Human-machine interface for wheelchair control with emg and its evaluation. Engineering in Medicine and Biology Society, 2003. Proceedings of the 25th Annual International Conference of the IEEE, 2:1602–1605 Vol.2, Sept. 2003. [19] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 27:417–441,

1933. [20] H. Hu, P. Jia, T. Lu, and K. Yuan. Head gesture recognition for hands-free control of an intelligent wheelchair. Industrial Robot: An International Journal, 34(1):60–68, 2007. [21] S. P. Kang, G. Rodnay, M. Tordon, and J. Katupitiya. A hand gesture based virtual interface for wheelchair control. In IEEE/ASME International Conference on Advanced Intelligent Mechatronics, volume 2, pages 778–783, 2003. [22] N. Katevas, N. Sgouros, S. Tzafestas, G. Papakonstantinou, P. Beattie, J. Bishop, P. Tsanakas, and D. Koutsouris. The autonomous mobile robot scenario: a sensor aided intelligent navigation system for powered wheelchairs. Robotics and Automation Magazine, IEEE, 4(4):60–70, Dec 1997. [23] H. Kauppinen, T. Seppanen, and M. Pietikainen. An experimental comparison of autoregressive and Fourier-based descriptors in 2d shape classification. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(2):201–207, 1995. [24] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, Number 4598, 13 May 1983, 220, 4598:671–680, 1983. [25] Y. Kuno, T. Murashima, N. Shimada, and Y. Shirai. Interactive gesture interface for intelligent wheelchairs. In IEEE International Conference on Multimedia and Expo (II), pages 789–792, 2000. [26] I. Kunttu, L. Lepisto, J. Rauhamaa, and A. Visa. Multiscale fourier descriptor for shape-based image retrieval. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, 2:765–768 Vol.2, Aug. 2004. [27] Y. Matsumoto, T. Ino, and T. Ogasawara. Development of intelligent wheelchair system with face and gaze based interface. Robot and Human Interactive Communication, 2001. Proceedings. 10th IEEE International Workshop on, pages 262– 267, 2001. [28] B. M. Mehtre, M. S. Kankanhalli, and W. F. Lee. Shape measures for content based image retrieval: A comparison. Information Processing & Management, 33(3):319–337, May 1997. [29] I. Moon, M. Lee, J. Ryu, and M. Mun. Intelligent robotic wheelchair with emg-, gesture-, and voice-based interfaces. Intelligent Robots and Systems, 2003. (IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on, 4:3453– 3458 vol.3, Oct. 2003. [30] S. Nakanishi, Y. Kuno, N. Shimada, and Y. Shirai. Robotic wheelchair based on observations of both user and environment. Intelligent Robots and Systems, 1999. IROS ’99. Proceedings. 1999 IEEE/RSJ International Conference on, 2:912–917 vol.2, 1999. [31] OpenCV. Opencv library, 2006. [32] R. C. Simpson. Smart wheelchairs: A literature

UbiCC Journal, Volume 5, March 2010


review. Journal of Rehabilitation Research and Development, 42(4):423–436, 2005. [33] M. B. Stegmann. Active appearance models: Theory, extensions and cases. Master’s thesis, Informatics and Mathematical Modelling, Technical University of Denmark, DTU, Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby, aug 2000. [34] K. Tanaka, K. Matsunaga, and H. Wang. Electroencephalogram-based control of an electric wheelchair. Robotics, IEEE Transactions on, 21(4):762–766, Aug. 2005. [35] C. Taylor, G. Edwards, and T. Cootes. Active appearance models. In ECCV98, volume 2, pages 484–498, 1998. [36] H. A. Yanco. Integrating robotic research: a survey of robotic wheelchair development. In AAAI Spring Symposium on Integrating Robotic Research, 1998. [37] I. Yoda, K. Sakaue, and T. Inoue. Development of head gesture interface for electric

wheelchair. In i-CREATe ’07: Proceedings of the 1st international convention on Rehabilitation engineering & assistive technology, pages 77–80, New York, NY, USA, 2007. ACM. [38] I. Yoda, J. Tanaka, B. Raytchev, K. Sakaue, and T. Inoue. Stereo camera based non-contact non-constraining head gesture interface for electric wheelchairs. ICPR, 4:740–745, 2006. [39] C. Zahn and R. Roskies. Fourier descriptors for plane closed curves. IEEE Trans. Computers, 21(3):269–281, March 1972. [40] D. S. Zhang and G. Lu. A comparative study of fourier descriptors for shape representation and retrieval. In Proceedings of the Fifth Asian Conference on Computer Vision, pages 646–651, 2002. [41] D. Zhang and G. Lu. A comparative study of curvature scale space and fourier descriptors for shape-based image retrieval. Journal Visual Communication and Image Representation, 14(1):39–57, 2003

UbiCC Journal, Volume 5, March 2010


Lakshmi Gade, Sreekar Krishna and Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing (CUbiC) Arizona State University, Tempe AZ 85281 Lakshmi.Gade@asu.edu, Sreekar.Krishna@asu.edu & Panch@asu.edu http://cubic.asu.edu

ABSTRACT Social interactions are a vital aspect of everyone’s daily living. Individuals with visual impairments are at a loss when it comes to social interactions as majority (nearly 65%) of these interactions happen through visual non-verbal cues. Recently, efforts have been made towards the development of an assistive technology, called the Social Interaction Assistant (SIA)[1], which enables access to non-verbal cues for individuals who are blind or visually impaired. Along with self report feedback about their own social interactions, behavioral psychology studies indicate that individuals with visual impairment will benefit in their social learning and social feedback by gaining access to non-verbal cues of their interaction partners. As part of this larger SIA project, in this paper, we discuss the importance of person localization while building a human-centric assistive technology which addresses the essential needs of the visually impaired users. We describe the challenges that arise when a wearable camera platform is used as a sensor for picking up non-verbal social cues, especially the problem of person localization in a real-world application. Finally, we present a computer vision based algorithm adapted to handle the various challenges associated with the problem of person localization in videos and demonstrate its performance on three examplar video sequences. Keywords: Social Interactions, Wearable Camera, Person Tracking, Particle Filtering, Chamfer Matching, Person Localization



Human-Centered Multimedia Computing (HCMC) [2], an emerging area under Human Centered Computing (HCC), focuses on the creation of multimedia solutions that enrich everyday lifestyles of individuals through the effective use of multimedia technologies. As explained in [2], HCMC focuses on deriving inspirations from human disabilities and deficits towards developing novel multimedia computing solutions. An important example of the same, discussed in detail in [1], is the concept of a Social Interaction Assistant (SIA) which aims at developing an assistive technology aid for enhancing social interactions between individuals, especially those who are visually impaired or blind. Developed primarily with assistive technology focus, the SIA uses state-of-the-art pervasive and ubiquitous computing elements starting from miniature on-body sensors to high fidelity haptic actuators. A detailed evolution of this project can be traced through the publications [1-4] and [28 30], in chronological order. This paper attempts at

providing a solution to one persistent problem of tracking people through the primary sensing element, a wearable camera, of the SIA. Following this section, we provide a brief overview of the SIA, before getting into the particular issue of person localization which is the primary focus of this article. Social Interaction Assistant (SIA) Social interactions are highly influenced by non-verbal communication cues such as eye contact, facial expressions, hand gestures, body posture, etc. which are all mostly visual in nature. The lack of access to such informative visual cues often inhibits individuals with visual impairments and blindness from effectively participating in day-to-day social interactions. The unique purpose of the SIA is to bridge this communication gap between the users who are visually impaired and their sighted counterparts [1]. As shown in Figure 1, SIA makes use of an inconspicuous camera mounted on the nose bridge of a pair of glasses as the primary visual sensor, 1.1

UbiCC Journal, Volume 5, March 2010


while an accelerometer mounted on a cap acts an egocentric motion sensor. The camera captures the scene in front of the user allowing various levels of computer vision processing. The delivery of information is actuated through single behind-theear speaker and a novel vibrotactile interface called the Haptic Belt. The video stream captured from the camera is processed for important social cues using a portable computing element. Any social information that is extracted from the video is delivered to the user through the use of audio and haptic cues. Since social cues (such as facial expressions, body mannerisms, proxemics etc) are very high bandwidth data, care is taken to encode these signals in such a way that the user is not cognitively loaded with information.

processing for social interaction cues. The problem of person localization in general is very broad in its scope and wide varieties of challenges such as variations in articulation, scale, clothing, partial appearances, occlusions, etc make this a complex problem. Narrowing the focus, this paper targets person localization in real world video sequences captured from the wearable camera of the SIA. Specifically, we focus on the task of localizing a person who is approaching the user to initiate a social interaction or just a conversation. In this context, the problem of person localization can be constrained to the cases where the person of interest is facing the user.

Figure 2. Person of interest at a short distance from camera

Figure 1. The Social Interaction Assistant In [1] we introduce a systematic requirements analysis for an effective SIA. Through an online survey (with inputs from 27 people, of whom 16 were blind, 9 had low vision, and 2 were sighted specialists in the area of visual impairment) we rank ordered a list of important visual cues related to social interaction that are considered important by the target population. Most of the needs identified through this survey display the importance of extracting these following characteristics of individuals in the scene, namely, a) Number and location of the interaction partners, b) Facial expressions, c) Identity, d) Appearance, e) Eye Gaze direction, f) Pose and e) Gestures. A brief glance through this list reveals the commonality of these issues with some of the important research questions being tackled by the face research group of the computer vision and pattern recognition community. In this regard, many advances have been made in order to extract information related to humans from images and videos. But, when the mobile setup of SIA is considered with real world data captured in unconstrained settings, a new dimension of complexity is added to these problems. As most of these cues are related to people in the surroundings of the user, it is essential to localize the individuals in the input video stream prior to

Figure 3. Person of interest at a large distance from camera When such a person of interest is in close proximity, his/her presence can be detected by analyzing the incoming video stream for facial features (Figure 2). But when such a person is approaching the user from a distance, the size of the facial region in the video appears to be extremely small. In this case, relying on facial features alone would not suffice and there is a need to analyze the data for full body features (Figure 3). In this work, we have concentrated on improving the effectiveness of the SIA by applying computer vision techniques to robustly localize people using full body features. Following section discusses some of the critical issues that are evident when performing person localization from the wearable camera setup of the SIA Challenges in Person Localization from a wearable camera platform A number of factors associated with the background, object, camera/object motion, etc. determine the complexity of the problem of person localization from a wearable camera platform. Following is a descriptive discussion of the imminent challenges that we encountered while processing the data using the SIA. 1.2

UbiCC Journal, Volume 5, March 2010


1.2.1 Background Properties When the Social Interaction Assistant is used in natural settings, it is highly possible that there are objects in the background which move, thus causing the background to be dynamic. Also, there are bound to be regions in the background whose image features are highly similar to that of the person, thus leading to a cluttered background. Due to these factors, the problem of distinguishing the person of interest from the background becomes highly challenging in this context. Figure 4 (a) and (b) illustrate the contrast in the data due to the nature of the background.

has not been studied much in the literature. Figure 5(a) shows the simplicity of the data when these problems are not present, while Figure 5(b) highlights complex data formulations in a typical interaction scenario. 1.2.3 Object/Camera Motion

(a) Static Camera

(a) Simple Background

(b) Mobile Camera Figure 6. Object/Camera Motion Traditionally, most computer vision applications use a static camera where strong assumptions of motion continuity and temporal redundancy can be made. But in our problem, as it is very natural for users to move their head continuously, the mobile nature of the platform causes abrupt motion in the image space (Figure 6). This is similar to the problem of working with low frame rate videos or the cases where the object exhibits abrupt movements. Recently, there has been an increase of interest in dealing with this issue in computer vision research [5] [6-8]. Some important applications which are required to meet real-time constraints, such as teleconferencing over low bandwidth networks, and cameras on low-power embedded systems, along with those which deal with abrupt object and camera motion like sports applications are becoming common place [8]. Though solutions have been suggested, person localization through low frame rate moving cameras still remains an active research topic. 1.2.4 Other Important Factors Effective Person Localization Affecting

(b) Complex Background Figure 4. Background Properties 1.2.2 Object Properties

(a) Rigid, Homogeneous Object

(b) Non-Rigid, Deformable, Homogeneous Object Figure 5. Object Properties


As we are interested in person localization, it can be clearly seen that the object is non-rigid in nature as there are appearance changes that occur throughout the sequence of images. Further, significant scale changes and deformities in the structure can also be observed. Also, when analyzing video frames of persons approaching the user, the basic image features in various sub-regions of the object vary vastly. For example, the image features from the facial region are considerably different from that of the torso region. Tracking detected persons from one frame to another will require individualized tracking of each region to maintain confidence. This non-homogeneity of the object poses a major hurdle while applying localization algorithms and

Figure 7. Changing Illumination, Pose Change and Blur As the SIA is intended to be used in uncontrolled environments, changing illumination conditions need to be taken into account. Further, partial occlusions, self occlusions, in-plane and out-ofplane rotations, pose changes, blur and various

UbiCC Journal, Volume 5, March 2010


other factors can complicate the nature of the data. See Figure 7 for example situations where various factors can affect the video quality. Given the nature of this problem, in this paper we focus on the problem of robust localization of a single person approaching a user of the SIA using full-body features. Issues arising due to cluttered background along with object and camera motion have been handled towards providing robustness. In the following section we discuss some of the important related work in the computer vision literature. The conceptual framework used in person localization is presented in Section 3. The details of the proposed algorithm are discussed in Section 4. Section 5 presents the results and discussions of the performance of our algorithm on videos collected from the wearable SIA. Finally, some possible directions of future work have been outlined followed by the conclusion. 2 RELATED WORK

like features. Some of the well-known higher level descriptors are histogram of oriented gradients [10] and covariance features [14]. Efforts have been made to make these descriptors scale invariant as well. In order to make these algorithms real-time, researchers have popularly resorted to two kinds of approaches. One category includes part-based approach such as Implicit Shape Models [5] and constellation models [15] which place emphasis on detecting parts of the object before integrating, while the other category of algorithms tries to search for relevant descriptors for the whole object in a cascaded manner[16]. Shape-based Chamfer matching [25] is a popular technique used in multiple ways for person detection as the silhouette gives a strong indication of the presence of a person. In recent times, Chamfer matching has been used extensively by the person detection and localization community. It has been applied with hierarchically arranged templates to obtain the initial candidate detection blocks so that they can be analyzed further by techniques such as segmentation, neural networks, etc. It has also been used as a validation tool to overcome ambiguities in detection results obtained by the Implicit Shape Model technique [18]. Tracking Algorithms Assuming that there is temporal object redundancy in the incoming videos, many algorithms have been proposed to track objects over frames and build confidence as they go. Generally they make the simplifying assumption that the properties of the object depend only on its properties in the previous frame, i.e. the evolution of the object is a Markovian process of first order. Based on these assumptions, a number of deterministic as well as stochastic algorithms have been developed. Deterministic algorithms usually apply iterative approaches to find the best estimate of the object in a particular image in the video sequence [16]. Optimal solutions based on various similarity measures between the object template and regions in the current image, such as sum of squared differences (SSD), histogram-based distances, distances in eigenspace and other low dimensional projected spaces and conformity to particular object models, have been explored [16]. Mean Shift is a popular, efficient optimization-based tracking algorithm which has been widely used. Stochastic algorithms use the state space approach of modeling dynamic systems and formulate tracking as a problem of probabilistic state estimation using noisy measurements [20]. In the context of visual object tracking, it is the problem 2.2

Historically, two distinct approaches have been used for searching and localizing objects in videos. On one hand, there are detection algorithms which focus on locating an object in every frame using specific spatial features which are fine tuned for the object of interest. For example, haar-based rectangular features [9] and histograms of oriented gradients [10] can develop detectors that are very specific to objects in videos. On the other hand, there are tracking algorithms which trail an object using generic image features, once it is located, by exploiting the temporal redundancy in videos. Examples of features used by tracking algorithms include color histograms [11] and edge orientation histograms [12]. 2.1 Detection Algorithms As mentioned previously, detection algorithms exploit the specific, distinctive features of an object and apply learning algorithms to detect a general class of objects. They use information related to the relative feature positions, invariant structural features, characteristic patterns and appearances to locate objects within the gallery image. But, when the object is complex, like a person, it becomes difficult for these algorithms to achieve generality thereby failing even under minute non-rigidity. A number of human factors such as variations in articulation, pose, clothing, scale and partial occlusions make this problem very challenging. When assumptions about the background cannot be made, learning algorithms which take advantage of the relative positions of body parts are used to build classifiers. The kind of low-level features generally used in this context are gradient strengths and gradient orientations [13,10], , entropy and haar-

UbiCC Journal, Volume 5, March 2010


of probabilistically estimating the object’s properties such as its location, scale and orientation by efficiently looking for appropriate image features of the object. Most of these stochastic algorithms perform Bayesian filtering at each step for tracking, i.e. they predict the probable state distribution based on all the available information and then update their estimate according to the new observations. Kalman filtering is one such algorithm which fixes the type of the underlying system to be linear with Gaussian noise distributions and analytically gives an optimal estimate based on this assumption. As most tracking scenarios do not fit into this linearGaussian model and as analytic solutions for nonlinear, non-Gaussian systems are not feasible, approximations to the underlying distribution are widely used from both parametric and nonparametric perspective. Sequential monte-carlo based Particle Filtering techniques have gained a lot of attention recently. These techniques approximate the state distribution of the tracked object using a finite set of weighted samples using various features of the system. For visual object tracking, a number of features have been used to build different kinds of observation models, each of which have their own advantages and disadvantages. Color histograms[11], contours[21], appearance models, intensity gradients[22], region covariance, texture, edgeorientation histograms, haar-like rectangular features [16] , to name a few. Apart from the kind of observation models used, this technique allows for variations in the filtering process itself. A lot of work has gone into adapting this algorithm to better perform in the context of visual object tracking. While both the areas of detection and tracking have been explored extensively, there is an impending need to address some of the issues faced by low frame rate visual tracking of objects. Especially in the case of SIA, person localization in low frame rate video is of utmost importance. In this paper, we have attempted to modify the color histogram comparison based particle filtering algorithm to handle the complexities that occur mobile camera on the Social Interaction Assistant. 3 CONCEPTUAL FRAMEWORK

techniques individually, the strengths of both these approaches need to be combined in order to tackle the challenges posed by the complex setting of the SIA. In the past, a few researchers have approached the problem of tracking in low frame rate or abrupt videos by interjecting a standard particle filtering algorithm with independent object detectors [23]. In our experience, the Social Interaction Assistant offers a weak temporal redundancy in most cases. We exploit this information trickle between frames to get an approximate estimate of the object location by incorporating a deterministic object search while avoiding the explicit use of pre-trained detectors. Due to the flexibility in the design, particle filtering algorithms provide a good platform to address the issues arising due to complex data. These algorithms give an estimate of an object’s position by discretely building the underlying distribution which determines the object’s properties. But, real-time constraints impose limits on the number of particles and the strength of the observation models that can be used. This generally causes the final estimate to be noisy when conventional particle filtering approaches are applied. Unless the choice of the particles and the observation models fit the underlying data well, the estimate is likely to drift away as the tracking progresses. To mitigate these problems faced in the use of the SIA, we propose a new particle filtering framework that gets an initial estimate of the person’s location by spreading particles over a reasonably large area and then successively corrects the position though a deterministic search in a reduced search space. Termed as Structured Mode Searching Particle Filter (SMSPF), the algorithm uses color histogram comparison in the particle filtering framework at each step to get an initial estimate which is then corrected by applying a structured search based on gradient features and chamfer matching. STRUCTURED MODE SEARCHING PARTICLE FILTER Assuming that an independent person detection algorithm can initialize this tracking algorithm with the initial estimate of the person location, this particle filtering framework focuses on tracking a single person under the following circumstances, namely • Image region with the person is non-rigid and non-homogeneous • • Image region with the person exhibits significant scale changes Image region with the person exhibits abrupt motions of small magnitude in the image space due to the movement of the camera. Background is cluttered. 4

As discussed in the previous section, detection and tracking offer distinctive advantages and disadvantages when it comes to localizing objects. In the case of SIA, thorough object detection is not possible in every frame due to the lack of computational power (on a wearable platform computing platform) and tracking is not always efficient due to the movement of the camera and the object’s (interaction partner’s) independent motion. Though there are clear advantages in applying these

UbiCC Journal, Volume 5, March 2010


The algorithm progresses by implementing two steps on each frame of the incoming video stream. In the first step (Figure 8), an approximate estimate of the person region is obtained by applying a color histogram based particle filtering step over a large search space. This is followed by a refining second step (Figure 9) where the estimate is corrected by applying a structured search based on gradient features and Chamfer matching. These two steps have been described in detail below.

person in the current image based on the previous frame’s information alone. When such data is modeled in the Bayesian filtering based particle filtering framework, the state of each particle’s position becomes independent of its state in the previous step. Thus, the prior distribution can be considered to be a uniform random distribution over the support region of the image.

p ( x ti | x ti−1 ) = p ( x ti )


As it is essential for particle filtering algorithm to choose a good set of particles, it would be useful to pick a good portion of them near the estimate in the previous step. By approximating this previous estimate to be equivalent to a measurement of the image region with the person in the current step, the proposal distribution of each particle can be chosen to be dependent only on the current measurement

q ( xti | xti−1 z t ) = q ( xti | z t )


Figure 8. SMSPF – Step 1

Though the propagation of information through particles is lost by making such an assumption, it gives a better sampling of the underlying system. We employ a large variance Gaussian with its mean centered at the previous estimate for successive frame particle propagation. By using such a set of particles, a larger area is covered, thus accounting for abrupt motion changes and a good portion of them are picked near the previous estimate, thus exploiting the weak temporal redundancy. As in [11], we have employed this technique using HSV color histogram comparison to get likelihoods at each of the particle locations. Since intensity is separated from chrominance in this color space, it is reasonably insensitive to illumination changes. We use an 8x8x4 HSV binning thereby allowing lesser sensitivity to changes in V when compared to chrominance. The histograms are compared using the well-known Bhattacharyya Similarity Coefficient which guarantees near optimality and scale invariance.

Figure 9. SMSPF – Step 2 4.1 Step 1: Particle filtering step In the context of SIA, as the person of interest can exhibit abrupt motion changes in the image space, it is extremely difficult to model the placement of the Figure 8. Structured Search With the above step alone, due to the small number of particles which are spread widely across the image, we can get an approximate location of the

UbiCC Journal, Volume 5, March 2010


person. When such an estimate partially overlaps with the desired person region, the best match occurs between the intersection of the estimate and the actual person region as shown in Figure 10. But, it is not trivial to detect this partial presence due to the existence of background clutter. To handle this problem, we introduce a second step which uses efficient image feature representations of the desired person object and employs an efficient search around the estimate to accurately localize the person object. 4.2 Step 2: Structured Search As the estimate obtained using widely spread particles gives the approximate location of the object, the search for the image block with a person in it can be restricted to a region around it. We have employed a grid-based approach to discretely search for the object of interest (a person) instead of checking at every pixel. By dividing the estimate into an m x n grid and sliding a window along the bins of the grid as shown in Error! Reference source not found., the search space can be restricted to a region close to the estimate. By finding the location which gives the best match with the person template, we can localize the person in the video sequence with better accuracy.

be seen that this search is characterized by the number of bins m x n into which the sliding window and the estimate are divided. Based on the nature of the problem, the number of bins and the amount of sweep across scale and space can be adjusted. Currently, these parameters are being set manually, but the structured search framework can be extended to include online algorithms which can adapt the number of grid bins based on the evolution of the object. If the object of interest was simple, then the best match across space and scale could be obtained by using simple feature matching techniques. But, due to the complex nature of the data, strong confidence is required while searching for the person region across scale. To this end, we propose to perform the structured search by analyzing the internal features of the person region as well as the external boundary/silhouette features and aggregating the confidence obtained from these two measures to refine the person location estimate in the image (Figure 12)

Figure 11. Sliding window of the Structured Search (Green: Estimate; Red: Sliding window). If this search is performed based on scale-invariant features, then it can be extended to identify scale changes as well. In order to achieve search over scale, the estimate and the sliding window need to be divided into different number of bins. If the search is performed using smaller number of bins as compared to the estimate, then shrinking of the object can be identified while searching with higher number of bins can account for dilation of the object. For example, if a (m-1) x (n-1) grid is used with the sliding window while a m x n grid is used with the estimate, then the best match will find a shrink in the object size. Similarly if an m x n grid sliding window is used with a (m-1) x (n-1) estimate grid, then dilations can be detected. It can

Figure 12. Structured Search Matching Technique In literature, gradient based features have been widely used for person detection and tracking problems and their applicability has been strongly established by various algorithms like Histogram of Oriented Gradients (HoGs) [10]. Following this principle, we have used the Edge Orientation Histogram (EOH) features [12] in order to obtain the internal content information measure. For this

UbiCC Journal, Volume 5, March 2010


purpose, a gradient histogram template (GHT) is initially built using a generic template image of a walking/standing person. This GHT is then compared with the gradient histogram of each structured search block using the Bhattacharyya histogram comparison as in [11] in order to find the block with the best internal confidence. In our implementation, orientations are computed using the Sobel operator and the gradients are then binned into 9 discrete bins. These features were extracted using the integral histogram concept [27] to facilitate computationally efficient searching. Similarly, in order to obtain the boundary confidence measure, a generic person silhouette template (GPT) (as shown in Figure 13) is used to perform a modified Chamfer match on each of the search blocks. In general, Chamfer matching is used to search for a particular contour model in an edge map by building a distance transformed image of the edge map. Each pixel value in a distance transformed image is proportional to the distance to its nearest edge pixel. In order to compare the edge map to the contour map, we convolve the edge image with the contour map. If the contour completely overlaps with the matching edge region, we get a chamfer match value of zero. Based on how different the edge map is to the template contour, the chamfer match score will increase and move towards 1. A chamfer match score of 1 implies a very bad match. While the theory of chamfer matching offers elegant search score, in reality, especially with clutter within the object’s silhouette, it is very difficult to get an exact match score. In SIA, since the data is very noisy and complex, certain modifications need to be made with the Chamfer matching algorithm in order achieve good performance. The following section details a modified Chamfer match algorithm introduced in this work. 4.3 Chamfer Matching in Structured Search As discussed above, Chamfer matching gives a measure of confidence on the presence of the person within an image based on silhouette information. We have incorporated this confidence into the structured search in order to detect the precise location of the person around the particle filter estimate. An edge map of the image under consideration is first obtained which is then divided into (m x n) windows in accordance with the structured search and an elliptical ring mask is then applied to each of these windows as shown in Figure 13. This mask is applied so as to eliminate the edges that arise due to clothing and background thereby emphasizing the silhouette edges which are likely to appear in the ring region if a window is precisely placed on the object perimeter. A distance

transformed image of the window is then obtained using the masked edges.

Figure 13. Incorporating Chamfer Matching into Structured Search By applying the modified chamfer matching (with a generic person contour resized to the current particle filter estimate), a confidence number in locating the desired object within the image region can be obtained. Similar to the Chamfer matching as before, a value close to 0 indicates a strong confidence of the presence of a person and vice versa. As 1 is the maximum value that can be obtained by the chamfer match, this measure can be incorporated into the match score of the structured search using the following equation.
BoundaryCo nf = (1 − ChamferMat ch)


The standard form of Chamfer Matching gives a continuous measure of confidence in locating an object in an edge map. But, in our case, when the elliptical ring mask is used to filter out the noisy edges in each search block, this nature of Chamfer match is lost. Since the primary goal of the structured search is to find a single best matching location of the person, it is more advantageous to use the filter mask at the cost of losing this continuous nature of the chamfer match. Further, as it is very likely that the person region is close to the approximate estimate obtained from the first step, one of the search windows of the structured search is bound to capture the entire person object thus resulting in a good match score. From the above discussion, it can be seen that combining the knowledge about the internal structure of the person region with the silhouette information results in a greater confidence in the SMSPF algorithm. Further, using such complementary features in the structured search robustly corrects the approximate estimate obtained from the particle filtering step while handling various problems associated with search across scale.

UbiCC Journal, Volume 5, March 2010




their performance [24]. • Area Overlap (A0) • Distance between Centroids (DC) Manually labeled rectangular regions around the person in the image have been used as the ground truth. Suppose gTruthi is the ground truth in the ith frame and tracki is the rectangular region output by a tracking algorithm, then the area overlap criterion is defined as follows

5.1 DataSets The performance of the structured mode searching particle filter (SMSPF) has been tested using three datasets where a single person faces the camera while approaching it. There are significant scale changes in each of these sequences. Further, nonrigidity and deformability of the person region can also be clearly observed. Different scenarios with varying degrees of complexity of the background and camera movement have been considered. Following is a brief description of these datasets. (a) DataSet 1 (Collected at CUbiC 1 ) : Plain Background; Static Camera; 320x240 resolution (b) DataSet 2 (CASIA 2 Gait Dataset B with subject approaching the camera [4]) : Slightly cluttered Background; Static Camera; 320x240 resolution (c) DataSet 3 (Collected at CUbiC 3 ) : Cluttered Background; Mobile Camera; 320x240 resolution Figure 14 shows the sample results on each of the datasets used.

AO( gTruthi , track i ) =

Area( gTruthi ∩ track i ) Area( gTruthi ∪ track i )


The average area overlap can be computed for each data sequence as

AvgAOR =

1 N


∑ AO
i =1



AvgAOR value closer to 1 indicates better match when compared to a value of 0 which implies no overlap. Similar to[24], we use Object Tracking Error (OTE) which is the average distance between the centroid of the ground truth bounding box and the centroid of the result given by a tracking algorithm
1 N ∑ (Centroid gTruthi − Centroid tracki ) (7) N i =1 An OTE value closer to 0 implies better tracking while a value away from 0 implies larger distance between the prediction and ground truth. OTE =

(a) SMSPF Results on a sequence from Dataset1

(b) SMSPF Results on a sequence from Dataset 2

In order to evaluate the performance of these algorithms using a single metric which encodes information from both area overlap and the distance between centroids, we have used a measure termed as the Tracking Evaluation Measure (TEM) which is the harmonic mean of the average area overlap fraction (AvgAOR) and an exponent mapping of the Object Tracking Error (OTE).

(c) SMSPF Results on a sequence from Dataset 3 Figure 14. SMSPF Results 5.2 Evaluation Metrics In order to test the robustness of this algorithm and the applicability in complex situations, its performance has been compared with the popular Color Particle Filtering algorithm [11]. The following two criteria have been used to evaluate
1 2

TEM = 2 *

AvgAOR * e − kOTE AvgAOR + e − kOTE


Where, k is a constant which exponentially penalizes the cases where the distance between centroids is large.

Center for Cognitive Ubiquitous Computing, ASU. Portions of the research in this paper use the CASIA Gait Database collected by Institute of Automation, Chinese Academy of Sciences Center for Cognitive Ubiquitous Computing, ASU.


5.3 Results As mentioned in [7], in order to handle abrupt motion changes, it is essential that the particles are widely spread while tracking. Following this principle, we have compared the performance of color particle filter (PF) [11] and the structured mode searching particle filter (SMSPF) by using a

UbiCC Journal, Volume 5, March 2010


2-D Gaussian with large variance as the system model. The position of the person and its scale has been included in the state vector. In order to compensate for the computational cost of structured search, only 50 particles were used for the SMSPF algorithm while 100 particles were used for the PF algorithm. A 10x10 grid with a sweep of 8 steps along the spatial dimension and 3 steps along the scale dimension were incorporated in the structured search.

algorithm outperforms the color based particle filtering algorithm with a higher TEM score.

Figure 17. Evaluation Measure for DataSet 1

Figure 15. AO (Dotted Line: Color PF; Solid Line: SMSPF) Figure 15 and Figure 16 illustrate the comparison of the area overlap ratio and the distance between centroids at each frame of an example sequence from Dataset 3. The sample frames are shown beside the tracking results. From Figure 15(a), it is evident that the SMSPF algorithm (red) shows a significant improvement over the color particle filter algorithm (green). Here, the area overlap ratio using SMSPF is much closer to 1 in most of the frames while the color particle filter drifts away causing this measure to be closer to 0. The distance between centroids measure also indicates a greater precision of the SMSPF algorithm as seen in Figure 16(a), where the distance between centroids using color particle filter is much higher than that with SMSPF (≈0).

Figure 18. Evaluation Measure for DataSet 2

Figure 19. Evaluation Measure for DataSet 3 The results presented as a comparison between Color PF and SMSPF shows that incorporating a deterministic structured search into the stochastic particle filtering framework improves the person tracking performance in complex scenarios. The SMSPF algorithm strikes a balance between specificity and generality offered by detection and tracking algorithms as discussed in Section 2. It uses specific structure-aware features in the search in order to handle non-homogeneity of the object and the cluttered nature of the background. On the other hand, generality is maintained by using simple, global features in the particle filtering framework so as to handle non-rigidity and deformability of the object. The clear advantage of using the structured search can be observed on the complex Dataset 3 which encompasses most of the challenges generally encountered while using the Social Interaction Assistant.

Figure 16. DC (Dotted Line: Color PF; Solid Line: SMSPF) Figure 17, Figure 18 and Figure 19 show the Tracking Evaluation Measure (TEM) for Datasets 1, 2 and 3. In majority of the cases, the SMSPF

UbiCC Journal, Volume 5, March 2010





As a first step towards achieving robust person localization in the Social Interaction Assistant platform, we have currently considered the cases where the movement of the camera is small. The generic structured search proposed in this work can be adapted to handle drastic abrupt motions of the camera as well. One way to handle such cases is to use a very small set of particles spread over a large region in conjunction with the structured search at each particle region. Also, improving the efficiency of the observation models would computationally ease such near-exhaustive searches. Further, in this work, we used a generic person silhouette in our chamfer matching step to validate the positions in the structured search. Better validation can be obtained by using person dependent silhouettes and better boundary masks which accurately capture the relevant structure of the person’s body. The current implementation has been focused only towards people facing the camera. This can be readily extended to handle other cases by effectively selecting the relevant silhouettes based on the application. 7 CONCLUSION

S. Panchanthan, N.C. Krishnan, S. Krishna, T. McDaniel, and V.N. Balasubramanian, “Enriched human-centered multimedia computing through inspirations from disabilities and deficit-centered computing solutions,” Proceeding of the 3rd ACM international workshop on Human-centered computing, Vancouver, British Columbia, Canada: ACM, 2008, pp. 35-42. S. Panchanathan, S. Krishna, J. Black, and V. Balasubramanian, “Human Centered Multimedia Computing: A New Paradigm for the Design of Assistive and Rehabilitative Environments,” Signal Processing, Communications and Networking, 2008. ICSCN '08. International Conference on, 2008, pp. 1-7. L. Gade, S. Krishna, and S. Panchanathan, “Person localization using a wearable camera towards enhancing social interactions for individuals with visual impairment,” Proceedings of the 1st ACM SIGMM international workshop on Media studies and implementations that help improving access to disabled users, Beijing, China: ACM, 2009, pp. 53-62. B. Leibe, A. Leonardis, and B. Schiele, “Combined Object Categorization and Segmentation With An Implicit Shape Model,” In Eccv Workshop On Statistical Learning In Computer Vision, 2004, pp. 17--32. Porikli, F. Tuzel, O., "Object Tracking in LowFrame-Rate Video", SPIE Image and Video Communications and Processing, Vol. 5685, 2005, pp. 72-79. Yuan Li, Haizhou Ai, T. Yamashita, Shihong Lao, and M. Kawade, “Tracking in Low Frame Rate Video: A Cascade Particle Filter with Discriminative Observers of Different Lifespans,” Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on, 2007, pp. 1-8. J. Kwon and K.M. Lee, “Tracking of Abrupt Motion Using Wang-Landau Monte Carlo Estimation,” Proceedings of the 10th European Conference on Computer Vision: Part I, Marseille, France: Springer-Verlag, 2008, pp. 387400. P. Viola and M.J. Jones, “Robust Real-Time Face Detection,” Int. J. Comput. Vision, vol. 57, 2004, pp. 137-154. N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01, IEEE Computer Society, 2005, pp. 886-893. K. Nummiaro, E. Koller-Meier, and L. Van Gool, “An adaptive color-based particle filter,” Image and Vision Computing, vol. 21, Jan. 2003, pp. 110, 99. F. Porikli, “Integral histogram: A fast way to extract histograms in cartesian spaces,” In Proc. IEEE Conf. On Computer Vision And Pattern Recognition, vol. 1, 2005, pp. 829--836.





Person localization in videos captured from a wearable camera involves tracking non-rigid, deformable, non-homogeneous image regions which exhibit random motion patterns in cluttered backgrounds. By incorporating ideas of specificity associated with deterministic detection algorithms along with the generality of stochastic tracking algorithms, we have presented a particle filtering technique which effectively localizes individuals across a range of space and scale once a person is detected. This technique is useful in achieving person localization in videos captured using any mobile camera platform where there is low temporal redundancy between frames. Our immediate application being the wearable Social Interaction Assistant, which aims to enhance the everyday social interaction experience of the visually impaired, we have been able to achieve near real-time person localization. 8





S. Krishna, D. Colbry, J. Black, V. Balasubramanian, and S. Panchanathan, “A Systematic Requirements Analysis and Development of an Assistive Device to Enhance the Social Interaction of People Who are Blind or Visually Impaired,” Workshop on Computer Vision Applications for the Visually Impaired (CVAVI 08), European Conference on Computer Vision ECCV 2008, Marseille, France: 2008. [11]


UbiCC Journal, Volume 5, March 2010



Q. Zhu, M. Yeh, K. Cheng, and S. Avidan, “Fast Human Detection Using a Cascade of Histograms of Oriented Gradients,” Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, IEEE Computer Society, 2006, pp. 1491-1498. O. Tuzel, F. Porikli, and P. Meer, “Human Detection via Classification on Riemannian Manifolds,” Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on, 2007, pp. 1-8. R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning,” Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, 2003, pp. 271, 264. Changjiang Yang, R. Duraiswami, and L. Davis, “Fast multiple object tracking via a hierarchical particle filter,” Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, 2005, pp. 212-219 Vol. 1. V. Philomin, R. Duraiswami, and L.S. Davis, “Quasi-Random Sampling for Condensation,” Proceedings of the 6th European Conference on Computer Vision-Part II, Springer-Verlag, 2000, pp. 134-149. B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowded scenes,” Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, 2005, pp. 878885 vol. 1. M. Bertozzi, A. Broggi, R. Chapuis, F. Chausse, A. Fascioli, and A. Tibaldi, “Shape-based pedestrian detection and localization,” Intelligent Transportation Systems, 2003. Proceedings. 2003 IEEE, 2003, pp. 328-333 vol.1. M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” Signal Processing, IEEE Transactions on, vol. 50, 2002, pp. 174-188. M. Isard and A. Blake, “CONDENSATION conditional density propagation for visual tracking,” International Journal Of Computer Vision, vol. 29, 1998, pp. 5--28. S. Birchfield, “Elliptical Head Tracking Using Intensity Gradients and Color Histograms,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 1998, p. 232. K. Okuma, A. Taleghani, N. De Freitas, O. De Freitas, J.J. Little, and D.G. Lowe, “A Boosted Particle Filter: Multitarget Detection and Tracking,” In ECCV, vol. 1, 2004, pp. 28--39. V. Manohar, P. Soundararajan, H. Raju, D. Goldgof, R. Kasturi, and J. Garofolo, “Performance Evaluation of Object Detection and Tracking in Video,” Computer Vision – ACCV 2006, 2006, pp. 151-161.


H.G. Barrow, J.M. Tenenbaum, R.C. Bolles, H.C. Wolf, Parametric correspondence and chamfer matching: Two new techniques for image matching. In proceedings of the 5th International Joint Conference on Artificial Intelligence. Cambridge, MA, 1977, pp. 659-663 CASIA, CASIA Gait http://www.sinobiometrics.com Database,


[26] [27]


F.C. Crow, “Summed-area tables for texture mapping,” Proceedings of the 11th annual conference on Computer graphics and interactive techniques, ACM, 1984, pp. 207-212. T.L. McDaniel, S. Krishna, D. Colbry, and S. Panchanathan, “Using tactile rhythm to convey interpersonal distances to individuals who are blind,” Proceedings of the 27th international conference extended abstracts on Human factors in computing systems, Boston, MA, USA: ACM, 2009, pp. 4669-4674. S. Krishna, T. McDaniel, and S. Panchanathan, “Haptic Belt for Delivering Nonverbal Communication Cues to People who are Blind or Visually Impaired,” 25th Annual International Technology & Persons with Disabilities, Los Angeles, CA: 25, 2009. S. Krishna, N.C. Krishnan, and S. Panchanathan, “Detecting Stereotype Body Rocking Behavior through Embodied Motion Sensors,” Annual Conference of the Rehabilitation Engineering and Assistive Technology Society of North America, New Orleans, LA: 2009.













UbiCC Journal, Volume 5, March 2010


You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->