You are on page 1of 5

Machine Vision: Human Recognition using Kinect

Pablo Ambrosio Royo Rico Universidad Autnoma de Chihuahua, Facultad de Ingeniera, Circuito No.1 Nuevo, Maestria en Reds Moviles (MSc in Mobile Networks), Chihuahua, Chihuahua, 31125, Mxico. royo1987@gmail.com - November 2011 Abstract A software implementation that detects a human body, allowing it to interact with the computer using different movements that the user can customize by training the application, those movements open and control a variety of windows programs such as Windows Media Player. At the same time its intended for people with disabilities to be able to control some features of the computer. Keywords: Ubicuos Computing, Kinect, Visual Studio, Human-Computer Iteraction. 1. Introduction Detecting human in images or videos is a challenging problem due to variations in pose, clothing, lighting conditions and complexity of the backgrounds. There has been much research in the past few years in human detection and various methods are proposed. Most of the research is based on images taken by visible-light cameras, which is a natural way to do it just as what human eyes perform. Although lots of reports showed that these methods can provide highly accurate human detection results, RGB image based methods encounter difficulties in perceiving the shapes of the human subjects with articulated poses or when the background is cluttered. [1] Interactive human body recognition has applications including gaming, human-computer interaction, security, telepresence and healthcare. The task has recently been greatly simplified
Figure 1. From a single input depth image, a per-pixel body part distribution is inferred. (Colors indicate the most likely part abeles at each pixel, and correspond in the joint proposals). Local modes of this signal are estimated to give high-quality proposals for the 3D locations of body joints, even for multiple users. [2]

by the introduction of real-time depth cameras like kinects. [2]

Perceiving the motion of the human body is difficult. First of all, the human body is richly articulated even a simple stick model describing the pose of arms, legs, torso and head requires more than 20 degrees of freedom. The body moves in 3D which makes the estimation of these degrees of freedom a challenge in a monocular setting. Image processing is also a challenge: humans typically wear clothing which may be loose and textured, and part of the body is typically self-occluded. This makes it difficult to identify limb boundaries, and even more so to segment the main parts of the body. [3, 4]

2. Literature Review The concept of humancomputer interaction (HCI) was first presented by a group of professionals at the Association for Computing Machinerys Special Interest Group on Computer Human Interaction Conference in 1992. The concept of HCI was adopted in the present study to define the domains of computer access for people with disabilities in terms of human factors (level of comfort and satisfaction with the overall operation) and system factors (movement time and accuracy). Human limitations related to computer interaction can be grouped into five categories: 1. Resource limitations refer to the inability of people to have access to education and infrastructure that would better their quality of life. 2. Learning limitations describe the lack of processing abilities amongst certain people, which interferes with their learning process. Such persons typically suffer from dyslexia and attention deficit disorders, amongst other limitations, and may require individualized course presentations. 3. Hearing limitations mean that people may experience varying degrees of auditory loss, ranging from slight hearing loss to deafness. 4. Visual limitations include low vision, colour blindness and blindness. People with these impairments have to rely heavily on other senses such as touch and sound. 5. Mobility limitations affect people stricken by certain illnesses or affected by accidents that deny them the full use of their limbs, who therefore have difficulty in holding and reaching for objects or moving around.[5] There are two gaps in implementing computeraccess treatment for students with multiple and

severe physical impairments. From the point of view of ergonomics, such students are often too physically impaired to activate mechanical input devices. Most of these students also have speech impairments, which further restrains them from accessing computers or enjoying information technology through sound-activated systems. Students with multiple impairments need directaccess, nonmechanical or nonhandheld pointer interfaces (dialogue architecture) to use languagefree applications (design approaches) for their learning and literacy needs. There also is a lack of comparative clinical studies to evaluate the performance of students with severe disabilities who use nonhandheld pointer interfaces. Ultimately, students with special needs will benefit from an evaluation of the efficacy of computer-access solutions, because this will affect their academic, communication, and recreational needs. [6] 3. Implementation All the software development was programed in Visual Studio 2010 with the KinectSDK v1.0 Beta for a Windows environment; the first problem that arises when developing the application is the detection of a human body in the depth image of Kinect [7], for this part will need to get the video provided by the 3D Depth Sensors of the Kinect, using commands from the Microsoft.Research.Kinect.Nui library such as: ConvertDepthFrame or depthImage.Source.

Figure 2. Diagram of the technologies in Kinect.

Once covered this point arises the need to show the structure (skeleton) body(s) previously detected by the Kinect [8], here will only work with the images already obtained and processed to identify each body part and show them on screen with functions like: GetBodySegment, GetDisplayPosition or NuiSkeleton2DataCoord (See section 6 of this document to know about this functions), all of those use a data base of body parts and joints so the functions only look in the depth images for body parts and assigned a point to each joint of the body as shown in Figure 3.a. Another limitation for this work was to save the entire sequence of movements of a specific activity, then identify and assign a windows command [9], the thing here is to first record a few movements, this can be done by saving a series of different body positions into a text file or a variable so it can be compared when the movement is done in 3 steps: initial, middle and final position using functions like: DtwCaptureClick, Recognize, GestureRecognizer or AddOrUpdate; when it happens the application launches a Windows program of your choice, such as Windows Media Player or Microsoft Word. List of commands to execute when movement X is performed: Movement 1. Media Player - Mplayer2.exe, then 2 Taps and an Enter (To play the first song). Movement 2. Next 13 Taps and an Enter. Movement 3. Back 11 Taps and an Enter. Movement 4. Pause 12 Taps and an Enter if the player is paused can perform the same movement to play again. Movement 5. Close Media Player Mplayer2.exe /play /close. Movement 6. Microsoft Word winword.
Figure 3. a) Shows the video provided by the kinect like a digital camera, b) the human skeleton image processing depth, c) this is the depth image provided by the kinect and programmed to detect bodies.

4. Results In the end the application can recognize 1 or 2 human bodies, record their movements and identify when they are performed; depending on the movement the software runs a program or action in Windows.

There are some limitations that must be considered when using the software and the fact that if you record a movement such as moving your right arm up and down and you have one leg extended at the time of the recognition have to have the same position. Or if two human bodies are recognized co-movement of the two will perform the action and if one person is out of range of the camera the software will not be able to recognize the movement until he/she comes back. See Figure 4.

Figure 5. The learned connectivity map of actions, poses, and objects using the sports dataset. Thicker lines indicate stronger connections while thinner connections indicate weaker connections. [10, 11]

The kinect, cost-effective scanning solution could make 3D scanning technology more accessible to everyday users and turn 3D shape models into a much more widely used asset for many new applications, for instance in community web platforms or online shopping. [12]
Figure 4. This is the finished system screen.

6. Functions Explanations ConvertDepthFrame: Converts a 16-bit grayscale depth frame which includes player indexes into a 32-bit frame that displays different players in different colors. GetBodySegment: Works out how to draw a line bone for sent Joints. GetDisplayPosition: Gets the display position (i.e. where in the display image) of a Joint NuiSkeleton2DataCoord: Runs every time the 2D coordinates are ready. DtwCaptureClick: Starts a countdown timer to enable the player to get in position to record gestures. AddOrUpdate: Add a sequence with a label to the known sequences library. The gesture MUST start on the first observation of the sequence and end on the last one. Sequences may have different

Another feature is the ability to save, load and rewrite movements, that is, if you're a new user, you can save your own movements, use and load them any time, and this can be done with infinite number of users. 5. Conclusions and future research The most difficult was to understand the library that Microsoft provides for managing Kinect, because the SDK is in beta state and there are not many books nor online support for this tool. The challenge for this research is that the Kinect is not yet able to differentiate between the human body and objects of great dimensions found in interaction between them (such as books, balls, accessories and chairs). Detection of human body and objects that are interacting with each other.

lengths. Or Update a sequence when it has been previously added Recognize: Recognize gesture in the given sequence, It will always assume that the gesture ends on the last observation of that sequence. If the distance between the last observations of each sequence is too great, or if the overall DTW distance between the two sequences is too great, no gesture will be recognized. 7. References [1] Lu Xia, Chia-Chih Chen and J. K. Aggarwal, Human Detection Using Depth Information by Kinect, Austin, TX, USA. [2] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake, Real-Time Human Pose Recognition in Parts from Single Depth Images, Cambridge, UK. [3] Yang Songy, Xiaolin Fengy and Pietro Perona, Towards Detection of Human Motion, Pasadena, CA, USA. [4] Yang Songy, Luis Gonzales and Pietro Perona, Unsupervised Learning of Human Motion Models, Pasadena, CA, USA. [5] Paula Kotz, Mariki Eloff, Ayodele Adesina-Ojo Jan Eloff, Accessible Computer Interaction For People With Disabilities: The Case Of Quadriplegics, Pretoria, SOUTH AFRICA. [6] David W. K. Man, Mei-Sheung Louisa Wong, Evaluation of Computer-Access Solutions for Students With Quadriplegic Athetoid Cerebral Palsy, KLN, HKG. [7] Christian Plagemann, Varun Ganapathi, Daphne Koller, Sebastian Thrun, Real-time Identification and Localization of Body Parts from Depth Images, Stanford, CA, USA.

[8] Gregory Rogez, Jonathan Rihan, Srikumar Ramalingam, Carlos Orrite and Philip H.S. Torr, Randomized Trees for Human Pose Detection, Zaragoza, SPAIN and Oxford, UK. [9] Domitilla Del Vecchio, Richard M. Murray, Pietro Perona, Decomposition of Human Motion into Dynamics Based Primitives with Application to DrawingTasks, Pasadena, CA, USA. [10] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei, Human Action Recognition by Learning Bases of Action Attributes and Parts, Stanford, CA, USA. [11] Bangpeng Yao, Aditya Khosla and Li Fei-Fei, Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses, Stanford, CA, USA. [12] Yan Cui, Didier Stricker, 3D Shape Scanning with a Kinect, Kaiserslautern, Germany. References available to download here.