You are on page 1of 31

NATIONAL COLLEGE OF ENGINEERING

(Affiliated to Tribhuvan University)


Talchhikhel, Lalitpur

[Subject Code: CT…..]


A Major Project Proposal On

“Hand Gesture Based Natural User Interface”

Submitted by

Anusha K.C. 074/BCT/102


Maharashi Rajbhandari 074/BCT/118
Pratik Shrestha 074/BCT/126
Yagyan Munankarmi 074/BCT/147

Submitted to
Department of Computer and Electronics Engineering
26th May 2021
Acknowledgement

We have put in a substantial effort in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would like to
extend our sincere thanks to all of them.
We are highly indebted to National College of Engineering for their guidance and constant
supervision as well as for providing necessary information regarding the project & also for
their support during the project.
We would like to express our gratitude towards our parents & members of NCE for their
kind co-operation and encouragement which helped us in the completion of this project.
We would like to express our special gratitude and thanks to our Head of Department for
giving us ample attention and time.
Our thanks and appreciations also go to our colleagues in developing the project and people
who have willingly helped us out with their abilities.

i
Abstract

In today’s technologies, Gesture recognition has become an emerging topic. Gesture is the
movement of body or a part of body to convey a meaning to the other person. Gestures are
thus a natural and intuitive way of how we can interact and communicate. The main focus
of this paper is to implement the human gestures and develop a Natural user interface
system. As humans communicate not only through their vocal abilities but also through the
gestures that they make, enabling gesture inputs with the use of computer vision ensures a
more natural way of how a human can interact with a system while also making sure that
there is next to no physical interaction with devices. This is quite useful as the system also
ensures the user does not lose focus looking for system controls in tasks that are critical to
user’s attention. The proposed system is a real time software that takes input from a web
camera and obtains hand landmarks using model and track their position. The image after
associating landmarks is later used to understand the user’s gesture which the system then
translates for controlling the system in a particular scenario as per the user’s intent. Since
this system can be easily implemented to manipulate the system using only gestures no
physical contact is required without having to go through a learning curve for
understanding. The system encourages the use of hands as input device creating a natural
human-computer interaction that remains user-friendly and is much easy to navigate the
device the user works with. Thus, in this paper, we propose a system able to control system
configuration events like mouse movement, brightness control and volume control.

Keywords: Natural User Interface, computer vision, hand landmark, gestures

ii
Contents
List of Tables ...................................................................................................................... iv
List of Figures ...................................................................................................................... v
List of Abbreviations .......................................................................................................... vi
1 Introduction .................................................................................................................. 1
1.1 Background ........................................................................................................... 1
1.1.1 Evolution of Topic & Algorithm ................................................................... 2
1.1.2 Existing system with their weakness ............................................................. 5
1.1.3 Customer’s Perspective about existing system .............................................. 7
1.1.4 Comparison of existing system ...................................................................... 8
1.2 Problem statement ................................................................................................. 9
1.3 Aim ...................................................................................................................... 11
1.4 Objective ............................................................................................................. 11
2 Literature Review....................................................................................................... 12
3 Methodology .............................................................................................................. 20
3.1 Conceptual model................................................................................................ 21
4 Epilogue ..................................................................................................................... 22
4.1 Expected Output .................................................................................................. 22
4.2 Gantt chart ........................................................................................................... 23
5 References .................................................................................................................. 24

iii
List of Tables
Table 1.1.1Table of Comparison of existing system ........................................................... 8

iv
List of Figures
Figure 2.1 Data flow diagram of the algorithm ................................................................. 12
Figure 2.2 Block diagram of the system ............................................................................ 15
Figure 3.1 General Overview of the System...................................................................... 21
Figure 3.2 Video Processing Module................................................................................. 21
Figure 4.1 Gantt chart ........................................................................................................ 23

v
List of Abbreviations
CNN Convolution Neural Network
EMG Electromyography
HCI Human Computer Interface
HSV Hue-Saturation-value
MATLAB Matrix Laboratory
NUI Natural User Interface
OpenCV Open Computer Vision
POV Point of View
RFID Radio Frequency Identification
RGB Red Green Blue
RGB-d Red Green Blue-depth
ROI Region of Interest

vi
1 Introduction
1.1 Background
Although modes of interaction with a computer-based system through touch-based input
measures have wide spread use today there are situations where they fail to meet the need
of interaction without physical contact with an input device. There are various situations
where a user requires to interact with devices without any physical contact such as in
conditions where the user might have to take care of maintaining hygiene, or may wish to
use the device without losing focus of attention while operating delicate equipment in
critical scenarios (not wishing to redirect the sight towards the controls).

As a general trend, measure of interaction with device by the users today also tends to tilt
towards more effective and intuitive measures of input such as voice commands with
natural language processing and touch-based controls that require no physical contacts.
Also, the industry seems to move towards research and development of devices that provide
input measures with less and less physical interface and more of an interaction-based
interfaces with sensors and other visual based inputs.

Computer vision is an interdisciplinary field that deals with how computers can be made to
gain high-level understanding from digital images or videos. Alternative to the existing
touch and voice command-based input, gesture-based input measures provide more
dynamic way of communication with a device. Gestures are expressive, meaningful body
motions that convey meaningful information or that which allows interaction with the
environment. Enabling gesture inputs with the use of computer vision ensure next to no
physical interaction with devices without compromising the real time usability of the mode
of use. In addition, it requires very little hardware components (usually a camera unit) for
its input which detects the user’s desired commands regardless of their position with respect
to the device as long as the user remains with in the POV of the input sensor. This further
removes the sophistication that arise with physical input devices that sometimes require a
learning curve for the user to be used to them for ease of use. The system encourages to use
the hands as an input device creating a natural human-computer interaction that remains
user-friendly and is easier to navigate the device the user works with.

Gesture recognition pertains to recognizing meaningful expressions of motion by a human,


involving the hands, arms, face, head and/or body. Its application is manyfold like sign-
language recognition, robot control and virtual object interaction, among which use of

1
computer vision for human-computer interaction allows user interaction to be more natural,
optimal, physical contact less and non-intrusive. Gesture recognition can be an immersive
approach of using computer vision to understand the hand gesture of the user and perform
the user desired applications.

In our proposed system, the process flow works as, video feed is captured through a
webcam which is then processed to detect the user’s hands. Which is then passed to a model
that detects hand landmarks. Based on those landmarks, gestures captured in the feed are
understood with. Hand landmark are the preferred features extracted from the images that
detect important parts of the hand such as fingers, joints of fingers and palm. Thus, the
system creates a virtual interface that remains as interactive like other input units and
sometimes even more user friendly. Further, applications can be customized for even more
immersive and intuitive interaction.

1.1.1 Evolution of Topic & Algorithm


Latest status of topic of algorithm

The easiest and most effective method of interaction among human and machine are using
gestures. Gesture recognition allows people to learn and interact with these devices in an
interactive manner. So, the field of gesture recognition is constantly evolving and
improving. During the start of 1960s, researchers focused on the capability of using
touchscreen and special pens to capture and digitize the writings. Later in 1969, Engineer
Myron Krueger started research in which people would be able to interact with devices
using environment without any need of screens. This was later termed as Natural User
Interface (NUI).

In late 1980s technology of motion capture by wearing gloves with sensors were invented.
The gloves were fitted with multiple sensors to detect motion and position of fingers and
palm of the hand. In 1983 first glove that was able to identify the hand positions with the
focus of creating alphanumeric characters was patented. And after 4 years in 1987 first
virtual reality gloves were introduced in the market. After this invention in 1990s people
started to use this technology to detect and identify hand gestures. These gloves are known
as Data Gloves. They were divided into 2 types: active data gloves and passive data gloves.
Active data gloves utilized different sensors or accelerometer to detect motion and position.
They were connected to the computers using wires or cables. Passive data gloves had no

2
electronic devices attached but the gloves divided regions of hand using certain colors.
These colors were used with image identification to detect motion and position of hands.

Electromyography (EMG) electrodes is a technique that evaluates and records the electrical
activity produced by skeletal muscle. It uses EMG sensors to measure the electrical
potential produced by movement of a muscle which can be used to detect motion of muscles
for gesture recognition. In late 1980s the electrodes required for EMG were advanced in
such a state that it was possible for mass production and was lightweight for usability. The
advancement in EMG focused on its implementation in medical field rather than for gesture
recognition. The advancement in EMG has also allowed it to sense isometric muscular
activities. Now researchers are trying to implement EMG to control prosthesis or generate
control signals to control electronic devices.

Ultrasonic waves were also considered for detection of motion and recording hand gestures.
Two techniques were implemented. First technique utilized ultrasound images to detect
gestures. Second technique was based on Doppler Effect to detect the motion and position
of hands. Studies based on ultrasound to measure muscle contraction were carried out by
Hodges in 2003. In 2013 Mujibiya published a paper that used a set of ultrasonic
transducers fitted around forearms and fingers to detect their position and motion. In 2015
Hettiarachchi was able to detect six different gestures using ultrasonic transducers. In 2017
McInsotsh produced a system that was successful to detect and monitor set of finger
gestures using probes that were snug on the forearm.

Studies on use of Radio Frequency Identification (RFID) system for gesture recognition
played a vital role in developing and advancing the field of gesture recognition to control
devices. RFID systems used ultra-high frequency which were capable of detecting labels
within a certain boundary. RFID systems can identify identification tags that can be used
to detect motion and position of hands. The benefits of using RFID are that the
identification tags were passive which means it requires no power and was inexpensive to
produce.

After huge advancement in field of small RGB and RGB-d cameras, they are highly used
for gesture capture and gesture recognition. Modern cameras capture the light on three
spectra red (R), green (G) and Blue(B). RGB-d camera is advancement of RGB camera that
can detect depth. These uses a fourth channel to measure the depth of the object in frame.
Using RGB-d cameras or only RGB cameras with computer vision it is possible to

3
recognize gestures and use these gestures as an input to the system. RGB-d cameras can
produce 3D images using 3 distinct techniques: Stereoscopic Vision, Structured Light and
Time of Flight. Various research on use of images from these cameras and combining them
with computer vision to detect hand landmarks without use of any other sensors are being
carried out. Using computer vision researchers are able to detect not only hand position but
also human body posture and classify them. In 2003, Cohen and Li were able to classify
body postures with support vector machine technique of a 3D visual helmet. In 2011, Raptis
et al. presented a paper on real-time classification of dance gestures from skeleton
animation using computer vision.

How it benefits the research area and society

• Implementing hand gestures for controlling devices makes interaction between


devices and human more intuitive and efficient.
• This system can be implemented in fields where physical contact with the devices
is not possible. For example: during surgery in medical field, in construction sites,
during repairing of devices etc.
• Since gestures are primitive way of communication, it is easier to learn and adopt
for new users and also allows them to easily interact with the systems without any
need of additional hardware.
• Gesture recognition and control system can be implemented for demonstration in
interactive learning.
• It can be used in entertainment sectors by implementing them in games to simulate
realistic interactions.
• Gesture recognition and control system can be further improved to detect and
translate sign language.
• It can be implemented in virtual environment to interact between the user and the
computer.

4
1.1.2 Existing system with their weakness
The implementation of gesture recognition has several fields of application. Among which
some are virtual object interaction, sign language recognition and interpretation, robot
control and device interaction. Other applications as per Wachs et al. [1] are in medical
assistive systems, crisis management and human-robot interaction. Multiple such systems
have been deployed that serve the purpose well.

However, discussing particularly about device interaction various implementation of the


concept have existed; for example, the use of self-growing and self-organized neural
networks for hand gesture recognition by Stergiopoulou et al. [2]. Another implementation
is an input device called UbiHand [3] that uses a miniature wrist-worn camera to track
finger position. An interactive screen developed by The Alternative Agency1, the Orange
screen that allows interaction by movement of hands in front of a window without the need
to touch it. Also, the use of hand-gesture recognition for interaction with navigation
applications such viewing photographs on a television and for interaction with slideshow
presentations in PowerPoint have been implemented. While, Sixthsense is a system that
converts any surface into an interactive surface that uses hand gesture recognition to
interact with the system. It uses color markers present in the fingers to detect the gestures.
Other available implementations propose a gesture recognition system capable of providing
a contactless controller via. depth-based headtracking. Similarly, systems that recognizes
gestures based on motion modeling, motion analysis, pattern recognition and machine
learning for handling the maneuver of a mouse pointer are also available. Also,
implementations that focus on simple recognition algorithm that use shape-based features
for identification of gesture have been used in.

All of the mentioned and many other implementations are quite impressive and work as
normal as any other input system would do. In simpler words, these systems meet their
target and function as intended. All of them provide one way or another of interaction with
a system. However, the need of a system that is rather more subtle and simply exists without
actually appearing to be another physical extended part of a modular system or ecosystem
seems to prevail. Most if not all of the current implementation of the idea to use gesture
recognition tend to have the same set of problems. These systems implement an extra set
of hardware components or other sophisticated parts to work. Some implementations also
turn out to be costly in terms of economical standing and the system are designed system
centered posing difficulty in implementation through different manners. Thus, failing to

5
meet the point to create a system that provides intuitive more natural interaction with a
device.

Weakness of current system are:

• Costly for implementation.


• Require additional hardware to what pre-exists with a normal user.
• Difficulty in manufacture due to inclusion of sophisticated components.
• Not flexible to different scenarios.
• Require physically controlling of intended device.
• May require to follow a learning curve for effective use.

6
1.1.3 Customer’s Perspective about existing system

• Although there has been advancement in field of hand gesture recognition, lot of
these system requires additional hardware and sensors to implement.
• Existing system are expensive due to the need of additional hardware.
• Existing system using computer vision recognize gestures based on labeled images
and not hand landmarks.
• Existing system are not intuitive and easy to use in daily life.

7
1.1.4 Comparison of existing system

Table 1.1.1Table of Comparison of existing system

Gloves Ultrasonic RFID


Gloves fitted with sensors Ultrasonic waves are used Uses ultra-high frequency
to detect motion and for detection of motion and which were capable of
position of fingers and hand gestures using detecting labels within a
palms of hands. ultrasonic transducers fitted certain boundary.
to the hands to capture the
gesture and hand
movements.
Categorized as: Active and Implemented using two Requires tracking RFID
passive data gloves. techniques: Ultrasonic tags and measure of
imaging and Doppler received signal strength to
effect. trace the gesture of the
user.
Movements are limited as Wide range of movement Uses inexpensive and
gloves need to stay as no sensors are attached passive identification tags.
connected for active to user.
gloves.
Range is dependent on the User has to be within Range is dependent on
length of connectors. certain range from device. configuration of antennas
in the system.
Operation of gloves are Ultrasonic waves may be Radio frequencies may be
immune to wireless interfered by metallic interfered by metallic
interferences. objects. objects.

8
1.2 Problem statement
Effective communication of humans with computers has always been a challenge. For a
long time, people have spent more than half of the past century experimenting with various
ways to interact with computers in order to achieve more efficient and intuitive
interfaces. With the development of ubiquitous computing, current user interaction
approaches with keyboard, mouse, touch surfaces and pen are not sufficient. Due to the
limitation of these devices the methods of interaction with the system feel to be limited
with hand full of commands a user can perform. Also, other limitations in using the current
interaction system are present such as, new users must learn certain set of instructions to
operate the hardware or should already have a general idea of how to operate the devices
and these systems require physical contact in order to interact with the system and cannot
be used by handicapped people. So, an interaction system that provides a more natural way
to communicate with the system is needed that is intuitive, efficient and is easy to use.

Although touch-based interface has wide spread use today, there are various situations
where the user requires to interact with the devices without any physical contact such as in
conditions of maintaining hygiene or if the user may wish to use the device without losing
focus of attention while operating delicate equipment in critical scenarios. Therefore, a
system that can directly understand the use of hand gestures as an input device should
provide a natural interaction platform.

Touchless interface in addition to gesture controls are becoming widely popular as they
provide the abilities to interact with devices without physically touching them. There are
many applications where hand gesture can be used for interaction with systems like,
videogames, controlling medical equipment’s, etc. One way of touchless interface is use of
Bluetooth connectivity of a smartphone to activate a company's visitor management
system. This assists in using devices without having to touch an interface while working in
a contagious environment. These hand gestures can also be used by handicapped people to
interact with the systems. But current used hand gesture systems have some limitations
such as it cannot be implemented properly since the initial investment becomes high due to
the cost of additional sensors and other hardware equipment which are very expensive.
Even when the requirements for gesture recognition system are met it is still challenging
for the user to get familiar with these systems. In addition, the maintenance cost for these
systems remains high as a major drawback.

9
The main objective of our research is to overcome the major drawback as we discussed
above by using the pre-existing webcam of the device for gesture capture and cut down the
cost of buying extra hardware devices reducing the complexity of the system. This study
specifically proposes a real time natural user interface based on hand gesture recognition
to control events like mouse cursor movement and clicking, volume bar control and
brightness control.

10
1.3 Aim
The aim of this project is to build a gesture recognition system that is able to control the
mouse pointer by tracking the hand movements and control features such as volume and
brightness using hand landmarks.

1.4 Objective
The main objectives of the project are:

• To develop an intuitive and interactive gesture-based control system.


• To develop a system that is able to detect hand landmarks using computer vision in
real-time.
• To use computer vision for gesture recognition and movement tracking of hands.
• To design a Natural User Interface that only utilizes a camera for input.

11
2 Literature Review
Rachit Puri [4] has implemented the maneuver of the mouse pointer and triggering of the
mouse operations such as right click and left click by using gesture recognition techniques.
The author has implemented the system in MATLAB. The approach proposed by the author
requires the cap of the fingers being used to be colored. The author detects the number of
target colors (region of interest) and triggers the mouse event on the basis of the gesture
formed by the fingers with the target color.

Initially the snapshot of the hand in front of the camera is taken. Then the user selects the
color cap in the finger that will be tracked during the gesture formation. The user can use
any color cap available in all of the fingers since the color for tracking is selected using the
snapshot. The system utilizes YCBCR color model instead of RGB because of its luminance
independent property. YCBCR uses CB and CR which are blue difference and red difference
chroma components and Y is luminance. This allows YCBCR to have light intensity to be

Figure 2.1 Data flow diagram of the algorithm

12
encoded non-linearly using gamma. After the snapshot is taken the system divides the
process into various steps. The following image represents the data flow.

Selection of RGB is the first step. In this step the system determines the RGB value of
color cap from the snapshot for tracking and gesture recognition. The system will determine
the value of the color which is selected by the user during initial stage. YCBCR conversion
is the next step in which the obtained RGB value of color cap is converted into YCBCR.
Region of Interest (ROI) determination is then performed in which the system detects
the ROI in the real-time video and determine their relative position to trigger mouse events.
To determine ROI the system converts each pixel of the frame into CB and CR and then
these values are compared with the value obtained from the snapshot. After detecting the
ROI their relative positions are stored in terms of pixel value. The ROI is converted into
single pixel value by determining the mid pixel of the ROI. The single pixel value allows
the system to have smooth cursor movement and gesture recognition. Scale Conversion is
performed since the resolution of the camera and the screen resolution are different. After
determining the X and Y value of ROI the value of X is mirrored since the direction of
cursor movement in webcam is opposite to the hand movement. Before obtaining the mirror
value the axis of the camera and screen are coincided. Finally, the mouse events are
triggered based on the number of ROI detected and their relative position. The system
allows multiple mouse events to be triggered by tracking the relative position of different
ROI. The movement of the cursor is performed by tracking the position of ROI only if a
single ROI is present in front of the camera.

David J Rios and team [5] have performed study on hand-based gesture recognition with
the help of computer vision and they successfully detected six different gestures. The paper
utilized the image from a RGB camera in real-time for hand detection and gesture
recognition. The paper utilizes OpenCV and python for implementation of the system. The
author prioritizes the use of computer vision instead of sensors for gesture recognition due
to its simplicity and ease of use. The first task performed is to separate the hand from its
background. The paper only focuses on detecting the hand against a background which
doesn’t contain the entire body of the user. The author has taken the approach of augmented
reality setting where the camera is placed in a headset and sees the hand in POV of user’s
eyes. The author suggests two approach to separate the hands from the background: the
system can either compare subsequent frames of the video to detect the movement of object
and separate the moving hand from the background or to use a skin-color filter to
13
differentiate between hand and the background. The author prefers the later solution which
provides better separation of the hands and are only susceptible to errors when the skin is
too pale or dark. For hand separation from the background the frames are passed through
multiple filters so that the system can extract parts of hands that are overexposed or are
covered by shadow.

The paper has utilized skin-color filtering that has been proved to be a useful and robust
technique for face detection and tracking. The filter utilizes a metric that determines the
distance between a colored pixel and a defined value representing the skin tone. The system
implemented the skin color filtering in YCBCR color space instead of RGB because YCBCR
is luma-independent which results into better performance. After the hand is separated from
the background the system utilizes edge detection algorithm to the image. The edge
detection provides a distinct boundary between the hand and the background. The paper
suggests two methods: template matching and differential gradient method. Each method
determines the magnitude of intensity gradient (g). Using the value of g set of contour
pixels are determined and these points are connected to form the edge of the hand. The
paper only detects gestures that correspond to the numbers from zero to five. This allows
the paper to not be sensitive towards which hand is being used left or right. The system in
the paper detects these gestures by identifying the peaks of the finger in the frame by
calculating the convex hull of the edges of the hand. Convex hull is a descriptor of the shape
and is the smallest convex set that contains the edge. The convex hull is implemented to
simplify the complex shapes that can be formed using the hands. The module iteratively
seeks and eliminates the concave region by examining the value of pixels in an arbitrary
straight segment with both end points residing int the edge of the hands. The convex hull
is then compared with the hand edge to detect the defects. Using the defect depths, the
depth average is calculated and these values and the total length of the hands are used to
count the number of elevated fingers to recognize the gesture.

14
Grif, H.-S., & Turc, T. [6] in their paper propose to improve the recognition of human hand
postures in a Human Computer Interaction through an application, while simultaneously
reducing of the computing time while also taking in consideration of the comfort regarding
the used human hand gesture for the end user. The authors intent of making use of a
proposed algorithm along with other feature extraction of a hand like hand pad color and
other hand features seem to present promising behaviors regarding the time computing. In
addition, the system seems to be working under low illuminance level as how it is suggested
to work under high illuminance level.

The process flow of the system as per the authors work in three process blocks. The system
has a mage acquisition block for image capture, a hand tracking and hand gesture/gesture
recognition block and mouse cursor control which suggests the system is relying on the
proposed application rather than other external hardware structures. The system uses an
extended webcam for image acquisition where the image of the user’s hand pad area is
acquired and is provided to the application developed in Visual C++ 2008, using the
OpenCV library which then processes the acquired image to calculates the hand position
and identify gestures presented to it and translates them to cursor movements and mouse
events(clicks).

Figure 2.2Block diagram of the system

In the proposed system, a bunch of hand postures/gestures have already been predefined
and a gesture has been considered to be a succession of tow hand postures. After the step
of image acquisition in hand tracking and hand posture/gesture recognition, the recognition
of the hand postures starts by obtaining images from the webcam which is converted from
RGB model to a HSV model. Then a threshold is being applied to the HSV model
considering the H component of the image. The conversion of the obtained image to a HSV
model being a black and white image presents a medium in separating the hand from its

15
surrounding where a white pixel corresponds to the hand pixels and the black ones represent
background pixels. Two Hue values have been predefined by default, being Hmin = 0 and
Hmax = 30 and the Hue value of a pixel is being used decide if it is background or if its value
is in between the mentioned constants as a blue paper is being used as a hand pad for the
background by the system. Other components such as S and V components are also present
whose threshold can be changed. Then two morphological operations are performed; a
closing operation followed by an opening operation used to filter out the unnecessary black
and white pixels. Then a hand angle feature is being used to recognize the hand postures
where specific interval of values (degrees of angle) corresponding to each of the used hand
posture. For this the system uses the algorithm where three-pixel coordinates are found i.e.,
the highest hand pixel and the extreme left and right pixels of the hand and then an angle is
measured between the line segments which have as the intersection point the highest hand
pixel. Finally, to determine the hand position, the pixel coordinate middle finger is being
used. However, due to the noise presence, the application seems to suffers regarding the
stability of the mouse position i.e., mouse shaking when hand movement stopped,
especially in low light condition. Thus, to reduce the shake, the system is utilizing the
position of the mouse cursor was considered as the mean of five positions: the last four
positions of the mouse and the current detected position of the hand.

Thakur, S., Mehra, R., & Prakash, B. [7] in their paper describe a vision-based interface
for operating a computer mouse via 2D hand gestures that replies upon camera-based color
detection technique. The authors intent to create a system with the use of a web camera to
develop a Human Computer Interface (HCI) system that is cost effective while being
reliable and efficient. The proposed system after acquiring the images of the hand gestures,
it processes the images in MATLAB to find the centroid of the hand in each image. The
position of the centroid is then being implied for cursor movement. A still image of the
hand is being used to compare the lengths of the fingers from images of when any of the
index or middle fingers are folded to implement the left and right click functionalities. Also,
to improve the efficiency of the system in tracking the hand, red and blue colored caps are
being used on the fingers.

Initially, Real-time Video Acquisition is done and the image frames are extracted from the
video and individually processed where each frame in the database is represented as matrix
(m×n) of defined resolution where each element consists of (1×3) matrix of RGB channels
each. Then, flipping of individual frames is done inverting horizontally using MATLAB’s

16
flipping function. Extraction of Red and Blue component is then being performed. For this,
as the image in initially in RGB, subtraction method is used where a gray scale image is
prepared and subtracted from red band image and blue band image individually to obtain a
red and blue component of the image in grayscale model. A median filter is then applied to
filter out the noise from the components. Then grey scaled images are being converted to
binary images setting a threshold due to MATLAB availability of property functions on
binary images. Removal of unwanted small objects detected obtained during video
acquisition is then done. Now a single object(finger) remains in each component image.
Then the system proceeds to centroid detection, in which a labeled matrix is found of all
objects then several properties are calculated. For this, centroid of red object and the major-
axis length of both the red and blue object region is done. An inbuilt function in MATLAB
is applied on red and blue object region individually available in the binary image which in
turn surrounds the main object by a rectangular bounding box as the output for the current
frame. The centroid from the bounding box of the red component is mapped to the cursor
which is done by accessing the mouse driver by integrating the JAVA high level language
along with MATLAB. To employ left and right click blue and red colors are used
respectively. Initially user’s hand is fully opened to acquire maximum length of the fingers
(major-axis length) during the acquisition of first frame which is later used for comparison.
Then threshold lengths are set to compare the folded finger lengths in future frames to
detect a click.

Abhishek B and the group [8] have proposed a paper on the topic Hand gesture recognition
using machine learning algorithms. This paper involved implementation of the system
whose aims is to design a vision-based hand gesture recognition system with a high correct
detection rate along with a high-performance criterion, which can work in a real time
Human Computer Interaction system without having any of the limitations (gloves, uniform
background etc.) on the user environment. The system is defined using a flowchart that
contains three main steps, they are: Learning, Detection and Recognition. Learning
involves training dataset and feature extraction. Detection involves capture scene through
web camera, preprocessing and hand detection whereas recognition involves gesture
recognition and performing action. The implementation of the system is divided into four
main steps: Image Enhancement and Segmentation, Orientation Detection, Feature
Extraction and Classification but is has limitation of change in color and lightening
condition. The system has involved three main steps for hand gesture recognition system:

17
Segmentation, Feature Representation and Recognition Techniques. The system is based
on hand gesture recognition by modeling of the hand in spatial domain and has used various
2D and 3D geometric and non-geometric models for modeling. The main drawback of the
system is it does not consider gesture recognition of temporal space, change in illumination,
rotation and orientation, scaling problems and special hardware which is pretty costlier.

The system implementation is divided into three phases: Hand gesture recognition using
kinetic camera, Algorithms for hand detection recognition and Hand gesture recognition
but it has limitation that detection and segmentation algorithms used here are not very
efficient when compared to neural networks. The dataset being considered here is very
small and can be used to detect very few sign gestures. The system architecture consists of:
Image acquisition, Segmentation of hand region and Distance transforms method for
gesture recognition but it has involved some limitations like the numbers of gestures that
are recognized are less and were not used to control any applications. In this
implementation there are three main algorithms that are used: Viola and jones Algorithm,
Convex Hull Algorithm and the AdaBoost based learning Algorithm. The work was
accomplished by training a set of feature set which is local contour sequence. The
limitations of this system are that it requires two sets of images for classification. One is
the positive set that contains the required images, the other is the negative set that contains
contradicting images. The design is composed of a human computer interaction system
which uses hand gestures as input for communication. Input to the system is from the web
camera or a prerecorded video sequence. Later it detects the skin color by using an adaptive
algorithm in the beginning of the frames. For the current user skin color has to be fixed
based on the lighting and camera parameter and condition. Once it is been fixed, hand is
localized with a histogram clustering method. Then a machine learning algorithm is been
used to detect the hand gestures in consecutive frames to distinguish the current gesture.
These gestures are used as an input for a computer application. The system is divided into
3 subsystems: Hand and Motion Detection, Dataset and 3D CNN. The Web-camera
captures the hand movement and provides it as input to OpenCV and TensorFlow Object
detector. Dataset is used for training the 3D CNN. CNN’s are a class of deep learning neural
networks used for analyzing videos and images consisting of several layers and performs
back propagation, tuning of the pages, zooming in and zooming out. The interactions with
the computer take place with the help of PyAutoGUI or System Calls.

18
Shivanku Mahna and the team [9] have offered a paper named Controlling Mouse using
Hand Gesture Recognition to identify specific human gesture and use the gesture for
controlling the device in the manner specified for that gesture in the gesture recognition
system. But to understand this technology we firstly need to understand what gesture
exactly is. Gesture is the movement of body or a part of body to convey a meaning to the
other person. The movement can be in conjunction with the verbal message or can be made
to show disagreement with what is being spoken. In this paper, a relation between hand
gesture and a hardware system is established using a set of codes which could be
implemented using a software called “MATLAB”. In MATLAB web cam is integrated to
read hand signals and then the detection codes, frame by frame are used, to process those
signals further controlling hardware such as Mouse or even Traffic lights.

Web Cam is the hardware used in this system and is the most crucial element in gesture
recognition system because it is interfacing hand signals with MATLAB software. The
working of Gesture Recognition System is that webcam takes hand gesture as input and
captures image every second. The captured image is processed using MATLAB and a
controlling hardware is used to command different functions. MATLAB - R2012b (Version
8.0) is the software used in this system where a program is written that can read hand
gestures. The first step is to store the video in a variable and the second step is to run an
infinite loop to detect gesture in every frame being performed thus detecting the gestures
in the video. The next step is controlling the actual hardware by integrating the
microcontroller with MATLAB and writing a series of conditional commands to operate it.
Color detection and mouse controlling are performed with the commands. Different
commands and programs have been written in MATLAB and have run to detect objects.
Since the algorithm is for the detection of red color, so to control the mouse, one should
wrap red bands around 3 fingers. The functioning shall happen as follows: When one finger
is detected, the cursor moves where the finger moves. When two fingers are detected, the
left key (left click) is pressed. When three fingers are detected, right key (right click) is
pressed. So, in this way a mouse can be controlled using pattern recognition coding and can
be used by handicapped as well as non-handicapped people. The method is very efficient
and does not require anything besides the red bands and a web cam.

19
3 Methodology
The following block diagram shown in figure number 3.1 represents the working principle
of hand gesture based Natural User Interface. The proposed system is based on detecting
and localization of hand landmarks and based on the relative position of the landmarks, the
gestures are detected. In the proposed system the video is captured using the webcam of
the device. Then the captured video is passed to the video processing module. The video
processing module outputs individual resized frame in YCBCR color space. These frames
are passed into Hand Segmentation Module. Segmentation is the process of dividing an
image into multiple regions based on the property of pixels with the main purpose of
identifying and separating the object from its background. Hand segmentation module
separates the hand from the background by utilizing a skin color filtering technique. The
filtering technique implements an adaptive thresholding process that determines whether
the pixel matches the skin color or not. The hand region segmented from the background is
passed to a hand landmark detection model that detect and track the joints of fingers and
region of palm. The relative position of the landmarks is used by gesture recognition
module. Based on the detected gesture the proposed system eithers tracks the movement of
hand to maneuver the mouse pointer or trigger certain events.

The figure number 3.2 displays the general working video processing module. The captured
videos are first divided into individual frames and each frame are resized to a pre-defined
size. Then the frames are passed to color space conversion module to convert them from
RGB to YCBCR color space. YCBCR color space is used due to its luma-independent
characteristic. The conversion of RGB to YCBCR is done using given formula:

𝑌𝑌 16 65.481 128.553 24.966 𝑅𝑅


�𝐶𝐶𝐶𝐶� = �128� + �−37.397 −74.203 112 � �𝐺𝐺 �
𝐶𝐶𝐶𝐶 128 112 −93.786 −18.214 𝐵𝐵

Then the frames in YCBCR color space are passed to Hand Segmentation Module for
detection of hand from background based on skin color filtering.

20
3.1 Conceptual model

Resized frame and


Captured Video
YCBCR Hand
Web camera Video Processing
Segmantation

Foreground
Seperation
Position of
Gesture Landmarks Hand Landmark
Hand Tracking
Recognition Detection

Trigger Events

Figure 3.1 General Overview of the System

Individual
resized
Captured Video Frame
frames Color Space
Web Camera Breaking and
Conversion
Resizing
Frames in
YCBCR color
space

Hand
Segmentation

Figure 3.2 Video Processing Module

21
4 Epilogue

4.1 Expected Output


The following project aims to create a natural user interface that will allow the user to
interact with the system by using hand gestures. The system will use the video of the hand
from webcam as an input and using the model, it will track and place the landmarks on the
hand. On the basis of the relative position of the landmarks the system predicts the gesture
and will allow the user to interact with the system. User will be able to maneuver the cursor,
trigger mouse events, control volume and control the brightness of the screen.

22
4.2 Gantt chart
2021 2022
ID Task Name Duration
May Jun Jul Aug Sep Oct Nov Dec Jan Feb

1 Planning 5w

2 Design 12w

3 Coding 24w

4 Testing 15w

5 Documentation 42w

6 Implementation 6.5w

Figure 4.1 Gantt chart

23
5 References
• [1] Juan Pablo Wachs, Mathias Kölsch, Helman Stern, and Yael Edan. “Vision-
based hand gesture applications”. Communications ACM, 54:60–71, feb 2011
• [2] E. Stergiopoulou and N. Papamarkos. Hand gesture recognition using a neural
network shape fitting technique. Engineering Applications of Artificial Intelligence,
22(8):1141–1158, 2009.
• [3] Farooq Ahmad and Petr Musilek. Ubihand: “a wearable input device for 3D
interaction. In ACM Internacional Conference and Exhibition on Computer
Graphics and Interactive Techniques”, page 159, New York, NY, USA, 2006.
ACM.
• [4] Puri, Rachit. (2014). “Gesture Recognition Based Mouse Events”. International
Journal of Computer Science and Information Technology. 5.
10.5121/ijcsit.2013.5608.
• [5] Rios-Soria, D.J. & Schaeffer, S.E. & Garza-Villarreal, S.E. . (2013). “Hand-
gesture recognition using computer-vision techniques”.
• [6] Grif, Horatiu & Turc, Traian. (2018). “Human hand gesture based system for
mouse cursor control”. Procedia Manufacturing. 22. 1038-1042.
10.1016/j.promfg.2018.03.147.
• [7] S. Thakur, R. Mehra and B. Prakash, "Vision based computer mouse control
using hand gestures," 2015 International Conference on Soft Computing
Techniques and Implementations (ICSCTI), 2015, pp. 85-89, doi:
10.1109/ICSCTI.2015.7489570.
• [8] B., Abhishek & Krishi, Kanya & M., Meghana & Daaniyaal, Mohammed & S.,
Anupama. (2020). “Hand gesture recognition using machine learning algorithms.”
Computer Science and Information Technologies. 1. 116-120.
10.11591/csit.v1i3.p116-120.
• [9] Mahna, Shivanku & Sethi, Ketan & Ch, Sravan. (2015). “Controlling Mouse
using Hand Gesture Recognition.” International Journal of Computer Applications.
113. 1-4. 10.5120/19819-1652.

24

You might also like