ST ND RD: Ntroduction

1st Lufiya George C 2nd Thomas George 3rd Ninu Francis
Computer Science Dept. Computer Science Dept. Computer Science Dept.

Jyothi Engineering College Jyothi Engineering College Jyothi Engineering College
Cheruthuruthy Cheruthuruthy Cheruthuruthy
Thrissur Thrissur Thrissur
lufiyageorgec@jecc.ac.in thomas@jecc.ac.in ninufrancis@jecc.ac.in
:4th Reshma K V 5th Lidiya George C

Computer Science Dept. Little Flower College Guruvayur.
Jyothi Engineering College Thrissur
Cheruthuruthy thandhootty@gmail.com
Thrissur
reshmakv@jecc.ac.in
Abstract Hand gesture detection finds its extensive certain techniques of machine learning to detect the object.
application in the field of computer science and language Among them, Convolutional neural network (CNN) model is
technology. Gesture detection is an infallible method and
ahead of the game in object detection. It has the ability of
momentous in some situations. Color, shape, matching
templates, and feature extraction were the parameters used in hierarchical learning features and its extraction. Typically
traditional gesture detection methods. Developments in deep gesture detection with deep learning is classified into two:
learning has been proposed to surmount the problems existing Generic and Salient. Bounding box and Regression are the
in traditional architecture embellished it to become an emerged two algorithms which achieve generic gesture detection.
technology in the real world. Hand Gestures can communicate Salient gesture detection is accomplished with pixel level
to individuals through the motion shapes; indeed, the common segmentation. Region basedConvolutional Neural Networks,
person may not be mindful of these motions and cannot get it
the genuine quintessence of the story. A novel method of generic or R-CNN, is a family of techniques for addressing object
hand gesture detection method with deep learning with focus on recognition task, designed for model performance. You Only
neural networks along with its modifications has been proposed Look Once, or YOLO an object recognition algorithm which
here optimizes the detection performance in the real-life scenario
in a fast effective way. In this paper a novel architecture using
Keywords Deep learning, neural network (NN), RPN, CNN method for the accurate gesture detection has been
regression, cnn, softmax
proposed. In the proposed system, the input image is captured
I. INTRODUCTION using a webcam. After capturing the input image, it is given
to the contour detection phase which in turn detect the edges.
Deep learning is a machine learning technique which is a
Training and testing is the next section. The contour images
subset of artificial intelligence, which has been a part of our
are given to the training algorithms[1]. In this section the
everyday lives.The new era of deep learning has already
features of contour output is assigned weights which are used
started and it has the network capability in which huge
to train the images. The next secion is the use of softmax
volume of data is learned through a mesh of artificial neural
classification method to classify the image according to its
networks,which are formed from the inspiration of the
actual meaning. Here, we are using our own dataset for the
human brain. Gesture identification is the process of
gesture recognition method.
interpretation of an image with the help of mathematical
algorithms. Human identifies gesture in an image with little II. RELATED WORKS
effort.But object recognition for computed aided systems is
The region proposal features combined with CNN, given rise
still a threat. Object detection technology aims to detect the
to the new method of R-CNN: Regions with CNN features
target objects with the methods of image processing,
are used to recognize the objects from an image input.In the
determine the semantic categories of these objects, and mark
proposed work,the object detection system comprises of
the specific position of the target object in the image. In the
three modules. The first module is the region proposals.
actual application, it is a very crucial task to use computer
ons available to our
technology to automatically detect the objects.
developing recognizer.In next module the huge convolutional
Gesture recognition refers to a set of tasks for recognizing
neural extricates a fixed-length include vector from each
gesture in the input digital image. In the present scenario, the
locale. The last module is a set of classification region like
approach of gesture detection can be evolved into two
linear SVMs. This architecture takes an input image from the
categories : traditional machine learning method and deep
dataset and extracts 2000 bottom-up region proposals. [2]
learning method. The use of deep learning strategy clears the
Multiple regions are created through initial sub segmentation
way in object detection problem.Deep learning usually does
. To calculate the features for a region proposal, convert the
not highly depend on features when compared with the
input image into that region and that is computed with the
traditional method of recognising guestures with the help of
CNN. Th
machine learning. Other than features, deep learning uses
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

point of view,the features are extracted when the regions are best method for the gesture recognition system.
passed through the convolution network following the
reshaping procedure. According to the third module ,the CNN
was pre-trained on a large dataset. It was performed using the III. PROPOSED SYSTEM ARCHITECTURE
open source Caffe CNN library. And classifies each region The proposed system presents an intense platform for
using class specific linear SVMs. Support Vector Machine is guesture identification through the mathematical foundations
used to divide these regions into different classes. This paper and deep learning frameworks which will be very helpful for
gives region proposal method architecture with bounding box the common man in a nonindigenous situation. Here the true
to predict the location of the target object. meaning of a particular dance mudra image will be
A new approach to identify an object from the input image determined from the captured image and will be displayed for
which emphasize on the use of the SPP-net is proposed here the better understnding. So the boundary values of each and
in object detection. First convolution layers must take a every image is calculated by the CNN. These annotated
-size input.[3 -CNN chooses to warp or output will classify the images into their corresponding
crop each region proposal into the same size. However, the classes. The basic[8] steps involving the proposed system is
occurance of unwanted geometric distortion is a problem here shown below.
when the object exist in the cropped region. These content 1. Input image
losses or distortions will reduce recognition accuracy. To 2. Extract the feature
fathom this issue creator took the hypothesis of spatial 3. Classification and prediction
pyramid matching in architecture named SPP-net. Here the 4.
feature maps are computed from the whole image and the A. Convolutioaln Neural Network
pool features are extracted in subjective districts to produce
fixed length representations for prepairing the detector. Object detection is considered to be a much difficult task than
The feasibility of the reusability of these feature maps is image classification because there exists a challenge of
because the feature map not only involves the strength of extracting significant features that are fit for classifying the
local responses but also has a relationship with their spatial images to our ideal classes.
positions. The layer after the final convolution layer is The main issue is in selecting a high quality component that
referred to as spatial pyramid pooling layer (SPP layer). This needs to pick different classification strategies whenever new
strategy has high speed with accurate performance. With the classes are added to the system. Much dedicated effort was
networks pre-trained on ImageNet, and representations from put forward by the researchers,the men behind to overcome
the images are extracted into a dataset and re-trained by Naive these difficulties. As a result,in 1990's LeNet design was
-net method presented which runs on the platform of convolutional
overrides R-CNN method. architecture. In our architecture different layers are included
and the following layers are given below:
Another method for object recognition has been proposed by
neural network shape fitting technique. Firstly the hand 1. Convolutional layers
region is detected by color segmentation. From the extraction 2. Pooling layers
of hand shape characteristics we can identify the number of 3. ReLU(Rectified Linear Unit)
fingers. [4] From this output we can recognize the gesture 4. Fully connected layer
type. Here we also used rule classifier to predict the hand 5. Softmax layer
gesture according to the number of fingers.
The first layer of the Convolutional layers consists of the
information layer which address the ROI for the given input
Explains about feature extraction method for the gesture
image outline. The information layer size is 128x128x3.
identification problem.Local brightness method has been
Here, the convolution is applied on the given information
used here . In first phase the image is divided into 25 X 25
outline utilizing 3x3x3 kernal with 32 filter channels. As the
blocks. These blocks calculates the local brightness after next step, we are applying pooling of size 2 x 2 on the tangled
applying colored segmentation operation.[5] After the picture and this size is decreased to 64 x 64 size. Pooling is
segmentation a block white image is created and represented performed to decrease rate of detection of features in feature
the hand pose. From this hand pose we can identify the maps. The main aim of the pooling layer is to decrease the
gesture type. spatial size, the number of parameters and its representation
in the network. This will in turn reduce overfitting. On its
Another architecture with fequency modulated continuous journey through the 3 convolutional layers and 3 pooling
wave (FMCW) radar system is proposed for the feature layers, the size is decreased to 16 x 16. In the middle of the
extraction from an image.We can capture a range-Doppler convolutional and pooling layers, another non-linear
map (RDM) from raw signals of FMCW radar syatem. [6] activation function called ReLU is utilized. The ReLU
And RDM is also helpful for the determination of gestures. function which stands for Rectified Linear Unit, has gained
And RDM able to obtain the various features of an image. much attention in the realm of deep learning. In the
These features are helps to determine the particular gesture convolutional layer our whole image is considered as a
types. Wrapper based algorithm is used for the selection of multidimensional array and convolution activity [8] is
gesture features. Author tells that feature analysis method is applied utilizing convolution matrix or kernel. A convolution
procedure is an expansion of its neighbouring components IV. EXPERIMENTAL RESULT
alongside its weights. In our proposed architecture, a filter In our framework the input structure is a RGB image.The
channel size of 3 x 3 on every convolution layer is used.
system has been trained using 250 images of each classiccal
Contingent on the pool size, in each picture a solitary pixel is
dance mudra captured using RGB camera. The images are
chosen from the selected mask.
trained on the GPU system. The system takes 20 min to train
a model using 250 images of dance mudras. The system
trained using a batch size of 16 and initial learning rate of
0.001. The system attained 99.32% accuracy in 8 epochs.
Python execution are utilized within the system . The method
use Keras API with Tensor Flow as backend. The proposed
model has checked with testing images and showing good
results while testing with 100 images of test dataset. LeNet
model architecture which comprises of convolutional,
activational and pooling layers has been used here.These
layers are followed by a fully connected layer and finally a
softmax classifier, which gives better result.
Here selective search algorithm has taken and which seen as
Fig 1: Proposed System Working increasingly number of bounding box are made deffered from
the item and which not valuable for this situation. The
Here we have utilized a pooling size of 2 x 2, which will method of marking bounding box in each picture is a crucial
diminish the real size of the input picture in to currect half task which requires a lot of time and workers to do this. The
with stride 2. The pooling of each and every image has been Softmax classifier gets its name from the softmax work,
executed by max-pooling layer of that network. From each of which is utilized to crush the crude class scores into
a cluster of neurons at the prior layer, the maximum value standardized positive values that total to one, with the goal
will be found in maxpooling .This activity take the greatest that the cross-entropy loss can be applied. In this
incentive in the 2 x 2 bit. In order to empower the positive circumstance our syatem has the favorable circumstances
values we have utilized an actuation function called ReLU . over the above issues
The function returns 0 if it gets any negative value as input,
but for any positive value x it returns that value back. It is
called as a positive function since it just empowers the
positive values and return zero for negative values and given
as [8]
F(x) = x+ = max(0,x)
The whole processed image is transformed into a liner cluster

which turns into the node of the following layer after it is
passed through all the convolution and pooling function[9]
.Each layer is associated with next coming layer with
comparing weights which is known as fully connected layer
or dense layer.The fully connected layer takes and flattens the Input Image Feature Extraction
output from the previous layers, which in turn produces a
single vector that can be an input for the next stage. The yield
of the dense layer is known as the scores and these scores are
given to the classification layer. Here we have included
softmax layer as classification layer to this problem . This
layer is used in the last layer of a neural network-based
classifier and have the same number of nodes as the
output layer. The accuracy of classification and a loss
function in each epoch is calculated by the softmax function.
In this system we use categorical cross entropy as loss
function for better result. As the next step,the aggregate error
is calculated by finding the yield of the system and genuine
value. According to this error, the weight has been updated
alongside its learning rate On each epochs,this procedure is
repeated and gets back propagated .In the last stage the
learned network model has been saved on the curresponding
disk. By utilizing this model the framework anticipate new
image inputs. Fig2: Real time system for gesture recognition
CONCLUSION REFERENCES
The recognition of expressive gestures in dance
called Mudras which acts as visual mode of interaction with [1]
the audience is proposed here.A novel architecture which has
[2]
the capability to identify the dance mudras in an image/scene object
is constructed here. The predominant technology of CNN is [3]
made use for the gesture detection and its identification .
technology, Vol-3, Issue-4, 2014
Classification of an object may be considered as the most
[4] Hand Gesture
important aspect of an object detection system. New methods
demonstrated here are more cost effective than traditional Engineering Research, Vol 5, Issue 7,July 2015
methods.Gesture recognition is used widely since it has great [5] Si-Jung Ryu, Jun- -based Hand Gesture Recognition
potential in real world scenario. Hand gesture stores a part of
IEEE,1558-1748 (c)
data such as number, estimation etc which is identified
[6] Kaiming He1, Xiangyu Zhang2 Spatial Pyramid Pooling in Deep
through the complex mathematical layers of the CNN. Hand
gestures along with its shapes can be determined in different International Publishing Switzerland 2014
methods.The system has been trained using 200 images and [7]
the model has been created by the successful implementation Method Based on Multi-
International Workshop on Information and Electronics Engineering,
of deep learning system using region based convolution 2011
neural network. The system has achieved an exactness of [8] Sajanraj T D, Indian Sign Language Numeral Recognition Using
99.56% for a similar subject while testing and the exactness Region of Interest Convolutional Neural Network, IEEE 2018
diminished to 97.26% in the low light condition Future [9] K. Deepa Merlin Dixon, Effect of Denoising on Vectorized
Convolutional Neural Network for Hyperspectral Image
research is needed to delimitate the paths to find efficient Classification, springer,3 april 2018
algorithms to truncate computational cost and to narrow the
time entailed for detecting the gesture for an array of videos
containing distinct characteristics and to elevate the accuracy
rate.

ST ND RD: Ntroduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST ND RD: Ntroduction

Uploaded by

Copyright:

Available Formats

1st Lufiya George C 2nd Thomas George 3rd Ninu Francis

Computer Science Dept. Computer Science Dept. Computer Science Dept.

:4th Reshma K V 5th Lidiya George C

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

The whole processed image is transformed into a liner cluster

You might also like