You are on page 1of 4

2023 IEEE 14th Control and System Graduate Research Colloquium (ICSGRC), 5 Aug 2023, Shah Alam, Selangor,

Malaysia

Hand Gesture Recognition based on Convolution


Neural Network (CNN) and Support Vector
Machine (SVM)
Muhammad Afiq Abdull Razak Farah Yasmin Abdul Rahman Roslina Mohamad
2023 IEEE 14th Control and System Graduate Research Colloquium (ICSGRC) | 979-8-3503-4623-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICSGRC57744.2023.10215427

School of Electrical Engineering Wireless High Speed Network Research Wireless High Speed Network Research
College of Engineering Group (WHiSNet) Group (WHiSNet)
Universiti Teknologi MARA School of Electrical Engineering School of Electrical Engineering
Shah Alam, Selangor, Malaysia College of Engineering College of Engineering
afiqre1998@gmail.com Universiti Teknologi MARA Universiti Teknologi MARA
Shah Alam, Selangor, Malaysia Shah Alam, Selangor, Malaysia
farahy@uitm.edu.my* roslina780@uitm.edu.my

Shahrani Shahbuddin Yuslinda Wati Mohamad Yusof Saiful Izwan Suliman


School of Electrical Engineering School of Electrical Engineering School of Electrical Engineering
College of Engineering College of Engineering College of Engineering
Universiti Teknologi MARA Universiti Teknologi MARA Universiti Teknologi MARA
Shah Alam, Selangor, Malaysia Shah Alam, Selangor, Malaysia Shah Alam, Selangor, Malaysia
shahranis@uitm.edu.my yuslinda@uitm.edu.my saifulizwan@uitm.edu.my

Abstract— Gestural communication is a type of nonverbal computer's intelligence and make it easier for people to
communication in which visible body gestures are utilised to communicate with computers in more intricate ways [4].
communicate vital messages, either in place of speech or in Technology needs to be able to recognize, classify, and
conjunction with it. The problem of gesture division is interpret many simple hand gestures and use them in a wide
presented as a first step toward visual hand gesture range of situations [3].
recognition, i.e., the detection, analysis, and recognition of
gestures through real-time hand sequences. Visual hand The main purpose of this research is to recognize nine
recognition and motion tracking are quite challenging to solve (9) hand gestures which consist of “Call", "Fist", "Live
due to their inconvenient nature. This research seeks to Long", "Okay", "Peace", "Rock", "Stop", "Thumbs Up" and
address the issue by determining which classification "Thumbs Down" hand gestures. The recognition system was
technique, Convolutional Neural Network (CNN) or Support developed using the Python programming language. The
Vector Machine (SVM), is superior in recognising hand feature extraction technique namely hand skeletal-based
motions. The hand-skeletal was used as the features to technique was used to represent the hand gestures. Then,
represent the hand gestures. Both classification methods CNN and SVM were used as recognition algorithms. Finally,
utilised the same sample dataset and camera input to achieve a we analyse the recognition techniques based on their
fair comparison. Then, the performance in terms of accuracy accuracy and processing time.
and processing time being analysed. The results indicate that
the CNN excels in recognising hand gestures with an accuracy
of 97.78% compared to the SVM with 96.30%. In terms of II. LITERATURE REVIEW
processing time to train/process the datasets, SVM has the Hand gesture recognition has become one of research
upper hand by taking 5 minutes and 16 seconds. Meanwhile topics that gain attentions of many researchers including
the CNN used 8 minutes and 24 seconds.
works presented in [1] – [10]. There were many approaches
Keywords—Hand Gesture Recognition, Hand Skeletal, CNN, being introduced, but the one that we interested to
SVM, Accuracy, Processing Time investigate is work done by Md Abdur Rahim et al [1]. They
proposed Skeleton Distance Measurement (SDM) as feature
I. INTRODUCTION extraction technique to represent hand gestures. This
approach was very straight forward where prominent points
Hand gesture recognition is critical to the effectiveness of on hand being detected and distance among those points
human-computer interaction (HCI) technology, which is being calculated and became the input of the classifier. CNN
used as a helpful interface in a variety of challenging is one of the classifiers that gave high recognition rate in
scenarios [1]. The system allows for nonverbal hand gesture recognition research. For example, research
communication that is natural, inventive, and modern. It can presented by Md Rashedul Islam et al [9]. Their work able
also be used in a variety of settings [3]. As a result, fostering to obtain classification accuracy at 98.41%. Besides CNN,
natural HCI is critical to bridging the gap between humans SVM also able to give high accuracy in recognising hand
and computers [2]. Further, HCI technology can be used to gestures. Research done by Chin-Pan Huang et al [10] used
create a smart environment [4]. (SVM) algorithm as their recognition algorithm and gain
Hand gestures have become a popular way of recognition rate of 97.51%.
communicating simple thoughts, which are then translated Based on our reviews, both CNN and SVM able to give
into events by a gesture detection system [3]. However, very good accuracy which is more than 90% in hand gesture
analysing the complete number of features takes a long time recognition studies. Therefore, we chose to used both
[1]. Every level of development aimed to improve the

979-8-3503-4623-7/23/$31.00 ©2023 IEEE 123


Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 17:15:42 UTC from IEEE Xplore. Restrictions apply.
2023 IEEE 14th Control and System Graduate Research Colloquium (ICSGRC), 5 Aug 2023, Shah Alam, Selangor, Malaysia

techniques and investigate their performance in terms of During training and validation process, total 2700
accuracy and processing time. images were used. Each gesture has 300 photo samples.
Meanwhile during testing, the images taken in real time
III. METHODOLOGY where 5 different people doing all 9 gestures were being
used.

B. Feature Extraction Technique

Figure 3: Flow of Data Extraction

Figure 3 above shows the flow of data extraction in this


project. The key points detection that was then represented
the hand skeletal was utilized by importing the MediaPipe
Library via the Python programming language. The
Figure 1: Flowchart of the Project MediaPipe Library has a pre-installed programme of
skeletal-based hand detection which can be used to detect
Figure 1 above shows the flowchart of this project. It each coordinate of hand key point fingertip.
consists of data acquisition, feature extraction, classification
and lastly, analysis of the findings based on recognition
results.

A. Data Acquisition

Figure 2 shows examples of dataset that consist images


of hand gesture used in this research. The dataset was taken
from our own collections where the images were obtained Figure 4: Skeletal-based detection of the hand
using a digital camera with a resolution of 0.9MP with a
224x224 dimension of frame size. They were 9 types of Figure 4 shows an example of skeletal-based hand key
gestures being studied in this project. They were classified points detection. This hand landmark model performed
as "Call", "Fist", "Live Long", "Okay", "Peace", "Rock", precise key point localization of 21 points on each finger.
"Stop", "Thumbs Up" and "Thumbs Down" Gestures.
Then, the data from those 21-key points of localization
fingertip were stored into a CSV file by using the Pandas
Library via Python. Pandas is a Python library that is used
for data analysis. The finished CSV file that contains all the
data extracted can then be ready to be trained into both
classification algorithms, which in this research, the CNN
and SVM.
C. Classification Algorithm

1) CNN

Figure 5: The creation of CNN model

Figure 2: Dataset of Hand Gesture Figure 5 above shows the flow of the CNN algorithm
used in this project. The data extracted via the skeletal-based

124
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 17:15:42 UTC from IEEE Xplore. Restrictions apply.
2023 IEEE 14th Control and System Graduate Research Colloquium (ICSGRC), 5 Aug 2023, Shah Alam, Selangor, Malaysia

hand detection in the CSV file was then be used to train the gestures. Nonetheless, the recognition rate is still good,
CNN model algorithm. This project utilized the CNN which is more than 80%. Next, the "Call" and "Fist gestures
training algorithm with the import of the TensorFlow has the same accuracy rate of the 94%. The overall
library, which contains the preinstalled programme of the accuracy of the CNN algorithm is 97.78%.
CNN architecture that can be found in the Python
programming language. The data from the CSV file were Table 1: Confusion Matrix for CNN technique
used to train the CNN model.

The model consists of a linear stack of layers. This


project used Rectified Linear Unit (ReLU) as the function
activation to create the input layer. The input layer were
created with 512 nodes (neurons) input nodes. Next, the
output layer consists of 9 nodes, as the project wants to
classify the given 9 gestures. All the 512 input nodes were
fully connected to the output layer. A SoftMax function
activation was used to classify the output into 9 distinct
units. After the CNN model successfully created, the model
was loaded into the recognition and tested using real-time
images.

2) SVM

Table 2: Confusion Matrix for SVM technique

Figure 6: Support Vector Classifier with RBF kernel

Figure 6 shows the flow of SVM Algorithm in this


research. The input data were extracted via the skeletal-
based hand detection in the CSV file. This project used the
SVM training algorithm with the import of the SciKit-learn
library, which contains the preinstalled programme of the
SVM architecture, which can be found via the Python
programming language. RBF kernel were used to separate
the SVM hyperplane. After the SVM model successfully
being trained, real-time images taken from digital camera
were used to test the accuracy of the developed recognition
system.
IV. RESULT AND DISCUSIION
Meanwhile, table 2 shows the confusion matrix for
This chapter describes the results obtained from the recognition of hand gestures using SVM technique. SVM
Hand Gesture Recognition using CNN and SVM algorithm is able classify perfectly with 100% recognition
algorithms. rate on 7 hand gestures, which are the "Okay", "Peace",
"Thumbs Down", "Call", "Stop", "Rock" and "Fist"
Table 1 shows the confusion matrix of hand gesture gestures. The "Live Long" gesture has the lowest accuracy
recognition using CNN technique. The confusion matrix of all other hand gestures by the SVM algorithm with
consists of 9 classes of hand gesture. Each gesture was 74.67% because it seems to be misjudged as the "Stop". The
tested 150 times and recognition results were recorded in a reason of this poor performance is all fingers were kept
confusion matrix. From the Table 1, it shows that the CNN straight thus the skeletal shape profile of those hand gestures
algorithm is able to achieve 100% accuracy in 7 hand were almost similar. This made the SVM difficult to set the
gestures, which were the "Okay", "Peace", "Thumbs Up", hyperplane boundary between those 2 gestures. Further, the
"Thumbs Down", "Rock", "Live Long" and "Fist" gestures. "Thumbs Up" gesture has the accuracy of the 92%. Same
The "Stop" gesture has the lowest accuracy among other recognition performance with 92% for "Call" hand gesture.
hand gestures classified by the CNN algorithm with 86% However, the overall accuracy of the SVM algorithm is
because it seems to be misjudged as the "Live long" and 96.30%.
"Rock" gesture. This caused by the similarity of skeletal
profiles where most of the fingers were kept straight in all 3

125
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 17:15:42 UTC from IEEE Xplore. Restrictions apply.
2023 IEEE 14th Control and System Graduate Research Colloquium (ICSGRC), 5 Aug 2023, Shah Alam, Selangor, Malaysia

V. CONCLUSION AND RECOMMENDATION

In conclusion, this research successfully developed a


hand recognition system using CNN and SVM. Skeletal-
based hand detection method is suitable to be used as feature
extraction technique to represent a hand gesture. Both CNN
and SVM able to recognized all 9 hand gestures. CNN has
better performance in terms of accuracy. However, in terms
of processing time, SVM works much faster than CNN.
REFERENCES
Figure 7: Performance of each Hand Gesture Class based on CNN and
[1] M. A. Rahim, A. S. M. Miah, A. Sayeed and J. Shin, "Hand Gesture
SVM
Recognition Based on Optimal Segmentation in Human-Computer
Interaction," 2020 3rd IEEE International Conference on Knowledge
Figure 7 shows the bar graph of accuracy for each hand Innovation and Invention (ICKII), 2020, pp. 163-166, doi:
gesture versus classification algorithm. From the Figure 7, it 10.1109/ICKII50300.2020.9318870.
shows that there are 5 hand gestures that achieved 100% [2] O. Köpüklü, T. Ledwon, Y. Rong, N. Kose and G. Rigoll,
recognition rate, which are the "Fist", "Rock", "Thumbs "DriverMHG: A Multi-Modal Dataset for Dynamic Recognition of
Driver Micro Hand Gestures and a Real-Time Recognition
Down", "Peace", "Stop" and "Okay" gestures. Both Framework," 2020 15th IEEE International Conference on Automatic
algorithms have trouble recognising the "Live Long" Face and Gesture Recognition (FG 2020), 2020, pp. 77-84, doi:
gestures as the gestures are sometimes misinterpreted as 10.1109/FG47880.2020.00041.
"Stop" gestures. Based on Figure 7 also shows that CNN [3] M. Panwar and P. Singh Mehra, "Hand gesture recognition for human
successfully recognizing all the studied 9 hand gestures with computer interaction," 2011 International Conference on Image
Information Processing, 2011, pp. 1-7, doi:
accuracy above 85%. In terms of overall accuracy, both 10.1109/ICIIP.2011.6108940.
classification algorithms have achieved the satisfactory [4] S. Veluchamy, L. R. Karlmarx and J. J. Sudha, "Vision based
recognition rate. CNN excels in recognising hand gestures gesturally controllable human computer interaction system," 2015
with an accuracy of 97.78% compared to the SVM with International Conference on Smart Technologies and Management for
96.30%. Computing, Communication, Controls, Energy and Materials
(ICSTM), 2015, pp. 8-15, doi: 10.1109/ICSTM.2015.7225383.
[5] Lae-Kyoung Lee, Su-Yong An and Se-Young Oh, "Robust fingertip
extraction with improved skin color segmentation for finger gesture
recognition in Human-robot interaction," 2012 IEEE Congress on
Evolutionary Computation, 2012, pp. 1-7, doi:
10.1109/CEC.2012.6256140.
[6] P. Choudhary and S. N. Tazi, "An Adaptive System of Yogic Gesture
Recognition for Human Computer Interaction," 2020 IEEE 15th
International Conference on Industrial and Information Systems
(ICIIS), 2020, pp. 399-402, doi: 10.1109/ICIIS51140.2020.9342678.
[7] Y. F. A. Gaus and F. Wong, "Hidden Markov Model-Based Gesture
Recognition with Overlapping Hand-Head/Hand-Hand Estimated
Using Kalman Filter," 2012 Third International Conference on
Intelligent Systems Modelling and Simulation, 2012, pp. 262-267,
doi: 10.1109/ISMS.2012.67.
[8] Z. Yang, Y. Li, W. Chen and Y. Zheng, "Dynamic hand gesture
recognition using hidden Markov models," 2012 7th International
Conference on Computer Science & Education (ICCSE), 2012, pp.
360-365, doi: 10.1109/ICCSE.2012.6295092.
Figure 8: Processing Time for Hand Gesture Recognition based on CNN
and SVM [9] M. R. Islam, U. K. Mitu, R. A. Bhuiyan and J. Shin, "Hand Gesture
Feature Extraction Using Deep Convolutional Neural Network for
Next, Figure 8 shows the processing time of hand Recognizing American Sign Language," 2018 4th International
gesture recognition using CNN and SVM. From Figure 8, it Conference on Frontiers of Signal Processing (ICFSP), 2018, pp. 115-
shows that SVM need less processing time compared to 119, doi: 10.1109/ICFSP.2018.8552044.
CNN. This is due to SVM doesn’t require to validate the [10] C. -P. Huang, C. -H. Hsieh, K. -T. Lai and W. -Y. Huang, "Human
trained dataset. The hyperplane able to be separated by its Action Recognition Using Histogram of Oriented Gradient of Motion
History Image," 2011 First International Conference on
class in 5 minutes and 8 seconds. Meanwhile the
Instrumentation, Measurement, Computer, Communication and
Convolutional Neural Network (CNN) used 8 minutes and Control, 2011, pp. 353-356, doi: 10.1109/IMCCC.2011.9
23 seconds to train and validate the dataset.

126
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 17:15:42 UTC from IEEE Xplore. Restrictions apply.

You might also like