Professional Documents
Culture Documents
Abstract: People who lack knowledge in sign language often face difficulty communicating effectively with those who are deaf or hard
of hearing, but in such cases hand gesture recognition technology can provide an easy-to-use alternative for computer communication.
This research concentrates on developing a hand gesture recognition system that works in real-time and utilizes universal physical traits
present in all human hands as its basis for identifying these movements by automating the identification of sign gestures obtained through
webcam footage using Convolutional Neural Networks (CNN) alongside additional algorithms integrated into this framework. Natural
hand gestures are used for communication while the system prioritizes segmentation of these movements, and the automated recognition
feature of the system is highly beneficial for people with hearing disabilities as it can help eliminate long-lasting communication barriers.
The system also has potential applications in areas like human-machine interfaces and immersive gaming technology, so all parties
involved can benefit from the ease that real-time hand gesture recognition brings through its potential as a tool for improving
communication and reducing barriers faced by those who are deaf or hard of hearing.
Keywords: Convolutional neural network, Deep learning, Gesture recognition, Sign language recognition, Hearing disability.
Input data
from Image Pre-processing
webcam
Feature
Classification
Extraction
Evaluation
In the HSV color space, colors are defined using three • HSV (Hue, Saturation, Value) is a color space that
parameters: Hue, Saturation and Value, while color represents colors based on three parameters: hue,
temperature refers to how warm (yellow) or cool saturation, and value. Hue is the actual color of the pixel,
(blueish) a shade appears. You can tell how bright a saturation is the intensity or purity of the color, and value
color is by looking at its Value component and how represents the brightness or darkness of the color.
intense or pure it is by looking at its Saturation, so one of • To isolate the color of the skin, a range of values for
the reasons why developers often use HSV color space in the Hue, Saturation, and Value components is defined.
image processing and computer vision applications lies This range is used to identify the pixels that represent
on its ability to separate brightness from chrominance skin color. The range is defined based on the average
data which simplifies isolating specific colors. The skin color of the dataset being used.
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 27
For example, if we want to isolate pixels that represent a range of values defined in step 2 (i.e., the range of
skin color, we might define the following ranges: values for Hue, Saturation, and Value that correspond to
the skin tone). Pixels within this range are set to 1 in the
Hue range: 0 to 50
mask, while pixels outside the range are set to 0.
Saturation range: 0.23 to 0.68
Mathematically, let I be the original RGB image and
Value range: 0.35 to 0.95 let IHSV be the same image converted to the HSV color
This means that any pixel with a hue value between 0 space. Then, let H, S, and V be the individual Hue,
and 50, a saturation value between 0.23 and 0.68, and a Saturation, and Value components of IHSV , respectively.
value between 0.35 and 0.95 is likely to represent skin The range of values for skin tone is defined as a set of
color. upper and lower bounds for H, S, and V, denoted as
Hu , Hl , Su , Sl , Vu , and Vl , respectively.
Mathematically, this can be represented as follows: The binary mask, M, is generated by thresholding the
Let 𝐼 be the input image, and 𝐻, 𝑆, and 𝑉 be the Hue, individual components of I_HSV with the respective
Saturation, and Value components of the HSV color upper and lower bounds as follows:
space, respectively. We can define a range of values for 𝑀(𝐻𝑢 > 𝐻 > 𝐻𝑙 , 𝑆𝑢 > 𝑆 > 𝑆𝑙 , 𝑉𝑢 > 𝑉 > 𝑉𝑙 ) = 1
each component as follows:
M(otherwise) = 0
Hue range: Hmin <= H <= Hmax
Here, the notation “A > B > C” means that B is
Saturation range: Smin <= S <= Smax between A and C, i.e., A is the upper bound and C is the
Value range: Vmin <= V <= Vmax lower bound.
Any pixel (i, j) in the input image I that satisfies these Once the binary mask is generated, it is applied to the
conditions can be classified as representing skin color: original image I to produce the final skin tone image,
Iskin :
𝐻𝑚𝑖𝑛 ≤ 𝐻(𝑖, 𝑗) ≤ 𝐻𝑚𝑎𝑥 𝑆𝑚𝑖𝑛 ≤ 𝑆(𝑖, 𝑗)
≤ 𝑆𝑚𝑎𝑥 𝑉𝑚𝑖𝑛 ≤ 𝑉(𝑖, 𝑗) ≤ 𝑉𝑚𝑎𝑥 Iskin = I ∗ M
This process of defining a range of values for the Hue, Where ∗ denotes element-wise multiplication. This
Saturation, and Value components is called color operation sets all pixels in I that correspond to non-skin-
segmentation or color thresholding. It is a common tone regions to 0, effectively removing all other colors
technique used in computer vision and image processing except for the skin tone.
for object detection, tracking, and recognition.
Apply morphological operations such as dilation and
Apply a mask to the image using the range of values erosion using elliptical kernels to smooth and refine
defined in step 2 to remove all other colors except for the edges of the skin tone.
the skin tone.
The background elimination process is important
In this step, a binary mask is applied to the image in because it helps to reduce the amount of noise in the
order to remove all colors except for the skin tone. The image and isolate the hand gestures, making it easier for
binary mask is generated by thresholding the image with the classification model to identify and recognize them.
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 28
3.3 Segmentation B. The mathematical formula for thresholding is as
follows:
After that, the first image is converted to grayscale. The
colour of the skin gesture will be lost throughout this B(i, j) = 1 if Ieq(i,j) > T else 0
procedure, but our system's resistance to changes in
where T is the threshold value.
brightness or illumination will be increased as a result.
The pixels in the modified image that are not black are To remove all the connected components in the image,
binarised, while the pixels in the original image remain we can use a morphological operation called opening.
intact, and are therefore black. The hand gesture is Let's denote the opened image as O. The mathematical
divided in two ways: first, by removing all of the formula for opening is as follows:
connected components in the image, and second, by
O = open(B, SE)
allowing only the part of the image that is extremely
connected, which in this case is the hand gesture. The where SE is a structuring element that defines the shape
frame is scaled to a 64 X 64 pixel resolution. At the and size of the operation.
conclusion of the segmentation process, binary pictures
To allow only the part of the image that is extremely
of size 64 X 64 are created, in which the white area
connected, we can use a morphological operation called
represents the hand gesture and the black coloured area
closing. Let's denote the closed image as C. The
represents the rest of the image (see figure 3(a) and (b)).
mathematical formula for closing is as follows:
C = close(B, SE)
where SE is the same structuring element used for
opening.
To scale the frame to a 64 x 64 pixel resolution, we can
use interpolation. Let's denote the final segmented image
Fig. 4(a) Binary Image 4(b) Resigned Image as S. The mathematical formula for interpolation is as
follows:
Let's denote the original image as I, and the grayscale
version of I as I_gray. S(i, j) = BilinearInterpolation(C, i / k, j / k)
To convert I to grayscale, we can use the following where BilinearInterpolation is the bilinear interpolation
formula: function, k is the scaling factor (which is calculated as
the ratio of the original image size to the desired size, in
Igray(i,j) = 0.2989 ∗ I(i, j, 1) + 0.5870 ∗ I(i, j, 2) this case 64), and S(i, j) is the intensity value of the pixel
+ 0.1140 ∗ I(i, j, 3) at (i, j) in the final segmented image.
where i and j are the pixel coordinates, and I(i, j, k) is the
k-th channel value of the pixel at (i, j) in the original 3.4 Feature Extraction
image. In image processing, selecting and extracting key
To increase the system's resistance to changes in features from an image is one of the most critical parts of
brightness or illumination, we can use a technique called the process. In most cases, the volume of data contained
histogram equalization. Let's denote the histogram- inside an image dataset demands a lot of storage
equalized grayscale image as 𝐼𝑒𝑞 . The mathematical capability. Data reduction through feature extraction can
formula for histogram equalization is as follows: help us tackle this issue. On top of all that, it keeps the
classifier's accuracy high while simplifying its
(CDF(Igray(i,j) ) − CDFmin ) complexity. We identified the binary pixels in the images
Ieq(i,j) = round ( )
(M ∗ N − CDFmin ) ∗ (L − 1) to be the most important attributes in this scenario.
Images of American Sign Language gestures have been
where round is the rounding function, CDF is the
classified effectively after scaling them to 64 pixels.
cumulative distribution function of 𝐼𝑔𝑟𝑎𝑦 , CDF_min is
When 64 pixels are multiplied by 64 pixels, there are
the minimum value of CDF, M and N are the dimensions
4096 features.
of 𝐼𝑔𝑟𝑎𝑦 , L is the number of possible intensity levels
(usually 256), and 𝐼_𝑒𝑞(𝑖, 𝑗) is the intensity value of the Let 𝐼 be an image dataset with 𝑁 images, each containing
pixel at (i, j) in the histogram-equalized grayscale image. M pixels. Let 𝑋 be an 𝑁 𝑥 4096 matrix, where each row
corresponds to an image and each column corresponds to
To binarize the modified image, we can use a a binary pixel in the image (0 or 1).
thresholding technique. Let's denote the binary image as
To extract the key features from I, we identify the binary
pixels in each image and create 𝑋 accordingly. Thus,
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 29
each row in 𝑋 represents a unique combination of binary of 0.5 is employed. The dropout layer randomly removes
pixel values for an image in I. some activations from the previous layer and sets them
to 0. This technique prevents the network from becoming
To reduce the volume of data in I, we perform feature
too dependent on the training instances and improves its
extraction using X. This involves analyzing X to identify
performance on new examples.
patterns and relationships between the binary pixels that
are most useful for classification. Let 𝑌 be an Every layer uses ReLu (Rectified Linear Unit) for
𝑁 𝑥 𝐾 matrix, where each row corresponds to an image activation, which computes a max (x, 0) for each input
and each column corresponds to a selected feature (e.g. a pixel. This helps in learning complex features, prevents
group of related binary pixels). We can choose K to be the vanishing gradient problem, and speeds up the
smaller than 4096 to reduce the storage requirement for training process. Additionally, max pooling and ReLu
𝑌. activation are used for the input image to reduce
overfitting by decreasing the number of parameters and
To classify images of American Sign Language gestures
overall computational cost. The system architecture for
effectively, we scale each image to 64 x 64 pixels. This
the sign language system is displayed in Figure 4. The
means that 𝑀 = 4096, and 𝑋 has 4096 columns. We can
testing procedure flowchart is shown in Figure 5, which
then use machine learning algorithms (e.g. neural
outlines the process of testing each letter independently
networks) to train a classifier on Y and use it to predict
with 50 iterations given to each letter, recording the
the sign language gesture in new images. By reducing
frequency of recognition for each letter. However, the
the number of features in Y, we simplify the complexity
recognition may be incorrect if the letter gesture is not
of the classifier while maintaining its accuracy.
recognized.
3.5 Classification For each letter in the current string whose count is
greater than or equal to a specific threshold value, the
It is necessary to extract characteristics from the frames
program outputs that letter and adds it to the current
and anticipate hand gestures to use a CNN model. Image
string. (In this case, the values are set at 50, and the
recognition is the primary use of this multi-layered feed
difference threshold is 20.) Images from the first
forward neural network. Several convolution layers are
convolutional layer are fed to a densely linked layer of
included in the CNN architecture. Each convolution
128 neurons, and the output from this layer is reshaped
layer consists of a pooling layer, an activation function,
into an array of 30 by 30 by 32 = 28,800 values. This
and, optionally, batch normalization. A set of fully-
layer receives an array of 28,800 values as its input. The
connected layers is also included in the design of this
output of these layers is received by the second densely
structure. Each time one of the photographs travels
connected layer. To avoid overfitting, a dropout layer
through the network, the size of the image gets smaller
with a value of 0.5 is utilized.
and smaller. Because of max pooling, this occurs in the
game. With the final layer, we can forecast the The output of the first densely connected layer is used as
probability of the different classes. an input to a fully connected layer comprising 96
neurons. The second layer is also densely connected.
3.6 Implementation Finally, the number of neurons in the last layer is equal
to the number of classes it identifies (alphabets and blank
Firstly, the program examines each letter in the current sign) and is an input to the second densely connected
string and verifies whether its count is greater than or layer.
equal to a specific threshold value (in this case, 50). If it
is, the program outputs that letter and adds it to the ReLu (Rectified Linear Unit) is used for activation in
current string. Otherwise, the count of images in the each layer (convolutional and fully connected neurons).
current dictionary is deleted. Next, the images from the ReLu computes a max (x, 0) for each input pixel, which
first convolutional layer are passed to a densely linked allows for more complex features to be learned. This
layer comprising 128 neurons. The output of this layer is approach eliminates the vanishing gradient problem and
reshaped into an array of 30 by 30 by 32 = 28,800 speeds up the training process.
values. This layer receives an array of 28,800 values as A pooling layer may be generated by utilizing max
its input. The output of the first densely connected layer pooling with a pool size of (2, 2) and ReLu activation on
is used as an input to a fully connected layer with 96 the input image. Overfitting is reduced since the number
neurons. The second layer is also densely connected. of parameters reduces the overall computational cost.
Finally, the number of neurons in the last layer is equal
to the number of classes it identifies (alphabets and blank After training, the network weights become so fine-tuned
sign) and is an input to the second densely connected to the training instances that the network does not
layer. To avoid overfitting, a dropout layer with a value perform well when given fresh examples, which is
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 30
known as the "dropout layer" problem. To solve this, the of 28,800 values. A fully connected layer with 96
dropout layer removes a random subset of activations neurons is then used, followed by a densely connected
from the previous layer and sets them to 0. This ensures layer whose number of neurons is equal to the number of
that the network is able to correctly classify or output for classes. To avoid overfitting, a dropout layer is
a single case, even if part of the activations is dropped. employed.
In summary, the program applies a threshold to each ReLu is used for activation in each layer, with max
letter in the current string and outputs the letters whose pooling and ReLu activation utilized for the input image.
count is above the threshold. It then feeds the images The architecture of the sign language system is depicted
from the first convolutional layer into densely connected in Figure 4, and the testing process flowchart is
layers, reshaping the output of the first layer to an array illustrated in Figure 5.
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 31
Fig.6: The testing procedure flowchart
4. Results and Discussion project requirements, the team found it difficult to find
pre-made datasets in the form of raw images. Therefore,
The "Sign Language Recognition System using CNN
they decided to create their own dataset by capturing
and Neural Networks" is a system that aims to improve
each frame shown by the machine's webcam using
the independence of people with disabilities by
OpenCV. In each frame, they defined a region of interest
recognizing sign language through a vision-based
(ROI) using a green bounded square. They then applied a
approach. The system uses hand gestures to represent
Gaussian blur filter to the image to extract various
signs and eliminates the need for any artificial devices
features, resulting in an image that looks like the figure
for interaction. To create a dataset that matched the
shown below.
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 32
Accuracy and Error Calculation: Once we put Detected right
Accuracy% = ∗ 100
together the hardware and uploaded the code we wanted No of iterations
onto the chip, we ran several tests to make sure True positive
everything was working as it should. We ran the tests Precision =
Actual Results
multiple times and made adjustments as needed until we
True Positive
were confident that we met all of our requirements. To Recall =
Predicted Results
determine how accurate our results were, we used
specific equations to calculate our error rates. 2 ∗ (precision ∗ recall)
F1 score =
precision + recall
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 33
Table I: Accuracy and Error Values.
Class type Precision Recall F1 Support
score
0 0.99 0.99 0.99 195
1 1.00 1.00 1.00 205
2 0.99 0.99 0.99 198
3 0.99 0.99 0.99 231
4 0.99 0.99 0.99 193
5 0.99 0.99 0.99 193
6 0.99 1.00 0.99 214
7 1.00 1.00 1.00 209
8 0.99 0.99 0.99 185
9 1.00 1.00 1.00 212
10 0.99 1.00 0.99 199
11 0.99 0.99 0.99 183
12 0.99 0.97 0.98 188
13 0.97 0.99 0.98 195
14 1.00 1.00 1.00 203
15 0.99 1.00 1.00 226
16 1.00 1.00 1.00 187
17 0.98 1.00 0.99 191
18 1.00 0.99 0.99 191
19 0.98 0.99 0.99 172
20 0.99 0.99 0.99 190
21 1.00 0.98 0.99 222
22 1.00 1.00 1.00 210
23 0.99 1.00 1.00 198
24 1.00 1.00 1.00 210
25 1.00 0.95 0.98 211
26 1.00 0.99 0.99 204
27 0.99 0.99 0.99 200
28 0.99 1.00 0.99 198
29 1.00 0.99 0.99 188
30 0.99 0.99 0.99 197
31 0.99 1.00 0.99 184
32 0.99 1.00 0.99 198
33 1.00 0.99 1.00 226
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 34
34 0.99 0.99 0.99 200
35 1.00 1.00 1.00 201
36 1.00 1.00 1.00 189
37 0.98 1.00 0.99 201
38 1.00 1.00 1.00 189
39 0.99 0.98 0.99 201
40 1.00 1.00 1.00 217
41 1.00 1.00 1.00 211
Accuracy 0.99 8400
Macro Avg 0.99 0.99 0.99 8400
Weighted 0.99 0.99 0.99 8400
Avg
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 35
Accuracy and error values over five epochs are displayed Understanding of the Components of ASL
on a line graph created by this code while epochs are Vocabulary Knowledge. Sign Language Studies,
represented on the x-axis and accuracy and errors are 14(2), 225–249.
shown on the y-axis. The representation of accuracy is https://doi.org/10.1353/sls.2014.0003
depicted in blue while that of errors is shown in [4] Ahmed, T. (2012). A Neural Network based Real
orange99 for accuracy readings and ideally maintained Time Hand Gesture Recognition System.
as low with an error value reading as low as The International Journal of Computer Applications,
statement implies that there has been consistent high 59(4), 17–22. https://doi.org/10.5120/9535-3971
performance of the evaluated model in terms of precision [5] Amer Kadhim, R., & Khamees, M. (2020). A Real-
with minimal errors during these epochs, and the Time American Sign Language Recognition
inclusion of both axis labels and a legend in the graph’s System using Convolutional Neural Network for
title enhances its interpretability allowing viewers to Real Datasets. TEM Journal, 937–943.
easily understand how well the model performed over https://doi.org/10.18421/tem93-14
five epochs. [6] Sign Language Translation Using Deep
Convolutional Neural Networks. (2020). KSII
5. Conclusion and Future Scope
Transactions on Internet and Information Systems,
The focus of our study was on building a real-time 14(2). https://doi.org/10.3837/tiis.2020.02.009
system that could identify hand motions to improve [7] Kumbhar, A., Sathe, V., Pathak, A., & Kodmelwar,
communication for individuals with hearing disabilities M. K. (2015). Hand tracking in HCI framework
and the automatic recognition of sign gestures by our intended for wireless interface. International
system using advanced algorithms combined with Journal of Computer Engineering in Research
convolutional neural network technology allows for Trends , 2(12), 821-824.
more intuitive and natural communication between [8] Damatraseta, F., Novariany, R., & Ridhani, M. A.
individuals. Recognizing commonalities among human (2021). Real-time BISINDO Hand Gesture
hands' shape is essential for the system's ability to Detection and Recognition with Deep Learning
identify different types of gestures. Using our system as CNN. Jurnal Informatika Kesatuan, 1(1), 71–80.
a tool to improve accessibility and streamline https://doi.org/10.37641/jikes.v1i1.774
communication is an effective way to save time for all [9] K, S., & R, P. (2022). Sign Language Recognition
parties involved, and our target moving forward is to System Using Neural Networks. International
improve upon the system's precision and speed as well as Journal for Research in Applied Science and
make it more accessible for users. Developing more Engineering Technology, 10(6), 827–831.
advanced applications integrated with modern tech such https://doi.org/10.22214/ijraset.2022.43787
as virtual and augmented reality is our aim in enhancing [10] M. M. Venkata Chalapathi, M. Rudra Kumar,
user experience. Furthermore, continued research and Neeraj Sharma, S. Shitharth, "Ensemble Learning
development of the hand gesture recognition system has by High-Dimensional Acoustic Features for
the potential to change communication methods Emotion Recognition from Speech Audio
significantly for people with hearing difficulties as well Signal", Security and Communication Networks,
as in other relevant fields. vol. 2022, Article ID 8777026, 10 pages, 2022.
https://doi.org/10.1155/2022/8777026
References
[11] Boinpally, A., Ventrapragada, S. B., Prodduturi, S.
[1] J. Padgett, R. (2014). The Contribution of R., Depa, J. R., & Sharma, K. V. (2023). Vision-
American Sign Language to Sign-Print based hand gesture recognition for Indian sign
Bilingualism in Children. Journal of language using convolution neural network.
Communication Disorders, Deaf Studies & Hearing International Journal of Computer Engineering in
Aids, 02(02). https://doi.org/10.4172/2375- Research Trends, 10(1), 1-9.
4427.1000108 https://doi.org/10.47577/IJCERT/2023/V10I0101
[2] Evaluation of Measures to Facilitate Access to Care [12] Rudra Kumar, M., Rashmi Pathak, and Vinit
for Pregnant Deaf Patients: Use of Interpreters and Kumar Gunjan. "Diagnosis and Medicine
Training of Caregivers in Sign Language. (2013). Prediction for COVID-19 Using Machine Learning
Journal of Communication Disorders, Deaf Studies Approach." Computational Intelligence in Machine
& Hearing Aids, 01(01). Learning: Select Proceedings of ICCIML 2021.
https://doi.org/10.4172/2375-4427.1000103 Singapore: Springer Nature Singapore, 2022. 123-
[3] Novogrodsky, R., Fish, S., & Hoffmeister, R. 133.
(2014). The Acquisition of Synonyms in American [13] Alsaadi, Z., Alshamani, E., Alrehaili, M., Alrashdi,
Sign Language (ASL): Toward a Further A. A. D., Albelwi, S., & Elfaki, A. O. (2022). A
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 36
Real Time Arabic Sign Language Alphabets [17] Kumar, P. ., Gupta, M. K. ., Rao, C. R. S. .,
(ArSLA) Recognition Model Using Deep Learning Bhavsingh, M. ., & Srilakshmi, M. (2023). A
Architecture. Computers, 11(5), 78. Comparative Analysis of Collaborative Filtering
https://doi.org/10.3390/computers11050078 Similarity Measurements for Recommendation
[14] Rudra Kumar, M., Rashmi Pathak, and Vinit Systems. International Journal on Recent and
Kumar Gunjan. "Machine Learning-Based Project Innovation Trends in Computing and
Resource Allocation Fitment Analysis System Communication, 11(3s), 184–192.
(ML-PRAFS)." Computational Intelligence in [18] Kumar, P. ., Gupta, M. K. ., Rao, C. R. S. .,
Machine Learning: Select Proceedings of ICCIML Bhavsingh, M. ., & Srilakshmi, M. (2023). A
2021. Singapore: Springer Nature Singapore, 2022. Comparative Analysis of Collaborative Filtering
1-14. Similarity Measurements for Recommendation
[15] Real time sign language detection system using Systems. International Journal on Recent and
deep learning techniques. (2022). Journal of Innovation Trends in Computing and
Pharmaceutical Negative Results, 13(S01). Communication, 11(3s), 184–192.
https://doi.org/10.47750/pnr.2022.13.s01.126 [19] B, G. D., K, S. R., & P, V. K. (2023). SMERAS -
[16] Amer Kadhim, R., & Khamees, M. (2020). A Real- State Management with Efficient Resource
Time American Sign Language Recognition Allocation and Scheduling in Big Data Stream
System using Convolutional Neural Network for Processing Systems. International Journal of
Real Datasets. TEM Journal, 937–943. Computer Engineering in Research Trends, 10(4),
https://doi.org/10.18421/tem93-14 150–154.
https://doi.org/10.22362/ijcertpublications.v10i4.5
International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(6s), 23–37 | 37