You are on page 1of 9

Hurroo, Mehreen, and Mohammad Elham.

"Sign language recognition system using


convolutional neural network and computer vision." International Journal of Engineering
Research and Technology (IJERT) 9.12 (2020): 59-64.

In this paper, the data is collected with the use of web camera to shoot the hand gestures. The
webcam will capture the images in the RGB colourspace. The images undergo a series of
processing operations whereby the backgrounds are detected and eliminated using the colour
extraction algorithm HSV (Hue, Saturation, Value) Using the morphological operations, a mask
is applied on the images and a series of dilation and erosion using elliptical kernel are executed .
With openCV, the images obtained are amended to the same size so there is no difference
between images of different gestures. The dataset has 2000 American sign gestures images out
of which 1600 images are for training and the rest 400 are for testing purposes. It is in the ratio
80:20. Binary pixels are extracted from each frame, and Convolutional Neural Network is
applied for training and classification.
DATA PROCESSING : Data processing is done using HSV colourspace. It is a model which splits the colour
of an image into 3 separate parts namely: Hue,Saturation and value. HSV is a powerful tool to improve
stability of the images by setting apart brightness from the chromaticity . A track-bar having H ranging
from 0 to 179, S ranging from 0-255 and V ranging from 0 to 255 is used to detect the hand gesture and
set the background to black. The region of the hand gesture undergoes dilation and erosion operations
with elliptical kernel. At the end of the segmentation process, binary images of size 64 by 64 are
obtained where the area in white represents the hand gesture, and the black coloured area is the rest.

In our case, the features found to be crucial are the binary pixels of the images. Scaling the
images to 64 pixels has led us to get sufficient features to effectively classify the American Sign
Language gestures . In total, we have 4096 number of features, obtained after multiplying 64 by
64 pixels.
the input layer of the convolutional neural network has 32 feature maps of size 3 by 3, and the
activation function is a Rectified Linear Unit. The max pool layer has a size of 2×2. The dropout
is set to 50 percent and the layer is flattened. The last layer of the network is a fully connected
output layer with ten units, and the activation function is Softmax. Then we compile the model
by using category cross-entropy as the loss function and Adam as the optimiser.
The model is evaluated based on 10 alphabetic American sign language including : A, B, C, D,
H, K, N,O,T and Y. Used a total of 2000 images to train the Convolutional Neural Network. The
dataset is split in the ratio of 80:20 for training and testing respectively. The results used in this
paper gives us an accuracy of over 90.0%.
Ansari, Zafar Ahmed, and Gaurav Harit. "Nearest neighbourclassification of Indian sign
language gestures using kinect camera." Sadhana 41.2 (2016): 161-182.
DOIhttps://doi.org/10.1007/s12046-015-0405-3

ISL Dataset has been categorized into Finger Spelling (Alphabets), Common words (Ideas
related to objects & Ideas related to people) ,Numbers, Technical words.
The dataset has been selected from the ISL general, technical and banking dictionaries. Apart
from complete words, the dataset also has signs for manual fingerspelling (signs for alphabets).
A word may have different variants—particularly single-handed and two-handed variants.. The
system ran Ubuntu 12.04 (32 bit). The system was implemented in C++ programming language.
Matlab R2011b was used for analysis and dataset archiving. 3D modelling tools like Mesh Lab
were used for visualising the data. They have used median filtering to get rid of the spiky noise
and to preserve the edge information.
They have assumed that there are no objects between the user and the Kinect. Mostly the hands
lie closer to the camera than the rest of the body. Once the depth image has been segmented into
cogent segments, we can choose the segment that is closest to the camera to identify the
hands.They have used a segmentation algorithm based on Graph Theoretic treatment to the
perceptual segmentation problem. And also used K-means clustering for faster and better hand
segmentation.
The algorithm is summarized into three steps :
1. Find simple local maxima using the histogram, governed by PEAK_MIN_DEPTH and
SEARCH_RANGE.
2. Grow regions (governed by tuning parameters PEAK_MIN_BRACKET) around these local
maxima.
3. Select the largest depth in these regions as the final initialisation seed points.
After initialising with the local maxima, the K-means tends to become faster and more
consistent in average runtime. We then select the closest cluster on the basis of the clusters’
mean points’ depth.
Features thus extracted from the training images are indexed in a k-d tree, where k is the
dimensionality of the feature vector. When the system runs, the test image is received from the
user, preprocessed, segmented and features are extracted from it. Then Features thus extracted
from the training images are indexed in a k-d tree, where k is the dimensionality of the feature
vector. When the system runs, the test image is received from the user, preprocessed, segmented
and features are extracted from it. Then a nearest neighbour search is run to find the point(s) in
the index that, according to some recognition heuristic is/are chosen as the output.
They divide test and training datasets such that on an average 5% of total samples (for all
classes) are in the test dataset and the rest are in the training dataset. The results are very poor ,
the best recognition rates for 2,10 and 20 classes are 90.937%, 59.26%, and 37.27% respectively.
When the number of classes was increased, the recognition rate fell down.
NB, Mahesh Kumar. "Conversion of sign language into text." International Journal of Applied
Engineering Research 13.9 (2018): 7154-7161.
Sign language recognition has two different approaches. - Glove based approaches , Vision
based approaches. The task will be simplified during segmentation process by wearing glove.
Image processing algorithms are used in Vision based technique to detect and track hand signs
and facial expressions of the signer. There are again two different approaches in vision based
sign language recognition: -3D model based ,Appearance based . 3D model based methods
make use of 3D information of key elements of the body parts. Using this information, several
important parameters, like palm position, joint angles etc., can be obtained. Appearance-based
systems use images as inputs. They directly interpret from these videos/images. They don’t use a
spatial representation of the body.
They have used an LDA Algorithm The Generalization of the Fisher's linear discriminant(FLD)
is known as Linear discriminant analysis (LDA). Mainly used in statistics, pattern recognition
and machine learning. It is used to find a linear combination of features that characterizes or
separates two or more classes of objects or events. LDA explains to model the difference
between the classes of data.
Pre-processing consist image acquisition, segmentation and morphological filtering
methods.Image Acquisition is the first step in preprocessing process of sensing of an image. So
in an image acquisition, image is sensed by “illumination”. It will also involve pre-processing
such as scaling. In image acquisition the image will be taken from database. Segmentation is the
process in which image is converted into small segments so that the more accurate image
attribute can be extracted. The image components are extracted by Morphological Filtering tools
which are useful for representation and description of shape.
The reduction of data dimensionality by encoding related information in a compressed
representation and removing less discriminative data is called as Feature extraction Technique.
Each gesture is represented as a column vector in the training phase. These gesture vectors are
then normalized with respect to average gesture. In the recognition phase, a subject gesture is
normalized with respect to the average gesture and then projected onto gesture space using the
eigenvector matrix. Lastly, Euclidean distance is computed between this projection and all
known projections.
By using LDA algorithm for sign recognition operation the dimensionality will be reduced. Due
to dimensionality reduction the noise will be reduced and with high accuracy.

Jain, Sanil, KV Sameer Raja, and Mentor-Prof Amitabha Mukerjee. "Indian sign language
character recognition." Indian Institute of Technology, Kanpur Course Project-CS365A (2016).
In this paper instead of using high-end technology like gloves or kinect, they aim to solve this
problem regarding Indian Sign Language, using state of the art computer vision and machine
learning algorithms.
They chose only 8 students of different skin tones and made around 6-11 seconds of video for
every alphabet per person using a 30fps camera summing upto 1 minute video for every
alphabet. Videos were converted into frames evaluating to around 1800 images per alphabet.
They divided their approach to tackle the classification problem into three stages.
• The first stage is to segment the skin part from the image, as the remaining part can be
regarded as noise w.r.t the character classification problem.
• The second stage is to extract relevant features from the skin segmented images which can
prove significant for the next stage i.e learning and classification.
• The third stage as mentioned above is to use the extracted features as input into various
supervised learning models for training and then finally use the trained models for classification.
Unlike RGBmodel, HSVmodelseparatesthecolor andintensity components which makes it more
robust to lighting and illumination changes. So in this approach, they transformed the image
from RGB space to HSV space. Retrain the pixels having H and S values in the range 25< H
<230 and 25< S <230. But the HSV model captures a lot of noise as shown in the images 2 and 4
and requires a lot of tuning.
The other Final Approach is YIQ and YUVmodel they obtain I and θ =tan−1(U/V) and we retain
pixels with 30 <I <100 and 105◦ < θ <150◦.
After the skin segmented images were obtained using the YUV-YIQ model, we used the
following approaches for extracting feature vectors like Bag of visual words, Histogram of
Oriented Gradient(HOG) Features with dimension reduction, Histogram of Oriented Gradient
Features(without dimensionality reduction).
Machine learning on feature vectors are Support Vector Machines where the best accuracies are
observed, Random Forest where it fell a little short of the Multiclass SVM with 46.45% 4 fold
CV accuracy, Hierarchical Classification.
The results reported are four fold cross validated they took about 25 images per alphabet per
person from 4 people(the images of the remaining 4 were in very bad lighting and hence not
trained upon), and then separated the images of one person as the validation set and trained on
the images of the remaining 3 and measured the accuracy on the 4th. Finally the average
accuracy is reported. Bag of Visual words record an accuracy of 32.74%, Gaussian Random
Projection records 51.43%, Hog Features with (Random Forests, RBF kernel SVM, linear kernel
SVM, Hierarchical Classifier) as 46.46, 4.63, 54.63, 53.23%.
Li, Dongxu, et al. "Word-level deep sign language recognition from video: A new large-scale
dataset and methods comparison." Proceedings of the IEEE/CVF winter conference on
applications of computer vision. 2020.
There are three publicly released word-level ASL datasets1, i.e. Purdue RVL-SLLL ASL
Database , Boston ASLLVD and RWTH-BOSTON-50. Purdue RVL-SLLL ASL Database
contains 39 motion primitives with different hand-shapes that are commonly encountered in
ASL. Each primitive is produced by 14 native signers. Boston ASLLVD has 2,742 words (i.e.,
glosses) with 9,794 examples (3.6 examples per gloss on average). RWTH-BOSTON-50
contains 483 samples of 50 different glosses performed by 2 signers.
Sign recognition approaches mainly consists of three steps: the feature extraction, temporal-
dependency modeling and classification. Hidden Markov Models (HMM) are then employed to
model the temporal relationships in video sequences. Classification algorithms, such as Support
Vector Machine (SVM) , are used to label the signs with the corresponding words.
In ASL dataset, they select videos whose titles clearly describe the gloss of the sign. In total, we
access 68,129 videos of 20,863 ASL glosses from 20 different websites. A temporal boundary is
used to indicate the start and end frames of a sign. When the videos do not contain repetitions of
signs, the boundaries are labelled as the first and last frames of the signs. In order to reduce side-
effects caused by backgrounds and let models focus on the signers, they use YOLOv3 as a
person detection tool to identify body bounding-boxes of signers in videos.
They employ two image-based baselines to model the temporal and spatial information of videos
in different manners 2DConvolutionwithRecurrentNeuralNetworks are widely used to extract
spatial features of input images while Recurrent Neural Networks (RNN) are employed to
capture the longterm temporal dependencies among inputs. 3D Convolutional Networks are able
to establish not only the holistic representation of each frame but also the temporal relationship
between frames in a hierarchical fashion. Pose based approaches mainly utilize RNNs to model
the pose sequences for analyzing human motions. A novel pose-based approach to ISLR using
Temporal Graph Convolution Networks (TGCN). The models, i.e., VGG-GRU, Pose-GRU,
Pose-TGCN and I3D are implemented in PyTorch.
Although I3D is larger than our TGCN, Pose-TGCN can still achieve comparable results with
I3D at top-5 and top-10 accuracy on the large-scale subset WLASL2000. baseline methods can
achieve relatively high classification accuracy on small-size subsets. i.e., WLASL100 and
WLASL300.
Saleh, Yaser, and Ghassan Issa. "Arabic sign language recognition through deep neural networks
fine-tuning." (2020): 71-83.

The procedure employs the data collection, under sampling,augmentation and fine tuning
process. For the Arabian Sign language dataset they have used VGG Net model architecture
capable of consisting fewer layers that provide good results with an error of 7.3% on the overall
accuracy.Also used Resnet model providing capabilities for improvement larger datasets.This
model able to achieve 3.57% error on the ImageNet.
The dataset originally contained 54,049 images distributed around 32 classes of Arabic Signs, the
images dimensions are unified on 64 x 64, and many variations of images were presented through
the use of different lighting and backgrounds. In under sampling, re-sampling is applied to the data
to fix the imbalance and reduce the bias. Under-sampling is one technique of re-sampling that is

used to reduce the size of the majority class. Data Augmentation is a process in which new sets of
data are created using existing ones, thus increasing the data diversity for the training process. the
use of data augmentation can help build a more robust classification network, and can give a boost
to the accuracy of the network itself, the authors presented how different types of aug-mentation
can affect the process of training deep convolutional neural networks.
Random operations are applied to the existing images, and the resulting datasetwould then be
inputted into the training algorithm with the use of an image augmentation object, this object
would apply the following augmentations on the images on every iteration of training:

Rescaling with a 1/255 factor


• Random horizontal flipping
• Random rotation
• Random height and width shifts
• Random zoom
Rescaling with a 1/255 factor
• Random horizontal flipping
• Random rotation
• Random height and width shifts
• Random zoom
Rescaling with a 1/255 factor
• Random horizontal flipping
• Random rotation
• Random height and width shifts
• Random zoom
 Rescaling with a 1/255 factor
 Random Horizontal Flipping
 Random Rotation
 Random height and width shifts
 Random Zoom
The VGG16 and the ResNet152 models are chosen for their well-known high performance
capabilities, furthermore, the process of fine-tuning thenetworks will provide the ability of
using smaller size dataset such as the one used in this methodology, and will require less
number of epochs, representing a pass through the network with the new data.

• Rescaling with a 1/255 factor


• Random horizontal flipping
• Random rotation
• Random height and width shifts
• Random zoo

The training process included running 100 epochs, passing through the network using the whole
dataset, where in each epoch, multiple batches of the image dataset were passed through the
network, while correspondingly computing the loss function through categorical cross-entropy
between predictions and targets, and running a Stochastic Gradient Descent updates,passing the
learning rate of 0.0001, which controlled the size of the update steps, and momentum of 0.9 as
parameters.
The model was able to reach a 99% validation accuracy at the 40th epoch, while the highest
accuracy of 99.45% was reached at the 92nd epoch.

Stoll, Stephanie, et al. "Text2Sign: towards sign language production using neural machine
translation and generative adversarial networks." International Journal of Computer
Vision 128.4 (2020): 891-908.
Text-to-sign-language (text2sign)translation system consists of two stages: Training an NMT
network to obtain a sequence of gloss probabilities that is used to solve a Motion Graph (MG) which
generates human pose sequences. Then a pose-conditioned sign generation network with an
encoder-decoder-discriminator architecture producethe output sign video.
In Text to pose translation they employ RNN based machine translation methods, namely attention
based NMT approaches, to realize spoken language sentence to sign language gloss sequence
translation. They use an encoder-decoder architecture with Luong attention . The motion
primitives need to be extracted from a larger set of motion capture data. This can be done by
identifying key frames in the motion data that are at the transition points between motions e.g. the
left foot impacting the floor for walking sequences. In pose to video translation network combines a
convolutional image encoder and a Generative Adversarial Network(GAN). A generator G that
creates new data instances, and a discriminator D that evaluates whether these belong to the same
data distribution as the training data. G is an encoder-decoder, conditioned on human pose and
appearance.
We trained our spoken language to sign pose network using
PHOENIX14T.However,duetothelimitednumberof signers in the dataset, we utilised another large
scale dataset to train the multi-signer (MS) generation network, namely the SMILE Sign Language
Assessment Dataset. The SMILE dataset contains 42 signers performing 100isolated signs for three
repetitions in Swiss German Sign Language (DSGS).In data preprocessing, split the continuous
samples of the PHOENIX14T dataset by gloss using a forced alignment 123 International Journal of
Computer Vision (2020) 128:891–908 899 approach.

ScottReed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, BerntSchiele
and Honglak Lee, "Generative Adversarial Text to Image Synthesis" arXiv:1605.05396
The approach is to train a deep convolutional generative adversarial network conditioned on text
features encoded by a hybrid character-level convolutional recurrent neural network. Both the
generator network G and the discriminator network D perform feed-forward inference conditioned
on the text feature. In the generator G, first we sample from the noise prior z ∈ RZ ∼ N(0,1) and we
encode the text query t using text encoder ϕ. The description embedding ϕ(t) is first compressed
using a fully-connected layer to a small dimension followed by leaky-ReLU and then concatenated
to the noise vector z. CUB has 150 train+val classes and 50 test classes, while Oxford-102 has 82
train+val and 20 test classes. For both datasets, we used 5 captions per image. During mini-batch
selection for training we randomly pick an image view (e.g. crop, flip) of the image and one of the
captions. For text features, we first pre-train a deep convolutional recurrent text encoder on
structured joint embedding of text captions with 1,024-dimensional GoogLeNet image embedings.
For both Oxford-102 and CUB we used a hybrid of character-level ConvNet with a recurrent neural
network. The training image size was set to 64 × 64 × 3. The text encoder produced 1,024-
dimensional embeddings that were projected to 128 dimensions in both the generator and
discriminator before depth concatenation into convolutional feature maps. GANCLS generates
sharper and higher-resolution samples that roughly correspond to the query, but Align DRAW samples
more noticably reflect single-word changes in the selected queries from that work.

You might also like