You are on page 1of 12

LITERATURE REVIEW

The author Dimitri Palaz, et. al. [1] proposed the automatic speech recognition systems model the
relationship between acoustic speech signal and phone classes in two stages, namely, extraction of
spectral- based features based on prior knowledge followed by training of acoustic model, typically an
artificial neural network (ANN). It was shown that Convolu- tional Neural Networks (CNNs) can model
phone classes from raw acoustic speech signal, reaching performance on par with other existing feature-
based approaches. The paper extends the CNN-based approach to large vocabulary speech recognition
task. More precisely, the proposed method compares the CNN- based approach against the
conventional ANN-based approach on Wall Street Journal corpus. The studies show that the CNN-based
approach achieves better performance than the conventional ANN- based approach with as many
parameters. We also show that the features learned from raw speech by the CNN- based approach
could generalize across different databases.

The author Ossama Abdel-Hamid, et. al. [2] proposed method, the error rate reduction can be obtained
by using convolutional neural networks (CNNs). first present a concise description of the basic CNN and
explain how it can be used for speech recognition. further propose a limited-weight- sharing scheme
that can better model speech features. The special structure such as local connectivity, weight sharing,
and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the
frequency axis, which is important to deal with speaker and environment variations. Experimental
results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone
recogni- tion and the voice search large vocabulary speech recognition tasks.

The author Jui-Ting Huang, et. al. [3] the proposed method aims to provide some detailed analysis of
CNNs by visualizing the localized filters learned in the convolutional layer, shows that edge detectors in
varying directions can be automatically learned. Then identify four domains we think CNNs can
consistently provide advantages over fully-connected deep neural networks (DNNs): channel-
mismatched training-test conditions, noise robustness, distant speech recognition, and low- footprint
models. For distant speech recognition, a CNN trained on 1000 hours of Kinect distant speech data
obtains relative 4%-word error rate reduction (WERR) over a DNN of a similar size. This is the largest
corpus so far reported in the literature for CNNs to show its effectiveness. Lastly, establish the CNN
structure combined with maxout units is the most effective model under small-sizing constraints for the
purpose of deploying small-footprint models to devices. This setup gives relative 9.3% WERR from DNNs
with sigmoid units.

The author Sunchan Park, et. al. [4] The Convolutional neural network (CNN) acoustic models showed
lower word error rate (WER) in distant speech recognition than fully- connected DNN acoustic models.
To improve the performance of reverberant speech recognition using CNN acoustic models, the
proposed method uses the multiresolution CNN that has two separate streams: one is the wideband
feature with wide- context window and the other is the narrowband feature with narrow-context
window. The experiments on the ASR task of the REVERB challenge 2014 showed that the proposed
mul- tiresolution CNN based approach reduced the WER by 8.79% and 8.83% for the simulated test data
and the real- condition test data, respectively, compared with the conventional CNN based method.

[1] Used a cheap 3D motion sensor called the leap motion sensor for extracting the
direction of motion, position and the motion velocity of the hand and then k-nearest
neighbour and support vector machines were applied on these features for the purpose
of sign language recognition. Four separate data sets were considered and in each
iteration three data sets were selected for the process and the fourth data set was
selected for testing. The distance between the tips of different fingers was calculating to
extraction gestures which included pinch and grab.

[2] Proposed a CNN model architecture for a selfie-based sign language recognition
system. The dataset contained 5000 images consisting signs from 5 sign language
users of 200 signs in different orientations. Out of the layers in the proposed CNN
architecture the layers for feature extraction consisted of four convolutional layers, four
ReLu (rectified linear units) and two stochastic pooling layers while the layers for the
classification were a dense layer, a ReLu layer and one softmax layer. Out of the four
convolutional layers, the first two layers were able to extract the low-level features while
the last two were able to extract the high-level features. The model was successful in
giving a better outcome than some of the commonly used algorithms such as Adaboost
and normal ANNs.

[3] Proposed a selfie-based sign language recognition system. The two major
implementation problems which are only one hand of the user was available to make
gestures because of the second hand being used to hold the selfie stick and the
disturbances created in the background due to the shaking of the selfie stick. To extract
the hand sign being made by the user, Gaussian filter along with Sobel gradient and
filling were used to extract the hand and head regions and, in the end, morphological
subtractions were done on the output to get the hand and head contours. Since, the
distances between the fingers and differences between various hand gestures in the
sign language are minute, Euclidean distance and normalized Euclidean distance failed
to give a better output and Mahalanobis distance was used for classification.
[4] Proposed a system which consisted of an Artificial Neural Network and a HOG
(Histogram of Gradient) feature descriptor. The HOG feature descriptor used finds the
gradient in the intensity or edge direction in the input image and the occurrence of a
sudden change in the gradient is used to find the edges and contours of the ROI
(Region of Interest) in the input image. Once the ROI is extracted, it is given as the input
to the neural network, which works on it and uses it for learning and classification by
feature vector generation. The output image is fed to the neural network and the neural
model is created for the classification of user input images. It eliminates the use of
sensor-based systems which commonly use sensor gloves or special coloured gloves
for proper identification of the hand ROI and makes the sign language recognition
system more accessible to people.

[5] Proposed Statistical Dynamic Time Wrapping for time wrapping. Proposed novel
classification techniques – Combined Discriminative Feature Detectors and Quadratic
Classification on DF Fisher Mapping which performed better than the conventional
Hidden Markov Models accompanied with SDTW. Dimensionality was reduced using
Fisher mapping.

[6] Used Kalman filter and improved histogram backpropagation for the purpose of hand
and face extraction using skin colors. Motion difference images were calculated and
streak features were extracted for pattern recognition. The signer was supposed to wear
colored gloves. One problem was that the head of the signer was supposed to be kept
still.

[7] Proposed the use of Multi-class Support Vector Machines (MCSVM) on the features
extracted by a Convolutional Neural Network. The system used non-linear MCSVM as a
normal SVM can only distinguish between two classes with the help on a hyperplane
and for the generation of a classifier for a non-linear dataset, the use of non-linear
kernel functions becomes necessary and the proposed system uses the Gaussian
Radial Basis Kernel function for this purpose
[8] Used kurtosis position, principal component analysis as the descriptor, motion chain
coding for the purpose of hand movement representation and hidden markov model for
classification of user input images. A hidden markov model classifier was used to test
the weightage of the proposed feature combinations. When only one feature was used,
PCA has shown to be the best feature with error rate 13.63%, while if two features are
used the combination of PCA and kurtosis position has improved error rate to be
11.82% with a decrease in error rate 1.81%. When a combination of three features is
employed the error-rate improved to be 10.90% with a decrease in error rate 2.73%.

A. Related Work There is a significant amount of work already done to recognize the gesture.
The existing approaches can be divided into two categories. One is using a special device and the
other one is to use Deep Learning to detect the hands and recognize the gesture that the hands are
representing. 1) Device based recognition: In Kinect sensor [7] the proposed method uses a
hierarchical CRF to detect the segments of signs using hand motions, and then it uses a
BoostMap embedding method to verify the handshapes of the segmented signs. It uses a
Microsoft Kinect sensor to capture the 3d depth information about the hand motion. In [8] the
sensor glove is used to capture the gesture and then an Artificial Neural Network (ANN) is used
to recognize the gesture and then classify them. [9] uses the flex sensors to capture the finger and
hand movements and then after it captures the posture it tries to detect if the posture belongs to
any predefined category. Based on the belonging category a proper matching algorithm is used to
recognize the exact value of the posture. This model is used to recognize the Vietnamese
language. In [10] proposed a recognition system using Microsoft Kinect, Convolutional Neural
Network (CNN), and General Processing Unit (GPU) acceleration. The system was able to
recognize 20 Italian gestures with high accuracy. In general, device-based recognition works in
two steps. In the first step, the machine captures the sensory signal from the gesture, and then in
the second step, it recognizes the sign from that capturing signal. This method generally gives
accurate information, but for daily use, it’s quite impractical and the machines are still quite
expensive to use. 2) Deep learning-based recognition: In computer vision and deep learning-
based recognition it extracts the useful feature maps from an input image using a neural network
with hidden layers. Several deep learning networks are giving very good results on detecting
objects, such as AlexNet [11], VGGNet [12], ResNet [13], etc. There are also some region-based
models such as R-CNN [14], Fast R-CNN [15] and Faster R-CNN [16]which tries to find the
region of an image where is a probability of having the object and then it tries to determine if
there is indeed the object in that region. These detection methods are called two-stage object
detectors. There exist several methods that apply these state-of-the-art CNN models to recognize
hand gestures. In [17] various image processing methods are applied and then use KNN and
SVM to evaluate the classification. [18] Use a pre-trained googLeNet architecture to recognize
ASL and then convert it into the English alphabet. The accuracy of this proposed method was
about 72%. This model can recognize the letter from ”a” to ”e”. There is also a ResNet-based
model implementation to recognize ASL [19]. It proposed a 2-level resnet50 based neural
network architecture and transfer learning [20] to recognize the gestures. In [21] a human
keypoints extraction system is used. They extract key points from the face, hand, body parts and
then the extracted key points are used as input to feed into the Recurrent Neural Network (RNN).
The accuracy was about 89.7%. [22] A custom Dense Neural Network (DNN) is proposed which
have 3 Conv layer and 3 max-pooling layers. The accuracy was about 82% and the model was
using English Sign Language(ESL) for performance evaluation. [23] It recognizes the gesture
using an SSD deep learning algorithm by making a 19 layer Neural Network (NN). [24]
YOLOv3 is used to localize the position of the hand gesture, after localizing the hand position,
the hand gesture is feed to a CNN to detect the gesture. From a very high-level overview it can
be said that in deep learning methods, Various filters are applied to extract the feature map. Then
pooling layers make the feature map dimension smaller and also convert the multidimensional
array into a vector so that it can be fed to the neural network. This method does not need any
equipment but it’s computationally expensive to extract each feature from each image on the
simple fact that each filter moves on a sliding window approach on the input image and then try
to take the features that could be useful to recognize the image. Also, the design and the
computation are quite expensive. B. YOLO YOLO [2] is developed based on Convolutional
Neural Network (CNN) and can produce fast and effective object detection. In the YOLO(You
Only Look Once) method, the input images are only seen once through the neural network and it
predicts the detected object in the image. It works by dividing the input image into different grids
based on predefined grid size and then predicts the probability of the desired object in each grid.
It predicts all the classes and the object bounding that are in the image in one run of the
Algorithm. Thus it became a very fast end-to-end object detection and can be used for real-time
detection. There are also continuous improvements on the YOLO algorithm [25] [26] [27] [6] in
terms of accuracy, speed, and lightweight.

[3] Research in the sign language system has two well-known approaches are Image processing and Data
glove. The image processing technique [4] [5] using the camera to capture the image/video. Analysis the
data with static images and recognize the image using algorithms and produce sentences in the display,
vision based sign language recognition system mainly follows the algorithms are Hidden Markov Mode
(HMM) [6], Artificial Neural Networks (ANN ) and Sum of Absolute Difference (SAD) Algorithm use to
extract the image and eliminate the unwanted background noise. The main drawback of vision based
sign language recognition system image acquisition process has many environmental apprehensions
such as the place of the camera, background condition and lightning sensitivity. Camera place to focus
the spot that capture maximum achievable hand movements, higher resolution camera take up more
computation time and occupy more memory space. User always need camera forever and cannot
implement in public place. Another research approach is a sign language recognition system using a data
glove [7] [8].user need to wear glove consist of flex sensor and motion tracker. Data are directly
obtained from each sensor depends upon finger flexures and computer analysis sensor data with static
data to produce sentences. It’s using neural network to improve the performance of the system. The
main advantage of this approach less computational time and fast response in real time applications. Its
portable device and cost of the device also low. Another approach using a portable Accelerometer (ACC)
and Surface Electro Myogram (sEMG) [9] sensors used to measure the hand gesture. ACC used to
capture movement information of hand and Arms. EMG sensor placed, it generates different sign
gesture. Sensor output signals are fed to the computer process to recognize the hand gesture and
produce speech/text. But none of the above methods provide users with natural interaction. This
proposed system will be capable of performing the conversation without any wearable device instead
using the human motion and gesture recognition.

II. LITERATURE SURVEY

Tanuj Bohra et al. proposed a two-way real-time sign language conversion program
based on image processing for in-depth reading using computer vision. Procedures
such as hand detection, skin colour separation, medium blur and frame detection are
performed on images in the database for best results. CNN model trained with a large
database of 40 classes and able to predict 17600 test images in 14 seconds with 99%
accuracy.
Joyeeta Singha and Karen Das proposed the Indian Sign Language Recognition
Program in a live video. The program consists of three stages. The pre-screening
process involves skin filtering and histogram matching. Eigen-values and eigen-vectors
are considered in the output factor category and the Eigen value that measures the
Euclidean distance to be divided. The Dataset contained 480 images of 24 ISL symbols
signed by 20 people. The system was tested on 20 videos and gained 96.25%
accuracy.

Muthu Mariappan H. and Dr. Gomathi V have designed a real-time sign language
recognition system as a portable unit that uses contour detection and an
incomprehensible algorithm for c-means. Outlines are used to see the face, left hand
and right hand. While the k means algorithm is incomprehensible it is used to divide the
input data into a specific number of clusters. The program was used on a database
containing video recordings of 10 signers for a few words and sentences. It was able to
achieve 75% accuracy.

Salma Hayani et al. proposed a CNN-based Arabic sign language recognition program,
persuaded from LeNet-5. The database contained 7869 images of Arabic numerals and
letters. Various tests are performed by changing the number of training sets from 50%
to 80%. 90% accuracy is achieved with 80% training database. The author also
compared the results obtained with machine learning algorithms such as KNN (closest
neighbor) and SVM (support vector machine) to demonstrate system performance. This
model was based on image only and can be extended to video-based identification.

Kshitij Bantupalli and Ying Xie built an American sign-language video recognition
system based on Convolution Neural Networks, LSTM(Long Term Short Memory) and
RNN(Recurrent Neural Network). A CNN model called Inception was used to extract
local features from frames, LSTM long-term dependence and RNN to extract temporary
features. Various tests were performed for different sample sizes and the database
contains 100 different markers performed by 5 signers and a high accuracy of 93% was
achieved. Sequences are then added to LSTM for longer durations. SoftMax layer
output and max pooling layer are provided in the RNN architecture to extract temporary
features in the SoftMax layer.

Mahesh Kumar put forward a system that can identify 26 sign language gestures in
Indian Sign Language based on Linear Discriminant Analysis (LDA). Pre-processing
measures such as skin separation and environmental performance are used in the
database. The separation of the skin is done using the Otsu algorithm. Discrimination
line analysis is used to exclude the feature. Each gesture is presented as a column
vector in the training phase and then customized with respect to the median gesture.
The algorithm detects eigenvectors of the variance matrix for median gesture. In the
recognition phase, the subject vector is usually relative to the median gesture and then
displayed in the gesture space using the eigenvector matrix. The Euclidean range is
calculated between these speculations and all known assumptions. A small number of
these comparisons were selected.

Suharjito et al. attempted to use a sign language recognition system with the I3inception
model using the transfer learning method. The public data set LSA64 is used in 10
words with 500 videos. For training the database is distributed in a 6: 2: 2 ratio, 300
training videos, 100 verification and 100 test sets. The model has good training
precision but very low validation accuracy.

Juan Zamora-Mora et al. introduced CNN-HMM which is a hybrid of sign language


recognition. They did experiments on three databases namely RWTH-PHOENIX-
Weather 2012, RWTHPHOENIX-Weather Multi Signer 2014 and one SIGNUM signer.
The training and certification set has a rating of 10 to 1. After the end of CNN training
the SoftMax layer is added and the results are applied to HMM as viewing opportunities.

Mengyi Xie and Xin Ma put forward an end-to-end program using a residual neural
network to initiate American Sign Language recognition. The data set contains 2524
images of 36 classes. Data enrichment is used to expand the database to 17640
images. These images are converted to a CSV file format and after inserting hot coding
and are provided as embedded in the ResNet50 network for training. The model
provides 96.02% accuracy without data development and accuracy improves with data
enrichment up to 99.4%.

G. Anantha Rao et al. raises Indian sign language gesture recognition using a
convolutional neural network. This application applies to videos taken from the front
mobile camera. Database created by making 200 ISL(Indian Sign Language) symbols.
CNN training is done on 3 different databases. In the first group, a single set of
information sets is provided as input. The second set consists of 2 sets of training data
and the third set respectively contains 3 sets of training data. The average visibility of
this CNN model is 92.88%.

Aditya Das et al. trained a convolutional neural network using the Inception v3 model of
American Sign Language. Data augmentation is applied to photos before training to
avoid overcrowding. This model provides more than 90% accuracy in the Sreehari
sreejith database of 24 class labels with 100 images per class.

The author Tanuj Bohra et al. proposed a two-way real-time sign language conversion tool for in-depth
reading utilizing computer vision based on image processing. For the best results, procedures such as
hand detection, skin color separation, medium blur, and frame detection are applied to photographs in
the database. CNN model trained with a large library of 40 classes and capable of predicting 17600 test
images with 99% accuracy in 14 seconds.

In a live broadcast, Joyeeta Singha and Karen Das suggested the Indian Sign Language Recognition
Program. The program is divided into three stages. Skin filtration and histogram matching are used in
the pre-screening procedure. The output factor category considers Eigen-values and Eigen-vectors, as
well as the Eigen value that measures the Euclidean distance to be divided. The dataset included 480
photos of 24 ISL symbols signed by 20 individuals. The algorithm was tested on 20 videos and achieved
an accuracy of 96.25%.
Mengyi Xie and Xin Ma proposed an end-to-end program for recognizing American Sign Language using
a residual neural network. The data collection contains 2524 photos from 36 different classes. The
database was expanded to 17640 photos via data enrichment. These photos are translated to CSV file
format and put in the ResNet50 network for training after being hot coded. The model provides 96.02%
accuracy without data development and improves to 99.4% accuracy after data enrichment.

Using a convolutional neural network, G. Anantha Rao et al. improve Indian sign language gesture
detection. This app is only for videos taken with the front-facing mobile camera. 200 ISL (Indian Sign
Language) symbols were used to generate the database. CNN training is carried out across three distinct
databases. The first group is given a single set of information sets as input. The second set includes two
sets of training data, whereas the third set includes three sets of training data. This CNN model's
visibility is 92.88% on average.

Many research attempts have been made for object detection and recognition utilizing deep
learning algorithms such as CNN, RCNN, YOLO, and others. This study includes a literature review to help
comprehend some of these algorithms. 

The authors, Aleksa Orovi et al. (2018), developed a system for recognizing traffic participants
using YOLOv3 and the Berkley Deep Drive dataset. This system can recognize five object classes (truck,
car, traffic signs, pedestrians, and lights) in various driving circumstances (snow, overcast and bright sky,
fog, and night). The accuracy was 63%.

Omkar Masurekar et al. (2020) developed an object detection model to help visually impaired
people. We utilized YOLOv3 and a custom dataset with three classes (bottle, bus, and mobile). For sound
generation, Google Text To Speech (GTTS) was employed. The authors discovered that the required time
for detecting the items in each frame was eight seconds with 98% accuracy.

Sunit Vaidya et al. (2020) developed an object detection web application and an Android
application. These systems made use of the YOLOv3 and coco datasets. The authors discovered that the
greatest accuracy in web the percentage of applications is 89% on desktop computers and 85.5% on
mobile phones. The time necessary to detect the objects was two seconds, and this time rose as the
number of objects increased.

S. Mahmoud et al. (2020) developed a model for object detection in optical remote sensing
images. Mask RCNN and the NWPU-VHR-10 dataset were used. The model can detect ten different sorts
of things. The greatest detection accuracy was 95%, and the detection time was 7.1 seconds. Deep
learning methods are utilized to implement other types of applications such as monitoring systems, sign
language translation, and so on.

Azher Atallah et al. (2020) developed a sign language translation system. A CNN with Tensorflow
and a custom dataset were employed. This system translated sign language into voice. The technology
can distinguish 40 different hand movements. This method had a 98% accuracy rate.

http://ijdri.com/me/wp-content/uploads/2021/06/17.pdf
The authors, Steve Daniels et al., developed a sign language recognition system that transforms
real-time video input into hand signs using a YOLOv3 pretrained model that is trained on the desired
configuration. Image preprocessing is performed on images obtained from input. To compare the
accuracy of the findings obtained, the model was run on both image and video data.

https://www.ijraset.com/best-journal/realtime-telugu-sign-language-translator-with-computer-vision

2. Literature SurveyThe author of [5] conducted a framework entitled as


“Indian sign language translator us-ing gesture recognition algorithm”. The
framework interprets motions made in ISL intoEnglish. The gesture
acknowledgment framework is to ensure gestural information. Vi-sion based
strategy incorporates picture refinement. The database for creating this
frame-work is made to possess with the recorded recordings of hard of
hearing and quiet en-dorsers. This makes the signals included to be
authentic. The diversity of different calcu-lations for Pre-processing, Feature
extraction and vector quantization, the leading skill-ful calculation was
shortlisted to be a combined yield calculation for pre-processing, 2DFFT
Fourier Descriptors for feature extraction and 4 vector codebook LBG. The
authorsof [6] proposed a convolutional neural network (CNN) strategy for
recognizing handsignals from camera activities of human task exercises. To
obtain the CNN’s preparingand examining the details, the skin
demonstration and the gauging of hand location andintroduction are
related. Since light conditions have a major impact on complexion color,the
proposed utilize a Gaussian Mixture model (GMM) to gear up the skin
exposure; thelatter is employed to effectively filter out non-skin color in an
illustration. The contem-plated framework has also acquired the palatable
comes about on the attributive motionsin an unrelenting movement utilizing
the contemplated guideline. The authors of [7]discussed a sign dialect
acknowledgment framework utilizing Back Propagation NeuralNetwork
Calculation contemplated instituted on American Sign Language. The
proposedframework employments the pictures in accordance with the
nearby framework or theoutline detained from webcam as an input. The
framework employments two classifiers:one employments crude image
attributes and the other one employments thresholdinghighlights. Back
propagation Algorithm was utilized for the proposed system as
learningassessment. Marcel Inactive Hand Pose was used for the framework
as a database
(11) (PDF) Sign Language Translator Using YOLO Algorithm. Available from:
https://www.researchgate.net/publication/356080179_Sign_Language_Translator_Using_YOLO_Algorit
hm [accessed Jan 24 2023].

[5] Badhe PC, Kulkarni V. Indian sign language translator using gesture recognition algorithm. In2015
IEEEInternational Conference on Computer Graphics, Vision and Information Security (CGVIS) 2015 Nov2
(pp. 195-200). IEEE

(11) (PDF) Sign Language Translator Using YOLO Algorithm. Available from:
https://www.researchgate.net/publication/356080179_Sign_Language_Translator_Using_YOLO_Algorit
hm [accessed Jan 24 2023].

[7] Karayılan T, Kılıc ̧ O


̈ . Sign language recognition. In2017 International Conference on Computer
Scienceand Engineering (UBMK) 2017 Oct 5 (pp. 1122-1126). IEEE

(11) (PDF) Sign Language Translator Using YOLO Algorithm. Available from:
https://www.researchgate.net/publication/356080179_Sign_Language_Translator_Using_YOLO_Algorit
hm [accessed Jan 24 2023].

https://www.irjet.net/archives/V9/i5/IRJET-V9I5586.pdf

You might also like