You are on page 1of 8

Designing a Talking System for Sign

Language Recognition of people with


Autism Spectrum Disorder

INTRODUCTION:
Autism Spectrum Disorder (ASD), commonly known as Autism, alludes to a wide range of
disorders or conditions, which include repetitive (sometimes restricted) actions, challenges and
difficulties with social norms, hindrance in verbal as well as non-verbal communication etc. [6].
Each child with ASD comes with his/her own specific needs and a collection of habits and
behaviors which can hinder their day to day tasks. Since it is known as “developmental
disorder”, the symptoms usually start to appear in early stages of life (first two or three years) [3]
[6]. It is an intricate neurobehavioral condition which makes social interactions problematic for
such individuals. Not all the disorders in this spectrum are equally severe, some seem like a
trifling handicap while others being devastating and can practically disrupt the whole lifestyle of
the concerned being. Children with ASD struggle in communicating with others. Reading visual
emotions is tricky for them. They usually find it hard to understand what other people feel and
think. This makes it quite troublesome for them to convey their own thoughts either with
gestures, words or facial expressions. Autistic individuals usually develop strange behaviors and
in a few scenarios they could be dangerous and perilous to themselves and the people around
them. Due to their impaired speaking abilities their parents, sometimes, undermine the physical
capabilities of these children and that may lead to a menace [4]. An autistic child is at risk for
running away, wandering without any supervision or indulging in any such activity that may
cause himself injury and is even harmful to nearby family members.
In accordance with CDC (Centre of Disease Control) Atlanta, 1 in every 59 Children in USA and
1 in every 68 children in the world is affected by the ASD [2]. Among all the children in US
under 17 years of age, more than 1 million of them have been diagnosed with ASD [1]. These
disorders are more commonly found in boys than girls, the ratio being approximately 4 to 1 [5].
Their research indicates an increasing number of affected individuals every demi-decade. In
2011-2012, in USA alone, children (aged 6-17) having ASD (reported by their parents) were
around 1 in 50 i.e. 2.00% which is a substantial increase from the value recorded in 2007 i.e.
1.16%. There can be quite a lot of factors responsible for that increase in the children with ASD
like environmental factors, climate change malnutrition etc.
To communicate effectively and institute social interactions with people from the initial age, the
child must be capable of interpreting the both verbal as well as non-verbal messages [7] [8]. A
universal and most important feature of our communication comprises of gestures [9]. Verbal
gestures are accompanied with our voice while the co-verbal gestures consist of hand and arm
movements. They facilitate the language content, emphasize on our point of view, keep the flow
of speech regulated and maintain the focus of audience towards the speaker. Although there is no
hard and fast rule about the categories of gestures, but the conventional gestures (CG) have well
established premises [10]. These are communicative and intentional actions which poses a direct
and accurate verbal translation hence they can be easily comprehended even without any spoken
aid.
Autistic children face difficulty in conveying their thoughts to others. They use fewer
communicative gestures as compared to the typically developing children (TD) [10]. Over the
course of time, their parents can get used to their gestures and understand what is going on in
their child’s mind but it is still quite hard for other people. Their sign language is normally
different from the deaf and mute people, hence making it even more difficult to grasp their
thoughts. In this paper we have proposed a system to understand their thoughts and the things
they are trying to convey us by monitoring and recording their speech and gestures (limb
movements). In this way we can enable a normal person to easily interpret the gestures or
fumbled words of autistic children and understand them. This process comprises of three major
steps:
1. Data acquisition (voice and gestures)
2. Feature Extraction
3. Data recognition

 Data Acquisition:

In this scenario, Data comprises of two things, i.e. Vocal input and gestures monitoring.
We selected 24 specific activities to be monitored and our test subjects were 15 Autistic
individuals.

 Feature Extraction:

 Data recognition:
Proposed Architecture:
We acquire speech (voice) data from an individual through a mic and simultaneously we get data
from sensors too, that are effectively placed on different body parts of that person, for the
purpose of gesture recognition. Body sensors comprise of gyroscope and accelerometer which
can elegantly record any kind of limb movements (rotational or linear). The data from sensors is
already in digital form (courtesy of a built-in ADC) but the speech is recorded in analog form,
hence it has to go through certain processes to make its recorded data get ready for analysis.
These processes include Pre emphasize filter, framing, windowing etc.

Speech (Voice) pre-processing:


This is the stage where the acquired data is preprocessed, nurtured (in a way), to make it ready
for further analysis. Our ultimate aim is to extract features from the assimilated voice and
preprocessing intensifies the efficiency of extraction as well as other methods of classification as
well. Some of the foremost steps in the preprocessing of the acquired voice consist of firstly
sampling then quantization and pre-emphasizing respectively.

 Sampling:

To make our data set suitable for digital processing, the analog data must be first
digitized spatially and in terms of amplitude as well. Sampling is the very first step in our
quest for converting an analog signal (voice) into a digital signal. As we are changing the
nature of signal from infinite to finite, we have to collect enough number of samples so
that the signal can be reconstructed and we do not lose any of its parts. The criteria for
that is known as Nyquist criteria. Nyquist Sampling Theorem states that an analog
(continuous time) signal can be sampled in a way that later on it is reconstructed perfectly
and this can only be achieved when the sampling frequency is at least twice the frequency
of the highest component present [12].
F s ≥ 2 Fmax

Although human beings have a hearing range of 20 Hz - 20 kHz but the typical speech
frequency lies in 100 Hz – 8 kHz range. Hence in this scenario, we can safely take the
sampling frequency to be around 16 kHz.
 Quantization:
It is the next step in the conversion of analog signal to a discrete one. It include the
conversion of a range of continuous values into discrete values which are rounded off to
an agreed standard for the purpose of simplicity. The parameter “bit depth” (available
number of bits) defines the quality and accuracy of the quantized results [13]. We can
also achieve higher accuracy by increasing the number of user defined levels.

Pre-Emphasis:
A pre-emphasis filter is usually applied on a signal when we want to amplify the higher
frequency harmonics in that signal. This filter is used in following ways [14]:
1. It balances the frequency spectrum, which is disrupted due to the fact that normally
signals with higher frequencies tend to have lower amplitudes as compared to the signals
having low frequencies.
2. We can avoid the numerical complexity caused by the operation of Fourier transform.
3. It can also contribute in increasing the signal to noise ratio (SNR).

Framing:
Speech (voice signal) is analog in its nature, hence continuously varying with time. It is quite
hard to apply any sort of processing techniques to a dynamically varying signal. Although the
voice signal does remain invariant for a very short time and those time ranges (5ms -100ms) are
termed as phonemes. Here we apply a process called “Framing” in which we convert those short
invariant time ranges into suitable frames. Typically this kind of frame has a size of 25ms.
Frame_length = 0.025 * (sampling frequency)
And then the total number of frames is given by
Signallength
Number of frames=
Framelength
To prevent any kind of data loss, we make the frames to overlap each other slightly. That
overlapped period of time is known as a frame_step. Here we are going to have a 10ms
frame_step. Now the number of frames can be calculated as

signal length−frame length


number of frames=
frame step

Windowing:
This step is used to window every individual frame. This is required to curtail the signal
discontinuities in the beginning of a frame and also at its end [15].
When a signal is suddenly cutoff, sharp edges are produced at the corners, which become the
cause of vagueness in the recognition process. To minimize this phenomenon, we use windowing
to diminish the edges and minimize the spectral leakage. The preferred way for this process is
usually “hamming window” as it eludes the sharp edges [16].

Feature Extraction:
The extraction process of voice features can be done through multiple feature extraction
techniques. The one which is preferred the most is MFCC (Mel Frequency Cepstral Coefficient)
[16]. The reason for its importance is that after preprocessing, it discards raw portions of voice
(speech) and holds on to the crucial ones. After passing through this step, a time domain voice
signal is converted into corresponding vector. Feature vectors puts the emphasis on important
information hence neglecting the raw data. There is also a technique for independent speech,
known as cepstral analysis [17].
References:
1. Peter Washington, Jena Daniels, Catalin Voss, Carl Feinstein, Nick Haber, Terry
Winograd, Serena Tanaka, Dennis Wall “A Wearable Social Interaction Aid for Children
with Autism.” Late-Breaking Work: Interaction in Specific Domains (2016).
2. Wingate, Martha, et al. "Prevalence of autism spectrum disorder among children aged 8
years autism and developmental disabilities monitoring network, 11 sites, United States,
2010." MMWR Surveillance Summaries 63.2 (2014).
3. Amir Mohammad Amiri, Nicholas Peltier, Cody Goldberg, Yan Sun, Anoo Nathan,
Shivayogi V. Hiremath and Kunal Mankodiya “WearSense: Detecting Autism
Stereotypic Behaviors through Smartwatches.” Healthcare 2017, 5, 11; doi: 10.3390 /
healthcare 5010011 (2017).
4. Sami S. Alwakeel, Bassem Alhalabi, Hadi Aggoune, Mohammad AlwakeelA “Machine
Learning Based WSN System for Autism Activity Recognition.” IEEE 14th International
Conference on Machine Learning and Applications (2015).
5. https://www.psychiatry.org/patients-families/autism/what-is-autism-spectrum-disorder
6. https://www.nimh.nih.gov/health/topics/autism-spectrum-disorders-asd/index.shtml
7. Bernicot J. Introduction. L’usuage gestes mots chez I’enfant. Paris: Colin; 1998. P. 5-25.
8. Capirci O, Iverson JM, Pizzuto E, Volterra V. Gestures and words during the transition to
two-word speech. J Child Lang 1996;23;645-73.
9. Goldin-Meadow S, Alibali MW.Gesture’s role in speaking, learning, and creating
language. Annu Rev Psychol 2013;64;257-83.
10. Perrault A, et al. “Comprehension of conventional gestures in typical children, children
with autism spectrum disorders and children with language disorders.” Neuropsychiatr
Enfance Adolesc (2017), https://doi.org/10.1016/j.neurenf.2018.03.002
11. Gulmira K Berdibaeva, Oleg N. Boden, Valery V. Kozlov, Dmitry I. Nefed’ev,
Kasymbek A. Ozhikenov, Yarsolav A. Pizhonkov “Pre-processing Voice Signals for
Voice Recognition Systems.” 18th International Conference on Micro/Nanotechnologies
and Electron Devices EDM (2015).
12. Doucet, Arnaud, Simon Godsill, and Christophe Andrieu. “On sequential Monte Carlo
sampling methods for Bayesian filtering.” Statistics and computing 10, no. 3 (2000): 197-
208.
13. https://www.mediacollege.com/glossary/q/quantization.html
14. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
15. Yashpalsing Chavhan, Pallavi Yesaware, M. L. Dhore “Speech Emotion Recognition
Using Support Vector Machine” International Journal of Computer Applications 0975 –
8887 (2010).
16. Gupta, Shikha, Jafreezal Jaafar, Wan Fatimah Wan Ahmad, and Arpit Bansal. "Feature
extraction using MFCC." Signal & Image Processing 4, no. 4 (2013): 101
17. Kesarkar, Manish P. "Feature extraction for speech recognition." Electronic Systems, EE.
Dept., IIT Bombay (2003).

You might also like