Recognition of Atypical Behavior in Autism

RECOGNITION OF ATYPICAL BEHAVIOR IN AUTISM DIAGNOSIS FROM VIDEO USING
POSE ESTIMATION OVER TIME
Kathan Vyas1 , Rui Ma1 , Behnaz Rezaei1 , Shuangjun Liu1 , Michael Neubauer2 , Thomas Ploetz2 ,
Ronald Oberleitner2 , Sarah Ostadabbas1
1
Augmented Cognition Lab (ACLab), Northeastern University, Boston, MA, USA
2
Behavior Imaging Solutions, Boise, Idaho, USA
ABSTRACT tends to deteriorate with increasing age. Therefore, an early

detection is important to make the person’s life better. Detect-
Autism spectrum disorder (ASD), similar to many other de- ing ASD is difficult in preliminary stages, primarily because
velopmental or behavioral conditions, is difficult to be pre- the ASD symptoms and natural behavior of a child overlap
cisely diagnosed. This difficulty increases when the subjects in many fronts. This is evident when we look at the preva-
are young children due to the huge overlap between the ASD lence of ASD among kids across the United States. ASD di-
symptoms and typical behaviors of young children. There- agnosis among states in the US varies significantly; between
fore, it is important to develop reliable methods that could 5.7 to 21.9 per 1000 eight-year-old children [2]. Due to such
help distinguish ASD from normal behaviors of children. In alarming numbers, there is a pressing need for predictive algo-
this paper, we implemented a computer vision based auto- rithms that can detect ASD behaviors in young children based
matic ASD prediction approach to detect autistic character- on their abnormal behavioral patterns, described in the DSM-
istics in a video dataset recorded from a mix of children with 5. Moreover, for any of these algorithms to hold a practical
and without ASD. Our target dataset contains 555 videos, out value in predicting ASD in a larger population, they have to
of which 8349 episodes (each approximately 10 seconds) are be both cost-efficient and unobtrusive in order to be employed
derived. Each episode is labeled as an atypical or typical be- in child’s natural living environment for long-term behavioral
havior by medical experts. We first estimate children pose in monitoring purposes.
each video frame by re-training a state-of-the-art human pose
estimator on our manually annotated children pose dataset. 1.1. Related Work
Particle filter interpolation is then applied on the output of the
In the last couple of decades, research has been growing in
pose estimator to predict the locations of missing body key-
the field of ASD behavior recognition using machine learning
points. For each episode, we calculate the children motion
and computer vision. These works have been focusing on un-
pattern defined as the trajectory of their keypoints over time
derstanding the behavior of the patients, especially children
by temporally encoding the estimated pose maps. Finally, a
of various ages, in a number of ways including studying their
binary classification network is trained on the pose motion
behavior through analysing their home videos. In [3], authors
representations to discriminate between typical and atypical
studied infants by extracting what they described as Autism
behaviors. We were able to achieve a classification accuracy
Observation Scale for Infants (ASOI). In [4], authors quanti-
of 72.4% (precision=0.72 and recall=0.92) on our test dataset.
fied facial expression of the ASD patients and compared them
with the IQ-matched healthy controls. They showed how the
1. INTRODUCTION ASD patients fail in using certain facial muscles when ex-
pressing their emotion.
Autism spectrum disorder (ASD) is a type of developmental In another study, authors analyzed behavioral interactions
disorder that affects human communication and typical behav- between ASD children and their caregivers/peers using auto-
iors in a person at a very early age. According to the diagnos- matically captured video data [5]. Children with ASD have
tic and statistical manual of mental disorders (DSM-5) [1], a shown impairment in their ability to produce and perceive
guide created by the American Psychiatric Association used dynamic facial expression, which may contribute to their so-
to diagnose mental disorders, people with ASD have: (1) dif- cial deficits [6]. Researchers have taken this particular fact in
ficulty in communication and interaction; (2) restricted inter- mind to focus on various part of the face including eyes. In
ests and repetitive behaviors; and (3) symptoms that hurt peo- [7], authors proposed an algorithm for fine-grained classifica-
ples ability to function properly in school, work, and other tion of developmental disorders via measurements of individ-
areas of life. Since ASD is developmental, patient’s condition uals’ eye-movements using multimodal (including RGB and
978-1-7281-0824-7/19/$31.00
Authorized ©2019
licensed use limited to: IEEE
Pontificia Universidad Catolica Del Peru. Downloaded on August 08,2023 at 13:37:29 UTC from IEEE Xplore. Restrictions apply.
Video Frame Pose Estimation
Nonlinear State PoTion Image ASD Behavior
Video Frame Pose Estimation
Estimation Generation Classifier
Video Frame Pose Estimation Typical Atypical
Fig. 1. An overview of our ASD behavior detection approach. The first step is to apply the re-trained pose estimation algorithm
on each frame of our video dataset. The resultant pose data may contain some missing keypoints. In order to fill these missing
values, a non-linear state estimation technique is applied on the pose data. The output of the interpolation is then fed to the
PoTion generator which uses ≈300 frames to provide one PoTion image. Finally, the PoTion images are used as the inputs to
our typical vs. atypical (i.e. ASD) behavior classification network.
eye tracking images) visual data.

All of the above mentioned work used different computer
vision techniques, however none of them focused on studying
full body pose movement over time as an indicator of atypi-
cal behaviors. Most of the research is done around analyzing
RGB frames and pertaining high level information from the
video as well as focusing on specific body parts of children
(e.g. face). In contrast, our focus in this paper is extraction Fig. 2. Behavior Imaging company’s Remote Autism assess-
of the body pose, which is an interpretable, appearance ag- ment platform, called NODA [8].
nostic yet low level latent information from the video frames
and tracking the temporal changes in the pose to understand We started with splitting videos into frames and then used a re-
behavior of the child. trained pose estimation network to identify human keypoints
1.2. Our Contributions on each frame. 15 keypoints were identified across a body as
explained in Section 2.2. After applying a non-linear interpo-
In this paper, we present an automatic ASD behavior predic- lation algorithm (i.e. a Particle filter) to fill out the missing
tion approach using computer vision and signal processing keypoints, we transformed the keypoints information over a
algorithm that can classify various behavior of children based fixed period of time (≈300 frames) into “PoTion Representa-
on display of ASD symptoms. This work is based on a col- tion” [9], which is described in Section 2.3. In Section 2.4, we
laboration with the Behavioral Imaging company [8]. We ap- trained a classification network on the PoTion representations
plied our proposed algorithm on 555 of their videos collected to classify them as either a typical (i.e. normal) or atypical
from children with and without ASD, in which each video (i.e. ASD) behavior.
was annotated by several expert physicians.
The major contributions in this paper are as follows: (1)
re-training a pre-trained pose estimation network to make it 2.1. Our Video Dataset
more robust in identifying children pose; (2) using a non-
linear state estimation technique to predict the locations of The video dataset used in our experiments comes from NODA
missing keypoints; (3) extracting behavioral features from program of Behavior Imaging company [8]. NODA is an
keypoints trajectories over a short time frame; and (4) devel- evidence-based Autism diagnostic service platform that has
oping a binary classifier that uses the behavioral features to been brought to market to address some of the challenges with
classify between atypical vs typical characteristics. The code ASD early diagnosis. Fig. 2 presents the preliminary process
for our proposed method will be released in our website. of developing the video dataset. Videos of child’s daily ac-
tivities at home were recorded by parents and then sent to
2. MATERIALS AND METHODS the Behavior Imaging company to be studied by expert physi-
cians. Common categories of daily activities include “play
Our primary goal in this work is to distinguish between typi- alone”, “play with others”, “meal time”, and “parents’ con-
cal and atypical behaviors of children based on their motion cerns”. The annotated dataset provided to us consists of four
feature representation. The dataset used in our study contains sets of videos, denoted by set#1 to set#4, respectively. Length
videos of the children in their home environment. Details on of videos vary from 2 minutes to 10 minutes. For each video,
our dataset are presented in Section 2.1. As shown in Fig. 1, physicians tagged several time stamps where they observed
we implemented a 4-step approach that takes video frames typical or atypical behavior of the child in the video. Details
as input and labels them as either typical or atypical behavior. of our dataset are shown in Table 1.
Authorized licensed use limited to: Pontificia Universidad Catolica Del Peru. Downloaded on August 08,2023 at 13:37:29 UTC from IEEE Xplore. Restrictions apply.
Region Proposal Network Classification/Regression Networks
(RPN) Human Nose
Base Network vs.
Background Upper-neck Temporal
(ResNet) Upper-Head
Bounding Box Aggregation
Coordinates
Keypoint Estimation t=1 t=T
Left-shoulder
Network
Right-Elbow
∑
RoI Align Right-Wrist
Coordinates of
15 Keypoints
Left-hip t=1 t=T
Fig. 3. Architecture of our pose estimator adopted from 2D

Right-Hip
Right-Knee
Mask R-CNN network [10]. It can perform multi-person pose Temporal

Aggregation
estimation by identifying 15 keypoints on each human body
t=1 t=T
in the input image.
2.2. Body Pose Estimation in Children Fig. 4. An illustration of the updated version of PoTion Rep-
resentation, first introduced in [9]. We used only three colors
For each typical or atypical time stamp in a video (provided by with different intensities as time variability in each denoting
ASD experts), we take 8 seconds before and 2 seconds after parts of the body. Red color represents all keypoints above
that time stamp to construct a 10-second episode. The reason shoulder including nose, top of the head and neck. For shoul-
is that physicians usually labeled a typical or atypical time ders, elbows, and wrists, we used green representation. The
stamp after they observed the behavior for a while, and a 10- keypoints below the abdomen including hips, knees, and an-
second window could approximately record the behavior in kles are represented in blue.
full. All of our videos are 30 frames per second, so the length
of the episode would be 301 frames (including one frame at pose optimizer replaces the instance segmentation network
the time stamp) at most, or fewer if a time stamp is labeled at with a keypoint estimation network in parallel to the classifi-
the very start or end of the video. cation and regression networks. The keypoint estimation net-
We adopted the state-of-the-art human pose estimation work could predict the coordinates of 15 keypoints for every
network, the 2D Mask R-CNN [10], by re-training it on our detected human. The pose of a human could be visualized
manually annotated children pose dataset. We then run the by connecting these 15 keypoints together, as is shown in the
re-trained pose estimator on each frame of every 10-second output of the keypoint estimation network. After the pose es-
episode to identify human keypoints in it. The pose estima- timation step, in order to fill the missed keypoints due to the
tor is capable of detecting multiple people on a single im- estimation failure, we apply a non-linear state estimation tech-
age and identifying their 15 body keypoints, including upper- nique. Details of the re-training and state estimation steps are
head, nose, upper-neck, left/right shoulders, left/right elbows, provided in Section 3.
left/right wrists, left/right hips, left/right knees, and left/right 2.3. PoTion Representation
ankles across every detected human body. The pose estima-
After filling out the missing keypoints, we transform the pose
tor extends the standard architecture of Mask R-CNN [11], as
information in each episode into Pose Motion (PoTion) Rep-
shown in Fig. 3.
resentation, a novel representation that gracefully encodes the
ResNet is used as the base network to extract low-level
movements of human keypoints [9]. In this work, we changed
feature maps from an input image. Next, region proposal net-
the colorization step of the original PoTion Representation, so
work (RPN) generates a number of region of interests (RoIs)
that the final output will be one RGB image for all keypoints
from these feature maps. RoI Align operations are then added
instead of one RGB image per each keypoint. As is illustrated
to the RoIs in order to refine them. The refined RoIs are
in Fig. 4, a Gaussian kernel is put around each keypoint to ob-
fed into classification and regression networks that perform
tain a keypoint heatmap at first. The Gaussian kernel has a
bounding box detection. For every input RoI, the regression
fixed size of 12 × 12 with variance of 1.0. Next, we colorize
network computes the coordinates of the upper left and lower
every heatmap according to the body parts it belongs to and
right corners of one bounding box, while the classification net-
the relative time of the frame in the episode. The main idea
work classifies the bounding box of being “human” or “back-
is to colorize the heatmaps on upper-head, nose, and upper-
ground”. Apart from Mask R-CNN’s basic architecture, the
neck in red, left/right shoulders, elbows, and wrists in green,
left/right hips, knees, and ankles in blue, while the color in-
tensity goes from weak to strong from the first frame to the
Table 1. Statistics of our NODA video dataset.
Set#1 Set#2 Set#3 Set#4 Total last frame over a certain episode. The last step is to aggregate
No. Video Clips 235 101 177 42 555
Total No. of Time Stamps 4465 2048 1545 291 8349
these colorized heatmaps over time to obtain the trajectories
No. of Typical Stamps 1749 681 529 125 3084 of all keypoints, which would be one RGB image. Size of this
No. of Atypical Stamps 2716 1367 1016 166 5265 image is warped to 256×256×3.
Input: 256*256*3
work on the target dataset and also by applying interpolation
Cov2D: 3*3*64
techniques to predict the locations of missing keypoints.
Max Pool: 2*2 3.1. Re-training Pose Estimation Network
Flatten Fully connected Layer The Mask R-CNN pose estimator is initialized using Ima-
geNet, pre-trained on COCO keypoint detection task, and fine-
Sigmoid
tuned on PoseTrack dataset [10]. Although, Mask R-CNN
Output
has achieved state-of-the-art pose estimation performance on
these datasets, however it provides poor results on our video
Fig. 5. Architecture of the CNN classifier. It consists of a
dataset, where the subjects are children rather than adult hu-
convolution layer of 64 filters with size 3 × 3, stride of 1, and
man. The reason is that children have different body propor-
same padding, a max-pooling layer with 2 × 2 filters, stride
tions and skeleton structures compared to adults, while very
of 2, followed by flattened layer and a fully-connected layer
few children are present in the training sets of the Mask R-
with 64 neurons, and the sigmoid operation to get the labels.
CNN. As a result, the pose estimator tends to ignore the pres-
One of the markers that physicians took care of while tag- ence of children by classifying the bounding boxes with chil-
ging a video is the child’s interaction with surrounding people. dren as “background”.
In order to properly measure the interaction between the child In order to improve the performance of the pose estimator,
and other people around the child, we decided to take the pose we took 985 sample frames with presence of children from
information of all detected people in a frame into PoTion Rep- our video datasets and manually annotated human keypoints
resentation instead of only taking the pose of the child. on these frames. We developed an improved version of a semi-
automatic annotation toolbox provided in [12]. With the help
of this toolbox, we could get the coordinates of 15 keypoints
2.4. Behavior Classification Network for each person in an image by clicking on the corresponding
After we obtain the PoTion Representations, we train a sim- body parts. The toolbox also supports bounding boxes anno-
ple convolutional neural network (CNN) to classify them as tation, from which we could obtain the upper left and lower
either a typical or atypical behavior. Architecture of the em- right coordinates of head and body bounding boxes. We con-
ployed CNN is shown in Fig. 5. The CNN consists of a con- verted the keypoints and bounding boxes coordinates to the
volution layer, a max-pooling layer, followed by a flattened JSON format of PoseTrack dataset [13]. We then re-trained
layer and a fully-connected layer, and then the sigmoid func- the keypoint estimation network of the pose estimator using
tion to get the classes. We split the whole dataset into 80% of these annotated frames. Effectiveness of the re-training step
training and 20% of test. We used the Adam optimizer with is evaluated in Section 4.
fixed learning rate of 0.01. In order to prevent our model from 3.2. Non-linear Interpolation using Particle Filter
over-fitting, we adopted early stopping technique as well as
the well-known 5-fold cross-validation during training. Mul- Once we re-trained the pose estimation network, it has be-
tiple performance metrics, including accuracy, precision, re- come more robust in identifying children and their poses.
call, and AUC, were used to evaluate the classification perfor- However, there might still be several frames in an episode
mance. AUC is an important metric in binary classification that were not annotated with pose due to various reasons,
tasks. It is the area under the Receiver Operating Character- including blurring, shading, and other image quality issues.
istics (ROC) curve, which is plotted as the true positive rate In order to obtain a complete set of keypoints’ trajectories,
against the false positive rate. This curve reflects the ability of we decided to implement a particle filter as a non-linear in-
a model in distinguishing positive and negative classes. Value terpolation approach to predict the coordinates of the missing
of AUC is between 0 and 1, and the higher the AUC, the bet- keypoints in these frames [14]. Particle filter is a sequential
ter the model would be in the classification task. All modeling Monte Carlo method based on point mass (or “particle”) rep-
results are evaluated in Section 4. resentations of the probability densities. It generalizes the
classical Kalman filtering methods and can be applied to any
3. DETAILS ON ALGORITHM IMPROVEMENTS state-space model [14]. Let us consider a single episode for
an instance. We count the number of frames which are not
The PoTion Representation could encode the movements of annotated (this means no people are detected in these frames)
human keypoints into colorized trajectories, visualizing the in the episode and denote it as M , while the total number
behavior of a child within a short time period. However, of frames in an episode is denoted as N . If M is more than
these trajectories may be non-consecutive if the pose esti- 90% of N , we discard the episode completely. For remaining
mator failed at detecting the keypoints in some of the input cases, we apply the particle filter interpolation on them.
frames. In this section, we explore two methods in order to The particle filter interpolation is initialized by generating
address this issue by re-training the human pose estimator net- a set of Gaussian distributed particles over the whole frame.
Table 2. Percentage (%) of annotated frames after applying pose estima-
tion before and after re-training.
Set#1 Set#2 Set#3 Set#4
Before 80.6 83.6 80.0 87.5
After 94.6 96.7 94.6 96.3
(a): Ground Truth (b): Output from (c): Prediction after (d): Prediction after
Pose Estimator Linear Interpolation Particle Filter 4. EXPERIMENTAL RESULTS
Fig. 6. The implementation of our interpolation step: (a) In this section, we evaluated the performance of our typical
groundtruth for the trajectory of one keypoint across 301 vs. atypical behavior detection method on our video dataset
frames, (b) the output trajectory of keypoint from the pose based on different model configurations. We first evaluated
estimator, (c) the output of linear interpolation done to fill the the effectiveness of the re-training step by running the Mask
missing keypoints, and (d) the output after the particle filter R-CNN pose estimation, with and without re-training, on
interpolation. set#1 to set#4 videos, and then compared the percentage of
the frames that were successfully annotated with human pose.
As can be seen from Table 2, re-training the pose estimator
In our case, we considered 300 particles, a value considering results in 9% to 14% increase in the number of annotated
the trade-off between speed and accuracy. We assigned four frames.
landmarks at four corners of the frame. The distance of a
keypoint i from these landmarks are
! taken as measurements Z
for each iteration, such that Z = (yL − yi )2 + (xL − xi )2 , Table 3. Performance evaluation the behavior classification network un-
der pre-processing conditions of no missing pose imputation (None), linear
where xL and yL are the coordinate of one of the landmarks interpolation (Linear), and particle filter imputation.
and xi and yi are the coordinate of the keypoint i. Interpolation/Metrics Accuracy Precision Recall AUC
None 0.72 0.72 0.93 0.64
We use the combination of the distance D and orientation Linear 0.72 0.72 0.92 0.66
O of the keypoint i between one frame j to the following Particle Filter 0.72 0.72 0.93 0.66
frame j + 1, as the state variables during each iteration:
We evaluated the effect of particle filter interpolation by
" comparing it with two other imputation methods, no key-
D= (yij+1 − yij )2 + (xj+1
i − xji )2 (1) points interpolation (None) and linear interpolation (Linear).
y j+1 − yij Table 3 compares the performance metrics of our classifica-
O= arctan ( ij+1 ) (2) tion model on test data under these pre-processing conditions.
xi − xji
All metrics in the table are derived after the pose estimator
is re-trained. It can be seen that the classifier benefits a lot
The whole process of particle filter includes prediction from interpolation, and particle filter interpolation method
and updation of particles done repeatedly over N iterations. achieves the best overall performance.
We update Z, D, and O after every iteration. After the initial- We also compared the accuracy of ASD vs. normal be-
ization step, we update the positions of particles as per state havior classification between our proposed pose estimation-
variables (D and O). We continue by comparing the distance based approach with a conventional video classification ap-
between each particle and the landmarks with Z. We assign proach that takes in RGB images instead of pose information.
a weight to each particle based on this comparison in a way We implemented a modified version of the single-frame ap-
where more weight is given to the particles that are closer to proach proposed in [15]. We split each episode into frames
the keypoint. This iterative process continues until we get the and labeled these frames as typical or atypical according to
positions of particles in the last frame. At the end, we take the class of the episode. Then, we fine-tuned an Inception
mean of each particle distribution to find the expected value V3 network as classifier using these labeled frames. Next, for
of the corresponding keypoint. each episode we predicted the classes of all the frames in it us-
The output after the particle filter interpolation step is an ing this classifier and then performed a majority voting to get
episode without missing keypoints. This episode is ready to the class of this episode. Performance results of this modified
be used as the input for the PoTion step. The effect of the single-frame approach, as well as our proposed pose estima-
particle filter interpolation is shown in Fig. 6, where we com- tion based approach are compared in Table 4. As you can see
pared the keypoints’ trajectories under three different meth- our approach outperforms the conventional video classifica-
ods, including no interpolation, linear interpolation, and our
particle filter interpolation, against the groundtruth trajectory.
It can be seen that the imputed trajectory done by particle filter Table 4. Comparison of the modified single-frame approach and our pro-
interpolation is the closest to the groundtruth. We also com- posed pose estimation based approach.
Approach/Metrics Accuracy Precision Recall AUC
pared the performance of the behavior classification model Modified single-frame [15] 0.66 0.70 0.77 0.62
under these 3 interpolation methods in Section 4. Our pose estimation-based 0.72 0.72 0.93 0.66
tion approach. [6] David Deriso, Joshua Susskind, Lauren Krieger, and
Marian Bartlett, “Emotion mirror: a novel intervention
5. CONCLUSION AND FUTURE WORK for autism based on real-time expression recognition,”
European Conference on Computer Vision, pp. 671–674,
Through this paper, we presented a novel behavior classifica-
2012. 1
tion approach which uses low level latent information in the
form of body pose over time to categorize a given video clip [7] Guido Pusiol, Andre Esteva, Scott S Hall, Michael
into a typical (normal) or atypical (ASD) behavior. We started Frank, Arnold Milstein, and Li Fei-Fei, “Vision-
with re-training a state-of-the-art pose estimator on our video based classification of developmental disorders using
dataset to obtain accurate keypoint estimation for children. eye-movements,” International Conference on Medical
This is done using a semi-automatic annotation tool, which Image Computing and Computer-Assisted Intervention,
enables researchers to create their own annotated datasets. We pp. 317–325, 2016. 1
used particle filter interpolation to fill the missing keypoints
[8] “Behavior Imaging – Health & Education Assess-
after the pose estimation. Using the extracted keypoints, we
ment Technology,” https://behaviorimaging.
transferred approximately 301 frames in each episode into a
com/, Accessed: 2019. 2
PoTion representation, forming a single RGB image. These
images capture changes in 15 keypoints on whole body across [9] Vasileios Choutas, Philippe Weinzaepfel, Jérôme Re-
the time of 10 seconds (length of one episode). We uses vaud, and Cordelia Schmid, “Potion: Pose motion
the PoTion images as inputs to our binary classifier for final representation for action recognition,” Proceedings of
behavior classification. We achieve better performance com- the IEEE Conference on Computer Vision and Pattern
pared to a conventional video classification method. Recognition, pp. 7024–7033, 2018. 2, 3
Our primary objective in developing such approach was
to use the movements of the body keypoints as a an indicator [10] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani,
of behavior in a person. The present approach can be im- Manohar Paluri, and Du Tran, “Detect-and-track: Ef-
proved by combining RGB and audio modalities in order to ficient pose estimation in videos,” 2018 IEEE/CVF Con-
get more accurate results. For future considerations, we plan ference on Computer Vision and Pattern Recognition, pp.
to develop a new type of PoTion Representation that works 350–359, 2018. 3, 4
with 3D trajectories and could add more spatial information [11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and
into the classification pipeline. Ross B. Girshick, “Mask r-cnn,” 2017 IEEE Inter-
national Conference on Computer Vision (ICCV), pp.
6. REFERENCES
2980–2988, 2017. 3
[1] American Psychiatric Association, Diagnostic and sta- [12] Shuangjun Liu and Sarah Ostadabbas, “A semi-
tistical manual of mental disorders (5th ed.), American supervised data augmentation approach using 3d graphi-
Psychiatric Association, 2013. 1 cal engines,” European Conference on Computer Vision,
[2] Jon Baio, “Morbidity and mortality weekly report. pp. 395–408, 2018. 4
surveillance summaries,” Epidemiology Program Office, [13] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov,
Centers for Disease Control and Prevention, vol. 63, no. Leonid Pishchulin, Anton Milan, Juergen Gall, and
2, 2014. 1 Bernt Schiele, “Posetrack: A benchmark for human
pose estimation and tracking,” Proceedings of the IEEE
[3] Jordan Hashemi, Thiago Vallin Spina, Mariano Tep-
Conference on Computer Vision and Pattern Recogni-
per, Amy Esler, Vassilios Morellas, Nikolaos Pa-
tion, pp. 5167–5176, 2018. 4
panikolopoulos, and Guillermo Sapiro, “A computer
vision approach for the assessment of autism-related be- [14] M Sanjeev Arulampalam, Simon Maskell, Neil Gordon,
havioral markers,” 2012 IEEE International Conference and Tim Clapp, “A tutorial on particle filters for online
on Development and Learning and Epigenetic Robotics nonlinear/non-gaussian bayesian tracking,” IEEE Trans-
(ICDL), pp. 1–7, 2012. 1 actions on signal processing, vol. 50, no. 2, pp. 174–188,
2002. 4
[4] Michael L Spezio, Ralph Adolphs, Robert SE Hurley,
and Joseph Piven, “Abnormal use of facial information [15] Andrej Karpathy, George Toderici, Sanketh Shetty,
in high-functioning autism,” Journal of autism and de- Thomas Leung, Rahul Sukthankar, and Li Fei-Fei,
velopmental disorders, vol. 37, pp. 929–939, 2007. 1 “Large-scale video classification with convolutional neu-
ral networks,” Proceedings of the IEEE conference on
[5] James Rehg, “Behavior imaging: Using computer vi-
Computer Vision and Pattern Recognition, pp. 1725–
sion to study autism.,” MVA, vol. 11, pp. 14–21, 2011.
1732, 2014. 5
1

Recognition of Atypical Behavior in Autism

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recognition of Atypical Behavior in Autism

Uploaded by

Copyright:

Available Formats

RECOGNITION OF ATYPICAL BEHAVIOR IN AUTISM DIAGNOSIS FROM VIDEO USING

POSE ESTIMATION OVER TIME

ABSTRACT tends to deteriorate with increasing age. Therefore, an early

Video Frame Pose Estimation Typical Atypical

eye tracking images) visual data.

Fig. 3. Architecture of our pose estimator adopted from 2D

Mask R-CNN network [10]. It can perform multi-person pose Temporal

You might also like