You are on page 1of 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330515985

A deep learning approach for diagnosing schizophrenic patients

Article in Journal of Experimental & Theoretical Artificial Intelligence · January 2019


DOI: 10.1080/0952813X.2018.1563636

CITATIONS READS

46 476

4 authors, including:

Varadraj P Gurupur Sharma V. Thankachan


University of Central Florida Georgia Institute of Technology
63 PUBLICATIONS 844 CITATIONS 118 PUBLICATIONS 909 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Varadraj P Gurupur on 13 August 2022.

The user has requested enhancement of the downloaded file.


Journal of Experimental & Theoretical Artificial
Intelligence

ISSN: 0952-813X (Print) 1362-3079 (Online) Journal homepage: https://www.tandfonline.com/loi/teta20

A deep learning approach for diagnosing


schizophrenic patients

Srivathsan Srinivasagopalan, Justin Barry, Varadraj Gurupur & Sharma


Thankachan

To cite this article: Srivathsan Srinivasagopalan, Justin Barry, Varadraj Gurupur & Sharma
Thankachan (2019): A deep learning approach for diagnosing schizophrenic patients, Journal of
Experimental & Theoretical Artificial Intelligence, DOI: 10.1080/0952813X.2018.1563636

To link to this article: https://doi.org/10.1080/0952813X.2018.1563636

Published online: 20 Jan 2019.

Submit your article to this journal

Article views: 7

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=teta20
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE
https://doi.org/10.1080/0952813X.2018.1563636

ARTICLE

A deep learning approach for diagnosing schizophrenic patients


Srivathsan Srinivasagopalana, Justin Barryb, Varadraj Gurupurc and Sharma Thankachanb
a
Alumni, Department of Computer Science, Louisiana State University, Baton Rouge, LA, USA; bDepartment of
Computer Science, University of Central Florida, Orlando, FL, USA; cDepartment of Health Management and
Informatics, University of Central Florida, Orlando, FL, USA

ABSTRACT ARTICLE HISTORY


In this article, the investigators present a new method using a deep Received 15 May 2018
learning approach to diagnose schizophrenia. In the experiment Accepted 20 December 2018
presented, the investigators have used a secondary dataset provided KEYWORDS
by The National Institute of Health. The aforementioned experimen- Schizophrenia; fMRI; deep
tation involves analyzing this dataset for the existence of schizo- learning; random forest;
phrenia using traditional machine learning approaches such as SVM
logistic regression, support vector machine, and random forest. This
is followed by the application of deep learning techniques using
three hidden layers in the model. The results obtained indicate
that the new deep learning technique formulated by the investiga-
tors provide a higher accuracy in diagnosing schizophrenia. These
results suggest that the deep learning may provide a paradigm shift
in diagnosing schizophrenia.

Introduction
Machine learning has been applied to a variety of applications, this includes computer vision
(Belhumeur, 1997) and medical diagnostics (Kuenen, Mischi, & Wijkstra, 2011; Martis, Lin, Gurupur,
& Fernandes, 2017). Typically, all machine learning algorithms need to be provided with significant
features from data to learn the patterns and perform classification (Fernandes & Bala, 2016;
Mikolajczyk & Schmid, 2005). Deep learning is a subfield of machine learning which can extract
features and perform classification on its own (Amin, Sharif, Yasmin, & Fernandes, 2018; Yang, Lin, &
Chen, 2018). Recently deep learning has gained a lot of interest in the diagnosis of
Schizophrenia (SCZ).
Schizophrenia is a mental disorder primarily affecting people between the ages of 16 and 30. Its
positive features include hallucinations, delusions, psychosis; its negative features include impaired
motivation and social withdrawal (Owen, Sawa, & Mortensen, 2016). SCZ usually manifests in
adolescence or early twenties and continues to worsen into adulthood. An at-risk phase, called
the prodromal phase, often precedes the full-blown disorder, though there have been cases of
sudden onset in previously healthy individuals. SCZ’s effects are severe. Gone untreated, SCZ places
the burden of care on the individual's family. Thus, there is a social incentive to efficiently and
accurately diagnose SCZ. Additionally, research has shown that early diagnosis of the disease can
reduce treatment costs (Mihalopoulos, Harris, Henry, Harrigan, & McGorry, 2009), thus increasing
the need for a reliable mechanism for diagnosis. Currently, diagnosing SCZ involves subjective
analysis of a patient's test results and mental history, though symptom overlap with other mental
disorders (Owen et al., 2016) can occur, increasing the risk of misdiagnosis. An automated and

CONTACT Justin Barry justin.barry.e@knights.ucf.edu Department of Computer Science, University of Central Florida,
Orlando, FL, USA
© 2019 Informa UK Limited, trading as Taylor & Francis Group
2 S. SRINIVASAGOPALAN ET AL.

efficient physiology-based diagnosis of SCZ would be beneficial. Currently, SCZ has no well-
established and conventional biomarker, though studies have shown (Han, Huang, Zhang, Zhao,
& Chen, 2017; Hsieh, Sun, & Liang, 2014), that MRI may be an effective biomarker for SCZ. This
paper shows how multimodal MRI data can be used to classify SCZ.
Magnetic Resonance Imaging (MRI) is a non-invasive imaging modality that returns valuable
information about the physiology of the human brain, including size, shape, and tissue structure
(Bois, Whalley, McIntosh, & Lawrie, 2015). MRI captures either structural or functional information.
Functional MRI (fMRI) utilizes Blood-oxygen-level-dependent (BOLD) signals to capture an approx-
imate measurement of activity between remote regions in the brain (Demirci & Calhoun, 2009).
Structural MRI (sMRI) provides information on varying characteristics of brain tissue such as gray
matter, white matter, and cerebrospinal fluid (Vemuri & Jack, 2010). The challenge with using sMRI
data to diagnose based on structural changes brought on by SCZ is the overlap in structural
change brought on by factors closely linked with SCZ such as alcoholism and anti-psychosis
medication (Bois et al., 2015). Previous studies have shown that the combination of fMRI and
sMRI data can be used in conjunction with a deep learning autoencoder to classify mental
disorders including SCZ (Patel, Aggarwal, & Gupta, 2016; Silva et al., 2014; Zeng et al., 2018). In
one such study (Patel et al., 2016), researchers used an autoencoder, four-layers deep in encoding
and decoding, to learn the features of the input data, then used SVM to classify the data with 92%
accuracy.

Problem statement
From the multimodal features derived from the brain MRI scans, we aim to automatically diagnose
subjects with Schizophrenia. Two modalities of MRI scans are used to obtain these features: functional
and structural MRI. Given a set of training data with these features and corresponding labels (either
Schizophrenic or not), the goal is to build a classifier that is specific enough to accurately diagnose
with as little false-positives and false-negatives1 as possible and generic enough such that it is robust
to any small perturbations in future (test) data.

Dataset
Data on 144 test subjects came from a 3T Siemens Trio MRI scanner (12-channel head coil). There
were 75 controls and 69 patients with SCZ (Silva et al., 2014). All data were collected at the Mind
Research Network.2 Data were preprocessed to distil independent components using the indepen-
dent component analysis results from a study found in (Allen et al., 2011; Segall et al., 2012), and
leveraging spatio-temporal regression to mitigate the potential of low bias or high variance
(Erhardt et al., 2011).
The dataset contains the following:

● Training FNC: FNC features for the training set. These are correlation values. They describe
the connection level between pairs of brain maps over time.
● Training SBM: SBM features for the training set. These are standardized weights. They
describe the expression level of ICA brain maps derived from grey-matter concentration.
● Training Labels: Labels for the training set. The labels are indicated in the “Class” column.
0 = ‘Healthy Control’, 1 = ‘Schizophrenic Patient’.
● Testing FNC: FNC features for the test set. Test subject labels have been removed. The task is
to predict these unknown labels from the provided features.
● Testing SBM: SBM features for the test set. Test subject labels have been removed.

This is a classic (probabilistic) binary classification problem. Here, we want to find a prediction
model ^f : X ! ^y, where ^y ¼ ^f ðxÞ is a class prediction for any given observation x such that
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 3

^y 2 ½0:0; 1:0. These observations are d-dimensional vectors where each dimension d represents
a particular feature of the MRI dataset. The resulting prediction probability for a given observation
gives the severity of SCZ a subject has. The greater the value of ^y, the more likely the subject
has SCZ.

Contribution
The contribution of our work is three-fold. First, we apply a series of traditional (and popular) machine
learning algorithms to our problem that perform not only well, but also provide insight into the
classification mechanisms. This lays the groundwork for future experiments and presents a good yardstick
to measure the efficacy of any new algorithms working with this dataset. Second, the features selected
were significant in deciding whether a given subject was Schizophrenic or not. To validate our proposed
method and our deep network’s internal architecture, we ran multiple, automated trials of our model
using different sets of hyperparameters. Third, the proposed architecture handily outperforms the
traditional methods accuracies while maintaining an acceptable level of generality. The related results
further confirm the effectiveness of the proposed model for Schizophrenic subject classification.

Traditional machine learning approaches


We explore a few well-known machine learning models to predict if a test subject has schizophrenia or
not. We restrict ourselves to three basic algorithms, namely, logistic regression, support vector
machine (SVM), and random forest (an ensemble technique). These (and their variants) are some of
the most commonly used algorithms both in the industry and academia. A brief overview of these
algorithms are given below and detailed parameters for the experimental setup are also given.

Logistic regression
Logistic regression is a variation of linear regression where the output (dependent) variable is
binary (categorical variable with two values such as ‘yes’ or ‘no’) rather than a continuous variable
(Hosmer & Lemeshow, 2000).
Multivariable problems are frequently encountered in medical research. Researchers are typi-
cally faced with questions such as “What is the relationship of one or more exposure variables (x’s)
to a disease or illness outcome (y). If we consider a dichotomous disease outcome with 0
representing not diseased and 1 representing diseased, and if the illness/disease is coronary heart
disease (CHD) status, then the subjects would be classified as either 0 (‘without CHD’) or 1 (‘with
CHD’). Suppose, that the researcher is interested in a single dichotomous exposure variable, for
instance, obesity status, classified as ‘yes’ or ‘no’ (or it could be a continuous variable as ‘obese
value (BMI)’ and could take on some predefined bucketed values such as ‘0–5ʹ, ‘5–10ʹ or ‘25–35ʹ). In
such situations, the research question translates to finding the extent to which obesity is associated
with CHD status.
The factors that could potentially contribute to an illness/disease represents a collection called
independent variables and the variable that needs to be predicted (the outcome) is called depen-
dent variable. More generally, the independent variables are denoted as X ¼ fx1 ; x2 ; . . . ; xn g where
n is the number of variables/factors being considered and x is an independent variable or
feature. When y, the outcome, is dependent on (or related to) a number of x’s, then it is
a multivariable problem.
Logistic regression is a mathematical modelling approach that can be used to describe the
relationship of several x’s to a dichotomous dependent outcome. Like linear regression, it assumes
a linear relationship between the predictor and output variables. It is used when one wants to
know the probability of an event occuring such as the probability of a bank loan being delinquent
or a soccer team winning a game. The input variables may not be continuous.
4 S. SRINIVASAGOPALAN ET AL.

The logistic function, which describes the mathematical form on which the logistic model is
based, is given as:
1
f ðzÞ ¼ (1)
1 þ ez
As the value of z varies from  1 to þ 1, the value of f ðzÞ takes on values from 0 to 1. The fact
that the logistic function f ðzÞ ranges between 0 and 1 is the primary reason the logistic model is so
popular. This model is designed to describe a probability, which is always some number between 0
and 1. In medical terms, such a probability gives the risk of an individual getting a disease.
More formally, the output of logistic regression is a predictor variable that gives the probability
of an event happening. The model is given by:
 
pðxÞ
log ¼ β0 þ xβ1 (2)
1  pðxÞ
Solving for p, this gives:

eβ0 þxβ1 1
pðxÞ ¼ β þxβ
¼ (3)
1þe 0 1 1 þ e 0 þxβ1 Þ
ðβ

where β0 is the bias or intercept term and β1 is the coefficient for the single input value x.
The ability to model the odds has made the logistic regression model a popular method of
statistical analysis. The logistic regression model can be used for prospective, retrospective, or
cross-sectional data. More details on its origins, applications and advantages can be seen in
(Cramer, n.d.).
Some key parameters used to run logistic regression are given below:

(1) C = 25.0. This can be tweaked for optimal regularization. The larger values give more
freedom to the model. Here, in order to contain any large increases or decreases in the
coefficient values due to small perturbations in the data, the inverse of the regularization
parameter is used.
(2) For multinomial loss, the ‘sag’3 solver is used.
(3) L2 regularization is used to improve generalization performance, i.e., the performance on
new, unseen data. L2 was chosen (as against L1 or others) is because the dataset is not
sparse and is better at generalizing the model complexity.

Support vector machine


Support Vector Machines (SVM) are a class of statistical models first developed in the mid-1960s by
Vladimir Vapnik and became popular in 1992 (then introduced again by Boser, Guyon and Vapnik
in (Boser, Guyon, & Vapnik, 1992)). Over the years, it has evolved considerably into one of the most
flexible and effective machine learning tools available (Cortes & Vapnik, 1995; Vapnik, 1998) and
(Vapnik, 1995) provides a comprehensive treatment of SVMs.
SVMs are used for both classification and regression. It operates by maximizing the margin
(hyperplane separating the datasets) from either side of the dataset. This boils down to a quadratic
optimization problem where one minimizes the following equation
X
n
ΦðwÞ ¼ 1=2jjWjj2 þ C i (4)
i¼1

subject to yi ðwT xi þ bÞ  1  i . C is the trade-off parameter between error and margin and i  0.
The general idea is that the original feature space can always be mapped to some higher-
dimensional feature space where the training data set is separable by a hyperplane. C > 0 is the
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 5

penalty parameter of the error term. There are different kernels that can be applied here. Some of
the basic kernels are the following:

● linear
● polynomial
● radial basis function (RBF)
● sigmoid

In general, the RBF kernel is a reasonable first choice. It non-linearly maps samples into a higher
dimensional space and so, it can handle cases where the relation between the predictor variables
and the class labels is nonlinear.

Random forest
Random Forests (RF) are a popular ensemble technique used to build predictive models for both
classification and regression problems. They are among the most successful classifiers not only
because of their accuracy, but also, due to their robustness and ability to work with skewed
datasets across a variety of domains. A study by M. Fernandez-Felgado (Fernández-Delgado,
Cernadas, Barro, & Amorim, 2014), evaluated 179 classifiers from 17 different families and con-
cluded that random forests were the best performing classifier among these families. Furthermore,
their performance was significantly better than that of others.
Ensemble methods use multiple ‘base’ learning models to make the final prediction. The
random forest algorithm randomly generates many unpruned, uncorrelated decision trees,4 and
uses them toto arrive at the best possible answer. RFs use bagging to generate bootstrap samples
of the training dataset and uses the CART method to build trees. Each tree casts a vote for the
classification of a new sample and the proportion of the votes in each class across the set of trees
will be decided as the predicted probability vector.
Assuming that the training set is
D ¼ ðx1 ; y1 Þ; . . . ; ðxn ; yn Þ (5)
drawn randomly from a (possibly unknown) probability distribution ðxi ; yi Þ,ðX; YÞ. The goal is to
build a classifier which predicts y from x based on the data set of examples D.
Given an ensemble of (possibly weak) classifiers h ¼ h1 ðxÞ; . . . ; hk ðxÞ, if each hk ðxÞ is a decision
tree, then the ensemble is a random forest. We define the parameters of the decision tree for
classifier hk ðxÞ to be Φk ¼ ðϕk1 ; ϕk2 ; . . . ; ϕk;p Þ. Typically, this is written as
hk ðxÞ ¼ hðxjΦk Þ (6)
which implies that the decision tree k leads to a classifier hk ðxÞ ¼ hðxjΦk Þ. Here, it is important to
note that a feature fi appears in node k of the jth tree is randomly chosen from a model variable Φ.
For the final classification f ðxÞ (which combines the classifiers hk ðxÞ), each tree casts a vote for the
most popular class at input x, and the class with the most votes wins. More specifically, given
dataset D ¼ ðxi ; yi Þni¼1 , we train a family of classifiers. Each classifier hk ðxÞ;hðxjΦk Þ is, in our case,
a predictor. The outcome is y ¼ f0; 1g, representing a pair of unique decisions (in our case, non-
Schizophrenic and Schizophrenic patient). For generalizations and more detailed treatment of RFs,
refer to Breiman, 2001.

Feature selection
An important step in machine learning is feature selection. For any type of prediction or classifica-
tion model, only the relevant and most useful features should be used. Feature selection improves
6 S. SRINIVASAGOPALAN ET AL.

model accuracy while requiring less data (by removing irrelevant and redundant features). If
feature selection is not done, the resulting model will likely become more complex and less
accurate. We used Recursive Feature Elimination and Random Forests to determine feature impor-
tances and used the mean of the scores from these two techniques to proceed with threshold
cut-off.
Figure 1 shows the rankings of all 411 features. A threshold was manually decided by looking at
this figure and deciding a cut-off rank (0.35). This enabled us to choose all features with importance
greater than 0.35, granting the classifier improved accuracy and speed. The first 10 features sorted
according to their importance are given in Table 1. The selected features (with rank-threshold
= 0.35) are:
[‘FNC295ʹ, ‘FNC244ʹ, ‘SBM_map67ʹ, ‘FNC302ʹ, ‘SBM_map61ʹ, ‘FNC226ʹ, ‘FNC289ʹ, ‘FNC220ʹ,
‘FNC33ʹ, ‘FNC243ʹ, ‘SBM_map36ʹ, ‘FNC183ʹ, ‘FNC37ʹ, ‘FNC48ʹ, ‘FNC78ʹ, ‘FNC61ʹ, ‘FNC297ʹ, ‘FNC333ʹ,
‘SBM_map7ʹ, ‘FNC290ʹ, ‘SBM_map75ʹ, ‘FNC208ʹ, ‘FNC292ʹ, ‘FNC40ʹ, ‘SBM_map17ʹ, ‘FNC265ʹ,
‘FNC171ʹ, ‘FNC189ʹ, ‘FNC353ʹ, ‘FNC62ʹ, ‘FNC185ʹ, ‘FNC13ʹ, ‘FNC337ʹ, ‘FNC5ʹ, ‘FNC30ʹ, ‘FNC68ʹ,
‘FNC150ʹ, ‘FNC211ʹ, ‘FNC293ʹ, ‘FNC328ʹ, ‘FNC89ʹ, ‘FNC106ʹ, ‘FNC165ʹ, ‘FNC221ʹ, ‘SBM_map64ʹ,
‘FNC75ʹ, ‘FNC83ʹ, ‘FNC29ʹ, ‘FNC102ʹ, ‘FNC142ʹ, ‘FNC194ʹ, ‘FNC200ʹ, ‘FNC210ʹ, ‘FNC219ʹ, ‘FNC256ʹ,
‘FNC279ʹ, ‘FNC304ʹ, ‘SBM_map72ʹ]

Application of deep learning binary classifier


Deep learning (Lecun, Bengio, & Hinton, 2015) is a set of machine learning algorithms that model
high-level abstractions in data using architectures consisting of multiple nonlinear transformations.
Based on artificial neural networks, deep neural networks aim to learn a given domain as a nested
hierarchy of concepts, where each concept is defined based on simpler concepts. In a deep neural
network, each neuron in each layer represents one aspect of the domain it tries to learn, and in
totality, it aims to learn a full representation of the input. Each node also has a weight that
represents the strength of its relationship with the output. As the model is fed more and more
data, the strengths of these relationships either reinforces or declines and this adjustment con-
tinues until training stops.

Figure 1. Feature rankings*.


* Only 80 features are shown on the x-axis for readability purposes.
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 7

Table 1. The first 10 feature rankings from the feature selection


step.
Feature RFE RForest Mean
FNC244 1 1 0.667
FNC295 1 0.95 0.65
SBM_map67 1 0.83 0.61
FNC302 1 0.75 0.583
SBM_map61 1 0.65 0.55
FNC226 1 0.57 0.523
FNC33 1 0.52 0.507
FNC289 1 0.51 0.503
FNC183 1 0.41 0.47

A major advantage of deep learning is that one neither needs to identify the significant features
nor have to do feature engineering. The architecture learns the features incrementally on its own.
This eliminates the need for domain expertise and explicit feature-extraction and feature-
engineering.5 Another advantage is that it tends to solve a problem end-to-end whereas, in
traditional machine learning approach, the problem has to be broken down into smaller, more-
manageable pieces and solved with a variety of statistical and probabilistic approaches.
A clear disadvantage of deep learning (as of this writing) is the inability of the architecture to clearly
explain itself. It provides very little justification as to the importance of features and what type of feature-
engineering happened behind the scenes. This lack of interpretability is a big reason why many sectors in
business have not yet adopted deep learning. In contrast, traditional machine learning approaches like
linear regression, logistic regression, SVMs, and random forests all provide a clear explanation of why and
how it made a given decision. In addition to this advantage, deep learning requires a very large amount of
data to train and computationally very intensive. Hence, the need for expensive GPUs and large
infrastructure to support them. Furthermore, it has yet to establish a robust, theoretical foundation,
which leads to its next disadvantage, the difficulty of determining the topology, training mechanism, and
hyperparameters of a model. The lack of theory and guidance makes building a good deep learning
model less of an engineering feat and more of an artistic endeavour. As there is little theory for guidance
(as of this date), building a good deep learning architecture is an art, possibly resulting in
a uninterpretable model.
Keras (Chollet et al., 2015) is a high-level neural-networks API, written in python and capable of
running on top of Tensorflow, CNTK or Theano. It was developed with a focus on enabling fast
experimentation. We used Kerasto build our model.

Architecture
The architecture used was rather simple and can be seen in Figure 2. The input layer has 411
neurons, denoted as dimensions, which are the 411 features of the data set. Note that we do not
use the curated 55 features obtained through feature selection. Instead, we pass the full dimen-
sionality of the data set to the model and allow it to decide which features are the most relevant.
The first hidden layer has 512 neurons, each with a Rectified Linear Unit6 activation function. The
output layer (the second layer) consisted of just one neuron with a sigmoid activation function.
Sigmoid is used to ensure that the resulting prediction probabilities fall between 0:0 and 1:0. The
loss function used was binary-cross entropy and the optimizer used was ‘adam’ (Adaptive Moment
Estimator) – a stochastic, first-order gradient-based optimiser that can be used instead of the
classical stochastic gradient descent procedure to update network weights. Finally, the model was
compiled using the ‘accuracy’ metric. This function is used to judge the performance of the model.
Before the training data was fed to the deep neural network model, it was scaled appropriately
using a standard scaler. This standardizes the feature values by removing the mean and scaling to
8 S. SRINIVASAGOPALAN ET AL.

Figure 2. Neural network architecture diagram.

unit variance. We chose a set of initial values for the weights to help the model achieve better and
faster convergence. This was achieved by the Xavier–Glorot method (Glorot & Bengio, 2010). This
helps in mitigating the well-known problems in back-propagation algorithms, that of, vanishing
gradients and exploding gradients (though known to occur very rarely).
Deep neural networks usually contain millions of parameters and hence, it is imperative to attempt to
reduce the parameter space as much as possible. In this context, we make such attempts by regularizing
the weights. It is well-known that the quality of the regularization method significantly affects both the
discriminative power and generalization ability of the trained models. To achieve an acceptable level of
generalization, we apply a dropout technique that randomly sets certain neurons’ responses to zero with
a given probability. In our case, we provide a 50% dropout rate., which means that half of the neurons in
any given layer, would output a zero. This technique, though simple in theory and implementation, has
far-reaching benefits not only in helping to generalize the mode, but also, in vastly improving the speed
of learning, particularly for very deep networks. This is due to the fact that training batches updates only
a subset of all the neuron at a time and conveniently avoid co-adaptation of the learned feature
representations (Krizhevsky, Sutskever, & Hinton, 2012).
We augment the deep architecture, at every layer, with an application of an aggressive dropout
regularization. This boosts the capability of the network to learn in a generalized way and avoids
over-fitting (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014).
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 9

Results and discussion


The dataset used to train the model was obtained from combining the correlation values between
the pairs of brain maps (the FNC dataset) and the standardized weights that describe the ICA brain
maps (SBM dataset). This combined feature vector was concatenated with their corresponding
training labels to create a full set of feature vectors. This was used to train various ML and deep
learning models.
We first discuss the results obtained from running traditional machine learning approaches.
Logistic regression, when run with 5000 iterations, L2-penalization and ‘sag’ solver, produced an
accuracy score 0.8277. A more detailed parameter listing is given in Table 2
SVM, when running with un-optimised parameters, achieved a lesser accuracy of 0.7988. We did
not perform hyper-parameter optimization here. It would not surprise us, if SVM was run with
proper hyper-parameter settings, gives better results. SVM was used in our context to do binary
classification with an rbf kernel and it produced a good model with a cross-validation score of
0.8268. We chose RBF kernel because the number of hyperparameters is much less than that of the
polynomial kernel. A detailed list of parameters are in Table 3:
A few aspects we did not experiment with are the value of C and γ. Ideally, a good ðC; γÞ has to
be identified to produce a good classifier. C (the soft-margin cost function) helps in controlling the
influence of each individual support vector and γ controls the variance. A large γ may lead to high
bias and low variance models and vice-versa.
Experiments with Random Forest with 5000 trees yielded a very good model with a cross-
validation score of 0.8333 and its respective parameter list is given in Table 4:
In our deep learning experiments, the total trainable parameters, weights and biases combined,
were 375,297 as given in Table 6. This number is equal to the summation of the number of trainable

Table 2. Logistic regression parameters.


Parameter Value
C 25.0
class_weight None
dual False
fit_intercept True
intercept_scaling 1
max_iter 5000
multi_class ovr
n_jobs 1
penalty 12
random_state None
solver sag
tol 0.0001
verbose 0
warm_start False

Table 3. SVM parameters.


Parameter Value
cache_size 200
class_weight None
coef0 0.0
decision_function_shape ovr
degree 3
gamma 0.1
kernel rbf
max_iter −1
probability False
random_state None
shrinking True
tol 0.001
verbose False
10 S. SRINIVASAGOPALAN ET AL.

Table 4. Random forest parameters.


Parameter Value
bootstrap True
class_weight None
criterion gini
max_depth None
max_features auto
max_leaf_nodes None
min_impurity_decrease 0.0
min_impurity_split None
min_samples_leaf 1
min_samples_split 2
min_weight_fraction_leaf 0.0
n_estimators 5000
n_jobs −1
oob_score False
random_state 42
warm_start False

parameters in all layers. Let nðLÞ be the number of neurons in layer L of the network, where nð0Þ is the
number of input features. We can calculate the number of trainable parameters in layer L as
nðLÞ  ðnðL1Þ þ 1Þ, where the additional 1 is for the bias unit in layer L  1. Then, the number of
trainable parameters for layer 1 is nð1Þ  ðnð0Þ þ 1Þ ¼ 512  ð411 þ 1Þ ¼ 210; 944, as shown in Table 6.
The total number of trainable parameters for the network is the summation of the number of trainable
parameters per layer and is given as 210; 944 þ 131; 328 þ 32; 896 þ 129 ¼ 375; 297.
The dropout mechanisms applied in each layer significantly reduced the computation time and
improved generalization. We trained several deep learning models with a different number of
hidden layers, dropout rates, and activation functions . After a considerable number of experi-
ments, we zoned on a depth of three hidden layers, number of nodes in each hidden layer (512,
256 and 128) and dropout rate (of 50%). We then ran this general framework on the training
dataset with different batch sizes and epochs. Each run on the training dataset was also evaluated
by a validation set (of 20%) to validate the effectiveness of the classifier.
The final model’s results are visualized in Figure 3. The accuracy for the training set stays more
or less at 100% while that of the testing set is close, reaching a steady state at 94:444%. And, this
result was achieved at the 19th epoch and 45th batch. The best loss (computed by binary-cross
entropy) for the training dataset was 4:5488e  06, while that of the validation set was less than
0:28. During the other stages of training, the accuracy and loss, though promising, did not reach to
this level.
During experiments with various epochs and batch-sizes, as seen in Figures 4 and 5, though the
accuracy tends to be close to 1.0, the validation loss tends to vary rather higher than 0.25. It is to be
noted that there is a fine balance between the number of epochs and batch-sizes. Typically,
a larger batch size is preferred for good normalization. The greater the mini-batch size, the lesser
will be the variance between each mini-batch. Smaller batch sizes typically result in finer gradient
descent steps, leading to higher latency in convergence. Larger batch sizes will lead to executing
less gradient descent steps and hence the need to train on more epochs. In most cases though, the
accuracy will not differ drastically. A comparison of accuracy between all machine learning
methods discussed above can be seen in Table 5.

Limitations
This work has limitations from several fronts. An immediately noticeable one is the size of the
training dataset, consisting only 86 observations as compared to the test dataset that has 119,748
observations. This deficiency in training data will directly impact any classifier’s performance.
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 11

Figure 3. Accuracy and loss plots for batch-size = 45 and number-of-epochs = 19.

Figure 4. Accuracy and loss plots for batch-size = 35 and number-of-epochs = 19.

Figure 5. Accuracy and loss plots for batch-size = 25 and number-of-epochs = 19.

However, we did not find class-imbalance.7 Out of the given training dataset, we aren’t sure about
the distribution of the kinds of observations (i.e., the distribution of the relationships among the
features). It would have been nice to have orthogonal feature values among the observations and
12 S. SRINIVASAGOPALAN ET AL.

Table 5. Cross-validation accuracy scores from various models.


ML algorithm Accuracy
Logistic regression 0.8277
SVM 0.8268
Random forest 0.8333
Deep learning 0.9444

Table 6. Dimensions at each layer of deep learning architecture.


Layer (type) Output shape Param #
dense_1 (Dense) (None, 512) 210,944
dropout_1 (Dropout) (None, 512) 0
dense_2 (Dense) (None, 256) 131,328
dropout_1 (Dropout) (None, 256) 0
dense_3 (Dense) (None, 128) 32,896
dense_4 (Dense) (None, 1) 129
Total params: 375,297.0
Trainable params: 375,297.0
Non-trainable params: 0.0

the types of relationships., which would result in a rich set of data, despite having a low cardinality
of the training set. In addition to these data issues, the lack of labels for the test dataset prevented
us from performing several more statistical analysis of the model’s performance (like ROC curve,
confusion matrix, F-1 score, etc.).
On the modelling front, particularly when building deep learning architecture, again, the small size
of the training dataset prevented us from exploring many options in hyper-parameter tuning. For
experiments with deep neural networks, one typically needs a very large set of training data. This not
only allows the investigator to search the parameter space in a fairly comprehensive way,8 but also
enable the investigator to generalize the model well (else, there is a risk of overfitting). Because of the
small training data set, we had to settle with rather uncomfortable numbers for epochs and batch-size.
Recent research in training deep neural networks with very little training set (called ‘One-Shot
Learning’) looks promising (Lake, Salakhutdinov, & Tenenbaum, 2015; Vinyals, Blundell, Lillicrap,
Kavukcuoglu, & Wierstra, 2016).

Conclusions and future work


In this experimental paper, we attempt to use deep learning techniques to perform classification of
patients as Schizophrenic or not based on fMRI data. We have used a deep learning architecture
with three hidden layers and achieve reasonably good classification accuracy as observed in the
validation-accuracy and loss results. We observed, in a consistent way, that the features selected (in
other words, the ranking of features) were indeed significant in deciding whether a given subject is
Schizophrenic or not. Our experiments with completely different sets of features (those below the
threshold of 0.35) did not achieve a level of accuracy even close to what we attained by performing
rigorous feature selection. In some cases, we observed up to 28% increase in validation accuracy
when we explicitly chose those features that were above the 0.35 threshold.
Our next research objective is to discover new features and use more powerful techniques of
deep-learning to gain higher accuracy. Some areas we intend to cover are unsupervised feature
learning for feature-engineering and pretrained initial weights while acknowledging that this
exercise would be complex and not easily verifiable. We shall also explore converting MRI data
into 2D images and perform feature learning before applying convolutional methods on it. We plan
to use autoencoders on top of CNNs to better learn features.
In the context of the deep-learning architecture itself, we plan to treat certain neurons in
a preferential way. Currently, in our experiments, we use standard dropout, rather than assigning
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 13

dropout rates for specific sets of neurons in each layer. This could be done either in a deterministic
way or stochastic way. We believe that training the network with such new methods of regulariza-
tion has a potential to improve the performance after several epochs.
We also plan to apply deep architectures with a particular use of optimized kernel machines to
enable dealing with problems (and datasets) with limited data and prior knowledge of the relationships
in data. Enabling architectures with data-specific kernels or kernel compositions (like multiple kernel
learning), we believe, would lead to new and powerful architectures and inference strategies.
Finally, as mentioned in Section 4.1, the lack of very large amounts of training data encourages
us to experiment with One-Shot approaches.

Notes
1. Here, in a probabilistic classifier, a false-positive is seen as the classifier generating a ‘high’ value (say > 0.7) for
a subject, whereas, in reality, the subject does not have SCZ. The threshold can typically be determined by
empirical analysis. We chose a probabilistic output instead of a clear binary output to enable to end-user
/radiologist to make an informed and collective decision, rather than have the algorithm make a firm decision.
2. Funded by a Center of Biomedical Research Excellence (COBRE) grant 5P20RR021938/P20GM103472 from the
NIH to Dr Vince Calhoun (Cetin et al., 2014).
3. Stochastic Average Gradient Descent: It allows to train a model much faster that other solvers when the data
are very large. To use it effectively, the features must be scaled.
4. Also called weak learners that have a minimum accuracy of > 0:5.
5. We say this with caution. Researchers still do not have a clear idea how it learns features, what (good and bad
things) it learns, how fast it learns and if it forgets anything.
6. ReLU is a non-linear function which is essentially a half-wave rectifier f ðzÞ ¼ maxðz; 0Þ. ReLU typically learns
faster than tanhðzÞ or 1=1 þ ez when there are many layers present in the network.
7. Forty observations out of 86 had a class of ‘1‘, which is a good 46.51% of the training set.
8. Randomized search as against to Grid Search and exhaustive search.

Disclosure statement
No potential conflict of interest was reported by the authors.

References
Allen, E. A., Erhardt, E. B., Damaraju, E., Gruner, W., Segall, J. M., Silva, R. F., . . . Michael, A. M. (2011). A baseline for the
multivariate comparison of resting-state networks. Frontiers in Systems Neuroscience, 5, 2.
Amin, J., Sharif, M., Yasmin, M., & Fernandes, S. L. (2018). Big data analysis for brain tumor detection: Deep
convolutional neural networks. Future Generation Computer Systems, 87, 290-297.
Belhumeur, P. N. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19(7), 711–720.
Bois, C., Whalley, H., McIntosh, A., & Lawrie, S. (2015). Structural magnetic resonance imaging markers of susceptibility
and transition to schizophrenia: A review of familial and clinical high risk population studies. Journal of
Psychopharmacology, 29(2), 144–154.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the
fifth annual workshop on computational learning theory (pp. 144–152). New York, NY: ACM.
Breiman, L. (2001, Oct 1). Random forests. Machine Learning, 45(1), 5–32.
Cetin, M., Christensen, F., Abbott, C., Stephen, J., Mayer, A., Canive, J., . . . Calhoun, V. D. (2014). Thalamus and posterior
temporal lobe show greater inter-network connectivity at rest and across varying sensory loads in schizophrenia.
Neuroimage, 97, 117–126.
Chollet, F. (2015). Keras. Available from: https://keras.io.
Cortes, C., & Vapnik, V. (1995, Sept 1). Support-vector networks. Machine Learning, 20(3), 273–297.
Cramer, J. (n.d.). The origins of logistic regression. Tinbergen Institute Working Paper No. 2002-119/4, Tinbergen
Institute.
Demirci, O., & Calhoun, V. D. (2009). Functional magnetic resonance imaging–Implications for detection of schizo-
phrenia. European Neurological Review, 4(2), 103.
Erhardt, E. B., Rachakonda, S., Bedrick, E. J., Allen, E. A., Adali, T., & Calhoun, V. D. (2011). Comparison of multi-subject
ica methods for analysis of fmri data. Human Brain Mapping, 32(12), 2075–2095.
14 S. SRINIVASAGOPALAN ET AL.

Fernandes, S. L., & Bala, G. J. (2016). Fusion of sparse representation and dictionary matching for identification of
humans in uncontrolled environment. Computers in Biology and Medicine, 76, 215–237.
Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real
world classification problems? Journal of Machine Learning Research, 15, 3133–3181.
Glorot, X., & Bengio, Y. (2010, May). Understanding the difficulty of training deep feedforward neural networks. In
JMLR W&CP: Proceedings of the thirteenth international conference on artificial intelligence and statistics (aistats 2010)
(Vol. 9, pp. 249–256). Chia Laguna Resort, Sardinia, Italy: PMLR.
Han, S., Huang, W., Zhang, Y., Zhao, J., & Chen, H. (2017, Dec 6). Recognition of early-onset schizophrenia using
deep-learning method. Applied Informatics, 4(1), 16.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (Wiley series in probability and statistics) (2nd ed.).
New York, NY: Wiley-Interscience Publication.
Hsieh, T.-H., Sun, M.-J., & Liang, S.-F. (2014). Diagnosis of schizophrenia patients based on brain network complexity
analysis of resting-state fMRI. In The 15th international conference on biomedical engineering (pp. 203–206). New
York, NY: Springer, Cham, 2014.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Proceedings of the 25th international conference on neuralinformation processing systems - volume 1 (pp. 1097–1105).
Red Hook, NY: Curran Associates Inc.
Kuenen, M. P., Mischi, M., & Wijkstra, H. (2011). Contrast-ultrasound diffusion imaging for localization of prostate
cancer. IEEE Transactions on Medical Imaging, 30(8), 1493–1502.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015, Dec 11). Human-level concept learning through probabilistic
program induction. Science, 350, 1332–1338.
Lecun, Y., Bengio, Y., & Hinton, G. (2015, May 27). Deep learning. Nature, 521(7553), 436–444.
Martis, R. J., Lin, H., Gurupur, V. P., & Fernandes, S. L. (2017). Editorial: Frontiers in development of intelligent
applications for medical imaging processing and computer vision. Computers in Biology and Medicine, 89, 549–550.
Mihalopoulos, C., Harris, M., Henry, L., Harrigan, S., & McGorry, P. (2009). Is early intervention in psychosis cost-effective
over the long term? Schizophrenia Bulletin, 35(5), 909–918.
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(10), 1615–1630.
Owen, M. J., Sawa, A., & Mortensen, P. B. (2016, July2). Schizophrenia. The Lancet, 388(10039), 86–97. (Copyright -
Copyright Elsevier Limited Jul 2, 2016; Last updated - 2017-11-23; CODEN - LANCAO)
Patel, P., Aggarwal, P., & Gupta, A. (2016). Classification of schizophrenia versus normal subjects using deep learning.
Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing, 28, 1–6. New York,
NY: ACM.
Segall, J., Allen, E., Jung, R., Erhardt, E., Arja, S., Kiehl, K., & Calhoun, V. (2012). Correspondence between structure and
function in the human brain at rest. Frontiers in Neuroinformatics, 6, 10.
Silva, R., Castro, E., Gupta, C., Cetin, M., Arbabshirani, M., Potluru, V., . . . Calhoun, V. (2014). The tenth annual mlsp
competition: Schizophrenia classification challenge. IEEE Computer Society.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent
neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin, Heidelberg: Springer-Verlag.
Vapnik, V. N. (1998). Statistical learning theory. New York, NY: Wiley-Interscience.
Vemuri, P., & Jack, C. R. (2010, Aug 31). Role of structural MRI in Alzheimer’s disease. Alzheimer’s Research & Therapy, 2
(4), 23.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. In
D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing
systems 29 (pp. 3630–3638). Red Hook, NY: Curran Associates, Inc.
Yang, H.-F., Lin, K., & Chen, C.-S. (2018). Supervised learning of semantics-preserving hash via deep convolutional
neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 437–451.
Zeng, -L.-L., Wang, H., Hu, P., Yang, B., Pu, W., Shen, H., . . . Wang, K. (2018). Multi-site diagnostic classification of
schizophrenia using discriminant deep learning with functional connectivity MRI. EBioMedicine, 30, 74–85.

View publication stats

You might also like