Professional Documents
Culture Documents
Deep Learning Schizophrenic
Deep Learning Schizophrenic
net/publication/330515985
CITATIONS READS
46 476
4 authors, including:
All content following this page was uploaded by Varadraj P Gurupur on 13 August 2022.
To cite this article: Srivathsan Srinivasagopalan, Justin Barry, Varadraj Gurupur & Sharma
Thankachan (2019): A deep learning approach for diagnosing schizophrenic patients, Journal of
Experimental & Theoretical Artificial Intelligence, DOI: 10.1080/0952813X.2018.1563636
Article views: 7
ARTICLE
Introduction
Machine learning has been applied to a variety of applications, this includes computer vision
(Belhumeur, 1997) and medical diagnostics (Kuenen, Mischi, & Wijkstra, 2011; Martis, Lin, Gurupur,
& Fernandes, 2017). Typically, all machine learning algorithms need to be provided with significant
features from data to learn the patterns and perform classification (Fernandes & Bala, 2016;
Mikolajczyk & Schmid, 2005). Deep learning is a subfield of machine learning which can extract
features and perform classification on its own (Amin, Sharif, Yasmin, & Fernandes, 2018; Yang, Lin, &
Chen, 2018). Recently deep learning has gained a lot of interest in the diagnosis of
Schizophrenia (SCZ).
Schizophrenia is a mental disorder primarily affecting people between the ages of 16 and 30. Its
positive features include hallucinations, delusions, psychosis; its negative features include impaired
motivation and social withdrawal (Owen, Sawa, & Mortensen, 2016). SCZ usually manifests in
adolescence or early twenties and continues to worsen into adulthood. An at-risk phase, called
the prodromal phase, often precedes the full-blown disorder, though there have been cases of
sudden onset in previously healthy individuals. SCZ’s effects are severe. Gone untreated, SCZ places
the burden of care on the individual's family. Thus, there is a social incentive to efficiently and
accurately diagnose SCZ. Additionally, research has shown that early diagnosis of the disease can
reduce treatment costs (Mihalopoulos, Harris, Henry, Harrigan, & McGorry, 2009), thus increasing
the need for a reliable mechanism for diagnosis. Currently, diagnosing SCZ involves subjective
analysis of a patient's test results and mental history, though symptom overlap with other mental
disorders (Owen et al., 2016) can occur, increasing the risk of misdiagnosis. An automated and
CONTACT Justin Barry justin.barry.e@knights.ucf.edu Department of Computer Science, University of Central Florida,
Orlando, FL, USA
© 2019 Informa UK Limited, trading as Taylor & Francis Group
2 S. SRINIVASAGOPALAN ET AL.
efficient physiology-based diagnosis of SCZ would be beneficial. Currently, SCZ has no well-
established and conventional biomarker, though studies have shown (Han, Huang, Zhang, Zhao,
& Chen, 2017; Hsieh, Sun, & Liang, 2014), that MRI may be an effective biomarker for SCZ. This
paper shows how multimodal MRI data can be used to classify SCZ.
Magnetic Resonance Imaging (MRI) is a non-invasive imaging modality that returns valuable
information about the physiology of the human brain, including size, shape, and tissue structure
(Bois, Whalley, McIntosh, & Lawrie, 2015). MRI captures either structural or functional information.
Functional MRI (fMRI) utilizes Blood-oxygen-level-dependent (BOLD) signals to capture an approx-
imate measurement of activity between remote regions in the brain (Demirci & Calhoun, 2009).
Structural MRI (sMRI) provides information on varying characteristics of brain tissue such as gray
matter, white matter, and cerebrospinal fluid (Vemuri & Jack, 2010). The challenge with using sMRI
data to diagnose based on structural changes brought on by SCZ is the overlap in structural
change brought on by factors closely linked with SCZ such as alcoholism and anti-psychosis
medication (Bois et al., 2015). Previous studies have shown that the combination of fMRI and
sMRI data can be used in conjunction with a deep learning autoencoder to classify mental
disorders including SCZ (Patel, Aggarwal, & Gupta, 2016; Silva et al., 2014; Zeng et al., 2018). In
one such study (Patel et al., 2016), researchers used an autoencoder, four-layers deep in encoding
and decoding, to learn the features of the input data, then used SVM to classify the data with 92%
accuracy.
Problem statement
From the multimodal features derived from the brain MRI scans, we aim to automatically diagnose
subjects with Schizophrenia. Two modalities of MRI scans are used to obtain these features: functional
and structural MRI. Given a set of training data with these features and corresponding labels (either
Schizophrenic or not), the goal is to build a classifier that is specific enough to accurately diagnose
with as little false-positives and false-negatives1 as possible and generic enough such that it is robust
to any small perturbations in future (test) data.
Dataset
Data on 144 test subjects came from a 3T Siemens Trio MRI scanner (12-channel head coil). There
were 75 controls and 69 patients with SCZ (Silva et al., 2014). All data were collected at the Mind
Research Network.2 Data were preprocessed to distil independent components using the indepen-
dent component analysis results from a study found in (Allen et al., 2011; Segall et al., 2012), and
leveraging spatio-temporal regression to mitigate the potential of low bias or high variance
(Erhardt et al., 2011).
The dataset contains the following:
● Training FNC: FNC features for the training set. These are correlation values. They describe
the connection level between pairs of brain maps over time.
● Training SBM: SBM features for the training set. These are standardized weights. They
describe the expression level of ICA brain maps derived from grey-matter concentration.
● Training Labels: Labels for the training set. The labels are indicated in the “Class” column.
0 = ‘Healthy Control’, 1 = ‘Schizophrenic Patient’.
● Testing FNC: FNC features for the test set. Test subject labels have been removed. The task is
to predict these unknown labels from the provided features.
● Testing SBM: SBM features for the test set. Test subject labels have been removed.
This is a classic (probabilistic) binary classification problem. Here, we want to find a prediction
model ^f : X ! ^y, where ^y ¼ ^f ðxÞ is a class prediction for any given observation x such that
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 3
^y 2 ½0:0; 1:0. These observations are d-dimensional vectors where each dimension d represents
a particular feature of the MRI dataset. The resulting prediction probability for a given observation
gives the severity of SCZ a subject has. The greater the value of ^y, the more likely the subject
has SCZ.
Contribution
The contribution of our work is three-fold. First, we apply a series of traditional (and popular) machine
learning algorithms to our problem that perform not only well, but also provide insight into the
classification mechanisms. This lays the groundwork for future experiments and presents a good yardstick
to measure the efficacy of any new algorithms working with this dataset. Second, the features selected
were significant in deciding whether a given subject was Schizophrenic or not. To validate our proposed
method and our deep network’s internal architecture, we ran multiple, automated trials of our model
using different sets of hyperparameters. Third, the proposed architecture handily outperforms the
traditional methods accuracies while maintaining an acceptable level of generality. The related results
further confirm the effectiveness of the proposed model for Schizophrenic subject classification.
Logistic regression
Logistic regression is a variation of linear regression where the output (dependent) variable is
binary (categorical variable with two values such as ‘yes’ or ‘no’) rather than a continuous variable
(Hosmer & Lemeshow, 2000).
Multivariable problems are frequently encountered in medical research. Researchers are typi-
cally faced with questions such as “What is the relationship of one or more exposure variables (x’s)
to a disease or illness outcome (y). If we consider a dichotomous disease outcome with 0
representing not diseased and 1 representing diseased, and if the illness/disease is coronary heart
disease (CHD) status, then the subjects would be classified as either 0 (‘without CHD’) or 1 (‘with
CHD’). Suppose, that the researcher is interested in a single dichotomous exposure variable, for
instance, obesity status, classified as ‘yes’ or ‘no’ (or it could be a continuous variable as ‘obese
value (BMI)’ and could take on some predefined bucketed values such as ‘0–5ʹ, ‘5–10ʹ or ‘25–35ʹ). In
such situations, the research question translates to finding the extent to which obesity is associated
with CHD status.
The factors that could potentially contribute to an illness/disease represents a collection called
independent variables and the variable that needs to be predicted (the outcome) is called depen-
dent variable. More generally, the independent variables are denoted as X ¼ fx1 ; x2 ; . . . ; xn g where
n is the number of variables/factors being considered and x is an independent variable or
feature. When y, the outcome, is dependent on (or related to) a number of x’s, then it is
a multivariable problem.
Logistic regression is a mathematical modelling approach that can be used to describe the
relationship of several x’s to a dichotomous dependent outcome. Like linear regression, it assumes
a linear relationship between the predictor and output variables. It is used when one wants to
know the probability of an event occuring such as the probability of a bank loan being delinquent
or a soccer team winning a game. The input variables may not be continuous.
4 S. SRINIVASAGOPALAN ET AL.
The logistic function, which describes the mathematical form on which the logistic model is
based, is given as:
1
f ðzÞ ¼ (1)
1 þ ez
As the value of z varies from 1 to þ 1, the value of f ðzÞ takes on values from 0 to 1. The fact
that the logistic function f ðzÞ ranges between 0 and 1 is the primary reason the logistic model is so
popular. This model is designed to describe a probability, which is always some number between 0
and 1. In medical terms, such a probability gives the risk of an individual getting a disease.
More formally, the output of logistic regression is a predictor variable that gives the probability
of an event happening. The model is given by:
pðxÞ
log ¼ β0 þ xβ1 (2)
1 pðxÞ
Solving for p, this gives:
eβ0 þxβ1 1
pðxÞ ¼ β þxβ
¼ (3)
1þe 0 1 1 þ e 0 þxβ1 Þ
ðβ
where β0 is the bias or intercept term and β1 is the coefficient for the single input value x.
The ability to model the odds has made the logistic regression model a popular method of
statistical analysis. The logistic regression model can be used for prospective, retrospective, or
cross-sectional data. More details on its origins, applications and advantages can be seen in
(Cramer, n.d.).
Some key parameters used to run logistic regression are given below:
(1) C = 25.0. This can be tweaked for optimal regularization. The larger values give more
freedom to the model. Here, in order to contain any large increases or decreases in the
coefficient values due to small perturbations in the data, the inverse of the regularization
parameter is used.
(2) For multinomial loss, the ‘sag’3 solver is used.
(3) L2 regularization is used to improve generalization performance, i.e., the performance on
new, unseen data. L2 was chosen (as against L1 or others) is because the dataset is not
sparse and is better at generalizing the model complexity.
subject to yi ðwT xi þ bÞ 1 i . C is the trade-off parameter between error and margin and i 0.
The general idea is that the original feature space can always be mapped to some higher-
dimensional feature space where the training data set is separable by a hyperplane. C > 0 is the
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 5
penalty parameter of the error term. There are different kernels that can be applied here. Some of
the basic kernels are the following:
● linear
● polynomial
● radial basis function (RBF)
● sigmoid
In general, the RBF kernel is a reasonable first choice. It non-linearly maps samples into a higher
dimensional space and so, it can handle cases where the relation between the predictor variables
and the class labels is nonlinear.
Random forest
Random Forests (RF) are a popular ensemble technique used to build predictive models for both
classification and regression problems. They are among the most successful classifiers not only
because of their accuracy, but also, due to their robustness and ability to work with skewed
datasets across a variety of domains. A study by M. Fernandez-Felgado (Fernández-Delgado,
Cernadas, Barro, & Amorim, 2014), evaluated 179 classifiers from 17 different families and con-
cluded that random forests were the best performing classifier among these families. Furthermore,
their performance was significantly better than that of others.
Ensemble methods use multiple ‘base’ learning models to make the final prediction. The
random forest algorithm randomly generates many unpruned, uncorrelated decision trees,4 and
uses them toto arrive at the best possible answer. RFs use bagging to generate bootstrap samples
of the training dataset and uses the CART method to build trees. Each tree casts a vote for the
classification of a new sample and the proportion of the votes in each class across the set of trees
will be decided as the predicted probability vector.
Assuming that the training set is
D ¼ ðx1 ; y1 Þ; . . . ; ðxn ; yn Þ (5)
drawn randomly from a (possibly unknown) probability distribution ðxi ; yi Þ,ðX; YÞ. The goal is to
build a classifier which predicts y from x based on the data set of examples D.
Given an ensemble of (possibly weak) classifiers h ¼ h1 ðxÞ; . . . ; hk ðxÞ, if each hk ðxÞ is a decision
tree, then the ensemble is a random forest. We define the parameters of the decision tree for
classifier hk ðxÞ to be Φk ¼ ðϕk1 ; ϕk2 ; . . . ; ϕk;p Þ. Typically, this is written as
hk ðxÞ ¼ hðxjΦk Þ (6)
which implies that the decision tree k leads to a classifier hk ðxÞ ¼ hðxjΦk Þ. Here, it is important to
note that a feature fi appears in node k of the jth tree is randomly chosen from a model variable Φ.
For the final classification f ðxÞ (which combines the classifiers hk ðxÞ), each tree casts a vote for the
most popular class at input x, and the class with the most votes wins. More specifically, given
dataset D ¼ ðxi ; yi Þni¼1 , we train a family of classifiers. Each classifier hk ðxÞ;hðxjΦk Þ is, in our case,
a predictor. The outcome is y ¼ f0; 1g, representing a pair of unique decisions (in our case, non-
Schizophrenic and Schizophrenic patient). For generalizations and more detailed treatment of RFs,
refer to Breiman, 2001.
Feature selection
An important step in machine learning is feature selection. For any type of prediction or classifica-
tion model, only the relevant and most useful features should be used. Feature selection improves
6 S. SRINIVASAGOPALAN ET AL.
model accuracy while requiring less data (by removing irrelevant and redundant features). If
feature selection is not done, the resulting model will likely become more complex and less
accurate. We used Recursive Feature Elimination and Random Forests to determine feature impor-
tances and used the mean of the scores from these two techniques to proceed with threshold
cut-off.
Figure 1 shows the rankings of all 411 features. A threshold was manually decided by looking at
this figure and deciding a cut-off rank (0.35). This enabled us to choose all features with importance
greater than 0.35, granting the classifier improved accuracy and speed. The first 10 features sorted
according to their importance are given in Table 1. The selected features (with rank-threshold
= 0.35) are:
[‘FNC295ʹ, ‘FNC244ʹ, ‘SBM_map67ʹ, ‘FNC302ʹ, ‘SBM_map61ʹ, ‘FNC226ʹ, ‘FNC289ʹ, ‘FNC220ʹ,
‘FNC33ʹ, ‘FNC243ʹ, ‘SBM_map36ʹ, ‘FNC183ʹ, ‘FNC37ʹ, ‘FNC48ʹ, ‘FNC78ʹ, ‘FNC61ʹ, ‘FNC297ʹ, ‘FNC333ʹ,
‘SBM_map7ʹ, ‘FNC290ʹ, ‘SBM_map75ʹ, ‘FNC208ʹ, ‘FNC292ʹ, ‘FNC40ʹ, ‘SBM_map17ʹ, ‘FNC265ʹ,
‘FNC171ʹ, ‘FNC189ʹ, ‘FNC353ʹ, ‘FNC62ʹ, ‘FNC185ʹ, ‘FNC13ʹ, ‘FNC337ʹ, ‘FNC5ʹ, ‘FNC30ʹ, ‘FNC68ʹ,
‘FNC150ʹ, ‘FNC211ʹ, ‘FNC293ʹ, ‘FNC328ʹ, ‘FNC89ʹ, ‘FNC106ʹ, ‘FNC165ʹ, ‘FNC221ʹ, ‘SBM_map64ʹ,
‘FNC75ʹ, ‘FNC83ʹ, ‘FNC29ʹ, ‘FNC102ʹ, ‘FNC142ʹ, ‘FNC194ʹ, ‘FNC200ʹ, ‘FNC210ʹ, ‘FNC219ʹ, ‘FNC256ʹ,
‘FNC279ʹ, ‘FNC304ʹ, ‘SBM_map72ʹ]
A major advantage of deep learning is that one neither needs to identify the significant features
nor have to do feature engineering. The architecture learns the features incrementally on its own.
This eliminates the need for domain expertise and explicit feature-extraction and feature-
engineering.5 Another advantage is that it tends to solve a problem end-to-end whereas, in
traditional machine learning approach, the problem has to be broken down into smaller, more-
manageable pieces and solved with a variety of statistical and probabilistic approaches.
A clear disadvantage of deep learning (as of this writing) is the inability of the architecture to clearly
explain itself. It provides very little justification as to the importance of features and what type of feature-
engineering happened behind the scenes. This lack of interpretability is a big reason why many sectors in
business have not yet adopted deep learning. In contrast, traditional machine learning approaches like
linear regression, logistic regression, SVMs, and random forests all provide a clear explanation of why and
how it made a given decision. In addition to this advantage, deep learning requires a very large amount of
data to train and computationally very intensive. Hence, the need for expensive GPUs and large
infrastructure to support them. Furthermore, it has yet to establish a robust, theoretical foundation,
which leads to its next disadvantage, the difficulty of determining the topology, training mechanism, and
hyperparameters of a model. The lack of theory and guidance makes building a good deep learning
model less of an engineering feat and more of an artistic endeavour. As there is little theory for guidance
(as of this date), building a good deep learning architecture is an art, possibly resulting in
a uninterpretable model.
Keras (Chollet et al., 2015) is a high-level neural-networks API, written in python and capable of
running on top of Tensorflow, CNTK or Theano. It was developed with a focus on enabling fast
experimentation. We used Kerasto build our model.
Architecture
The architecture used was rather simple and can be seen in Figure 2. The input layer has 411
neurons, denoted as dimensions, which are the 411 features of the data set. Note that we do not
use the curated 55 features obtained through feature selection. Instead, we pass the full dimen-
sionality of the data set to the model and allow it to decide which features are the most relevant.
The first hidden layer has 512 neurons, each with a Rectified Linear Unit6 activation function. The
output layer (the second layer) consisted of just one neuron with a sigmoid activation function.
Sigmoid is used to ensure that the resulting prediction probabilities fall between 0:0 and 1:0. The
loss function used was binary-cross entropy and the optimizer used was ‘adam’ (Adaptive Moment
Estimator) – a stochastic, first-order gradient-based optimiser that can be used instead of the
classical stochastic gradient descent procedure to update network weights. Finally, the model was
compiled using the ‘accuracy’ metric. This function is used to judge the performance of the model.
Before the training data was fed to the deep neural network model, it was scaled appropriately
using a standard scaler. This standardizes the feature values by removing the mean and scaling to
8 S. SRINIVASAGOPALAN ET AL.
unit variance. We chose a set of initial values for the weights to help the model achieve better and
faster convergence. This was achieved by the Xavier–Glorot method (Glorot & Bengio, 2010). This
helps in mitigating the well-known problems in back-propagation algorithms, that of, vanishing
gradients and exploding gradients (though known to occur very rarely).
Deep neural networks usually contain millions of parameters and hence, it is imperative to attempt to
reduce the parameter space as much as possible. In this context, we make such attempts by regularizing
the weights. It is well-known that the quality of the regularization method significantly affects both the
discriminative power and generalization ability of the trained models. To achieve an acceptable level of
generalization, we apply a dropout technique that randomly sets certain neurons’ responses to zero with
a given probability. In our case, we provide a 50% dropout rate., which means that half of the neurons in
any given layer, would output a zero. This technique, though simple in theory and implementation, has
far-reaching benefits not only in helping to generalize the mode, but also, in vastly improving the speed
of learning, particularly for very deep networks. This is due to the fact that training batches updates only
a subset of all the neuron at a time and conveniently avoid co-adaptation of the learned feature
representations (Krizhevsky, Sutskever, & Hinton, 2012).
We augment the deep architecture, at every layer, with an application of an aggressive dropout
regularization. This boosts the capability of the network to learn in a generalized way and avoids
over-fitting (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014).
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 9
parameters in all layers. Let nðLÞ be the number of neurons in layer L of the network, where nð0Þ is the
number of input features. We can calculate the number of trainable parameters in layer L as
nðLÞ ðnðL1Þ þ 1Þ, where the additional 1 is for the bias unit in layer L 1. Then, the number of
trainable parameters for layer 1 is nð1Þ ðnð0Þ þ 1Þ ¼ 512 ð411 þ 1Þ ¼ 210; 944, as shown in Table 6.
The total number of trainable parameters for the network is the summation of the number of trainable
parameters per layer and is given as 210; 944 þ 131; 328 þ 32; 896 þ 129 ¼ 375; 297.
The dropout mechanisms applied in each layer significantly reduced the computation time and
improved generalization. We trained several deep learning models with a different number of
hidden layers, dropout rates, and activation functions . After a considerable number of experi-
ments, we zoned on a depth of three hidden layers, number of nodes in each hidden layer (512,
256 and 128) and dropout rate (of 50%). We then ran this general framework on the training
dataset with different batch sizes and epochs. Each run on the training dataset was also evaluated
by a validation set (of 20%) to validate the effectiveness of the classifier.
The final model’s results are visualized in Figure 3. The accuracy for the training set stays more
or less at 100% while that of the testing set is close, reaching a steady state at 94:444%. And, this
result was achieved at the 19th epoch and 45th batch. The best loss (computed by binary-cross
entropy) for the training dataset was 4:5488e 06, while that of the validation set was less than
0:28. During the other stages of training, the accuracy and loss, though promising, did not reach to
this level.
During experiments with various epochs and batch-sizes, as seen in Figures 4 and 5, though the
accuracy tends to be close to 1.0, the validation loss tends to vary rather higher than 0.25. It is to be
noted that there is a fine balance between the number of epochs and batch-sizes. Typically,
a larger batch size is preferred for good normalization. The greater the mini-batch size, the lesser
will be the variance between each mini-batch. Smaller batch sizes typically result in finer gradient
descent steps, leading to higher latency in convergence. Larger batch sizes will lead to executing
less gradient descent steps and hence the need to train on more epochs. In most cases though, the
accuracy will not differ drastically. A comparison of accuracy between all machine learning
methods discussed above can be seen in Table 5.
Limitations
This work has limitations from several fronts. An immediately noticeable one is the size of the
training dataset, consisting only 86 observations as compared to the test dataset that has 119,748
observations. This deficiency in training data will directly impact any classifier’s performance.
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 11
Figure 3. Accuracy and loss plots for batch-size = 45 and number-of-epochs = 19.
Figure 4. Accuracy and loss plots for batch-size = 35 and number-of-epochs = 19.
Figure 5. Accuracy and loss plots for batch-size = 25 and number-of-epochs = 19.
However, we did not find class-imbalance.7 Out of the given training dataset, we aren’t sure about
the distribution of the kinds of observations (i.e., the distribution of the relationships among the
features). It would have been nice to have orthogonal feature values among the observations and
12 S. SRINIVASAGOPALAN ET AL.
the types of relationships., which would result in a rich set of data, despite having a low cardinality
of the training set. In addition to these data issues, the lack of labels for the test dataset prevented
us from performing several more statistical analysis of the model’s performance (like ROC curve,
confusion matrix, F-1 score, etc.).
On the modelling front, particularly when building deep learning architecture, again, the small size
of the training dataset prevented us from exploring many options in hyper-parameter tuning. For
experiments with deep neural networks, one typically needs a very large set of training data. This not
only allows the investigator to search the parameter space in a fairly comprehensive way,8 but also
enable the investigator to generalize the model well (else, there is a risk of overfitting). Because of the
small training data set, we had to settle with rather uncomfortable numbers for epochs and batch-size.
Recent research in training deep neural networks with very little training set (called ‘One-Shot
Learning’) looks promising (Lake, Salakhutdinov, & Tenenbaum, 2015; Vinyals, Blundell, Lillicrap,
Kavukcuoglu, & Wierstra, 2016).
dropout rates for specific sets of neurons in each layer. This could be done either in a deterministic
way or stochastic way. We believe that training the network with such new methods of regulariza-
tion has a potential to improve the performance after several epochs.
We also plan to apply deep architectures with a particular use of optimized kernel machines to
enable dealing with problems (and datasets) with limited data and prior knowledge of the relationships
in data. Enabling architectures with data-specific kernels or kernel compositions (like multiple kernel
learning), we believe, would lead to new and powerful architectures and inference strategies.
Finally, as mentioned in Section 4.1, the lack of very large amounts of training data encourages
us to experiment with One-Shot approaches.
Notes
1. Here, in a probabilistic classifier, a false-positive is seen as the classifier generating a ‘high’ value (say > 0.7) for
a subject, whereas, in reality, the subject does not have SCZ. The threshold can typically be determined by
empirical analysis. We chose a probabilistic output instead of a clear binary output to enable to end-user
/radiologist to make an informed and collective decision, rather than have the algorithm make a firm decision.
2. Funded by a Center of Biomedical Research Excellence (COBRE) grant 5P20RR021938/P20GM103472 from the
NIH to Dr Vince Calhoun (Cetin et al., 2014).
3. Stochastic Average Gradient Descent: It allows to train a model much faster that other solvers when the data
are very large. To use it effectively, the features must be scaled.
4. Also called weak learners that have a minimum accuracy of > 0:5.
5. We say this with caution. Researchers still do not have a clear idea how it learns features, what (good and bad
things) it learns, how fast it learns and if it forgets anything.
6. ReLU is a non-linear function which is essentially a half-wave rectifier f ðzÞ ¼ maxðz; 0Þ. ReLU typically learns
faster than tanhðzÞ or 1=1 þ ez when there are many layers present in the network.
7. Forty observations out of 86 had a class of ‘1‘, which is a good 46.51% of the training set.
8. Randomized search as against to Grid Search and exhaustive search.
Disclosure statement
No potential conflict of interest was reported by the authors.
References
Allen, E. A., Erhardt, E. B., Damaraju, E., Gruner, W., Segall, J. M., Silva, R. F., . . . Michael, A. M. (2011). A baseline for the
multivariate comparison of resting-state networks. Frontiers in Systems Neuroscience, 5, 2.
Amin, J., Sharif, M., Yasmin, M., & Fernandes, S. L. (2018). Big data analysis for brain tumor detection: Deep
convolutional neural networks. Future Generation Computer Systems, 87, 290-297.
Belhumeur, P. N. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19(7), 711–720.
Bois, C., Whalley, H., McIntosh, A., & Lawrie, S. (2015). Structural magnetic resonance imaging markers of susceptibility
and transition to schizophrenia: A review of familial and clinical high risk population studies. Journal of
Psychopharmacology, 29(2), 144–154.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the
fifth annual workshop on computational learning theory (pp. 144–152). New York, NY: ACM.
Breiman, L. (2001, Oct 1). Random forests. Machine Learning, 45(1), 5–32.
Cetin, M., Christensen, F., Abbott, C., Stephen, J., Mayer, A., Canive, J., . . . Calhoun, V. D. (2014). Thalamus and posterior
temporal lobe show greater inter-network connectivity at rest and across varying sensory loads in schizophrenia.
Neuroimage, 97, 117–126.
Chollet, F. (2015). Keras. Available from: https://keras.io.
Cortes, C., & Vapnik, V. (1995, Sept 1). Support-vector networks. Machine Learning, 20(3), 273–297.
Cramer, J. (n.d.). The origins of logistic regression. Tinbergen Institute Working Paper No. 2002-119/4, Tinbergen
Institute.
Demirci, O., & Calhoun, V. D. (2009). Functional magnetic resonance imaging–Implications for detection of schizo-
phrenia. European Neurological Review, 4(2), 103.
Erhardt, E. B., Rachakonda, S., Bedrick, E. J., Allen, E. A., Adali, T., & Calhoun, V. D. (2011). Comparison of multi-subject
ica methods for analysis of fmri data. Human Brain Mapping, 32(12), 2075–2095.
14 S. SRINIVASAGOPALAN ET AL.
Fernandes, S. L., & Bala, G. J. (2016). Fusion of sparse representation and dictionary matching for identification of
humans in uncontrolled environment. Computers in Biology and Medicine, 76, 215–237.
Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real
world classification problems? Journal of Machine Learning Research, 15, 3133–3181.
Glorot, X., & Bengio, Y. (2010, May). Understanding the difficulty of training deep feedforward neural networks. In
JMLR W&CP: Proceedings of the thirteenth international conference on artificial intelligence and statistics (aistats 2010)
(Vol. 9, pp. 249–256). Chia Laguna Resort, Sardinia, Italy: PMLR.
Han, S., Huang, W., Zhang, Y., Zhao, J., & Chen, H. (2017, Dec 6). Recognition of early-onset schizophrenia using
deep-learning method. Applied Informatics, 4(1), 16.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (Wiley series in probability and statistics) (2nd ed.).
New York, NY: Wiley-Interscience Publication.
Hsieh, T.-H., Sun, M.-J., & Liang, S.-F. (2014). Diagnosis of schizophrenia patients based on brain network complexity
analysis of resting-state fMRI. In The 15th international conference on biomedical engineering (pp. 203–206). New
York, NY: Springer, Cham, 2014.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Proceedings of the 25th international conference on neuralinformation processing systems - volume 1 (pp. 1097–1105).
Red Hook, NY: Curran Associates Inc.
Kuenen, M. P., Mischi, M., & Wijkstra, H. (2011). Contrast-ultrasound diffusion imaging for localization of prostate
cancer. IEEE Transactions on Medical Imaging, 30(8), 1493–1502.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015, Dec 11). Human-level concept learning through probabilistic
program induction. Science, 350, 1332–1338.
Lecun, Y., Bengio, Y., & Hinton, G. (2015, May 27). Deep learning. Nature, 521(7553), 436–444.
Martis, R. J., Lin, H., Gurupur, V. P., & Fernandes, S. L. (2017). Editorial: Frontiers in development of intelligent
applications for medical imaging processing and computer vision. Computers in Biology and Medicine, 89, 549–550.
Mihalopoulos, C., Harris, M., Henry, L., Harrigan, S., & McGorry, P. (2009). Is early intervention in psychosis cost-effective
over the long term? Schizophrenia Bulletin, 35(5), 909–918.
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(10), 1615–1630.
Owen, M. J., Sawa, A., & Mortensen, P. B. (2016, July2). Schizophrenia. The Lancet, 388(10039), 86–97. (Copyright -
Copyright Elsevier Limited Jul 2, 2016; Last updated - 2017-11-23; CODEN - LANCAO)
Patel, P., Aggarwal, P., & Gupta, A. (2016). Classification of schizophrenia versus normal subjects using deep learning.
Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing, 28, 1–6. New York,
NY: ACM.
Segall, J., Allen, E., Jung, R., Erhardt, E., Arja, S., Kiehl, K., & Calhoun, V. (2012). Correspondence between structure and
function in the human brain at rest. Frontiers in Neuroinformatics, 6, 10.
Silva, R., Castro, E., Gupta, C., Cetin, M., Arbabshirani, M., Potluru, V., . . . Calhoun, V. (2014). The tenth annual mlsp
competition: Schizophrenia classification challenge. IEEE Computer Society.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent
neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin, Heidelberg: Springer-Verlag.
Vapnik, V. N. (1998). Statistical learning theory. New York, NY: Wiley-Interscience.
Vemuri, P., & Jack, C. R. (2010, Aug 31). Role of structural MRI in Alzheimer’s disease. Alzheimer’s Research & Therapy, 2
(4), 23.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. In
D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing
systems 29 (pp. 3630–3638). Red Hook, NY: Curran Associates, Inc.
Yang, H.-F., Lin, K., & Chen, C.-S. (2018). Supervised learning of semantics-preserving hash via deep convolutional
neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 437–451.
Zeng, -L.-L., Wang, H., Hu, P., Yang, B., Pu, W., Shen, H., . . . Wang, K. (2018). Multi-site diagnostic classification of
schizophrenia using discriminant deep learning with functional connectivity MRI. EBioMedicine, 30, 74–85.