Professional Documents
Culture Documents
OPENING PRAYER
May God the Father bless us. May God the Son heal us. May God the Holy Spirit enlighten
us, and give us eyes to see with, ears to hear with, hands to do the work of God with, feet
to walk with, a mouth to preach the word of salvation with, and the angel of peace to watch
over us and lead us at last, by our Lord's gift, to the Kingdom. Amen.
Understanding Neural
networks
(1) describe Neural Networks as analogues of
o From biological to artificial biological neurons,
neurons (2) develop hands-on a neural net that can be
o Activation functions trained to compute the square-root function,
o Network topology (3) describe support vector machine (SVM)
o The direction of classification, and
information travel (4) complete several case-studies, including
o Training neural networks optical character recognition (OCR), the Iris
with back propagation flowers, Google Trends and the Stock Market,
and Quality of Life in chronic disease.
MODULE INTRODUCTION AND FOCUS QUESTION(S):
In this chapter, we will discuss strategies to import data and export results. Also, we are going to learn the
basic tricks we need to know about processing different types of data. Specifically, we will illustrate
common R data structures and strategies for loading (ingesting) and saving (regurgitating) data.
Pretest
To further gauge your level of understanding and where you currently stand in this topic, please answer the
following pre-test questions honestly. Take note of the items that you were not able to correctly answer and
look for the right answer as you go through this module.
1. Understanding Regression
Does this formula appear familiar to you? In this slope-intercept formula, a is our intercept while b is
the slope. That is the expression for simple linear regression. If we know a and b, for any given x we
can calculate y via the above formula. If we plot x and y in a coordinate system, we will have a
straight line.
However, this is the ideal case. When we plot using real world data, the pattern would be harder to
recognize. Let’s look at the scatter plot(you can recall Chapter 2) and simple linear regression line of
two variables “hospital charges” or CHARGES (independent variable) and length of stay in the
hospital or LOS (predictor). The data could be find in our class files
CaseStudy12_AdultsHeartAttack_Data. We removed two observations that have missing data using
the command heart_attack<-heart_attack[complete.cases(heart_attack), ].
It seems to be common sense that the longer you stay in the hospital, the higher the medical costs
will be. However, on the scatter plot, we have only a bunch of dots showing a little bit sign of an
increasing pattern.
it is simple to make predictions with this regression line. Assume we have a patient that spent 10
days in hospital, then we have LOS=10. The predicted charge is likely to be
$4582.70+$212.29×10=$6705.6. Plugging x into the expression equation automatically gives us an
estimated value of the outcome y. This chapter of the Probability and statistics EBook provides an
introduction to linear modeling.
2.2 Ordinary least squares estimation
How did we get the estimated expression? The most common estimating method in statistics is
ordinary least squares (OLS). OLS estimators are obtained by minimizing sum of the squared errors -
that is the sum of squared vertical distance from each dot on the scatter plot to the regression line.
OLS is minimizing the following formula:
After some statistical calculations, our value b with the minimum squared error is:
While the optional a is:
Combining the above, we get an estimate of the slope coefficient (effect-size of LOS on Charge):
We can see that this is exactly the same as previously stated expression.
Linear relationship
Multivariate normality,
No or little multicollinearity,
No auto-correlation, independence,
Homoscedasticity
2.4 Correlations
Note: The SOCR Interactive Scatterplot Game (requires Java enabled browser) provides a dynamic
interface demonstrating linear models, trends, correlations, slopes and residuals.
Based on covariance we can calculate correlation, which indicates how closely that the relationship
between two variables follows a straight line.
In R, correlation is given by cor() while square root of variance or standard deviation is given by sd().
Same outputs are obtained. This correlation is a positive number that is relatively small. We can say
there is a weak positive linear association between these two variables. If we have a negative number
then it is a negative linear association. We have a weak association when 0.1≤Cor<0.3, a moderate
association for 0.3≤Cor<0.5, and a strong association for 0.5≤Cor≤1.0. If the correlation is below 0.1
then it suggests little to no linear relation between the variables.
In practice, we usually have more situations with multiple predictors and one dependent variable,
which may follow a multiple linear model. That is:
or equivalently
We usually use the second notation method in statistics. This equation shows the linear relationship
between k predictors and a dependent variable. In total we have k+1 coefficients to estimate.
The matrix notation for the above equation is:
Where
And
Similar to simple linear regression, our goal is to minimize sum of squared errors. After solved for β,
we get:
the solution is in the matrix form. X−1 is the inverse of matrix X and XT is the transposed matrix.
Next, we will apply this function to our heart attack dataset. To begin with, let’s check if the simple
linear regression output is the same as we calculated earlier.
It works! Then, we can include more variables as predictors. As an example, we just add age into the
model.
3 Case Study 1: Baseball Players
We utilize the mlb data “01a_data.txt”. The dataset contains 1034 records of heights and weights for
some current and recent Major League Baseball (MLB) Players. These data were obtained from
different resources (e.g., IBM Many Eyes).
Variables:
Let’s load this dataset first. We use as.is=T to make non-numerical vectors into characters. Also, we
delete the Name variable because we don’t need players’ names in this case study.
By looking at the srt() output we notice that the variable TEAM and Position are misspecified as
characters. To fix this we can use function as.factor() that convert numerical or character vectors to
factors.
Applying GGpairs to obtain a compact dataset summary we can mark heavy weight and light weight
players (according to light<median<heavy) by different colors in the plot:
Next, we may also mark player positions by different colors in the plot.
What about our potential predictors?
Here we have two numerical predictors, two categorical predictors and 1034 observations. Let’s see
how R treats these three different classes of variables.
1 Understanding Neural Networks
An Artificial Neural Network (ANN) model mimics the biological brain response to multisource
(sensory-motor) stimuli (inputs). ANN simulate the brain using a network of interconnected neuron
cells to create a massive parallel processor. Of course, it uses a network of artificial nodes, not brain
cells, to train data.
When we have three signals (or inputs) x1, x2 and x3, the first step is weighting the features (w’s)
according to their importance. Then, the weighted signals are summed by the “neuron cell” and this
sum is passed on according to an activation function denoted by f. The last step is generating an
output y at the end of the process. A typical output will have the following mathematical relationship to
the inputs.
One of the functions is known as threshold activation function that results in an output signal once a
specified input threshold has been attained.
Other activation functions might also be useful:
Basically, we can chose a proper activation function based on the corresponding codomain of the
function. For example, with hyperbolic tangent activation function, we can only have outputs ranging
from -1 to 1 regardless of what input do we have. With linear function we can go from −∞ to +∞. Our
Gaussian activation function will give us a model called Radial Basis Function network.
1.3 Network topology
’s or features in the dataset is called input nodes while the predicted values are called the output
nodes. Multilayer networks include multiple hidden layers. The following graph represents a two layer
neural network:
The arrows in the last graph (with multiple layers) suggest a feed forward network. In such network,
we can also have multiple outcomes modeled simultaneously.
Alternatively, in a recurrent network (feedback network), information can also travel backwards in
loops (or delay). This is illustrated in the following graph.
This short-term memory increases the power of recurrent networks dramatically. However, in
practice, recurrent networks are seldom used.
1.5 The number of nodes in each layer
Number of input nodes and output nodes are predetermined by the dataset and predictive variables.
The number we can edit is the hidden nodes in the model. Our goal is to add fewer hidden nodes as
possible to simplify the model when the model performance is pleasant.
This algorithm could determine the weights in the model using a strategy of back-propagating errors.
First, we assign random numbers for weights (but all weights must be non-trivial, i.e., ≠0
). For example, we can use normal distribution, or any other random process, to assign initial weights.
Then we adjust the weights iteratively by repeating the process until until certain convergence or
stopping criterion is met. Each iteration contains two phases.
Forward phase: from input layer to output layer using current weights. Outputs are produced at
the end of this phase, and
Backward phase: compare the outputs and true target values. If the difference is significant,
we change the weights and go through the forward phase, again.
In the end, we pick a set of weights, corresponding to the least total error, to be the weights in our
network.
In this case study, we are going to use the Google trends and stock market dataset. A doc file with
the meta-data and the CSV data are available on the Case-Studies Canvas Site. These daily data
(between 2008 and 2009) can be used to examine the associations between Google search trends
and the daily marker index - Dow Jones Industrial average.
2.1.1 Variables
Here we use the RealEstate as our dependent variable. Let’s see if Google Real Estate Index could
be predicted by other variables in the dataset.
Let’s delete the first two columns, since the only goal is to predict Google Real Estate Index with
other indexes and DJI.
As we can see from the structure of the data, these indexes and DJI have different ranges. We
should rescale the data. In Chapter 6, we learned that normalizing these features using our own
normalize() function could fix the problem. We can use lapply() to apply the normalize() function to
each column.
The last step clearly normalizes all feature vectors into the 0 to 1 range.
The next step would be separating our google dataset into training and test subsets. This time we will
use the sample() and floor() function to separate training and test dataset. sample() is a function to
create a set of indicators for row numbers. We can subset the original dataset with random rows
using these indicators. floor() takes a number x and returns the closest integer to x sample(row, size)
row: rows in the dataset that you want to select from. If you want to select all the rows, you can
use nrow(data)
or 1:nrow(data)
(for a single number or a vector).
Here, we use the function neuralnet() in package neuralnet. neuralnet returns a NN object containing:
Similar to the predict() function that we have mentioned in previous chapters, compute() is an
alternative method that could help us to generate the model predictions.
p<-compute(m, test)
In our model we picked Unemployment, Rental, Mortgage, Jobs, Investing, DJI_Index, StdDJI as our
predictors. So we need to find these corresponding column numbers in the test dataset (1, 2, 4, 5, 6,
7, 8 respectively).
As mentioned in Chapter 9, we can still use the correlation between predicted results and observed
Real Estate Index to evaluate the algorithm. A correlation over 0.9 is very good for real world
datasets. Could this be improved further?
This time we managed to include 4 hidden nodes in the model. Let’s see what results we can get
from this more complicated model.
Although the graph looks very complicated, we have smaller Error or sum of squared errors. Actually,
it can be used both for classification and regression, which we will see in the next part. Let’s first try
regression.
2.6 Step 6 - adding additional layers
We observe an even lower Error by use three hidden layer with nodes 4,3,3, respectively.
This simple example demonstrates the foundation of the neural network prediction of a basic
mathematical function: −−√:R+⟶R+
.
We observe that the NN, net.sqrt actually learns and predicts pretty close the complex square root
function. Of course, everyone’s results may vary as we randomly generate the training data
(rand_data) and the NN construction (net.sqrt) is also stochastic.
In practice, NN may be more useful as a classifier. Let’s demonstrate this by using again the Stock
Market data. We mark the samples according to their RealEstate. For those higher than 75%
percentile, we give them label 0; For those lower than 0.25 percentile, we label them as 2; Otherwise,
label 1. Even in the classification set, response still must be numeric.
Here, we divide the data to training and testing sets. We need 3 more column indicators, which
correspond to the 3 outcomes labels.
We use non-linear output and display every 2,000 iterations.
Below is the prediction function translating this model to the forecasting results.
Now let’s inspect the structure of the Neural Network.
Similarly, we can change hidden to utilize multiple hidden layers, however, a more complicated model
won’t necessarily guarantee an improved performance.
The easiest shape would be a plane. Support Vector Machine (SVM) can use a hyperplane to
separate data into several groups or classes. This is used for datasets that are linearly separable.
Assume that we have only two features, will you use A or B hyperplane to separate the data? Or even
another plane C?
Finding the maximum margin
To answer the above question, we need to search for the Maximum Margin Hyperplane (MMH).
That is the hyperplane that creates greatest separation between the two closest observations.
We define support vectors as the points from each class that are closest to MMH. Each class must
have at least one observation as support vector.
Using support vectors along is not sufficient for finding the MMH. Although tricky mathematical
calculations are involved, the principal of the process is fairly simple. Let’s look at linearly separable
data and non-linearly separable data individually.
If the dataset is linearly separable, we can find the outer boundaries of our two groups of data points.
These boundaries are called convex hull (red lines in the following graph). The MMH (black solid line)
is just the line that perpendicular to the shortest line between the two convex hulls.
An alternative way would be picking two parallel planes that can separate the data into two groups
while the distance between two planes is as far as possible.
We can use vector notation to mathematically define planes. In n-dimensional space, a plane could
be expressed by the following equation:
where the vectors w⃗ (weights) and x⃗ (unknowns) both have n coordinates and b is a (scalar)
constant. To clarify this notation let’s look at the situation in a 3D space where we can express
(embed) 2D Euclidean planes using a point ((xo,yo,zo)) and normal-vector ((a,b,c)) form. This is just a
linear equation, where d=−(axo+byo+cz0):
or equivalently
We can see that it is equivalent to the vector notation
And
We require that all of the observations in the first class fall above the first plane and all observations
in the other class fall below the second plane.
where ∥w⃗ ∥ is the Euclidean norm. To maximize the distance we need to minimize the Euclidean
norm.
For each nonlinear programming problem, the primal problem, there is related nonlinear
programming problem, the Lagrangian dual problem. Under certain assumptions for convexity and
suitable constraints, the primal and dual problems have equal optimal objective values. Primal
optimization problems are typically described as:
Then the Lagrangian dual problem is defined as a parallel nonlinear programming problem:
To optimize that objective function, we can set the partial derivatives equal to zero:
For non-linearly separable data, we need to use a small trick. Still, we use a plane but allow some of
the points to be misclassified into the wrong class. To penalize for that, we add a cost term after the
Euclidean norm function that we need to minimize.
Therefore, the solution will optimize the following objective (cost) function:
where C is the cost and ξi is the distance between the misclassified observation i and the plane.
Similar to what we did above for the separable case, we can use the derivatives of the primal problem
to solve the dual problem.
Notice the inner product in the final expression. We can replace this inner product with a kernel
function that maps the feature space into a higher dimensional space (e.g., using a polynomial kernel)
or an infinite dimensional space (e.g., using a Gaussian kernel).
An alternative way to solve for the non-linear separable is called the kernel trick. That is to add some
dimensions (or features) to make these non-linear separable data to be separable in a higher
dimensional space.
How can we do that? We transform our data using kernel functions. A general form for kernel
functions would be:
where ϕ is a mapping of the data into another space.
The linear kernel would be the simplest one that is just the dot product of the features.
The polynomial kernel of degree d transform the data by adding a simple non-linear transformation of
the data.
The sigmoid kernel is very similar to neural network. It uses a sigmoid activation function.
The Gaussian RBF kernel is similar to RBF neural network and is a good place to start investigating a
dataset.
This example illustrates management and transferring of handwritten notes (text) and converting it to
typeset or printed text representing the characters in the original notes (unstructured image data).
Protocol:
Divides the image (typically optical image of handwritten notes on paper) into a fine grid where
each cell contains 1 glyph (symbol, letter, number).
Match the glyph in each cell to 1 of the possible characters in a dictionary.
Combine individual characters together into words to reconstitute the digital representation of
the optical image of the handwritten notes.
In this example, we use an optical document image (data) that has already been pre-partitioned into
rectangular grid cells containing 1 character of the 26 English letters, A through Z.
The resulting gridded dataset is distributed by the UCI Machine Learning Data Repository. The
dataset contains 20, 000 examples of 26 English capital letters printed using 20 different randomly
reshaped and morphed fonts.
Replacing the vanilladot linear kernel with rbfdot Radial Basis Function kernel, i.e., “Gaussian” kernel
may improve the OCR prediction.
Note the improvement of automated (SVM) classification accuracy (0.928) for rbfdot compared to the
previous (vanilladot) result (0.844).
Case Study 3: Iris Flowers
Let’s have another look at the iris data that we saw in Chapter 2.
SVM require all features to be numeric and each feature has to be scaled into a relative small
interval. We are using the Edgar Anderson’s Iris Data in R for this case study. This dataset measures
the length and width of sepals and petals from three Iris flower species.
Let’s load the data first. In this case study we want to explore the variable Species.
The data looks good. However, recall that we need a fairly normalized data. We could normalize the
data by hand. Luckily, the R package we are going to use will normalized the dataset automatically.
Now we can separate the training and test dataset using 75%-25% rule.
Let’s first try a toy (iris data) example.
Step 3 - training a model on the data
We are going to use kernlab for this case study. However other packages like e1071 and klaR are
available if you are quite familiar with C++.
Here, we used all the variables other than the Species in the dataset as predictors. We also used
kernel vanilladot that is the linear kernel in this model. We get a training error less than 0.02.
Our old friend predict() function is used again to make predictions. Here we have a factor outcome, so
we need the command table() to show us how well do the predictions and actual data match.
We can see that only 1-2 cases of may be misclassifies as Iris versicolor. The species of the majority
of the flowers are all correctly identified.
To see the results more clearly, we can use the proportional table to show the agreements of the
categories.
Here == means “equal to”. Over 90% of predictions are correct. Nevertheless, is there any chance
that we can improve the outcome? What if we try a Gaussian kernel?
Linear kernel is the simplest one but usually not the best one. Let’s try the RBF (Radial Basis
“Gaussian” Function) kernel instead.
Unfortunately, the model performance is actually worse than the previous one (you might get slightly
different results). This is because this Iris dataset has a linear feature. In practice, we could try
alternative kernel functions and see which one fits the dataset the best.
Parameter Tuning
We can tune the SVM using the tune.svm function in the package e1071.
Let’s load the data first. In this case study, we want to use the variable CHARLSONSCORE as our
target variable.
Delete the first two columns (we don’t need ID variables) and rows that have missing values in
CHARLSONSCORE(where CHARLSONSCOREequals “-9”) !qol$CHARLSONSCORE==-9 means
we want all the rows that have CHARLSONSCORE not equal to -9. The exclamation sign indicates
“exclude”. Also, we need to convert our categorical variable CHARLSONSCORE into a factor.
Now the dataset is ready. First, separate the dataset into training and test datasets using 75%-25%
rule. Then, build a SVM model using all other variables in the dataset to be predictor variables. Try to
add different cost of misclassification to the model. Rather than the default C=1 we use C=2 and C=3.
See how the model behaves. Here we utilize the radio basis kernel.
9 Appendix
Below is some additional R code demonstrating various results reported in this Chapter.
Research:
Analysis
Choose at least 2 values each from the inputs you listed and write perform the variable information and
conversion.
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________.
Action:
RUBRICS
Programming Rubrics
Mathematics Rubrics
Category 4 3 2 1
Neatness and The work is The work is The work is The work appears
organization presented in a presented in a presented in an sloppy and
neat, clear, neat and organized unorganized. It is
organized fashion organized fashion but may hard to know what
that is easy to fashion that is be hard to read information goes
read. usually easy to at times. together.
read.
Understanding I got it!! I did it in I got it. I I understood I did not
new ways and understood the parts of the understand the
showed you how it problem and problem. I got problem.
worked. I can tell have an started, but I
you what math appropriate couldn’t finish.
concepts are solution. All
used. parts of the
problem are
addressed.
Strategy & Typically, uses an Typically, uses Sometimes uses Rarely uses an
Procedures efficient and an effective an effective effective strategy
effective strategy strategy to solve strategy to solve to solve problems.
to solve the the problem(s). problems, but
problem(s). does not do it
consistently.
Mathematical 90-100% of the Almost all (85- Most (75-84%) More than 75% of
Errors steps and 89%) of the of the steps and the steps and
solutions have no steps and solutions have solutions have
mathematical solutions have no mathematical mathematical
errors. no mathematical errors. errors.
errors.
Completion All problems are All but one of the All but two of the Several of the
completed. problems are problems are problems are not
completed. completed. completed.
1. Dinov, ID. (2018) Data Science and Predictive Analytics: Biomedical and Health Applications using R,
Springer (ISBN 978-3-319-72346-4).
2. DSPA Book downloads (5M, as of May 2020).