Forecasting numeric data using regression models

CS ELEC 4 - Analytics Techniques
& Tools/Machine Learning

MODULE NO.: 3 (Finals)
Module Title: Forecasting, Neural Networks and Support Vector Machines
WRITER: Dr. Richard N. Monreal
To do well in this module, you need to remember the following:
1. Pause and pray before starting this module.

2. Read and go through the module at your own time and pace.
3. You may open suggested references for supplemental activities and exercises.
4. Honestly answer the activities and sample exercises. The answers are provided on the succeeding
pages.
OPENING PRAYER
May God the Father bless us. May God the Son heal us. May God the Holy Spirit enlighten
us, and give us eyes to see with, ears to hear with, hands to do the work of God with, feet
to walk with, a mouth to preach the word of salvation with, and the angel of peace to watch
over us and lead us at last, by our Lord's gift, to the Kingdom. Amen.
Learning Outcomes Estimated

Subtopic Title
“I SHOULD BE ABLE TO”… time
Understanding regression
o Simple linear regression (1) demonstrate the predictive power of multiple
o Ordinary least squares linear regression,
estimation
o Model assumptions (2) show the foundation of regression trees and
o Correlation model trees, and
3 o Multiple linear regressions 25 Hrs.
(3) Examine two complementary case-studies
Understanding regression (Baseball Players and Heart Attack).
tress and model trees
o Adding regression to trees
Understanding Neural
networks
(1) describe Neural Networks as analogues of
o From biological to artificial biological neurons,
neurons (2) develop hands-on a neural net that can be
o Activation functions trained to compute the square-root function,
o Network topology (3) describe support vector machine (SVM)
o The direction of classification, and
information travel (4) complete several case-studies, including
o Training neural networks optical character recognition (OCR), the Iris
with back propagation flowers, Google Trends and the Stock Market,
and Quality of Life in chronic disease.
MODULE INTRODUCTION AND FOCUS QUESTION(S):
In this chapter, we will discuss strategies to import data and export results. Also, we are going to learn the
basic tricks we need to know about processing different types of data. Specifically, we will illustrate
common R data structures and strategies for loading (ingesting) and saving (regurgitating) data.
Pretest
To further gauge your level of understanding and where you currently stand in this topic, please answer the
following pre-test questions honestly. Take note of the items that you were not able to correctly answer and
look for the right answer as you go through this module.
A. Enumeration (10 points)
1. Give the steps in solving linear regressions.

2. Give the steps in solving multiple regressions.
B. Lab/ Actual Activities (400 points)

i. Perform the following activities in your machine (create your own dataset):
1. Week 1
i. Give the steps in solving linear regressions.
ii. Using R language provide an example to demonstrate in solving regression.
2. Week 2
i. Give the steps in solving multiple regressions.
3. Week 3
i. Topic: Neural networks
i. Using R language, perform the case study number 1: google trends
and stock market.
ii. Provide an screenshots of your output
4. Week 4
i. Presentation of project
i. Choose 1 among the topics, forecasting, neural networks and
regressions. Apply 1 actual case study and provide detailed
explanation.
ii. Provide a screenshot or video recording for your project
Study Time
Forecasting Numeric Data using Regression Models
1. Understanding Regression
Regression is a measurement of relationship between a dependent variable (value to be predicted)

and a group of independent variables (predictors similar to features, discussed in Chapter 6. We
assume the relationship between our dependent variable and independent variables follow a straight
line.
2.1 Simple linear regression
The simplest case of a regression is that we only have one predictor.
Does this formula appear familiar to you? In this slope-intercept formula, a is our intercept while b is
the slope. That is the expression for simple linear regression. If we know a and b, for any given x we
can calculate y via the above formula. If we plot x and y in a coordinate system, we will have a
straight line.
However, this is the ideal case. When we plot using real world data, the pattern would be harder to
recognize. Let’s look at the scatter plot(you can recall Chapter 2) and simple linear regression line of
two variables “hospital charges” or CHARGES (independent variable) and length of stay in the
hospital or LOS (predictor). The data could be find in our class files
CaseStudy12_AdultsHeartAttack_Data. We removed two observations that have missing data using
the command heart_attack<-heart_attack[complete.cases(heart_attack), ].
It seems to be common sense that the longer you stay in the hospital, the higher the medical costs
will be. However, on the scatter plot, we have only a bunch of dots showing a little bit sign of an
increasing pattern.
The estimated expression for this regression line is:
it is simple to make predictions with this regression line. Assume we have a patient that spent 10
days in hospital, then we have LOS=10. The predicted charge is likely to be
$4582.70+$212.29×10=$6705.6. Plugging x into the expression equation automatically gives us an
estimated value of the outcome y. This chapter of the Probability and statistics EBook provides an
introduction to linear modeling.
2.2 Ordinary least squares estimation
How did we get the estimated expression? The most common estimating method in statistics is
ordinary least squares (OLS). OLS estimators are obtained by minimizing sum of the squared errors -
that is the sum of squared vertical distance from each dot on the scatter plot to the regression line.
OLS is minimizing the following formula:
After some statistical calculations, our value b with the minimum squared error is:
While the optional a is:
If we utilize the sample averages (x¯, y¯), we have:
Combining the above, we get an estimate of the slope coefficient (effect-size of LOS on Charge):
Let’s examine these using the heart attack data.
We can see that this is exactly the same as previously stated expression.
2.3 Model Assumptions
Regression modeling has five key assumptions:
 Linear relationship
 Multivariate normality,
 No or little multicollinearity,
 No auto-correlation, independence,
 Homoscedasticity
2.4 Correlations
Note: The SOCR Interactive Scatterplot Game (requires Java enabled browser) provides a dynamic
interface demonstrating linear models, trends, correlations, slopes and residuals.
Based on covariance we can calculate correlation, which indicates how closely that the relationship
between two variables follows a straight line.
In R, correlation is given by cor() while square root of variance or standard deviation is given by sd().
Same outputs are obtained. This correlation is a positive number that is relatively small. We can say
there is a weak positive linear association between these two variables. If we have a negative number
then it is a negative linear association. We have a weak association when 0.1≤Cor<0.3, a moderate
association for 0.3≤Cor<0.5, and a strong association for 0.5≤Cor≤1.0. If the correlation is below 0.1
then it suggests little to no linear relation between the variables.
2.5 Multiple Linear Regression
In practice, we usually have more situations with multiple predictors and one dependent variable,
which may follow a multiple linear model. That is:
or equivalently
We usually use the second notation method in statistics. This equation shows the linear relationship
between k predictors and a dependent variable. In total we have k+1 coefficients to estimate.
The matrix notation for the above equation is:
Where
And
is the error term.
Similar to simple linear regression, our goal is to minimize sum of squared errors. After solved for β,
we get:
the solution is in the matrix form. X−1 is the inverse of matrix X and XT is the transposed matrix.
Let’s make a function of our own using this matrix formula.

solve() is taking the command for matrix inversion. %*% is matrix multiplication.
Next, we will apply this function to our heart attack dataset. To begin with, let’s check if the simple
linear regression output is the same as we calculated earlier.
It works! Then, we can include more variables as predictors. As an example, we just add age into the
model.
3 Case Study 1: Baseball Players
3.1 Step 1 - collecting data
We utilize the mlb data “01a_data.txt”. The dataset contains 1034 records of heights and weights for
some current and recent Major League Baseball (MLB) Players. These data were obtained from
different resources (e.g., IBM Many Eyes).
Variables:
 Name: MLB Player Name

 Team: The Baseball team the player was a member of at the time the data was acquired
 Position: Player field position
 Height: Player height in inch
 Weight: Player weight in pounds
 Age: Player age at time of record.
3.2 Step 2 - exploring and preparing the data
Let’s load this dataset first. We use as.is=T to make non-numerical vectors into characters. Also, we
delete the Name variable because we don’t need players’ names in this case study.
By looking at the srt() output we notice that the variable TEAM and Position are misspecified as
characters. To fix this we can use function as.factor() that convert numerical or character vectors to
factors.
The above plot illustrates our dependent variable Weight.
Applying GGpairs to obtain a compact dataset summary we can mark heavy weight and light weight
players (according to light<median<heavy) by different colors in the plot:
Next, we may also mark player positions by different colors in the plot.
What about our potential predictors?
Here we have two numerical predictors, two categorical predictors and 1034 observations. Let’s see
how R treats these three different classes of variables.
1 Understanding Neural Networks
1.1 From biological to artificial neurons
An Artificial Neural Network (ANN) model mimics the biological brain response to multisource
(sensory-motor) stimuli (inputs). ANN simulate the brain using a network of interconnected neuron
cells to create a massive parallel processor. Of course, it uses a network of artificial nodes, not brain
cells, to train data.
When we have three signals (or inputs) x1, x2 and x3, the first step is weighting the features (w’s)
according to their importance. Then, the weighted signals are summed by the “neuron cell” and this
sum is passed on according to an activation function denoted by f. The last step is generating an
output y at the end of the process. A typical output will have the following mathematical relationship to
the inputs.
There are three important components for building a neural network:
 Activation function: transforms weighted and summed inputs to output.

 Network topology: describes the number of “neuron cells”, the number of layers and manner
in which the cells are connected.
 Training algorithm: how to determine weights wi
Let’s look at each of these components one by one.

1.2 Activation functions
One of the functions is known as threshold activation function that results in an output signal once a
specified input threshold has been attained.
Other activation functions might also be useful:
Basically, we can chose a proper activation function based on the corresponding codomain of the
function. For example, with hyperbolic tangent activation function, we can only have outputs ranging
from -1 to 1 regardless of what input do we have. With linear function we can go from −∞ to +∞. Our
Gaussian activation function will give us a model called Radial Basis Function network.
1.3 Network topology
The number of layers: The x
’s or features in the dataset is called input nodes while the predicted values are called the output
nodes. Multilayer networks include multiple hidden layers. The following graph represents a two layer
neural network:
When we have multiple layers, the information flow could be complicated.

1.4 The direction of information travel
1.5
The arrows in the last graph (with multiple layers) suggest a feed forward network. In such network,
we can also have multiple outcomes modeled simultaneously.
Alternatively, in a recurrent network (feedback network), information can also travel backwards in
loops (or delay). This is illustrated in the following graph.
This short-term memory increases the power of recurrent networks dramatically. However, in
practice, recurrent networks are seldom used.
1.5 The number of nodes in each layer
Number of input nodes and output nodes are predetermined by the dataset and predictive variables.
The number we can edit is the hidden nodes in the model. Our goal is to add fewer hidden nodes as
possible to simplify the model when the model performance is pleasant.
1.6 Training neural networks with backpropagation
This algorithm could determine the weights in the model using a strategy of back-propagating errors.
First, we assign random numbers for weights (but all weights must be non-trivial, i.e., ≠0
). For example, we can use normal distribution, or any other random process, to assign initial weights.
Then we adjust the weights iteratively by repeating the process until until certain convergence or
stopping criterion is met. Each iteration contains two phases.
 Forward phase: from input layer to output layer using current weights. Outputs are produced at
the end of this phase, and
 Backward phase: compare the outputs and true target values. If the difference is significant,
we change the weights and go through the forward phase, again.
In the end, we pick a set of weights, corresponding to the least total error, to be the weights in our
network.
2 Case Study 1: Google Trends and the Stock Market - Regression

2.1 Step 1 - collecting data
In this case study, we are going to use the Google trends and stock market dataset. A doc file with
the meta-data and the CSV data are available on the Case-Studies Canvas Site. These daily data
(between 2008 and 2009) can be used to examine the associations between Google search trends
and the daily marker index - Dow Jones Industrial average.
2.1.1 Variables
 Index: Time Index of the Observation

 Date: Date of the observation (Format: YYYY-MM-DD)
 Unemployment: The Google Unemployment Index tracks queries related to “unemployment,
social, social security, unemployment benefits” and so on.
 Rental: The Google Rental Index tracks queries related to “rent, apartments, for rent, rentals”,
etc. RealEstate: The Google Real Estate Index tracks queries related to “real estate,
mortgage, rent, apartments” and so on.
 Mortgage: The Google Mortgage Index tracks queries related to “mortgage, calculator,
mortgage calculator, mortgage rates”.
 Jobs: The Google Jobs Index tracks queries related to “jobs, city, job, resume, career,
monster” and so forth.
 Investing: The Google Investing Index tracks queries related to “stock, finance, capital, yahoo
finance, stocks”, etc.
 DJI_Index: The Dow Jones Industrial (DJI) index. These data are interpolated from 5 records
per week (Dow Jones stocks are traded on week-days only) to 7 days per week to match the
constant 7-day records of the Google-Trends data.
 StdDJI: The standardized-DJI Index computed by: StdDJI = 3+(DJI-11091)/1501, where
m=11091 and s=1501 are the approximate mean and standard-deviation of the DJI for the
period (2005-2011).
 30-Day Moving Average Data Columns: The 8 variables below are the 30-day moving
averages of the 8 corresponding (raw) variables above: Unemployment30MA, Rental30MA,
RealEstate30MA, Mortgage30MA, Jobs30MA, Investing30MA, DJI_Index30MA, and
StdDJI_30MA.
 180-Day Moving Average Data Columns: The 8 variables below are the 180-day moving
averages of the 8 corresponding (raw) variables: Unemployment180MA, Rental180MA,
RealEstate180MA, Mortgage180MA, Jobs180MA, Investing180MA, DJI_Index180MA, and
StdDJI_180MA.
Here we use the RealEstate as our dependent variable. Let’s see if Google Real Estate Index could
be predicted by other variables in the dataset.
2.2 Step 2 - exploring and preparing the data
First thing first, we need to load the dataset into R.
Let’s delete the first two columns, since the only goal is to predict Google Real Estate Index with
other indexes and DJI.
As we can see from the structure of the data, these indexes and DJI have different ranges. We
should rescale the data. In Chapter 6, we learned that normalizing these features using our own
normalize() function could fix the problem. We can use lapply() to apply the normalize() function to
each column.
The last step clearly normalizes all feature vectors into the 0 to 1 range.
The next step would be separating our google dataset into training and test subsets. This time we will
use the sample() and floor() function to separate training and test dataset. sample() is a function to
create a set of indicators for row numbers. We can subset the original dataset with random rows
using these indicators. floor() takes a number x and returns the closest integer to x sample(row, size)
 row: rows in the dataset that you want to select from. If you want to select all the rows, you can
use nrow(data)
or 1:nrow(data)
 (for a single number or a vector).
 size: how many rows you want for your subset.
We are good to go! Let’s move forward to training phase.
2.3 Step 3 - training a model on the data
Here, we use the function neuralnet() in package neuralnet. neuralnet returns a NN object containing:
 call; the matched call.

 response; extracted from the data argument.
 covariate; the variables extracted from the data argument.
 model.list; a list containing the covariates and the response variables extracted from the
formula argument.
 err.fct and act.fct; the error and activation functions.
 net.result; a list containing the overall result of the neural network for every repetition.
 weights; a list containing the fitted weights of the neural network for every repetition.
 result.matrix; a matrix containing the reached threshold, needed steps, error, AIC, BIC, and
weights for every repetition. Each column represents one repetition.
m<-neuralnet(target~predictors, data=mydata, hidden=1), where:
 target: variable we want to predict.

 predictors: predictors we want to use. Note that we cannot use “.” to denote all the variables in
this function. We have to add all predictors one by one to the model.
 data: training dataset.
 hidden: number of hidden nodes that we want to use in the model. By default, it is set to one.
The above graph shows that we have only one hidden node. Error is stand for the sum of squared
errors and Steps is how many iterations the model has go through. Do note that these outputs could
be different when you run exact same codes twice because the weights are randomly generated.
2.4 Step 4 - evaluating model performance
Similar to the predict() function that we have mentioned in previous chapters, compute() is an
alternative method that could help us to generate the model predictions.
p<-compute(m, test)
 m: a trained neural networks model.

 test: the test dataset. This dataset should only contain the same type of predictors in the neural
network model.
In our model we picked Unemployment, Rental, Mortgage, Jobs, Investing, DJI_Index, StdDJI as our
predictors. So we need to find these corresponding column numbers in the test dataset (1, 2, 4, 5, 6,
7, 8 respectively).
As mentioned in Chapter 9, we can still use the correlation between predicted results and observed
Real Estate Index to evaluate the algorithm. A correlation over 0.9 is very good for real world
datasets. Could this be improved further?
2.5 Step 5 - improving model performance
This time we managed to include 4 hidden nodes in the model. Let’s see what results we can get
from this more complicated model.
Although the graph looks very complicated, we have smaller Error or sum of squared errors. Actually,
it can be used both for classification and regression, which we will see in the next part. Let’s first try
regression.
2.6 Step 6 - adding additional layers
We observe an even lower Error by use three hidden layer with nodes 4,3,3, respectively.
3 Simple NN demo - learning to compute −−√
This simple example demonstrates the foundation of the neural network prediction of a basic
mathematical function: −−√:R+⟶R+
.
We observe that the NN, net.sqrt actually learns and predicts pretty close the complex square root
function. Of course, everyone’s results may vary as we randomly generate the training data
(rand_data) and the NN construction (net.sqrt) is also stochastic.
Case Study 2: Google Trends and the Stock Market - Classification
In practice, NN may be more useful as a classifier. Let’s demonstrate this by using again the Stock
Market data. We mark the samples according to their RealEstate. For those higher than 75%
percentile, we give them label 0; For those lower than 0.25 percentile, we label them as 2; Otherwise,
label 1. Even in the classification set, response still must be numeric.
Here, we divide the data to training and testing sets. We need 3 more column indicators, which
correspond to the 3 outcomes labels.
We use non-linear output and display every 2,000 iterations.
Below is the prediction function translating this model to the forecasting results.
Now let’s inspect the structure of the Neural Network.
Similarly, we can change hidden to utilize multiple hidden layers, however, a more complicated model
won’t necessarily guarantee an improved performance.
Support Vector Machines (SVM)

Classification with hyperplanes
The easiest shape would be a plane. Support Vector Machine (SVM) can use a hyperplane to
separate data into several groups or classes. This is used for datasets that are linearly separable.
Assume that we have only two features, will you use A or B hyperplane to separate the data? Or even
another plane C?
Finding the maximum margin
To answer the above question, we need to search for the Maximum Margin Hyperplane (MMH).
That is the hyperplane that creates greatest separation between the two closest observations.
We define support vectors as the points from each class that are closest to MMH. Each class must
have at least one observation as support vector.
Using support vectors along is not sufficient for finding the MMH. Although tricky mathematical
calculations are involved, the principal of the process is fairly simple. Let’s look at linearly separable
data and non-linearly separable data individually.
Linearly separable data
If the dataset is linearly separable, we can find the outer boundaries of our two groups of data points.
These boundaries are called convex hull (red lines in the following graph). The MMH (black solid line)
is just the line that perpendicular to the shortest line between the two convex hulls.
An alternative way would be picking two parallel planes that can separate the data into two groups
while the distance between two planes is as far as possible.
We can use vector notation to mathematically define planes. In n-dimensional space, a plane could
be expressed by the following equation:
where the vectors w⃗ (weights) and x⃗ (unknowns) both have n coordinates and b is a (scalar)
constant. To clarify this notation let’s look at the situation in a 3D space where we can express
(embed) 2D Euclidean planes using a point ((xo,yo,zo)) and normal-vector ((a,b,c)) form. This is just a
linear equation, where d=−(axo+byo+cz0):
or equivalently
We can see that it is equivalent to the vector notation
Using the vector notation, we can specify two hyperplanes as follows:
And
We require that all of the observations in the first class fall above the first plane and all observations
in the other class fall below the second plane.
The distance between two planes is calculated as:
where ∥w⃗ ∥ is the Euclidean norm. To maximize the distance we need to minimize the Euclidean
norm.
For each nonlinear programming problem, the primal problem, there is related nonlinear
programming problem, the Lagrangian dual problem. Under certain assumptions for convexity and
suitable constraints, the primal and dual problems have equal optimal objective values. Primal
optimization problems are typically described as:
Then the Lagrangian dual problem is defined as a parallel nonlinear programming problem:
Suppose the Lagrange primal is
To optimize that objective function, we can set the partial derivatives equal to zero:
Substituting into the Lagrange primal, we obtain the Lagrange dual:

Non-linearly separable data
For non-linearly separable data, we need to use a small trick. Still, we use a plane but allow some of
the points to be misclassified into the wrong class. To penalize for that, we add a cost term after the
Euclidean norm function that we need to minimize.
Therefore, the solution will optimize the following objective (cost) function:
where C is the cost and ξi is the distance between the misclassified observation i and the plane.
We have Lagrange primal problem:
Similar to what we did above for the separable case, we can use the derivatives of the primal problem
to solve the dual problem.
Notice the inner product in the final expression. We can replace this inner product with a kernel
function that maps the feature space into a higher dimensional space (e.g., using a polynomial kernel)
or an infinite dimensional space (e.g., using a Gaussian kernel).
5.2.4 Using kernels for non-linear spaces
An alternative way to solve for the non-linear separable is called the kernel trick. That is to add some
dimensions (or features) to make these non-linear separable data to be separable in a higher
dimensional space.
How can we do that? We transform our data using kernel functions. A general form for kernel
functions would be:
where ϕ is a mapping of the data into another space.
The linear kernel would be the simplest one that is just the dot product of the features.
The polynomial kernel of degree d transform the data by adding a simple non-linear transformation of
the data.
The sigmoid kernel is very similar to neural network. It uses a sigmoid activation function.
The Gaussian RBF kernel is similar to RBF neural network and is a good place to start investigating a
dataset.
Case Study 2: Optical Character Recognition (OCR)
This example illustrates management and transferring of handwritten notes (text) and converting it to
typeset or printed text representing the characters in the original notes (unstructured image data).
 Protocol:
 Divides the image (typically optical image of handwritten notes on paper) into a fine grid where
each cell contains 1 glyph (symbol, letter, number).
 Match the glyph in each cell to 1 of the possible characters in a dictionary.
 Combine individual characters together into words to reconstitute the digital representation of
the optical image of the handwritten notes.
In this example, we use an optical document image (data) that has already been pre-partitioned into
rectangular grid cells containing 1 character of the 26 English letters, A through Z.
The resulting gridded dataset is distributed by the UCI Machine Learning Data Repository. The
dataset contains 20, 000 examples of 26 English capital letters printed using 20 different randomly
reshaped and morphed fonts.
Example of the preprocessed gridded handwritten letters

Step 1: Prepare and explore the data
Step 3: Training an SVM model
We can specify vanilladot as a linear kernel, or alternatively:
 rbfdot Radial Basis kernel i.e, “Gaussian”

 polydot Polynomial kernel
 tanhdot Hyperbolic tangent kernel
 laplacedot Laplacian kernel
 besseldot Bessel kernel
 anovadot ANOVA RBF kernel
 splinedot Spline kernel
 stringdot String kernel
Step 4: Evaluating model performance
Step 5: Improving model performance
Replacing the vanilladot linear kernel with rbfdot Radial Basis Function kernel, i.e., “Gaussian” kernel
may improve the OCR prediction.
Note the improvement of automated (SVM) classification accuracy (0.928) for rbfdot compared to the
previous (vanilladot) result (0.844).
Case Study 3: Iris Flowers
Let’s have another look at the iris data that we saw in Chapter 2.
Step 1 - collecting data
SVM require all features to be numeric and each feature has to be scaled into a relative small
interval. We are using the Edgar Anderson’s Iris Data in R for this case study. This dataset measures
the length and width of sepals and petals from three Iris flower species.
Step 2 - exploring and preparing the data
Let’s load the data first. In this case study we want to explore the variable Species.
The data looks good. However, recall that we need a fairly normalized data. We could normalize the
data by hand. Luckily, the R package we are going to use will normalized the dataset automatically.
Now we can separate the training and test dataset using 75%-25% rule.
Let’s first try a toy (iris data) example.
Step 3 - training a model on the data
We are going to use kernlab for this case study. However other packages like e1071 and klaR are
available if you are quite familiar with C++.
Let’s break down the function ksvm()
m<-ksvm(target~predictors, data=mydata, kernel="rbfdot", c=1)
 target: the outcome variable that we want to predict.

 predictors: features that the prediction based on. In this function we can use the “.” to
represent all the variables in the dataset again.
 data: the training dataset that the target and predictors can be find.
 kernel: is the kernel mapping we want to use. By default it is the radio basis function (rbfdot).
 C is a number that specifies the cost of misclassification.
Let’s install the package and play with the data now.
Here, we used all the variables other than the Species in the dataset as predictors. We also used
kernel vanilladot that is the linear kernel in this model. We get a training error less than 0.02.
Step 4 - evaluating model performance
Our old friend predict() function is used again to make predictions. Here we have a factor outcome, so
we need the command table() to show us how well do the predictions and actual data match.
We can see that only 1-2 cases of may be misclassifies as Iris versicolor. The species of the majority
of the flowers are all correctly identified.
To see the results more clearly, we can use the proportional table to show the agreements of the
categories.
Here == means “equal to”. Over 90% of predictions are correct. Nevertheless, is there any chance
that we can improve the outcome? What if we try a Gaussian kernel?
Step 5 - RBF kernel function
Linear kernel is the simplest one but usually not the best one. Let’s try the RBF (Radial Basis
“Gaussian” Function) kernel instead.
Unfortunately, the model performance is actually worse than the previous one (you might get slightly
different results). This is because this Iris dataset has a linear feature. In practice, we could try
alternative kernel functions and see which one fits the dataset the best.
Parameter Tuning
We can tune the SVM using the tune.svm function in the package e1071.
Further, we can draw a cv plot to gauge the model performance:

Improving the performance of Gaussian kernels
Now, let’s attempt to improve the performance of a Gaussian kernel by tuning:
We observe that the model achieves a better prediction now.
Problem 2: Quality of Life and Chronic Disease
Let’s load the data first. In this case study, we want to use the variable CHARLSONSCORE as our
target variable.
Delete the first two columns (we don’t need ID variables) and rows that have missing values in
CHARLSONSCORE(where CHARLSONSCOREequals “-9”) !qol$CHARLSONSCORE==-9 means
we want all the rows that have CHARLSONSCORE not equal to -9. The exclamation sign indicates
“exclude”. Also, we need to convert our categorical variable CHARLSONSCORE into a factor.
Now the dataset is ready. First, separate the dataset into training and test datasets using 75%-25%
rule. Then, build a SVM model using all other variables in the dataset to be predictor variables. Try to
add different cost of misclassification to the model. Rather than the default C=1 we use C=2 and C=3.
See how the model behaves. Here we utilize the radio basis kernel.
9 Appendix
Below is some additional R code demonstrating various results reported in this Chapter.
Research:
List down new examples of the following:
 data sets using input and output

 csv file
Analysis
Choose at least 2 values each from the inputs you listed and write perform the variable information and
conversion.
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________.
Action:
Design your own data selection and manipulation.

POST TEST
A. Enumeration (10 points)
1. Give the steps in solving linear regressions.

2. Give the steps in solving multiple regressions.
B. Lab/ Actual Activities (400 points)

i. Perform the following activities in your machine (create your own dataset):
1. Week 1
i. Give the steps in solving linear regressions.
2. Week 2
i. Give the steps in solving multiple regressions.
3. Week 3
i. Topic: Neural networks
i. Using R language, perform the case study number 1: google trends
and stock market.
ii. Provide an screenshots of your output
4. Week 4
i. Presentation of project
i. Choose 1 among the topics, forecasting, neural networks and
regressions. Apply 1 actual case study and provide detailed
explanation.
ii. Provide a screenshot or video recording for your project
CLOSING PRAYER
May God the Father bless us.

May God the Son heal us.
May God the Holy Spirit enlighten us,
and give us eyes to see with,
ears to hear with,
hands to do the work of God with,
feet to walk with,
a mouth to preach the word of salvation with,
and the angel of peace to watch over us and lead us at last,
by our Lord's gift,
to the Kingdom.
Amen.
RUBRICS
Programming Rubrics
Mathematics Rubrics
Category 4 3 2 1
Neatness and The work is The work is The work is The work appears
organization presented in a presented in a presented in an sloppy and
neat, clear, neat and organized unorganized. It is
organized fashion organized fashion but may hard to know what
that is easy to fashion that is be hard to read information goes
read. usually easy to at times. together.
read.
Understanding I got it!! I did it in I got it. I I understood I did not
new ways and understood the parts of the understand the
showed you how it problem and problem. I got problem.
worked. I can tell have an started, but I
you what math appropriate couldn’t finish.
concepts are solution. All
used. parts of the
problem are
addressed.
Strategy & Typically, uses an Typically, uses Sometimes uses Rarely uses an
Procedures efficient and an effective an effective effective strategy
effective strategy strategy to solve strategy to solve to solve problems.
to solve the the problem(s). problems, but
problem(s). does not do it
consistently.
Mathematical 90-100% of the Almost all (85- Most (75-84%) More than 75% of
Errors steps and 89%) of the of the steps and the steps and
solutions have no steps and solutions have solutions have
mathematical solutions have no mathematical mathematical
errors. no mathematical errors. errors.
errors.
Completion All problems are All but one of the All but two of the Several of the
completed. problems are problems are problems are not
completed. completed. completed.
This module was developed based on the following references:
1. Dinov, ID. (2018) Data Science and Predictive Analytics: Biomedical and Health Applications using R,
Springer (ISBN 978-3-319-72346-4).
2. DSPA Book downloads (5M, as of May 2020).

Forecasting numeric data using regression models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Forecasting numeric data using regression models

Uploaded by

Copyright:

Available Formats

CS ELEC 4 - Analytics Techniques

& Tools/Machine Learning

To do well in this module, you need to remember the following:

1. Pause and pray before starting this module.

Learning Outcomes Estimated

A. Enumeration (10 points)

1. Give the steps in solving linear regressions.

B. Lab/ Actual Activities (400 points)

Forecasting Numeric Data using Regression Models

Regression is a measurement of relationship between a dependent variable (value to be predicted)

2.1 Simple linear regression

The simplest case of a regression is that we only have one predictor.

The estimated expression for this regression line is:

If we utilize the sample averages (x¯, y¯), we have:

Let’s examine these using the heart attack data.

2.3 Model Assumptions

Regression modeling has five key assumptions:

2.5 Multiple Linear Regression

is the error term.

Let’s make a function of our own using this matrix formula.

3.1 Step 1 - collecting data

 Name: MLB Player Name

3.2 Step 2 - exploring and preparing the data

The above plot illustrates our dependent variable Weight.

1.1 From biological to artificial neurons

There are three important components for building a neural network:

 Activation function: transforms weighted and summed inputs to output.

Let’s look at each of these components one by one.

The number of layers: The x

When we have multiple layers, the information flow could be complicated.

1.6 Training neural networks with backpropagation

2 Case Study 1: Google Trends and the Stock Market - Regression

 Index: Time Index of the Observation

2.2 Step 2 - exploring and preparing the data

First thing first, we need to load the dataset into R.

 size: how many rows you want for your subset.

We are good to go! Let’s move forward to training phase.

2.3 Step 3 - training a model on the data

 call; the matched call.

m<-neuralnet(target~predictors, data=mydata, hidden=1), where:

 target: variable we want to predict.

 m: a trained neural networks model.

2.5 Step 5 - improving model performance

3 Simple NN demo - learning to compute −−√

Case Study 2: Google Trends and the Stock Market - Classification

Support Vector Machines (SVM)

Linearly separable data

Using the vector notation, we can specify two hyperplanes as follows:

The distance between two planes is calculated as:

Suppose the Lagrange primal is

Substituting into the Lagrange primal, we obtain the Lagrange dual:

We have Lagrange primal problem:

5.2.4 Using kernels for non-linear spaces

Case Study 2: Optical Character Recognition (OCR)

Example of the preprocessed gridded handwritten letters

Step 3: Training an SVM model

We can specify vanilladot as a linear kernel, or alternatively:

 rbfdot Radial Basis kernel i.e, “Gaussian”

Step 1 - collecting data

Step 2 - exploring and preparing the data

Let’s break down the function ksvm()

m<-ksvm(target~predictors, data=mydata, kernel="rbfdot", c=1)

 target: the outcome variable that we want to predict.