You are on page 1of 7

Overfitting - explaining some variation in the data that was nothing more than chance variation.

We mislabeled the noise in the

data if it were a signal.
Evans Rule (conservative): n/p > 10 (at least 10 observations per predictor)
Doanes Rule (relaxed): n/p > 5 (at least 5 observations per predictor)
Standardize the data
Training Partition (typically the largest partition) contains the data used to build the various models we are examining. The
same training partition is generally used to develop multiple models.
Validation Partition (sometimes called the test partition) is used to assess the performance of each model so that you can
compare models and pick the best one.
Test Partition (sometimes called the holdout or evaluation partition) is used if we need to assess the performance of the chosen
model with new data.
Simple linear regression: Regression analysis involving one independent variable (X) & one dependent variable (Y) in which
the relationship between the variables is approximated by a straight line.

Mean square residual (MSE) = Residual Standard Error (#8 on table on next page) MSR= SSR/K (K = # of predictors)



R2 can be interpreted as the proportion of the variation in

the response variable that is explained by the estimated
regression equation.
To test for a significant regression relationship conduct a
hypothesis test to determine whether the value of 1 is 0.
H0: 1 = 0 H1: 1 0
Reject H0 if p-value <
Test statistics for hypothesis tests about slope. Note: Degree
of freedom is n-2 t =b1/se(b1) se=standard error on
The coded variables are called dummy variables. If a categorical variable has m levels then we have to introduce m-1 dummy
variables in model.
Interpretation of s: 0 = B (Mean of base level) - 1 = A - B (Mean of base level) - 2 = B C
1. Backward elimination (Backward stepwise)

2. All-possible-regressions selection procedure (subset selection)

Type I (include some unimportant independent variables in the model) or Type II errors (eliminate some important independent
variables). RA2 - The adjusted coefficient of determination is done to penalize the inclusion of useless predictors.
Regression Selection R2 - RA2 - CP Want smallest CP error
R2 = SSR/SST = 1 SSE/SST = 1 [MSE/(n-k-1)] - RA2 = 1 (n-1)(MSE/SST)=1-[(n-1)/(n-k-1)]*(1-R2) - CP =SSEk/MSEL +2(k-1)-5
R2 > RA2 and for poor-fitting models R A2 may be negative. Choose highest adjusted R squared

If we reject Ho we conclude the model is statistically significant

If we reject the null hypothesis, there is enough evidence to support that at least one of the coefficients is zero. The overall
model appears to be statistically useful for predicting y
If we cannot reject the null hypothesis, there is not enough evidence to support that at least one of the coefficients is nonzero.
The overall model does not appear to be statistically useful for predicting y

The advantage of K-Fold Cross validation is that all the examples

in the dataset are eventually used for both training and testing
and each observation is used for validation exactly once

k-NN Classifier (Categorical Outcome)

The idea in k-nearest neighbor methods is to identify k records in
the training dataset that are similar to the new record that we
wish to classify. We then use there similar (neighboring ) records
to classify the new record into a class, assigning the new record to the predominant class among these neighbors.
If we choose k too low, we may be fitting to the noise of the data.
If we choose k too high, we will armiss out on the methods ability to
capture the local structure in the data.
If k = n, we simply assign all records to the majority class in the training
Typically, values of k falls in the range of 1-20.
The odd number is normally chosen to avoid ties.
We partition the data into training data and validation data.
For example, 18 data points for the training data and 6 data
points for the validation data.
Use the training data to classify the records in the validation data, then compute the error rates for various choices of k.
Perform k-foldcross validation and record the error rate for various choices of k.
Notes - The default of the cutoff value is 0.5 but it can be set differently. It can be used to classify the response with more than
two classes. It can be applied with a numerical response.
The first step of determining neighbors by computing distances remains unchanged.
The second step is modified such that we take the average response value of the k-nearest neighbors to determine the
The best k can be determined by the other measure besides the misclassification rate.
Advantages - It is simple. It does not require parametric assumptions. It performs well with a large enough training set.
Shortcomings - The time to find the nearest neighbors in a large training set can be prohibitive. The number of records
required in the training set to qualify as large increases exponentially with the number of predictors p.
Kn only works with quantitative variables.
Misclassification rate wrong classified divided by all numbers summed up (b+c)/(a+b+c+d) = error
Accuracy = 1 error = (a+d)/(a+b+c+d)
If dataset is too small for partition it may yield unstable results
()In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to
()If p value is greater than alpha(level of significance) we accept, if less then then we reject
1-Residuals = he residuals are the difference between the
actual values of the variable you're predicting and predicted
2-Significance Stars= Shorthand for significance figures
3-Estimated Coeffecient= The estimated coeffecient is the
value of slope calculated by the regression.
4-Standard Error of the Coeffecient Estimate - Measure of
the variability in the estimate for the coeffecient. Lower
means better but this number is relative to the value fo the
5-T value of coefficient estimate = Score that measures
whether or not the coeffecient for this variable is meaningful
for the model. You probably won't use this value itself, but
know that it is used to calculate the p-value and the
significance levels.
6-Variable P Value = Probability the variable is NOT relevant.
You want this number to be as small as possible.
8-Residual STD error/Degrees of freedom = The Residusal Std
Error is just the standard deviation of your residuals. The Degrees of
Freedom is the difference between the number of observations
included in your training sample adn the number of variables used in your model (intercept counts as a variable).
9-R2= Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in
what you're predicting that is explained by the model.
10- F-Statistic & resulting p-value= Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and
compares it to a model that has fewer parmeters. In theory the model with more parameters should fit better. If the model with more parameters

(your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost).
If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.

Root node error is used to calculate the cross validation error.

N=number of records
CP is the complexity parameter. Any split that does not decrease the overall lack of fit by a factor of CP is not attempt (Default is 0.01)
Relative Error is found from the whole dataset and it is always the same for different runs.
Xerror can change for different runs

of the small number

Error Rate(whole dataset) = the root node error times the relative error =

all square decision nodes

Total Records

(smaller of

number is misclass)
10-fold CV error rate = root node error times the xerror
Prune Tree
Step 1: Find the best subtree of each size (1,2,3, ).
Step 2: Pick the tree in the sequence that gives the smallest misclassification error in the validation set.
The idea behind pruning is to recognize that a very large tree is likely to e overfitting the training data and that the weakest branches,
which hardly reduce the error rate, should be removed.
The tree method can also be used for numerical response variables regression tree.
- Both the principle and the procedure are the same.
- There are three details that are different from the
classification tree.
(i) Prediction
(ii) Impurity measures
(iii) Evaluating performance
Regression Tree
Both the principle and the procedure are the same. There are three details that are different from the classification tree.
(i) Prediction
(ii) Impurity measures
(iii) Evaluating performance
Logistic Regression

Logistic regression model explains a relationship between a binary response and predictors
using a logit link function.
Y is used to represent the binary response.
P(Y = 1) or p is the probability of belonging to class 1

0 1 x1 ... k xk
0 1 x1 ... k xk p e
log it ( p ) ln
1 p
1 e 0 1x1 ... k xk Odds e 0 1x1 ... k xk

1 odds


If xj increases 1 unit, then odds changes by (e j -1)(100)% (holding all other predictors constant.)

Odds ratio

e 0 CD (1)... k xk
e CD
e 0 CD ( 0 )... k xk
Association Rules


no. transactions that include both condition and result item sets
the total number of records

Confidencec no. transactions that include both condition and result item sets
no. transactions with condition item sets

P (condition and result )

P (result | condition)
P (condition )

A high value of confidence suggests a strong association rule.

confidence P(condition and result )
Lift ratio

P(result )
P(condition ) P (result )

Lift Ratio

s P(condition and result )

1 p

A lift ration greater than 1.0 suggests that there is some usefulness to the rule - the level of
association between the condition and result item sets is higher than would be expected if they
were independent.The larger the lift ratio, the greater the strength of the association.
The support indicates its impact in terms of overall size. If only a small number of transactions are
affected, the rule may be of a little use (unless the consequent is very valuable and/or the rule is
very efficient in finding it).
Cluster Analysis
dij is a distance metric or dissimilarity
The following properties are required.
measure, between records i and j.
dij 0.
(xi1, xi2, , xip) is the vector of p
dii = 0.
measurements for record i.
dij = dji.
(xj1, xj2, , xjp) is the vector of p
Triangle Inequality
dij dik + dkj.
measurements for record j.

d ij ( xi1 x j1 ) ( xi 2 x j 2 ) ... ( xip x jp ) .


observed value mean

std deviation

We explore the characteristics of each cluster
a. Obtaining summary statistics from each
cluster on each measurement that was
used in the cluster analysis
b. Examining the clusters for the presence
of some common feature (variable) that
was not used in the cluster analysis
c. Cluster labeling: based on the
interpretation, trying to assign a name
or label to each other.
AIC =L + 2k L is usually given so not 2L?
L = Residual Deviance Log likelihood k=# of parameters
Lower AIC and BIC better
AIC indicates how good the estimates that maximize the chance of obtaining the data.
AIC Gives penalty to a higher # of predictors
-Holding the other variables constant, the (response variable) is more/less likely to be in class 1 if
the variable is 1
Root Node X Error = Cross validation error rate

Combine the cities, then readjust column, take the

function of the combined set^then lowest of the new numbers after function
Splitting values, order lowest to highest, then half
of the pairs going down in a row?
(Different Note)
Rel. Error = relative error or misclassification for
tree at that stage to convert to absolute error
multiply by root node error.
Yes, take left number add takes right

for adjusted R^2