Model Evaluation

MODEL EVALUATION
D.DEVA
HEMA(AP/CSE)
OVERVIEW
• Choosing and evaluating models

• Validating models
• Memorization methods
• Linear regression
• Logistic regression
• Unsupervised methods: Cluster analysis
Schematic model construction and evaluation
Mapping problems to machine learning tasks
• Predicting what customers might buy, based

on past transactions
• Identifying fraudulent transactions
• Determining price elasticity of various
products or product classes
• Determining the best way to present product
listings when a customer searches for an item
Mapping problems to machine learning tasks
• Customer segmentation: grouping customers

with similar purchasing behavior
• AdWord valuation: how much the company
should spend to buy certain AdWords on
search engines
• Evaluating marketing campaigns
• Organizing new products into a product
catalog
Mapping problems to machine learning
tasks
• Solving classification problems
• Solving scoring problems
• Working without known target
• Problem to method mapping
Solving classification problems
Some common classification methods
• Naïve Bayes
• Decision Trees
• Logistic Regression
• Support vector machine
Solving scoring problems
Solving scoring problems
• Linear regression
• Logistic regression
WORKING WITHOUT KNOWN TARGETS
• To find patterns or relationships

• Unsupervised learning
CLUSTERING METHODS
• K-means clustering
• Apriori algorithm for finding association rules
• Nearest Neighbor
Problem to Method Mapping
Evaluating the Models
Evaluating the Models
• Classification
• Scoring
• Probability estimation
• Ranking
• Clustering
Evaluating the classification Models
Confusion Matrix:
• Confusion Matrix is the technique we use to

measure the performance of classification
models
• The Confusion Matrix is in a tabular form where
each row represents actual classes and columns
are predicated classes.
Model evaluation
Confusion Matrix:
Predicted Values
TP FP
Actual Values
FN TN
Predicted Values
5 0
Actual Values 1 9
Model evaluation
Accuracy
 Accuracy is defined as the ratio of total number
of correct predictions to the total number of
samples.
Accuracy =(True Positive + True Negative) / (True
positive +True Negative+ False Positive +False
Negative)
Model evaluation
Precision
 Precision defines the correct identification of actual positives
Precision = True Positives / (True positives + False positives)
Recall
 Recall or True Positive Rate is defined as the ratio of true positives to

total number of true positives and false negatives.
Recall = True Positives / (True positives + False Negatives)

Model evaluation
 FI Score
FI score is the measure that provides the single measure of combined

precision and recall.
F-Measure = (2 * Precision * Recall) / (Precision + Recall)
Misclassification Rate or Error rate:

 It is defined as the ratio between total number of misclassified
samples and total number of samples
Error rate = (False Positive+False Negative)/Total Number of Samples

Model evaluation
Sensitivity (True positive rate)
TP/(TP+FN).
Specificity (True negative rate)
TN/(TN+FP).
False positive rate:

FP/(FP+TN)
EVALUATING THE SCORING MODEL
• Residuals or the difference between our

predictions f(x[i]) and actual outcomes y[i].
d <- data.frame(y=(1:10)^2,x=1:10)
model <- lm(y~x,data=d)
d$prediction <- predict(model,newdata=d)
library('ggplot2')
ggplot(data=d) + geom_point(aes(x=x,y=y)) +
geom_line(aes(x=x,y=prediction),color='blue') +
geom_segment(aes(x=x,y=prediction,yend=y,xend=x))
+
• scale_y_continuous('')
ROOT MEAN SQUARE ERROR
• This is the square root of the average square
of the difference between our prediction and
actual values.
• The RMSE is in the same units as your y-values
sqrt(mean((d$prediction-d$y)^2))
R-SQUARED( coefficient of
determination)
• 1-sum((d$prediction
-d$y)^2)/sum((mean(d$y)- d$y)^2)
• R-squared can be thought of as a normalized
version of RMSE
• Under certain circumstances, R-squared is
equal to the square of another measure
called the correlation
CORRELATION
• Correlation is a statistical technique that can show

whether and how strongly pairs of variables are
related. For example, height and weight are related;
taller people tend to be heavier than shorter
people.
• Pearson, Spearman, and Kendall
CORRELATION
• Pearson's correlation coefficient (r) is a
measure of the strength of the association
between the two variables.
• The first step in studying the relationship
between two continuous variables is to draw a
scatter plot of the variables to check for
linearity.
CORRELATION
• The Spearman correlation method computes
the correlation between the rank of x and
the rank of y variables.
• Kendall rank correlation coefficient is a
statistic used to measure the ordinal association
between two measured quantities.
ABSOLUTE ERROR
• Absolute error (sum(abs(d$prediction-d$y)))

• Mean absolute error (sum(abs(d$prediction-
d$y))/length(d$y))
• Relative absolute error
(sum(abs(d$prediction-d$y))/sum(abs(d$y)))
Evaluating probability models
• Probability models are useful for both classification

and scoring tasks.
• Probability models are models that both decide if an
item is in a given class and return an estimated
probability (or confidence) of the item being in the
class.
• The modeling techniques of logistic regression and
decision trees are fairly famous for being able to
return good probability estimates.
THE RECEIVER OPERATING CHARACTERISTIC
CURVE
• We plot both the true positive rate and the
false positive rate.
• This curve represents every possible trade-off
between sensitivity and specificity that is
available for this classifier.
ROC Curve
LOG LIKELIHOOD
• The log likelihood is the logarithm of the

product of the probability the model assigned to
each
example.
• For a spam email with an estimated likelihood of
0.9 of being spam, the log likelihood is log(0.9)
• For a non-spam email, the same score of 0.9 is a
log likelihood of log(1-0.9
LOG LIKELIHOOD
sum(ifelse(spamTest$spam=='spam',
log(spamTest$pred),
log(1-spamTest$pred)))
Deviance
D= -2*(logLikelihood-S)
• S- The log likelihood of the saturated model
• The saturated model is a perfect model that
returns probability 1 for items in the class and
probability 0 for items not in the class .
Akaike information criterion (AIC).
AIC= deviance + 2*numberOfParameters

Entropy
• It is measured in a unit called bits.
E= sum(-p*log(p,2))
• p is a vector containing the probability of

each possible outcome
Bayesian information criterion
BIC= 2*2^entropy
Evaluating ranking models
• Spearman’s rank correlation coefficient

• Data mining concept of lift
Evaluating clustering models
Clustering random data in the plane:
set.seed(32297)
d <- data.frame(x=runif(100),y=runif(100))
clus <- kmeans(d,centers=5)
d$cluster <- clus$cluster
Calculating the size of each cluster
> table(d$cluster)
1 2 3 4 5
10 27 18 17 28
Validating The models
Common model problems
• Bias
• Overfit
• Variance
• Non significance
Memorization Models
• The simplest methods in data science are
what we call memorization methods.
–Building single-variable model
–Cross-validated variable selection
–Building basic multivariable models
–Starting with decision trees, nearest
neighbor,and naive Bayes models
Building single-variable model
Linear Regression:
• Regression analysis is a very widely used
statistical tool to establish a relationship model
between two variables.
• One of these variable is called predictor variable
whose value is gathered through experiments.
• The other variable is called response variable
whose value is derived from the predictor
variable.
• The general mathematical equation for a linear regression is -
y = ax + b
• Following is the description of the parameters used -
• y is the response variable.
• x is the predictor variable.
• a and b are constants which are called the coefficients.

lm() Function
This function creates the relationship model between the predictor and
the response variable.
Syntax
The basic syntax for lm() function in linear regression is -
lm(formula,data)
Following is the description of the parameters used -
• formula is a symbol presenting the relation between x and y.

• data is the vector on which the formula will be applied.
H <- c(151, 174, 138, 186, 128, 136, 179, 163,
152, 131)
W <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.

relation <- lm(W~H)
print(relation)
OUTPUT
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
H <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131)
W <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

relation <- lm(W~H)
print(relation)
print(summary(relation))
OUTPUT
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.253 on 8 degrees of freedom

Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
predict() Function
Syntax
The basic syntax for predict() in linear regression is -
predict(object, newdata)
Following is the description of the parameters used -
is the formula which is already created using the lm()
function.
newdata is the vector containing the new value for
predictor variable.
Predict the weight of new persons
# The predictor vector.

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
# The resposne vector.

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

relation <- lm(y~x)
# Find weight of a person with height 170.

a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
OUTPUT:
1
76.22869
Logistic Regression
• A relationship between predictor variable and a
categorical response variable can be determined
by a technique known as logistic regression
Types:
1.Binary logistic regression: Binary response
2. Nominal logistic regression: Three or more
categories with no natural ordering to the levels
3.Ordinal logistic regression: Three or more
categories with natural ordering to the levels
Logistic Regression
• Here are some examples of binary classification
problems:
• Spam Detection : Predicting if an email is Spam or not
• Credit Card Fraud : Predicting if a given credit card
transaction is fraud or not
• Health : Predicting if HIV is positive or negative
• Marketing : Predicting if a given user will buy an
insurance product or not
• Banking : Predicting if a customer will default on a loan.
Logistic Regression
Syntax
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)
Following is the description of the parameters used −
• formula is the symbol presenting the relationship between the

variables.
• data is the data set giving the values of these variables.
• family is R object to specify the details of the model. It's value is

binomial for logistic regression.
Building multiple-variable model
• Multiple regression is an extension of linear
regression into relationship between more than
two variables.
• In simple linear relation we have one predictor

and one response variable, but in multiple
regression we have more than one predictor
variable and one response variable.
The general mathematical equation for multiple
regression is
y = a + b1x1 + b2x2 +...bnxn
• y is the response variable.
• a, b1, b2...bn are the coefficients.
• x1, x2, ...xn are the predictor variables.

Syntax
The basic syntax for lm() function in multiple
regression is
lm(y ~ x1+x2+x3...,data)
• formula is a symbol presenting the relation between
the
• response variable and predictor variables.
• data is the vector on which the formula will be applied.
• Consider the data set "mtcars" available in the R
environment. It gives a comparison between different car
models in terms of mileage per gallon (mpg), cylinder ,
displacement("disp"), horse power("hp"), weight of the
car("wt") and some more parameters.

• The goal of the model is to establish the relationship
between "mpg" as a response variable with "disp","hp"
and "wt" as predictor variables. We create a subset of
these variables from the mtcars data set for this purpose.
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
When we execute the above code, it produces the following
result −
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
input <- mtcars[,c("mpg","disp","hp","wt")]
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)

# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")

a <- coef(model)[1]
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]

print(Xdisp)
print(Xhp)
print(Xwt)

OUTPUT
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)

Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

# # # # The Coefficient Values # # #
(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891
• Create Equation for Regression Model
• Based on the above intercept and coefficient values, we
create the mathematical equation.
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
For a car with disp = 221, hp = 102 and wt = 2.91 the
predicted mileage is −
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-
3.8008)*2.91 =22.7104
UNSUPERVISED LEARNING
OVERVIEW
• Using R’s clustering functions to explore data and
look for similarities
• Choosing the right number of clusters
• Evaluating a clustering
• Using R’s association rules functions to find
patterns of co-occurrence in data
• Evaluating a set of association rules
CLUSTER ANALYSIS
• The goal is to group the observations in your
data into clusters.
Two types of clustering:
1. Hierarchical Clustering
2. K means Clustering
Distances
• In order to cluster, you need the notions of
similarity and dissimilarity.
CLUSTER ANALYSIS
• Euclidean distance
• Hamming distance
• Manhattan (city block) distance
• Cosine similarity
CLUSTER ANALYSIS
EUCLIDEAN DISTANCE
• The most common distance is Euclidean
distance. The Euclidean distance between two
vectors x and y is defined as
edist(x, y) <- sqrt((x[1]-y[1])^2 + (x[2]-y[2])^2
+ ...)
CLUSTER ANALYSIS
HAMMING DISTANCE
• For categorical variables (male/female, or
small/medium/large), you can define the distance as 0 if
two points are in the same category, and 1 otherwise. If all
the variablesare categorical, then you can use Hamming
distance, which counts the number ofmismatches:
• hdist(x, y) <- sum((x[1] != y[1]) + (x[2] != y[2]) + ...)
• Here, a != b is defined to have a value of 1 if the expression

is true, and a value of 0 if the expression is false.
CLUSTER ANALYSIS
MANHATTAN (CITY BLOCK) DISTANCE
• Manhattan distance measures distance in the
number of horizontal and vertical units
• it takes to get from one (real-valued) point to
the other (no diagonal moves):
mdist(x, y) <- sum(abs(x[1]-y[1]) + abs(x[2]-y[2])

+ ...)
CLUSTER ANALYSIS
• COSINE SIMILARITY
• Cosine similarity is a common similarity metric
in text analysis. It measures the smallestangle
between two vectors (the angle theta between
two vectors is assumed to be between 0 and 90
degrees). Two perpendicular vectors (theta = 90
degrees) are themost dissimilar; the cosine of 90
degrees is 0.
• the cosine of 0 degrees is 1.
CLUSTER ANALYSIS
dot(x, y) <- sum( x[1]*y[1] + x[2]*y[2] + ... )
cossim(x, y) <- dot(x, y)/(sqrt(dot(x,x)*dot(y,y)))
HIERARCHICAL CLUSTERING
• Hierarchical clustering is an alternative approach which
builds a hierarchy from the bottom-up, and doesn’t
require us to specify the number of clusters
beforehand.
The algorithm works as follows:
• Put each data point in its own cluster.
• Identify the closest two clusters and combine them
into one cluster.
• Repeat the above step till all the data points are in a
single cluster.
HIERARCHICAL CLUSTERING EXAMPLE
set.seed(123)
data <
data.frame(x=sample(1:10000,7),y=sample(1:
10000,7), z=sample(1:10000,7))
data
x y z
1 2876 8925 1030
2 7883 5514 8998
3 4089 4566 2461
4 8828 9566 421
5 9401 4532 3278
6 456 6 773 9541
7 5278 5723 8891
# create own function according to Euclidean
distance formula
euclidean_distance <- function(p,q)
{ sqrt(sum((p - q)^2)) }
# check points 4 and 6
euclidean_distance(data[4,],data[6,]) #my own
function
Output:
[1] 12691.16
dist(data, method="euclidean")
Output:
1 2 3 4 5 6
## 2 10009.695
## 3 4745.525 7617.448
## 4 6017.314 9532.925 7184.687
## 5 8180.928 5998.921 5374.569 5816.523
## 6 9106.296 7552.500 8258.083 12691.16 11147.209
## 7 8821.436 2615.560 6640.578 9955.503 7065.648
4977.618
hclust(dist(data, method="euclidean"),
method="single")
hclust(dist(data, method="euclidean"),
method="complete")

Model Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Model Evaluation

Uploaded by

Copyright:

Available Formats

MODEL EVALUATION

• Choosing and evaluating models

• Predicting what customers might buy, based

• Customer segmentation: grouping customers

• To find patterns or relationships

• Confusion Matrix is the technique we use to

Precision = True Positives / (True positives + False positives)

 Recall or True Positive Rate is defined as the ratio of true positives to

Recall = True Positives / (True positives + False Negatives)

FI score is the measure that provides the single measure of combined

Misclassification Rate or Error rate:

Error rate = (False Positive+False Negative)/Total Number of Samples

Specificity (True negative rate)

False positive rate:

• Residuals or the difference between our

• Correlation is a statistical technique that can show

• Absolute error (sum(abs(d$prediction-d$y)))

• Probability models are useful for both classification

• The log likelihood is the logarithm of the

AIC= deviance + 2*numberOfParameters

• p is a vector containing the probability of

• Spearman’s rank correlation coefficient

• y is the response variable.

• x is the predictor variable.

• a and b are constants which are called the coefficients.

• formula is a symbol presenting the relation between x and y.

# Apply the lm() function.

# Apply the lm() function.

Residual standard error: 3.253 on 8 degrees of freedom

# The predictor vector.

# The resposne vector.

# Apply the lm() function.

# Find weight of a person with height 170.

• formula is the symbol presenting the relationship between the

• data is the data set giving the values of these variables.

• family is R object to specify the details of the model. It's value is

• In simple linear relation we have one predictor

• Here, a != b is defined to have a value of 1 if the expression

mdist(x, y) <- sum(abs(x[1]-y[1]) + abs(x[2]-y[2])

You might also like