You are on page 1of 80

MODEL EVALUATION

D.DEVA
HEMA(AP/CSE)
OVERVIEW

• Choosing and evaluating models


• Validating models
• Memorization methods
• Linear regression
• Logistic regression
• Unsupervised methods: Cluster analysis
Schematic model construction and evaluation
Mapping problems to machine learning tasks

• Predicting what customers might buy, based


on past transactions
• Identifying fraudulent transactions
• Determining price elasticity of various
products or product classes
• Determining the best way to present product
listings when a customer searches for an item
Mapping problems to machine learning tasks

• Customer segmentation: grouping customers


with similar purchasing behavior
• AdWord valuation: how much the company
should spend to buy certain AdWords on
search engines
• Evaluating marketing campaigns
• Organizing new products into a product
catalog
Mapping problems to machine learning
tasks
• Solving classification problems
• Solving scoring problems
• Working without known target
• Problem to method mapping
Solving classification problems
Some common classification methods
• Naïve Bayes
• Decision Trees
• Logistic Regression
• Support vector machine
Solving scoring problems
Solving scoring problems
• Linear regression
• Logistic regression
WORKING WITHOUT KNOWN TARGETS

• To find patterns or relationships


• Unsupervised learning
CLUSTERING METHODS
• K-means clustering
• Apriori algorithm for finding association rules
• Nearest Neighbor
Problem to Method Mapping
Evaluating the Models
Evaluating the Models
• Classification
• Scoring
• Probability estimation
• Ranking
• Clustering
Evaluating the classification Models
Confusion Matrix:

• Confusion Matrix is the technique we use to


measure the performance of classification
models
• The Confusion Matrix is in a tabular form where
each row represents actual classes and columns
are predicated classes. 
Model evaluation
Confusion Matrix:
Predicted Values

TP FP
Actual Values
FN TN

Predicted Values

5 0
Actual Values 1 9
Model evaluation
Accuracy
 Accuracy is defined as the ratio of total number
of correct predictions to the total number of
samples.
Accuracy =(True Positive + True Negative) / (True
positive +True Negative+ False Positive +False
Negative)
Model evaluation
Precision
 Precision defines the correct identification of actual positives

Precision = True Positives / (True positives + False positives)

Recall

 Recall or True Positive Rate is defined as the ratio of true positives to


total number of true positives and false negatives.

Recall = True Positives / (True positives + False Negatives)


Model evaluation
 FI Score

FI score is the measure that provides the single measure of combined


precision and recall.
F-Measure = (2 * Precision * Recall) / (Precision + Recall)

Misclassification Rate or Error rate: 


 It is defined as the ratio between total number of misclassified
samples and total number of samples

Error rate = (False Positive+False Negative)/Total Number of Samples


Model evaluation
Sensitivity (True positive rate)

TP/(TP+FN).

Specificity (True negative rate) 

TN/(TN+FP).

False positive rate:


FP/(FP+TN)
EVALUATING THE SCORING MODEL

• Residuals or the difference between our


predictions f(x[i]) and actual outcomes y[i].
EVALUATING THE SCORING MODEL
d <- data.frame(y=(1:10)^2,x=1:10)
model <- lm(y~x,data=d)
d$prediction <- predict(model,newdata=d)
library('ggplot2')
ggplot(data=d) + geom_point(aes(x=x,y=y)) +
geom_line(aes(x=x,y=prediction),color='blue') +
geom_segment(aes(x=x,y=prediction,yend=y,xend=x))
+
• scale_y_continuous('')
EVALUATING THE SCORING MODEL
ROOT MEAN SQUARE ERROR
• This is the square root of the average square
of the difference between our prediction and
actual values.
• The RMSE is in the same units as your y-values
sqrt(mean((d$prediction-d$y)^2))
R-SQUARED( coefficient of
determination)
• 1-sum((d$prediction
-d$y)^2)/sum((mean(d$y)- d$y)^2)
• R-squared can be thought of as a normalized
version of RMSE
• Under certain circumstances, R-squared is
equal to the square of another measure
called the correlation
CORRELATION

• Correlation is a statistical technique that can show


whether and how strongly pairs of variables are
related. For example, height and weight are related;
taller people tend to be heavier than shorter
people.
• Pearson, Spearman, and Kendall
CORRELATION
• Pearson's correlation coefficient (r) is a
measure of the strength of the association
between the two variables.
• The first step in studying the relationship
between two continuous variables is to draw a
scatter plot of the variables to check for
linearity.
CORRELATION
• The Spearman correlation method computes
the correlation between the rank of x and
the rank of y variables.
•  Kendall rank correlation coefficient is a
statistic used to measure the ordinal association
between two measured quantities.
ABSOLUTE ERROR

• Absolute error (sum(abs(d$prediction-d$y)))


• Mean absolute error (sum(abs(d$prediction-
d$y))/length(d$y))
• Relative absolute error
(sum(abs(d$prediction-d$y))/sum(abs(d$y)))
Evaluating probability models

• Probability models are useful for both classification


and scoring tasks.
• Probability models are models that both decide if an
item is in a given class and return an estimated
probability (or confidence) of the item being in the
class.
• The modeling techniques of logistic regression and
decision trees are fairly famous for being able to
return good probability estimates.
THE RECEIVER OPERATING CHARACTERISTIC
CURVE
• We plot both the true positive rate and the
false positive rate.
• This curve represents every possible trade-off
between sensitivity and specificity that is
available for this classifier.
ROC Curve
LOG LIKELIHOOD

• The log likelihood is the logarithm of the


product of the probability the model assigned to
each
example.
• For a spam email with an estimated likelihood of
0.9 of being spam, the log likelihood is log(0.9)
• For a non-spam email, the same score of 0.9 is a
log likelihood of log(1-0.9
LOG LIKELIHOOD

sum(ifelse(spamTest$spam=='spam',
log(spamTest$pred),
log(1-spamTest$pred)))
Deviance
D= -2*(logLikelihood-S)
• S- The log likelihood of the saturated model
• The saturated model is a perfect model that
returns probability 1 for items in the class and
probability 0 for items not in the class .
Akaike information criterion (AIC).

AIC= deviance + 2*numberOfParameters


Entropy
• It is measured in a unit called bits.
E= sum(-p*log(p,2))

• p is a vector containing the probability of


each possible outcome
Bayesian information criterion

BIC= 2*2^entropy
Evaluating ranking models

• Spearman’s rank correlation coefficient


• Data mining concept of lift
Evaluating clustering models
Clustering random data in the plane:
set.seed(32297)
d <- data.frame(x=runif(100),y=runif(100))
clus <- kmeans(d,centers=5)
d$cluster <- clus$cluster
Calculating the size of each cluster

> table(d$cluster)
1 2 3 4 5
10 27 18 17 28
Validating The models
Common model problems
• Bias
• Overfit
• Variance
• Non significance
Memorization Models
• The simplest methods in data science are
what we call memorization methods.
–Building single-variable model
–Cross-validated variable selection
–Building basic multivariable models
–Starting with decision trees, nearest
neighbor,and naive Bayes models
Building single-variable model

Linear Regression:
• Regression analysis is a very widely used
statistical tool to establish a relationship model
between two variables.
• One of these variable is called predictor variable
whose value is gathered through experiments.
• The other variable is called response variable
whose value is derived from the predictor
variable.
Building single-variable model
• The general mathematical equation for a linear regression is -

y = ax + b
• Following is the description of the parameters used -

• y is the response variable.

• x is the predictor variable.

• a and b are constants which are called the coefficients.


Building single-variable model
lm() Function
This function creates the relationship model between the predictor and
the response variable.

Syntax
The basic syntax for lm() function in linear regression is -

lm(formula,data)
Following is the description of the parameters used -

• formula is a symbol presenting the relation between x and y.


• data is the vector on which the formula will be applied.
Building single-variable model
H <- c(151, 174, 138, 186, 128, 136, 179, 163,
152, 131)
W <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(W~H)

print(relation)
Building single-variable model

OUTPUT

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746
Building single-variable model
H <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131)
W <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(W~H)

print(relation)
print(summary(relation))
Building single-variable model
OUTPUT
Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
Building single-variable model
predict() Function
Syntax
The basic syntax for predict() in linear regression is -

predict(object, newdata)
Following is the description of the parameters used -
is the formula which is already created using the lm()
function.
newdata is the vector containing the new value for
predictor variable.
Building single-variable model
Predict the weight of new persons

# The predictor vector.


x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
Building single-variable model
OUTPUT:
1
76.22869
Logistic Regression
• A relationship between predictor variable and a
categorical response variable can be determined
by a technique known as logistic regression
Types:
1.Binary logistic regression: Binary response
2. Nominal logistic regression: Three or more
categories with no natural ordering to the levels
3.Ordinal logistic regression: Three or more
categories with natural ordering to the levels
Logistic Regression
• Here are some examples of binary classification
problems:
• Spam Detection : Predicting if an email is Spam or not
• Credit Card Fraud : Predicting if a given credit card
transaction is fraud or not
• Health : Predicting if HIV is positive or negative
• Marketing : Predicting if a given user will buy an
insurance product or not
• Banking : Predicting if a customer will default on a loan.
Logistic Regression
Syntax
The basic syntax for glm() function in logistic regression is −

glm(formula,data,family)
Following is the description of the parameters used −

• formula is the symbol presenting the relationship between the


variables.

• data is the data set giving the values of these variables.

• family is R object to specify the details of the model. It's value is


binomial for logistic regression.
Building multiple-variable model
• Multiple regression is an extension of linear
regression into relationship between more than
two variables.

• In simple linear relation we have one predictor


and one response variable, but in multiple
regression we have more than one predictor
variable and one response variable.
Building multiple-variable model
The general mathematical equation for multiple
regression is
y = a + b1x1 + b2x2 +...bnxn
Following is the description of the parameters used −
• y is the response variable.
• a, b1, b2...bn are the coefficients.
• x1, x2, ...xn are the predictor variables.
 
Building multiple-variable model
Syntax
The basic syntax for lm() function in multiple
regression is
lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
• formula is a symbol presenting the relation between
the
• response variable and predictor variables.
• data is the vector on which the formula will be applied.
Building multiple-variable model
• Consider the data set "mtcars" available in the R
environment. It gives a comparison between different car
models in terms of mileage per gallon (mpg), cylinder ,
displacement("disp"), horse power("hp"), weight of the
car("wt") and some more parameters.
 
• The goal of the model is to establish the relationship
between "mpg" as a response variable with "disp","hp"
and "wt" as predictor variables. We create a subset of
these variables from the mtcars data set for this purpose.
Building multiple-variable model
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
When we execute the above code, it produces the following
result −
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
input <- mtcars[,c("mpg","disp","hp","wt")]
Building multiple-variable model
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
 
# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")
 
a <- coef(model)[1]
print(a)
Building multiple-variable model
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
 
print(Xdisp)
print(Xhp)
print(Xwt)
 
Building multiple-variable model
OUTPUT
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
 
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
 
# # # # The Coefficient Values # # #
(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891
Building multiple-variable model
• Create Equation for Regression Model
• Based on the above intercept and coefficient values, we
create the mathematical equation.
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
For a car with disp = 221, hp = 102 and wt = 2.91 the
predicted mileage is −

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-
3.8008)*2.91 =22.7104
UNSUPERVISED LEARNING

OVERVIEW
• Using R’s clustering functions to explore data and
look for similarities
• Choosing the right number of clusters
• Evaluating a clustering
• Using R’s association rules functions to find
patterns of co-occurrence in data
• Evaluating a set of association rules
CLUSTER ANALYSIS
• The goal is to group the observations in your
data into clusters.
Two types of clustering:
1. Hierarchical Clustering
2. K means Clustering
Distances
• In order to cluster, you need the notions of
similarity and dissimilarity.
CLUSTER ANALYSIS
• Euclidean distance
• Hamming distance
• Manhattan (city block) distance
• Cosine similarity
CLUSTER ANALYSIS
EUCLIDEAN DISTANCE
• The most common distance is Euclidean
distance. The Euclidean distance between two
vectors x and y is defined as
edist(x, y) <- sqrt((x[1]-y[1])^2 + (x[2]-y[2])^2
+ ...)
CLUSTER ANALYSIS
HAMMING DISTANCE
• For categorical variables (male/female, or
small/medium/large), you can define the distance as 0 if
two points are in the same category, and 1 otherwise. If all
the variablesare categorical, then you can use Hamming
distance, which counts the number ofmismatches:
• hdist(x, y) <- sum((x[1] != y[1]) + (x[2] != y[2]) + ...)

• Here, a != b is defined to have a value of 1 if the expression


is true, and a value of 0 if the expression is false.
CLUSTER ANALYSIS
MANHATTAN (CITY BLOCK) DISTANCE
• Manhattan distance measures distance in the
number of horizontal and vertical units
• it takes to get from one (real-valued) point to
the other (no diagonal moves):

mdist(x, y) <- sum(abs(x[1]-y[1]) + abs(x[2]-y[2])


+ ...)
CLUSTER ANALYSIS
• COSINE SIMILARITY
• Cosine similarity is a common similarity metric
in text analysis. It measures the smallestangle
between two vectors (the angle theta between
two vectors is assumed to be between 0 and 90
degrees). Two perpendicular vectors (theta = 90
degrees) are themost dissimilar; the cosine of 90
degrees is 0.
• the cosine of 0 degrees is 1.
CLUSTER ANALYSIS
dot(x, y) <- sum( x[1]*y[1] + x[2]*y[2] + ... )
cossim(x, y) <- dot(x, y)/(sqrt(dot(x,x)*dot(y,y)))
HIERARCHICAL CLUSTERING
• Hierarchical clustering is an alternative approach which
builds a hierarchy from the bottom-up, and doesn’t
require us to specify the number of clusters
beforehand.
The algorithm works as follows:
• Put each data point in its own cluster.
• Identify the closest two clusters and combine them
into one cluster.
• Repeat the above step till all the data points are in a
single cluster.
HIERARCHICAL CLUSTERING EXAMPLE

set.seed(123)
data <
data.frame(x=sample(1:10000,7),y=sample(1:
10000,7), z=sample(1:10000,7))
data
HIERARCHICAL CLUSTERING EXAMPLE

x y z
1 2876 8925 1030
2 7883 5514 8998
3 4089 4566 2461
4 8828 9566 421
5 9401 4532 3278
6 456 6 773 9541
7 5278 5723 8891
HIERARCHICAL CLUSTERING EXAMPLE
# create own function according to Euclidean
distance formula
euclidean_distance <- function(p,q)
{ sqrt(sum((p - q)^2)) }
# check points 4 and 6
euclidean_distance(data[4,],data[6,]) #my own
function
Output:
[1] 12691.16
HIERARCHICAL CLUSTERING EXAMPLE
dist(data, method="euclidean")
Output:
1 2 3 4 5 6
## 2 10009.695
## 3 4745.525 7617.448
## 4 6017.314 9532.925 7184.687
## 5 8180.928 5998.921 5374.569 5816.523
## 6 9106.296 7552.500 8258.083 12691.16 11147.209
## 7 8821.436 2615.560 6640.578 9955.503 7065.648
4977.618
HIERARCHICAL CLUSTERING EXAMPLE

hclust(dist(data, method="euclidean"),
method="single")
hclust(dist(data, method="euclidean"),
method="complete")

You might also like