Professional Documents
Culture Documents
D.DEVA
HEMA(AP/CSE)
OVERVIEW
TP FP
Actual Values
FN TN
Predicted Values
5 0
Actual Values 1 9
Model evaluation
Accuracy
Accuracy is defined as the ratio of total number
of correct predictions to the total number of
samples.
Accuracy =(True Positive + True Negative) / (True
positive +True Negative+ False Positive +False
Negative)
Model evaluation
Precision
Precision defines the correct identification of actual positives
Recall
TP/(TP+FN).
TN/(TN+FP).
sum(ifelse(spamTest$spam=='spam',
log(spamTest$pred),
log(1-spamTest$pred)))
Deviance
D= -2*(logLikelihood-S)
• S- The log likelihood of the saturated model
• The saturated model is a perfect model that
returns probability 1 for items in the class and
probability 0 for items not in the class .
Akaike information criterion (AIC).
BIC= 2*2^entropy
Evaluating ranking models
> table(d$cluster)
1 2 3 4 5
10 27 18 17 28
Validating The models
Common model problems
• Bias
• Overfit
• Variance
• Non significance
Memorization Models
• The simplest methods in data science are
what we call memorization methods.
–Building single-variable model
–Cross-validated variable selection
–Building basic multivariable models
–Starting with decision trees, nearest
neighbor,and naive Bayes models
Building single-variable model
Linear Regression:
• Regression analysis is a very widely used
statistical tool to establish a relationship model
between two variables.
• One of these variable is called predictor variable
whose value is gathered through experiments.
• The other variable is called response variable
whose value is derived from the predictor
variable.
Building single-variable model
• The general mathematical equation for a linear regression is -
y = ax + b
• Following is the description of the parameters used -
Syntax
The basic syntax for lm() function in linear regression is -
lm(formula,data)
Following is the description of the parameters used -
print(relation)
Building single-variable model
OUTPUT
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
Building single-variable model
H <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131)
W <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
print(summary(relation))
Building single-variable model
OUTPUT
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict(object, newdata)
Following is the description of the parameters used -
is the formula which is already created using the lm()
function.
newdata is the vector containing the new value for
predictor variable.
Building single-variable model
Predict the weight of new persons
glm(formula,data,family)
Following is the description of the parameters used −
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-
3.8008)*2.91 =22.7104
UNSUPERVISED LEARNING
OVERVIEW
• Using R’s clustering functions to explore data and
look for similarities
• Choosing the right number of clusters
• Evaluating a clustering
• Using R’s association rules functions to find
patterns of co-occurrence in data
• Evaluating a set of association rules
CLUSTER ANALYSIS
• The goal is to group the observations in your
data into clusters.
Two types of clustering:
1. Hierarchical Clustering
2. K means Clustering
Distances
• In order to cluster, you need the notions of
similarity and dissimilarity.
CLUSTER ANALYSIS
• Euclidean distance
• Hamming distance
• Manhattan (city block) distance
• Cosine similarity
CLUSTER ANALYSIS
EUCLIDEAN DISTANCE
• The most common distance is Euclidean
distance. The Euclidean distance between two
vectors x and y is defined as
edist(x, y) <- sqrt((x[1]-y[1])^2 + (x[2]-y[2])^2
+ ...)
CLUSTER ANALYSIS
HAMMING DISTANCE
• For categorical variables (male/female, or
small/medium/large), you can define the distance as 0 if
two points are in the same category, and 1 otherwise. If all
the variablesare categorical, then you can use Hamming
distance, which counts the number ofmismatches:
• hdist(x, y) <- sum((x[1] != y[1]) + (x[2] != y[2]) + ...)
set.seed(123)
data <
data.frame(x=sample(1:10000,7),y=sample(1:
10000,7), z=sample(1:10000,7))
data
HIERARCHICAL CLUSTERING EXAMPLE
x y z
1 2876 8925 1030
2 7883 5514 8998
3 4089 4566 2461
4 8828 9566 421
5 9401 4532 3278
6 456 6 773 9541
7 5278 5723 8891
HIERARCHICAL CLUSTERING EXAMPLE
# create own function according to Euclidean
distance formula
euclidean_distance <- function(p,q)
{ sqrt(sum((p - q)^2)) }
# check points 4 and 6
euclidean_distance(data[4,],data[6,]) #my own
function
Output:
[1] 12691.16
HIERARCHICAL CLUSTERING EXAMPLE
dist(data, method="euclidean")
Output:
1 2 3 4 5 6
## 2 10009.695
## 3 4745.525 7617.448
## 4 6017.314 9532.925 7184.687
## 5 8180.928 5998.921 5374.569 5816.523
## 6 9106.296 7552.500 8258.083 12691.16 11147.209
## 7 8821.436 2615.560 6640.578 9955.503 7065.648
4977.618
HIERARCHICAL CLUSTERING EXAMPLE
hclust(dist(data, method="euclidean"),
method="single")
hclust(dist(data, method="euclidean"),
method="complete")