Professional Documents
Culture Documents
Ah, but a man’s reach should exceed his grasp, Or what’s a heaven for?” – Robert Browning
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Clu
st e Data
ring ng
Tid Refund Marital Taxable
Status Income Cheat
li
1 Yes Single 125K No
ode
2 No Married 100K No
M
ive
3 No Single 70K No
4 Yes Married 120K No
ic t
ed
5 No Divorced 95K Yes
6
7
No
Yes
Married 60K
Divorced 220K
No
No Pr
8 No Single 85K Yes
9 No Married 75K No
An
10 No Single 90K Yes
De oma
11 No Married 60K No
tion
12 Yes Divorced 220K No
ia tec ly
oc
13 No Single 85K Yes
s tio
As s
14 No Married 75K No
15 No Single 90K Yes n
le
Ru
10
Milk
Number of Number of
years years
Yes No Yes No
Set
Learn
Training Model
Set Classifier
Use of K-means to
partition Sea Surface
60
Land Cluster 2
0
(NPP) into clusters that
Ice or No NPP
-30
reflect the Northern and
Sea Cluster 2 Southern Hemispheres.
-60
Sea Cluster 1
• Medical Informatics
– Rules are used to find combination of patient symptoms
and test results associated with certain diseases
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Dr. Hazem Shatila
Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
Dr. Hazem Shatila
Data Quality: Why Preprocess the Data?
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
Dr. Hazem Shatila
34
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
• Causes?
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
Dr. Hazem Shatila
42
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id º B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units
Dr. Hazem Shatila
43
Data Integration
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Dr. Hazem Shatila
44
Data Integration
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed - Expected ) 2
c2 = å
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
åi =1 (ai - A)(bi - B) å
n n
(ai bi ) - n AB
rA, B = = i =1
(n - 1)s As B (n - 1)s As B
Scatter plots
showing the
similarity from –
1 to 1.
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
Dr. Hazem Shatila
Data Reduction Strategies
x2
x1
Dr. Hazem Shatila
Principal Component Analysis
(Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
Dr. Hazem Shatila
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
– E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Dr. Hazem Shatila
Attribute Creation (Feature
Generation)
• Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet
transformation, manifold approaches (not covered)
– Attribute construction
• Combining features
• Data discretization
Dr. Hazem Shatila
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
Dr. Hazem Shatila
Parametric Data Reduction: Regression
and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
Dr. Hazem Shatila
Clustering
• Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
• Cluster analysis will be studied in depth later on.
W O R
SRS le random
im p h o u t
( s e wit
p l
sam ment)
pl ac e
re
SRSW
R
Raw Data
Dr. Hazem Shatila
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample
s s y
lo
Original Data
Approximated
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
Dr. Hazem Shatila
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization.
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization
Dr. Hazem Shatila
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 - 12,000
1.0]. Then $73,000 is mapped to 98,000 - 12,000 (1.0 - 0) + 0 = 0.716
• Z-score normalization (μ: mean, σ: standard deviation):
v - µA
v' =
s A
73,600 - 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Dr. Hazem Shatila
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic
rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} ® {Beer},
1 Bread, Milk {Milk, Bread} ® {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} ® {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Þ Computationally prohibitive!
éæ d ö æ d - k öù
R = å êç ÷ ´ å ç ÷ú
d -1 d -k
ëè k ø è j øû
k =1 j =1
= 3 - 2 +1
d d +1
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Dr. Hazem Shatila
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ³ minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
"X , Y : ( X Í Y ) Þ s( X ) ³ s(Y )
– Support of an itemset never exceeds the support of
its subsets
– This is known as the anti-monotone property of
support
Dr. Hazem Shatila
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
supersets ABCDE
Minimum Support = 3
Minimum Support = 3
Minimum Support = 2
- Supervised learning
- Unsupervised learning
- Others:
- Reinforcement learning
- Recommender systems.
x2
x1
Supervised Learning Unsupervised Learning
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Dr. Hazem Shatila
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
1416 232
1534 315
852 178
… …
Notation:
Training Set
m = Number of training examples
Hypothesis:
‘s: Parameters
How to choose ‘s ?
Parameters:
Cost Function:
Goal:
Question:How
Dr. Hazem Shatila to minimize J?
Gradient Descent
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
q1
q0
Current value of
Unchange
update
and
simultaneously
Gradient descent:
Repeat
(simultaneously update )
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
simultaneously update
l = 0.1 Bias=1
w0 w1 w2 w3 Epoch w0 w1 w2 w3
X1 X2 X3 Y 0 0 0 0 0 0 0 0 0 0
1 0 0 -1 1 -0.2 -0.2 0 0
1 0 1 1 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 4 -0.4 0.2 0.4 0.4
0 1 0 -1 5 -0.2 0 0 0
6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1
0 0 0 -1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
8 -0.2 0 0.2 0.2
• Bayes theorem:
P( X | Y ) P(Y )
P(Y | X ) =
P( X )
P ( X 1 X 2 ! X d | Y ) P (Y )
P (Y | X 1 X 2 ! X n ) =
P( X 1 X 2 ! X d )
P(Refund=Yes|Yes)=0
t e t e n las
ca ca c o c
Tid Refund Marital
Status
Taxable
Income Evade
• Normal distribution:
( X i - µij ) 2
-
1 Yes Single 125K No 1 2s ij2
P( X i | Y j ) = e
2 No Married 100K No
2ps 2
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Xi,Yi) pair
5 No Divorced 95K Yes
6 No Married 60K No
• For (Income, Class=No):
7 Yes Divorced 220K No – If Class=No
8 No Single 85K Yes • sample mean = 110
9 No Married 75K No
• sample variance = 2975
10 No Single 90K Yes
10
1 -
( 120 -110 ) 2
te
g
n tin a ss
te
g
Consider the l Naïve Bayes Classifier:
c a table cwith
a Tid =c o7 deleted
c
Tid Refund Marital Taxable
Status Income Evade P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
1 Yes Single 125K No P(Refund = Yes | Yes) = 0
2 No Married 100K No P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
3 No Single 70K No
P(Marital Status = Divorced | No) = 0
4 Yes Married 120K No P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
6 No Married 60K No P(Marital Status = Married | Yes) = 0/3
7 Yes Divorced 220K No For Taxable Income:
If class = No: sample mean = 91
8 No Single 85K Yes
sample variance = 685
9 No Married 75K No If class = No: sample mean = 90
10 No Single 90K Yes sample variance = 25
10
X X X
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
• Binary split:
– Divides values into two subsets
property among
Shirt
attribute values Size
This grouping
violates order
property
{Small, {Medium,
Large} Extra Large}
C0: 5 C0: 9
C1: 5 C1: 1
• Entropy
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j
• Misclassification error
Error (t ) = 1 - max P (i | t )
i
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
Dr. Hazem Shatila
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t ) = 1 - å [ p ( j | t )]2
j
GINI (t ) = 1 - å [ p ( j | t )]2
j
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
è n ø
split i =1
GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i
SplitINFO
split
n n i =1
GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i
SplitINFO
split
n n i =1
Error (t ) = 1 - max P (i | t )
i
Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y < 0.33?
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is known
as decision boundary
• Decision boundary is parallel to axes because test condition involves a
single attribute at-a-time
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia
• Results of a query
– Groupings are a result of an external specification
– Clustering is a grouping of objects based on the data
• Supervised classification
– Have class label information
• Association Analysis
– Local vs. global connections
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
8 contiguous clusters
• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
2 Overlapping Circles
• Hierarchical clustering
• Density-based clustering
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
6 Starting with two initial centroids in one cluster of each pair of clusters
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
Starting with some pairs of clusters having three initial centroids, while other
have only one.
5
Dr. Hazem Shatila 10 15 20
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Empty
Cluster
• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the
highest SSE
– If there are several empty clusters, the above
can be repeated several times.
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Predicted class
Pos Neg
Actual
Pos TP FN P
class
Neg FP TN N
Recall/Sensitivity
(True Positive Rate)
Specificity
(True Negative Rate)
0 1 2
• We can use:
– Training data;
– Independent test data;
– Hold-out method;
– k-fold cross-validation method;
– Leave-one-out method;
– Bootstrap method;
– And many more…
– Q: Why?
– A: Because new data will probably not be exactly the
same as the training data!
• The accuracy/error estimates on the training data
measure the degree of classifier’s overfitting.
Dr. Hazem Shatila
Estimation with Independent Test Data
Classifier
Data
• The hold-out method is usually used when we have
thousands of instances, including several hundred
instances from each class.
• Then we have