All Lectures Basics

Basics of Machine
Learning & Python
Dr. Hazem Shatila
Dr. Hazem Shatila

Key Issues to be Covered • Evaluation of learning models
• Statistics, Linear algebra & Probability • ROC & lift curves
• Introduction to ML & DM
• Data Mining Practical
• Machine Learning • Introduction to Python, visualization.
• Supervised and unsupervised learning • Use cases, assignments and practical
• Types of Data examples for machine learning
• Data Preprocessing
• Frequent Item sets
• Association Rules & Apriori algorithm References:
• Regression • “Introduction to Machine Learning”, Alex Smola
and S.V.N. Vishwanathan.
• Base Classifiers • “Introduction to Data Mining” by Tan, Steinbach &
– Logistic Regression Kumar.
– Decision Tree based Methods
– Nearest-neighbor
– Neural Networks
– Naïve Bayes
– Support Vector Machines
– Kernel Trick in SVM Hazem Shatila, PhD.
• Clustering CEO, Markov analytics
– K-mean Clustering
– Hierarchical Clustering
– Cluster Evaluation
Dr. Hazem Shatila

Jobs in 2022
Dr. Hazem Shatila

What is Data?
Dr. Hazem Shatila

Big Data Definition
• Big Data is the amount of data just beyond technology’s

capability to store, manage and process efficiently.
Ah, but a man’s reach should exceed his grasp, Or what’s a heaven for?” – Robert Browning
Dr. Hazem Shatila

Evolution of Analytics
Dr. Hazem Shatila

Data Science
Dr. Hazem Shatila

What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
Dr. Hazem Shatila

What is (not) Data Mining?
●What is not Data ● What is Data Mining?
Mining?
– Look up phone – Certain names are more

number in phone prevalent in certain US
directory locations (O’Brien, O’Rourke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g., Amazon
rainforest, Amazon.com)
Dr. Hazem Shatila
Machine Learning is not Data
Mining
• Machine Learning design systems that can learn in the
process of processing data
• Checkers program designed by one of the scientist
eventually learned to play better than the program
designer
• Data Mining incorporates the Machine learning methods
but also benefits from the methods of other disciplines
such as database and statistic
• Machine learning is a field of data science that focuses
on designing algorithms that can learn from and make
predictions on data
• Data Mining is the field that focuses on discovering
properties of data sets, it can use ML to do so.
Dr. Hazem Shatila
Machine Learning Tasks
• Prediction Methods (Supervised-labels)
– Use some variables to predict unknown or future
values of other variables.
• Description Methods (Unsupervised-no

labels)
– Find human-interpretable patterns that describe
the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Dr. Hazem Shatila

Machine Learning Tasks …
Clu
st e Data
ring ng
Tid Refund Marital Taxable
Status Income Cheat
li
1 Yes Single 125K No
ode
2 No Married 100K No
M
ive
3 No Single 70K No
4 Yes Married 120K No
ic t
ed
5 No Divorced 95K Yes
6
7
No
Yes
Married 60K
Divorced 220K
No
No Pr
8 No Single 85K Yes
9 No Married 75K No
An
10 No Single 90K Yes
De oma
tion
12 Yes Divorced 220K No
ia tec ly
oc
s tio
As s
15 No Single 90K Yes n
le
Ru
10
Milk
Dr. Hazem Shatila

Predictive Modeling: Classification
• Find a model for class attribute as a

function of the values of other attributes
Model for predicting credit
worthiness
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
> 3 yr < 3 yr > 7 yrs < 7 yrs
Yes No Yes No
Dr. Hazem Shatila

Classification Example
l l ive
ir ca ir ca at # years at
go go ntit Tid Employed
Level of
present
Credit
ate ate ua ass Education
address
Worthy
c c q cl 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10
2 Yes High School 2 No

3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … … Test
10
Set
Learn
Training Model
Set Classifier
Dr. Hazem Shatila

Examples of Classification Task
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying land covers (water bodies, urban areas,

forests, etc.) using satellite data
• Categorizing news stories as finance,

weather, entertainment, sports, etc
• Identifying intruders in the cyberspace
• Predicting tumor cells as benign or malignant
• Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil
Dr. Hazem Shatila

Regression
• Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Extensively studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.
Dr. Hazem Shatila

Clustering
• Finding groups of objects such that the objects in a

group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Dr. Hazem Shatila

Applications of Cluster Analysis
• Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
• Summarization
– Reduce the size of large data
sets
Courtesy: Michael Eisen
Clusters for Raw SST and Raw NPP

90
Use of K-means to
partition Sea Surface
60
Land Cluster 2
30 Temperature (SST) and

Land Cluster 1 Net Primary Production
latitude
0
(NPP) into clusters that
Ice or No NPP
-30
reflect the Northern and
Sea Cluster 2 Southern Hemispheres.
-60
Sea Cluster 1
Dr. Hazem Shatila

-90
-180 - 150 -120 -90 -60 - 30 0 30 60 90 120 150 180
Cluster
longitude
Association Rule Discovery:
Definition
• Given a set of records each of which
contain some number of items from a
given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
TID items.
Items
Rules Discovered:
1 Bread, Coke, Milk
{Milk} --> {Coke}
2 Beer, Bread
{Diaper, Milk} --> {Beer}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Dr. Hazem Shatila

Association Analysis:
Applications
• Market-basket analysis
– Rules are used for sales promotion, shelf management,
and inventory management
• Telecommunication alarm diagnosis

– Rules are used to find combination of alarms that occur
together frequently in the same time period
• Medical Informatics
– Rules are used to find combination of patient symptoms
and test results associated with certain diseases
Dr. Hazem Shatila

Deviation/Anomaly/Change Detection
• Detect significant deviations from
normal behavior
• Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
– Identify anomalous behavior from
sensor networks for monitoring and
surveillance.
– Detecting changes in the global forest
cover.
Dr. Hazem Shatila

what is Data? Attributes
• Collection of data objects and

their attributes Tid Refund Marital
Status
Taxable
Income Cheat

• An attribute is a property or
characteristic of an object
3 No Single 70K No
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as Objects
6 No Married 60K No
variable, field, characteristic, or
feature 7 Yes Divorced 220K No
8 No Single 85K Yes
• A collection of attributes
9 No Married 75K No
describe an object
– Object is also known as 10
record, point, case, sample,

entity, or instance Size: Number of objects
Dimensionality: Number of attributes
Sparsity: Number of populated
object-attribute pairs
Dr. Hazem Shatila

Types of Attributes
• There are different types of attributes
– Categorical
• Examples: eye color, zip codes, words, rankings (e.g,
good, fair, bad), height in {tall, medium, short}
• Nominal (no order or comparison) vs Ordinal (order
but not comparable)
– Numeric
• Examples: dates, temperature, time, length, value,
count.
• Discrete (counts) vs Continuous (temperature)
• Special case: Binary attributes (yes/no, exists/not
exists)
Dr. Hazem Shatila

Numeric Record Data
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such data set can be represented by an n-by-d data

matrix, where there are n rows, one for each object, and
d columns, one for each attribute
Projection Projection Distance Load Thickness

of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Dr. Hazem Shatila
Categorical Data
• Data that consists of a collection of
records, each of which consists of a fixed
set of categorical attributes
Status Income Cheat
1 Yes Single High No

2 No Married Medium No
3 No Single Low No
4 Yes Married High No
5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
Dr. Hazem Shatila
10
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times
the corresponding term occurs in the document.
– Bag-of-words representation – no ordering
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Dr. Hazem Shatila
Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
• A set of items can also be represented as a binary

vector, where each attribute is an item.
• A document can also be represented as a set of
words (no counts)
Dr. Hazem Shatila

Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
• Data is a long ordered string
Dr. Hazem Shatila

Ordered Data
• Time series
– Sequence of ordered (over “time”) numeric
values.
Dr. Hazem Shatila

Discrete and Continuous
Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Dr. Hazem Shatila

Data Preprocessing
• Data Preprocessing: An Overview
– Data Quality
– Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
Dr. Hazem Shatila
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view

– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling,
…
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?
Dr. Hazem Shatila
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
Dr. Hazem Shatila

Chapter 3: Data Preprocessing
– Data Quality
• Data Cleaning
• Data Reduction
• Summary
Dr. Hazem Shatila
34
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Dr. Hazem Shatila

Data Cleaning
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
• Missing data may need to be inferred
Dr. Hazem Shatila
Data Cleaning
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing

(when doing classification)—not effective when the % of
missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as
Bayesian formula or decision tree
Dr. Hazem Shatila
Data Cleaning
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
Dr. Hazem Shatila
38
Data Cleaning
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g.,
deal with possible outliers)
Dr. Hazem Shatila

Data Cleaning
Outliers
• Outliers are data objects with characteristics

that are considerably different than most of
the other data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection
• Causes?
Dr. Hazem Shatila

Data Cleaning
Outliers
Box Plots
Dr. Hazem Shatila

– Data Quality
• Data Cleaning
• Data Reduction
• Summary
Dr. Hazem Shatila
42
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id º B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units
Dr. Hazem Shatila
43
Data Integration
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Dr. Hazem Shatila
44
Data Integration
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed - Expected ) 2
c2 = å
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Dr. Hazem Shatila

Data Integration
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution
in the two categories)
(250 - 90) 2 (50 - 210) 2 (200 - 360) 2 (1000 - 840) 2
c =
2
+ + + = 507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are
correlated in the group
Dr. Hazem Shatila
Data Integration
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product

moment coefficient)
åi =1 (ai - A)(bi - B) å
n n
(ai bi ) - n AB
rA, B = = i =1
(n - 1)s As B (n - 1)s As B
where n is the number of tuples, A and B are the respective means

of A and B, σA and σB are the respective standard deviation of A
and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Dr. Hazem Shatila

Data Integration
Visually Evaluating Correlation
Scatter plots
showing the
similarity from –
1 to 1.
Dr. Hazem Shatila

– Data Quality
• Data Cleaning
• Data Reduction
• Summary
Dr. Hazem Shatila
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is

much smaller in volume but yet produces the same (or almost the
same) analytical results
• Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
– Data compression
Dr. Hazem Shatila
Data Reduction 1: Dimensionality
Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
Dr. Hazem Shatila

Mapping Data to a New Space
n Fourier transform
n Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency
Dr. Hazem Shatila

What Is Wavelet Transform?
• Decomposes a signal into
different frequency subbands
– Applicable to n-
dimensional signals
• Data are transformed to
preserve relative distance
between objects at different
levels of resolution
• Allow natural clusters to
become more distinguishable
• Used for image compression
Dr. Hazem Shatila

Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation in data

• The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
Dr. Hazem Shatila
Principal Component Analysis
(Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
Dr. Hazem Shatila
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
– E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Dr. Hazem Shatila
Attribute Creation (Feature
Generation)
• Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet
transformation, manifold approaches (not covered)
– Attribute construction
• Combining features
• Data discretization
Dr. Hazem Shatila
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
Dr. Hazem Shatila
Parametric Data Reduction: Regression
and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions
Dr. Hazem Shatila

y
Regression Analysis
Y1
• Regression analysis: A collective name for

techniques for the modeling and analysis of Y1’
y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or X1 x
more independent variables (aka.
explanatory variables or predictors)
• Used for prediction
• The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
• Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but other
relationships
criteria have also been used
Dr. Hazem Shatila

Regress Analysis and Log-Linear
Models
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
• Log-linear models:
– Approximate discrete multidimensional probability distributions
– Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
– Useful for dimensionality reduction and data smoothing
Dr. Hazem Shatila
Histogram Analysis
• Divide data into buckets and 40
store average (sum) for each 35
bucket
30
• Partitioning rules:
25
– Equal-width: equal bucket 20
range
15
– Equal-frequency (or equal- 10
depth)
5
0
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
Dr. Hazem Shatila
Clustering
• Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
• Cluster analysis will be studied in depth later on.
Dr. Hazem Shatila

Sampling
• Sampling: obtaining a small sample s to represent the
whole data set N
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor
performance in the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling:
• Note: Sampling may not reduce database I/Os (page at a
time)
Dr. Hazem Shatila
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular
item
• Sampling without replacement
– Once an object is selected, it is removed from the
population
• Sampling with replacement
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
– Used in conjunction with skewed data
Dr. Hazem Shatila
Sampling: With or without Replacement
W O R
SRS le random
im p h o u t
( s e wit
p l
sam ment)
pl ac e
re
SRSW
R
Raw Data
Dr. Hazem Shatila
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample
Dr. Hazem Shatila

Data Reduction 3: Data
Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is
possible without expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be
considered as forms of data compression
Dr. Hazem Shatila
Data Compression
Original Data Compressed

Data
lossless
s s y
lo
Original Data
Approximated
Dr. Hazem Shatila

– Data Quality
• Data Cleaning
• Data Reduction
• Summary
Dr. Hazem Shatila
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization.
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization
Dr. Hazem Shatila
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 - 12,000
1.0]. Then $73,000 is mapped to 98,000 - 12,000 (1.0 - 0) + 0 = 0.716
• Z-score normalization (μ: mean, σ: standard deviation):
v - µA
v' =
s A
73,600 - 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Dr. Hazem Shatila
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic
rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
Dr. Hazem Shatila

Data Discretization Methods
• Typical methods: All the methods can be applied
recursively
– Binning
– Histogram analysis
– Clustering analysis
– Decision-tree analysis
– Correlation (e.g., c2) analysis
Dr. Hazem Shatila

Simple Discretization: Binning
• Equal-width (distance) partitioning

– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning

– Divides the range into N intervals, each containing approximately
same number of samples
– Good data scaling
– Managing categorical attributes can be tricky
Dr. Hazem Shatila
Binning Methods for Data
Smoothing
q Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Dr. Hazem Shatila
Association Analysis: Basic Concepts
Frequent Itemsets
Apriori Algorithm
Dr. Hazem Shatila

Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} ® {Beer},
1 Bread, Milk {Milk, Bread} ® {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} ® {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Dr. Hazem Shatila

Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count (s) 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
– E.g. s({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Dr. Hazem Shatila

Definition: Association Rule
● Association Rule
TID Items
– An implication expression of the form
X ® Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} ® {Beer} 3 Milk, Diaper, Beer, Coke
● Rule Evaluation Metrics
– Support (s)
u Fraction of transactions that contain Example:
both X and Y {Milk, Diaper} Þ {Beer}
– Confidence (c)
◆ Measures how often items in Y s (Milk, Diaper, Beer) 2
s= = = 0.4
appear in transactions that contain X |T| 5
⁃ Lift (L)
s (Milk, Diaper, Beer) 2
- Lift=confidence/support(Y) c= = = 0.67
s (Milk, Diaper) 3
Dr. Hazem Shatila

Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Þ Computationally prohibitive!
Dr. Hazem Shatila

•
Computational
Given d unique items:
Complexity
– Total number of itemsets = 2d
– Total number of possible association rules:
éæ d ö æ d - k öù
R = å êç ÷ ´ å ç ÷ú
d -1 d -k
ëè k ø è j øû
k =1 j =1
= 3 - 2 +1
d d +1
If d=6, R = 602 rules
Dr. Hazem Shatila

Mining Association Rules
TID Items Example of Rules:
1 Bread, Milk {Milk,Diaper} ® {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} ® {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} ® {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} ® {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} ® {Milk,Beer} (s=0.4, c=0.5)
{Milk} ® {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Dr. Hazem Shatila
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ³ minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
• Frequent itemset generation is still

computationally expensive
Dr. Hazem Shatila
Frequent Itemset Generation null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE

Given d items, there
are 2d possible
ABCDE candidate itemsets
Dr. Hazem Shatila

Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent
itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
M candidates and N transactions
Dr. Hazem Shatila
Frequent Itemset Generation
Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)

– Reduce size of N as the size of itemset
increases
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the
candidates or transactions
– No need to match every candidate against
every transaction
Dr. Hazem Shatila
Reducing Number of
Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must
also be frequent
• Apriori principle holds due to the following

property of the support measure:
"X , Y : ( X Í Y ) Þ s( X ) ³ s(Y )
– Support of an itemset never exceeds the support of
its subsets
– This is known as the anti-monotone property of
support
Dr. Hazem Shatila
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
supersets ABCDE
Dr. Hazem Shatila

TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 3 (60%)
If every subset is considered,

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Dr. Hazem Shatila

TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
6 + 6 + 4 = 16
Dr. Hazem Shatila

Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate
Eggs 1 {Bread,Diaper}
{Beer, Milk}
candidates involving Coke
{Diaper, Milk} or Eggs)
{Beer,Diaper}
Minimum Support = 3

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
6 + 6 + 4 = 16
Dr. Hazem Shatila

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3
candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support = 3

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
6 + 6 + 4 = 16
Dr. Hazem Shatila

Bread 4
Coke 2
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk}
6 + 15 + 20 = 41 { Beer,Bread,Diaper}
With support-based pruning, {Bread, Diaper, Milk}
6 + 6 + 4 = 16 { Beer, Bread, Milk}
Dr. Hazem Shatila

Bread 4
Coke 2
Eggs 1
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
If every subset is considered, Itemset Count

6C + 6C + 6C { Beer, Diaper, Milk} 2
1 2 3
{ Beer,Bread, Diaper} 2
6 + 15 + 20 = 41
{Bread, Diaper, Milk} 2
With support-based pruning, {Beer, Bread, Milk} 1
6 + 6 + 4 = 16
Dr. Hazem Shatila

Apriori Algorithm
– Fk: frequent k-itemsets
– Lk: candidate k-itemsets
• Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
• Candidate Generation: Generate Lk+1 from Fk
• Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
• Support Counting: Count the support of each
candidate in Lk+1 by scanning the DB
• Candidate Elimination: Eliminate candidates in Lk+1
that are infrequent, leaving only those that are frequent
=> Fk+1
Dr. Hazem Shatila

Candidate Generation: Brute-force method
Dr. Hazem Shatila

Candidate Generation: Merge Fk-1 and F1 itemsets
Dr. Hazem Shatila

Candidate Generation: Fk-1 x Fk-1 Method
Dr. Hazem Shatila

Bread 4
Coke 2
Eggs 1
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41 {Bread, Diaper, Milk} 2
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support counting
step.
Dr. Hazem Shatila

Example
Minimum Support = 2
Dr. Hazem Shatila

Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f Ì L such that f ® L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC ®D, ABD ®C, ACD ®B, BCD ®A,
A ®BCD, B ®ACD, C ®ABD, D ®ABC
AB ®CD, AC ® BD, AD ® BC, BC ®AD,
BD ®AC, CD ®AB,
• If |L| = k, then there are 2k – 2 candidate

association rules (ignoring L ® Æ and Æ ®
L)
Dr. Hazem Shatila
Rule Generation
• In general, confidence does not have an anti-
monotone property
c(ABC ®D) can be larger or smaller than c(AB ®D)
• But confidence of rules generated from the same

itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:
c(ABC ® D) ³ c(AB ® CD) ³ c(A ® BCD)
– Confidence is anti-monotone w.r.t. number of items

on the RHS of the rule
Dr. Hazem Shatila

Rule Generation for Apriori
Lattice of rules Algorithm
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules
Dr. Hazem Shatila

Try it out…Assignment
Please draw the rules tree
Dr. Hazem Shatila

Linear Regression
Dr. Hazem Shatila

Machine Learning Algorithms
- Supervised learning
- Unsupervised learning
- Others:
- Reinforcement learning
- Recommender systems.
Dr. Hazem Shatila

Supervised Learning & Unsupervised Learning
x2
x1
Supervised Learning Unsupervised Learning
Dr. Hazem Shatila

Linear Regression with one
Variable
Housing Prices
(Portland, OR)
Price
(in 1000s
of dollars)
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Dr. Hazem Shatila
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
1416 232
1534 315
852 178
… …
Notation:
Training Set
m = Number of training examples
x’s = “input” variable / features Learning Algorithm
y’s = “output” variable / “target” variable

Size of Estimated
house
h price
Question : How to describe h?

Dr. Hazem Shatila
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
Dr. Hazem Shatila

Dr. Hazem Shatila
y
Idea: Choose so that

is close to for our
training examples
Dr. Hazem Shatila

Cost Function
Simplified:
Hypothesis:
Parameters:
Cost Function:
Goal:
Dr. Hazem Shatila

Price
($)
in
1000’s
Size in feet2 (x)
Question:How
Dr. Hazem Shatila to minimize J?
Gradient Descent
Have some function

Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
Dr. Hazem Shatila

J(q0,q1)
q1
q0
Dr. Hazem Shatila

Gradient descent algorithm
Correct: Simultaneous update Incorrect:
Dr. Hazem Shatila

Notice : α is the learning rate.
Dr. Hazem Shatila

If α is too small, gradient descent
can be slow.
If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.
Dr. Hazem Shatila

at local optima
Current value of
Unchange
Gradient descent can converge to a local minimum, even with the

learning rate α fixed.
As we approach a local minimum, gradient descent will
automatically take smaller steps. So, no need to decrease α over
time.
Dr. Hazem Shatila

Gradient Descent for
Linear Regression
Gradient descent algorithm Linear Regression Model
Dr. Hazem Shatila

update
and
simultaneously
Dr. Hazem Shatila

(for fixed , this is a function of x) (function of the parameters )
Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Linear Regression with multiple
variables
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update for every )
Dr. Hazem Shatila

New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for

)
(simultaneously update )
Dr. Hazem Shatila

Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
simultaneously update
Dr. Hazem Shatila

Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Logistic Regression
Dr. Hazem Shatila

Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Neural Network
Dr. Hazem Shatila

Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Example of Perceptron Learning
[
w( k +1) = w( k ) + l yi - f ( w( k ) , xi ) xi]
d
Y = sign( å wi X i )
i =0
l = 0.1 Bias=1
w0 w1 w2 w3 Epoch w0 w1 w2 w3
X1 X2 X3 Y 0 0 0 0 0 0 0 0 0 0
1 0 0 -1 1 -0.2 -0.2 0 0
1 0 1 1 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 4 -0.4 0.2 0.4 0.4
0 1 0 -1 5 -0.2 0 0 0
6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1
0 0 0 -1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
8 -0.2 0 0.2 0.2
Dr. Hazem Shatila

Naïve Bayes
Dr. Hazem Shatila

Bayes Classifier
• A probabilistic framework for solving
classification problems P( X , Y )
P(Y | X ) =
• Conditional Probability: P( X )
P( X , Y )
P( X | Y ) =
P(Y )
• Bayes theorem:
P( X | Y ) P(Y )
P(Y | X ) =
P( X )
Dr. Hazem Shatila

Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability

he/she has meningitis?
P ( S | M ) P ( M ) 0.5 ´1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
Dr. Hazem Shatila

Using Bayes Theorem for Classification
• Consider each attribute and class label as random
variables
• Given a record with attributes (X1, X2,…, Xd)

– Goal is to predict class Y
– Specifically, we want to find the value of Y that
maximizes P(Y| X1, X2,…, Xd )
• Can we estimate P(Y| X1, X2,…, Xd ) directly from

data?
Dr. Hazem Shatila
Example Data
Given a Test Record:
a l a l s
u
o ric o ricX =in(Refund
s
= No, Divorced, Income = 120K)
uo
t eg t eg n t
as
l
ca ca co c
Tid Refund Marital
Status
Taxable
Income Evade ● Can we estimate
1 Yes Single 125K No P(Evade = Yes | X) and P(Evade = No | X)?
3 No Single 70K No
4 Yes Married 120K No In the following we will replace
Evade = Yes by Yes, and
6 No Married 60K No
7 Yes Divorced 220K No Evade = No by No
8 No Single 85K Yes
9 No Married 75K No
10
Dr. Hazem Shatila

Using Bayes Theorem for
• Approach: Classification
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem
P ( X 1 X 2 ! X d | Y ) P (Y )
P (Y | X 1 X 2 ! X n ) =
P( X 1 X 2 ! X d )
– Maximum a-posteriori: Choose Y that maximizes

P(Y | X1, X2, …, Xd)
– Equivalent to choosing value of Y that maximizes

P(X1, X2, …, Xd|Y) P(Y)
• How to estimate P(X1, X2, …, Xd | Y )?
Dr. Hazem Shatila

Example Data
a l a l s
u
o ric o ric X =in (Refund
s
uo
t eg t eg n t
as
l
ca ca co c
Status Income Evade

3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Dr. Hazem Shatila

Naïve Bayes Classifier
• Assume independence among attributes Xi when class
is given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)
– Now we can estimate P(Xi| Yj) for all Xi and Yj

combinations from the training data
– New point is classified to Yj if P(Yj) P P(Xi| Yj) is

maximal.
Dr. Hazem Shatila

Conditional Independence
• X and Y are conditionally independent given Z if
P(X|YZ) = P(X|Z)
• Example: Arm length and reading skills

– Young child has shorter arm length and
limited reading skills, compared to adults
– If age is fixed, no apparent relationship
between arm length and reading skills
– Arm length and reading skills are conditionally
independent given age
Dr. Hazem Shatila

Naïve Bayes on Example Data
a l a l s
u
o ric o ricX =in(Refund
s
uo
t eg t eg n t
as
l
ca ca co c
Status Income Evade ● P(X | Yes) =
No
1 Yes Single 125K
P(Refund = No | Yes) x
3 No Single 70K No
P(Divorced | Yes) x
4 Yes Married 120K No P(Income = 120K | Yes)
6 No Married 60K No
● P(X | No) =
8 No Single 85K Yes P(Refund = No | No) x
9 No Married 75K No
P(Divorced | No) x
10
P(Income = 120K | No)
Dr. Hazem Shatila

Estimate
al al
Probabilities
s
from Data
ic ic o u
o r o r u
g g in s
c at
e
c at
e
co
n t
cl
a s • Class: P(Y) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
• For categorical
3 No Single 70K No attributes:
4 Yes Married 120K No P(Xi | Yk) = |Xik|/ Nk c
– where |Xik| is number of
6 No Married 60K No instances having attribute
7 Yes Divorced 220K No value Xi and belonging to
8 No Single 85K Yes class Yk
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
10
P(Refund=Yes|Yes)=0
Dr. Hazem Shatila

Estimate Probabilities from Data
• For continuous attributes:

– Discretization: Partition the range into bins:
k
• Replace continuous value with bin value
– Attribute changed from continuous to ordinal
– Probability density estimation:

• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)
Dr. Hazem Shatila

Estimate
g o g o
Probabilities
tin
u
us
ric
o
a
s
from
l
Dataric
a l
t e t e n las
ca ca c o c
Tid Refund Marital
Status
Taxable
Income Evade
• Normal distribution:
( X i - µij ) 2
-
1 Yes Single 125K No 1 2s ij2
P( X i | Y j ) = e
2ps 2
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Xi,Yi) pair
6 No Married 60K No
• For (Income, Class=No):
7 Yes Divorced 220K No – If Class=No
8 No Single 85K Yes • sample mean = 110
9 No Married 75K No
• sample variance = 2975
10
1 -
( 120 -110 ) 2
P ( Income = 120 | No) = e 2 ( 2975 )

= 0.0072
2p (54.54)
Dr. Hazem Shatila
Example of Naïve Bayes Classifier
X = (Refund = No, Divorced, Income = 120K)
Naïve Bayes Classifier:
P(Refund = Yes | No) = 3/7

P(Refund = No | No) = 4/7 ● P(X | No) = P(Refund=No | No)
P(Refund = Yes | Yes) = 0 ´ P(Divorced | No)
P(Refund = No | Yes) = 1 ´ P(Income=120K | No)
P(Marital Status = Single | No) = 2/7 = 4/7 ´ 1/7 ´ 0.0072 = 0.0006
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 ● P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | Yes) = 1/3 ´ P(Divorced | Yes)
P(Marital Status = Married | Yes) = 0 ´ P(Income=120K | Yes)
= 1 ´ 1/3 ´ 1.2 ´ 10-9 = 4 ´ 10-10
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class = Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
=> Class = No
Dr. Hazem Shatila

Issues with Naïve Bayes Classifier
o ric
a l
o ric
a l
uo
u s
te
g
n tin a ss
te
g
Consider the l Naïve Bayes Classifier:
c a table cwith
a Tid =c o7 deleted
c
Status Income Evade P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
1 Yes Single 125K No P(Refund = Yes | Yes) = 0
2 No Married 100K No P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
3 No Single 70K No
P(Marital Status = Divorced | No) = 0
4 Yes Married 120K No P(Marital Status = Married | No) = 4/6
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
6 No Married 60K No P(Marital Status = Married | Yes) = 0/3
7 Yes Divorced 220K No For Taxable Income:
If class = No: sample mean = 91
8 No Single 85K Yes
9 No Married 75K No If class = No: sample mean = 90
10 No Single 90K Yes sample variance = 25
10
Given X = (Refund = Yes, Divorced, 120K)

Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0 classify X as Yes or No!
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0
Dr. Hazem Shatila

Issues with Naïve Bayes
Classifier
• If one of the conditional probabilities is zero, then the entire expression
becomes zero
• Need to use other estimates of conditional probabilities than simple
fractions
c: number of classes
• Probability estimation:
p: prior probability of
N ic
Original : P( Ai | C ) = the class
Nc m: parameter
N ic + 1
Laplace : P( Ai | C ) = Nc: number of instances
Nc + c in the class
N ic + mp
m - estimate : P( Ai | C ) =
Nc + m Nic: number of instances
having attribute value Ai
in class c
Dr. Hazem Shatila
Example of Naïve Bayes Classifier….try
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) = ´ ´ ´ = 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = ´ ´ ´ = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M ) = 0.06 ´ = 0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004 ´ = 0.0027
eagle no yes no yes non-mammals 20
Give Birth Can Fly Live in Water Have Legs Class

P(A|M)P(M) > P(A|N)P(N)
yes no yes no ? => Mammals
Dr. Hazem Shatila

Example of Naïve Bayes Classifier….try
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) = ´ ´ ´ = 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = ´ ´ ´ = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M ) = 0.06 ´ = 0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004 ´ = 0.0027
eagle no yes no yes non-mammals 20
Give Birth Can Fly Live in Water Have Legs Class

P(A|M)P(M) > P(A|N)P(N)
yes no yes no ? => Mammals
Dr. Hazem Shatila

Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance during

probability estimate calculations
• Robust to irrelevant attributes
• Independence assumption may not hold for some

attributes
– Use other techniques such as Bayesian Belief
Networks (BBN)
Dr. Hazem Shatila

Nearest Neighbor
Dr. Hazem Shatila

Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
Training Choose k of the

Records “nearest” records
Dr. Hazem Shatila

Nearest-Neighbor Classifiers
Unknown record ● Requires three things
– The set of labeled records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
● To classify an unknown record:

– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Dr. Hazem Shatila

Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points

that have the k smallest distances to x
Dr. Hazem Shatila

1 nearest-neighbor
Voronoi Diagram
Dr. Hazem Shatila

Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
d ( p, q ) = å ( pi
i
-q )
i
2
• Determine the class from nearest neighbor

list
– Take the majority vote of class labels among the
k-nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
Dr. Hazem Shatila

Nearest Neighbor
Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points
from other classes
Dr. Hazem Shatila

Nearest Neighbor
Classification…
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Dr. Hazem Shatila

Nearest Neighbor
Classification…
• Selection of the right similarity measure is
critical:
111111111110 000000000001
vs
011111111111 100000000000
Euclidean distance = 1.4142 for both pairs
Dr. Hazem Shatila

Nearest neighbor
Classification…
• k-NN classifiers are lazy learners since they do not build
models explicitly
• Classifying unknown records are relatively expensive
• Can produce arbitrarily shaped decision boundaries
• Easy to handle variable interactions since the decisions
are based on local information
• Selection of right proximity measure is essential
• Superfluous or redundant attributes can create problems
• Missing attributes are hard to handle
Dr. Hazem Shatila

Improving KNN Efficiency
• Avoid having to compute distance to all
objects in the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
• Condensing
– Determine a smaller set of objects that give the
same performance
• Editing
– Remove objects to improve efficiency
Dr. Hazem Shatila

Decision tree
Dr. Hazem Shatila

Example of a Decision Tree
cal cal u s
ori ori uo
ti n ss
t eg t eg n a
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10
Training Data Model: Decision Tree
Dr. Hazem Shatila

Another Example of Decision
Tree
i cal
i cal
ous
or or nu
t eg
t eg
nt i
ass
l
ca ca co c MarSt Single,
Married Divorced
ID
1 Yes Single 125K
NO Home
No
Yes Owner No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
NO YES
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10
Dr. Hazem Shatila

Apply Model to Test Data
Test Data
Start from the root of tree.
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Dr. Hazem Shatila

Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Dr. Hazem Shatila

Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Dr. Hazem Shatila

Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Dr. Hazem Shatila

Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Dr. Hazem Shatila

Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
Dr. Hazem Shatila

Decision Tree Classification
Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Dr. Hazem Shatila

Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Dr. Hazem Shatila

General Structure of Hunt’s
Algorithm
• Let Dt be the set of training ID
Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
records that reach a node t 1 Yes Single 125K No
• General Procedure: 3 No Single 70K No

– If Dt contains records that 5 No Divorced 95K Yes
belong the same class yt, 6 No Married 60K No
then t is a leaf node labeled 7 Yes Divorced 220K No
as yt 8 No Single 85K Yes

9 No Married 75K No
– If Dt contains records that
belong to more than one 10
class, use an attribute test Dt

to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.
Dr. Hazem Shatila

Hunt’s Algorithm Home Marital Annual Defaulted
Home ID
Owner
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Dr. Hazem Shatila

Design Issues of Decision Tree
Induction
• How should training records be split?
– Method for specifying test condition
• depending on attribute types
– Measure for evaluating the goodness of a test
condition
• How should the splitting procedure stop?

– Stop splitting if all the records belong to the same
class or have identical attribute values
– Early termination
Dr. Hazem Shatila

Methods for Expressing Test
Conditions
• Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split

– 2-way split
– Multi-way split
Dr. Hazem Shatila

Test Condition for Nominal
Attributes
• Multi-way split:
Marital
– Use as many partitions as Status
distinct values.
Single Divorced Married
• Binary split:
– Divides values into two subsets
Marital Marital Marital

Status Status Status
OR OR
{Married} {Single, {Single} {Married, {Single, {Divorced}

Divorced} Divorced} Married}
Dr. Hazem Shatila

Test Condition for Ordinal Attributes
• Multi-way split: Shirt
Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large
• Binary split: Shirt

Size
Shirt
Size
– Divides values into two
subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}
property among
Shirt
attribute values Size
This grouping
violates order
property
{Small, {Medium,
Large} Extra Large}
Dr. Hazem Shatila

Test Condition for Continuous
Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split
Dr. Hazem Shatila

Splitting Based on Continuous
Attributes
• Different ways of handling
– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
• Static – discretize once at the beginning
• Dynamic – repeat at each node
– Binary Decision: (A < v) or (A ³ v)

• consider all possible splits and finds the best cut
• can be more compute intensive
Dr. Hazem Shatila

How to determine the Best Split
Before Splitting: 10 records of class 0,

10 records of class 1
Gender Car Customer

Type ID
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?
Dr. Hazem Shatila

How to determine the Best Split
• Greedy approach:
– Nodes with purer class distribution are
preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
High degree of impurity Low degree of impurity
Dr. Hazem Shatila

Measures of Node Impurity
• Gini Index
GINI (t ) = 1 - å [ p ( j | t )]
2
• Entropy
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j
• Misclassification error
Error (t ) = 1 - max P (i | t )
i
• Regression trees uses Standard deviation.

Dr. Hazem Shatila
Finding the Best Split
1. Compute impurity measure (P) before
splitting
2. Compute impurity measure (M) after splitting
● Compute impurity measure of each child node
● M is the weighted impurity of children
3. Choose the attribute test condition that
produces the highest gain
Gain = P – M
or equivalently, lowest impurity measure
after splitting (M)
Dr. Hazem Shatila

Finding the Best Split
C0 N00
Before Splitting: P
C1 N01
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41
M11 M12 M21 M22
M1 M2
Gain = P – M1 vs P – M2
Dr. Hazem Shatila
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t ) = 1 - å [ p ( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
Dr. Hazem Shatila

Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t ) = 1 - å [ p ( j | t )]2
j
– For 2-class problem (p, 1 – p):

• GINI = 1 – p2 – (1 – p)2 = 2p (1-p)
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Dr. Hazem Shatila

Computing Gini Index of a Single
Node
å
GINI (t ) = 1 - [ p ( j | t )]
j
2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Dr. Hazem Shatila

Computing Gini Index for a
Collection of Nodes
• When a node p is split into k partitions (children)
k
ni
GINI split = å GINI (i )
i =1 n
where, ni = number of records at child i,

n = number of records at parent node p.
• Choose the attribute that minimizes weighted average

Gini index of the children
• Gini index is used in decision tree algorithms such as

CART, SLIQ, SPRINT
Dr. Hazem Shatila

Binary Attributes: Computing GINI Index
● Splits into two partitions

● Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
Dr. Hazem Shatila

Categorical Attributes: Computing Gini Index
● For each distinct value, gather counts for each class in

the dataset
● Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)
CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167
Which of these is the best?
Dr. Hazem Shatila

Continuous Attributes: Computing Gini Index
● Use Binary Decisions based on one ID

Home
Owner
Marital Annual
Status Income
Defaulted
value 1 Yes Single 125K No

● Several Choices for the splitting value 3 No Single 70K No
– Number of possible splitting values 4 Yes Married 120K No

= Number of distinct values 6 No Married 60K No
● Each splitting value has a count matrix 7 Yes Divorced 220K No

8 No Single 85K Yes
associated with it 9 No Married 75K No
– Class counts in each of the 10 No Single 90K Yes
partitions, A < v and A ³ v

10
● Simple method to choose best v

– For each v, scan the database to
gather count matrix and compute its Annual Income ?
Gini index
– Computationally Inefficient! ≤ 80 > 80
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4
Dr. Hazem Shatila

Continuous Attributes: Computing Gini Index...
● For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No

Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Dr. Hazem Shatila



Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Dr. Hazem Shatila



Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Dr. Hazem Shatila



Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Dr. Hazem Shatila



Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Dr. Hazem Shatila

Measure of Impurity: Entropy
• Entropy at a given node t:
j
• Maximum (log nc) when records are equally distributed

among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are quite similar to

the GINI index computations
Dr. Hazem Shatila
Computing Entropy of a Single
Node
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Dr. Hazem Shatila

Computing Information Gain After Splitting
• Information Gain:
æ n ö
= Entropy ( p ) - ç å Entropy (i ) ÷
k
GAIN i
è n ø
split i =1
Parent Node, p is split into k partitions;

ni is number of records in partition i
– Choose the split that achieves most reduction

(maximizes GAIN)
– Used in the ID3 and C4.5 decision tree algorithms
Dr. Hazem Shatila

Problem with large number of partitions
• Node impurity measures tend to prefer
splits that result in large number of
partitions, each being small but pure
Gender Car Customer
Type ID
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
– Customer ID has highest information gain

because entropy for all the children is zero
Dr. Hazem Shatila

Gain Ratio
• Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i
SplitINFO
split
n n i =1
Parent Node, p is split into k partitions

ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning

(SplitINFO).
• Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain
Dr. Hazem Shatila

Gain Ratio
• Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i
SplitINFO
split
n n i =1
Parent Node, p is split into k partitions

ni is the number of records in partition i
CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97
Dr. Hazem Shatila

Measure of Impurity: Classification Error
• Classification error at a node t :
Error (t ) = 1 - max P (i | t )
i
– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
– Minimum (0) when all records belong to one class,
implying most interesting information
Dr. Hazem Shatila

Computing Error of a Single
Node
Error (t ) = 1 - max P (i | t )i
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Dr. Hazem Shatila

Comparison among Impurity
Measures
For a 2-class problem:
Dr. Hazem Shatila

Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
Dr. Hazem Shatila

Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
Misclassification error for all three cases = 0.3 !
Dr. Hazem Shatila

Decision Boundary
1
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y < 0.33?
y
0.5 y < 0.47?

0.4
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is known
as decision boundary
• Decision boundary is parallel to axes because test condition involves a
single attribute at-a-time
Dr. Hazem Shatila

Decision Tree Based
● Advantages: Classification
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
● Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
Dr. Hazem Shatila

Cluster Analysis
Dr. Hazem Shatila

What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Dr. Hazem Shatila

Applications of Cluster Analysis
• Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group
– Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
for browsing, group genes Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
and proteins that have 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
similar functionality, or Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
group stocks with similar Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia
Dr. Hazem Shatila

What is not Cluster Analysis?
• Simple segmentation
– Dividing students into different registration groups
alphabetically, by last name
• Results of a query
– Groupings are a result of an external specification
– Clustering is a grouping of objects based on the data
• Supervised classification
– Have class label information
• Association Analysis
– Local vs. global connections
Dr. Hazem Shatila

Notion of a Cluster can be
Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters
Dr. Hazem Shatila

Types of Clusterings
• A clustering is a set of clusters
• Important distinction between hierarchical
and partitional sets of clusters
• Partitional Clustering
– A division of data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Dr. Hazem Shatila

Partitional Clustering
Original Points A Partitional Clustering
Dr. Hazem Shatila

Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Dr. Hazem Shatila

Other Distinctions Between Sets of Clusters
• Exclusive versus non-exclusive

– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Clusters of widely different sizes, shapes, and densities
Dr. Hazem Shatila

Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Described by an Objective Function

Dr. Hazem Shatila
Types of Clusters: Well-Separated
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
Dr. Hazem Shatila

Types of Clusters: Center-Based
• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
Dr. Hazem Shatila

Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or

Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
8 contiguous clusters
Dr. Hazem Shatila

Types of Clusters: Density-Based
• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Dr. Hazem Shatila

Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters

– Finds clusters that share some common property or represent
a particular concept.
.
2 Overlapping Circles
Dr. Hazem Shatila

Types of Clusters: Objective Function
• Clusters Defined by an Objective Function

– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using
the given objective function.
– Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit the
data to a parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.
Dr. Hazem Shatila

Map Clustering Problem to a Different Problem
• Map the clustering problem to a different

domain and solve a related problem in that
domain
– Proximity matrix defines a weighted graph,
where the nodes are the points being
clustered, and the weighted edges represent
the proximities between points
– Clustering is equivalent to breaking the graph

into connected components, one for each
cluster.
Dr. Hazem Shatila
Characteristics of the Input Data Are Important
• Type of proximity or density measure

– Central to clustering
– Depends on data and application
• Data characteristics that affect proximity and/or density are

– Dimensionality
• Sparseness
– Attribute type
– Special relationships in the data
• For example, autocorrelation
– Distribution of the data
• Noise and Outliers

– Often interfere with the operation of the clustering algorithm
Dr. Hazem Shatila

Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
• Density-based clustering
Dr. Hazem Shatila

K-means Clustering
• Partitional clustering approach

• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
• The basic algorithm is very simple
Dr. Hazem Shatila

Example of K-means Clustering
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Dr. Hazem Shatila
x
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Dr. Hazem Shatila

K-means Clustering – Details
• Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the
cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Dr. Hazem Shatila

Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square
K these errors and sum them.
SSE = å å dist 2 (mi , x )
i =1 xÎCi
– x is a data point in cluster Ci and mi is the representative point for

cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two sets of clusters, we prefer the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Dr. Hazem Shatila
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x
Optimal Clustering Sub-optimal Clustering
Dr. Hazem Shatila

Limitations of K-means
• K-means has problems when clusters are
of differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data

contains outliers.
Dr. Hazem Shatila

Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
Dr. Hazem Shatila

Limitations of K-means: Differing Density
Dr. Hazem Shatila

Limitations of K-means: Non-globular Shapes
Dr. Hazem Shatila

Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters.

Find parts of clusters, but need to put together.
Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Dr. Hazem Shatila
x
Importance of Choosing Initial Centroids
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Dr. Hazem Shatila

Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Dr. Hazem Shatila x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Dr. Hazem Shatila

Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting

one centroid from each cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then
– For example, if K = 10, then probability = 10!/1010 = 0.00036

– Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t
– Consider an example of five pairs of clusters
Dr. Hazem Shatila

8
10 Clusters Example
6
6 Starting with two initial centroids in one cluster of each pair of clusters
Dr. Hazem Shatila

10 Clusters Example
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
Dr. Hazem Shatila

10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other
have only one.
5
Dr. Hazem Shatila 10 15 20
10 Clusters Example
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Dr. Hazem Shatila

Solutions to Initial Centroids
Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to
determine initial centroids
• Select more than k initial centroids and then
select among these initial centroids
– Select most widely separated
• Postprocessing
• Generate a larger number of clusters and
then perform a hierarchical clustering
• Bisecting K-means
– Not as susceptible to initialization issues
Dr. Hazem Shatila

Empty Clusters
• K-means can yield empty clusters
6.8 13 18
X X X
6.5 9 10 15 16 18.5
7.75 12.5 17.25

X X X
6.5 9 10 15 16 18.5
Empty
Cluster
Dr. Hazem Shatila

Handling Empty Clusters
• Basic K-means algorithm can yield empty
clusters
• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the
highest SSE
– If there are several empty clusters, the above
can be repeated several times.
Dr. Hazem Shatila

Updating Centers Incrementally
• In the basic K-means algorithm, centroids
are updated after all points are assigned to
a centroid
• An alternative is to update the centroids

after each assignment (incremental
approach)
– More expensive
– Introduces an order dependency
– Never get an empty cluster
Dr. Hazem Shatila
Pre-processing and Post-
processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent
outliers
– Split ‘loose’ clusters, i.e., clusters with
relatively high SSE
– Merge clusters that are ‘close’ and that have
relatively
Dr. Hazem Shatila low SSE
Bisecting K-means
• Bisecting K-means algorithm

– Variant of K-means that can produce a partitional or a
hierarchical clustering
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Dr. Hazem Shatila

Bisecting K-means Example
Dr. Hazem Shatila

Evaluation of Learning Models
Dr. Hazem Shatila

How to evaluate the Classifier’s
Generalization Performance?
• Assume that we test a classifier on some
test set and we derive at the end the
following confusion matrix:
Predicted class
Pos Neg
Actual
Pos TP FN P
class
Neg FP TN N
Dr. Hazem Shatila

Confusion Matrix
Recall/Sensitivity
(True Positive Rate)
Specificity
(True Negative Rate)
PPV NPV Accuracy

Positive Negative
Precision Predictive Value Predictive Value F1 Score
(best at 1)
Geometric Mean = √ Precision * Recall
Dr. Hazem Shatila

Example
Dr. Hazem Shatila

Example
0 1 2
Dr. Hazem Shatila

How to Estimate the Metrics?
• We can use:
– Training data;
– Independent test data;
– Hold-out method;
– k-fold cross-validation method;
– Leave-one-out method;
– Bootstrap method;
– And many more…
Dr. Hazem Shatila

Estimation with Training Data
• The accuracy/error estimates on the training data

are not good indicators of performance on future
data.
Classifier
Training set Training set
– Q: Why?
– A: Because new data will probably not be exactly the
same as the training data!
• The accuracy/error estimates on the training data
measure the degree of classifier’s overfitting.
Dr. Hazem Shatila
Estimation with Independent Test Data
• Estimation with independent test data is used when

we have plenty of data and there is a natural way to
forming training and test data.
Classifier
Training set Test set

• For example: Quinlan in 1987 reported experiments in
a medical domain for which the classifiers were
trained on data from 1985 and tested on data from
1986.
Dr. Hazem Shatila

Hold-out Method
• The hold-out method splits the data into training data

and test data (usually 2/3 for train, 1/3 for test). Then we
build a classifier using the train data and test it using the
test data.
Classifier
Training set Test set
Data
• The hold-out method is usually used when we have
thousands of instances, including several hundred
instances from each class.
Dr. Hazem Shatila

Making the Most of the Data
• Once evaluation is complete, all the data

can be used to build the final classifier.
• Generally, the larger the training data the
better the classifier (but returns diminish).
• The larger the test data the more accurate
the error estimate.
Dr. Hazem Shatila

Stratification
• The holdout method reserves a certain
amount for testing and uses the remainder
for training.
– Usually: one third for testing, the rest for training.
• For “unbalanced” datasets, samples might
not be representative.
– Few or none instances of some classes.
• Stratified sample: advanced version of
balancing the data.
– Make sure that each class is represented with
approximately equal proportions in both subsets.
Dr. Hazem Shatila

Repeated Holdout Method
• Holdout estimate can be made more

reliable by repeating the process with
different subsamples.
– In each iteration, a certain proportion is
randomly selected for training (possibly with
stratification).
– The error rates on the different iterations are
averaged to yield an overall error rate.
• This is called the repeated holdout
method.
Dr. Hazem Shatila
Repeated Holdout Method, 2
• Still not optimum: the different test sets

overlap, but we would like all our instances
from the data to be tested at least ones.
• Can we prevent overlapping?
Dr. Hazem Shatila

k-Fold Cross-Validation
• k-fold cross-validation avoids overlapping test
sets:
– First step: data is split into k subsets of equal size;
– Second step: each subset in turn is used for testing and
the remainder for training.
• The subsets are stratified
Classifier
before the cross-validation.
• The estimates are averaged to
yield an overall estimate.
train train test
Data train test train
test train train
Dr. Hazem Shatila

More on Cross-Validation
• Standard method for evaluation: stratified 10-fold
cross-validation.
• Why 10? Extensive experiments have shown that this
is the best choice to get an accurate estimate.
• Stratification reduces the estimate’s variance.
• Even better: repeated stratified cross-validation:
– E.g. ten-fold cross-validation is repeated ten times
and results are averaged (reduces the variance).
Dr. Hazem Shatila

Receive operating characteristics
curve
• It is commonly called the ROC curve.

• It is a plot of the true positive rate (TPR)
against the false positive rate (FPR).
• True positive rate:
• False positive rate:
Dr. Hazem Shatila

Sensitivity and Specificity
• In statistics, there are two other evaluation

measures:
– Sensitivity: Same as TPR
– Specificity: Also called True Negative Rate
(TNR)
• Then we have
Dr. Hazem Shatila

Example ROC curves
Dr. Hazem Shatila

Area under the curve (AUC)
• Which classifier is better, C1 or C2?

– It depends on which region you talk about.
• Can we have one measure?
– Yes, we compute the area under the curve (AUC)
• If AUC for Ci is greater than that of Cj, it is
said that Ci is better than Cj.
– If a classifier is perfect, its AUC value is 1
– If a classifier makes all random guesses, its AUC
value is 0.5.
Dr. Hazem Shatila

Lower left point (0,0)
Dr. Hazem Shatila

Upper Right Corner (1,1)
Dr. Hazem Shatila

Point D (0,1)
Dr. Hazem Shatila

Several Points in ROC space
Dr. Hazem Shatila

Point C (random experiment)
Dr. Hazem Shatila

Example
Dr. Hazem Shatila

Example (Accuracy)
Dr. Hazem Shatila

Example
Dr. Hazem Shatila

All Lectures Basics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Lectures Basics

Uploaded by

Copyright:

Available Formats

Basics of Machine

Learning & Python

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

• Big Data is the amount of data just beyond technology’s

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

– Look up phone – Certain names are more

• Description Methods (Unsupervised-no

Dr. Hazem Shatila

Dr. Hazem Shatila

• Find a model for class attribute as a

> 3 yr < 3 yr > 7 yrs < 7 yrs

Dr. Hazem Shatila

2 Yes High School 2 No

Dr. Hazem Shatila

• Classifying land covers (water bodies, urban areas,

• Categorizing news stories as finance,

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein

Dr. Hazem Shatila

Dr. Hazem Shatila

• Finding groups of objects such that the objects in a

Dr. Hazem Shatila

Clusters for Raw SST and Raw NPP

30 Temperature (SST) and

Dr. Hazem Shatila

Dr. Hazem Shatila

• Telecommunication alarm diagnosis

Dr. Hazem Shatila

Dr. Hazem Shatila

• Collection of data objects and

1 Yes Single 125K No

record, point, case, sample,

Dr. Hazem Shatila

Dr. Hazem Shatila

• Such data set can be represented by an n-by-d data

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

1 Yes Single High No

• A set of items can also be represented as a binary

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

Dr. Hazem Shatila

• Data Preprocessing: An Overview

– Major Tasks in Data Preprocessing

• Data Transformation and Data Discretization

• Measures for data quality: A multidimensional view

Dr. Hazem Shatila

• Data Preprocessing: An Overview

– Major Tasks in Data Preprocessing

• Data Transformation and Data Discretization

Dr. Hazem Shatila

• Ignore the tuple: usually done when class label is missing

Dr. Hazem Shatila

• Outliers are data objects with characteristics

Dr. Hazem Shatila