You are on page 1of 348

Basics of Machine

Learning & Python

Dr. Hazem Shatila

Dr. Hazem Shatila


Key Issues to be Covered • Evaluation of learning models
• Statistics, Linear algebra & Probability • ROC & lift curves
• Introduction to ML & DM
• Data Mining Practical
• Machine Learning • Introduction to Python, visualization.
• Supervised and unsupervised learning • Use cases, assignments and practical
• Types of Data examples for machine learning
• Data Preprocessing
• Frequent Item sets
• Association Rules & Apriori algorithm References:
• Regression • “Introduction to Machine Learning”, Alex Smola
and S.V.N. Vishwanathan.
• Base Classifiers • “Introduction to Data Mining” by Tan, Steinbach &
– Logistic Regression Kumar.
– Decision Tree based Methods
– Nearest-neighbor
– Neural Networks
– Naïve Bayes
– Support Vector Machines
– Kernel Trick in SVM Hazem Shatila, PhD.
• Clustering CEO, Markov analytics
– K-mean Clustering
– Hierarchical Clustering
– Cluster Evaluation

Dr. Hazem Shatila


Jobs in 2022

Dr. Hazem Shatila


What is Data?

Dr. Hazem Shatila


Big Data Definition

• Big Data is the amount of data just beyond technology’s


capability to store, manage and process efficiently.

Ah, but a man’s reach should exceed his grasp, Or what’s a heaven for?” – Robert Browning

Dr. Hazem Shatila


Evolution of Analytics

Dr. Hazem Shatila


Data Science

Dr. Hazem Shatila


What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

Dr. Hazem Shatila


What is (not) Data Mining?
●What is not Data ● What is Data Mining?
Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US
directory locations (O’Brien, O’Rourke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g., Amazon
rainforest, Amazon.com)
Dr. Hazem Shatila
Machine Learning is not Data
Mining
• Machine Learning design systems that can learn in the
process of processing data
• Checkers program designed by one of the scientist
eventually learned to play better than the program
designer
• Data Mining incorporates the Machine learning methods
but also benefits from the methods of other disciplines
such as database and statistic
• Machine learning is a field of data science that focuses
on designing algorithms that can learn from and make
predictions on data
• Data Mining is the field that focuses on discovering
properties of data sets, it can use ML to do so.
Dr. Hazem Shatila
Machine Learning Tasks
• Prediction Methods (Supervised-labels)
– Use some variables to predict unknown or future
values of other variables.

• Description Methods (Unsupervised-no


labels)
– Find human-interpretable patterns that describe
the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Dr. Hazem Shatila


Machine Learning Tasks …

Clu
st e Data
ring ng
Tid Refund Marital Taxable
Status Income Cheat
li
1 Yes Single 125K No
ode
2 No Married 100K No
M
ive
3 No Single 70K No
4 Yes Married 120K No
ic t
ed
5 No Divorced 95K Yes
6
7
No
Yes
Married 60K
Divorced 220K
No
No Pr
8 No Single 85K Yes
9 No Married 75K No

An
10 No Single 90K Yes

De oma
11 No Married 60K No

tion
12 Yes Divorced 220K No

ia tec ly
oc
13 No Single 85K Yes

s tio
As s
14 No Married 75K No
15 No Single 90K Yes n
le
Ru
10

Milk

Dr. Hazem Shatila


Predictive Modeling: Classification

• Find a model for class attribute as a


function of the values of other attributes
Model for predicting credit
worthiness
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

Dr. Hazem Shatila


Classification Example
l l ive
ir ca ir ca at # years at
go go ntit Tid Employed
Level of
present
Credit
ate ate ua ass Education
address
Worthy
c c q cl 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … … Test
10

Set

Learn
Training Model
Set Classifier

Dr. Hazem Shatila


Examples of Classification Task
• Classifying credit card transactions
as legitimate or fraudulent

• Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

• Categorizing news stories as finance,


weather, entertainment, sports, etc

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

Dr. Hazem Shatila


Regression
• Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Extensively studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.

Dr. Hazem Shatila


Clustering

• Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Dr. Hazem Shatila


Applications of Cluster Analysis
• Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
• Summarization
– Reduce the size of large data
sets
Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP


90

Use of K-means to
partition Sea Surface
60

Land Cluster 2

30 Temperature (SST) and


Land Cluster 1 Net Primary Production
latitude

0
(NPP) into clusters that
Ice or No NPP

-30
reflect the Northern and
Sea Cluster 2 Southern Hemispheres.
-60

Sea Cluster 1

Dr. Hazem Shatila


-90
-180 - 150 -120 -90 -60 - 30 0 30 60 90 120 150 180
Cluster
longitude
Association Rule Discovery:
Definition
• Given a set of records each of which
contain some number of items from a
given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
TID items.
Items
Rules Discovered:
1 Bread, Coke, Milk
{Milk} --> {Coke}
2 Beer, Bread
{Diaper, Milk} --> {Beer}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Dr. Hazem Shatila


Association Analysis:
Applications
• Market-basket analysis
– Rules are used for sales promotion, shelf management,
and inventory management

• Telecommunication alarm diagnosis


– Rules are used to find combination of alarms that occur
together frequently in the same time period

• Medical Informatics
– Rules are used to find combination of patient symptoms
and test results associated with certain diseases

Dr. Hazem Shatila


Deviation/Anomaly/Change Detection
• Detect significant deviations from
normal behavior
• Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
– Identify anomalous behavior from
sensor networks for monitoring and
surveillance.
– Detecting changes in the global forest
cover.

Dr. Hazem Shatila


what is Data? Attributes

• Collection of data objects and


their attributes Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


• An attribute is a property or
2 No Married 100K No
characteristic of an object
3 No Single 70K No
– Examples: eye color of a
4 Yes Married 120K No
person, temperature, etc.
5 No Divorced 95K Yes
– Attribute is also known as Objects
6 No Married 60K No
variable, field, characteristic, or
feature 7 Yes Divorced 220K No
8 No Single 85K Yes
• A collection of attributes
9 No Married 75K No
describe an object
10 No Single 90K Yes
– Object is also known as 10

record, point, case, sample,


entity, or instance Size: Number of objects
Dimensionality: Number of attributes
Sparsity: Number of populated
object-attribute pairs

Dr. Hazem Shatila


Types of Attributes
• There are different types of attributes
– Categorical
• Examples: eye color, zip codes, words, rankings (e.g,
good, fair, bad), height in {tall, medium, short}
• Nominal (no order or comparison) vs Ordinal (order
but not comparable)
– Numeric
• Examples: dates, temperature, time, length, value,
count.
• Discrete (counts) vs Continuous (temperature)
• Special case: Binary attributes (yes/no, exists/not
exists)

Dr. Hazem Shatila


Numeric Record Data
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

• Such data set can be represented by an n-by-d data


matrix, where there are n rows, one for each object, and
d columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Dr. Hazem Shatila
Categorical Data
• Data that consists of a collection of
records, each of which consists of a fixed
set of categorical attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single High No


2 No Married Medium No
3 No Single Low No
4 Yes Married High No
5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
Dr. Hazem Shatila
10
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times
the corresponding term occurs in the document.
– Bag-of-words representation – no ordering

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Dr. Hazem Shatila
Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

• A set of items can also be represented as a binary


vector, where each attribute is an item.
• A document can also be represented as a set of
words (no counts)

Dr. Hazem Shatila


Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
• Data is a long ordered string

Dr. Hazem Shatila


Ordered Data
• Time series
– Sequence of ordered (over “time”) numeric
values.

Dr. Hazem Shatila


Discrete and Continuous
Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.

Dr. Hazem Shatila


Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
Dr. Hazem Shatila
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling,

– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?
Dr. Hazem Shatila
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

Dr. Hazem Shatila


Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
Dr. Hazem Shatila
34
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

Dr. Hazem Shatila


Data Cleaning
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
• Missing data may need to be inferred
Dr. Hazem Shatila
Data Cleaning
How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing


(when doing classification)—not effective when the % of
missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as
Bayesian formula or decision tree
Dr. Hazem Shatila
Data Cleaning
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
Dr. Hazem Shatila
38
Data Cleaning
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g.,
deal with possible outliers)

Dr. Hazem Shatila


Data Cleaning
Outliers

• Outliers are data objects with characteristics


that are considerably different than most of
the other data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection

• Causes?

Dr. Hazem Shatila


Data Cleaning
Outliers
Box Plots

Dr. Hazem Shatila


Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
Dr. Hazem Shatila
42
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id º B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units
Dr. Hazem Shatila
43
Data Integration
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Dr. Hazem Shatila
44
Data Integration
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed - Expected ) 2
c2 = å
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

Dr. Hazem Shatila


Data Integration
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)
(250 - 90) 2 (50 - 210) 2 (200 - 360) 2 (1000 - 840) 2
c =
2
+ + + = 507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are
correlated in the group
Dr. Hazem Shatila
Data Integration
Correlation Analysis (Numeric Data)

• Correlation coefficient (also called Pearson’s product


moment coefficient)

åi =1 (ai - A)(bi - B) å
n n
(ai bi ) - n AB
rA, B = = i =1
(n - 1)s As B (n - 1)s As B

where n is the number of tuples, A and B are the respective means


of A and B, σA and σB are the respective standard deviation of A
and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

Dr. Hazem Shatila


Data Integration
Visually Evaluating Correlation

Scatter plots
showing the
similarity from –
1 to 1.

Dr. Hazem Shatila


Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
Dr. Hazem Shatila
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is


much smaller in volume but yet produces the same (or almost the
same) analytical results
• Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
– Data compression
Dr. Hazem Shatila
Data Reduction 1: Dimensionality
Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)

Dr. Hazem Shatila


Mapping Data to a New Space
n Fourier transform
n Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

Dr. Hazem Shatila


What Is Wavelet Transform?
• Decomposes a signal into
different frequency subbands
– Applicable to n-
dimensional signals
• Data are transformed to
preserve relative distance
between objects at different
levels of resolution
• Allow natural clusters to
become more distinguishable
• Used for image compression

Dr. Hazem Shatila


Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data


• The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space

x2

x1
Dr. Hazem Shatila
Principal Component Analysis
(Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
Dr. Hazem Shatila
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
– E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Dr. Hazem Shatila
Attribute Creation (Feature
Generation)
• Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet
transformation, manifold approaches (not covered)
– Attribute construction
• Combining features
• Data discretization
Dr. Hazem Shatila
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
Dr. Hazem Shatila
Parametric Data Reduction: Regression
and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions

Dr. Hazem Shatila


y
Regression Analysis
Y1

• Regression analysis: A collective name for


techniques for the modeling and analysis of Y1’
y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or X1 x
more independent variables (aka.
explanatory variables or predictors)
• Used for prediction
• The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
• Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but other
relationships
criteria have also been used

Dr. Hazem Shatila


Regress Analysis and Log-Linear
Models
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
• Log-linear models:
– Approximate discrete multidimensional probability distributions
– Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
– Useful for dimensionality reduction and data smoothing
Dr. Hazem Shatila
Histogram Analysis
• Divide data into buckets and 40
store average (sum) for each 35
bucket
30
• Partitioning rules:
25
– Equal-width: equal bucket 20
range
15
– Equal-frequency (or equal- 10
depth)
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
Dr. Hazem Shatila
Clustering
• Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
• Cluster analysis will be studied in depth later on.

Dr. Hazem Shatila


Sampling
• Sampling: obtaining a small sample s to represent the
whole data set N
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor
performance in the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling:
• Note: Sampling may not reduce database I/Os (page at a
time)
Dr. Hazem Shatila
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular
item
• Sampling without replacement
– Once an object is selected, it is removed from the
population
• Sampling with replacement
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
– Used in conjunction with skewed data
Dr. Hazem Shatila
Sampling: With or without Replacement

W O R
SRS le random
im p h o u t
( s e wit
p l
sam ment)
pl ac e
re

SRSW
R

Raw Data
Dr. Hazem Shatila
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample

Dr. Hazem Shatila


Data Reduction 3: Data
Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is
possible without expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be
considered as forms of data compression
Dr. Hazem Shatila
Data Compression

Original Data Compressed


Data
lossless

s s y
lo
Original Data
Approximated

Dr. Hazem Shatila


Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
Dr. Hazem Shatila
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization.
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization
Dr. Hazem Shatila
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 - 12,000
1.0]. Then $73,000 is mapped to 98,000 - 12,000 (1.0 - 0) + 0 = 0.716
• Z-score normalization (μ: mean, σ: standard deviation):
v - µA
v' =
s A

73,600 - 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Dr. Hazem Shatila
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic
rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification

Dr. Hazem Shatila


Data Discretization Methods
• Typical methods: All the methods can be applied
recursively
– Binning
– Histogram analysis
– Clustering analysis
– Decision-tree analysis
– Correlation (e.g., c2) analysis

Dr. Hazem Shatila


Simple Discretization: Binning

• Equal-width (distance) partitioning


– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well

• Equal-depth (frequency) partitioning


– Divides the range into N intervals, each containing approximately
same number of samples
– Good data scaling
– Managing categorical attributes can be tricky
Dr. Hazem Shatila
Binning Methods for Data
Smoothing
q Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Dr. Hazem Shatila
Association Analysis: Basic Concepts
Frequent Itemsets
Apriori Algorithm

Dr. Hazem Shatila


Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} ® {Beer},
1 Bread, Milk {Milk, Bread} ® {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} ® {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Dr. Hazem Shatila


Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count (s) 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
– E.g. s({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold

Dr. Hazem Shatila


Definition: Association Rule
● Association Rule
TID Items
– An implication expression of the form
X ® Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} ® {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
● Rule Evaluation Metrics
– Support (s)
u Fraction of transactions that contain Example:
both X and Y {Milk, Diaper} Þ {Beer}
– Confidence (c)
◆ Measures how often items in Y s (Milk, Diaper, Beer) 2
s= = = 0.4
appear in transactions that contain X |T| 5
⁃ Lift (L)
s (Milk, Diaper, Beer) 2
- Lift=confidence/support(Y) c= = = 0.67
s (Milk, Diaper) 3

Dr. Hazem Shatila


Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Þ Computationally prohibitive!

Dr. Hazem Shatila



Computational
Given d unique items:
Complexity
– Total number of itemsets = 2d
– Total number of possible association rules:

éæ d ö æ d - k öù
R = å êç ÷ ´ å ç ÷ú
d -1 d -k

ëè k ø è j øû
k =1 j =1

= 3 - 2 +1
d d +1

If d=6, R = 602 rules

Dr. Hazem Shatila


Mining Association Rules
TID Items Example of Rules:
1 Bread, Milk {Milk,Diaper} ® {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} ® {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} ® {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} ® {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} ® {Milk,Beer} (s=0.4, c=0.5)
{Milk} ® {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Dr. Hazem Shatila
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ³ minsup

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

• Frequent itemset generation is still


computationally expensive
Dr. Hazem Shatila
Frequent Itemset Generation null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets

Dr. Hazem Shatila


Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent
itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
M candidates and N transactions
Dr. Hazem Shatila
Frequent Itemset Generation
Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M

• Reduce the number of transactions (N)


– Reduce size of N as the size of itemset
increases
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the
candidates or transactions
– No need to match every candidate against
every transaction
Dr. Hazem Shatila
Reducing Number of
Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must
also be frequent

• Apriori principle holds due to the following


property of the support measure:

"X , Y : ( X Í Y ) Þ s( X ) ³ s(Y )
– Support of an itemset never exceeds the support of
its subsets
– This is known as the anti-monotone property of
support
Dr. Hazem Shatila
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE

Dr. Hazem Shatila


Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1

Minimum Support = 3 (60%)

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Dr. Hazem Shatila


Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Dr. Hazem Shatila


Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate
Eggs 1 {Bread,Diaper}
{Beer, Milk}
candidates involving Coke
{Diaper, Milk} or Eggs)
{Beer,Diaper}

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Dr. Hazem Shatila


Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3
candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Dr. Hazem Shatila


Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk}
6 + 15 + 20 = 41 { Beer,Bread,Diaper}
With support-based pruning, {Bread, Diaper, Milk}
6 + 6 + 4 = 16 { Beer, Bread, Milk}

Dr. Hazem Shatila


Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6C + 6C + 6C { Beer, Diaper, Milk} 2
1 2 3
{ Beer,Bread, Diaper} 2
6 + 15 + 20 = 41
{Bread, Diaper, Milk} 2
With support-based pruning, {Beer, Bread, Milk} 1
6 + 6 + 4 = 16

Dr. Hazem Shatila


Apriori Algorithm
– Fk: frequent k-itemsets
– Lk: candidate k-itemsets
• Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
• Candidate Generation: Generate Lk+1 from Fk
• Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
• Support Counting: Count the support of each
candidate in Lk+1 by scanning the DB
• Candidate Elimination: Eliminate candidates in Lk+1
that are infrequent, leaving only those that are frequent
=> Fk+1

Dr. Hazem Shatila


Candidate Generation: Brute-force method

Dr. Hazem Shatila


Candidate Generation: Merge Fk-1 and F1 itemsets

Dr. Hazem Shatila


Candidate Generation: Fk-1 x Fk-1 Method

Dr. Hazem Shatila


Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41 {Bread, Diaper, Milk} 2
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support counting
step.

Dr. Hazem Shatila


Example

Minimum Support = 2

Dr. Hazem Shatila


Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f Ì L such that f ® L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC ®D, ABD ®C, ACD ®B, BCD ®A,
A ®BCD, B ®ACD, C ®ABD, D ®ABC
AB ®CD, AC ® BD, AD ® BC, BC ®AD,
BD ®AC, CD ®AB,

• If |L| = k, then there are 2k – 2 candidate


association rules (ignoring L ® Æ and Æ ®
L)
Dr. Hazem Shatila
Rule Generation
• In general, confidence does not have an anti-
monotone property
c(ABC ®D) can be larger or smaller than c(AB ®D)

• But confidence of rules generated from the same


itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:
c(ABC ® D) ³ c(AB ® CD) ³ c(A ® BCD)

– Confidence is anti-monotone w.r.t. number of items


on the RHS of the rule

Dr. Hazem Shatila


Rule Generation for Apriori
Lattice of rules Algorithm
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules

Dr. Hazem Shatila


Try it out…Assignment

Please draw the rules tree

Dr. Hazem Shatila


Linear Regression

Dr. Hazem Shatila


Machine Learning Algorithms

- Supervised learning
- Unsupervised learning

- Others:
- Reinforcement learning
- Recommender systems.

Dr. Hazem Shatila


Supervised Learning & Unsupervised Learning

x2

x1
Supervised Learning Unsupervised Learning

Dr. Hazem Shatila


Linear Regression with one
Variable
Housing Prices
(Portland, OR)
Price
(in 1000s
of dollars)

Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Dr. Hazem Shatila
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
1416 232
1534 315
852 178
… …
Notation:
Training Set
m = Number of training examples

x’s = “input” variable / features Learning Algorithm

y’s = “output” variable / “target” variable


Size of Estimated
house
h price

Question : How to describe h?


Dr. Hazem Shatila
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …

Hypothesis:
‘s: Parameters
How to choose ‘s ?

Dr. Hazem Shatila


Dr. Hazem Shatila
y

Idea: Choose so that


is close to for our
training examples

Dr. Hazem Shatila


Cost Function
Simplified:
Hypothesis:

Parameters:

Cost Function:

Goal:

Dr. Hazem Shatila


Price
($)
in
1000’s

Size in feet2 (x)

Question:How
Dr. Hazem Shatila to minimize J?
Gradient Descent

Have some function


Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum

Dr. Hazem Shatila


J(q0,q1)

q1
q0

Dr. Hazem Shatila


Gradient descent algorithm

Correct: Simultaneous update Incorrect:

Dr. Hazem Shatila


Gradient descent algorithm

Notice : α is the learning rate.

Dr. Hazem Shatila


If α is too small, gradient descent
can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.

Dr. Hazem Shatila


at local optima

Current value of

Unchange

Gradient descent can converge to a local minimum, even with the


learning rate α fixed.
As we approach a local minimum, gradient descent will
automatically take smaller steps. So, no need to decrease α over
time.

Dr. Hazem Shatila


Gradient Descent for
Linear Regression
Gradient descent algorithm Linear Regression Model

Dr. Hazem Shatila


Gradient descent algorithm

update
and
simultaneously

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


(for fixed , this is a function of x) (function of the parameters )

Dr. Hazem Shatila


Linear Regression with multiple
variables
Hypothesis:
Parameters:
Cost function:

Gradient descent:
Repeat

(simultaneously update for every )

Dr. Hazem Shatila


New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat

(simultaneously update for


)

(simultaneously update )

Dr. Hazem Shatila


Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

simultaneously update

Dr. Hazem Shatila


Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Logistic Regression

Dr. Hazem Shatila


Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Neural Network

Dr. Hazem Shatila


Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Dr. Hazem Shatila
Example of Perceptron Learning
[
w( k +1) = w( k ) + l yi - f ( w( k ) , xi ) xi]
d
Y = sign( å wi X i )
i =0

l = 0.1 Bias=1

w0 w1 w2 w3 Epoch w0 w1 w2 w3
X1 X2 X3 Y 0 0 0 0 0 0 0 0 0 0
1 0 0 -1 1 -0.2 -0.2 0 0
1 0 1 1 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 4 -0.4 0.2 0.4 0.4
0 1 0 -1 5 -0.2 0 0 0
6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1
0 0 0 -1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
8 -0.2 0 0.2 0.2

Dr. Hazem Shatila


Naïve Bayes

Dr. Hazem Shatila


Bayes Classifier
• A probabilistic framework for solving
classification problems P( X , Y )
P(Y | X ) =
• Conditional Probability: P( X )
P( X , Y )
P( X | Y ) =
P(Y )

• Bayes theorem:
P( X | Y ) P(Y )
P(Y | X ) =
P( X )

Dr. Hazem Shatila


Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20

• If a patient has stiff neck, what’s the probability


he/she has meningitis?
P ( S | M ) P ( M ) 0.5 ´1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20

Dr. Hazem Shatila


Using Bayes Theorem for Classification
• Consider each attribute and class label as random
variables

• Given a record with attributes (X1, X2,…, Xd)


– Goal is to predict class Y
– Specifically, we want to find the value of Y that
maximizes P(Y| X1, X2,…, Xd )

• Can we estimate P(Y| X1, X2,…, Xd ) directly from


data?
Dr. Hazem Shatila
Example Data
Given a Test Record:
a l a l s
u
o ric o ricX =in(Refund
s
= No, Divorced, Income = 120K)
uo
t eg t eg n t
as
l
ca ca co c
Tid Refund Marital
Status
Taxable
Income Evade ● Can we estimate
1 Yes Single 125K No P(Evade = Yes | X) and P(Evade = No | X)?
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No In the following we will replace
5 No Divorced 95K Yes
Evade = Yes by Yes, and
6 No Married 60K No
7 Yes Divorced 220K No Evade = No by No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Dr. Hazem Shatila


Using Bayes Theorem for
• Approach: Classification
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem

P ( X 1 X 2 ! X d | Y ) P (Y )
P (Y | X 1 X 2 ! X n ) =
P( X 1 X 2 ! X d )

– Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)

– Equivalent to choosing value of Y that maximizes


P(X1, X2, …, Xd|Y) P(Y)

• How to estimate P(X1, X2, …, Xd | Y )?

Dr. Hazem Shatila


Example Data
Given a Test Record:
a l a l s
u
o ric o ric X =in (Refund
s
= No, Divorced, Income = 120K)
uo
t eg t eg n t
as
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Dr. Hazem Shatila


Naïve Bayes Classifier
• Assume independence among attributes Xi when class
is given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

– Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data

– New point is classified to Yj if P(Yj) P P(Xi| Yj) is


maximal.

Dr. Hazem Shatila


Conditional Independence
• X and Y are conditionally independent given Z if
P(X|YZ) = P(X|Z)

• Example: Arm length and reading skills


– Young child has shorter arm length and
limited reading skills, compared to adults
– If age is fixed, no apparent relationship
between arm length and reading skills
– Arm length and reading skills are conditionally
independent given age

Dr. Hazem Shatila


Naïve Bayes on Example Data
Given a Test Record:
a l a l s
u
o ric o ricX =in(Refund
s
= No, Divorced, Income = 120K)
uo
t eg t eg n t
as
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade ● P(X | Yes) =
No
1 Yes Single 125K
P(Refund = No | Yes) x
2 No Married 100K No
3 No Single 70K No
P(Divorced | Yes) x
4 Yes Married 120K No P(Income = 120K | Yes)
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
● P(X | No) =
8 No Single 85K Yes P(Refund = No | No) x
9 No Married 75K No
P(Divorced | No) x
10 No Single 90K Yes
10

P(Income = 120K | No)

Dr. Hazem Shatila


Estimate
al al
Probabilities
s
from Data
ic ic o u
o r o r u
g g in s
c at
e
c at
e
co
n t
cl
a s • Class: P(Y) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
• For categorical
3 No Single 70K No attributes:
4 Yes Married 120K No P(Xi | Yk) = |Xik|/ Nk c
5 No Divorced 95K Yes
– where |Xik| is number of
6 No Married 60K No instances having attribute
7 Yes Divorced 220K No value Xi and belonging to
8 No Single 85K Yes class Yk
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
10

P(Refund=Yes|Yes)=0

Dr. Hazem Shatila


Estimate Probabilities from Data

• For continuous attributes:


– Discretization: Partition the range into bins:
k
• Replace continuous value with bin value
– Attribute changed from continuous to ordinal

– Probability density estimation:


• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)

Dr. Hazem Shatila


Estimate
g o g o
Probabilities
tin
u
us
ric
o
a
s
from
l
Dataric
a l

t e t e n las
ca ca c o c
Tid Refund Marital
Status
Taxable
Income Evade
• Normal distribution:
( X i - µij ) 2
-
1 Yes Single 125K No 1 2s ij2
P( X i | Y j ) = e
2 No Married 100K No
2ps 2
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Xi,Yi) pair
5 No Divorced 95K Yes
6 No Married 60K No
• For (Income, Class=No):
7 Yes Divorced 220K No – If Class=No
8 No Single 85K Yes • sample mean = 110
9 No Married 75K No
• sample variance = 2975
10 No Single 90K Yes
10

1 -
( 120 -110 ) 2

P ( Income = 120 | No) = e 2 ( 2975 )


= 0.0072
2p (54.54)
Dr. Hazem Shatila
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Divorced, Income = 120K)
Naïve Bayes Classifier:

P(Refund = Yes | No) = 3/7


P(Refund = No | No) = 4/7 ● P(X | No) = P(Refund=No | No)
P(Refund = Yes | Yes) = 0 ´ P(Divorced | No)
P(Refund = No | Yes) = 1 ´ P(Income=120K | No)
P(Marital Status = Single | No) = 2/7 = 4/7 ´ 1/7 ´ 0.0072 = 0.0006
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 ● P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | Yes) = 1/3 ´ P(Divorced | Yes)
P(Marital Status = Married | Yes) = 0 ´ P(Income=120K | Yes)
= 1 ´ 1/3 ´ 1.2 ´ 10-9 = 4 ´ 10-10
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class = Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25
=> Class = No

Dr. Hazem Shatila


Issues with Naïve Bayes Classifier
o ric
a l
o ric
a l
uo
u s

te
g
n tin a ss
te
g
Consider the l Naïve Bayes Classifier:
c a table cwith
a Tid =c o7 deleted
c
Tid Refund Marital Taxable
Status Income Evade P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
1 Yes Single 125K No P(Refund = Yes | Yes) = 0
2 No Married 100K No P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
3 No Single 70K No
P(Marital Status = Divorced | No) = 0
4 Yes Married 120K No P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
6 No Married 60K No P(Marital Status = Married | Yes) = 0/3
7 Yes Divorced 220K No For Taxable Income:
If class = No: sample mean = 91
8 No Single 85K Yes
sample variance = 685
9 No Married 75K No If class = No: sample mean = 90
10 No Single 90K Yes sample variance = 25
10

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0 classify X as Yes or No!
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0

Dr. Hazem Shatila


Issues with Naïve Bayes
Classifier
• If one of the conditional probabilities is zero, then the entire expression
becomes zero
• Need to use other estimates of conditional probabilities than simple
fractions
c: number of classes
• Probability estimation:
p: prior probability of
N ic
Original : P( Ai | C ) = the class
Nc m: parameter
N ic + 1
Laplace : P( Ai | C ) = Nc: number of instances
Nc + c in the class
N ic + mp
m - estimate : P( Ai | C ) =
Nc + m Nic: number of instances
having attribute value Ai
in class c
Dr. Hazem Shatila
Example of Naïve Bayes Classifier….try
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) = ´ ´ ´ = 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = ´ ´ ´ = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M ) = 0.06 ´ = 0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004 ´ = 0.0027
eagle no yes no yes non-mammals 20

Give Birth Can Fly Live in Water Have Legs Class


P(A|M)P(M) > P(A|N)P(N)
yes no yes no ? => Mammals

Dr. Hazem Shatila


Example of Naïve Bayes Classifier….try
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) = ´ ´ ´ = 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = ´ ´ ´ = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M ) = 0.06 ´ = 0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004 ´ = 0.0027
eagle no yes no yes non-mammals 20

Give Birth Can Fly Live in Water Have Legs Class


P(A|M)P(M) > P(A|N)P(N)
yes no yes no ? => Mammals

Dr. Hazem Shatila


Naïve Bayes (Summary)
• Robust to isolated noise points

• Handle missing values by ignoring the instance during


probability estimate calculations

• Robust to irrelevant attributes

• Independence assumption may not hold for some


attributes
– Use other techniques such as Bayesian Belief
Networks (BBN)

Dr. Hazem Shatila


Nearest Neighbor

Dr. Hazem Shatila


Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

Dr. Hazem Shatila


Nearest-Neighbor Classifiers
Unknown record ● Requires three things
– The set of labeled records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

● To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

Dr. Hazem Shatila


Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distances to x

Dr. Hazem Shatila


1 nearest-neighbor
Voronoi Diagram

Dr. Hazem Shatila


Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
d ( p, q ) = å ( pi
i
-q )
i
2

• Determine the class from nearest neighbor


list
– Take the majority vote of class labels among the
k-nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2

Dr. Hazem Shatila


Nearest Neighbor
Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points
from other classes

Dr. Hazem Shatila


Nearest Neighbor
Classification…
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M

Dr. Hazem Shatila


Nearest Neighbor
Classification…
• Selection of the right similarity measure is
critical:
111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs

Dr. Hazem Shatila


Nearest neighbor
Classification…
• k-NN classifiers are lazy learners since they do not build
models explicitly
• Classifying unknown records are relatively expensive
• Can produce arbitrarily shaped decision boundaries
• Easy to handle variable interactions since the decisions
are based on local information
• Selection of right proximity measure is essential
• Superfluous or redundant attributes can create problems
• Missing attributes are hard to handle

Dr. Hazem Shatila


Improving KNN Efficiency
• Avoid having to compute distance to all
objects in the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
• Condensing
– Determine a smaller set of objects that give the
same performance
• Editing
– Remove objects to improve efficiency

Dr. Hazem Shatila


Decision tree

Dr. Hazem Shatila


Example of a Decision Tree
cal cal u s
ori ori uo
ti n ss
t eg t eg n a
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Dr. Hazem Shatila


Another Example of Decision
Tree
i cal
i cal
ous
or or nu
t eg
t eg
nt i
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K
NO Home
No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10

Dr. Hazem Shatila


Apply Model to Test Data
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Dr. Hazem Shatila


Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Dr. Hazem Shatila


Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Dr. Hazem Shatila


Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Dr. Hazem Shatila


Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Dr. Hazem Shatila


Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

Dr. Hazem Shatila


Decision Tree Classification
Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Dr. Hazem Shatila


Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

Dr. Hazem Shatila


General Structure of Hunt’s
Algorithm
• Let Dt be the set of training ID
Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No

• General Procedure: 3 No Single 70K No


4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes
belong the same class yt, 6 No Married 60K No
then t is a leaf node labeled 7 Yes Divorced 220K No

as yt 8 No Single 85K Yes


9 No Married 75K No
– If Dt contains records that
10 No Single 90K Yes
belong to more than one 10

class, use an attribute test Dt


to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

Dr. Hazem Shatila


Hunt’s Algorithm Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

Dr. Hazem Shatila


Design Issues of Decision Tree
Induction
• How should training records be split?
– Method for specifying test condition
• depending on attribute types
– Measure for evaluating the goodness of a test
condition

• How should the splitting procedure stop?


– Stop splitting if all the records belong to the same
class or have identical attribute values
– Early termination

Dr. Hazem Shatila


Methods for Expressing Test
Conditions
• Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous

• Depends on number of ways to split


– 2-way split
– Multi-way split

Dr. Hazem Shatila


Test Condition for Nominal
Attributes
• Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

• Binary split:
– Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

Dr. Hazem Shatila


Test Condition for Ordinal Attributes
• Multi-way split: Shirt
Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

• Binary split: Shirt


Size
Shirt
Size
– Divides values into two
subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among
Shirt
attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}

Dr. Hazem Shatila


Test Condition for Continuous
Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Dr. Hazem Shatila


Splitting Based on Continuous
Attributes
• Different ways of handling
– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
• Static – discretize once at the beginning
• Dynamic – repeat at each node

– Binary Decision: (A < v) or (A ³ v)


• consider all possible splits and finds the best cut
• can be more compute intensive

Dr. Hazem Shatila


How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

Dr. Hazem Shatila


How to determine the Best Split
• Greedy approach:
– Nodes with purer class distribution are
preferred

• Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

Dr. Hazem Shatila


Measures of Node Impurity
• Gini Index
GINI (t ) = 1 - å [ p ( j | t )]
2

• Entropy
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j

• Misclassification error

Error (t ) = 1 - max P (i | t )
i

• Regression trees uses Standard deviation.


Dr. Hazem Shatila
Finding the Best Split
1. Compute impurity measure (P) before
splitting
2. Compute impurity measure (M) after splitting
● Compute impurity measure of each child node
● M is the weighted impurity of children
3. Choose the attribute test condition that
produces the highest gain
Gain = P – M
or equivalently, lowest impurity measure
after splitting (M)

Dr. Hazem Shatila


Finding the Best Split
C0 N00
Before Splitting: P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
Dr. Hazem Shatila
Measure of Impurity: GINI
• Gini Index for a given node t :

GINI (t ) = 1 - å [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

Dr. Hazem Shatila


Measure of Impurity: GINI
• Gini Index for a given node t :

GINI (t ) = 1 - å [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– For 2-class problem (p, 1 – p):


• GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Dr. Hazem Shatila


Computing Gini Index of a Single
Node
å
GINI (t ) = 1 - [ p ( j | t )]
j
2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Dr. Hazem Shatila


Computing Gini Index for a
Collection of Nodes
• When a node p is split into k partitions (children)
k
ni
GINI split = å GINI (i )
i =1 n

where, ni = number of records at child i,


n = number of records at parent node p.

• Choose the attribute that minimizes weighted average


Gini index of the children

• Gini index is used in decision tree algorithms such as


CART, SLIQ, SPRINT

Dr. Hazem Shatila


Binary Attributes: Computing GINI Index

● Splits into two partitions


● Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

Dr. Hazem Shatila


Categorical Attributes: Computing Gini Index

● For each distinct value, gather counts for each class in


the dataset
● Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Which of these is the best?

Dr. Hazem Shatila


Continuous Attributes: Computing Gini Index

● Use Binary Decisions based on one ID


Home
Owner
Marital Annual
Status Income
Defaulted

value 1 Yes Single 125K No


2 No Married 100K No
● Several Choices for the splitting value 3 No Single 70K No

– Number of possible splitting values 4 Yes Married 120K No


5 No Divorced 95K Yes
= Number of distinct values 6 No Married 60K No

● Each splitting value has a count matrix 7 Yes Divorced 220K No


8 No Single 85K Yes
associated with it 9 No Married 75K No

– Class counts in each of the 10 No Single 90K Yes

partitions, A < v and A ³ v


10

● Simple method to choose best v


– For each v, scan the database to
gather count matrix and compute its Annual Income ?
Gini index
– Computationally Inefficient! ≤ 80 > 80
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4

Dr. Hazem Shatila


Continuous Attributes: Computing Gini Index...

● For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Dr. Hazem Shatila


Continuous Attributes: Computing Gini Index...

● For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Dr. Hazem Shatila


Continuous Attributes: Computing Gini Index...

● For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Dr. Hazem Shatila


Continuous Attributes: Computing Gini Index...

● For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Dr. Hazem Shatila


Continuous Attributes: Computing Gini Index...

● For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Dr. Hazem Shatila


Measure of Impurity: Entropy
• Entropy at a given node t:
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

• Maximum (log nc) when records are equally distributed


among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information

– Entropy based computations are quite similar to


the GINI index computations
Dr. Hazem Shatila
Computing Entropy of a Single
Node
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Dr. Hazem Shatila


Computing Information Gain After Splitting
• Information Gain:
æ n ö
= Entropy ( p ) - ç å Entropy (i ) ÷
k
GAIN i

è n ø
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i

– Choose the split that achieves most reduction


(maximizes GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

Dr. Hazem Shatila


Problem with large number of partitions
• Node impurity measures tend to prefer
splits that result in large number of
partitions, each being small but pure
Gender Car Customer
Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain


because entropy for all the children is zero

Dr. Hazem Shatila


Gain Ratio
• Gain Ratio:

GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i

SplitINFO
split

n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning


(SplitINFO).
• Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

Dr. Hazem Shatila


Gain Ratio
• Gain Ratio:

GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i

SplitINFO
split

n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

Dr. Hazem Shatila


Measure of Impurity: Classification Error

• Classification error at a node t :

Error (t ) = 1 - max P (i | t )
i

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0) when all records belong to one class,
implying most interesting information

Dr. Hazem Shatila


Computing Error of a Single
Node
Error (t ) = 1 - max P (i | t )i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Dr. Hazem Shatila


Comparison among Impurity
Measures
For a 2-class problem:

Dr. Hazem Shatila


Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!

Dr. Hazem Shatila


Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416

Misclassification error for all three cases = 0.3 !

Dr. Hazem Shatila


Decision Boundary
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6

y < 0.33?
y

0.5 y < 0.47?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
• Border line between two neighboring regions of different classes is known
as decision boundary
• Decision boundary is parallel to axes because test condition involves a
single attribute at-a-time

Dr. Hazem Shatila


Decision Tree Based
● Advantages: Classification
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
● Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute

Dr. Hazem Shatila


Cluster Analysis

Dr. Hazem Shatila


What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Dr. Hazem Shatila


Applications of Cluster Analysis
• Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group

– Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

for browsing, group genes Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,


Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
and proteins that have 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
similar functionality, or Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

group stocks with similar Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

• Summarization
– Reduce the size of large
data sets

Clustering precipitation
in Australia

Dr. Hazem Shatila


What is not Cluster Analysis?
• Simple segmentation
– Dividing students into different registration groups
alphabetically, by last name

• Results of a query
– Groupings are a result of an external specification
– Clustering is a grouping of objects based on the data

• Supervised classification
– Have class label information

• Association Analysis
– Local vs. global connections

Dr. Hazem Shatila


Notion of a Cluster can be
Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

Dr. Hazem Shatila


Types of Clusterings
• A clustering is a set of clusters
• Important distinction between hierarchical
and partitional sets of clusters
• Partitional Clustering
– A division of data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset

• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

Dr. Hazem Shatila


Partitional Clustering

Original Points A Partitional Clustering

Dr. Hazem Shatila


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Dr. Hazem Shatila


Other Distinctions Between Sets of Clusters

• Exclusive versus non-exclusive


– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Clusters of widely different sizes, shapes, and densities

Dr. Hazem Shatila


Types of Clusters
• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual

• Described by an Objective Function


Dr. Hazem Shatila
Types of Clusters: Well-Separated

• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.

3 well-separated clusters

Dr. Hazem Shatila


Types of Clusters: Center-Based

• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster

4 center-based clusters

Dr. Hazem Shatila


Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or


Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

8 contiguous clusters

Dr. Hazem Shatila


Types of Clusters: Density-Based

• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters

Dr. Hazem Shatila


Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters


– Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles

Dr. Hazem Shatila


Types of Clusters: Objective Function

• Clusters Defined by an Objective Function


– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using
the given objective function.
– Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit the
data to a parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.

Dr. Hazem Shatila


Map Clustering Problem to a Different Problem

• Map the clustering problem to a different


domain and solve a related problem in that
domain
– Proximity matrix defines a weighted graph,
where the nodes are the points being
clustered, and the weighted edges represent
the proximities between points

– Clustering is equivalent to breaking the graph


into connected components, one for each
cluster.
Dr. Hazem Shatila
Characteristics of the Input Data Are Important

• Type of proximity or density measure


– Central to clustering
– Depends on data and application

• Data characteristics that affect proximity and/or density are


– Dimensionality
• Sparseness
– Attribute type
– Special relationships in the data
• For example, autocorrelation
– Distribution of the data

• Noise and Outliers


– Often interfere with the operation of the clustering algorithm

Dr. Hazem Shatila


Clustering Algorithms
• K-means and its variants

• Hierarchical clustering

• Density-based clustering

Dr. Hazem Shatila


K-means Clustering

• Partitional clustering approach


• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
• The basic algorithm is very simple

Dr. Hazem Shatila


Example of K-means Clustering
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


Dr. Hazem Shatila
x
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Dr. Hazem Shatila


K-means Clustering – Details

• Initial centroids are often chosen randomly.


– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the
cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

Dr. Hazem Shatila


Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square
K these errors and sum them.
SSE = å å dist 2 (mi , x )
i =1 xÎCi

– x is a data point in cluster Ci and mi is the representative point for


cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two sets of clusters, we prefer the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Dr. Hazem Shatila
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering

Dr. Hazem Shatila


Limitations of K-means
• K-means has problems when clusters are
of differing
– Sizes
– Densities
– Non-globular shapes

• K-means has problems when the data


contains outliers.

Dr. Hazem Shatila


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Dr. Hazem Shatila


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Dr. Hazem Shatila


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Dr. Hazem Shatila


Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.

Dr. Hazem Shatila


Overcoming K-means Limitations

Original Points K-means Clusters

Dr. Hazem Shatila


Overcoming K-means Limitations

Original Points K-means Clusters

Dr. Hazem Shatila


Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


Dr. Hazem Shatila
x
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Dr. Hazem Shatila


Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


Dr. Hazem Shatila x
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Dr. Hazem Shatila


Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of selecting


one centroid from each cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036


– Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t
– Consider an example of five pairs of clusters

Dr. Hazem Shatila


8
10 Clusters Example
6

6 Starting with two initial centroids in one cluster of each pair of clusters

Dr. Hazem Shatila


10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with two initial centroids in one cluster of each pair of clusters

Dr. Hazem Shatila


10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other
have only one.

5
Dr. Hazem Shatila 10 15 20
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with some pairs of clusters having three initial centroids, while other have only one.

Dr. Hazem Shatila


Solutions to Initial Centroids
Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to
determine initial centroids
• Select more than k initial centroids and then
select among these initial centroids
– Select most widely separated
• Postprocessing
• Generate a larger number of clusters and
then perform a hierarchical clustering
• Bisecting K-means
– Not as susceptible to initialization issues

Dr. Hazem Shatila


Empty Clusters
• K-means can yield empty clusters
6.8 13 18
X X X
6.5 9 10 15 16 18.5

7.75 12.5 17.25


X X X
6.5 9 10 15 16 18.5

Empty
Cluster

Dr. Hazem Shatila


Handling Empty Clusters
• Basic K-means algorithm can yield empty
clusters

• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the
highest SSE
– If there are several empty clusters, the above
can be repeated several times.

Dr. Hazem Shatila


Updating Centers Incrementally
• In the basic K-means algorithm, centroids
are updated after all points are assigned to
a centroid

• An alternative is to update the centroids


after each assignment (incremental
approach)
– More expensive
– Introduces an order dependency
– Never get an empty cluster
Dr. Hazem Shatila
Pre-processing and Post-
processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent
outliers
– Split ‘loose’ clusters, i.e., clusters with
relatively high SSE
– Merge clusters that are ‘close’ and that have
relatively
Dr. Hazem Shatila low SSE
Bisecting K-means

• Bisecting K-means algorithm


– Variant of K-means that can produce a partitional or a
hierarchical clustering

CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

Dr. Hazem Shatila


Bisecting K-means Example

Dr. Hazem Shatila


Evaluation of Learning Models

Dr. Hazem Shatila


How to evaluate the Classifier’s
Generalization Performance?
• Assume that we test a classifier on some
test set and we derive at the end the
following confusion matrix:

Predicted class
Pos Neg
Actual
Pos TP FN P
class
Neg FP TN N

Dr. Hazem Shatila


Confusion Matrix

Recall/Sensitivity
(True Positive Rate)

Specificity
(True Negative Rate)

PPV NPV Accuracy


Positive Negative
Precision Predictive Value Predictive Value F1 Score
(best at 1)

Geometric Mean = √ Precision * Recall

Dr. Hazem Shatila


Example

Dr. Hazem Shatila


Example

0 1 2

Dr. Hazem Shatila


How to Estimate the Metrics?

• We can use:
– Training data;
– Independent test data;
– Hold-out method;
– k-fold cross-validation method;
– Leave-one-out method;
– Bootstrap method;
– And many more…

Dr. Hazem Shatila


Estimation with Training Data

• The accuracy/error estimates on the training data


are not good indicators of performance on future
data.
Classifier

Training set Training set

– Q: Why?
– A: Because new data will probably not be exactly the
same as the training data!
• The accuracy/error estimates on the training data
measure the degree of classifier’s overfitting.
Dr. Hazem Shatila
Estimation with Independent Test Data

• Estimation with independent test data is used when


we have plenty of data and there is a natural way to
forming training and test data.

Classifier

Training set Test set


• For example: Quinlan in 1987 reported experiments in
a medical domain for which the classifiers were
trained on data from 1985 and tested on data from
1986.

Dr. Hazem Shatila


Hold-out Method

• The hold-out method splits the data into training data


and test data (usually 2/3 for train, 1/3 for test). Then we
build a classifier using the train data and test it using the
test data.
Classifier

Training set Test set

Data
• The hold-out method is usually used when we have
thousands of instances, including several hundred
instances from each class.

Dr. Hazem Shatila


Making the Most of the Data

• Once evaluation is complete, all the data


can be used to build the final classifier.
• Generally, the larger the training data the
better the classifier (but returns diminish).
• The larger the test data the more accurate
the error estimate.

Dr. Hazem Shatila


Stratification
• The holdout method reserves a certain
amount for testing and uses the remainder
for training.
– Usually: one third for testing, the rest for training.
• For “unbalanced” datasets, samples might
not be representative.
– Few or none instances of some classes.
• Stratified sample: advanced version of
balancing the data.
– Make sure that each class is represented with
approximately equal proportions in both subsets.

Dr. Hazem Shatila


Repeated Holdout Method

• Holdout estimate can be made more


reliable by repeating the process with
different subsamples.
– In each iteration, a certain proportion is
randomly selected for training (possibly with
stratification).
– The error rates on the different iterations are
averaged to yield an overall error rate.
• This is called the repeated holdout
method.
Dr. Hazem Shatila
Repeated Holdout Method, 2

• Still not optimum: the different test sets


overlap, but we would like all our instances
from the data to be tested at least ones.
• Can we prevent overlapping?

Dr. Hazem Shatila


k-Fold Cross-Validation
• k-fold cross-validation avoids overlapping test
sets:
– First step: data is split into k subsets of equal size;
– Second step: each subset in turn is used for testing and
the remainder for training.
• The subsets are stratified
Classifier
before the cross-validation.
• The estimates are averaged to
yield an overall estimate.
train train test

Data train test train

test train train

Dr. Hazem Shatila


More on Cross-Validation
• Standard method for evaluation: stratified 10-fold
cross-validation.
• Why 10? Extensive experiments have shown that this
is the best choice to get an accurate estimate.
• Stratification reduces the estimate’s variance.
• Even better: repeated stratified cross-validation:
– E.g. ten-fold cross-validation is repeated ten times
and results are averaged (reduces the variance).

Dr. Hazem Shatila


Receive operating characteristics
curve

• It is commonly called the ROC curve.


• It is a plot of the true positive rate (TPR)
against the false positive rate (FPR).
• True positive rate:

• False positive rate:

Dr. Hazem Shatila


Sensitivity and Specificity

• In statistics, there are two other evaluation


measures:
– Sensitivity: Same as TPR
– Specificity: Also called True Negative Rate
(TNR)

• Then we have

Dr. Hazem Shatila


Example ROC curves

Dr. Hazem Shatila


Area under the curve (AUC)

• Which classifier is better, C1 or C2?


– It depends on which region you talk about.
• Can we have one measure?
– Yes, we compute the area under the curve (AUC)
• If AUC for Ci is greater than that of Cj, it is
said that Ci is better than Cj.
– If a classifier is perfect, its AUC value is 1
– If a classifier makes all random guesses, its AUC
value is 0.5.

Dr. Hazem Shatila


Lower left point (0,0)

Dr. Hazem Shatila


Upper Right Corner (1,1)

Dr. Hazem Shatila


Point D (0,1)

Dr. Hazem Shatila


Several Points in ROC space

Dr. Hazem Shatila


Point C (random experiment)

Dr. Hazem Shatila


Example

Dr. Hazem Shatila


Example (Accuracy)

Dr. Hazem Shatila


Example

Dr. Hazem Shatila

You might also like