You are on page 1of 51

1.

Introduction

1
Models and Patterns
x f(x)
 Model = abstract representation of a given
training data 1 1
e.g., very simple linear model structure 2 4
Y=aX+b
a & b are parameters determined from the 3 9
data 4 16
Y = aX + b is the model structure
Y = 0.9X + 0.3 is a particular model 1259 ?

• Example: Given a finite sample, <x,f(x)> pairs, create a model


that can hold for future values?
To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the
pattern will hold for future examples too.
Why Classification? A motivating application

 Credit approval
 A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
 The history of past customers is used to train the classifier

 The classifier provides rules, which identify potentially


reliable future customers
 Classification rule:
 If age = “31...40” and income = high then credit_rating = excellent
 Future customers
 Paul: age = 35, income = high  excellent credit rating
 John: age = 20, income = medium  fair credit rating

3
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models
the underlying observation
– e.g., a model that could simulate the data if needed

Descriptive model identifies 4.4


EM ITERATION 25

patterns or relationship in data 4.3

Unlike the predictive model, a

Red Blood Cell Hemoglobin Concentration


4.2

descriptive model serves as a way to 4.1

explore the properties of the data 4

examined, not to predict new 3.9

properties 3.8

3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume

• Description Methods find human-interpretable patterns


that describe and find natural groupings of the data.
• Methods used in descriptive modeling are:
clustering, summarization, association rule
discovery, etc. 5
Pattern (Association Rule) Discovery
 Goal is to discover interesting “local” patterns (sequential
patterns) in the data rather than to characterize the data globally
 Also called link analysis (uncovers relationships among data)
 Given market basket data we might discover that
 If customers buy wine and bread then they buy cheese with probability
0.9
 Methods used in pattern discovery include:
 Association rules, Sequence discovery, etc.
 Example in retail: Customer transactions to consumer behavior:
 People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
 Example: football player behavior
 If player A is in the game, player B’s scoring rate increases
from 25% chance per game to 95% chance per game
6
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
 New data is classified based on the training set

 Unsupervised learning (clustering)


 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data

7
Classification
 Classification is a data mining (machine learning) technique
used to predict group membership for data instances.
 Given a collection of records (training set), each record contains
a set of attributes, one of the attributes is the class.
 Find a model for class attribute as a function of the values of other
attributes.
 Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the accuracy
of the model.
 Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
 For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.

8
CLASSIFICATION: A TWO-STEP PROCESS
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Modelusage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
9
Classification methods
 Goal: Predict class Ci = f(x1, x2, .. xn)
 There are various classification methods.
 Popular classification techniques include the following:
Decision tree classifier: divide decision space into
piecewise constant regions.
K-nearest neighbor
Rule-based
Neural networks: partition by non-linear boundaries
Bayesian network: a probabilistic model
Support vector machine

10
Decision tree classifier
 Decision tree performs classification by constructing a
tree based on training instances with leaves having class
labels.
 The tree is traversed for each test instance to find a leaf,
and the class of the leaf is the predicted class. This is a
directed knowledge discovery in the sense that there is a
specific field whose value we want to predict.
 Widely used learning method as it has been applied to:
 classify medical patients based on the disease,
 equipment malfunction by cause,
 loan applicant by likelihood of payment.

11
Choosing the Splitting Attribute
 At each node, the best attribute is selected for splitting the
training examples using a Goodness function
The best attribute is the one that separate the classes of the
training examples faster such that it results in the smallest tree
 Typical goodness functions:
information gain, information gain ratio, and GINI index
Information Gain
Select the attribute with the highest information gain, that create
small average disorder
 First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
 Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
m
in D: Info( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split vD into v


| Dj |
partitions) to classify D: Info A ( D)    I (D j )
j 1 | D |

 Information gained by branching on attribute A


Gain(A)  Info(D)  Info A(D)
Entropy
• The Entropy measures the disorder of a set S containing a
total of n examples of which n+ are positive and n- are
negative and it is given by:
n n n n
D(n , n )   log 2  log 2  Entropy ( S )
n n n n
• Some useful properties of the Entropy:
– D(n,m) = D(m,n)
– D(0,m) = D(m,0) = 0
D(S)=0 means that all the examples in S have
the same class
– D(m,m) = 1
D(S)=1 means that half the examples in S are
of one class and half are in the opposite class
14
Attribute Selection by Information
Gain to construct the optimal
decision tree
• Entropy: The Disorder of Sunburned

D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

3 3 5 5
 D(3 ,5 )   log 2  log 2  0.954
8 8 8 8

15
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.

 gain_ratio(income) = 0.029/1.557 = 0.019


 The attribute with the maximum gain ratio is selected as the
splitting attribute

16
Comparing Attribute Selection
Measures
 The three measures, in general, return good results but
 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and
purity in both partitions
17
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise
or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split a node if
this would result in the goodness measure falling below a
threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “best pruned tree”

18
K-Nearest Neighbors
K-nearest neighbor is a supervised learning algorithm
where the result of new instance query is classified based
on majority of K-nearest neighbor category.
The purpose of this algorithm is to classify a new object
based on attributes and training samples: (xi, f(xi)), i=1..N.
Given a query point, we find K number of objects or
(training points) closest to the query point.
The classification is using majority vote among the classification
of the K objects.
K Nearest neighbor algorithm used neighborhood classification as
the prediction value of the new query instance.
K nearest neighbor algorithm is very simple. It works based
on minimum distance from the query instance to the
training samples to determine the K-nearest neighbors.
19
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes
Setting the variable K (Number of nearest neighbors)
The numbers of nearest neighbors (K) should be based on cross
validation over a number of K setting.
When k=1 is a good baseline model to benchmark against.
A good rule-of-thumb is that K should be less than or equal to the
square root of the total number of training patterns.
Setting the type of distant metric K N
We need a measure of distance in order to know who are the
neighbours
Assume that we have T attributes for the learning problem. Then
one example point x has elements xt  , t=1,…T.
The distance between two points xi xj is often defined as the
Euclidean distance: D 2

Dist ( X , Y )   ( Xi  Yi )
i 1
20
KNNs: advantages & Disadvantages
 Advantage
 Simple
 Powerful
 Requires no training time
 Nonparametric architecture
 Disadvantage: Difficulties with k-nearest neighbour
algorithms
 Memory intensive: just store the training examples
 when a test example is given then find the closest matches
 Classification/estimation is slow
 Have to calculate the distance of the test case from all training
cases
 There may be irrelevant attributes amongst the attributes –
curse of dimensionality
21
Rule-Based Classification
 The learned model is represented as a set of IF-THEN
rules.
 A rule-based classifier uses a set of IF-THEN rules for
classification
 Rule: If Condition(LHS) then Conclusion(RHS)
where
 Condition is a conjunctions of attributes
 conclusion is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
 (Blood Type=Warm)  (Lay Eggs=Yes)=> Birds
 (age=youth) ^ (student=yes)=>(buys-computer=yes).
Rule Coverage and Accuracy
 Coverage of a rule:
 Fraction of records that satisfy Tid Refund Marital
Status
Taxable
Income Class
the antecedent of a rule.
1 Yes Single 125K No
 Accuracy of a rule: 2 No Married 100K No
 Fraction of records that satisfy 3 No Single 70K No
the antecedent that also satisfy 4 Yes Married 120K No
the consequent of a rule 5 No Divorced 95K Yes
 Assessment of a rule: coverage and 6 No Married 60K No
accuracy 7 Yes Divorced 220K No
 ncovers = # of tuples covered by R 8 No Single 85K Yes
 ncorrect = # of tuples correctly classified 9 No Married 75K No
by R 10 No Single 90K Yes
10

coverage(R) = ncovers /|D| /* D: training


(Status=Single)  No
data set */
Coverage = 40%, Accuracy = 50%
accuracy(R) = ncorrect / ncovers
Characteristics of Rule Sets: Strategy 1
 Mutually exclusive rules
Classifier contains mutually exclusive rules if
the rules are independent of each other
Every record is covered by at most one rule

 Exhaustiverules
Classifier has exhaustive coverage if it accounts
for every possible combination of attribute
values
Each record is covered by at least one rule
Characteristics of Rule Sets: Strategy 2
 Rules are not mutually exclusive
A record may trigger more than one rule
Solution?
 Ordered rule set
 Unordered rule set – use voting schemes
 Rules are not exhaustive
A record may not trigger any rules
Solution?
 Use a default class
Rule Ordering Schemes
 Rule-based ordering
 Individual rules are ranked based on their quality
 Class-based ordering
 Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering


(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes
Building Classification Rules
 Direct Method:
 Extract rules directly from data
 Examples: RIPPER, CN2, Holte’s 1R

 Indirect Method:
 Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
 Examples: C4.5rules
Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is
met
Direct Method: RIPPER
 For 2-class problem, choose one of the classes as positive
class, and the other as negative class
Learn rules for positive class
Negative class will be default class
 For multi-class problem
Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)
Learn the rule set for smallest class first, treat the rest
as negative class
Repeat with next smallest class as positive class
Direct Method: RIPPER
 Growing a rule:
 Start from empty rule
 Add conjuncts as long as they improve FOIL’s information gain
 Stop when rule no longer covers negative examples
 Prune the rule immediately using incremental reduced error
pruning
 Measure for pruning: v = (p-n)/(p+n)
 p: number of positive examples covered by the rule in
the validation set
 n: number of negative examples covered by the rule in
the validation set
 Pruning method: delete any final sequence of conditions that
maximizes v
Direct Method: RIPPER
 Building a Rule Set:
 Use sequential covering algorithm
 Finds the best rule that covers the current set of positive
examples
 Eliminate both positive and negative examples covered by
the rule
 Each time a rule is added to the rule set, compute the new
description length
 Stop adding new rules when the new description length is d
bits longer than the smallest description length obtained so
far
Direct Method: RIPPER
 Optimize the rule set:
 For each rule r in the rule set R
 Consider 2 alternative rules:
– Replacement rule (r*): grow new rule from scratch
– Revised rule(r′): add conjuncts to extend the rule r
 Compare the rule set for r against the rule set for r*
and r′
 Choose rule set that minimizes MDL principle
 Repeat rule generation and rule optimization for the remaining
positive examples
Indirect Method: C4.5rules
 Extract rules from an unpruned decision tree
 For each rule, r: A  y,
 consider an alternative rule r′: A′  y where A′ is obtained by
removing one of the conjuncts in A
 Compare the pessimistic error rate for r against all r’s
 Prune if one of the alternative rules has lower pessimistic error
rate
 Repeat until we can no longer improve generalization error
Indirect Method: C4.5rules
 Instead of ordering the rules, order subsets of rules (class ordering)
 Each subset is a collection of rules with the same rule
consequent (class)
 Compute description length of each subset
 Description length = L(error) + g L(model)
 g is a parameter that takes into account the presence of
redundant attributes in a rule set
(default value = 0.5)
Advantages of Rule-Based Classifiers
 Has characteristics quite similar to decision trees
As highly expressive as decision trees
Easy to interpret
Performance comparable to decision trees
Can handle redundant attributes

 Better suited for handling imbalanced classes

 Harder to handle missing values in the test set


Bayesian Classification
Why Bayesian Classification?
Provides practical learning algorithms
 Probabilistic learning: Calculate explicit probabilities for
hypothesis. E.g. Naïve Bayes
Prior knowledge and observed data can be combined
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
It is a generative (model based) approach, which offers a
useful conceptual framework
 Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities. E.g. sequences could also be classified,
based on a probabilistic model specification
 Any kind of objects can be classified, based on a probabilistic
model specification
36
Naive Bayesian Classifier
 Advantages
Easy to implement
Good results obtained in most of the cases
Robust to isolated noise points
Handle missing values by ignoring the instance during probability
estimate calculations
Robust to irrelevant attributes
 Disadvantages
Class conditional independence assumption may not hold for some
attributes, therefore loss of accuracy
Practically dependencies exist among variables
 E.g. hospitals: patients: profile: age, family history, etc.
symptoms: fever, cough etc. Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve
Bayesian classifier
How to deal with these dependencies? Bayesian Belief Networks
37
The Power of Brain vs. Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance

• The Machine
– Calculation
– Precision
– Logic

38
Features of the Brain
• Ten billion (1010) neurons
 Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Distributed representations
• Die off frequently (never replaced)
• Compensated for problems by massive parallelism
39
Neural Network classifier
 Itis represented as a layered set of interconnected
processors. These processor nodes has a relationship with
the neurons of the brain.
 Each node has a weighted connection to several other nodes in
adjacent layers. Individual nodes take the input received from
connected nodes and use the weights together to compute output
values.
 The inputs are fed simultaneously into the input layer.
 The weighted outputs of these units are fed into hidden
layer.
 The weighted outputs of the last hidden layer are inputs to
units making up the output layer.
40
Artificial Neural Networks (ANN)
Input
nodes Black box
Output
X1 w1 node
 Model is an assembly of
w2
inter-connected nodes and X2  Y
weighted links w3
X3 t
 Output node sums up each
of its input value Perceptron Model
according to the weights d
of its links Y  sign(  wi X i  t )
i 1
 Compare output node d

against some threshold t  sign(  wi X i )


i 0
Architecture of Neural network
 Neural networks are used to look for patterns in data, learn these
patterns, & then classify new patterns & make forecasts
 A network with the input and output layer only is called single-
layered neural network. Whereas, a multilayer neural network is a
generalized one with one or more hidden layer.
 A network containing two hidden layers is called a three-layer
neural network, and so on.

Single layered NN Multilayer NN


n
x1 x1
w1 o   (  wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
 ( y) 
1  e y Input Hidden Output
nodes nodes nodes
42
A Multilayer Neural Network
 Input: corresponds with class attribute that are with normalized
attributes values.
– There are as many nodes
as class attributes, X =
{x1, x2, …. xm}, where n is
the number of attributes.
• Hidden Layer
– neither its input nor its
output can be observed
from outside.
– The number of nodes in
the hidden layer & the
number of hidden layers
depends on
implementation.

• Output Layer – corresponds to the class attribute. There


are as many nodes as classes (values of the class attribute).
– Ok, where k= 1, 2,.. n, where n is number of classes
43
Two Topologies of neural network
 NNcan be designed in a feed forward or recurrent
manner
 Ina feed forward neural network connections
between the units do not form a directed cycle.
 In this network, the information moves in only one
direction, forward, from the input nodes, through the
hidden nodes (if any) & to the output nodes. There are no
cycles or loops or no feedback connections are present in
the network, that is, connections extending from outputs of
units to inputs of units in the same layer or previous layers.
 Inrecurrent networks data circulates back &
forth until the activation of the units is stabilized
 Recurrent networks have a feedback loop where data can be
fed back into the input at some point before it is fed
forward again for further processing and final output.
44
Training the neural network
The purpose is to learn to generalize using a set of sample
patterns where the desired output is known.
Back Propagation is the most commonly used method for
training multilayer feed forward NN.
Back propagation learns by iteratively processing a set of training
data (samples).
For each sample, weights are modified to minimize the error
between the desired output and the actual output.
After propagating an input through the network, the error
is calculated and the error is propagated back through the
network while the weights are adjusted in order to make
the error smaller.
45
Training Algorithm
The learning algorithm is as follows
Initialize the weights and threshold to small random
numbers.
Present a vector x to the neuron inputs and calculate the
m
output using the adder function. y w x
 j j
j 1

Apply the activation function (in this case step function)


such that 
 0 if y  0
y 

 1 if y  0

Update the weights according to the error.


W j  W j   * ( yT  y ) * x j
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
­Slow training time
+ Can learn more complicated ­
Hard to interpret & understand
class boundaries the learned function (weights)
+ Fast application ­Hard to implement: trial & error
+ Can handle large number of for choosing number of nodes
features

Neural Network needs long time for training.


Neural Network has a high tolerance to noisy and
incomplete data
Conclusion: Use neural nets only if decision-trees fail.
47
SVM—Support Vector Machines
 A relatively new classification method for both linear and
nonlinear data
 It uses a nonlinear mapping to transform the original training
data into a higher dimension
 With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
 With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
 SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)

48
SVM—History and Applications

 Vapnik and colleagues (1992)—groundwork from Vapnik &


Chervonenkis’ statistical learning theory in 1960s
 Features: training can be slow but accuracy is high owing to
their ability to model complex nonlinear decision boundaries
(margin maximization)
 Used for: classification and numeric prediction
 Applications:
 handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests

49
SVM—Linearly Separable
 A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints 
Quadratic Programming (QP)  Lagrangian multipliers
50
Why Is SVM Effective on High Dimensional Data?

 The complexity of trained classifier is characterized by the # of


support vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high

51
CB-SVM Algorithm: Outline
 Construct two CF-trees from positive and negative data sets
independently
 Need one scan of the data set
 Train an SVM from the centroids of the root entries
 De-cluster the entries near the boundary into the next level
 The children entries de-clustered from the parent entries are
accumulated into the training set with the non-declustered
parent entries
 Train an SVM again from the centroids of the entries in the
training set
 Repeat until nothing is accumulated

52

You might also like