Machine Learning Unit 1

MACHINE LEARNING
Unit-1
Dr. KOYEL DATTA GUPTA

DEPT. OF CSE
MSIT
Outline
Machine Learning
Applications
Learning System Model
Learning Techniques
Classification Families
What is Machine Learning?
Machine Learning
◦ Study of algorithms that
◦ improve their performance
◦ at some task
◦ with experience
Optimize a performance criterion using example data or
past experience.
Role of Statistics: Inference from a sample
Role of Computer science: Efficient algorithms to
◦ Solve the optimization problem
◦ Representing and evaluating the model for inference 3
Applications
 Retail: Market basket analysis, Customer relationship
management (CRM)
 Finance: Credit scoring, fraud detection
 Manufacturing: Optimization, troubleshooting
 Medicine: Medical diagnosis
 Telecommunications: Quality of service optimization
 Bioinformatics: Motifs, alignment
 Web mining: Search engines
Learning system model
Testing
Input Learning
Samples Method
System
Training
Designing a Learning System:
1. Problem Description
2. Choosing the Training Experience
3. Choosing the Target Function
4. Choosing a Representation for the Target
Function(Concept Representation)
5. Choosing a Function Approximation
Algorithm
6. Final Design
6
1. Problem Description: A
Checker Learning Problem
Task T: Playing Checkers
Performance Measure P:
Percent of games won against
opponents
Training Experience E: To be
selected ==> Games Played
against itself
7
2. Choosing the Training Experience
 Direct versus Indirect Experience [Indirect
Experience gives rise to the credit assignment problem and
is thus more difficult]
 Teacher versus Learner Controlled Experience
[the teacher might provide training examples; the learner
might suggest interesting examples and ask the teacher for
their outcome; or the learner can be completely on its own
with no access to correct outcomes]
 How Representative is the Experience? [Is the
training experience representative of the task the system
will actually have to solve? It is best if it is, but such a
situation cannot systematically be achieved]
8
3. Choosing the Target Function
 Given a set of legal moves, we want to learn how to
choose the best move [since the best move is not necessarily
known, this is an optimization problem]
 ChooseMove: B --> M is called a Target Function
[ChooseMove, however, is difficult to learn. An easier and related
target function to learn is V: B --> R, which assigns a numerical
score to each board. The better the board, the higher the score.]
 Operational versus Non-Operational Description of a
Target Function [An operational description must be given]
 Function Approximation [The actual function can often not
be learned and must be approximated]
9
4. Choosing a Representation for the Target
Function (Concept Representation)
Infer the general definition of some

concept, given examples labelled as
members or non-members of the
concept.
10
4. Choosing a Representation for the
Target Function (Concept Representation)
 Expressiveness versus Training set size [The
more expressive the representation of the target function,
the closer to the “truth” we can get. However, the more
expressive the representation, the more training examples
are necessary to choose among the large number of
“representable” possibilities.]
Example of a representation:
◦ x1/x2 = # of black/red pieces on the board
◦ x3/x4 = # of black/red king on the board
◦ x5/x6 = # of black/red pieces threatened by red/black
V(b) = w0+w1.x1+w2.x2+w3.x3+w4.x4+w5.x5+w6.x6
w’s indicates adjustable or “learnable” coefficients
5. Choosing a Function Approximation
Algorithm
 Generating Training Examples of the form
<b,Vtrain(b)> [e.g. <x1=3, x2=0, x3=1, x4=0, x5=0,
x6=0, +100 (=blacks won)]
◦ Useful and Easy Approach:^ Vtrain(b) <-
V(Successor(b))
 Training the System
◦ Defining a criterion for success [What is the error that
needs to be minimized?]
◦ Choose an algorithm capable of finding weights of a
linear function that minimize that error [e.g. the Least
Mean Square (LMS) training rule].
12
6. Final Design for Checkers Learning
 The Performance Module: Takes as input a new

board and outputs a trace of the game it played against
itself.
 The Critic: Takes as input the trace of a game and
outputs a set of training examples of the target function
 The Generalizer: Takes as input training examples
and outputs a hypothesis which estimates the target
function. Good generalization to new cases is crucial.
 The Experiment Generator: Takes as input the
current hypothesis (currently learned function) and
outputs a new problem (an initial board state) for the
performance system to explore
13
Issues in Machine Learning (i.e.,
Generalization)
What algorithms are available for learning a concept?
How well do they perform?
How much training data is sufficient to learn a concept
with high confidence?
When is it useful to use prior knowledge?
Are some training examples more useful than others?
What are best tasks for a system to learn?
What is the best way for a system to represent its
knowledge?
14
Learning Techniques
Association Analysis
Supervised Learning
◦ Classification
◦ Regression/Prediction
UnsupervisedLearning
Reinforcement Learning
Learning Associations
Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Example: P ( chips | beer ) = 0.7
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Classification
Example: Credit
scoring
Differentiating
between low-risk
and high-risk
customers from their
income and savings
Discriminant: IF income > θ1 AND savings > θ2

THEN low-risk ELSE high-risk
Model 17
Classification: Applications
Aka Pattern recognition
Face recognition: Pose, lighting, occlusion
(glasses, beard), make-up, hair style
Character recognition: Different handwriting
styles.
Speech recognition: Temporal dependency.
◦ Use of a dictionary or the syntax of the language.
◦ Sensor fusion: Combine multiple modalities; eg, visual
(lip image) and acoustic for speech
Medical diagnosis: From symptoms to illnesses
Web Advertizing: Predict if a user clicks on an ad
on the Internet.
18
Prediction: Regression
Example: Price of a
used car
y = wx+w0
x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
19
Regression Applications
Navigating a car: Angle of the steering wheel

(CMU NavLab)
Kinematics of a robot arm
(x,y) α = g (x,y)
1 1
α2= g2(x,y)
α2
α1
20
Supervised Learning: Uses
Example: decision trees tools that create rules
Prediction of future cases: Use the rule to
predict the output for future inputs
Knowledge extraction: The rule is easy to
understand
Compression: The rule is simpler than the
data it explains
Outlier detection: Exceptions that are not
covered by the rule, e.g., fraud
21
Unsupervised Learning
Learning “what normally happens”
No output
Clustering: Grouping similar instances
Other applications: Summarization,
Association Analysis
Example applications
◦ Customer segmentation in CRM
◦ Image compression: Color quantization
◦ Bioinformatics: Learning motifs
22
Reinforcement Learning
Topics:
◦ Policies: what actions should an agent take in a particular
situation
◦ Utility estimation: how good is a state (used by policy)
No supervised output but delayed reward
Credit assignment problem (what was responsible
for the outcome)
Applications:
◦ Game playing
◦ Robot in a maze
◦ Multiple agents, partial observability, ...
23
Classification Families
Decision trees
Nearest Neighbour
Linear discriminative
Non-linear discriminative
Probabilistic (conditional and generative)
Decision Tree C4.5 (J48 classifier)
• T contains one or more samples, all

belonging to a single class Cj
• T contains no samples.
• T contains samples that belong to a
mixture of classes.
C4.5
Calculate information gain for the tree and
for each attribute.
For each attribute , find the normalized
information gain ratio from splitting on a.
Let a be the attribute with the highest
normalized information gain.
Create a decision node that splits on a.
Recur on the sublists obtained by splitting
on a_best, and add those nodes as children
of node.
Example of C4.5 algorithm
Info(T)=-9/14*log2(9/14)-5/14*log2(5/14)
=0.940 bits
Infox1(T)=5/14(-2/5*log2(2/5)-3/5*log2(3/5))
+4/14(-4/4*log2(4/4)-0/4*log2(0/4))
+5/14(-3/5*log2(3/5)-2/5*log2(2/5))
=0.694 bits
Gain(x1)=0.940-0.694=0.246 bits
K-Nearest Neighbor
Features
◦ All instances correspond to points in an n-
dimensional Euclidean space
◦ Classification is delayed till a new instance
arrives
◦ Classification done by comparing feature
vectors of the different points
◦ Target function may be discrete or real-valued
1-Nearest Neighbor
3-Nearest Neighbor
K-Nearest Neighbor
An arbitrary instance is represented by (a 1(x),
a2(x), a3(x),.., an(x))
◦ ai(x) denotes features
Euclidean distance between two instances
d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)

Continuous valued target function
◦ mean value of the k nearest training examples
Voronoi Diagram
Decision surface formed by the training
examples
Distance-Weighted Nearest Neighbor
Algorithm
Assign weights to the neighbors based on
their ‘distance’ from the query point
◦ Weight ‘may’ be inverse square of the
distances
All training points may influence a
particular instance
◦ Shepard’s method
Linear Classification
What is meant by linear classification?

◦ The decision boundaries in the in the feature (input)
space is linear
Should the regions be contiguous?
R1
R2
X2
R3
R4
X1
Piecewise linear decision boundaries in 2D input space
35
Linear Classification…
There is a discriminant function k(x) for each
class k
Classification rule: Rk  {x : k  argj max  j ( x)}
In higher dimensional space the decision
boundaries are piecewise hyperplanar
Remember that 0-1 loss function led to the
classification rule: Rk  {x : k  arg max P(G  j | X  x)}
j
So, P (G  k | can
X) serve as k(x)
36
Linear Classification…
All we require here is the class boundaries {x:k(x) =
j(x)} be linear for every (k, j) pair
One can achieve this if k(x) themselves are linear or
any monotone transform of k(x) is linear
◦ An example: exp( 0   T x)
P (G  1 | X  x) 
1  exp( 0   T x)
1
P (G  2 | X  x) 
1  exp( 0   T x)
P (G  1 | X  x)
So that log[ ]  0   T x
P (G  2 | X  x)
Linear
37
Linear Classification as a Linear
Regression
2D Input space: X = (X1, X2)
Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)
Training sample, size N=5,

1 x11 x12   y11 y12 y13 
1 x21 x22  y y22 y23 
  21 Each row has
X  1 x31 x32 , Y   y31 y32 y33  exactly one 1
    indicating the
1 x41 x42   y41 y42 y43  category/class
1 x51 x52   y51 y52 y53 
Indicator Matrix
Regression output: Yˆ (( x1 , x2 ))  (1 x1 x2 )( XT X) 1 XT Y  ( xT 1 xT  2 xT  3 )
Or, Yˆ1 (( x1 x2 ))  (1 x1 x2 ) 1
Classification rule:
Yˆ2 (( x1 x2 ))  (1 x1 x2 )  2
Yˆ (( x x ))  (1 x x ) 
3 1 2 1 2 3
Gˆ (( x1 x2 ))  arg max Yˆk (( x1 x2 ))
k
38
LINEAR CLASSIFICATION-
PERCEPTRON
The Masking
Linear regression of the indicator matrix can lead to masking
2D input space and three classes Masking
Yˆ3  (1 x1 x2 )  3
Yˆ2  (1 x1 x2 )  2
Yˆ1  (1 x1 x2 ) 1
Viewing direction
LDA can avoid this masking 40

Linear Discriminant Analysis
Essentially minimum error Bayes’ classifier
Assumes that the conditional class densities are (multivariate) Gaussian
Assumes equal covariance for every class
f k ( x) k
Posterior probability Pr(G  k | X  x)  K
Application of
 f ( x)
l 1
l l
Bayes rule
k is the prior probability for class k
fk(x) is class conditional density or likelihood density

1 1
f k ( x)  exp( ( x   k )T Σ 1 ( x   k ))
(2 ) p / 2 | Σ |1/ 2 2
41
LDA…
Pr(G  k | X  x) k fk
log  log  log
Pr(G  l | X  x) l fl
1 T 1 1 T 1
 (log  k  x Σ  k   k Σ  k )  (log  l  x Σ l  l Σ l )
T 1 T 1
2 2
 k (x)  l (x)
Classification rule: Gˆ ( x)  arg max  k ( x)

k
is equivalent to: Gˆ ( x )  arg max Pr(G  k | X  x)

k
The good old Bayes classifier!

42
LDA…
When are we going to use the training data?
Total N input-output pairs

Nk number of pairs in class k ( g i , xi ), i  1 : N
Total number of classes: K
Training data utilized to estimate
Prior probabilities: ˆ k  N k / N
Means: ˆ k   g  k xi / N k
i
Covariance matrix: ˆΣ   K  ( x  ˆ )( x  ˆ )T /( N  K )
k 1 g i k i k
i
43
Fisher’s Linear Discriminant [DHS]
From training set we want to find out a direction where the separation
between the class means is high and overlap between the classes is small
44
Fisher’s LD…
Projection of a vector x on a unit vector w: wT x
Geometric interpretation: x
wT x
From training set we want to find out a direction w where the separation
between the projections of class means is high and
the projections of the class overlap is small
45
Fisher’s LD…
1 1
Class means: m1 
N1
x ,
xi R1
i m2 
N2
x
xi R2
i
~  1 ~  1
Projected class means: m1
N1
w
xi R1
T
xi  wT m1 , m2
N2
w
xi R2
T
xi  wT m2
Difference between projected class means: ~ m

m ~  wT (m  m )
2 1 2 1
Scatter of projected data (this will indicate overlap between the classes):
~  
s12   ( yi  m~1 ) 2 
 ( w T
x i  wT
m 1 ) 2
 w T
 
 x R ( x i  m1 )( x i  m 1 ) T
 w  wT S1w

yi : xi R1 xi R1  i 1 
~ ~ )2   
s22   ( yi  m 2  ( w T
x i  w T
m 2 ) 2
 w T
 
 x R ( x i  m 2 )( x i  m 2 ) T
 w  wT S 2 w

yi : xi R2 xi R2  i 2 
46
Fisher’s LD…
Ratio of difference of projected means over total scatter:
~ m
(m ~ ) 2 wT S w
r ( w)  ~ 2 ~12  T B
2
s1  s2 w Sww
Rayleigh quotient
where S w  S1  S 2
S B  (m2  m1 )(m2  m1 )T
We want to maximize r(w). The solution is
w  S w1 (m2  m1 )
47
Fisher’s LD: Classifier
So far so good. However, how do we get the classifier?
All we know at this point is that the direction w  S w1 (m2  m1 )

separates the projected data very well
Since we know that the projected class means are well separated,
we can choose average of the two projected means as a threshold
for classification
Classification rule: x in R2 if y(x)>0, else x in R1, where
1 ~ ~ 1 T 1
y ( x)  wT x  (m1  m2 )  w x 
T
w (m1  m2 )  S w1 (m2  m1 )( x  (m1  m2 ))
2 2 2
48
Non-linear Discriminant
Analysis
The idea of nonlinear discriminant
analysis is to transform the input data
nonlinearly, before standard linear
discriminant analysis is applied.
NLDA
 Multilayer Perceptrons (MLP) are probably the most
common type of neural network, and are used for
classication and regression.
 They usually consist of at least three layers of neurons.
 Nonlinearity is introduced by nonlinear neuron activation
functions, such as sigmoidals.
 Typical MLP's are trained by minimizing the mean square
error (MSE) using rst or second order methods and the
backpropagation
 algorithm.
 The adaptive behavior of MLP's is considered as one of
their greatest advantages over other Machine Learning
techniques.
Multi-Layer Networks
output layer
hidden layer
input layer
Training-Rule for Weights to the Output
Layer
yj Ep[wij] = ½  (t -y )
j j
p
j
p 2
wji Ep/wji = /wji ½ j (tj -yj )

p p 2
S
=…
= - yjp(1-ypj)(tpj-ypj) xip
xi
wji = a yjp(1-yjp) (tpj-yjp) xip
= a d j
p
xip
with d j
p
:= yjp(1-yjp) (tpj-yjp)
Training-Rule for Weights to the Hidden
Layer
dj
yj
Credit assignment problem:
No target values t for hidden layer
wjk
units.
xk
dk Error for hidden units?
dk = Sj wjk dj yj (1-yj)
wki
wki = a xkp(1-xkp) d k
p
xip
xi
Training-Rule for Weights to the Hidden
Layer
yj
dj Ep[wki] = ½  (t -y )
j j
p
j
p 2
wj Ep/wki = /wki ½ Sj (tjp-yjp)2

=/wki ½Sj (tjp-s(Skwjk xkp))2
k =/wki ½Sj (tjp-s(Skwjk s(Siwki xip)))2
xk
dk = -j (tjp-yjp) s’j(a) wjk s’k(a) xip
= -j dj wjk s’k(a) xip
wk
= -j dj wjk xk (1-xk) xip
i
xi
wki = a d k xi p
with dk = j dj wjk xk(1-xk)

MLP
 1. Propagate samples to the last hidden layer.
 2. Set weights of output layer by performing linear
Fisher Discriminant Analysis.
 3. Terminate, if some stopping criterion is
satisfied.
 4. Compute derivatives of J(w) with respect to the
last hidden layer weights and use
Backpropagation to compute derivatives for all
remaining weights.
 5. Adapt weights.
 6. Goto Step 2.
Background
 There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data

Example: perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class

Examples: naive Bayes, model based classifiers
 a) and b) are examples of discriminative classification

 c) is an example of generative classification
 b) and c) are both examples of probabilistic classification
56

Machine Learning Unit 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Unit 1

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING

Dr. KOYEL DATTA GUPTA

Learning System Model

Infer the general definition of some

 The Performance Module: Takes as input a new

Discriminant: IF income > θ1 AND savings > θ2

Navigating a car: Angle of the steering wheel

• T contains one or more samples, all

An arbitrary instance is represented by (a 1(x),

a2(x), a3(x),.., an(x))

◦ ai(x) denotes features

Euclidean distance between two instances

d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)

What is meant by linear classification?

Training sample, size N=5,

2D input space and three classes Masking

LDA can avoid this masking 40

Assumes that the conditional class densities are (multivariate) Gaussian

Assumes equal covariance for every class

k is the prior probability for class k

fk(x) is class conditional density or likelihood density

Classification rule: Gˆ ( x)  arg max  k ( x)

is equivalent to: Gˆ ( x )  arg max Pr(G  k | X  x)

The good old Bayes classifier!

Total N input-output pairs

Training data utilized to estimate

Projection of a vector x on a unit vector w: wT x

Difference between projected class means: ~ m

Ratio of difference of projected means over total scatter:

We want to maximize r(w). The solution is

All we know at this point is that the direction w  S w1 (m2  m1 )

Classification rule: x in R2 if y(x)>0, else x in R1, where

wji Ep/wji = /wji ½ j (tj -yj )

wj Ep/wki = /wki ½ Sj (tjp-yjp)2

with dk = j dj wjk xk(1-xk)

b) Model the probability of class memberships given input data

c) Make a probabilistic model of data within each class

 a) and b) are examples of discriminative classification

You might also like