ML Unit1

MACHINE LEARNING
UNIT -1
1
Machine learning
 Machine learning is an application of artificial

intelligence (AI) that provides systems the ability to
automatically learn and improve from experience
without being explicitly programmed.
 Machine learning focuses on the development of

computer programs that can access data and use it to
learn for themselves.
2
Machine learning
 The process of learning begins with observations or data,

such as examples, direct experience, or instruction, in
order to look for patterns in data and make better
decisions in the future based on the examples that we
provide.
 The primary aim is to allow the computers learn
automatically without human intervention or assistance
and adjust actions accordingly.
3
Well-posed learning problems
Definition:
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.
For example, a computer program that learns to play
checkers might improve its performance as measured by
its ability to win at the class of tasks involving playing
checkers games, through experience obtained by
playing games against itself
4
Related Fields
data
mining
control theory
statistics
decision theory
machine
information theory learning
cognitive science
databases
psychological models
evolutionary
models neuroscience
Machine learning is primarily concerned with the accuracy
and effectiveness of the computer system.
Some more examples of tasks that are
best solved by using a learning algorithm
 Recognizing patterns:
 Facial identities or facial expressions
 Handwritten or spoken words
 Medical images
 Generating patterns:
 Generating images or motion sequences (demo)
 Recognizing anomalies:
 Unusual sequences of credit card transactions
 Unusual patterns of sensor readings in a nuclear power
plant or unusual sound in your car engine.

 Prediction:
 Future stock prices or currency exchange rates
Some web-based examples of machine learning
 The web contains a lot of data. Tasks with very big datasets
often use machine learning
 especially if the data is noisy or non-stationary.
 Spam filtering, fraud detection:

 The enemy adapts so we must adapt too.
 Recommendation systems:
 Lots of noisy data. Million dollar prize!
 Information retrieval:
 Find documents or images with similar content.
 Data Visualization:
 Display a huge database in a revealing way (demo)
Types of learning task
 Supervised learning
 Learn to predict output when given an input vector
 Reinforcement learning
 Learn action to maximize payoff
 Not much information in a payoff signal
 Unsupervised learning
 Create an internal representation of the input e.g. form
clusters; extract features

 This is the new frontier of machine learning because most big
datasets do not come with labels.

Some successful applications of
machine learning
1.A checkers learning problem:

Task T: playing checkers
Performance measure P: percent of games won against
opponents
Training experience E: playing practice games against
itself
We can specify many learning problems in this fashion,
such as learning to recognize handwritten words, or
learning to drive a robotic automobile autonomously.
9
Some successful applications of
machine learning
A handwriting recognition learning problem:

Task T: recognizing and classifying handwritten words
within images
Performance measure P: percent of words correctly
classified
 Learning symbolic representations of concepts.
Machine learning as a search problem.
 Learning as an approach to improving problem solving.
 Using prior knowledge together with training data to
guide learning.
10
DESIGNING A LEARNING SYSTEM
Choosing the Training Experience
 The first design choice we face is to choose the type

of training experience from which our system will
learn.
 The type of training experience available can have a
significant impact on success or failure of the learner.
 One key attribute is whether the training experience
provides direct or indirect feedback regarding the
choices made by the performance system.
11
 For example, in learning to play checkers, the system

might learn from direct training examples consisting of
individual checkers board states and the correct move
for each.
 Alternatively, it might have available only indirect

information consisting of the move sequences and final
outcomes of various games played.
12
 A second important attribute of the training

experience is the degree to which the learner controls
the sequence of training examples.
 For example, the learner might rely on the teacher to
select informative board states and to provide the
correct move for each
 Notice in this last case the learner may choose between
experimenting with novel board states that it has not
yet considered.
13
 A third important attribute of the training experience

is how well it represents the distribution of examples
over which the final system performance P must be
measured.
 In general, learning is most reliable when the training

examples follow a distribution similar to that of future
test examples.
14
 In our checkers learning scenario, the performance

metric P is the percent of games the system wins in the
world tournament.
 If its training experience E consists only of games
played against itself,
 there is an obvious danger that this training experience
might not be fully representative of the distribution of
situations over which it will later be tested
15
A checkers learning problem:

Task T: playing checkers
Performance measure P: percent of games won in the
world tournament
Training experience E: games played against itself
In order to complete the design of the learning system, we
must now choose
1. the exact type of knowledge to be, learned
2. a representation for this target knowledge
3. a learning mechanism
16
17
 Example : In Driverless Car, the training data is fed to

Algorithm like how to Drive Car in Highway, Busy
and Narrow Street with factors like speed limit,
parking, stop at signal etc.
 After that, a Logical and Mathematical model is
created on the basis of that and after that, the car
will work according to the logical model.
 Also, the more data the data is fed the more efficient
output is produced.
18
Steps for Designing Learning System
are:
19
Step 1) Choosing the Training Experience
 The very important and first task is to choose the

training data or training experience which will be
fed to the Machine Learning Algorithm.
 It is important to note that the data or experience that
we fed to the algorithm must have a significant
impact on the Success or Failure of the Model.
 So Training data or experience should be chosen
wisely.
20
1.Below are the attributes which will impact on Success

and Failure of Data:
 The training experience will be able to provide direct
or indirect feedback regarding choices.
 For example:
 While Playing chess the training data will provide
feedback to itself like instead of this move if this is
chosen the chances of success increases.
21
2.Second important attribute is the degree to which the

learner will control the sequences of training example
For example:
when training data is fed to the machine then at that time

it gains experience while playing again and again
with itself.
Based on the opponent the machine algorithm will get

feedback and control the chess game accordingly.
22
 3. important attribute is how it will represent the

distribution of examples over which performance will
be measured.
 For example, a Machine learning algorithm will get
experience while going through a number of different
cases and different examples.
 Thus, Machine Learning Algorithm will get more and
more experience by passing through more and more
examples and hence its performance will increase.
23
Step 2- Choosing target function
 The next important step is choosing the target function.

 It means according to the knowledge fed to the
algorithm the machine learning will choose Next Move
function which will describe what type of legal moves
should be taken.
 For example : While playing chess with the opponent,
when opponent will play then the machine learning
algorithm will decide what be the number of possible
legal moves taken in order to get success.
24
Step 3- Choosing Representation for
Target function
When the machine algorithm will know all the possible

legal moves the next step is to choose the optimized
move using any representation
i.e. using linear Equations, Hierarchical Graph
Representation, Tabular form etc.
The Next Move function will which will provide more
success rate.
For Example :
while playing chess machine have 4 possible moves, so
the machine will choose that optimized move which
will provide success to it.
25
Step 4- Choosing Function
Approximation Algorithm
 An optimized move cannot be chosen just with the

training data.
 The training data had to go through with set of example

will approximates which steps are chosen and after
that machine will provide feedback on it.
 For Example : When a training data of Playing chess is

fed to algorithm that time will fail or get success from
that failure or success it will measure while next move
what step should be chosen and what is its success rate.
26
Step 5- Final Design
 The final design is created at last when system goes
from number of examples , failures and success ,
correct and incorrect decision and what will be the
next step.
Example: DeepBlue is an intelligent computer which is

ML-based won chess game against the chess expert
Garry Kasparov, and it became the first computer
which had beaten a human chess expert.
27
Choosing the Target Function
Let us therefore define the target value V(b) for an arbitrary
board state b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V(b) = V(bl),

where b' is the best final board state that can be achieved
starting from b and playing optimally until the end of the
game (assuming the opponent plays optimally, as well).
28
 let us choose a simple representation: for any given
board state, the function c will be calculated as a linear
combination of the following board features:
xl: the number of black pieces on the board

x2: the number of red pieces on the board
x3: the number of black kings on the board
x4: the number of red kings on the board
x5: the number of black pieces threatened by red (i.e.,
which can be captured on red's next turn)
X6: the number of red pieces threatened by black
29
30
 Thus, our learning program will represent c(b) as a
linear function of the Form
 where w0 through W6 are numerical coefficients, or

weights, to be chosen by the learning algorithm.
 Learned values for the weights w1 through W6 will
determine the relative importance of the various board
features.
31
PERSPECTIVES AND ISSUES IN
MACHINE LEARNING
Our checkers example raises a number of generic

questions about machine learning.
 What algorithms exist for learning general target
functions from specific training examples?
 In what settings will particular algorithms converge to
the desired function, given sufficient training data?
 Which algorithms perform best for which types of
problems and representations?
 How much training data is sufficient?
32
MACHINE LEARNING
 When and how can prior knowledge held by the learner

guide the process of generalizing from examples?
 Can prior knowledge be helpful even when it is only
approximately correct?
 What is the best strategy for choosing a useful next
training experience, and how does the choice of this
strategy alter the complexity of the learning
problem?
33
MACHINE LEARNING
 What is the best way to reduce the learning task to one

or more function approximation problems?
 Put another way, what specific functions should the
system attempt to learn?
 Can this process itself be automated?
 How can the learner automatically alter its
representation to improve its ability to represent and
learn the target function?
34
Decision Tree Learning
 A decision tree is a flowchart-like structure in which
each internal node represents a test on a feature
(e.g. whether a coin flip comes up heads or tails).
 Each leaf node represents a class label (decision taken
after computing all features) and branches represent
conjunctions of features that lead to those class labels.
 The paths from root to leaf represent classification
rules. Below diagram illustrate the basic flow of
decision tree for decision making with labels
(Rain(Yes), No Rain(No)).
35
36
37
A decision tree
38
A decision tree
 A decision tree for the concept buys computer,
indicating whether a customer at AllElectronics is
likely to purchase a computer.
 Each internal (nonleaf) node represents a test on an
attribute.
 Each leaf node represents a class (either buys
computer = yes or buys computer = no).
39
Decision Tree
 Decision trees classify instances by sorting them down
the tree from the root to some leaf node, which
provides the classification of the instance.
 Each node in the tree specifies a test of some attribute
of the instance.
 Each branch descending from a node corresponds to
one of the possible values for the attribute.
40
Decision Tree
 Each leaf node assigns a classification.
 The instance (Outlook=Sunny, Temperature=Hot,
Humidity=High, Wind=Strong) is classified as a
negative instance.
41
When to Consider Decision Trees
 Instances are represented by attribute-value pairs.
 Fixed set of attributes, and the attributes take a small
number of disjoint possible values.
 The target function has discrete output values.
 Decision tree learning is appropriate for a boolean
classification, but it easily extends to learning
functions with more than two possible output values.
42
 Disjunctive descriptions may be required
 decision trees naturally represent disjunctive
expressions.
 The training data may contain errors.
 Decision tree learning methods are robust to errors,
both errors in classifications of the training examples
and errors in the attribute values that describe these
examples
43
 The training data may contain missing attribute values.
 Decision tree method scan be used even when some
training examples have unknown values.
 Decision tree learning has been applied to problems such
as learning to classify
 medical patients by their disease, equipment
malfunctions by their cause
 loan applicants by their likelihood of defaulting on
payment
44
Which Attribute is ”best”?
 We would like to select the attribute that is most useful
for classifying examples.
 Information gain measures how well a given attribute
separates the training examples according to their
target classification.
 ID3 uses this information gain measure to select among
the candidate attributes at each step while growing the
tree.
45
Which Attribute is ”best”?
 In order to define information gain precisely, we use a
measure commonly used in information theory, called
entropy
 Entropy characterizes the (im)purity of an arbitrary
collection of examples
46
47
Appropriate Problems for Decision
Tree Learning
1.Instances are represented by attribute-value pairs

 Instances are described by a fixed set of attributes (e.g.,
Temperature) and their values (e.g., Hot).
 The easiest situation for decision tree learning is when
each attribute takes on a
 small number of disjoint possible values (e.g., Hot,
Mild, Cold).
48
Tree Learning
2.The target function has discrete output values

 The decision tree assigns a boolean classification (e.g., yes
or no) to each example.
 Decision tree methods easily extend to learning functions
with more than two possible output values.
 A more substantial extension allows learning target
functions with real-valued outputs, though the application of
decision trees in this setting is less common.
49
Tree Learning
3.Disjunctive descriptions may be required. As noted

above, decision trees naturally represent disjunctive
expressions.
4.The training data may contain errors.
Decision tree learning methods are robust to errors, both

errors in classifications of the training examples and
errors in the attribute values that describe these
examples.
50
Tree Learning
5.The training data may contain missing attribute

values.
Decision tree methods can be used even when some

training examples have unknown values
 (e.g., if the Humidity of the day is known for only
some of the training examples)
51
THE BASIC DECISION TREE LEARNING
ALGORITHM
 Most algorithms that have been developed for learning

decision trees are variations on a core algorithm that
employs a top-down, greedy search through the space
of possible decision trees.
 This approach is exemplified by the ID3 algorithm
 (Quinlan 1986) and its successor C4.5 (Quinlan 1993),
which form the primary focus of our discussion here
52
THE BASIC DECISION TREE
LEARNING ALGORITHM
 Our basic algorithm, ID3, learns decision trees by

constructing them topdown,
 beginning with the question "which attribute should
be tested at the root of the tree?
 To answer this question, each instance attribute is
evaluated using a statistical test to determine how well
it alone classifies the training examples.
 The best attribute is selected and used as the test at the
root node of the tree.
53
LEARNING ALGORITHM
 A descendant of the root node is then created for each

possible value of this attribute, and the training
examples are sorted to the appropriate descendant node
 (i.e., down the branch corresponding to the example's
value for this attribute).
 The entire process is then repeated using the training
examples associated with each descendant node to
select the best attribute to test at that point in the tree
 .
54
Training Examples for play Game
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
55
LEARNING ALGORITHM
 This forms a greedy search for an acceptable decision

tree, in which the algorithm
 never backtracks to reconsider earlier choices
56
Which Attribute is best?
We would like to select the attribute that is most useful
for classifying examples.
 Information gain measures how well a given attribute
separates the training examples according to their
target classification.
 ID3 uses this information gain measure to select among
the candidate attributes at each step while growing the
tree.
57
ENTROPY MEASURES
HOMOGENEITY OF EXAMPLES
 Information gain is the reduction in entropy or surprise

by transforming a dataset and is often used in training
decision trees.
 Information gain is calculated by comparing the
entropy of the dataset before and after a transformation.
 Mutual information calculates the statistical
dependence between two variables and is the name
given to information gain when applied to variable
selection.
58
Defining Information Gain
 We want to determine which attribute in a given set of
training feature vectors is most useful for
discriminating between the classes to be learned.
 Information gain tells us how important a given
attribute of the feature vectors is.
 We will use it to decide the ordering of attributes in
the nodes of a decision tree
CS 40003: Data Analytics 59

Definition of Entropy
Definition 9.2: Entropy
The entropy of a set of m distinct values is the

minimum number of yes/no questions needed to
determine an unknown values from these m
possibilities.
CS 40003: Data Analytics 60

Entropy
 S is a sample of training examples
 p+ is the proportion of positive examples
 p- is the proportion of negative examples
 Entropy measures the impurity of S
Entropy(S) = -p+ log2 p+ - p- log2 p-
61
Entropy
 To illustrate, suppose S is a collection of 14 examples of some
boolean concept, including 9 positive and 5 negative examples
(we adopt the notation)
 [9+, 5- to summarize such a sample of data). Then the

entropy of S relative to this boolean classification is
Entropy(S) = -p+ log2 p+ - p- log2 p-
62
Entropy
 Notice that the entropy is 0 if all members of S
belong to the same class.
 For example, if all members are positive (pe = I), then
p, is 0.
 Note the entropy is 1 when the collection contains an
equal number of positive and negative examples.
63
Training Examples for play Game
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
64
ID3
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind
High Normal Weak Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048 65
ID3
S=[9+,5-]
E=0.940
Outlook
Over
Sunny Rain
cast
[2+, 3-] [4+, 0] [3+, 2-]
E=0.97 E=0.0 E=0.971
Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 –
(5/14)*0.0971 =0.247
66
ID3
The information gain values for the 4 attributes are:

• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
where S denotes the collection of training examples
67
ID3
According to the information gain measure, the Outlook

attribute provides the best prediction of the target
attribute, PlayTennis, over the training examples.
Therefore, Outlook is selected as the decision attribute

for the root node,and branches are created below the root
for each of its possible values (i.e.,Sunny, Overcast, and
Rain).
68
Result for ID3
Note that every example for which Outlook = Overcast is
also a positive example of PlayTennis.
Therefore, this node of the tree becomes a leaf node with

the classification PlayTennis = Yes.
In contrast, the descendants corresponding to Outlook =

Sunny and Outlook = Rain still have nonzero entropy,
and the decision
tree will be further elaborated below these nodes.
69
70

ML Unit1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Unit1

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING

 Machine learning is an application of artificial

 Machine learning focuses on the development of

 The process of learning begins with observations or data,

 Handwritten or spoken words

 Unusual patterns of sensor readings in a nuclear power

plant or unusual sound in your car engine.

 Spam filtering, fraud detection:

clusters; extract features

datasets do not come with labels.

1.A checkers learning problem:

A handwriting recognition learning problem:

Choosing the Training Experience

 The first design choice we face is to choose the type

 For example, in learning to play checkers, the system

 Alternatively, it might have available only indirect

 A second important attribute of the training

 A third important attribute of the training experience

 In general, learning is most reliable when the training

 In our checkers learning scenario, the performance

A checkers learning problem:

 Example : In Driverless Car, the training data is fed to

 The very important and first task is to choose the

1.Below are the attributes which will impact on Success

2.Second important attribute is the degree to which the

when training data is fed to the machine then at that time

Based on the opponent the machine algorithm will get

 3. important attribute is how it will represent the

 The next important step is choosing the target function.

When the machine algorithm will know all the possible

 An optimized move cannot be chosen just with the

 The training data had to go through with set of example

 For Example : When a training data of Playing chess is

Example: DeepBlue is an intelligent computer which is

1. if b is a final board state that is won, then V(b) = 100

2. if b is a final board state that is lost, then V(b) = -100

3. if b is a final board state that is drawn, then V(b) = 0

4. if b is a not a final state in the game, then V(b) = V(bl),

xl: the number of black pieces on the board

 where w0 through W6 are numerical coefficients, or

Our checkers example raises a number of generic

 When and how can prior knowledge held by the learner

 What is the best way to reduce the learning task to one

1.Instances are represented by attribute-value pairs

2.The target function has discrete output values

3.Disjunctive descriptions may be required. As noted

4.The training data may contain errors.

Decision tree learning methods are robust to errors, both

5.The training data may contain missing attribute

Decision tree methods can be used even when some

 Most algorithms that have been developed for learning

 Our basic algorithm, ID3, learns decision trees by

 A descendant of the root node is then created for each

 This forms a greedy search for an acceptable decision

 Information gain is the reduction in entropy or surprise

CS 40003: Data Analytics 59

Definition 9.2: Entropy

The entropy of a set of m distinct values is the

CS 40003: Data Analytics 60

 p+ is the proportion of positive examples

 p- is the proportion of negative examples

 Entropy measures the impurity of S

Gain(S,Outlook) =0.940-(5/14)0.971 -(4/14)0.0 –