You are on page 1of 70

MACHINE LEARNING

UNIT -1

1
Machine learning

 Machine learning is an application of artificial


intelligence (AI) that provides systems the ability to
automatically learn and improve from experience
without being explicitly programmed.

 Machine learning focuses on the development of


computer programs that can access data and use it to
learn for themselves.
2
Machine learning

 The process of learning begins with observations or data,


such as examples, direct experience, or instruction, in
order to look for patterns in data and make better
decisions in the future based on the examples that we
provide.
 The primary aim is to allow the computers learn
automatically without human intervention or assistance
and adjust actions accordingly.
3
Well-posed learning problems
Definition:
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.
For example, a computer program that learns to play
checkers might improve its performance as measured by
its ability to win at the class of tasks involving playing
checkers games, through experience obtained by
playing games against itself

4
Related Fields
data
mining
control theory
statistics
decision theory
machine
information theory learning
cognitive science
databases
psychological models
evolutionary
models neuroscience
Machine learning is primarily concerned with the accuracy
and effectiveness of the computer system.
Some more examples of tasks that are
best solved by using a learning algorithm

 Recognizing patterns:
 Facial identities or facial expressions

 Handwritten or spoken words

 Medical images

 Generating patterns:
 Generating images or motion sequences (demo)

 Recognizing anomalies:
 Unusual sequences of credit card transactions

 Unusual patterns of sensor readings in a nuclear power

plant or unusual sound in your car engine.


 Prediction:
 Future stock prices or currency exchange rates
Some web-based examples of machine learning
 The web contains a lot of data. Tasks with very big datasets
often use machine learning
 especially if the data is noisy or non-stationary.

 Spam filtering, fraud detection:


 The enemy adapts so we must adapt too.

 Recommendation systems:
 Lots of noisy data. Million dollar prize!

 Information retrieval:
 Find documents or images with similar content.

 Data Visualization:
 Display a huge database in a revealing way (demo)
Types of learning task
 Supervised learning
 Learn to predict output when given an input vector

 Reinforcement learning
 Learn action to maximize payoff
 Not much information in a payoff signal
 Unsupervised learning
 Create an internal representation of the input e.g. form

clusters; extract features


 This is the new frontier of machine learning because most big

datasets do not come with labels.


Some successful applications of
machine learning

1.A checkers learning problem:


Task T: playing checkers
Performance measure P: percent of games won against
opponents
Training experience E: playing practice games against
itself
We can specify many learning problems in this fashion,
such as learning to recognize handwritten words, or
learning to drive a robotic automobile autonomously.
9
Some successful applications of
machine learning

A handwriting recognition learning problem:


Task T: recognizing and classifying handwritten words
within images
Performance measure P: percent of words correctly
classified
 Learning symbolic representations of concepts.
Machine learning as a search problem.
 Learning as an approach to improving problem solving.
 Using prior knowledge together with training data to
guide learning.

10
DESIGNING A LEARNING SYSTEM

Choosing the Training Experience

 The first design choice we face is to choose the type


of training experience from which our system will
learn.
 The type of training experience available can have a
significant impact on success or failure of the learner.
 One key attribute is whether the training experience
provides direct or indirect feedback regarding the
choices made by the performance system.

11
DESIGNING A LEARNING SYSTEM

 For example, in learning to play checkers, the system


might learn from direct training examples consisting of
individual checkers board states and the correct move
for each.

 Alternatively, it might have available only indirect


information consisting of the move sequences and final
outcomes of various games played.

12
DESIGNING A LEARNING SYSTEM

 A second important attribute of the training


experience is the degree to which the learner controls
the sequence of training examples.
 For example, the learner might rely on the teacher to
select informative board states and to provide the
correct move for each
 Notice in this last case the learner may choose between
experimenting with novel board states that it has not
yet considered.

13
DESIGNING A LEARNING SYSTEM

 A third important attribute of the training experience


is how well it represents the distribution of examples
over which the final system performance P must be
measured.

 In general, learning is most reliable when the training


examples follow a distribution similar to that of future
test examples.

14
DESIGNING A LEARNING SYSTEM

 In our checkers learning scenario, the performance


metric P is the percent of games the system wins in the
world tournament.
 If its training experience E consists only of games
played against itself,
 there is an obvious danger that this training experience
might not be fully representative of the distribution of
situations over which it will later be tested

15
DESIGNING A LEARNING SYSTEM

A checkers learning problem:


Task T: playing checkers
Performance measure P: percent of games won in the
world tournament
Training experience E: games played against itself
In order to complete the design of the learning system, we
must now choose
1. the exact type of knowledge to be, learned
2. a representation for this target knowledge
3. a learning mechanism

16
DESIGNING A LEARNING SYSTEM

17
DESIGNING A LEARNING SYSTEM

 Example :  In Driverless Car, the training data is fed to


Algorithm like how to Drive Car in Highway, Busy
and Narrow Street with factors like speed limit,
parking, stop at signal etc.
 After that, a Logical and Mathematical model is
created on the basis of that and after that, the car
will work according to the logical model.
 Also, the more data the data is fed the more efficient
output is produced.

18
Steps for Designing Learning System
are:

19
Step 1) Choosing the Training Experience

 The very important and first task is to choose the


training data or training experience which will be
fed to the Machine Learning Algorithm.
 It is important to note that the data or experience that
we fed to the algorithm must have a significant
impact on the Success or Failure of the Model.
 So Training data or experience should be chosen
wisely.

20
Choosing the Training Experience

1.Below are the attributes which will impact on Success


and Failure of Data:
 The training experience will be able to provide direct
or indirect feedback regarding choices.
 For example:
 While Playing chess the training data will provide
feedback to itself like instead of this move if this is
chosen the chances of success increases.

21
Choosing the Training Experience

2.Second important attribute is the degree to which the


learner will control the sequences of training example

For example:

when training data is fed to the machine then at that time


it gains experience while playing again and again
with itself.

Based on the opponent the machine algorithm will get


feedback and control the chess game accordingly.

22
Choosing the Training Experience

 3. important attribute is how it will represent the


distribution of examples over which performance will
be measured.
 For example, a Machine learning algorithm will get
experience while going through a number of different
cases and different examples.
 Thus, Machine Learning Algorithm will get more and
more experience by passing through more and more
examples and hence its performance will increase.

23
Step 2- Choosing target function

 The next important step is choosing the target function.


 It means according to the knowledge fed to the
algorithm the machine learning will choose Next Move
function which will describe what type of legal moves
should be taken.
  For example : While playing chess with the opponent,
when opponent will play then the machine learning
algorithm will decide what be the number of possible
legal moves taken in order to get success.

24
Step 3- Choosing Representation for
Target function

When the machine algorithm will know all the possible


legal moves the next step is to choose the optimized
move using any representation
i.e. using linear Equations, Hierarchical Graph
Representation, Tabular form etc.
The Next Move function will which will provide more
success rate.
For Example :
while playing chess machine have 4 possible moves, so
the machine will choose that optimized move which
will provide success to it.
25
Step 4- Choosing Function
Approximation Algorithm

 An optimized move cannot be chosen just with the


training data.

 The training data had to go through with set of example


will approximates which steps are chosen and after
that machine will provide feedback on it.

 For Example : When a training data of Playing chess is


fed  to algorithm that time will fail or get success from
that failure or success it will measure while next move
what step should be chosen and what is its success rate.
26
Step 5- Final Design
 The final design is created at last when system goes
from number of examples , failures and success ,
correct and incorrect decision and what will be the
next step.

Example: DeepBlue is an intelligent  computer which is


ML-based won chess game against the chess expert
Garry Kasparov, and it became the first computer
which had beaten a human chess expert.

27
Choosing the Target Function
Let us therefore define the target value V(b) for an arbitrary
board state b in B, as follows:

1. if b is a final board state that is won, then V(b) = 100

2. if b is a final board state that is lost, then V(b) = -100

3. if b is a final board state that is drawn, then V(b) = 0

4. if b is a not a final state in the game, then V(b) = V(bl),


where b' is the best final board state that can be achieved
starting from b and playing optimally until the end of the
game (assuming the opponent plays optimally, as well).
28
Choosing the Target Function
 let us choose a simple representation: for any given
board state, the function c will be calculated as a linear
combination of the following board features:

xl: the number of black pieces on the board


x2: the number of red pieces on the board
x3: the number of black kings on the board
x4: the number of red kings on the board
x5: the number of black pieces threatened by red (i.e.,
which can be captured on red's next turn)
X6: the number of red pieces threatened by black
29
Choosing the Target Function

30
Choosing the Target Function
 Thus, our learning program will represent c(b) as a
linear function of the Form

 where w0 through W6 are numerical coefficients, or


weights, to be chosen by the learning algorithm.
 Learned values for the weights w1 through W6 will
determine the relative importance of the various board
features.

31
PERSPECTIVES AND ISSUES IN
MACHINE LEARNING

Our checkers example raises a number of generic


questions about machine learning.
 What algorithms exist for learning general target
functions from specific training examples?
 In what settings will particular algorithms converge to
the desired function, given sufficient training data?
 Which algorithms perform best for which types of
problems and representations?
 How much training data is sufficient?
32
PERSPECTIVES AND ISSUES IN
MACHINE LEARNING

 When and how can prior knowledge held by the learner


guide the process of generalizing from examples?
 Can prior knowledge be helpful even when it is only
approximately correct?
 What is the best strategy for choosing a useful next
training experience, and how does the choice of this
strategy alter the complexity of the learning
problem?

33
PERSPECTIVES AND ISSUES IN
MACHINE LEARNING

 What is the best way to reduce the learning task to one


or more function approximation problems?
 Put another way, what specific functions should the
system attempt to learn?
 Can this process itself be automated?
 How can the learner automatically alter its
representation to improve its ability to represent and
learn the target function?

34
Decision Tree Learning
 A decision tree is a flowchart-like structure in which
each internal node represents a test on a feature
(e.g. whether a coin flip comes up heads or tails).
 Each leaf node represents a class label (decision taken
after computing all features) and branches represent
conjunctions of features that lead to those class labels.
 The paths from root to leaf represent classification
rules. Below diagram illustrate the basic flow of
decision tree for decision making with labels
(Rain(Yes), No Rain(No)).
35
36
37
A decision tree

38
A decision tree
 A decision tree for the concept buys computer,
indicating whether a customer at AllElectronics is
likely to purchase a computer.
 Each internal (nonleaf) node represents a test on an
attribute.
 Each leaf node represents a class (either buys
computer = yes or buys computer = no).

39
Decision Tree
 Decision trees classify instances by sorting them down
the tree from the root to some leaf node, which
provides the classification of the instance.
 Each node in the tree specifies a test of some attribute
of the instance.
 Each branch descending from a node corresponds to
one of the possible values for the attribute.

40
Decision Tree
 Each leaf node assigns a classification.
 The instance (Outlook=Sunny, Temperature=Hot,
Humidity=High, Wind=Strong) is classified as a
negative instance.

41
When to Consider Decision Trees
 Instances are represented by attribute-value pairs.
 Fixed set of attributes, and the attributes take a small
number of disjoint possible values.
 The target function has discrete output values.
 Decision tree learning is appropriate for a boolean
classification, but it easily extends to learning
functions with more than two possible output values.

42
When to Consider Decision Trees
 Disjunctive descriptions may be required
 decision trees naturally represent disjunctive
expressions.
 The training data may contain errors.
 Decision tree learning methods are robust to errors,
both errors in classifications of the training examples
and errors in the attribute values that describe these
examples

43
When to Consider Decision Trees
 The training data may contain missing attribute values.
 Decision tree method scan be used even when some
training examples have unknown values.
 Decision tree learning has been applied to problems such
as learning to classify
 medical patients by their disease, equipment
malfunctions by their cause
 loan applicants by their likelihood of defaulting on
payment
44
Which Attribute is ”best”?
 We would like to select the attribute that is most useful
for classifying examples.
 Information gain measures how well a given attribute
separates the training examples according to their
target classification.
 ID3 uses this information gain measure to select among
the candidate attributes at each step while growing the
tree.

45
Which Attribute is ”best”?
 In order to define information gain precisely, we use a
measure commonly used in information theory, called
entropy
 Entropy characterizes the (im)purity of an arbitrary
collection of examples

46
47
Appropriate Problems for Decision
Tree Learning

1.Instances are represented by attribute-value pairs


 Instances are described by a fixed set of attributes (e.g.,
Temperature) and their values (e.g., Hot).
 The easiest situation for decision tree learning is when
each attribute takes on a
 small number of disjoint possible values (e.g., Hot,
Mild, Cold).

48
Appropriate Problems for Decision
Tree Learning

2.The target function has discrete output values


 The decision tree assigns a boolean classification (e.g., yes
or no) to each example.
 Decision tree methods easily extend to learning functions
with more than two possible output values.
 A more substantial extension allows learning target
functions with real-valued outputs, though the application of
decision trees in this setting is less common.

49
Appropriate Problems for Decision
Tree Learning

3.Disjunctive descriptions may be required. As noted


above, decision trees naturally represent disjunctive
expressions.

4.The training data may contain errors.

Decision tree learning methods are robust to errors, both


errors in classifications of the training examples and
errors in the attribute values that describe these
examples.
50
Appropriate Problems for Decision
Tree Learning

5.The training data may contain missing attribute


values.

Decision tree methods can be used even when some


training examples have unknown values
 (e.g., if the Humidity of the day is known for only
some of the training examples)

51
THE BASIC DECISION TREE LEARNING
ALGORITHM

 Most algorithms that have been developed for learning


decision trees are variations on a core algorithm that
employs a top-down, greedy search through the space
of possible decision trees.
 This approach is exemplified by the ID3 algorithm
 (Quinlan 1986) and its successor C4.5 (Quinlan 1993),
which form the primary focus of our discussion here

52
THE BASIC DECISION TREE
LEARNING ALGORITHM

 Our basic algorithm, ID3, learns decision trees by


constructing them topdown,
 beginning with the question "which attribute should
be tested at the root of the tree?
 To answer this question, each instance attribute is
evaluated using a statistical test to determine how well
it alone classifies the training examples.
 The best attribute is selected and used as the test at the
root node of the tree.
53
THE BASIC DECISION TREE
LEARNING ALGORITHM

 A descendant of the root node is then created for each


possible value of this attribute, and the training
examples are sorted to the appropriate descendant node
 (i.e., down the branch corresponding to the example's
value for this attribute).
 The entire process is then repeated using the training
examples associated with each descendant node to
select the best attribute to test at that point in the tree
 .
54
Training Examples for play Game
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
55
THE BASIC DECISION TREE
LEARNING ALGORITHM

 This forms a greedy search for an acceptable decision


tree, in which the algorithm
 never backtracks to reconsider earlier choices

56
Which Attribute is best?
We would like to select the attribute that is most useful
for classifying examples.
 Information gain measures how well a given attribute
separates the training examples according to their
target classification.
 ID3 uses this information gain measure to select among
the candidate attributes at each step while growing the
tree.

57
ENTROPY MEASURES
HOMOGENEITY OF EXAMPLES

 Information gain is the reduction in entropy or surprise


by transforming a dataset and is often used in training
decision trees.
 Information gain is calculated by comparing the
entropy of the dataset before and after a transformation.
 Mutual information calculates the statistical
dependence between two variables and is the name
given to information gain when applied to variable
selection.

58
Defining Information Gain
 We want to determine which attribute in a given set of
training feature vectors is most useful for
discriminating between the classes to be learned.
 Information gain tells us how important a given
attribute of the feature vectors is.
 We will use it to decide the ordering of attributes in
the nodes of a decision tree

CS 40003: Data Analytics 59


Definition of Entropy

Definition 9.2: Entropy

The entropy of a set of m distinct values is the


minimum number of yes/no questions needed to
determine an unknown values from these m
possibilities.

CS 40003: Data Analytics 60


Entropy
 S is a sample of training examples

 p+ is the proportion of positive examples

 p- is the proportion of negative examples

 Entropy measures the impurity of S

Entropy(S) = -p+ log2 p+ - p- log2 p-

61
Entropy
 To illustrate, suppose S is a collection of 14 examples of some
boolean concept, including 9 positive and 5 negative examples
(we adopt the notation)

 [9+, 5- to summarize such a sample of data). Then the


entropy of S relative to this boolean classification is

Entropy(S) = -p+ log2 p+ - p- log2 p-

62
Entropy
 Notice that the entropy is 0 if all members of S
belong to the same class.
 For example, if all members are positive (pe = I), then
p, is 0.
 Note the entropy is 1 when the collection contains an
equal number of positive and negative examples.

63
Training Examples for play Game
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
64
ID3
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]


E=0.985 E=0.592 E=0.811 E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048 65
ID3
S=[9+,5-]
E=0.940
Outlook

Over
Sunny Rain
cast
[2+, 3-] [4+, 0] [3+, 2-]
E=0.97 E=0.0 E=0.971

Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 –

(5/14)*0.0971 =0.247
66
ID3

The information gain values for the 4 attributes are:


• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training examples

67
ID3

According to the information gain measure, the Outlook


attribute provides the best prediction of the target
attribute, PlayTennis, over the training examples.

Therefore, Outlook is selected as the decision attribute


for the root node,and branches are created below the root
for each of its possible values (i.e.,Sunny, Overcast, and
Rain).

68
Result for ID3
Note that every example for which Outlook = Overcast is
also a positive example of PlayTennis.

Therefore, this node of the tree becomes a leaf node with


the classification PlayTennis = Yes.

In contrast, the descendants corresponding to Outlook =


Sunny and Outlook = Rain still have nonzero entropy,
and the decision

tree will be further elaborated below these nodes.

69
70

You might also like