Machine Learning and Applications (5L)

AI: Machine
DR. SOURAV
Learning and its MANDAL
Applications
What is Learning?
• “Learning denotes changes in a system that ... enable a system to

do the same task … more efficiently the next time.” - Herbert Simon
• “Learning is constructing or modifying representations of what is
being experienced.” - Ryszard Michalski
• “Learning is making useful changes in our minds.” - Marvin Minsky
“Machine learning refers to a system capable of the autonomous

acquisition and integration of knowledge.”
2
Why Machine Learning?
• No human experts
• industrial/manufacturing control
• mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise
• face/handwriting/speech recognition
• driving a car, flying a plane
• Rapidly changing phenomena
• credit scoring, financial modeling
• diagnosis, fraud detection
• Need for customization/personalization
• personalized news reader
• movie/book recommendation
3
Related Fields
data
mining control theory
statistics
decision theory
information theory machine
learning
cognitive science
databases
psychological models
evolutionary neuroscience
models
Machine learning is primarily concerned with the

accuracy and effectiveness of the computer system.
4
rote learning (memorization technique)
learning by being told (advice-taking)
learning from examples (induction)
Machine learning by analogy
Learning speed-up learning

concept learning
Paradigms clustering
discovery
…
5
Architecture of a Learning System
feedback performance standard
critic percepts
ENVIRONMENT
changes
learning performance percepts
element element actions

knowledge
learning goals
problem
generator
6
Forms of Learning
• Any component of an agent can be improved by

learning from data. The improvement depends on four
major factors.
• Which component is to be improved.
• What prior knowledge the agent already has.
• What representation is used for the data and the component.
• What feedback is available to learn from.
• E.g., for automatic taxi driver agent, an agent might learn (condition-action
rule) when to break (or slowdown!) from all previous instances of breaking.
Seeing many camera images, it can learn the different vehicles and objects in
front of the taxi, therefore can take various discissions itself.
7
Learning Element
Design affected by:

• performance element used
• e.g., utility-based agent, reactive agent, logical agent
• functional component to be learned
• e.g., classifier, evaluation function, perception-action function,
• representation of functional component
• e.g., weighted linear function, logical theory, HMM
• feedback available
• e.g., correct action, reward, relative preferences
8
Dimensions of Learning Systems
type of feedback
• supervised (labeled examples)- like a teacher

• unsupervised (unlabeled examples) – no explicit feedback
• reinforcement (reward) [like scoring good marks in exams]
representation
• attribute-based (feature vector)

• relational (first-order logic)
use of knowledge
• empirical (knowledge-free)
• analytical (knowledge-guided)
9
What is machine learning?
• A branch of artificial intelligence, concerned with
the design and development of algorithms that
allow computers to evolve behaviors based on
empirical data.
• As intelligence requires knowledge, it is necessary for

the computers to acquire knowledge.
10
What is machine learning?
11
Learning system model
Testing
Input Learning
Samples Method
System
Training
12
Learning system model
13
Training and testing
Data acquisition Practical

usage
Universal set
(unobserved)
Training set Testing set

(observed) (unobserved)
14
Training and testing
• Training is the process of making the system able to
learn.
• No free lunch rule:
• Training set and testing set come from the same distribution
• Need to make some assumptions or bias
15
Performance
• There are several factors affecting the performance:
• Types of training provided
• The form and extent of any initial background knowledge
• The type of feedback provided
• The learning algorithms used
• Two important factors:

• Modeling
• Optimization
16
Algorithms
• The success of machine learning system also
depends on the algorithms.
• The algorithms control the search to find and build

the knowledge structures.
• The learning algorithms should extract useful

information from training examples.
17
Algorithms
• Supervised learning ( )
• Prediction
• Classification (discrete labels), Regression (real values)
• Unsupervised learning ( )
• Clustering
• Probability distribution estimation
• Finding association (in features)
• Dimension reduction
• Semi-supervised learning
• Reinforcement learning
• Decision making (robot, chess machine)
18
Algorithms
Supervised Unsupervised
learning learning
Semi-supervised learning 19
Machine learning structure
• Supervised learning
20
Inductive (Supervised) Learning
Basic Problem: Given a training set of N example input-output pairs
(x1,y1), (x2,y2),…,(xN, yN),
Where each yi was generated by an unknown function y= f(x),
• Discover a function h (hypothesis) that approximate the true function f.

• Learning is a search through the space of possible hypothesis for one
that will perform well, even on new examples beyond the training set.
21
Inductive (Supervised) Learning
Basic Problem: Induce a representation of a function (a systematic

relationship between inputs and outputs) from examples.
• target function f: X → Y
• example (x,f(x))
• hypothesis g: X → Y such that g(x) = f(x)
x = set of attribute values (attribute-value representation)

x = set of logical sentences (first-order representation)
Y = set of discrete labels (classification)

Y =  (regression)
22
Machine learning structure
• Unsupervised learning
23
Predicting housing price
24
Classifying Iris Plants
• Iris flowers have different

sepal and petal shapes:
• Iris Setosa
• Iris Versicolour
• Iris Virginica
• Suppose you are shown
lots of examples of each
type. Given a new iris
flower, what type is it?
https://en.wikipedia.org/wiki/Iris_setosa
https://en.wikipedia.org/wiki/Iris_versicolor 25
https://en.wikipedia.org/wiki/Iris_virginica
Supervised Learning
26
Supervised Learning:
Regression vs. Classification
Regression
• Covers situations where Y is continuous (quantitative)

• E.g., predicting the value of the Dow (share) in 6 months,
predicting the value of a given house based on various inputs, etc.
Classification
• Covers situations where Y is categorical (qualitative)

• E.g., Will the Dow be up or down in 6 months? Is this email spam or
not?
27
Supervised Learning: Examples
• Email Spam:
• predict whether an email is a junk email (i.e. spam)
28
• Handwritten Digit Recognition:

• Identify single digits 0~9 based on images
29
• Face Detection/Recognition:
• Identify human faces
30
• Speech Recognition:
• Identify words spoken according to speech signals
• Automatic voice recognition systems used by airline companies,
automatic stock price reporting, etc.
31
Supervised Learning:
Linear Regression
32
What are we seeking?
• Supervised: Low E-out or maximize probabilistic terms
E-in: for training set

E-out: for testing set
• Unsupervised: Minimum quantization error, Minimum

distance, MAP, MLE(maximum likelihood estimation)
33
What are we seeking?
Under-fitting VS. Over-fitting (fixed N)
error
(model = hypothesis + loss

functions)
34
Learning techniques
• Supervised learning categories and techniques

• Linear classifier (numerical functions)
• Parametric (Probabilistic functions)
• Naïve Bayes, Gaussian discriminant analysis (GDA), Hidden Markov models
(HMM), Probabilistic graphical models
• Non-parametric (Instance-based functions)
• K-nearest neighbors, Kernel regression, Kernel density estimation, Local
regression
• Non-metric (Symbolic functions)
• Classification and regression tree (CART), decision tree
• Aggregation
• Bagging (bootstrap + aggregation), Adaboost, Random forest
35
Learning techniques
• Linear classifier
, where w is a d-dim vector (learned)
• Techniques:
• Perceptron
• Logistic regression
• Support vector machine (SVM)
• Ada-line
• Multi-layer perceptron (MLP)
36
Learning techniques
Using perceptron learning algorithm(PLA)
Training Testing
Error rate: Error rate:
0.10 0.156
37
Learning techniques
Using logistic regression
Training Testing
Error rate: Error rate:
0.11 0.145
38
Learning techniques
• Non-linear case
• Support vector machine (SVM):

• Linear to nonlinear: Feature transform and kernel function
39
Learning techniques
• Unsupervised learning categories and techniques
• Clustering
• K-means clustering
• Spectral clustering
• Density Estimation
• Gaussian mixture model (GMM)
• Graphical models
• Dimensionality reduction
• Principal component analysis (PCA)
• Factor analysis
40
Supervised Learning
41
Road Map
Basic concepts Decision tree induction
42
An example application
• An emergency room in a hospital measures 17 variables
(e.g., blood pressure, age, etc) of newly admitted
patients.
• A decision is needed: whether to put a new patient in an
intensive-care unit.
• Due to the high cost of ICU, those patients who may
survive less than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate
them from low-risk patients.
43
Another application
• A credit card company receives thousands of applications for

new cards. Each application contains information about an
applicant,
• age
• Marital status
• annual salary
• outstanding debts
• credit rating
• etc.
• Problem: to decide whether an application should approve, or
to classify applications into two categories, approved and not
approved.
44
Machine learning and our focus
• Like human learning from past experiences.
• A computer does not have “experiences”.
• A computer system learns from data, which represent some
“past experiences” of an application domain.
• Our focus: learn a target function that can be used to
predict the values of a discrete class attribute, e.g., approve
or not-approved, and high-risk or low risk.
• The task is commonly called: Supervised learning,
classification, or inductive learning.
45
The data and the goal
• Data: A set of data records (also called examples, instances
or cases) described by
• k attributes: A1, A2, … Ak.
• a class: Each example is labelled with a pre-defined class.
• Goal: To learn a classification model from the data that can
be used to predict the classes of new (future, or test)
cases/instances.
46
An example: data (loan application)
• Approved or not
47
An example: the learning task
• Learn a classification model from the data
• Use the model to classify future loan
applications into
• Yes (approved) and
• No (not approved)
• What is the class for following case/instance?
48
Learning Approaches
Supervised Learning Unsupervised Learning

• The training data is • The training data is not
annotated with annotated with any extra
information to help the information to help the
learning system learning system
Semi-Supervised
Learning
49
Supervised vs. unsupervised Learning
• Supervised learning: classification is seen as supervised
learning from examples.
• Supervision: The data (observations, measurements, etc.)
are labeled with pre-defined classes. It is like that a
“teacher” gives the classes (supervision).
• Test data are classified into these classes too.
• Unsupervised learning (clustering)
• Class labels of the data are unknown
• Given a set of data, the task is to establish the existence of
classes or clusters in the data
50
Supervised learning process: two steps
 Learning (training): Learn a model using the
training data
 Testing: Test the model using unseen test data
to assess the model accuracy
Number of correct classifications

Accuracy  ,
Total number of test cases
51
What do we mean by learning?
• Given
• a data set D,
• a task T, and
• a performance measure M,
a computer system is said to learn from D to perform the task
T if after learning the system’s performance on T improves as
measured by M.
• In other words, the learned model helps the system to
perform T better as compared to no learning.
52
An example
• Data: Loan application data

• Task: Predict whether a loan should be approved or not.
• Performance measure: accuracy.
No learning: classify all future applications (test data) to the

majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.
53
Fundamental assumption of learning
Assumption: The distribution of training examples is identical to

the distribution of test examples (including future unseen
examples).
• In practice, this assumption is often violated to certain degree.

• Strong violations will clearly result in poor classification
accuracy.
• To achieve good accuracy on the test data, training examples
must be sufficiently representative of the test data.
54
Different Data Analysis Tasks
• Classification • Pattern detection

• Assign a category (ie, a • Identify regularities (ie,
class) for a new instance patterns) in temporal or
• Clustering spatial data
• Form clusters (ie, groups) • Simulation
with a set of instances • Define mathematical
formulas that can
generate data like
observations collected
55
Classification • Each type of task is
characterized by the kinds
Clustering of data they require and
the kinds of output they
Pattern detection generate
• Each type of task uses
Causal discovery
different algorithms
Simulation
56
General Approaches are Adapted to
Specific Kinds of Data
Treat Programs as “Black Boxes”
• You don’t have to understand

complex mathematics and
programming in order to use
software
• Therefore, we often refer to
software as a “black box”
• You only need to understand
inputs and outputs and the
program’s function in order to
use it correctly
58
Programs as Functions: Inputs, Outputs,
and Parameters
Shift key: 5
Original: HELLO
Cipher: KHOOR
59
Workflow as a Composition of Functions
PART II:
CLASSIFICATION
Part II: Classification
Topics
Classification tasks
Building a classifier
Evaluating a classifier
62
Classifying Mushrooms
• What mushrooms are
edible, i.e., not poisonous?
• Book lists many kinds of
mushrooms identified as
either edible, poisonous,
or unknown edibility
• https://archive.ics.uci.edu/
• Given a new kind
ml/datasets/Mushroom
mushroom not listed in
the book, is it edible?
63
Classifying Iris Plants
• Iris flowers have different

sepal and petal shapes:
• Iris Setosa
• Iris Versicolour
• Iris Virginica
• Suppose you are shown
lots of examples of each
type. Given a new iris
flower, what type is it?
https://en.wikipedia.org/wiki/Iris_setosa 64
https://en.wikipedia.org/wiki/Iris_versicolor
https://en.wikipedia.org/wiki/Iris_virginica
1. Classification
Tasks
65
Classification Tasks
• Given:
• A set of classes
• Instances (examples) of each class
• Generate: A method (aka model) that when given a new instance
it will determine its class
66
http://www.business-insight.com/html/intelligence/bi_overfitting.html
Classification Tasks
• Given: • Instances are described as a

• A set of classes set of features or attributes
• Instances of each class and their values
• Generate: A method that • The class that the instance

when given a new instance it belongs to is also called its
will determine its class “label”
• Input is a set of “labeled
instances”
67
Possible Features
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
68
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
https://commons.wikimedia.org/wiki/File:Twelve_edible_mushrooms_of_the_United_States.jpg
Describing an Instance
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
• Class: poisonous - p
• Cap shape: convex – x
• Cap surface: smooth – s
• Cap color: brown – n
• Bruises: true – t
• Odor: pungent – p
•…
69
https://en.wikipedia.org/wiki/Edible_mushroom#/media/File:Lepista_nuda.jpg
Iris Classification:
“Continuous” Feature Values
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
70
Describing Many Instances
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m
e,x,y,y,t,l,f,c,b,g,e,c,s,s,w,w,p,w,o,p,n,n,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,s,m
71
https://commons.wikimedia.org/wiki/File:Twelve_edible_mushrooms_of_the_United_States.jpg
Example of a Model:
A Decision Tree
• Nodes: attribute-based
decisions
• Branches: alternative values of
the attributes
• Leaves: each leaf is a class
72
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
Using a Decision Tree
• Given a new instance,

take a path through the
tree based on its
attributes
• When a leaf is reached,
that is the class assigned
to the instance
73
High-Level Algorithm to
Learn a Decision Tree
• Start with the set of all instances in
the root node
• Select the attribute that splits the set
best and create children's nodes
• E.g., more evenly into the subsets
• When a node has all instances in the
same class, make it a leaf node
• Iterate until all nodes are leaves
74
Classifying a New Instance
Instance
Instance
Instance
ss
Instance
s
Modeler
New
Model instance
Classifier
Class
75
Classifying New Instances
Instance
Instance
Instance
ss
Instance
s
Modeler
New
Model instance
Classifier
Class
Class
Class
Class 76
Training and Test Sets
Instance
Instance
Instance
ss Training instances
Instance
s (training set)
Modeler
Test instances
New
instance (test set)
Model
Classifier
Class
Class
Class
Class 77
Contamination
Instance
Instance
Instance
ss Training instances
Instance
s (training set)
Modeler
Test instances
New
instance (test set)
Model
Classifier When training and test sets overlap

– this should NEVER happen
Class
Class
Class
Class 78
About Classification Tasks
• Classes must be disjoint, i.e., each instance belongs to

only one class
• Classification tasks are “binary” if there are only two
classes
• The classification method will rarely be perfect, it will make
mistakes in its classification of new instances
79
2. Building a
Classifier
80
What is a Modeler?
Instance
Instance
Instance
ss •A
Instance
s mathematical/algorith
mic approach to
generalize from
Modeler instances so it can
New make predictions about
Model instance instances that it has not
seen before
• Its output is called a
Classifier model
Class
Class
Class
Class 81
Types of Modelers/Models
Instance
Instance
Instance
ss
Instance
s
Modeler • Naïve Bayes classifiers
New • Support vector machines
instance (SVMs)
Model
• Decision trees
• Random forests
Classifier
• Kernel methods
• Genetic algorithms
Class
Class
Class
Class • Neural networks 82
Explanations
• Decision trees
• Naïve Bayes classifiers
• Support vector machines
(SVMs)
• Random forests
• Kernel methods
• Genetic algorithms
Other models are • Neural networks
mathematical models that
are hard to explain and
83
visualize
84
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
85
86
87
88
What Modeler to Choose?
• Data scientists try different Logistic regression
modelers, with different Naïve Bayes classifiers

parameters, and check the
accuracy to figure out Support vector machines (SVMs)
which one works best for Decision trees

the data at hand
Random forests
Kernel methods
Genetic algorithms (GAs)
Neural networks: perceptrons
89
Ensembles
Instances
Instances
• An ensemble method uses several Instances
Instance
algorithms that do the same task,
and combines their results
• “Ensemble learning” Modeler Modeler Modeler
A B C
• A combination function joins the
results
Model
• Majority vote: each algorithm ModelA ModelB
C
gets a vote
• Weighted voting: each
algorithm’s vote has a weight CombinationFunction
• Other complex combination
functions Final Model
90
91
http://magizbox.com/index.php/machine-learning/ds-model-building/ensemble/
Road Map
• Basic concepts
• Decision tree induction [on loan data]
92
Introduction
• Decision tree learning is one of the most widely used

techniques for classification.
• Its classification accuracy is competitive with other methods, and
• it is very efficient.
• The classification model is a tree, called decision tree.
• C4.5 by Ross Quinlan is perhaps the best-known system. It
can be downloaded from the Web.
93
The loan data (reproduced)
Approved or not
94
A decision tree from the loan data
 Decision nodes and leaf nodes (classes)
95
Use the decision tree
No
96
Is the decision tree unique?
 No. Here is a simpler tree.
 We want smaller tree and accurate tree.
 Easy to understand and perform better.
 All current tree building

algorithms are heuristic
algorithms
3/9 6/9
97
From a decision tree to a set of rules
 A decision tree can
be converted to a
set of rules
 Each path from the
root to a leaf is a
rule.
(3/9) (6/9)
(3/9)
10 (6/9)
98
Algorithm for decision tree learning
• Basic algorithm (a greedy divide-and-conquer algorithm)

• Assume attributes are categorical now (continuous attributes can be handled
too)
• Tree is constructed in a top-down recursive manner
• At start, all the training examples are at the root
• Examples are partitioned recursively based on selected attributes
• Attributes are selected based on an impurity function (e.g., information gain)
• Conditions for stopping partitioning
• All examples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority class is the leaf
• There are no examples left
99
Decision tree learning algorithm
100
Choose an attribute to partition data
• The key to building a decision tree - which attribute to

choose in order to branch.
• The objective is to reduce impurity or uncertainty in data
as much as possible.
• A subset of data is pure if all instances belong to the same class.
• The heuristic in C4.5 is to choose the attribute with the
maximum Information Gain or Gain Ratio based on
information theory.
101
The loan data (reproduced)
Approved or not
102
Two possible roots, which is better?
 Fig. (B) seems to be better.
103
Information theory
• Information theory provides a mathematical basis for

measuring the information content.
• To understand the notion of information, think about it as
providing the answer to a question, for example, whether
a coin will come up heads.
• If one already has a good guess about the answer, then the
actual answer is less informative.
• If one already knows that the coin is rigged so that it will come
with heads with probability 0.99, then a message (advanced
information) about the actual outcome of a flip is worth less than
it would be for an honest coin (50-50).
104
Information theory (cont …)
• For a fair (honest) coin, you have no information, and you are willing
to pay more (say in terms of $) for advanced information - less you
know, the more valuable the information.
• Information theory uses this same intuition, but instead of measuring

the value for information in dollars, it measures information contents
in bits.
• One bit of information is enough to answer a yes/no question about

which one has no idea, such as the flip of a fair coin.
105
Information theory: Entropy measure
• The entropy formula,
|C |
entropy ( D )    Pr(c ) log
j 1
j 2 Pr(c j )
|C |
 Pr(c )  1,
j 1
j
• Pr(cj) is the probability of class cj in data set D

• We use entropy as a measure of impurity or
disorder of data set D. (Or a measure of
information in a tree)
https://www.youtube.com/watch?v=YtebGVx-
Fxw&ab_channel=StatQuestwithJoshStarmer 106
Entropy measure: let us get a feeling
 As the data become purer and purer, the entropy value

becomes smaller and smaller. This is useful to us!
107
Information gain
• Given a set of examples D, we first compute its
entropy:
• If we make attribute Ai, with v values, the root of

the current tree, this will partition D into v subsets
D1, D2, …, Dv . The expected entropy if Ai is used
as the current root: v |D |

entropy Ai ( D) 
j
j 1 | D |
 entropy( D j )
108
Information gain (cont …)
• Information gained by selecting attribute Ai
to branch or to partition the data is
gain( D, Ai )  entropy( D)  entropy Ai ( D)
• We choose the attribute with the highest

gain to branch/split the current tree.
109
An example
6 6 9 9
entropy(D)   log2   log2  0.971
15 15 15 15
6 9
entropyOwn _ house ( D)   entropy( D1 )   entropy( D2 )
15 15
6 9
  0   0.918
15 15
 0.551
5 5 5
entropyAge(D)   entropy(D1 )   entropy(D2 )   entropy(D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
  0.971  0.971  0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
 0.888
 Own_house is the best

choice for the root.
110
We build the final tree
(3/9) (6/9)
 We can use information gain ratio to evaluate the
impurity as well
111
Decision Trees
Should I wait at this
restaurant?
112
3. Evaluating a
Classifier
113
Classification Accuracy
• Accuracy: percentage of correct classifications
Total test instances classified correctly

Accuracy =
Total number of test instances
114
Evaluating a Classifier:
n-fold Cross Validation
• Suppose m labeled instances
• Divide into n subsets (“folds”) of equal size
• Run classifier n times, with each of the subsets as the test set
• The rest (n-1) for training
• Each run gives an accuracy result
Translated from image by Joan.domenech91 (Own work) [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons 115
(https://commons.wikimedia.org/wiki/File:K-fold_cross_validation.jpg)
Confusion Matrix
Classified positive Classified negative
Actual positive True positive False negative
Actual negative False positive True negative
TP: number of positive examples classified correctly

FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly 116
Precision and Recall
TP: number of positive examples classified correctly

FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly
TP TP
Precision = Recall =
TP + FP TP + FN
Note that the focus is on the positive class 117

Other Metrics
• There are many other accuracy metrics
• F1-score
• Receive Operating Characteristics (ROC) curve
• Area Under the Curve (AUC)
118
Other Metrics
• Other accuracy • Other concerns

metrics • Explainability of
• F1-score classifier results
• Receive Operating • Cost of examples
Characteristics • Cost of feature
(ROC) curve values
• Area Under the • Labeling
Curve (AUC)
119
What Affects the Performance
• Complexity of the task
• Large amounts of features (high dimensionality)
• Feature(s) appears very few times (sparse data)
• Few instances for a complex classification task
• Missing feature values for instances
• Errors in attribute values for instances
• Errors in the labels of training instances
• Uneven availability of instances in classes
120
Overfitting
• A model overfits the training data when it is very accurate with that
data, and may not do so well with new test data
Training Data Test Data
Model 1
Model 2
121
Induction
• Induction requires inferring general rules about examples

seen in the past
• Contrast with deduction: inferring things that are a
logical consequence of what we have seen in the past
• Classifiers use induction: they generate general rules
about the target classes
• The rules are used to make predictions about new data
• These predictions can be wrong
122
When Facing a Classification Task
• What features to choose • What classes to choose

• Try defining different • Edible / poisonous?
features • Edible / poisonous /
• For some problems, unknown?
hundreds and maybe • How many labeled
thousands of features examples
may be possible
• May require a lot of
• Sometimes the features work
are not directly
observable (ie, there • What modeler to choose
are “latent” variables) • Better to try different
123
ones
Part II: Classification
Summary of Major Concepts

• Instances, features, • Training and test sets
values • Evaluation
• Classes, disjoint classes • Accuracy, confusion
• Labels, binary tasks matrix, precision &
recall
• Learning
• N-fold cross validation
• Decision trees • Overfitting
• Modeler • About the data
• Ensembles, • High dimensionality
combination • Sparse data
function • Continuous/discrete
• Majority vote, values
weighted vote • Latent variables
• Induction 124
(Artificial) Neural Networks
• Motivation: human brain
• massively parallel (1011 neurons, ~20
types)
• small computational units with simple
low-bandwidth communication (1014
synapses, 1-10ms cycle time)
• Realization: neural network

• units ( neurons) connected by
directed weighted links
• activation function from inputs to
output
125
Neural Networks (continued)
• neural network = parameterized family of nonlinear functions

• types
• feed-forward (acyclic): single-layer perceptrons, multi-layer
networks
• recurrent (cyclic): Hopfield networks, Boltzmann machines
[ connectionism, parallel distributed processing]

126
Neural Network Learning
Key Idea: Adjusting the weights changes the function represented by the
neural network (learning = optimization in weight space).
Iteratively adjust weights to reduce error (difference between network

output and target output).
• Weight Update
• perceptron training rule
• linear programming
• delta rule
• backpropagation
127
Neural Network Learning: Decision Boundary
single-layer perceptron multi-layer network
128
Additional
Study
129
PART III:
PATTERN LEARNING AND
CLUSTERING
Part III: Pattern Learning and Clustering
Topics
PATTERN DETECTION PATTERN LEARNING AND CLUSTERING

PATTERN DISCOVERY
131
1. Pattern
Detection
132
Network Patterns
Subgroups
Strength of ties
Central entities
Patterns of activity over time

133
Spatial Patterns
Patterns
http://bama.ua.edu/~mbonizzoni/research.html 134
Temporal Patterns
Pattern
Detecto
r
Patterns
P1
** ** * P2
* * *
*
** * ** * ** *
http://epthinking.blogspot.com/2009/01/on-event-pattern-detection-vs-event.html 135
Detecting Patterns in a Text String
• ababababab
• abcabcabcabc
• abcccccccabcccabccccccccccabcabccc
136
A Pattern Language
• ababababab
• (ab)*
• abcabcabcabc
• (abc)*
• abcccccccabcccabccccccccccabcabccc
• ((ab)(c)*)*
137
Detecting Patterns in Streaming Data
• (ab)*x*
• Abababthsrthwababyertueyrtyertheabsgd
• abcabcabcabc
• abcabcrgkskhgsnrhnabcabcabcabcrjgjsrn
138
Concept Drift
• Over time, the data source changes and the concepts

that were learned in the past have now changed
139
2. Pattern
Learning and
Pattern
Discovery
140
Pattern Detection vs Pattern
Learning
Pattern Pattern
Detection Learning
• Inputs: • Inputs:
• Data • Data annotated
• A set of patterns with a set of
patterns
• Output:
• Matches of the • Output:
patterns to the • A set of patterns
data that appear in the
data with some
frequency
141
Pattern Detection vs Pattern Learning
Pattern Pattern
Learning Discovery
• Inputs: • Inputs:
• Data annotated • Data
with a set of
patterns
• Output:
• Output: • A set of patterns
• A set of patterns that appear in the
that appear in data with some
the data with frequency
some frequency
142
3. Clustering
143
Clustering
• Find patterns based on features of instances

• Given:
• A set of instances (datapoints), with feature
values
• Feature vectors
• A target number of clusters (k)
• Find:
• The “best” assignment of instances
(datapoints) to clusters
• “Best”: satisfies some optimization criteria
• “clusters” represent similar instances
144
https://commons.wikimedia.org/wiki/File:DBSCAN-Gaussian-data.svg
K-Means Clustering Algorithm
• User specifies a target number of

clusters (k)
• Place randomly k cluster centers
• For each datapoint, attach it to the
nearest cluster center
• For each center, find the centroid of
all the datapoints attached to it
• Turn the centroids into cluster centers
• Repeat until the sum of all the
datapoint distances to the cluster
centers is minimized
145
K-Means Clustering (1)
https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 146
Clustering Methods
• K-Means clustering
• Centroid-based
• Hierarchical clustering
• Attach datapoints to
root points
• Density-based
methods
• Clusters contain a
minimal number of
datapoints
•…
152
https://commons.wikimedia.org/wiki/File:DBSCAN-Gaussian-data.svg
PART IV:
CAUSAL DISCOVERY
Today’s Topics
1. Correlation and causation

2. Causal models
• Bayesian networks
• Markov networks
154
1. Correlation
and Causation
155
Correlation
• Two variables are • Examples:

correlated • When people
(associated) when buy chips they
their values are not are very likely to
independent buy beer
• Probabilistically • When people
speaking have yellow
fingers, they are
very likely to
smoke
156
Predictive Variables
• Some variables are
predictive variables
because they are
Smokin correlated with other
g
target independent
Cough
variables
• Smoking and
coughing are
predictive variables
Respiratory
disease
for respiratory disease
• BUT: Do predictive
variables indicate the 157
causes?
Cause and Effect
Smokin
g Cause
• A variable v1 is a cause
for variable v2 if
changing v1 changes v2
• Smoking is a cause
for respiratory
Respiratory disease
disease • A variable v3 is an effect
of variable v2 if
changing v3 does not
change v1
Cough Effect • Cough is an effect of
respiratory disease
158
Latent Variables
Smokin
g • Latent variables are
variables that cannot be
DNA Carbon directly observed, only
dama monoxid inferred through a model
ge e • Eg DNA damage
• Eg Carbon monoxide
Respiratory inhalation
disease • Latent variables can be
hard to identify, even
harder to learn
Cough automatically from data
159
Correlation vs Causation
Correlation Causation
• Knowledge of v1 provides • Requires being able to collect
information for v2 specific data that helps show
• Eg: yellow fingers, causality (ie, do experiments)
cough, smoking, lung • Randomized controlled trial
cancer • Select 1000 people, split
• Can use any data evenly
collected (ie, by simple • 500 (control)
observation) and do • Eg forced to
statistical analysis smoke
• 500 (treatment)
• Eg forced not to
smoke
• Collect data
• Association persists only
when causal relation
160
2. Causal
Models
161
(Probabilistic) Graphical
Model
• Graph that captures

dependencies among
variables
• Nodes are variables
• Links indicate
dependencies
• Probabilities that
represent how the
dependencies work
162
http://www.eecs.berkeley.edu/~wainwrig/icml08/tutorial_icml08.html
Graphical Models
Bayesian Networks Markov Networks

• Graph links have a direction • Graph links do not have
• Cycles not allowed direction
• Cycles are allowed
Smoking Exposure
Respiratory
disease
Cough
163
http://gordam.themillimetertomylens.com/
Bayesian Networks
• A Bayesian network is a graph

• Directed edges show how
variables influence others
• No cycles allowed
• Conditional probability
distribution (tables or functions)
show the probability of the
value of a variable given the
values of its parent variables
• A variable is only dependent
on its parent variables, not on
its earlier ancestors
164
https://en.wikipedia.org/wiki/Bayesian_network#/media/File:SimpleBayesNet.svg
Bayesian Inference
• Bayesian inference is used

to reason over a Bayesian
network to determine the
probabilities of some
variables given some
observed variables
• Eg: Given that the
grass is wet, what is
the probability that it
is raining?
165
https://en.wikipedia.org/wiki/Bayesian_network#/media/File:SimpleBayesNet.svg
Markov Networks
• A Markov network is an
undirected graphical
model that includes a
potential function for
each clique of
interconnected nodes
166
http://gordam.themillimetertomylens.com/
Causal Models
• A causal model is a Bayesian network where all

the relationships among variables are causal
• Causal models represent how independent
variables have an effect on dependent
variables
• Causal reasoning uses the probabilities in the
causal model to make inferences about the
value of variables given the values of others
• Eg: Given that the grass is wet, what is the
probability that it rained?
167
Learning Causal Models
Parameter Structure
Learning Learning
• Learning the parameters • Learning the structure of
(probabilities) of the the model
model • Usually more
challenging
168
Part IV: Causal Discovery
Summary of Topics Covered
1. Correlation and causation

2. Causal models
• Bayesian networks
• Markov networks
169
Part IV: Causal Discovery
Summary of Major Concepts
• Predictive variables • Probabilistic graphical

• Cause and effect models
• Latent variables • Bayesian networks
• Correlation vs • Markov networks

causation • Causal models
• Randomized Control • Parameter learning
Trials • Structure learning
170
PART V:
SIMULATION AND
MODELING
Simulation
• Simulation is an approach to data Traffic
analysis that uses a mathematical or
formal model of a phenomenon to
run different scenarios to make
predictions
• Eg By simulating people in a city
and where they drive every day,
we can analyze scenarios where
there is a flu epidemic and
predict people’s behavior
changes Air flow over an engine
• Simulation models can be
improved to make predictions
that correspond to the observed
data
https://en.wikipedia.org/wiki/Traffic_simulation#/media/File:WTC_Pedestrian_Modeling.png 172
https://en.wikipedia.org/wiki/Simulation#/media/File:Ugs-nx-5-engine-airflow-simulation.jpg
Example: Landscape Evolution
Work by Chris Duffy, Yu Zhang, and Rudy Slingerland of Penn State University
Example: Landscape Evolution
Simulated evolution of an initially uniform
landscape to a complex terrain and river
network over 10 8 years.
Example: Analyzing Water Quality
From T. Harmon (UC Merced/CENS)
McConnell SP
SJR confluence
An Example Workflow Sketch for Analyzing
Environmental Data [Gil et al 2011]
California’s Central
Valley:
• Farming, pesticides,
waste
• Water releases
• Restoration efforts
Workflow
Sketch
Data
preparation
Feature
extraction
Models of how
water mixes
with air
(“reaeration”)
and what
chemical
reactions occur
(“metabolism”)
From a Workflow Sketch to a
Computational Workflow
PART VI:
PRACTICAL USE OF
MACHINE LEARNING AND
DATA ANALYSIS
RECAP:
• Classification • Causal modeling

• Assign a label (ie, a class) • Learn causal
for a new instance given (probabilistic)
many labeled instances dependencies
• Clustering among variables
• Form clusters (ie, groups) • Simulation modeling
with a set of instances • Define
• Pattern learning/detection mathematical
formulas that can
• Learn patterns (i.e.,
generate data
regularities) in data
that is close to
observations 180
collected
RECAP:
• Classification
• Each type of task is
• Clustering characterized by the
• Pattern learning kinds of data they
• Causal modeling require and the kinds
of output they
• Simulation modeling generate
•… • Each type of task
uses different
algorithms 181
When Facing a Learning Task
• Supervised, unsupervised, • What features to choose
or semi-supervised: cost of • Try defining different
labels features
• Setting up the learning task • For some problems,
• Classification: What hundreds and maybe
classes to choose thousands of features
• Clustering: How many may be possible
target clusters • Sometimes the features
• Causality: What are not directly
observables observable (ie, there
are “latent” variables)
• What data is available
• What learning method
• Collecting data
• Better to try different
• Buying data
ones
182
• Scalability: processing
Recent Trends: Neural
Networks and “Deep
Learning”
http://theanalyticsstore.ie/deep-learning/ 183
Trends: Deep Learning in
AlphaGo
184
Introduction to Machine
Learning and Data Analytics:
Topics Covered
I. Machine learning IV. Causal discovery
and data analysis • Correlation
tasks • Causation
II. Classification • Causal models
• Classification tasks • Bayesian networks
• Building a classifier • Markov networks
• Evaluating a classifier
III. Pattern learning and
V. Simulation and
clustering modeling
• Pattern detection
• Pattern learning and VI. Practical use of
pattern discovery
• Clustering
machine learning
• K-means clustering and data analysis
185

Machine Learning and Applications (5L)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning and Applications (5L)

Uploaded by

Copyright:

Available Formats

AI: Machine

• “Learning denotes changes in a system that ... enable a system to

“Machine learning refers to a system capable of the autonomous

Machine learning is primarily concerned with the

Machine learning by analogy

Learning speed-up learning

element element actions

• Any component of an agent can be improved by

Design affected by:

• supervised (labeled examples)- like a teacher

• attribute-based (feature vector)

• As intelligence requires knowledge, it is necessary for

Data acquisition Practical

Training set Testing set

• Two important factors:

• The algorithms control the search to find and build

• The learning algorithms should extract useful

• Discover a function h (hypothesis) that approximate the true function f.

Basic Problem: Induce a representation of a function (a systematic

x = set of attribute values (attribute-value representation)

Y = set of discrete labels (classification)

• Iris flowers have different

• Covers situations where Y is continuous (quantitative)

• Covers situations where Y is categorical (qualitative)

• Handwritten Digit Recognition:

E-in: for training set

• Unsupervised: Minimum quantization error, Minimum

(model = hypothesis + loss

• Supervised learning categories and techniques

• Support vector machine (SVM):

Basic concepts Decision tree induction

• A credit card company receives thousands of applications for

Supervised Learning Unsupervised Learning

Number of correct classifications

• Data: Loan application data

No learning: classify all future applications (test data) to the

Assumption: The distribution of training examples is identical to

• In practice, this assumption is often violated to certain degree.

• Classification • Pattern detection

• You don’t have to understand

• Iris flowers have different

• Given: • Instances are described as a

• Generate: A method that • The class that the instance

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

• Given a new instance,

Classifier When training and test sets overlap

• Classes must be disjoint, i.e., each instance belongs to

modelers, with different Naïve Bayes classifiers

which one works best for Decision trees

Genetic algorithms (GAs)

Neural networks: perceptrons

• Decision tree learning is one of the most widely used

 All current tree building

• Basic algorithm (a greedy divide-and-conquer algorithm)

• The key to building a decision tree - which attribute to

 Fig. (B) seems to be better.

• Information theory provides a mathematical basis for

• Information theory uses this same intuition, but instead of measuring

• One bit of information is enough to answer a yes/no question about

• Pr(cj) is the probability of class cj in data set D

 As the data become purer and purer, the entropy value

• If we make attribute Ai, with v values, the root of