You are on page 1of 185

AI: Machine

DR. SOURAV
Learning and its MANDAL
Applications
What is Learning?

• “Learning denotes changes in a system that ... enable a system to


do the same task … more efficiently the next time.” - Herbert Simon
• “Learning is constructing or modifying representations of what is
being experienced.” - Ryszard Michalski
• “Learning is making useful changes in our minds.” - Marvin Minsky

“Machine learning refers to a system capable of the autonomous


acquisition and integration of knowledge.”

2
Why Machine Learning?

• No human experts
• industrial/manufacturing control
• mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise
• face/handwriting/speech recognition
• driving a car, flying a plane
• Rapidly changing phenomena
• credit scoring, financial modeling
• diagnosis, fraud detection
• Need for customization/personalization
• personalized news reader
• movie/book recommendation

3
Related Fields
data
mining control theory

statistics
decision theory
information theory machine
learning
cognitive science
databases
psychological models
evolutionary neuroscience
models

Machine learning is primarily concerned with the


accuracy and effectiveness of the computer system.

4
rote learning (memorization technique)
learning by being told (advice-taking)
learning from examples (induction)

Machine learning by analogy

Learning speed-up learning


concept learning
Paradigms clustering
discovery

5
Architecture of a Learning System
feedback performance standard
critic percepts

ENVIRONMENT
changes
learning performance percepts

element element actions


knowledge
learning goals

problem
generator

6
Forms of Learning

• Any component of an agent can be improved by


learning from data. The improvement depends on four
major factors.
• Which component is to be improved.
• What prior knowledge the agent already has.
• What representation is used for the data and the component.
• What feedback is available to learn from.
• E.g., for automatic taxi driver agent, an agent might learn (condition-action
rule) when to break (or slowdown!) from all previous instances of breaking.
Seeing many camera images, it can learn the different vehicles and objects in
front of the taxi, therefore can take various discissions itself.

7
Learning Element

Design affected by:


• performance element used
• e.g., utility-based agent, reactive agent, logical agent
• functional component to be learned
• e.g., classifier, evaluation function, perception-action function,
• representation of functional component
• e.g., weighted linear function, logical theory, HMM
• feedback available
• e.g., correct action, reward, relative preferences

8
Dimensions of Learning Systems

type of feedback

• supervised (labeled examples)- like a teacher


• unsupervised (unlabeled examples) – no explicit feedback
• reinforcement (reward) [like scoring good marks in exams]

representation

• attribute-based (feature vector)


• relational (first-order logic)

use of knowledge

• empirical (knowledge-free)
• analytical (knowledge-guided)

9
What is machine learning?
• A branch of artificial intelligence, concerned with
the design and development of algorithms that
allow computers to evolve behaviors based on
empirical data.

• As intelligence requires knowledge, it is necessary for


the computers to acquire knowledge.

10
What is machine learning?

11
Learning system model

Testing

Input Learning
Samples Method

System

Training

12
Learning system model

13
Training and testing

Data acquisition Practical


usage

Universal set
(unobserved)

Training set Testing set


(observed) (unobserved)
14
Training and testing
• Training is the process of making the system able to
learn.
• No free lunch rule:
• Training set and testing set come from the same distribution
• Need to make some assumptions or bias

15
Performance
• There are several factors affecting the performance:
• Types of training provided
• The form and extent of any initial background knowledge
• The type of feedback provided
• The learning algorithms used

• Two important factors:


• Modeling
• Optimization

16
Algorithms
• The success of machine learning system also
depends on the algorithms.

• The algorithms control the search to find and build


the knowledge structures.

• The learning algorithms should extract useful


information from training examples.

17
Algorithms
• Supervised learning ( )
• Prediction
• Classification (discrete labels), Regression (real values)
• Unsupervised learning ( )
• Clustering
• Probability distribution estimation
• Finding association (in features)
• Dimension reduction
• Semi-supervised learning
• Reinforcement learning
• Decision making (robot, chess machine)

18
Algorithms

Supervised Unsupervised
learning learning

Semi-supervised learning 19
Machine learning structure
• Supervised learning

20
Inductive (Supervised) Learning
Basic Problem: Given a training set of N example input-output pairs
(x1,y1), (x2,y2),…,(xN, yN),
Where each yi was generated by an unknown function y= f(x),

• Discover a function h (hypothesis) that approximate the true function f.


• Learning is a search through the space of possible hypothesis for one
that will perform well, even on new examples beyond the training set.

21
Inductive (Supervised) Learning

Basic Problem: Induce a representation of a function (a systematic


relationship between inputs and outputs) from examples.

• target function f: X → Y
• example (x,f(x))
• hypothesis g: X → Y such that g(x) = f(x)

x = set of attribute values (attribute-value representation)


x = set of logical sentences (first-order representation)

Y = set of discrete labels (classification)


Y =  (regression)

22
Machine learning structure
• Unsupervised learning

23
Predicting housing price

24
Classifying Iris Plants

• Iris flowers have different


sepal and petal shapes:
• Iris Setosa
• Iris Versicolour
• Iris Virginica
• Suppose you are shown
lots of examples of each
type. Given a new iris
flower, what type is it?

https://en.wikipedia.org/wiki/Iris_setosa
https://en.wikipedia.org/wiki/Iris_versicolor 25
https://en.wikipedia.org/wiki/Iris_virginica
Supervised Learning

26
Supervised Learning:
Regression vs. Classification

Regression

• Covers situations where Y is continuous (quantitative)


• E.g., predicting the value of the Dow (share) in 6 months,
predicting the value of a given house based on various inputs, etc.

Classification

• Covers situations where Y is categorical (qualitative)


• E.g., Will the Dow be up or down in 6 months? Is this email spam or
not?

27
Supervised Learning: Examples
• Email Spam:
• predict whether an email is a junk email (i.e. spam)

28
Supervised Learning: Examples

• Handwritten Digit Recognition:


• Identify single digits 0~9 based on images

29
Supervised Learning: Examples

• Face Detection/Recognition:
• Identify human faces

30
Supervised Learning: Examples

• Speech Recognition:
• Identify words spoken according to speech signals
• Automatic voice recognition systems used by airline companies,
automatic stock price reporting, etc.

31
Supervised Learning:
Linear Regression

32
What are we seeking?
• Supervised: Low E-out or maximize probabilistic terms

E-in: for training set


E-out: for testing set

• Unsupervised: Minimum quantization error, Minimum


distance, MAP, MLE(maximum likelihood estimation)

33
What are we seeking?
Under-fitting VS. Over-fitting (fixed N)

error

(model = hypothesis + loss


functions)

34
Learning techniques

• Supervised learning categories and techniques


• Linear classifier (numerical functions)
• Parametric (Probabilistic functions)
• Naïve Bayes, Gaussian discriminant analysis (GDA), Hidden Markov models
(HMM), Probabilistic graphical models
• Non-parametric (Instance-based functions)
• K-nearest neighbors, Kernel regression, Kernel density estimation, Local
regression
• Non-metric (Symbolic functions)
• Classification and regression tree (CART), decision tree
• Aggregation
• Bagging (bootstrap + aggregation), Adaboost, Random forest
35
Learning techniques
• Linear classifier
, where w is a d-dim vector (learned)

• Techniques:
• Perceptron
• Logistic regression
• Support vector machine (SVM)
• Ada-line
• Multi-layer perceptron (MLP)

36
Learning techniques
Using perceptron learning algorithm(PLA)

Training Testing
Error rate: Error rate:
0.10 0.156
37
Learning techniques
Using logistic regression

Training Testing
Error rate: Error rate:
0.11 0.145
38
Learning techniques
• Non-linear case

• Support vector machine (SVM):


• Linear to nonlinear: Feature transform and kernel function

39
Learning techniques
• Unsupervised learning categories and techniques
• Clustering
• K-means clustering
• Spectral clustering
• Density Estimation
• Gaussian mixture model (GMM)
• Graphical models
• Dimensionality reduction
• Principal component analysis (PCA)
• Factor analysis

40
Supervised Learning

41
Road Map

Basic concepts Decision tree induction

42
An example application
• An emergency room in a hospital measures 17 variables
(e.g., blood pressure, age, etc) of newly admitted
patients.
• A decision is needed: whether to put a new patient in an
intensive-care unit.
• Due to the high cost of ICU, those patients who may
survive less than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate
them from low-risk patients.

43
Another application

• A credit card company receives thousands of applications for


new cards. Each application contains information about an
applicant,
• age
• Marital status
• annual salary
• outstanding debts
• credit rating
• etc.
• Problem: to decide whether an application should approve, or
to classify applications into two categories, approved and not
approved.
44
Machine learning and our focus
• Like human learning from past experiences.
• A computer does not have “experiences”.
• A computer system learns from data, which represent some
“past experiences” of an application domain.
• Our focus: learn a target function that can be used to
predict the values of a discrete class attribute, e.g., approve
or not-approved, and high-risk or low risk.
• The task is commonly called: Supervised learning,
classification, or inductive learning.

45
The data and the goal
• Data: A set of data records (also called examples, instances
or cases) described by
• k attributes: A1, A2, … Ak.
• a class: Each example is labelled with a pre-defined class.
• Goal: To learn a classification model from the data that can
be used to predict the classes of new (future, or test)
cases/instances.

46
An example: data (loan application)

• Approved or not

47
An example: the learning task
• Learn a classification model from the data
• Use the model to classify future loan
applications into
• Yes (approved) and
• No (not approved)
• What is the class for following case/instance?

48
Learning Approaches

Supervised Learning Unsupervised Learning


• The training data is • The training data is not
annotated with annotated with any extra
information to help the information to help the
learning system learning system

Semi-Supervised
Learning
49
Supervised vs. unsupervised Learning
• Supervised learning: classification is seen as supervised
learning from examples.
• Supervision: The data (observations, measurements, etc.)
are labeled with pre-defined classes. It is like that a
“teacher” gives the classes (supervision).
• Test data are classified into these classes too.
• Unsupervised learning (clustering)
• Class labels of the data are unknown
• Given a set of data, the task is to establish the existence of
classes or clusters in the data

50
Supervised learning process: two steps
 Learning (training): Learn a model using the
training data
 Testing: Test the model using unseen test data
to assess the model accuracy

Number of correct classifications


Accuracy  ,
Total number of test cases

51
What do we mean by learning?
• Given
• a data set D,
• a task T, and
• a performance measure M,
a computer system is said to learn from D to perform the task
T if after learning the system’s performance on T improves as
measured by M.
• In other words, the learned model helps the system to
perform T better as compared to no learning.

52
An example

• Data: Loan application data


• Task: Predict whether a loan should be approved or not.
• Performance measure: accuracy.

No learning: classify all future applications (test data) to the


majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.

53
Fundamental assumption of learning

Assumption: The distribution of training examples is identical to


the distribution of test examples (including future unseen
examples).

• In practice, this assumption is often violated to certain degree.


• Strong violations will clearly result in poor classification
accuracy.
• To achieve good accuracy on the test data, training examples
must be sufficiently representative of the test data.

54
Different Data Analysis Tasks

• Classification • Pattern detection


• Assign a category (ie, a • Identify regularities (ie,
class) for a new instance patterns) in temporal or
• Clustering spatial data
• Form clusters (ie, groups) • Simulation
with a set of instances • Define mathematical
formulas that can
generate data like
observations collected

55
Different Data Analysis Tasks
Classification • Each type of task is
characterized by the kinds
Clustering of data they require and
the kinds of output they
Pattern detection generate
• Each type of task uses
Causal discovery
different algorithms
Simulation

56
General Approaches are Adapted to
Specific Kinds of Data
Treat Programs as “Black Boxes”

• You don’t have to understand


complex mathematics and
programming in order to use
software
• Therefore, we often refer to
software as a “black box”
• You only need to understand
inputs and outputs and the
program’s function in order to
use it correctly

58
Programs as Functions: Inputs, Outputs,
and Parameters

Shift key: 5
Original: HELLO
Cipher: KHOOR

59
Workflow as a Composition of Functions
PART II:
CLASSIFICATION
Part II: Classification
Topics

Classification tasks

Building a classifier

Evaluating a classifier

62
Classifying Mushrooms
• What mushrooms are
edible, i.e., not poisonous?
• Book lists many kinds of
mushrooms identified as
either edible, poisonous,
or unknown edibility
• https://archive.ics.uci.edu/
• Given a new kind
ml/datasets/Mushroom
mushroom not listed in
the book, is it edible?

63
Classifying Iris Plants

• Iris flowers have different


sepal and petal shapes:
• Iris Setosa
• Iris Versicolour
• Iris Virginica
• Suppose you are shown
lots of examples of each
type. Given a new iris
flower, what type is it?
https://en.wikipedia.org/wiki/Iris_setosa 64
https://en.wikipedia.org/wiki/Iris_versicolor
https://en.wikipedia.org/wiki/Iris_virginica
1. Classification
Tasks

65
Classification Tasks

• Given:
• A set of classes
• Instances (examples) of each class
• Generate: A method (aka model) that when given a new instance
it will determine its class

66
http://www.business-insight.com/html/intelligence/bi_overfitting.html
Classification Tasks

• Given: • Instances are described as a


• A set of classes set of features or attributes
• Instances of each class and their values

• Generate: A method that • The class that the instance


when given a new instance it belongs to is also called its
will determine its class “label”
• Input is a set of “labeled
instances”

67
Possible Features

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s


2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
68
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
https://commons.wikimedia.org/wiki/File:Twelve_edible_mushrooms_of_the_United_States.jpg
Describing an Instance

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
• Class: poisonous - p
• Cap shape: convex – x
• Cap surface: smooth – s
• Cap color: brown – n
• Bruises: true – t
• Odor: pungent – p
•…
69
https://en.wikipedia.org/wiki/Edible_mushroom#/media/File:Lepista_nuda.jpg
Iris Classification:
“Continuous” Feature Values

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
70
Describing Many Instances

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m
e,x,y,y,t,l,f,c,b,g,e,c,s,s,w,w,p,w,o,p,n,n,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,s,m
71
https://commons.wikimedia.org/wiki/File:Twelve_edible_mushrooms_of_the_United_States.jpg
Example of a Model:
A Decision Tree
• Nodes: attribute-based
decisions
• Branches: alternative values of
the attributes
• Leaves: each leaf is a class

72
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
Using a Decision Tree

• Given a new instance,


take a path through the
tree based on its
attributes
• When a leaf is reached,
that is the class assigned
to the instance

73
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
High-Level Algorithm to
Learn a Decision Tree
• Start with the set of all instances in
the root node
• Select the attribute that splits the set
best and create children's nodes
• E.g., more evenly into the subsets
• When a node has all instances in the
same class, make it a leaf node
• Iterate until all nodes are leaves

74
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
Classifying a New Instance
Instance
Instance
Instance
ss
Instance
s

Modeler
New
Model instance

Classifier

Class
75
Classifying New Instances
Instance
Instance
Instance
ss
Instance
s

Modeler
New
Model instance

Classifier

Class
Class
Class
Class 76
Training and Test Sets
Instance
Instance
Instance
ss Training instances
Instance
s (training set)

Modeler
Test instances
New
instance (test set)
Model

Classifier

Class
Class
Class
Class 77
Contamination
Instance
Instance
Instance
ss Training instances
Instance
s (training set)

Modeler
Test instances
New
instance (test set)
Model

Classifier When training and test sets overlap


– this should NEVER happen

Class
Class
Class
Class 78
About Classification Tasks

• Classes must be disjoint, i.e., each instance belongs to


only one class
• Classification tasks are “binary” if there are only two
classes
• The classification method will rarely be perfect, it will make
mistakes in its classification of new instances

79
2. Building a
Classifier

80
What is a Modeler?
Instance
Instance
Instance
ss •A
Instance
s mathematical/algorith
mic approach to
generalize from
Modeler instances so it can
New make predictions about
Model instance instances that it has not
seen before
• Its output is called a
Classifier model

Class
Class
Class
Class 81
Types of Modelers/Models
Instance
Instance
Instance
ss
Instance
s
• Logistic regression
Modeler • Naïve Bayes classifiers
New • Support vector machines
instance (SVMs)
Model
• Decision trees
• Random forests
Classifier
• Kernel methods
• Genetic algorithms
Class
Class
Class
Class • Neural networks 82
Explanations

• Decision trees
• Logistic regression
• Naïve Bayes classifiers
• Support vector machines
(SVMs)
• Random forests
• Kernel methods
• Genetic algorithms
Other models are • Neural networks
mathematical models that
are hard to explain and
83
visualize
84
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
85
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
86
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
87
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
88
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
What Modeler to Choose?
• Data scientists try different Logistic regression

modelers, with different Naïve Bayes classifiers


parameters, and check the
accuracy to figure out Support vector machines (SVMs)

which one works best for Decision trees


the data at hand
Random forests

Kernel methods

Genetic algorithms (GAs)

Neural networks: perceptrons

89
Ensembles
Instances
Instances
• An ensemble method uses several Instances
Instance
algorithms that do the same task,
and combines their results
• “Ensemble learning” Modeler Modeler Modeler
A B C
• A combination function joins the
results
Model
• Majority vote: each algorithm ModelA ModelB
C
gets a vote
• Weighted voting: each
algorithm’s vote has a weight CombinationFunction
• Other complex combination
functions Final Model
90
91
http://magizbox.com/index.php/machine-learning/ds-model-building/ensemble/
Road Map

• Basic concepts
• Decision tree induction [on loan data]

92
Introduction

• Decision tree learning is one of the most widely used


techniques for classification.
• Its classification accuracy is competitive with other methods, and
• it is very efficient.
• The classification model is a tree, called decision tree.
• C4.5 by Ross Quinlan is perhaps the best-known system. It
can be downloaded from the Web.

93
The loan data (reproduced)
Approved or not

94
A decision tree from the loan data
 Decision nodes and leaf nodes (classes)

95
Use the decision tree

No

96
Is the decision tree unique?
 No. Here is a simpler tree.
 We want smaller tree and accurate tree.
 Easy to understand and perform better.

 All current tree building


algorithms are heuristic
algorithms

3/9 6/9
97
From a decision tree to a set of rules
 A decision tree can
be converted to a
set of rules
 Each path from the
root to a leaf is a
rule.
(3/9) (6/9)

(3/9)
10 (6/9)

98
Algorithm for decision tree learning

• Basic algorithm (a greedy divide-and-conquer algorithm)


• Assume attributes are categorical now (continuous attributes can be handled
too)
• Tree is constructed in a top-down recursive manner
• At start, all the training examples are at the root
• Examples are partitioned recursively based on selected attributes
• Attributes are selected based on an impurity function (e.g., information gain)
• Conditions for stopping partitioning
• All examples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority class is the leaf
• There are no examples left

99
Decision tree learning algorithm

100
Choose an attribute to partition data

• The key to building a decision tree - which attribute to


choose in order to branch.
• The objective is to reduce impurity or uncertainty in data
as much as possible.
• A subset of data is pure if all instances belong to the same class.
• The heuristic in C4.5 is to choose the attribute with the
maximum Information Gain or Gain Ratio based on
information theory.

101
The loan data (reproduced)
Approved or not

102
Two possible roots, which is better?

 Fig. (B) seems to be better.

103
Information theory

• Information theory provides a mathematical basis for


measuring the information content.
• To understand the notion of information, think about it as
providing the answer to a question, for example, whether
a coin will come up heads.
• If one already has a good guess about the answer, then the
actual answer is less informative.
• If one already knows that the coin is rigged so that it will come
with heads with probability 0.99, then a message (advanced
information) about the actual outcome of a flip is worth less than
it would be for an honest coin (50-50).

104
Information theory (cont …)

• For a fair (honest) coin, you have no information, and you are willing
to pay more (say in terms of $) for advanced information - less you
know, the more valuable the information.

• Information theory uses this same intuition, but instead of measuring


the value for information in dollars, it measures information contents
in bits.

• One bit of information is enough to answer a yes/no question about


which one has no idea, such as the flip of a fair coin.

105
Information theory: Entropy measure
• The entropy formula,
|C |
entropy ( D )    Pr(c ) log
j 1
j 2 Pr(c j )

|C |

 Pr(c )  1,
j 1
j

• Pr(cj) is the probability of class cj in data set D


• We use entropy as a measure of impurity or
disorder of data set D. (Or a measure of
information in a tree)
https://www.youtube.com/watch?v=YtebGVx-
Fxw&ab_channel=StatQuestwithJoshStarmer 106
Entropy measure: let us get a feeling

 As the data become purer and purer, the entropy value


becomes smaller and smaller. This is useful to us!
107
Information gain
• Given a set of examples D, we first compute its
entropy:

• If we make attribute Ai, with v values, the root of


the current tree, this will partition D into v subsets
D1, D2, …, Dv . The expected entropy if Ai is used
as the current root: v |D |


entropy Ai ( D) 
j

j 1 | D |
 entropy( D j )

108
Information gain (cont …)
• Information gained by selecting attribute Ai
to branch or to partition the data is
gain( D, Ai )  entropy( D)  entropy Ai ( D)

• We choose the attribute with the highest


gain to branch/split the current tree.

109
An example
6 6 9 9
entropy(D)   log2   log2  0.971
15 15 15 15
6 9
entropyOwn _ house ( D)   entropy( D1 )   entropy( D2 )
15 15
6 9
  0   0.918
15 15
 0.551

5 5 5
entropyAge(D)   entropy(D1 )   entropy(D2 )   entropy(D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
  0.971  0.971  0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
 0.888

 Own_house is the best


choice for the root.

110
We build the final tree

(3/9) (6/9)
 We can use information gain ratio to evaluate the
impurity as well

111
Decision Trees
Should I wait at this
restaurant?

112
3. Evaluating a
Classifier

113
Classification Accuracy

• Accuracy: percentage of correct classifications

Total test instances classified correctly


Accuracy =
Total number of test instances

114
Evaluating a Classifier:
n-fold Cross Validation
• Suppose m labeled instances
• Divide into n subsets (“folds”) of equal size
• Run classifier n times, with each of the subsets as the test set
• The rest (n-1) for training
• Each run gives an accuracy result

Translated from image by Joan.domenech91 (Own work) [CC BY-SA 3.0


(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons 115
(https://commons.wikimedia.org/wiki/File:K-fold_cross_validation.jpg)
Evaluating a Classifier:
Confusion Matrix

Classified positive Classified negative

Actual positive True positive False negative

Actual negative False positive True negative

TP: number of positive examples classified correctly


FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly 116
Evaluating a Classifier:
Precision and Recall

TP: number of positive examples classified correctly


FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly

TP TP
Precision = Recall =
TP + FP TP + FN

Note that the focus is on the positive class 117


Evaluating a Classifier:
Other Metrics
• There are many other accuracy metrics
• F1-score
• Receive Operating Characteristics (ROC) curve
• Area Under the Curve (AUC)

118
Evaluating a Classifier:
Other Metrics

• Other accuracy • Other concerns


metrics • Explainability of
• F1-score classifier results
• Receive Operating • Cost of examples
Characteristics • Cost of feature
(ROC) curve values
• Area Under the • Labeling
Curve (AUC)

119
Evaluating a Classifier:
What Affects the Performance
• Complexity of the task
• Large amounts of features (high dimensionality)
• Feature(s) appears very few times (sparse data)
• Few instances for a complex classification task
• Missing feature values for instances
• Errors in attribute values for instances
• Errors in the labels of training instances
• Uneven availability of instances in classes
120
Overfitting
• A model overfits the training data when it is very accurate with that
data, and may not do so well with new test data
Training Data Test Data

Model 1

Model 2

121
Induction

• Induction requires inferring general rules about examples


seen in the past
• Contrast with deduction: inferring things that are a
logical consequence of what we have seen in the past
• Classifiers use induction: they generate general rules
about the target classes
• The rules are used to make predictions about new data
• These predictions can be wrong

122
When Facing a Classification Task

• What features to choose • What classes to choose


• Try defining different • Edible / poisonous?
features • Edible / poisonous /
• For some problems, unknown?
hundreds and maybe • How many labeled
thousands of features examples
may be possible
• May require a lot of
• Sometimes the features work
are not directly
observable (ie, there • What modeler to choose
are “latent” variables) • Better to try different
123
ones
Part II: Classification

Summary of Major Concepts


• Instances, features, • Training and test sets
values • Evaluation
• Classes, disjoint classes • Accuracy, confusion
• Labels, binary tasks matrix, precision &
recall
• Learning
• N-fold cross validation
• Decision trees • Overfitting
• Modeler • About the data
• Ensembles, • High dimensionality
combination • Sparse data
function • Continuous/discrete
• Majority vote, values
weighted vote • Latent variables
• Induction 124
(Artificial) Neural Networks
• Motivation: human brain
• massively parallel (1011 neurons, ~20
types)
• small computational units with simple
low-bandwidth communication (1014
synapses, 1-10ms cycle time)

• Realization: neural network


• units ( neurons) connected by
directed weighted links
• activation function from inputs to
output

125
Neural Networks (continued)

• neural network = parameterized family of nonlinear functions


• types
• feed-forward (acyclic): single-layer perceptrons, multi-layer
networks
• recurrent (cyclic): Hopfield networks, Boltzmann machines

[ connectionism, parallel distributed processing]


126
Neural Network Learning

Key Idea: Adjusting the weights changes the function represented by the
neural network (learning = optimization in weight space).

Iteratively adjust weights to reduce error (difference between network


output and target output).

• Weight Update
• perceptron training rule
• linear programming
• delta rule
• backpropagation

127
Neural Network Learning: Decision Boundary

single-layer perceptron multi-layer network

128
Additional
Study

129
PART III:
PATTERN LEARNING AND
CLUSTERING
Part III: Pattern Learning and Clustering
Topics

PATTERN DETECTION PATTERN LEARNING AND CLUSTERING


PATTERN DISCOVERY

131
1. Pattern
Detection

132
Network Patterns

Subgroups
Strength of ties

Central entities

Patterns of activity over time


133
Spatial Patterns

Patterns

http://bama.ua.edu/~mbonizzoni/research.html 134
Temporal Patterns

Pattern
Detecto
r

Patterns
P1
** ** * P2
* * *
*
** * ** * ** *
http://epthinking.blogspot.com/2009/01/on-event-pattern-detection-vs-event.html 135
Detecting Patterns in a Text String

• ababababab

• abcabcabcabc

• abcccccccabcccabccccccccccabcabccc

136
A Pattern Language

• ababababab
• (ab)*
• abcabcabcabc
• (abc)*
• abcccccccabcccabccccccccccabcabccc
• ((ab)(c)*)*

137
Detecting Patterns in Streaming Data

• (ab)*x*
• Abababthsrthwababyertueyrtyertheabsgd
• abcabcabcabc
• abcabcrgkskhgsnrhnabcabcabcabcrjgjsrn

138
Concept Drift

• Over time, the data source changes and the concepts


that were learned in the past have now changed

139
2. Pattern
Learning and
Pattern
Discovery

140
Pattern Detection vs Pattern
Learning
Pattern Pattern
Detection Learning
• Inputs: • Inputs:
• Data • Data annotated
• A set of patterns with a set of
patterns
• Output:
• Matches of the • Output:
patterns to the • A set of patterns
data that appear in the
data with some
frequency
141
Pattern Detection vs Pattern Learning

Pattern Pattern
Learning Discovery
• Inputs: • Inputs:
• Data annotated • Data
with a set of
patterns
• Output:
• Output: • A set of patterns
• A set of patterns that appear in the
that appear in data with some
the data with frequency
some frequency
142
3. Clustering

143
Clustering

• Find patterns based on features of instances


• Given:
• A set of instances (datapoints), with feature
values
• Feature vectors
• A target number of clusters (k)
• Find:
• The “best” assignment of instances
(datapoints) to clusters
• “Best”: satisfies some optimization criteria
• “clusters” represent similar instances

144
https://commons.wikimedia.org/wiki/File:DBSCAN-Gaussian-data.svg
K-Means Clustering Algorithm

• User specifies a target number of


clusters (k)
• Place randomly k cluster centers
• For each datapoint, attach it to the
nearest cluster center
• For each center, find the centroid of
all the datapoints attached to it
• Turn the centroids into cluster centers
• Repeat until the sum of all the
datapoint distances to the cluster
centers is minimized

145
K-Means Clustering (1)

https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 146
K-Means Clustering (2)

https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 147
K-Means Clustering (3)

https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 148
K-Means Clustering (4)

https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 149
K-Means Clustering (5)

https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 150
K-Means Clustering (6)

https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 151
Clustering Methods
• K-Means clustering
• Centroid-based
• Hierarchical clustering
• Attach datapoints to
root points
• Density-based
methods
• Clusters contain a
minimal number of
datapoints
•…
152
https://commons.wikimedia.org/wiki/File:DBSCAN-Gaussian-data.svg
PART IV:
CAUSAL DISCOVERY
Today’s Topics

1. Correlation and causation


2. Causal models
• Bayesian networks
• Markov networks

154
1. Correlation
and Causation

155
Correlation

• Two variables are • Examples:


correlated • When people
(associated) when buy chips they
their values are not are very likely to
independent buy beer
• Probabilistically • When people
speaking have yellow
fingers, they are
very likely to
smoke
156
Predictive Variables
• Some variables are
predictive variables
because they are
Smokin correlated with other
g
target independent
Cough
variables
• Smoking and
coughing are
predictive variables
Respiratory
disease
for respiratory disease
• BUT: Do predictive
variables indicate the 157

causes?
Cause and Effect

Smokin
g Cause
• A variable v1 is a cause
for variable v2 if
changing v1 changes v2
• Smoking is a cause
for respiratory
Respiratory disease
disease • A variable v3 is an effect
of variable v2 if
changing v3 does not
change v1
Cough Effect • Cough is an effect of
respiratory disease
158
Latent Variables
Smokin
g • Latent variables are
variables that cannot be
DNA Carbon directly observed, only
dama monoxid inferred through a model
ge e • Eg DNA damage
• Eg Carbon monoxide
Respiratory inhalation
disease • Latent variables can be
hard to identify, even
harder to learn
Cough automatically from data
159
Correlation vs Causation

Correlation Causation
• Knowledge of v1 provides • Requires being able to collect
information for v2 specific data that helps show
• Eg: yellow fingers, causality (ie, do experiments)
cough, smoking, lung • Randomized controlled trial
cancer • Select 1000 people, split
• Can use any data evenly
collected (ie, by simple • 500 (control)
observation) and do • Eg forced to
statistical analysis smoke
• 500 (treatment)
• Eg forced not to
smoke
• Collect data
• Association persists only
when causal relation
160
2. Causal
Models

161
(Probabilistic) Graphical
Model

• Graph that captures


dependencies among
variables
• Nodes are variables
• Links indicate
dependencies
• Probabilities that
represent how the
dependencies work

162
http://www.eecs.berkeley.edu/~wainwrig/icml08/tutorial_icml08.html
Graphical Models

Bayesian Networks Markov Networks


• Graph links have a direction • Graph links do not have
• Cycles not allowed direction
• Cycles are allowed
Smoking Exposure

Respiratory
disease

Cough
163
http://gordam.themillimetertomylens.com/
Bayesian Networks

• A Bayesian network is a graph


• Directed edges show how
variables influence others
• No cycles allowed
• Conditional probability
distribution (tables or functions)
show the probability of the
value of a variable given the
values of its parent variables
• A variable is only dependent
on its parent variables, not on
its earlier ancestors

164
https://en.wikipedia.org/wiki/Bayesian_network#/media/File:SimpleBayesNet.svg
Bayesian Inference

• Bayesian inference is used


to reason over a Bayesian
network to determine the
probabilities of some
variables given some
observed variables
• Eg: Given that the
grass is wet, what is
the probability that it
is raining?

165
https://en.wikipedia.org/wiki/Bayesian_network#/media/File:SimpleBayesNet.svg
Markov Networks
• A Markov network is an
undirected graphical
model that includes a
potential function for
each clique of
interconnected nodes

166
http://gordam.themillimetertomylens.com/
Causal Models

• A causal model is a Bayesian network where all


the relationships among variables are causal
• Causal models represent how independent
variables have an effect on dependent
variables
• Causal reasoning uses the probabilities in the
causal model to make inferences about the
value of variables given the values of others
• Eg: Given that the grass is wet, what is the
probability that it rained?
167
Learning Causal Models

Parameter Structure
Learning Learning
• Learning the parameters • Learning the structure of
(probabilities) of the the model
model • Usually more
challenging

168
Part IV: Causal Discovery
Summary of Topics Covered

1. Correlation and causation


2. Causal models
• Bayesian networks
• Markov networks

169
Part IV: Causal Discovery
Summary of Major Concepts

• Predictive variables • Probabilistic graphical


• Cause and effect models

• Latent variables • Bayesian networks

• Correlation vs • Markov networks


causation • Causal models
• Randomized Control • Parameter learning
Trials • Structure learning

170
PART V:
SIMULATION AND
MODELING
Simulation
• Simulation is an approach to data Traffic
analysis that uses a mathematical or
formal model of a phenomenon to
run different scenarios to make
predictions
• Eg By simulating people in a city
and where they drive every day,
we can analyze scenarios where
there is a flu epidemic and
predict people’s behavior
changes Air flow over an engine
• Simulation models can be
improved to make predictions
that correspond to the observed
data

https://en.wikipedia.org/wiki/Traffic_simulation#/media/File:WTC_Pedestrian_Modeling.png 172
https://en.wikipedia.org/wiki/Simulation#/media/File:Ugs-nx-5-engine-airflow-simulation.jpg
Example: Landscape Evolution
Work by Chris Duffy, Yu Zhang, and Rudy Slingerland of Penn State University
Example: Landscape Evolution
Simulated evolution of an initially uniform
landscape to a complex terrain and river
network over 10 8 years.
Example: Analyzing Water Quality
From T. Harmon (UC Merced/CENS)

McConnell SP

SJR confluence
An Example Workflow Sketch for Analyzing
Environmental Data [Gil et al 2011]

California’s Central
Valley:
• Farming, pesticides,
waste
• Water releases
• Restoration efforts
Workflow
Sketch

Data
preparation

Feature
extraction

Models of how
water mixes
with air
(“reaeration”)
and what
chemical
reactions occur
(“metabolism”)
From a Workflow Sketch to a
Computational Workflow
PART VI:
PRACTICAL USE OF
MACHINE LEARNING AND
DATA ANALYSIS
RECAP:
Different Data Analysis Tasks

• Classification • Causal modeling


• Assign a label (ie, a class) • Learn causal
for a new instance given (probabilistic)
many labeled instances dependencies
• Clustering among variables
• Form clusters (ie, groups) • Simulation modeling
with a set of instances • Define
• Pattern learning/detection mathematical
formulas that can
• Learn patterns (i.e.,
generate data
regularities) in data
that is close to
observations 180
collected
RECAP:
Different Data Analysis Tasks

• Classification
• Each type of task is
• Clustering characterized by the
• Pattern learning kinds of data they
• Causal modeling require and the kinds
of output they
• Simulation modeling generate
•… • Each type of task
uses different
algorithms 181
When Facing a Learning Task
• Supervised, unsupervised, • What features to choose
or semi-supervised: cost of • Try defining different
labels features
• Setting up the learning task • For some problems,
• Classification: What hundreds and maybe
classes to choose thousands of features
• Clustering: How many may be possible
target clusters • Sometimes the features
• Causality: What are not directly
observables observable (ie, there
are “latent” variables)
• What data is available
• What learning method
• Collecting data
• Better to try different
• Buying data
ones
182

• Scalability: processing
Recent Trends: Neural
Networks and “Deep
Learning”

http://theanalyticsstore.ie/deep-learning/ 183
Trends: Deep Learning in
AlphaGo

184
Introduction to Machine
Learning and Data Analytics:
Topics Covered
I. Machine learning IV. Causal discovery
and data analysis • Correlation
tasks • Causation
II. Classification • Causal models
• Classification tasks • Bayesian networks
• Building a classifier • Markov networks
• Evaluating a classifier
III. Pattern learning and
V. Simulation and
clustering modeling
• Pattern detection
• Pattern learning and VI. Practical use of
pattern discovery
• Clustering
machine learning
• K-means clustering and data analysis
185

You might also like