Professional Documents
Culture Documents
DR. SOURAV
Learning and its MANDAL
Applications
What is Learning?
2
Why Machine Learning?
• No human experts
• industrial/manufacturing control
• mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise
• face/handwriting/speech recognition
• driving a car, flying a plane
• Rapidly changing phenomena
• credit scoring, financial modeling
• diagnosis, fraud detection
• Need for customization/personalization
• personalized news reader
• movie/book recommendation
3
Related Fields
data
mining control theory
statistics
decision theory
information theory machine
learning
cognitive science
databases
psychological models
evolutionary neuroscience
models
4
rote learning (memorization technique)
learning by being told (advice-taking)
learning from examples (induction)
5
Architecture of a Learning System
feedback performance standard
critic percepts
ENVIRONMENT
changes
learning performance percepts
problem
generator
6
Forms of Learning
7
Learning Element
8
Dimensions of Learning Systems
type of feedback
representation
use of knowledge
• empirical (knowledge-free)
• analytical (knowledge-guided)
9
What is machine learning?
• A branch of artificial intelligence, concerned with
the design and development of algorithms that
allow computers to evolve behaviors based on
empirical data.
10
What is machine learning?
11
Learning system model
Testing
Input Learning
Samples Method
System
Training
12
Learning system model
13
Training and testing
Universal set
(unobserved)
15
Performance
• There are several factors affecting the performance:
• Types of training provided
• The form and extent of any initial background knowledge
• The type of feedback provided
• The learning algorithms used
16
Algorithms
• The success of machine learning system also
depends on the algorithms.
17
Algorithms
• Supervised learning ( )
• Prediction
• Classification (discrete labels), Regression (real values)
• Unsupervised learning ( )
• Clustering
• Probability distribution estimation
• Finding association (in features)
• Dimension reduction
• Semi-supervised learning
• Reinforcement learning
• Decision making (robot, chess machine)
18
Algorithms
Supervised Unsupervised
learning learning
Semi-supervised learning 19
Machine learning structure
• Supervised learning
20
Inductive (Supervised) Learning
Basic Problem: Given a training set of N example input-output pairs
(x1,y1), (x2,y2),…,(xN, yN),
Where each yi was generated by an unknown function y= f(x),
21
Inductive (Supervised) Learning
• target function f: X → Y
• example (x,f(x))
• hypothesis g: X → Y such that g(x) = f(x)
22
Machine learning structure
• Unsupervised learning
23
Predicting housing price
24
Classifying Iris Plants
https://en.wikipedia.org/wiki/Iris_setosa
https://en.wikipedia.org/wiki/Iris_versicolor 25
https://en.wikipedia.org/wiki/Iris_virginica
Supervised Learning
26
Supervised Learning:
Regression vs. Classification
Regression
Classification
27
Supervised Learning: Examples
• Email Spam:
• predict whether an email is a junk email (i.e. spam)
28
Supervised Learning: Examples
29
Supervised Learning: Examples
• Face Detection/Recognition:
• Identify human faces
30
Supervised Learning: Examples
• Speech Recognition:
• Identify words spoken according to speech signals
• Automatic voice recognition systems used by airline companies,
automatic stock price reporting, etc.
31
Supervised Learning:
Linear Regression
32
What are we seeking?
• Supervised: Low E-out or maximize probabilistic terms
33
What are we seeking?
Under-fitting VS. Over-fitting (fixed N)
error
34
Learning techniques
• Techniques:
• Perceptron
• Logistic regression
• Support vector machine (SVM)
• Ada-line
• Multi-layer perceptron (MLP)
36
Learning techniques
Using perceptron learning algorithm(PLA)
Training Testing
Error rate: Error rate:
0.10 0.156
37
Learning techniques
Using logistic regression
Training Testing
Error rate: Error rate:
0.11 0.145
38
Learning techniques
• Non-linear case
39
Learning techniques
• Unsupervised learning categories and techniques
• Clustering
• K-means clustering
• Spectral clustering
• Density Estimation
• Gaussian mixture model (GMM)
• Graphical models
• Dimensionality reduction
• Principal component analysis (PCA)
• Factor analysis
40
Supervised Learning
41
Road Map
42
An example application
• An emergency room in a hospital measures 17 variables
(e.g., blood pressure, age, etc) of newly admitted
patients.
• A decision is needed: whether to put a new patient in an
intensive-care unit.
• Due to the high cost of ICU, those patients who may
survive less than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate
them from low-risk patients.
43
Another application
45
The data and the goal
• Data: A set of data records (also called examples, instances
or cases) described by
• k attributes: A1, A2, … Ak.
• a class: Each example is labelled with a pre-defined class.
• Goal: To learn a classification model from the data that can
be used to predict the classes of new (future, or test)
cases/instances.
46
An example: data (loan application)
• Approved or not
47
An example: the learning task
• Learn a classification model from the data
• Use the model to classify future loan
applications into
• Yes (approved) and
• No (not approved)
• What is the class for following case/instance?
48
Learning Approaches
Semi-Supervised
Learning
49
Supervised vs. unsupervised Learning
• Supervised learning: classification is seen as supervised
learning from examples.
• Supervision: The data (observations, measurements, etc.)
are labeled with pre-defined classes. It is like that a
“teacher” gives the classes (supervision).
• Test data are classified into these classes too.
• Unsupervised learning (clustering)
• Class labels of the data are unknown
• Given a set of data, the task is to establish the existence of
classes or clusters in the data
50
Supervised learning process: two steps
Learning (training): Learn a model using the
training data
Testing: Test the model using unseen test data
to assess the model accuracy
51
What do we mean by learning?
• Given
• a data set D,
• a task T, and
• a performance measure M,
a computer system is said to learn from D to perform the task
T if after learning the system’s performance on T improves as
measured by M.
• In other words, the learned model helps the system to
perform T better as compared to no learning.
52
An example
53
Fundamental assumption of learning
54
Different Data Analysis Tasks
55
Different Data Analysis Tasks
Classification • Each type of task is
characterized by the kinds
Clustering of data they require and
the kinds of output they
Pattern detection generate
• Each type of task uses
Causal discovery
different algorithms
Simulation
56
General Approaches are Adapted to
Specific Kinds of Data
Treat Programs as “Black Boxes”
58
Programs as Functions: Inputs, Outputs,
and Parameters
Shift key: 5
Original: HELLO
Cipher: KHOOR
59
Workflow as a Composition of Functions
PART II:
CLASSIFICATION
Part II: Classification
Topics
Classification tasks
Building a classifier
Evaluating a classifier
62
Classifying Mushrooms
• What mushrooms are
edible, i.e., not poisonous?
• Book lists many kinds of
mushrooms identified as
either edible, poisonous,
or unknown edibility
• https://archive.ics.uci.edu/
• Given a new kind
ml/datasets/Mushroom
mushroom not listed in
the book, is it edible?
63
Classifying Iris Plants
65
Classification Tasks
• Given:
• A set of classes
• Instances (examples) of each class
• Generate: A method (aka model) that when given a new instance
it will determine its class
66
http://www.business-insight.com/html/intelligence/bi_overfitting.html
Classification Tasks
67
Possible Features
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
• Class: poisonous - p
• Cap shape: convex – x
• Cap surface: smooth – s
• Cap color: brown – n
• Bruises: true – t
• Odor: pungent – p
•…
69
https://en.wikipedia.org/wiki/Edible_mushroom#/media/File:Lepista_nuda.jpg
Iris Classification:
“Continuous” Feature Values
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
70
Describing Many Instances
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m
e,x,y,y,t,l,f,c,b,g,e,c,s,s,w,w,p,w,o,p,n,n,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,s,m
71
https://commons.wikimedia.org/wiki/File:Twelve_edible_mushrooms_of_the_United_States.jpg
Example of a Model:
A Decision Tree
• Nodes: attribute-based
decisions
• Branches: alternative values of
the attributes
• Leaves: each leaf is a class
72
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
Using a Decision Tree
73
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
High-Level Algorithm to
Learn a Decision Tree
• Start with the set of all instances in
the root node
• Select the attribute that splits the set
best and create children's nodes
• E.g., more evenly into the subsets
• When a node has all instances in the
same class, make it a leaf node
• Iterate until all nodes are leaves
74
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
Classifying a New Instance
Instance
Instance
Instance
ss
Instance
s
Modeler
New
Model instance
Classifier
Class
75
Classifying New Instances
Instance
Instance
Instance
ss
Instance
s
Modeler
New
Model instance
Classifier
Class
Class
Class
Class 76
Training and Test Sets
Instance
Instance
Instance
ss Training instances
Instance
s (training set)
Modeler
Test instances
New
instance (test set)
Model
Classifier
Class
Class
Class
Class 77
Contamination
Instance
Instance
Instance
ss Training instances
Instance
s (training set)
Modeler
Test instances
New
instance (test set)
Model
Class
Class
Class
Class 78
About Classification Tasks
79
2. Building a
Classifier
80
What is a Modeler?
Instance
Instance
Instance
ss •A
Instance
s mathematical/algorith
mic approach to
generalize from
Modeler instances so it can
New make predictions about
Model instance instances that it has not
seen before
• Its output is called a
Classifier model
Class
Class
Class
Class 81
Types of Modelers/Models
Instance
Instance
Instance
ss
Instance
s
• Logistic regression
Modeler • Naïve Bayes classifiers
New • Support vector machines
instance (SVMs)
Model
• Decision trees
• Random forests
Classifier
• Kernel methods
• Genetic algorithms
Class
Class
Class
Class • Neural networks 82
Explanations
• Decision trees
• Logistic regression
• Naïve Bayes classifiers
• Support vector machines
(SVMs)
• Random forests
• Kernel methods
• Genetic algorithms
Other models are • Neural networks
mathematical models that
are hard to explain and
83
visualize
84
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
85
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
86
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
87
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
88
http://tjo-en.hatenablog.com/entry/2014/01/06/234155
What Modeler to Choose?
• Data scientists try different Logistic regression
Kernel methods
89
Ensembles
Instances
Instances
• An ensemble method uses several Instances
Instance
algorithms that do the same task,
and combines their results
• “Ensemble learning” Modeler Modeler Modeler
A B C
• A combination function joins the
results
Model
• Majority vote: each algorithm ModelA ModelB
C
gets a vote
• Weighted voting: each
algorithm’s vote has a weight CombinationFunction
• Other complex combination
functions Final Model
90
91
http://magizbox.com/index.php/machine-learning/ds-model-building/ensemble/
Road Map
• Basic concepts
• Decision tree induction [on loan data]
92
Introduction
93
The loan data (reproduced)
Approved or not
94
A decision tree from the loan data
Decision nodes and leaf nodes (classes)
95
Use the decision tree
No
96
Is the decision tree unique?
No. Here is a simpler tree.
We want smaller tree and accurate tree.
Easy to understand and perform better.
3/9 6/9
97
From a decision tree to a set of rules
A decision tree can
be converted to a
set of rules
Each path from the
root to a leaf is a
rule.
(3/9) (6/9)
(3/9)
10 (6/9)
98
Algorithm for decision tree learning
99
Decision tree learning algorithm
100
Choose an attribute to partition data
101
The loan data (reproduced)
Approved or not
102
Two possible roots, which is better?
103
Information theory
104
Information theory (cont …)
• For a fair (honest) coin, you have no information, and you are willing
to pay more (say in terms of $) for advanced information - less you
know, the more valuable the information.
105
Information theory: Entropy measure
• The entropy formula,
|C |
entropy ( D ) Pr(c ) log
j 1
j 2 Pr(c j )
|C |
Pr(c ) 1,
j 1
j
entropy Ai ( D)
j
j 1 | D |
entropy( D j )
108
Information gain (cont …)
• Information gained by selecting attribute Ai
to branch or to partition the data is
gain( D, Ai ) entropy( D) entropy Ai ( D)
109
An example
6 6 9 9
entropy(D) log2 log2 0.971
15 15 15 15
6 9
entropyOwn _ house ( D) entropy( D1 ) entropy( D2 )
15 15
6 9
0 0.918
15 15
0.551
5 5 5
entropyAge(D) entropy(D1 ) entropy(D2 ) entropy(D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
0.888
110
We build the final tree
(3/9) (6/9)
We can use information gain ratio to evaluate the
impurity as well
111
Decision Trees
Should I wait at this
restaurant?
112
3. Evaluating a
Classifier
113
Classification Accuracy
114
Evaluating a Classifier:
n-fold Cross Validation
• Suppose m labeled instances
• Divide into n subsets (“folds”) of equal size
• Run classifier n times, with each of the subsets as the test set
• The rest (n-1) for training
• Each run gives an accuracy result
TP TP
Precision = Recall =
TP + FP TP + FN
118
Evaluating a Classifier:
Other Metrics
119
Evaluating a Classifier:
What Affects the Performance
• Complexity of the task
• Large amounts of features (high dimensionality)
• Feature(s) appears very few times (sparse data)
• Few instances for a complex classification task
• Missing feature values for instances
• Errors in attribute values for instances
• Errors in the labels of training instances
• Uneven availability of instances in classes
120
Overfitting
• A model overfits the training data when it is very accurate with that
data, and may not do so well with new test data
Training Data Test Data
Model 1
Model 2
121
Induction
122
When Facing a Classification Task
125
Neural Networks (continued)
Key Idea: Adjusting the weights changes the function represented by the
neural network (learning = optimization in weight space).
• Weight Update
• perceptron training rule
• linear programming
• delta rule
• backpropagation
127
Neural Network Learning: Decision Boundary
128
Additional
Study
129
PART III:
PATTERN LEARNING AND
CLUSTERING
Part III: Pattern Learning and Clustering
Topics
131
1. Pattern
Detection
132
Network Patterns
Subgroups
Strength of ties
Central entities
Patterns
http://bama.ua.edu/~mbonizzoni/research.html 134
Temporal Patterns
Pattern
Detecto
r
Patterns
P1
** ** * P2
* * *
*
** * ** * ** *
http://epthinking.blogspot.com/2009/01/on-event-pattern-detection-vs-event.html 135
Detecting Patterns in a Text String
• ababababab
• abcabcabcabc
• abcccccccabcccabccccccccccabcabccc
136
A Pattern Language
• ababababab
• (ab)*
• abcabcabcabc
• (abc)*
• abcccccccabcccabccccccccccabcabccc
• ((ab)(c)*)*
137
Detecting Patterns in Streaming Data
• (ab)*x*
• Abababthsrthwababyertueyrtyertheabsgd
• abcabcabcabc
• abcabcrgkskhgsnrhnabcabcabcabcrjgjsrn
138
Concept Drift
139
2. Pattern
Learning and
Pattern
Discovery
140
Pattern Detection vs Pattern
Learning
Pattern Pattern
Detection Learning
• Inputs: • Inputs:
• Data • Data annotated
• A set of patterns with a set of
patterns
• Output:
• Matches of the • Output:
patterns to the • A set of patterns
data that appear in the
data with some
frequency
141
Pattern Detection vs Pattern Learning
Pattern Pattern
Learning Discovery
• Inputs: • Inputs:
• Data annotated • Data
with a set of
patterns
• Output:
• Output: • A set of patterns
• A set of patterns that appear in the
that appear in data with some
the data with frequency
some frequency
142
3. Clustering
143
Clustering
144
https://commons.wikimedia.org/wiki/File:DBSCAN-Gaussian-data.svg
K-Means Clustering Algorithm
145
K-Means Clustering (1)
https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 146
K-Means Clustering (2)
https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 147
K-Means Clustering (3)
https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 148
K-Means Clustering (4)
https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 149
K-Means Clustering (5)
https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 150
K-Means Clustering (6)
https://commons.wikimedia.org/wiki/File:K-means_convergence_to_a_local_minimum.png 151
Clustering Methods
• K-Means clustering
• Centroid-based
• Hierarchical clustering
• Attach datapoints to
root points
• Density-based
methods
• Clusters contain a
minimal number of
datapoints
•…
152
https://commons.wikimedia.org/wiki/File:DBSCAN-Gaussian-data.svg
PART IV:
CAUSAL DISCOVERY
Today’s Topics
154
1. Correlation
and Causation
155
Correlation
causes?
Cause and Effect
Smokin
g Cause
• A variable v1 is a cause
for variable v2 if
changing v1 changes v2
• Smoking is a cause
for respiratory
Respiratory disease
disease • A variable v3 is an effect
of variable v2 if
changing v3 does not
change v1
Cough Effect • Cough is an effect of
respiratory disease
158
Latent Variables
Smokin
g • Latent variables are
variables that cannot be
DNA Carbon directly observed, only
dama monoxid inferred through a model
ge e • Eg DNA damage
• Eg Carbon monoxide
Respiratory inhalation
disease • Latent variables can be
hard to identify, even
harder to learn
Cough automatically from data
159
Correlation vs Causation
Correlation Causation
• Knowledge of v1 provides • Requires being able to collect
information for v2 specific data that helps show
• Eg: yellow fingers, causality (ie, do experiments)
cough, smoking, lung • Randomized controlled trial
cancer • Select 1000 people, split
• Can use any data evenly
collected (ie, by simple • 500 (control)
observation) and do • Eg forced to
statistical analysis smoke
• 500 (treatment)
• Eg forced not to
smoke
• Collect data
• Association persists only
when causal relation
160
2. Causal
Models
161
(Probabilistic) Graphical
Model
162
http://www.eecs.berkeley.edu/~wainwrig/icml08/tutorial_icml08.html
Graphical Models
Respiratory
disease
Cough
163
http://gordam.themillimetertomylens.com/
Bayesian Networks
164
https://en.wikipedia.org/wiki/Bayesian_network#/media/File:SimpleBayesNet.svg
Bayesian Inference
165
https://en.wikipedia.org/wiki/Bayesian_network#/media/File:SimpleBayesNet.svg
Markov Networks
• A Markov network is an
undirected graphical
model that includes a
potential function for
each clique of
interconnected nodes
166
http://gordam.themillimetertomylens.com/
Causal Models
Parameter Structure
Learning Learning
• Learning the parameters • Learning the structure of
(probabilities) of the the model
model • Usually more
challenging
168
Part IV: Causal Discovery
Summary of Topics Covered
169
Part IV: Causal Discovery
Summary of Major Concepts
170
PART V:
SIMULATION AND
MODELING
Simulation
• Simulation is an approach to data Traffic
analysis that uses a mathematical or
formal model of a phenomenon to
run different scenarios to make
predictions
• Eg By simulating people in a city
and where they drive every day,
we can analyze scenarios where
there is a flu epidemic and
predict people’s behavior
changes Air flow over an engine
• Simulation models can be
improved to make predictions
that correspond to the observed
data
https://en.wikipedia.org/wiki/Traffic_simulation#/media/File:WTC_Pedestrian_Modeling.png 172
https://en.wikipedia.org/wiki/Simulation#/media/File:Ugs-nx-5-engine-airflow-simulation.jpg
Example: Landscape Evolution
Work by Chris Duffy, Yu Zhang, and Rudy Slingerland of Penn State University
Example: Landscape Evolution
Simulated evolution of an initially uniform
landscape to a complex terrain and river
network over 10 8 years.
Example: Analyzing Water Quality
From T. Harmon (UC Merced/CENS)
McConnell SP
SJR confluence
An Example Workflow Sketch for Analyzing
Environmental Data [Gil et al 2011]
California’s Central
Valley:
• Farming, pesticides,
waste
• Water releases
• Restoration efforts
Workflow
Sketch
Data
preparation
Feature
extraction
Models of how
water mixes
with air
(“reaeration”)
and what
chemical
reactions occur
(“metabolism”)
From a Workflow Sketch to a
Computational Workflow
PART VI:
PRACTICAL USE OF
MACHINE LEARNING AND
DATA ANALYSIS
RECAP:
Different Data Analysis Tasks
• Classification
• Each type of task is
• Clustering characterized by the
• Pattern learning kinds of data they
• Causal modeling require and the kinds
of output they
• Simulation modeling generate
•… • Each type of task
uses different
algorithms 181
When Facing a Learning Task
• Supervised, unsupervised, • What features to choose
or semi-supervised: cost of • Try defining different
labels features
• Setting up the learning task • For some problems,
• Classification: What hundreds and maybe
classes to choose thousands of features
• Clustering: How many may be possible
target clusters • Sometimes the features
• Causality: What are not directly
observables observable (ie, there
are “latent” variables)
• What data is available
• What learning method
• Collecting data
• Better to try different
• Buying data
ones
182
• Scalability: processing
Recent Trends: Neural
Networks and “Deep
Learning”
http://theanalyticsstore.ie/deep-learning/ 183
Trends: Deep Learning in
AlphaGo
184
Introduction to Machine
Learning and Data Analytics:
Topics Covered
I. Machine learning IV. Causal discovery
and data analysis • Correlation
tasks • Causation
II. Classification • Causal models
• Classification tasks • Bayesian networks
• Building a classifier • Markov networks
• Evaluating a classifier
III. Pattern learning and
V. Simulation and
clustering modeling
• Pattern detection
• Pattern learning and VI. Practical use of
pattern discovery
• Clustering
machine learning
• K-means clustering and data analysis
185