Professional Documents
Culture Documents
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Two Flavors of Machine Learning
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Two Flavors of Machine Learning
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Two Flavors of Machine Learning
4
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Two Flavors of Machine Learning
5
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Two Flavors of Machine Learning
6
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
A Workflow for ML:
Unsupervised Supervised Learning
7
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Supervised Learning Starts with Feature
Extraction
Start with a matrix X of extracted features
Input data X: features
✔ Beak
? Webbed feet
✔ Quacks
✔ Beak
✔ Webbed feet
✔ Quacks
x Beak
x Webbed feet
x Quacks
8
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Supervised Learning Starts with Feature
Extraction, Then Training
Train using a matrix X and a label or class, in vector y
Input X: features y: class
✔ Beak ✔ Is a duck
? Webbed feet
✔ Quacks
✔ Beak ✔ Is a duck
✔ Webbed feet
✔ Quacks
x Beak x Is NOT a duck
x Webbed feet
x Quacks
9
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Under the Covers…
✔ Beak
✔ Webbed feet ? ✔ Is a duck
✔ Quacks
10
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Then: Prediction
Xnew: features : predicted class
✔ Beak
? Webbed feet ✔ Is a duck
✔ Quacks
How well the classifier does on its training data may be different
from how it does on completely new data!
11
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Types of Supervised Learning
Classification: y is categorical
each value is from a finite set,
e.g., nationality, page will be clicked on, item is a duck
Regression: y is continuous
each value is numeric within a continuous range,
e.g., age, dollars spent
12
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Classification
13
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The Process of Building
and Evaluating a Classifier
(optional)
Validat
Train Test
e
Find the function Tune Assess
from X to y; parameters, performance
fit parameters choose from
multiple
models
14
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Why the Three Stages?
Challenges Faced in Building a Classifier
Goal: Learn from training data – but generalize!
Risk of overfitting on the training data
Need roughly as many observations as features
Test data helps us to evaluate the final classifier by comparing its predictions
vs the “gold standard”
# Create classifier
clf = Classifier(my_params)
17
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
An Example
==blac
k?
T
F
Result = True
(2 instances)
==smal
l?
F T
==blac
k?
T
F
Result = True
(2 instances)
==smal
l?
F T
tree.plot_tree(clf.fit(X, y))
small <= 0.5 entropy = 0.0
entropy = 0.918 samples = 2
samples = 3 value = [0, 2]
value = [2, 1]
23
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Summary: Decision Tree Basics
• Easy to use in SciKit-Learn, although we will want to know more about how they work!
24
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
How a Decision Tree Is Built
25
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
From the Basics to an Algorithm
26
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The Basic Approach c(f)
T
F
At each point, find a feature (and condition) to split the decision tree
• Do this at the root, and then divide the training set into subsets
depending on the condition
• Repeat recursively on each of the data subsets
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
A Greedy Heuristic to Build
Decision Trees F
c(f)
T
Split(dataset)
If items in dataset are not all of the same class:
Determine a predicate over a feature that maximizes some notion of
information gain (to be defined)
Partition the dataset according to the predicate
Split each sub-dataset
H(yi)
Pr(y =1) i
https://en.wikipedia.org/wiki/Entropy_(information_theory)#/media/File:Binary_entropy_plot.svg
30
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
An Example: Entropy
32
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
An Example: Conditional Entropy
Conditional entropy of
input_data: input_data[likes_stats]| input_data[major]:
major likes_stats
def get_subset_likes_stats(term):
0 math False subset = input_data[input_data['major']==term]\
1 math False ['likes_stats']
2 math True return [sum(subset==False)/len(subset), \
3 math True sum(subset==True)/len(subset)]
4 engl False
5 stat True subsets = {}
6 stat True
for major in majors:
subsets[major] = get_subset_likes_stats(major)
7 engl False
34
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Entropy Isn’t the Only Possible Metric
for Information Gain
35
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Summary: Decision Trees
and Split Points
• Choosing a split point (predicate for intermediate node) is heuristics-
based – it is NP hard to find an optimal split
37
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Major Danger of Decision Trees
Some responses:
• Apply PCA in advance to remove correlated features
• Balance positive and negative examples
tunable
• Limit the depth of the tree hyper-
• Don’t split if below a minimum number of samples parameters
• Prune lower levels of the tree of the decision
tree
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Another Approach to Make
Classifiers More Accurate: Ensembles
Train classifiers over different subsets of the input
(X_train, y_train): training data
X
subsets of
X1 X2 X3 X4 (X_train, y_train)
ensemble of separately
c1 c2 c3 c4 trained classifiers
classifier combines
c_ens
votes of ensemble members
39
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Ensembles of Decision Trees
– Random Forests
1. Draw a random bootstrap data sample of size n, with
replacement (called “bagging”)
2. Build (“grow”) a tiny decision tree (or “stump”)
• At each node, randomly select d candidate features
w/o replacement
• Split the node using the feature with best split
according to objective function (e.g. info gain)
3. Repeat to produce k decision trees (a forest!)
4. For prediction, use majority vote to predict a class for
new data
40
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Random Forest
Hyperparameters
• Tree depth
41
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Random Forests in SciKit-Learn
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 20, \
max_depth=2)
clf.fit(X_train,y_train)
prediction = clf.predict(X_test)
accuracy=sklearn.metrics.accuracy_score(prediction,y_test)
print("Accuracy: %.1f%%"% (accuracy*100))
Accuracy: 98.1%
42
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Summary of Decision Trees
and Random Forests
Decision trees are:
• Fast
• Scale-invariant
• Can handle categorical and continuous data (CART)
• Understandable (“explainable AI”)
• Prone to overfitting
Random forests:
• Use a form of ensembles called “bagging”
• Highly accurate, parallelizable, and used in practice
• Don’t require hyperparameter search
Random forests are one of the most popular and accurate method classifiers for
big data!
43
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.