You are on page 1of 65

Machine Learning with Python

Sebastian Raschka, Ph.D.


MSU Data Science workshop
East Lansing, Michigan State University • Feb 21, 2018
Today’s focus:

And if we have time, a quick


overview ...
2
Tutorial Material on GitHub:
https://github.com/rasbt/msu-datascience-ml-tutorial-2018

Contact:
o E-mail: mail@sebastianraschka.com
o Website: http://sebastianraschka.com
o Twitter: @rasbt
o GitHub: rasbt

3
Machine learning is used & useful (almost) anywhere

4
5
3 Types of Learning

Supervised Unsupervised

Reinforcement

6
Working with Labeled Data

?
Regression

y (“output”)
Supervised
x (“input”)
Learning

Classification

x2 (“input”)
?

x1 (“input”)

7
Working with Unlabeled Data

Clustering

Unsupervised
Learning

Compression

8
Topics

1. Introduction to Machine Learning


2. Linear Regression
3. Introduction to Classification
4. Feature Preprocessing & scikit-learn Pipelines
5. Dimensionality Reduction: Feature Selection & Extraction
6. Model Evaluation & Hyperparameter Tuning

9
Simple Linear Regression

y (response variable)
ŷ = w0 + w1x
vertical offset
|ŷ − y|

Δy
w1 (slope)
Δx = Δy / Δx
(xi, yi)
w0 (intercept)

x (explanatory variable)

10
Data Representation
Columns: features (explanatory variables, independent variables, covariates,
predictors, variables, inputs, attributes)

x0 x1 … xm y0

Rows: training examples


(observations, records,
y1
x0,0 x0,1

instances, samples)
x1,0 x1,1 y2 Targets (target
variable,response variable,
X=
x2,0
x3,0
x2,1
x3,1
y= y3
dependent variable, labels,
. ground truth)
. .
. .
.
yn
xn,0 xn,1 … xn,m
11
“Basic” Supervised Learning Workflow
Training Data

Training Labels
Data
1 Labels
Test Data

Test Labels

Hyperparameter
Training Data Values
2 Model
Training Labels Learning
Algorithm

Test Data
Prediction

3 Model
Performance
Test Labels

Hyperparameter
Data Values Final
4 Labels Learning Model
Algorithm
12
Jupyter Notebook

13
Topics

1. Introduction to Machine Learning


2. Linear Regression
3. Introduction to Classification
4. Feature Preprocessing & scikit-learn Pipelines
5. Dimensionality Reduction: Feature Selection & Extraction
6. Model Evaluation & Hyperparameter Tuning

14
Scikit-learn API
class SupervisedEstimator(...):
def __init__(self, hyperparam, ...):
...
def fit(self, X, y):
...
return self
def predict(self, X):
...
return y_pred
def score(self, X, y):
...
return score
... 15
Iris Dataset

Iris-Setosa Iris-Versicolor Iris-Virginica

16
Iris Dataset
features (columns) petal

sepal
sepal sepal petal petal
length width lengt width
[cm] [cm] h [cm]
[cm]
setosa

samples (rows)
1 5.1 3.5 1.4 0.2
setosa
2 4.9 3.0 1.4 0.2

X= 50 6.4 3.5 4.5 1.2


y= versicolor
.
. .
. .
.
virginica
150 5.9 3.0 5.0 1.8
17
Note about Non-Stratified Splits

§ training set → 38 x Setosa, 28 x Versicolor, 34 x Virginica


§ test set → 12 x Setosa, 22 x Versicolor, 16 x Virginica

18
Linear Regression Recap

Activation
Bias 1 function
unit w0
x1 w1 z a Predicted
Σ y
output
x2 w2
..
. wm Net input
xm function
Weight
Input coefficients
values

19
Linear Regression Recap

Activation
Bias 1 function
unit w0
x1 w1 z a Predicted
Σ y
output
x2 w2
..
. wm Net input
xm function
Weight
coefficients Here: Identity
Input
function
values

20
Logistic Regression, a Generalized Linear Model
(a Classifier)

Activation
Bias 1 function
unit w0
x1 w1 z a Predicted
Σ y
class label
x2 w2
..
. wm Net input Unit step
xm function function
Weight
Predicted
Input coefficients
probability
values

21
A “Lazy Learner:” K-Nearest Neighbors Classifier




x2

? Predict
? =

x1

22
Jupyter Notebook

23
There are many, many more classification
and regression algorithms ...

http://scikit-learn.org/stable/supervised_learning.html

24
Topics

1. Introduction to Machine Learning


2. Linear Regression
3. Introduction to Classification
4. Feature Preprocessing & scikit-learn Pipelines
5. Dimensionality Reduction: Feature Selection & Extraction
6. Model Evaluation & Hyperparameter Tuning

25
Categorical Variables

class
color size price
label

red M $10.49 0

blue XL $15.00 1

green L $12.99 1

26
Encoding Categorical Variables (Ordinal vs Nominal)
color size price class label

red M $10.49 0

blue XL $15.00 1

green L $12.99 1

red blue green


size
1 0 0
0
0 1 0
2
0 0 1
1

27
Feature Normalization

Min-max scaling Z-score standardization

feature minmax z-score

1.0 0.0 -1.46385

2.0 0.2 -0.87831

3.0 0.4 -0.29277

4.0 0.6 0.29277

5.0 0.8 0.87831

6.0 1.0 1.46385

28
Scikit-learn API
class UnsupervisedEstimator(...):
def __init__(self, ...):
...
def fit(self, X):
...
return self
def transform(self, X):
...
return X_transf
def predict(self, X):
...
return pred
29
Scikit-learn Pipelines
Class labels
Test data
Training data Pipeline
predict
fit
Scaling

fit & transform transform


Dimensionality
Reduction

fit & transform


Learning transform
Algorithm
fit
Model Class labels
predict

30
Jupyter Notebook

31
Topics

1. Introduction to Machine Learning


2. Linear Regression
3. Introduction to Classification
4. Feature Preprocessing & scikit-learn Pipelines
5. Dimensionality Reduction: Feature Selection & Extraction
6. Model Evaluation & Hyperparameter Tuning

32
Dimensionality Reduction – why?

[cm]
[cm]
[cm]
[cm]

[cm] [cm] [cm] [cm]


33
Dimensionality Reduction – why?
visualization &
interpretability

predictive performance

predictive performance

storage & speed

34
Recursive Feature Elimination
available features: [ f1 f2 f3 f4 ]

[ w1 w2 w3 w4 ]
fit model, remove lowest weight, repeat

[ w1 w2 w4 ]
fit model, remove lowest weight, repeat

[ w1 w4 ]
fit model, remove lowest weight, repeat

[ w4 ]

35
Sequential Feature Selection
available features: [ f1 f2 f3 f4 ]

[ f1 ] [ f2 ] [ f3 ] [ f4 ] fit model, pick best, repeat

[ f1 f3 ] [ f1 f2 ] [ f1 f4 ]
fit model, pick best, repeat

[ f1 f3 f4 ] [ f1 f3 f2 ]

36
Principal Component Analysis

x2
PC2

PC1

x1

37
Jupyter Notebook

38
Topics

1. Introduction to Machine Learning


2. Linear Regression
3. Introduction to Classification
4. Feature Preprocessing & scikit-learn Pipelines
5. Dimensionality Reduction: Feature Selection & Extraction
6. Model Evaluation & Hyperparameter Tuning

39
“Basic” Supervised Learning Workflow
Training Data

Training Labels
Data
1 Labels
Test Data

Test Labels

Hyperparameter
Training Data Values
2 Model
Training Labels Learning
Algorithm

Test Data
Prediction

3 Model
Performance
Test Labels

Hyperparameter
Data Values Final
4 Labels Learning Model
Algorithm

40
Holdout Method and Hyperparameter Tuning 1-3
Training Data

Training Labels
Validation
Validation Data Prediction
Data
1
Data

Validation
Performance
Labels
Labels Model Validation
Labels
Best
Test
Hyperparameter
Data
values
Test Validation
Labels Data Prediction

3 Performance
Best
Model Validation Model
Labels
Hyperparameter
values
Model

Validation
Data
Training Data Learning Prediction

2 Training Labels
Hyperparameter
values Algorithm Model Performance
Model Validation
Labels

Hyperparameter
Model
values

41
Holdout Method and Hyperparameter Tuning 4-6
Best
Hyperparameter
Validation
Training Data Data Values
4 Training Labels
Validation
Labels
Learning
Model

Algorithm

Test Data
Prediction

5 Model
Performance
Test Labels

Best
Hyperparameter
Values
6
Data Final
Labels Learning Model
Algorithm

42
K-fold Cross-Validation
Validation Training
Fold Fold

1st Performance 1

K Iterations (K-Folds)
2nd Performance 2

3rd Performance 3 Performance


10
= 101 ∑ Performance i
4th Performance 4 i=1

5th Performance 5

Training Fold Data


Prediction
Validation
Training Fold Labels Fold Data

Performance
Model

Hyperparameter
Values Validation
Fold Labels
Learning
Algorithm

Model
43
This work by Sebastian Raschka is licensed under a
K-fold Cross-Validation Workflow 1-3
Training Data

Training Labels
Data
1
Labels
Test Data

Test Labels

Hyperparameter
Model
values

Training Data
Hyperparameter
Learning
2 Training Labels
values Algorithm Model

Hyperparameter
Model
values

Best
Hyperparameter
Training Data Values
3 Training Labels Learning
Model

Algorithm

44
K-fold Cross-Validation Workflow 4-5

Test Data
Prediction

4 Model
Performance
Test Labels

Best
Hyperparameter
Values
5
Data Final
Labels Learning Model
Algorithm

45
More info about model evaluation (one of the most
important topics in ML):

https://sebastianraschka.com/blog/index.html

• Model evaluation, model selection, and algorithm selection in machine learning Part I - The basics

• Model evaluation, model selection, and algorithm selection in machine learning Part II -
Bootstrapping and uncertainties

• Model evaluation, model selection, and algorithm selection in machine learning Part III - Cross-
validation and hyperparameter tuning

46
Jupyter Notebook

47
BONUS SLIDES

48
https://www.tensorflow.org

49
Figure 1: Example TensorFlow code fragm

C
TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems
(Preliminary White Paper, November 9, 2015)

Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
...
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,
Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
ReLU
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
Google Research⇤
Abstract sequence prediction [47], move selection for Go [34], Add
pedestrian detection [2], reinforcement learning [38],
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf
TensorFlow [1] is an interface for expressing machine learn-
and other areas [17, 5]. In addition, often in close collab-
ing algorithms, and an implementation for executing such al-
oration with the Google Brain team, more than 50 teams
gorithms. A computation expressed using TensorFlow can be
executed with little or no change on a wide variety of hetero-
geneous systems, ranging from mobile devices such as phones
b
at Google and other Alphabet companies have deployed
deep neural networks using DistBelief in a wide variety MatMul
of products, including Google Search [11], our advertis-
and tablets up to large-scale distributed systems of hundreds
ing products, our speech recognition systems [50, 6, 46],
of machines and thousands of computational devices such as
Google Photos [43], Google Maps and StreetView [19],
GPU cards. The system is flexible and can be used to express
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been
Google Translate [18], YouTube, and many others.
Based on our experience with DistBelief and a more W x
used for conducting research and for deploying machine learn- complete understanding of the desirable system proper-
ties and requirements for training and using neural net- 50
ing systems into production across more than a dozen areas of
at performing highly parallelized numerical computations. In addition, TensorFlow also
supports distributed systems as well as mobile computing platforms, including Android and
Apple’s iOS.
Tensors?
But what is a tensor? In simplifying terms, we can think of tensors as multidimensional
Index [0,2,1]
arrays of numbers, as a generalization of scalars, vectors,
Index [0,0] and matrices.

1. Scalar: R Index [2]


2. Vector: Rn
3. Matrix: Rn × Rm
4. 3-Tensor: Rn × Rm × Rp
5. …
rank 0 tensor rank 1 tensor rank 2 tensor rank 3 tensor
dimensions [ ] dimensions [5] dimensions [5, 3] dimensions [4, 4, 2]
When we describe tensors, we refer to its “dimensions” as the rank (or order) of a tensor,
scalar vector matrix

which is not to be confused with the dimensions of a matrix. For instance, an m × n matrix,
https://sebastianraschka.com/pdf/books/dlb/appendix_g_tensorflow.pdf

where m is the number of rows and n is the number of columns, would be a special case of
51
a rank-2 tensor. A visual explanation of tensors and their ranks is given is the figure below.
GPUs

52
Vectorization
X = np.random.random((num_train_examples, num_features))
W = np.random.random((num_features, num_hidden))

x =

53
Vectorization

x =

54
Computation Graphs
a(x, w, b) = relu(w*x + b)
u
v
b

x + v = u+b a = relu(v)
u = wx
w
*
55
Computation Graphs
import tensorflow as tf

g = tf.Graph()
with g.as_default() as g:

x = tf.placeholder(dtype=tf.float32, shape=None, name='x')


w = tf.Variable(initial_value=2, dtype=tf.float32, name='w')
b = tf.Variable(initial_value=1, dtype=tf.float32, name='b')

u = x * w
v = u + b
a = tf.nn.relu(v)

print(x, w, b, u, v, a)

Tensor("x:0", dtype=float32) <tf.Variable 'w:0' shape=() dtype=float32_ref> <tf.Variable


'b:0' shape=() dtype=float32_ref> Tensor("mul:0", dtype=float32) Tensor("add:0",
dtype=float32) Tensor("Relu:0", dtype=float32)

56
Computation Graphs
b=1

x + v = u+b a = relu(v)
u = wx
w=2
*

with tf.Session(graph=g) as sess:


sess.run(init_op)
b_res = sess.run(’b:0')

print(b_res)

1.0
57
() (+ () $#
(*
= (* (+
=1 =1
$% !"
=1
!#
b=1 7 7
x=3 6 + v = u+b a = relu(v)
u = wx
w=2
*
() (- () $#
= $& $&
=1
(, (, (- =3
$'
(- (+ ()
= = 3*1*1 = 3
(, (- (+ https://github.com/rasbt/pydata-annarbor2017-dl-tutorial 58
g = tf.Graph()
with g.as_default() as g:

x = tf.placeholder(dtype=tf.float32, shape=None, name='x')


w = tf.Variable(initial_value=2, dtype=tf.float32, name='w')
b = tf.Variable(initial_value=1, dtype=tf.float32, name='b')

u = x * w
v = u + b
a = tf.nn.relu(v)

d_a_w = tf.gradients(a, w)
d_b_w = tf.gradients(a, b)

with tf.Session(graph=g) as sess:


sess.run(tf.global_variables_initializer())
res = sess.run([d_a_w, d_b_w], feed_dict={'x:0': 3})

[3.0] [1.0] 59
http://pytorch.org

60
import torch
import torch.nn.functional as F
from torch.autograd import Variable
from torch.autograd import grad

x = Variable(torch.Tensor([3]))
w = Variable(torch.Tensor([2]), requires_grad=True)
b = Variable(torch.Tensor([1]), requires_grad=True)

u = x * w
v = u + b
a = F.relu(v)

partial_derivatives = grad(a, (w, b))

for name, grad in zip("wb", (partial_derivatives)):


print('d_a_%s:' % name, grad)

d_a_w: Variable containing:


3
[torch.FloatTensor of size 1]

d_a_b: Variable containing:


1
[torch.FloatTensor of size 1]
61
Multilayer Perceptron

https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch12/images/12_02.png

62
g = tf.Graph() class MultilayerPerceptron(torch.nn.Module):
with g.as_default():
def __init__(self, num_features, num_classes):
# Input data super(MultilayerPerceptron, self).__init__()
tf_x = tf.placeholder(tf.float32, [None, n_input], name='features')
tf_y = tf.placeholder(tf.float32, [None, n_classes], name='targets') ### 1st hidden layer
self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
# Model parameters
weights = { ### Output layer
'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1)), self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
'out': tf.Variable(tf.truncated_normal([n_hidden_2, n_classes], stddev=0.1))
}
biases = {
def forward(self, x):
'b1': tf.Variable(tf.zeros([n_hidden_1])), out = self.linear_1(x)
'out': tf.Variable(tf.zeros([n_classes])) out = F.relu(out)
} logits = self.linear_out(out)
probas = F.softmax(logits, dim=1)
# Multilayer perceptron return logits, probas
layer_1 = tf.add(tf.matmul(tf_x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1) model = MultilayerPerceptron(num_features=num_features,
out_layer = tf.matmul(layer_1, weights['out']) + biases['out'] num_classes=num_classes)

# Loss and optimizer if torch.cuda.is_available():


loss = tf.nn.softmax_cross_entropy_with_logits(logits=out_layer, labels=tf_y) model.cuda()
cost = tf.reduce_mean(loss, name='cost')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) for epoch in range(num_epochs):
train = optimizer.minimize(cost, name='train') for batch_idx, (features, targets) in enumerate(train_loader):
# Prediction
correct_prediction = tf.equal(tf.argmax(tf_y, 1), tf.argmax(out_layer, 1))
features = Variable(features.view(-1, 28*28))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy') targets = Variable(targets)

with tf.Session(graph=g) as sess: if torch.cuda.is_available():


sess.run(tf.global_variables_initializer()) features, targets = features.cuda(), targets.cuda()

for epoch in range(training_epochs): ### FORWARD AND BACK PROP


avg_cost = 0. logits, probas = model(features)
total_batch = mnist.train.num_examples // batch_size cost = cost_fn(logits, targets)
optimizer.zero_grad()
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size) cost.backward()
_, c = sess.run(['train', 'cost:0'], feed_dict={'features:0': batch_x,
'targets:0': batch_y}) ### UPDATE MODEL PARAMETERS
optimizer.step()
63
Further Resources

Math-heavy Math-free scikit-learn intro Mix of code & math


(~60% scikit-learn)

64
Thanks for attending!

Tutorial Material on GitHub:


https://github.com/rasbt/msu-datascience-ml-tutorial-2018

Contact:
o E-mail: mail@sebastianraschka.com
o Website: http://sebastianraschka.com
o Twitter: @rasbt
65
o GitHub: rasbt

You might also like