05 Preprocessing and Sklearn - Slides

Lecture 05
Data Preprocessing and

Machine Learning with Scikit-Learn
(Computational Foundations Part 3/3)
STAT 451: Intro to Machine Learning, Fall 2020
Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat451-fs2020/
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 1

Lecture 5 (Data Preprocessing and ML with Scikit-Learn)
Topics
1. Reading a Dataset from a Tabular Text File
2. Basic Data Handling
3. Object Oriented Programming (OOP) & Python Classes
4. Machine Learning with Scikit-learn
5. Preparing Training Data & Transformer API
6. Scikit-learn Pipelines
Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/

L05/code/05-preprocessing-and-sklearn__notes.ipynb
Where we currently are in this course ...

Machine Learning Workflow
Feature Extraction and Scaling
Feature Selection
Dimensionality Reduction
Sampling
Labels
Training Dataset
Learning
Final Model New Data
Labels Algorithm
Raw Test Dataset

Data
Labels
Preprocessing Learning Evaluation Prediction
Model Selection
Cross-Validation
Performance Metrics
Hyperparameter Optimization

Reading a Dataset from a
Tabular Text File

The Iris Dataset
Iris-Setosa Iris-Versicolor Iris-Virginica
Dataset paper: Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936);
also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).

Sometimes Useful: Executing "Bash" Terminal
Commands Via "!"

A DataFrame Library for Data Wrangling
https://pandas.pydata.org
pandas is short for "PANel DAta S"
Pandas Paper: McKinney, Wes. "Data structures for statistical computing in

python." Proceedings of the 9th Python in Science Conference. Vol. 445.
2010.

https://pandas.pydata.org

Many additional options exist ...

E.g., processing a large file iteratively ...
Source: https://github.com/rasbt/python_reference/blob/master/useful_scripts/large_csv_to_sqlite.py

For scaling Pandas, also check out
Modin: https://github.com/modin-project/modin
and Dask: https://github.com/dask/dask 

 

Topics

Python Function

Regular Function vs Lambda Function

Column-based Data Processing via Lambda
Functions and ".apply"

Column-based Data Processing via
Dictionaries and ".map"

Quick Inspections via "head" and "tail"

Accessing the Underlying NumPy Array(s) via
the ".values" Attribute

"Creating*" the Label Vector "y" and Design
Matrix "X"
* why did I put "Creating"

in quotation marks?

A Library with Additional Data Science  
& Machine Learning-related Functions
MLXTEND http://rasbt.github.io/mlxtend/
Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and
extensions to Python’s scientific computing stack." 
The Journal of Open Source Software 3.24 (2018).

Exploratory Data Analysis (EDA)

(Later, we will see how to do this more conveniently)

Topics
3. Object Oriented Programming (OOP)  

& Python Classes
Python Classes
To get a better understanding of the scikit-learn API, we need to

understand the main concepts behind Object Oriented
Programming (OOP) & classes in Python

Python Classes

Python Classes

Python Classes

Python Classes

Python Classes

Topics
3. Object Oriented Programming (OOP) & Python

Classes
The "Main" Machine Learning Library for
Python
http://scikit-learn.org
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. the Journal of
Machine Learning Research, 12, pp.2825-2830.

The Scikit-learn Estimator API (an OOP Paradigm)

The Scikit-learn Estimator API
Training Training
Data Labels
① est.fit(X_train, y_train)
Model Test Data
② est.predict(X_test)
Predicted
labels

A 3-Nearest Neighbor Classifier & 2 Iris Features

Topics

a sample is inversely proportional to the size of the sample. Let us have a look at an example using
Issues with Random Subsampling ...

the Iris dataset 1 , which we randomly divide into 2/3 training data and 1/3 test data as illustrated in
Figure 1. (The source code for generating this graphic is available on GitHub2 .)
All samples (n = 150)
Training samples (n = 100) Test samples (n = 50)
Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets.
Sebastian Raschka
1
STAT 451: Intro to ML
https://archive.ics.uci.edu/ml/datasets/iris Lecture 5: Scikit-learn 44
Stratified Splits

Normalization: Min-Max Scaling
[i]
[i]
x − xmin
xnorm =
xmax − xmin

Normalization: Min-Max Scaling
[i]
[i]
x − xmin
xnorm =
xmax − xmin

Normalization: Standardization
[i]
[i]
x − μx
xstd =
σx

[i]
[i]
x − μx
xstd =
σx


Sample vs Population Standard Deviation
1 i=1 [i] 2
n−1∑
sx = (x − x̄)
n
i=1
1
(x [i] − μx)2
n∑
σx =
n

Sample vs Population Standard Deviation
i=1
1
(x [i] − x̄)2
n−1∑
sx =
n
1 i=1 [i]
(x − μx)2
n∑
σx =
n

Scaling Validation and Test Sets

Given 3 training examples:
- example1: 10 cm -> class 2
Estimate:
mean: 20 cm
standard deviation: 8.2 cm

Estimate:
mean: 20 cm
Standardize:
- example1: -1.21 -> class 2
- example2: 0.00 -> class 2

Estimate:
mean: 20 cm
Assume you have the classification rule:

Standardize (z scores): (
class 2 if z  0.6
h(z) =
class 1 otherwise
<latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>

(
Given 3 training examples: class 2 if z  0.6
h(z) =
class 1 otherwise

Given 3 NEW examples:
Estimate: - example4: 5 cm -> class ?
- example5: 6 cm -> class ?
mean: 20 cm
Estimate "new" mean and std.:

Standardize (z scores):
(
Given 3 training examples: class 2 if z  0.6
h(z) =
class 1 otherwise
Estimate:
Estimate "new" mean and std.:
mean: 20 cm

Standardize (z scores):
- example5: -18.37

- example6: -17.15
- example3: 1.21 -> class 1 - example7: -15.92

The Scikit-Learn Transformer API
Training Test
Data Data
①
est.fit(X_train)
② est.transform(X_train) Model est.transform(X_test)③
Transformed Transformed
Training Data Test Data

The Scikit-Learn Transformer API

Working with Categorical Data

Categorical Data -> Ordinal Data

Categorical Data -> Nominal Data
(Class Labels)

One-hot Encoding for Categorical
(Nominal) Features

One-hot Encoding for Categorical
(Nominal) Features

Dealing with Missing Data



Topics

Scikit-Learn Pipelines
(Step 1) (Step 2)
Class labels
Training set Test set
pipeline.fit(…) pipeline.predict(…)
Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)
.fit(…)
Predictive Model
Class labels
.predict(…)



(Step 1) (Step 2)
Class labels
pipeline.fit(…) pipeline.predict(…)
Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)
.fit(…)
Predictive Model
Class labels
.predict(…)

Model Selection:
Simple Holdout Method
Original dataset
Training set Validation set Test set
Change hyperparameters
and repeat
Machine learning
algorithm
Evaluate
Fit
Predictive model
Final performance estimate

Model Selection:

Model Selection:

Model Selection: Simple Holdout Method

Model Selection:

Randomized Search
https://github.com/scikit-learn/scikit-learn/pull/13900

Successive Halving

Lecture Notes
This time in interactive Jupyter Notebook form:
https://github.com/rasbt/stat451-machine-learning-fs20/blob/
master/L05/code/05-preprocessing-and-sklearn__notes.ipynb
Bonus: Column Transformers for

Heterogenous Data
https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/
L05/code/05-bonus-column-transformer.ipynb

05 Preprocessing and Sklearn - Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 Preprocessing and Sklearn - Slides

Uploaded by

Copyright:

Available Formats

Lecture 05

Data Preprocessing and

STAT 451: Intro to Machine Learning, Fall 2020

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 1

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 3

Raw Test Dataset

Preprocessing Learning Evaluation Prediction

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 4

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 5

Iris-Setosa Iris-Versicolor Iris-Virginica

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 6

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 7

pandas is short for "PANel DAta S"

Pandas Paper: McKinney, Wes. "Data structures for statistical computing in

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 8

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 9

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 10

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 11

and Dask: https://github.com/dask/dask

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 12

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 14

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 15

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 17

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 18

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 19

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 20

* why did I put "Creating"

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 21

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 22

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 23

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 27

2. Basic Data Handling

3. Object Oriented Programming (OOP)

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

To get a better understanding of the scikit-learn API, we need to

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 29

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 30

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 31

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 32

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 34

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 35

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 39

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 40

Model Test Data

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 41

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 42

2. Basic Data Handling

and Dask: https://github.com/dask/dask 

3. Object Oriented Programming (OOP)