You are on page 1of 81

Lecture 05

Data Preprocessing and


Machine Learning with Scikit-Learn
(Computational Foundations Part 3/3)

STAT 451: Intro to Machine Learning, Fall 2020

Sebastian Raschka

http://stat.wisc.edu/~sraschka/teaching/stat451-fs2020/

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 1


Lecture 5 (Data Preprocessing and ML with Scikit-Learn)
Topics
1. Reading a Dataset from a Tabular Text File

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

6. Scikit-learn Pipelines

Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/


L05/code/05-preprocessing-and-sklearn__notes.ipynb
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 2
Where we currently are in this course ...

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 3


Machine Learning Workflow
Feature Extraction and Scaling
Feature Selection
Dimensionality Reduction
Sampling

Labels

Training Dataset
Learning
Final Model New Data
Labels Algorithm

Raw Test Dataset


Data
Labels

Preprocessing Learning Evaluation Prediction

Model Selection
Cross-Validation
Performance Metrics
Hyperparameter Optimization

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 4


Reading a Dataset from a
Tabular Text File

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 5


The Iris Dataset

Iris-Setosa Iris-Versicolor Iris-Virginica

Dataset paper: Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936);
also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 6


Sometimes Useful: Executing "Bash" Terminal
Commands Via "!"

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 7


A DataFrame Library for Data Wrangling

https://pandas.pydata.org

pandas is short for "PANel DAta S"

Pandas Paper: McKinney, Wes. "Data structures for statistical computing in


python." Proceedings of the 9th Python in Science Conference. Vol. 445.
2010.

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 8


https://pandas.pydata.org

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 9


Many additional options exist ...

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 10


E.g., processing a large file iteratively ...

Source: https://github.com/rasbt/python_reference/blob/master/useful_scripts/large_csv_to_sqlite.py

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 11


For scaling Pandas, also check out

Modin: https://github.com/modin-project/modin

and Dask: https://github.com/dask/dask



Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 12


Lecture 5 (Data Preprocessing and ML with Scikit-Learn)
Topics
1. Reading a Dataset from a Tabular Text File

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

6. Scikit-learn Pipelines

Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/


L05/code/05-preprocessing-and-sklearn__notes.ipynb
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 13
Python Function

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 14


Regular Function vs Lambda Function

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 15


Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 16
Column-based Data Processing via Lambda
Functions and ".apply"

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 17


Column-based Data Processing via
Dictionaries and ".map"

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 18


Quick Inspections via "head" and "tail"

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 19


Accessing the Underlying NumPy Array(s) via
the ".values" Attribute

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 20


"Creating*" the Label Vector "y" and Design
Matrix "X"

* why did I put "Creating"


in quotation marks?

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 21


A Library with Additional Data Science 

& Machine Learning-related Functions

MLXTEND http://rasbt.github.io/mlxtend/

Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and
extensions to Python’s scientific computing stack."

The Journal of Open Source Software 3.24 (2018).

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 22


Exploratory Data Analysis (EDA)

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 23


Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 24
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 25
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 26
(Later, we will see how to do this more conveniently)

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 27


Lecture 5 (Data Preprocessing and ML with Scikit-Learn)
Topics
1. Reading a Dataset from a Tabular Text File

2. Basic Data Handling

3. Object Oriented Programming (OOP) 



& Python Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

6. Scikit-learn Pipelines
Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/
L05/code/05-preprocessing-and-sklearn__notes.ipynb
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 28
Python Classes

To get a better understanding of the scikit-learn API, we need to


understand the main concepts behind Object Oriented
Programming (OOP) & classes in Python

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 29


Python Classes

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 30


Python Classes

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 31


Python Classes

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 32


Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 33
Python Classes

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 34


Python Classes

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 35


Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 36
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 37
Lecture 5 (Data Preprocessing and ML with Scikit-Learn)
Topics
1. Reading a Dataset from a Tabular Text File

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python


Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

6. Scikit-learn Pipelines
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 38
The "Main" Machine Learning Library for
Python

http://scikit-learn.org

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. the Journal of
Machine Learning Research, 12, pp.2825-2830.

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 39


The Scikit-learn Estimator API (an OOP Paradigm)

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 40


The Scikit-learn Estimator API

Training Training
Data Labels

① est.fit(X_train, y_train)

Model Test Data

② est.predict(X_test)

Predicted
labels

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 41


A 3-Nearest Neighbor Classifier & 2 Iris Features

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 42


Lecture 5 (Data Preprocessing and ML with Scikit-Learn)
Topics
1. Reading a Dataset from a Tabular Text File

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

6. Scikit-learn Pipelines

Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/


L05/code/05-preprocessing-and-sklearn__notes.ipynb
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 43
a sample is inversely proportional to the size of the sample. Let us have a look at an example using

Issues with Random Subsampling ...


the Iris dataset 1 , which we randomly divide into 2/3 training data and 1/3 test data as illustrated in
Figure 1. (The source code for generating this graphic is available on GitHub2 .)

All samples (n = 150)

Training samples (n = 100) Test samples (n = 50)

Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets.

Sebastian Raschka
1
STAT 451: Intro to ML
https://archive.ics.uci.edu/ml/datasets/iris Lecture 5: Scikit-learn 44
Stratified Splits

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 45


Normalization: Min-Max Scaling

[i]
[i]
x − xmin
xnorm =
xmax − xmin

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 46


Normalization: Min-Max Scaling

[i]
[i]
x − xmin
xnorm =
xmax − xmin

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 47


Normalization: Standardization

[i]
[i]
x − μx
xstd =
σx

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 48


Normalization: Standardization
[i]
[i]
x − μx
xstd =
σx

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 49


Normalization: Standardization

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 50


Sample vs Population Standard Deviation

1 i=1 [i] 2
n−1∑
sx = (x − x̄)
n

i=1
1
(x [i] − μx)2
n∑
σx =
n

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 51


Sample vs Population Standard Deviation

i=1
1
(x [i] − x̄)2
n−1∑
sx =
n

1 i=1 [i]
(x − μx)2
n∑
σx =
n

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 52


Scaling Validation and Test Sets

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 53


Scaling Validation and Test Sets

Given 3 training examples:

- example1: 10 cm -> class 2

- example2: 20 cm -> class 2

- example3: 30 cm -> class 1

Estimate:

mean: 20 cm

standard deviation: 8.2 cm

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 54


Scaling Validation and Test Sets
Given 3 training examples:

- example1: 10 cm -> class 2

- example2: 20 cm -> class 2

- example3: 30 cm -> class 1

Estimate:

mean: 20 cm

standard deviation: 8.2 cm

Standardize:

- example1: -1.21 -> class 2

- example2: 0.00 -> class 2

- example3: 1.21 -> class 1

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 55


Scaling Validation and Test Sets
Given 3 training examples:

- example1: 10 cm -> class 2

- example2: 20 cm -> class 2

- example3: 30 cm -> class 1

Estimate:

mean: 20 cm

standard deviation: 8.2 cm

Assume you have the classification rule:


Standardize (z scores): (
class 2 if z  0.6
- example1: -1.21 -> class 2
h(z) =
- example2: 0.00 -> class 2
class 1 otherwise
- example3: 1.21 -> class 1
<latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 56


Scaling Validation and Test Sets
(
Given 3 training examples: class 2 if z  0.6
h(z) =
- example1: 10 cm -> class 2
class 1 otherwise
<latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>

- example2: 20 cm -> class 2

- example3: 30 cm -> class 1


Given 3 NEW examples:

Estimate: - example4: 5 cm -> class ?

- example5: 6 cm -> class ?

mean: 20 cm
- example6: 7 cm -> class ?

standard deviation: 8.2 cm

Estimate "new" mean and std.:


Standardize (z scores):
- example5: -1.21 -> class 2
- example1: -1.21 -> class 2
- example6: 0.00 -> class 2
- example2: 0.00 -> class 2
- example7: 1.21 -> class 1
- example3: 1.21 -> class 1
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 57
Scaling Validation and Test Sets
(
Given 3 training examples: class 2 if z  0.6
h(z) =
- example1: 10 cm -> class 2
class 1 otherwise
- example2: 20 cm -> class 2

<latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>

- example4: 5 cm -> class ?

- example3: 30 cm -> class 1

- example5: 6 cm -> class ?

- example6: 7 cm -> class ?

Estimate:
Estimate "new" mean and std.:
mean: 20 cm

standard deviation: 8.2 cm

- example5: -1.21 -> class 2


- example6: 0.00 -> class 2
Standardize (z scores):
- example7: 1.21 -> class 1
- example1: -1.21 -> class 2
- example5: -18.37

- example2: 0.00 -> class 2


- example6: -17.15

- example3: 1.21 -> class 1 - example7: -15.92


Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 58
The Scikit-Learn Transformer API

Training Test
Data Data


est.fit(X_train)

② est.transform(X_train) Model est.transform(X_test)③

Transformed Transformed
Training Data Test Data

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 59


The Scikit-Learn Transformer API

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 60


Working with Categorical Data

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 61


Categorical Data -> Ordinal Data

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 62


Categorical Data -> Nominal Data
(Class Labels)

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 63


One-hot Encoding for Categorical
(Nominal) Features

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 64


One-hot Encoding for Categorical
(Nominal) Features

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 65


Dealing with Missing Data

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 66


Dealing with Missing Data

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 67


Dealing with Missing Data

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 68


Lecture 5 (Data Preprocessing and ML with Scikit-Learn)
Topics
1. Reading a Dataset from a Tabular Text File

2. Basic Data Handling

3. Object Oriented Programming (OOP) & Python Classes

4. Machine Learning with Scikit-learn

5. Preparing Training Data & Transformer API

6. Scikit-learn Pipelines

Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/


L05/code/05-preprocessing-and-sklearn__notes.ipynb
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 69
Scikit-Learn Pipelines

(Step 1) (Step 2)
Class labels
Training set Test set

pipeline.fit(…) pipeline.predict(…)

Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)

.fit(…)
Predictive Model
Class labels
.predict(…)

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 70


Scikit-Learn Pipelines

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 71


Scikit-Learn Pipelines

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 72


Scikit-Learn Pipelines

(Step 1) (Step 2)
Class labels
Training set Test set

pipeline.fit(…) pipeline.predict(…)

Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)

.fit(…)
Predictive Model
Class labels
.predict(…)

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 73


Model Selection:
Simple Holdout Method
Original dataset

Training set Test set

Training set Validation set Test set

Change hyperparameters
and repeat

Machine learning
algorithm
Evaluate
Fit

Predictive model
Final performance estimate

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 74


Model Selection:
Simple Holdout Method

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 75


Model Selection:
Simple Holdout Method

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 76


Model Selection: Simple Holdout Method

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 77


Model Selection:
Simple Holdout Method

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 78


Randomized Search

https://github.com/scikit-learn/scikit-learn/pull/13900

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 79


Successive Halving

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 80


Lecture Notes

This time in interactive Jupyter Notebook form:

https://github.com/rasbt/stat451-machine-learning-fs20/blob/
master/L05/code/05-preprocessing-and-sklearn__notes.ipynb

Bonus: Column Transformers for


Heterogenous Data
https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/
L05/code/05-bonus-column-transformer.ipynb

Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 81

You might also like