Session 2

¿Qué es el aprendizaje automático?
El aprendizaje automático
permite a las
computadoras aprender e
inferir de los datos.
2
Machine Learning in Our Daily Lives
Spam Filtering
3
Spam Filtering Web Search
4
Spam Filtering Web Search Postal Mail Routing
5
Spam Filtering Web Search Postal Mail Routing
Fraud Detection Movie Recommendations Vehicle Driver Assistance
Web Advertisements Social Networks Speech Recognition
6
Types of Machine Learning
Supervised Data points have known outcome
7
unSupervised Data points have unknown outcome
8
unSupervised Data points have unknown outcome
9
Types of Supervised Learning
regression Outcome is continuous (numerical)
10
Types of Supervised Learning
regression Outcome is continuous (numerical)
classification Outcome is a category
11
Supervised Learning Overview
Data with
answers + Model Fit model
Data without Predicted

answers + model Predict
answers
12
Regression: Numerical Answers
Movie data with

revenue + Model Fit model
Movie data
(unknown revenue) + model Predict Predicted
revenue
13
Classification: Categorical Answers
Labeled
data + Model Fit model
Unlabeled
data + model Predict Labels
14
Classification: Categorical Answers
Emails labeled as
spam/not spam + Model Fit model
Unlabeled
emails + model Predict Spam or
not spam
15
Machine Learning Vocabulary
 Target: predicted category or value of the data
(column to predict)
16
Sepal length Sepal width Petal length Petal width Species
6.7 3.0 5.2 2.3 Virginica
6.4 2.8 5.6 2.1 Virginica
4.6 3.4 1.4 0.3 Setosa
6.9 3.1 4.9 1.5 Versicolor
4.4 2.9 1.4 0.2 Setosa
4.8 3.0 1.4 0.1 Setosa
5.9 3.0 5.1 1.8 Virginica
5.4 3.9 1.3 0.4 Setosa
4.9 3.0 1.4 0.2 Setosa
5.4 3.4 1.7 0.2 Setosa
17
6.7 3.0 5.2 2.3 Virginica
6.4 2.8 5.6 2.1 Virginica

Target
4.6 3.4 1.4 0.3 Setosa
6.9 3.1 4.9 1.5 Versicolor
4.4 2.9 1.4 0.2 Setosa
4.8 3.0 1.4 0.1 Setosa
5.9 3.0 5.1 1.8 Virginica
5.4 3.9 1.3 0.4 Setosa
4.9 3.0 1.4 0.2 Setosa
5.4 3.4 1.7 0.2 Setosa
18
(column to predict)
 Features: properties of the data used for prediction

(non-target columns)
19
6.7 3.0 5.2 2.3 Virginica
6.4 2.8 5.6 2.1 Virginica

Features
4.6 3.4 1.4 0.3 Setosa
6.9 3.1 4.9 1.5 Versicolor
4.4 2.9 1.4 0.2 Setosa
4.8 3.0 1.4 0.1 Setosa
5.9 3.0 5.1 1.8 Virginica
5.4 3.9 1.3 0.4 Setosa
4.9 3.0 1.4 0.2 Setosa
5.4 3.4 1.7 0.2 Setosa
20
(column to predict)

 Example: a single data point within the data

(one row)
21
6.7 3.0 5.2 2.3 Virginica
6.4 2.8 5.6 2.1 Virginica
Examples 4.6 3.4 1.4 0.3 Setosa
6.9 3.1 4.9 1.5 Versicolor
4.4 2.9 1.4 0.2 Setosa
4.8 3.0 1.4 0.1 Setosa
5.9 3.0 5.1 1.8 Virginica
5.4 3.9 1.3 0.4 Setosa
4.9 3.0 1.4 0.2 Setosa
5.4 3.4 1.7 0.2 Setosa
22
(column to predict)

 Example: a single data point within the data

(one row)
 Label: the target value for a single data point
23
6.7 3.0 5.2 2.3 Virginica
6.4 2.8 5.6 2.1 Virginica
4.6 3.4 1.4 0.3 Setosa Label

6.9 3.1 4.9 1.5 Versicolor
4.4 2.9 1.4 0.2 Setosa
4.8 3.0 1.4 0.1 Setosa
5.9 3.0 5.1 1.8 Virginica
5.4 3.9 1.3 0.4 Setosa
4.9 3.0 1.4 0.2 Setosa
5.4 3.4 1.7 0.2 Setosa
24
What is Classification?
A flower shop wants to
guess a customer's
purchase from similarity to
most recent purchase.
26
Which flower is a
customer most likely to
purchase based on
similarity to previous
purchase?
27
Which flower is a
purchase based on
purchase?
28
Which flower is a
purchase based on
purchase?
29
Which flower is a
purchase based on
purchase?
30
What is Needed for Classification?
 Model data with:
 Features that can be quantitated
31
 Labels that are known
32
 Labels that are known
 Method to measure similarity
33
K Nearest Neighbors Classification
34
60
Survived
Did not survive

40
Age
20
0 10 20
Number of Malignant Nodes
35
60
40
Age
Predict
20
0 10 20
36
Neighbor Count (K = 1): 60
0
1
40
Age
Predict
20
0 10 20
37
1
1
40
Age
Predict
20
0 10 20
38
2
1
40
Age
Predict
20
0 10 20
39
3
1
40
Age
Predict
20
0 10 20
40
What is Needed to Select a KNN Model?
41
What is Needed to Select a KNN Model?
 Correct value for 'K' 60
 How to measure
closeness of
neighbors?
40
Age
20
0 10 20
42
K Nearest Neighbors Decision Boundary
K=1 60
40
Age
20
0 10 20
43
K Nearest Neighbors Decision Boundary
K = All 60
40
Age
20
0 10 20
44
Value of 'K' Affects Decision Boundary
K=1 K=All
60 60
40 40
20 20
0 10 20 0 10 20
Number of Malignant Nodes Number of Malignant Nodes
45
Value of 'K' Affects Decision Boundary
K=1 K=All
60 60
40 40
20 20
0 10 20 0 10 20
Number of Malignant Nodes Number of Malignant Nodes
Methods for determining 'K' will be discussed in next lesson

46
Measurement of Distance in KNN
60
40
Age
20
0 10 20
47
Measurement of Distance in KNN
60
40
Age
20
0 10 20
48
Euclidean Distance
60
40
Age
20
0 10 20
49
Euclidean Distance (L2 Distance)
60
d
40 ∆ Age
Age
20 ∆ Nodes 𝑑= ∆𝑁𝑜𝑑𝑒𝑠 2 + ∆𝐴𝑔𝑒 2
0 10 20
50
Manhattan Distance (L1 or City Block Distance)
60
40 ∆ Age
Age
20 ∆ Nodes
𝑑 = ∆𝑁𝑜𝑑𝑒𝑠 + ∆𝐴𝑔𝑒
0 10 20
51
Scale is Important for Distance Measurement
60
40
Age
20
12345
Number of Surgeries
52
60
24
40 22
Age
20
20
18
12345
Number of Surgeries
53
60
24
40 22
Nearest
Age Neighbors!
20
20
18
12345
Number of Surgeries
54
"Feature Scaling" 60
40
Age
20
0 1 2 3 4 5
Number of Surgeries
55
40
Age
20
0 1 2 3 4 5
Number of Surgeries
56
40
Age
Nearest
Neighbors!
20
0 1 2 3 4 5
Number of Surgeries
57
Comparison of Feature Scaling Methods
 Standard Scaler: Mean center data and scale to unit variance
 Minimum-Maximum Scaler: Scale data to fixed range (usually 0–1)
 Maximum Absolute Value Scaler: Scale maximum absolute value
58
Feature Scaling: The Syntax
Import the class containing the scaling method
from sklearn.preprocessing import StandardScaler
59
Create an instance of the class

StdSc = StandardScaler()
60

Fit the scaling parameters and then transform the data

StdSc = StdSc.fit(X_data)
X_scaled = KNN.transform(X_data)
61

Fit the scaling parameters and then transform the data

StdSc = StdSc.fit(X_data)
X_scaled = KNN.transform(X_data)
Other scaling methods exist: MinMaxScaler, MaxAbsScaler.
62
Multiclass KNN Decision Boundary
K=5 60
Full remission
Partial remission 40
Age
Did not survive
20
0 10 20
63
Regression with KNN
K = 20 K=3 K=1
64
Characteristics of a KNN Model
 Fast to create model because it simply stores data
 Slow to predict because many distance calculations
 Can require lots of memory if data set is large
65
K Nearest Neighbors: The Syntax
Import the class containing the classification method
from sklearn.neighbors import KNeighborsClassifier
66

KNN = KNeighborsClassifier(n_neighbors=3)
67

Fit the instance on the data and then predict the expected value
KNN = KNN.fit(X_data, y_data)
y_predict = KNN.predict(X_data)
68

The fit and predict/transform syntax will show up throughout the course.
69

Regression can be done with KNeighborsRegressor.
70

Session 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 2

Uploaded by

Copyright:

Available Formats

¿Qué es el aprendizaje automático?

Spam Filtering Web Search

Spam Filtering Web Search Postal Mail Routing

Spam Filtering Web Search Postal Mail Routing

Fraud Detection Movie Recommendations Vehicle Driver Assistance

Web Advertisements Social Networks Speech Recognition

Supervised Data points have known outcome

Supervised Data points have known outcome

unSupervised Data points have unknown outcome

Supervised Data points have known outcome

unSupervised Data points have unknown outcome

regression Outcome is continuous (numerical)

regression Outcome is continuous (numerical)

classification Outcome is a category

Data without Predicted

Movie data with

6.7 3.0 5.2 2.3 Virginica

6.4 2.8 5.6 2.1 Virginica

4.6 3.4 1.4 0.3 Setosa

6.9 3.1 4.9 1.5 Versicolor

4.4 2.9 1.4 0.2 Setosa

4.8 3.0 1.4 0.1 Setosa

5.9 3.0 5.1 1.8 Virginica

5.4 3.9 1.3 0.4 Setosa

4.9 3.0 1.4 0.2 Setosa

5.4 3.4 1.7 0.2 Setosa

6.7 3.0 5.2 2.3 Virginica

6.4 2.8 5.6 2.1 Virginica

6.9 3.1 4.9 1.5 Versicolor

4.4 2.9 1.4 0.2 Setosa

4.8 3.0 1.4 0.1 Setosa

5.9 3.0 5.1 1.8 Virginica

5.4 3.9 1.3 0.4 Setosa

4.9 3.0 1.4 0.2 Setosa

5.4 3.4 1.7 0.2 Setosa

 Features: properties of the data used for prediction

6.7 3.0 5.2 2.3 Virginica

6.4 2.8 5.6 2.1 Virginica

6.9 3.1 4.9 1.5 Versicolor

4.4 2.9 1.4 0.2 Setosa

4.8 3.0 1.4 0.1 Setosa

5.9 3.0 5.1 1.8 Virginica

5.4 3.9 1.3 0.4 Setosa

4.9 3.0 1.4 0.2 Setosa

5.4 3.4 1.7 0.2 Setosa

 Features: properties of the data used for prediction

 Example: a single data point within the data

6.7 3.0 5.2 2.3 Virginica

6.4 2.8 5.6 2.1 Virginica

Examples 4.6 3.4 1.4 0.3 Setosa

6.9 3.1 4.9 1.5 Versicolor

4.4 2.9 1.4 0.2 Setosa

4.8 3.0 1.4 0.1 Setosa

5.9 3.0 5.1 1.8 Virginica

5.4 3.9 1.3 0.4 Setosa

4.9 3.0 1.4 0.2 Setosa

5.4 3.4 1.7 0.2 Setosa

 Features: properties of the data used for prediction

 Example: a single data point within the data

 Label: the target value for a single data point

6.7 3.0 5.2 2.3 Virginica

6.4 2.8 5.6 2.1 Virginica

4.6 3.4 1.4 0.3 Setosa Label

4.4 2.9 1.4 0.2 Setosa