Professional Documents
Culture Documents
A brief overview of ML
Key tasks in ML
Why we need ML
Why Python is so grate for ML
K-nearest neighbors algorithm
kNN Classification
kNN Regression
Some Issues in KNN
Python Modules to work on the ML Algorithms
11/07/22 2
Machine Learning
Machine learning is the process of turning the data into information and
Knowledge.
ML lies at the intersection of computer science, engineering, and statistics
and often appears in other disciplines.
11/07/22 3
What is Machine Learning?
11/07/22 4
Traditional Vs ML systems
In ML, once the system is provided with the right data and
algorithms, it can "fish for itself”.
11/07/22 5
Traditional Vs ML systems
11/07/22 6
Sensor and the Data Deluge
Due to the two trends of mobile computing and sensor generated data
mean that we’ll be getting more and more data in the future.
11/07/22 7
Key Terminology
11/07/22 8
Key Terminology
To train the ML algorithm we need to feed it quality data known as a training set.
In the above example each training example (instant) has four features and one target variable.
In a training set the target variable is known.
The machine learns by finding some relationship between the features and the target variable.
In the classification problem the target variables are called classes, and they are assumed to be
a finite number of classes.
11/07/22 9
Key Terminology Cont…
To test machine learning algorithms a separate dataset is used which is called a test set.
The target variable for each example from the test set isn’t given to the program.
The program (model) decides in which class each example should belong to.
Then compare the predicted value with the target variable.
11/07/22 10
Key Tasks of Machine Learning
In classification, our job is to predict what class an instance of data should fall into.
Regression is the prediction of a numeric value.
11/07/22 11
Key Tasks of Machine Learning
11/07/22 12
Key Tasks of Machine Learning
Common algorithms used to perform classification, regression, clustering, and density estimation tasks.
Balancing generalization and memorization (over fitting) is a common problem to many ML algorithms.
Regularization techniques are used to reduce over fitting.
11/07/22 13
Key Tasks of Machine Learning
There are two fundamental cause of prediction error: a model bias, and its variance.
A model with high variance over-fits the training data, while a model with high bias under-fits the training data.
High bias, low variance (Under-fitting)
Low bias, high variance (Over-fitting)
High bias, high variance (Under/Over – fitting)
Low bias, low variance (Good dataset)
The predictive power of many ML algorithms improve as the amount of training data increases.
Quality of data is also important.
11/07/22 14
Key Tasks of Machine Learning
Ideally, a model will have both low bias and variance; but effort to reduce one will frequently increase the other. This is known as the bias-
variance trade-off.
11/07/22 15
How to Choose the Right Algorithm
11/07/22 16
How to Choose the Right Algorithm
Spend some time to know the data, and the more we know it, we can build successful application. (70-80% of the time)
Things to know about the data are these:
Are the features nominal or continuous?
Are there missing values in the features?
If there are missing values, why are there missing values?
Are there outliers in the data? etc…(80, 81, 82, 83, 245)
All of these features about your data can help you narrow the algorithm selection process.
11/07/22 17
How to Choose the Right Algorithm
11/07/22 18
Problem Solving Framework
11/07/22 19
Machine Learning Systems and Data
11/07/22 20
Machine Learning Systems and Data
11/07/22 21
Machine Learning Systems and Data
11/07/22 25
Stages of ML Process
11/07/22 26
Data Collection and Preparation
The validation set is used to select and tune the final ML model.
11/07/22 29
Data Collection and Preparation
11/07/22 30
Data Collection and Preparation
11/07/22 31
Data Collection and Preparation
11/07/22 32
Data Collection and Preparation
11/07/22 34
Python
11/07/22 35
Classifying with k-Nearest
Neighbors
11/07/22 36
K-Nearest Neighbors (KNN)
11/07/22 37
K-Nearest Neighbors (KNN)
11/07/22 38
K-Nearest Neighbors (KNN)
Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new
class we assign to the data we were asked to classify.
11/07/22 39
K-Nearest Neighbors (KNN)
KNN, non-paramteric models can be useful when training data is abundant and you have little prior
knowledge about the relationship b/n the response and explanatory variables.
KNN makes only one assumption: instance that are near each other are likely to have similar values of
response variable.
A model that makes assumption about the relationship can be useful if training data is scarce or if you
already know about the relationship.
11/07/22 40
KNN Classification
Now, you find a movie you haven’t seen yet and want to know if it’s a romance movie or an action movie.
To determine this, we’ll use the kNN algorithm.
11/07/22 41
KNN Classification
We find the movie in question and see how many kicks and kisses it has.
11/07/22 43
KNN Classification
We don’t know what type of movie the question mark movie is.
First, we calculate the distance to all the other movies.
11/07/22 45
KNN Classification
11/07/22 46
General Approach to KNN
11/07/22 47
K-Nearest Neighbors (KNN)
11/07/22 48
K-Nearest Neighbors (KNN)
Advantage:
It remembers
Fast (no learning time)
Simple and straight forward
Down side :
No generalization
Over-fitting (noise)
Computationally expensive for large datasets
11/07/22 49
K-Nearest Neighbors (KNN)
Given:
Training data D = (xi, yi)
Distance metric d(q, x): domain knowledge important
Number of neighbors K: domain knowledge important
Query point q
11/07/22 50
KNN- Regression Problem
X1, X2 y Euclidian:
1-NN _______
3-NN _______
Regression
1, 6 7 Manhattan 1-NN _______
3-NN _______
2, 4 8
3, 7 16
6, 8 44
7, 1 50
8, 4 68
Q = 4, 2, y = ???
11/07/22 51
KNN- Regression Problem
d(): k Average
Regression Euclidian: 1-NN ___8___
3-NN ___42__
3, 7 16 26
6, 8 44 40
7, 1 50 10
8, 4 68 20
Q = 4, 2, y = ???
11/07/22 52
KNN- Regression Problem
d(): k Average
Regression Euclidian: 1-NN _______
3-NN _______
3, 7 16 6
6, 8 44 8
7, 1 50 4
8, 4 68 6
Q = 4, 2, y = ???
11/07/22 53
K-Nearest Neighbors Bias
Preference Bias?
Our believe about what makes a good hypothesis.
Locality: near points are similar (distance function / domain)
Smoothness: averaging
All features matter equally
Best practices for Data preparation
Rescale data: normalizing the data to the range [0, 1] is a good idea.
Address missing data: excluded or imputed the missing values.
Lower dimensionality: KNN is suitable for lower dimensional data
11/07/22 54
KNN and Curse of Dimensionality
As the number of features or dimension grows, the amount of data we need to generalize accurately grows exponentially.
Exponentially mean “bad”. O(2d)
11/07/22 55
Some Other Issues
11/07/22 56
Summary
11/07/22 57
Summary
KNN is positione d in the a lgor ithm list of sc ikit le a rn.
11/07/22 58
Question & Answer
11/07/22 59
Thank You !!!
11/07/22 60
Python Programming
11/07/22 61
Assignment One - Python
Programming
Numpy
11/07/22 62
Python Programming
Numpy
11/07/22 63
Python Programming
Numpy
11/07/22 64
Python Programming
Ma tplotli b
11/07/22 65
Python Programming
Ma tplotli b
11/07/22 66
Python Programming
Ma tplotli b
11/07/22 67
Python Programming
Ma tplotli b
11/07/22 68
Python Programming
Ma tplotli b
11/07/22 69
Python Programming
Ma tplotli b
11/07/22 70
Python Programming
Ma tplotli b
11/07/22 71
Python Programming
Ma tplotli b
11/07/22 72
Python Programming
Ma tplotli b
11/07/22 73
Python Programming
Ma tplotli b
11/07/22 74
Python Programming
Sci Py
11/07/22 75
Python Programming
Sci Py
11/07/22 76
Python Programming
Sci Py
11/07/22 77
Tool Set
Jupyter notebooks
Interactive coding and Visualization of output
NumPy, SciPy, Pandas
Numerical computation
Matplotlib, Seaborn
Data visualization
Scikit-learn
Machine learning
11/07/22 78
Jupyter Cell
%matplotlib inline: display plots inline in Jupyter notebook.
11/07/22 79
Jupyter Cell
%%ti mei t: t im e how l ong a cel l t akes t o execut e.
11/07/22 80
Introduction to Pandas: Series
11/07/22 81
Introduction to Pandas
Library for com putati on wit h t abular dat a.
11/07/22 82
Introduction to Pandas
Library for com putati on wit h t abular dat a.
11/07/22 83
Introduction to Pandas
Library for com putati on wit h t abular dat a.
11/07/22 84
Introduction to Pandas: Dataframe
Library for com putati on wit h t abular dat a.
11/07/22 85
Introduction to Pandas: Dataframe
Library for com putati on wit h t abular dat a.
11/07/22 86
Introduction to Pandas: Dataframe
Library for com putati on wit h t abular dat a.
11/07/22 87
Introduction to Pandas: Dataframe
Library for com putati on wit h t abular dat a.
11/07/22 88
Introduction to Pandas: Dataframe
Library for com putati on wit h t abular dat a.
11/07/22 89
Introduction to Pandas: Dataframe
11/07/22 90