You are on page 1of 90

Classification : Machine

Learning Basic and kNN


Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2020)
Outline

 A brief overview of ML
 Key tasks in ML
 Why we need ML
 Why Python is so grate for ML
 K-nearest neighbors algorithm
 kNN Classification
 kNN Regression
 Some Issues in KNN
 Python Modules to work on the ML Algorithms

11/07/22 2
Machine Learning

 With machine learning we can gain insight from a dataset.


 We’re going to ask the computer to make some sense from the data.
 This is what we mean by learning.

 Machine learning is the process of turning the data into information and
Knowledge.
 ML lies at the intersection of computer science, engineering, and statistics
and often appears in other disciplines.

11/07/22 3
What is Machine Learning?

 It’s a tool that can be applied to many problems.


 Any field that needs to interpret and act on data can benefit
from ML techniques.

 There are many problems where the solution isn’t deterministic.


 That is, we don’t know enough about the problem or don’t have
enough computing power to properly model the problem.

11/07/22 4
Traditional Vs ML systems

 In ML, once the system is provided with the right data and
algorithms, it can "fish for itself”.

11/07/22 5
Traditional Vs ML systems

 A key aspect of ML that makes it particularly appealing in terms of


business value is that it does not require as much explicit programming
in advance.

11/07/22 6
Sensor and the Data Deluge

 We have a tremendous amount of human-created data from the WWW,


but recently more non-human sources of data have been coming online.
 Sensors connected to the web.
 20 % of non-video internet traffic by sensors.
 Data collected from mobile phone (three-axis accelerometer, temperature
sensors, and GPS receivers)

 Due to the two trends of mobile computing and sensor generated data
mean that we’ll be getting more and more data in the future.

11/07/22 7
Key Terminology

 Weight, Wingspan, Webbed feet, Back color are features or


attributes.
 An instance is made up of features. (controlled, exposure etc.)
 Species is the target variable. (response, outcome, output etc.)
 Attributes can be numeric, binary, nominal.

11/07/22 8
Key Terminology

 To train the ML algorithm we need to feed it quality data known as a training set.
 In the above example each training example (instant) has four features and one target variable.
 In a training set the target variable is known.

 The machine learns by finding some relationship between the features and the target variable.
 In the classification problem the target variables are called classes, and they are assumed to be
a finite number of classes.

11/07/22 9
Key Terminology Cont…

 To test machine learning algorithms a separate dataset is used which is called a test set.
 The target variable for each example from the test set isn’t given to the program.

 The program (model) decides in which class each example should belong to.
 Then compare the predicted value with the target variable.

11/07/22 10
Key Tasks of Machine Learning

 In classification, our job is to predict what class an instance of data should fall into.
 Regression is the prediction of a numeric value.

 Classification and regression are examples of supervised learning.


 This set of problems is known as supervised because we’re telling the algorithm what to predict.

11/07/22 11
Key Tasks of Machine Learning

 The opposite of supervised learning is a set of tasks known as unsupervised learning.


 In unsupervised learning, there’s no label or target value given for the data. (known as clustering)
 In unsupervised learning, we may also want to find statistical values that describe the data. This is known as density estimation.
 Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly
visualize the dimensions.

11/07/22 12
Key Tasks of Machine Learning

 Common algorithms used to perform classification, regression, clustering, and density estimation tasks.
 Balancing generalization and memorization (over fitting) is a common problem to many ML algorithms.
 Regularization techniques are used to reduce over fitting.

11/07/22 13
Key Tasks of Machine Learning

 There are two fundamental cause of prediction error: a model bias, and its variance.
 A model with high variance over-fits the training data, while a model with high bias under-fits the training data.
 High bias, low variance (Under-fitting)
 Low bias, high variance (Over-fitting)
 High bias, high variance (Under/Over – fitting)
 Low bias, low variance (Good dataset)
 The predictive power of many ML algorithms improve as the amount of training data increases.
 Quality of data is also important.

11/07/22 14
Key Tasks of Machine Learning

 Ideally, a model will have both low bias and variance; but effort to reduce one will frequently increase the other. This is known as the bias-
variance trade-off.

 Common measurement of performance:


 Accuracy (ACC) = (TP + TN / TP+TN+FP+FN)
 Precision (P) = (TP / TP+FP)
 Recall (R) = (TP / TP+FN)

11/07/22 15
How to Choose the Right Algorithm

 First, you need to consider your goal.


 If you’re trying to predict or forecast a target value, then you need to look into supervised learning.
 If not, then unsupervised learning is the place you want to be.

 If you’ve chosen supervised learning, what’s your target value?


 Discrete value (y/n, 1/2/3, Red/Yellow/Black):- classification
 A number of values (0.00 to 100.00 etc…):- regression

11/07/22 16
How to Choose the Right Algorithm

 Spend some time to know the data, and the more we know it, we can build successful application. (70-80% of the time)
 Things to know about the data are these:
 Are the features nominal or continuous?
 Are there missing values in the features?
 If there are missing values, why are there missing values?
 Are there outliers in the data? etc…(80, 81, 82, 83, 245)

 All of these features about your data can help you narrow the algorithm selection process.

11/07/22 17
How to Choose the Right Algorithm

 Finding the best algorithm is an iterative process of trial and error.


 Steps in developing a machine learning application:
 Collect data: scraping a website, RSS feed or API etc..
 Prepare the input data: make sure the unstableness of the data format.
 Analyze the input data: looking at the data.
 Understand the data.
 Train the algorithm: the ML takes place (not for unsupervised)
 Test the algorithm: (go back to the 4th step)
 Use it (implement ML application)

11/07/22 18
Problem Solving Framework

 Problem solving Framework for ML application:


 Business issue understanding
 Data understanding
 Data preparation
 Analysis Modeling
 Validation
 Presentation / Visualization

11/07/22 19
Machine Learning Systems and Data

 In AI (ML), instead of writing a program by hand for each


specific task, we collect lots of examples that specify the correct
output for a given input.
 The most important factors in ML is not the algorithm or the
software systems.
 The quality of the data is the soul of the ML systems.

11/07/22 20
Machine Learning Systems and Data

 Invalid training data:


 Garbage In ------ Garbage Out.

 Invalid dataset leads to invalid results.


 This is not to say that the training data needs to be prefer.

 Out of a million examples, some inaccurate labels is


acceptable.
 The quality of the data is the soul of the ML systems.

11/07/22 21
Machine Learning Systems and Data

 “garbage” can be several things:


 Wrong label (Dog – Cat, Cat – Dog)
 Inaccurate and Missing Values
 A bias dataset etc…
 Handling missing data:
 Small portion row and columns – discarded them
 Data imputation (time serial data) – the last valid value
 Substitute with mean or median
 Predicting the missing values from the available data
 A missing value can have a meaning on its own (missing)
11/07/22 22
Machine Learning Systems and Data

 Having a clear dataset is not always enough.


 Features with large magnitudes can dominate features with small
magnitudes during the training.
 Example: Age [0-100], salary [6,000 – 20,000] – Scaling and
Standardization
 Data imbalance:
 Leave as it is.
No Classes Number
Under sampling (if all classes are
1 Cat 5000
equally important) [5000 – 25]
2 Dog 5000
3 Tiger 150 Over sampling (if all classes are
4 Cow 25 equally important) [25-5000]
11/07/22 23
Challenges in Machine Learning

 It requires considerable data and compute power.


 It requires knowledgeable data science specialists or teams.
 It adds complexity to the organization's data integration
strategy. (data-driven culture)

 Learning AI(ML) algorithms is challenging without an


advanced math background.
 The context of data often changes. (private data Vs public data)
 Algorithmic bias, privacy and ethical concerns may be
overlooked.
11/07/22 24
Stages of ML Process

 The first key step in preparing to explore and exploit AI(ML) is to


understand the basic stages involved.

11/07/22 25
Stages of ML Process

 Machine Learning Tasks and Subtasks:

11/07/22 26
Data Collection and Preparation

 Data collection is the process of gathering and measuring


information from countless different sources.
 Data generating at an unprecedented rate. These data can be:
 Numeric (temperature, loan amount, customer retention rate),
 Categorical (gender, color, highest degree earned), or
 Even free text (think doctor’s notes or opinion surveys). 

 In order to use the data we collect to develop practical


solutions, it must be collected and stored in a way that makes
sense for the business problem at hand.
11/07/22 27
Data Collection and Preparation
Data Collection and Preparation

 During an AI development, we always rely on data.


 From training, tuning, model selection to testing, we use three
different data sets: the training set, the validation set ,and the
testing set. 

 The validation set is used to select and tune the final ML model.

 The test data set is used to evaluate how well your algorithm


was trained with the training data set.

11/07/22 29
Data Collection and Preparation

 Testing sets represent 20% or 30% of the data. (cross validation)


 The test set is ensured to be the input data grouped together with
verified correct outputs, generally by human verification.

11/07/22 30
Data Collection and Preparation

 The most successful AI projects are those that integrate a data


collection strategy during the service/product life-cycle.
 It must be built into the core product itself.
 Basically, every time a user engages with the product/service,
you want to collect data from the interaction.
 The goal is to use this constant new data flow to improve your
product/service.

11/07/22 31
Data Collection and Preparation

 Solving the right problem:


 Understand the purpose for a model.
 Ask about who, what, when, where and why?
 Is the problem viable for machine learning (AI)?

11/07/22 32
Data Collection and Preparation

 Data preparation is a set of procedures that makes your dataset


more suitable for ML.
 Articulate the problem early
 Establish data collection mechanisms (data-driven culture)
 Format data to make it consistent
 Reduce data (attribute sampling)
 Complete data cleaning
 Decompose data (complex data set)
 Rescale data (data normalization)
 Discretize data (numerical – categorical values)
 Private datasets capture the specifics of your unique business
and potentially have all relevant attributes
11/07/22 33
Data Collection, Preparation and
Delivery

11/07/22 34
Python

 Python is a grate language for ML.


 Has clear syntax:
 High-level data type (list, tuples, dictionaries, sets, etc…)
 Program in any style (OO, procedural, functional, and so on)
 Makes text manipulation extremely easy
There are a number of libraries

 Libraries such as SciPy and NumPy: to do vector and matrix


operations.
 Matplotlib can plot 2D and 3D plots.

11/07/22 35
Classifying with k-Nearest
Neighbors

11/07/22 36
K-Nearest Neighbors (KNN)

 It is an easy to grasp (understand and implement) and very


effective (powerful tool).
 The model for kNN is the entire training dataset.

 Pros: High accuracy, insensitive to outliers, no assumptions


about data.
 Cons: computationally expensive, requires a lot of memory.
 Works with: Numeric values, nominal values. (Classification
and regression)

11/07/22 37
K-Nearest Neighbors (KNN)

 We have an existing set of example data (training set).


 We know what class each piece of the data should fall into.

 When we’re given a new piece of data without a label.


 We compare that new piece of data to the existing data, every piece of existing data.
 We then take the most similar pieces of data (the nearest neighbors) and look at their
labels.

11/07/22 38
K-Nearest Neighbors (KNN)

 We have an existing set of example data (training set).


 We look at the top k most similar pieces of data from our known dataset. (usually less than 20)
 The K is often set to an odd number to prevent ties.

 Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new
class we assign to the data we were asked to classify.

11/07/22 39
K-Nearest Neighbors (KNN)

 KNN, non-paramteric models can be useful when training data is abundant and you have little prior
knowledge about the relationship b/n the response and explanatory variables.
 KNN makes only one assumption: instance that are near each other are likely to have similar values of
response variable.

 A model that makes assumption about the relationship can be useful if training data is scarce or if you
already know about the relationship.

11/07/22 40
KNN Classification

 Classifying movies into romance or action movies.


 The number of kisses and kicks in each movie (features)

 Now, you find a movie you haven’t seen yet and want to know if it’s a romance movie or an action movie.
 To determine this, we’ll use the kNN algorithm.

11/07/22 41
KNN Classification
 We find the movie in question and see how many kicks and kisses it has.

Classifying movies by plotting the # kicks and kisses in each movie


11/07/22 42
KNN Classification

Movies with the # of kicks, # of kisses along with their class

11/07/22 43
KNN Classification

 We don’t know what type of movie the question mark movie is.
 First, we calculate the distance to all the other movies.

Distance b/n each movie and the unknown movie


11/07/22 44
KNN Classification

Euclidian distance where the distance between two vectors

11/07/22 45
KNN Classification

 Let’s assume k=3.


 Then, the three closest movies are He’s Not Really into Dudes, Beautiful Woman, and California Man.
 Because all three movies are romances, we forecast that the
mystery movie is a romance movie. (majority vote)

11/07/22 46
General Approach to KNN

 General approach to kNN:


 Collect: Any method
 Prepare: Numeric values are needed for a distance calculation.
 Analyze: Any method (plotting).
 Train: Does not apply to the kNN algorithm.
 Test: Calculate the error rate.
 Use: This application needs to get some input data and output structured numeric values.

11/07/22 47
K-Nearest Neighbors (KNN)

 kNN is an instance-based learning algorithm.

<x, y> 1 <x, y> 1


<x, y> 2 <x, y> 2
<x, y> 3 Database
<x, y> 3 F(x) = wx + b
<x, y> 4 <x, y> 4
…….. ……..
<x, y> n <x, y> n F(x) = lookup(x)
Non-instance supervised learning Instance-based supervised learning

11/07/22 48
K-Nearest Neighbors (KNN)

 Advantage:
 It remembers
 Fast (no learning time)
 Simple and straight forward

 Down side :
 No generalization
 Over-fitting (noise)
 Computationally expensive for large datasets

11/07/22 49
K-Nearest Neighbors (KNN)

 Given:
 Training data D = (xi, yi)
 Distance metric d(q, x): domain knowledge important
 Number of neighbors K: domain knowledge important
 Query point q

 KNN = {i : d(q, x i) k smallest }


 Return:
 Classification: Majority Vote of the yi.
 Regression: mean of the yi.

11/07/22 50
KNN- Regression Problem

 The similarity measure is dependent on the type of the data:


 Real-valued data: Euclidean distance
 Hamming distance: categorical or binary data (P-norm; when p=0)
 d(): k Average

X1, X2 y  Euclidian:

1-NN _______
3-NN _______
Regression
1, 6 7  Manhattan 1-NN _______
 3-NN _______

2, 4 8
3, 7 16
6, 8 44
7, 1 50
8, 4 68
Q = 4, 2, y = ???
11/07/22 51
KNN- Regression Problem

 d(): k Average
Regression  Euclidian: 1-NN ___8___
 3-NN ___42__

X1, X2 y ED  Manhattan 1-NN _______


 3-NN _______
1, 6 7 25
2, 4 8 8 Euclidian = ((X – q )2 +(X2i – q2)2)1/2
1i 1

3, 7 16 26
6, 8 44 40
7, 1 50 10
8, 4 68 20

Q = 4, 2, y = ???

11/07/22 52
KNN- Regression Problem

 d(): k Average
Regression  Euclidian: 1-NN _______
 3-NN _______

X1, X2 y mD  Manhattan 1-NN ___29__


 3-NN __35.5__
1, 6 7 7
2, 4 8 4 Manhattan = (|X – q |) + (|X2i - q1|)
1i 1

3, 7 16 6
6, 8 44 8
7, 1 50 4
8, 4 68 6

Q = 4, 2, y = ???

11/07/22 53
K-Nearest Neighbors Bias

 Preference Bias?
 Our believe about what makes a good hypothesis.
 Locality: near points are similar (distance function / domain)
 Smoothness: averaging
 All features matter equally
 Best practices for Data preparation
 Rescale data: normalizing the data to the range [0, 1] is a good idea.
 Address missing data: excluded or imputed the missing values.
 Lower dimensionality: KNN is suitable for lower dimensional data

11/07/22 54
KNN and Curse of Dimensionality

 As the number of features or dimension grows, the amount of data we need to generalize accurately grows exponentially.
 Exponentially mean “bad”. O(2d)

11/07/22 55
Some Other Issues

 What is needed to select a KNN model?


 How to measure closeness of neighbors.
 Correct value for K.

 d(x, q) = Euclidian, Manhattan, weighted etc…


 The choice of the distance function matters.
 K value
 K = n (the average of all data / no need of query)
 K = n (weighted average) [Locally weighted regression]

11/07/22 56
Summary

 kNN is an example of instance-based learning.


 The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage.
 Need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.
 kNN doesn’t give you any idea of the underlying structure of the data.
 kNN is an example of lazy learning, which is the opposite of eager learning.
 kNN can handle both classification and regression.

11/07/22 57
Summary
 KNN is positione d in the a lgor ithm list of sc ikit le a rn.

11/07/22 58
Question & Answer

11/07/22 59
Thank You !!!

11/07/22 60
Python Programming

 Python: PL (python tutorial)


 Ipython: an advanced python shell. (Anaconda - Jupyter)
 Numpy: to manipulate number data (Number python)
 Scipy: high-level scientific computation (Scientific Python), optimization, regression, interpolation.
 Matplotlib: 2-D visualization, “publication-ready” plots.
 Scikit-learn: the ML algorithms in python.

11/07/22 61
Assignment One - Python
Programming
 Numpy

11/07/22 62
Python Programming
 Numpy

11/07/22 63
Python Programming
 Numpy

11/07/22 64
Python Programming
 Ma tplotli b

11/07/22 65
Python Programming
 Ma tplotli b

11/07/22 66
Python Programming
 Ma tplotli b

11/07/22 67
Python Programming
 Ma tplotli b

11/07/22 68
Python Programming
 Ma tplotli b

11/07/22 69
Python Programming
 Ma tplotli b

11/07/22 70
Python Programming
 Ma tplotli b

11/07/22 71
Python Programming
 Ma tplotli b

11/07/22 72
Python Programming
 Ma tplotli b

11/07/22 73
Python Programming
 Ma tplotli b

11/07/22 74
Python Programming
 Sci Py

11/07/22 75
Python Programming
 Sci Py

11/07/22 76
Python Programming
 Sci Py

11/07/22 77
Tool Set

 Jupyter notebooks
 Interactive coding and Visualization of output
 NumPy, SciPy, Pandas
 Numerical computation
 Matplotlib, Seaborn
 Data visualization
 Scikit-learn
 Machine learning

11/07/22 78
Jupyter Cell
 %matplotlib inline: display plots inline in Jupyter notebook.

11/07/22 79
Jupyter Cell
 %%ti mei t: t im e how l ong a cel l t akes t o execut e.

 %run filename.ipynb: execute code from another notebook


or python file.

11/07/22 80
Introduction to Pandas: Series

 Library for computation with tabular data.


 Mixed types of data allowed in a single table.
 Columns and rows of data can be named.
 Advanced data aggregation and statistical functions.

11/07/22 81
Introduction to Pandas
 Library for com putati on wit h t abular dat a.

11/07/22 82
Introduction to Pandas
 Library for com putati on wit h t abular dat a.

11/07/22 83
Introduction to Pandas
 Library for com putati on wit h t abular dat a.

11/07/22 84
Introduction to Pandas: Dataframe
 Library for com putati on wit h t abular dat a.

11/07/22 85
Introduction to Pandas: Dataframe
 Library for com putati on wit h t abular dat a.

11/07/22 86
Introduction to Pandas: Dataframe
 Library for com putati on wit h t abular dat a.

11/07/22 87
Introduction to Pandas: Dataframe
 Library for com putati on wit h t abular dat a.

11/07/22 88
Introduction to Pandas: Dataframe
 Library for com putati on wit h t abular dat a.

11/07/22 89
Introduction to Pandas: Dataframe

 Library for com putati on wit h t abular dat a.

11/07/22 90

You might also like