Lecture - 2 Classification (Machine Learning Basic and KNN)

Classification : Machine
Learning Basic and kNN

Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2020)
Outline
 A brief overview of ML
 Key tasks in ML
 Why we need ML
 Why Python is so grate for ML
 K-nearest neighbors algorithm
 kNN Classification
 kNN Regression
 Some Issues in KNN
 Python Modules to work on the ML Algorithms
11/07/22 2
Machine Learning
 With machine learning we can gain insight from a dataset.

 We’re going to ask the computer to make some sense from the data.
 This is what we mean by learning.
 Machine learning is the process of turning the data into information and
Knowledge.
 ML lies at the intersection of computer science, engineering, and statistics
and often appears in other disciplines.
11/07/22 3
What is Machine Learning?
 It’s a tool that can be applied to many problems.

 Any field that needs to interpret and act on data can benefit
from ML techniques.
 There are many problems where the solution isn’t deterministic.

 That is, we don’t know enough about the problem or don’t have
enough computing power to properly model the problem.
11/07/22 4
Traditional Vs ML systems
 In ML, once the system is provided with the right data and
algorithms, it can "fish for itself”.
11/07/22 5
Traditional Vs ML systems
 A key aspect of ML that makes it particularly appealing in terms of

business value is that it does not require as much explicit programming
in advance.
11/07/22 6
Sensor and the Data Deluge
 We have a tremendous amount of human-created data from the WWW,

but recently more non-human sources of data have been coming online.
 Sensors connected to the web.
 20 % of non-video internet traffic by sensors.
 Data collected from mobile phone (three-axis accelerometer, temperature
sensors, and GPS receivers)
 Due to the two trends of mobile computing and sensor generated data
mean that we’ll be getting more and more data in the future.
11/07/22 7
Key Terminology
 Weight, Wingspan, Webbed feet, Back color are features or

attributes.
 An instance is made up of features. (controlled, exposure etc.)
 Species is the target variable. (response, outcome, output etc.)
 Attributes can be numeric, binary, nominal.
11/07/22 8
Key Terminology
 To train the ML algorithm we need to feed it quality data known as a training set.
 In the above example each training example (instant) has four features and one target variable.
 In a training set the target variable is known.
 The machine learns by finding some relationship between the features and the target variable.
 In the classification problem the target variables are called classes, and they are assumed to be
a finite number of classes.
11/07/22 9
Key Terminology Cont…
 To test machine learning algorithms a separate dataset is used which is called a test set.
 The target variable for each example from the test set isn’t given to the program.
 The program (model) decides in which class each example should belong to.
 Then compare the predicted value with the target variable.
11/07/22 10
Key Tasks of Machine Learning
 In classification, our job is to predict what class an instance of data should fall into.
 Regression is the prediction of a numeric value.
 Classification and regression are examples of supervised learning.

 This set of problems is known as supervised because we’re telling the algorithm what to predict.
11/07/22 11
 The opposite of supervised learning is a set of tasks known as unsupervised learning.

 In unsupervised learning, there’s no label or target value given for the data. (known as clustering)
 In unsupervised learning, we may also want to find statistical values that describe the data. This is known as density estimation.
 Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly
visualize the dimensions.
11/07/22 12
 Common algorithms used to perform classification, regression, clustering, and density estimation tasks.
 Balancing generalization and memorization (over fitting) is a common problem to many ML algorithms.
 Regularization techniques are used to reduce over fitting.
11/07/22 13
 There are two fundamental cause of prediction error: a model bias, and its variance.
 A model with high variance over-fits the training data, while a model with high bias under-fits the training data.
 High bias, low variance (Under-fitting)
 Low bias, high variance (Over-fitting)
 High bias, high variance (Under/Over – fitting)
 Low bias, low variance (Good dataset)
 The predictive power of many ML algorithms improve as the amount of training data increases.
 Quality of data is also important.
11/07/22 14
 Ideally, a model will have both low bias and variance; but effort to reduce one will frequently increase the other. This is known as the bias-
variance trade-off.
 Common measurement of performance:

 Accuracy (ACC) = (TP + TN / TP+TN+FP+FN)
 Precision (P) = (TP / TP+FP)
 Recall (R) = (TP / TP+FN)
11/07/22 15
How to Choose the Right Algorithm
 First, you need to consider your goal.

 If you’re trying to predict or forecast a target value, then you need to look into supervised learning.
 If not, then unsupervised learning is the place you want to be.
 If you’ve chosen supervised learning, what’s your target value?

 Discrete value (y/n, 1/2/3, Red/Yellow/Black):- classification
 A number of values (0.00 to 100.00 etc…):- regression
11/07/22 16
 Spend some time to know the data, and the more we know it, we can build successful application. (70-80% of the time)
 Things to know about the data are these:
 Are the features nominal or continuous?
 Are there missing values in the features?
 If there are missing values, why are there missing values?
 Are there outliers in the data? etc…(80, 81, 82, 83, 245)
 All of these features about your data can help you narrow the algorithm selection process.
11/07/22 17
 Finding the best algorithm is an iterative process of trial and error.

 Steps in developing a machine learning application:
 Collect data: scraping a website, RSS feed or API etc..
 Prepare the input data: make sure the unstableness of the data format.
 Analyze the input data: looking at the data.
 Understand the data.
 Train the algorithm: the ML takes place (not for unsupervised)
 Test the algorithm: (go back to the 4th step)
 Use it (implement ML application)
11/07/22 18
Problem Solving Framework
 Problem solving Framework for ML application:

 Business issue understanding
 Data understanding
 Data preparation
 Analysis Modeling
 Validation
 Presentation / Visualization
11/07/22 19
Machine Learning Systems and Data
 In AI (ML), instead of writing a program by hand for each

specific task, we collect lots of examples that specify the correct
output for a given input.
 The most important factors in ML is not the algorithm or the
software systems.
 The quality of the data is the soul of the ML systems.
11/07/22 20
 Invalid training data:

 Garbage In ------ Garbage Out.
 Invalid dataset leads to invalid results.

 This is not to say that the training data needs to be prefer.
 Out of a million examples, some inaccurate labels is

acceptable.
 The quality of the data is the soul of the ML systems.
11/07/22 21
 “garbage” can be several things:

 Wrong label (Dog – Cat, Cat – Dog)
 Inaccurate and Missing Values
 A bias dataset etc…
 Handling missing data:
 Small portion row and columns – discarded them
 Data imputation (time serial data) – the last valid value
 Substitute with mean or median
 Predicting the missing values from the available data
 A missing value can have a meaning on its own (missing)
11/07/22 22
 Having a clear dataset is not always enough.

 Features with large magnitudes can dominate features with small
magnitudes during the training.
 Example: Age [0-100], salary [6,000 – 20,000] – Scaling and
Standardization
 Data imbalance:
 Leave as it is.
No Classes Number
Under sampling (if all classes are
1 Cat 5000
equally important) [5000 – 25]
2 Dog 5000
3 Tiger 150 Over sampling (if all classes are
4 Cow 25 equally important) [25-5000]
11/07/22 23
Challenges in Machine Learning
 It requires considerable data and compute power.

 It requires knowledgeable data science specialists or teams.
 It adds complexity to the organization's data integration
strategy. (data-driven culture)
 Learning AI(ML) algorithms is challenging without an

advanced math background.
 The context of data often changes. (private data Vs public data)
 Algorithmic bias, privacy and ethical concerns may be
overlooked.
11/07/22 24
Stages of ML Process
 The first key step in preparing to explore and exploit AI(ML) is to

understand the basic stages involved.
11/07/22 25
Stages of ML Process
 Machine Learning Tasks and Subtasks:
11/07/22 26
Data Collection and Preparation
 Data collection is the process of gathering and measuring

information from countless different sources.
 Data generating at an unprecedented rate. These data can be:
 Numeric (temperature, loan amount, customer retention rate),
 Categorical (gender, color, highest degree earned), or
 Even free text (think doctor’s notes or opinion surveys).
 In order to use the data we collect to develop practical

solutions, it must be collected and stored in a way that makes
sense for the business problem at hand.
11/07/22 27
 During an AI development, we always rely on data.

 From training, tuning, model selection to testing, we use three
different data sets: the training set, the validation set ,and the
testing set.
 The validation set is used to select and tune the final ML model.
 The test data set is used to evaluate how well your algorithm

was trained with the training data set.
11/07/22 29
 Testing sets represent 20% or 30% of the data. (cross validation)

 The test set is ensured to be the input data grouped together with
verified correct outputs, generally by human verification.
11/07/22 30
 The most successful AI projects are those that integrate a data

collection strategy during the service/product life-cycle.
 It must be built into the core product itself.
 Basically, every time a user engages with the product/service,
you want to collect data from the interaction.
 The goal is to use this constant new data flow to improve your
product/service.
11/07/22 31
 Solving the right problem:

 Understand the purpose for a model.
 Ask about who, what, when, where and why?
 Is the problem viable for machine learning (AI)?
11/07/22 32
 Data preparation is a set of procedures that makes your dataset

more suitable for ML.
 Articulate the problem early
 Establish data collection mechanisms (data-driven culture)
 Format data to make it consistent
 Reduce data (attribute sampling)
 Complete data cleaning
 Decompose data (complex data set)
 Rescale data (data normalization)
 Discretize data (numerical – categorical values)
 Private datasets capture the specifics of your unique business
and potentially have all relevant attributes
11/07/22 33
Data Collection, Preparation and
Delivery
11/07/22 34
Python
 Python is a grate language for ML.

 Has clear syntax:
 High-level data type (list, tuples, dictionaries, sets, etc…)
 Program in any style (OO, procedural, functional, and so on)
 Makes text manipulation extremely easy
There are a number of libraries
 Libraries such as SciPy and NumPy: to do vector and matrix

operations.
 Matplotlib can plot 2D and 3D plots.
11/07/22 35
Classifying with k-Nearest
Neighbors
11/07/22 36
K-Nearest Neighbors (KNN)
 It is an easy to grasp (understand and implement) and very

effective (powerful tool).
 The model for kNN is the entire training dataset.
 Pros: High accuracy, insensitive to outliers, no assumptions

about data.
 Cons: computationally expensive, requires a lot of memory.
 Works with: Numeric values, nominal values. (Classification
and regression)
11/07/22 37
 We have an existing set of example data (training set).

 We know what class each piece of the data should fall into.
 When we’re given a new piece of data without a label.

 We compare that new piece of data to the existing data, every piece of existing data.
 We then take the most similar pieces of data (the nearest neighbors) and look at their
labels.
11/07/22 38
 We have an existing set of example data (training set).

 We look at the top k most similar pieces of data from our known dataset. (usually less than 20)
 The K is often set to an odd number to prevent ties.
 Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new
class we assign to the data we were asked to classify.
11/07/22 39
 KNN, non-paramteric models can be useful when training data is abundant and you have little prior
knowledge about the relationship b/n the response and explanatory variables.
 KNN makes only one assumption: instance that are near each other are likely to have similar values of
response variable.
 A model that makes assumption about the relationship can be useful if training data is scarce or if you
already know about the relationship.
11/07/22 40
KNN Classification
 Classifying movies into romance or action movies.

 The number of kisses and kicks in each movie (features)
 Now, you find a movie you haven’t seen yet and want to know if it’s a romance movie or an action movie.
 To determine this, we’ll use the kNN algorithm.
11/07/22 41
KNN Classification
 We find the movie in question and see how many kicks and kisses it has.
Classifying movies by plotting the # kicks and kisses in each movie

11/07/22 42
KNN Classification
Movies with the # of kicks, # of kisses along with their class
11/07/22 43
KNN Classification
 We don’t know what type of movie the question mark movie is.
 First, we calculate the distance to all the other movies.
Distance b/n each movie and the unknown movie

11/07/22 44
KNN Classification
Euclidian distance where the distance between two vectors
11/07/22 45
KNN Classification
 Let’s assume k=3.

 Then, the three closest movies are He’s Not Really into Dudes, Beautiful Woman, and California Man.
 Because all three movies are romances, we forecast that the
mystery movie is a romance movie. (majority vote)
11/07/22 46
General Approach to KNN
 General approach to kNN:

 Collect: Any method
 Prepare: Numeric values are needed for a distance calculation.
 Analyze: Any method (plotting).
 Train: Does not apply to the kNN algorithm.
 Test: Calculate the error rate.
 Use: This application needs to get some input data and output structured numeric values.
11/07/22 47
 kNN is an instance-based learning algorithm.
<x, y> 1 <x, y> 1

<x, y> 2 <x, y> 2
<x, y> 3 Database
<x, y> 3 F(x) = wx + b
<x, y> 4 <x, y> 4
…….. ……..
<x, y> n <x, y> n F(x) = lookup(x)
Non-instance supervised learning Instance-based supervised learning
11/07/22 48
 Advantage:
 It remembers
 Fast (no learning time)
 Simple and straight forward
 Down side :
 No generalization
 Over-fitting (noise)
 Computationally expensive for large datasets
11/07/22 49
 Given:
 Training data D = (xi, yi)
 Distance metric d(q, x): domain knowledge important
 Number of neighbors K: domain knowledge important
 Query point q
 KNN = {i : d(q, x i) k smallest }

 Return:
 Classification: Majority Vote of the yi.
 Regression: mean of the yi.
11/07/22 50
KNN- Regression Problem
 The similarity measure is dependent on the type of the data:

 Real-valued data: Euclidean distance
 Hamming distance: categorical or binary data (P-norm; when p=0)
 d(): k Average
X1, X2 y  Euclidian:

1-NN _______
3-NN _______
Regression
1, 6 7  Manhattan 1-NN _______
 3-NN _______
2, 4 8
3, 7 16
6, 8 44
7, 1 50
8, 4 68
Q = 4, 2, y = ???
11/07/22 51
 d(): k Average
Regression  Euclidian: 1-NN ___8___
 3-NN ___42__
X1, X2 y ED  Manhattan 1-NN _______

 3-NN _______
1, 6 7 25
2, 4 8 8 Euclidian = ((X – q )2 +(X2i – q2)2)1/2
1i 1
3, 7 16 26
6, 8 44 40
7, 1 50 10
8, 4 68 20
Q = 4, 2, y = ???
11/07/22 52
 d(): k Average
Regression  Euclidian: 1-NN _______
 3-NN _______
X1, X2 y mD  Manhattan 1-NN ___29__

 3-NN __35.5__
1, 6 7 7
2, 4 8 4 Manhattan = (|X – q |) + (|X2i - q1|)
1i 1
3, 7 16 6
6, 8 44 8
7, 1 50 4
8, 4 68 6
Q = 4, 2, y = ???
11/07/22 53
K-Nearest Neighbors Bias
 Preference Bias?
 Our believe about what makes a good hypothesis.
 Locality: near points are similar (distance function / domain)
 Smoothness: averaging
 All features matter equally
 Best practices for Data preparation
 Rescale data: normalizing the data to the range [0, 1] is a good idea.
 Address missing data: excluded or imputed the missing values.
 Lower dimensionality: KNN is suitable for lower dimensional data
11/07/22 54
KNN and Curse of Dimensionality
 As the number of features or dimension grows, the amount of data we need to generalize accurately grows exponentially.
 Exponentially mean “bad”. O(2d)
11/07/22 55
Some Other Issues
 What is needed to select a KNN model?

 How to measure closeness of neighbors.
 Correct value for K.
 d(x, q) = Euclidian, Manhattan, weighted etc…

 The choice of the distance function matters.
 K value
 K = n (the average of all data / no need of query)
 K = n (weighted average) [Locally weighted regression]
11/07/22 56
Summary
 kNN is an example of instance-based learning.

 The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage.
 Need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.
 kNN doesn’t give you any idea of the underlying structure of the data.
 kNN is an example of lazy learning, which is the opposite of eager learning.
 kNN can handle both classification and regression.
11/07/22 57
Summary
 KNN is positione d in the a lgor ithm list of sc ikit le a rn.
11/07/22 58
Question & Answer
11/07/22 59
Thank You !!!
11/07/22 60
Python Programming
 Python: PL (python tutorial)

 Ipython: an advanced python shell. (Anaconda - Jupyter)
 Numpy: to manipulate number data (Number python)
 Scipy: high-level scientific computation (Scientific Python), optimization, regression, interpolation.
 Matplotlib: 2-D visualization, “publication-ready” plots.
 Scikit-learn: the ML algorithms in python.
11/07/22 61
Assignment One - Python
Programming
 Numpy
11/07/22 62
Python Programming
 Numpy
11/07/22 63
Python Programming
 Numpy
11/07/22 64
Python Programming
 Ma tplotli b
11/07/22 65
Python Programming
 Ma tplotli b
11/07/22 66
Python Programming
 Ma tplotli b
11/07/22 67
Python Programming
 Ma tplotli b
11/07/22 68
Python Programming
 Ma tplotli b
11/07/22 69
Python Programming
 Ma tplotli b
11/07/22 70
Python Programming
 Ma tplotli b
11/07/22 71
Python Programming
 Ma tplotli b
11/07/22 72
Python Programming
 Ma tplotli b
11/07/22 73
Python Programming
 Ma tplotli b
11/07/22 74
Python Programming
 Sci Py
11/07/22 75
Python Programming
 Sci Py
11/07/22 76
Python Programming
 Sci Py
11/07/22 77
Tool Set
 Jupyter notebooks
 Interactive coding and Visualization of output
 NumPy, SciPy, Pandas
 Numerical computation
 Matplotlib, Seaborn
 Data visualization
 Scikit-learn
 Machine learning
11/07/22 78
Jupyter Cell
 %matplotlib inline: display plots inline in Jupyter notebook.
11/07/22 79
Jupyter Cell
 %%ti mei t: t im e how l ong a cel l t akes t o execut e.
 %run filename.ipynb: execute code from another notebook

or python file.
11/07/22 80
Introduction to Pandas: Series
 Library for computation with tabular data.

 Mixed types of data allowed in a single table.
 Columns and rows of data can be named.
 Advanced data aggregation and statistical functions.
11/07/22 81
Introduction to Pandas
 Library for com putati on wit h t abular dat a.
11/07/22 82
11/07/22 83
11/07/22 84
Introduction to Pandas: Dataframe
11/07/22 85
11/07/22 86
11/07/22 87
11/07/22 88
11/07/22 89
11/07/22 90

Lecture - 2 Classification (Machine Learning Basic and KNN)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture - 2 Classification (Machine Learning Basic and KNN)

Uploaded by

Copyright:

Available Formats

Classification : Machine

Learning Basic and kNN

 With machine learning we can gain insight from a dataset.

 It’s a tool that can be applied to many problems.

 There are many problems where the solution isn’t deterministic.

 A key aspect of ML that makes it particularly appealing in terms of

 We have a tremendous amount of human-created data from the WWW,

 Weight, Wingspan, Webbed feet, Back color are features or

 Classification and regression are examples of supervised learning.

 The opposite of supervised learning is a set of tasks known as unsupervised learning.

 Common measurement of performance:

 First, you need to consider your goal.

 If you’ve chosen supervised learning, what’s your target value?

 Finding the best algorithm is an iterative process of trial and error.

 Problem solving Framework for ML application:

 In AI (ML), instead of writing a program by hand for each

 Invalid training data:

 Invalid dataset leads to invalid results.

 Out of a million examples, some inaccurate labels is

 “garbage” can be several things:

 Having a clear dataset is not always enough.

 It requires considerable data and compute power.

 Learning AI(ML) algorithms is challenging without an

 The first key step in preparing to explore and exploit AI(ML) is to

 Machine Learning Tasks and Subtasks:

 Data collection is the process of gathering and measuring

 In order to use the data we collect to develop practical

 During an AI development, we always rely on data.

 The test data set is used to evaluate how well your algorithm

 Testing sets represent 20% or 30% of the data. (cross validation)

 The most successful AI projects are those that integrate a data

 Solving the right problem:

 Data preparation is a set of procedures that makes your dataset

 Python is a grate language for ML.

 Libraries such as SciPy and NumPy: to do vector and matrix

 It is an easy to grasp (understand and implement) and very

 Pros: High accuracy, insensitive to outliers, no assumptions

 We have an existing set of example data (training set).

 When we’re given a new piece of data without a label.

 We have an existing set of example data (training set).

 Classifying movies into romance or action movies.

Classifying movies by plotting the # kicks and kisses in each movie

Movies with the # of kicks, # of kisses along with their class

Distance b/n each movie and the unknown movie

Euclidian distance where the distance between two vectors

 Let’s assume k=3.

 General approach to kNN:

 kNN is an instance-based learning algorithm.

<x, y> 1 <x, y> 1

 KNN = {i : d(q, x i) k smallest }

 The similarity measure is dependent on the type of the data:

X1, X2 y ED  Manhattan 1-NN _______

X1, X2 y mD  Manhattan 1-NN ___29__

 What is needed to select a KNN model?

 d(x, q) = Euclidian, Manhattan, weighted etc…

 kNN is an example of instance-based learning.

 Python: PL (python tutorial)

 %run filename.ipynb: execute code from another notebook

 Library for computation with tabular data.

 Library for com putati on wit h t abular dat a.

You might also like

X1, X2 y mD  Manhattan 1-NN _29