You are on page 1of 270

MACHINE LEARNING

Professional CORE( CET3006B)


T. Y. B.Tech CSE

SCHOOL OF COPUTER ENGINEERING & TECHNOLOGY


MACHINE LEARNING
• Credits : 3+1 (Four)
• Examination scheme: Total Marks-100
- 30 Marks CCA
-30 Marks LCA
- 40 Marks End Term Examination
MACHINE LEARNING
Course Objectives:
1.Knowledge:
i. To learn data preparation techniques for Machine Learning methods
ii. To understand advance supervised and unsupervised Learning methods
2.Skills:
i. To apply suitable pre-processing techniques on various datasets for Machine Learning applications.
ii. To design and implement various advanced supervised and unsupervised learning methods
3.Attitude:
i. To be able to choose and apply suitable ML techniques to solve the problem
ii. To compare & analyze various advanced supervised and unsupervised learning methods.
ADVANCES IN MACHINE LEARNING

Course Outcomes:
After completion of the course the students will be able to:

1. Analyze and apply different data preparation techniques for Machine Learning applications

2. Identify, Analyze and compare appropriate supervised learning algorithm for given problem

3. Identify, Analyze and Compare Unsupervised and semi supervised algorithms

4. Design and implement Machine Learning techniques for real-time applications


Course Contents:

Unit 1. Introduction to ML

Unit 2. Supervised Learning: Classification

Unit 3. Unsupervised Learning: Clustering

Unit 4. Performance Analysis and Model Evaluation

Unit 5. Trends in ML

5
Course Contents:
Laboratory Exercises:
1. Demonstrate feature selection techniques for a dataset from UCI ML library.
2. Implement exploratory data analysis for IRIS dataset.
3. Implementation of Tree based Classifiers.
4. Implementation of SVM, Comparison with Tree Based Classifier
5. Implementation of Ensemble, Random Forests. Analyze the Performance.
6. Implementation and Comparison of various clustering techniques such as Spectral & DBSCAN.
7. Demonstrate perform analysis on a given dataset to find accuracy, precision, recall and
confusion matrix for supervised learning algorithms.
8. Mini-Project based on suitable Machine Learning dataset
6
Course Contents:
Text Books:
1. E. Alpaydin, Introduction to Machine Learning, PHI, 2004.
2. Peter Flach: Machine Learning: The Art and Science of Algorithms that Make Sense of
Data, Cambridge University Press, Edition 2012.
3. T. Mitchell, Machine Learning, McGraw-Hill, 1997.
4. Josh Patterson, Adam Gibson, “Deep Learning: A Practitioners Approach”, O‟REILLY,
SPD, ISBN: 978-93-5213-604-9, 2017 Edition 1st
Reference Books:
1. C. M. Bishop: Pattern Recognition and Machine Learning, Springer 1st Edition-2013.
2. Ian H Witten, Eibe Frank, Mark A Hall: Data Mining, Practical Machine Learning Tools
and Techniques, Elsevier, 3rd Edition.
3. Shaishalev-shwartz, Shai Ben-David: Understanding Machine Learning from Theory to
algorithms, Cambridge University Press, ISBN-978-1-107-51282-5, 2014.
7
Course Contents:
Supplementary Reading:
1. AurelienGeron, “Hands-on Machine Learning with Scikit-learn and Tensor flow,
O’Reilly Media
Web Resources:
1. Popular dataset resource for ML beginners: http://archive.ics.uci.edu/ml/index.php
Web links:
1. https://www.kaggle.com/datasets
2. http://deeplearning.net/datasets/
MOOCs:
1. https://swayam.gov.in/nd1_noc20_cs29/preview
2. https://swayam.gov.in/nd1_noc20_cs44/preview
8
Course Contents:

9
Syllabus-Unit 1
Introduction to ML:
Introduction, Data Preparation
Data Encoding Techniques
Data Pre-processing techniques for ML applications.
Feature Engineering:
Dimensionality Reduction using PCA
Exploratory Data Analysis
Feature Selection

10
AI Vs. ML

11
INTRODUCTION

12
Cntd..
To solve a problem on a computer, we need an algorithm.
An algorithm is a sequence of instructions that should be carried out to
transform the input to output
Ex. Sorting Input :set of numbers , output : ordered list
For some tasks ,however, we do not have an algorithm We are machine
learning intelligence
Ex. tell spam emails from legitimate emails
Input :email document (file of characters), output : yes/no output indicating
whether the message is spam or not
computer (machine) to extract automatically the algorithm for this task.
Cntd..

(source: https://medium.com/analytics-vidhya/introduction-to-machine-learning-e1b9c055039c)
• Machine learning is a “Field of study that gives computers the ability to learn without
being explicitly programmed.”
• In other words it is concerned with the question of how to construct computer programs
that automatically improve with the experience. - According to Arthur Samuel(1959)
14
Cntd..
• A computer program is said to learn from experience ‘E’ with respect to some
class of task ‘T’ and performance measure ‘P’ if its performance at task in ‘T’
as measured by ‘P’ improves with experience ‘E’ – Tom M Mitchell

• Machine learning is an application of artificial intelligence (AI) that provides


systems the ability to automatically learn and improve from experience
without being explicitly programmed.

• Machine learning focuses on the development of computer programs that can


access data and use it learn for themselves.

15
Cntd..
Example 1
Classify Email as spam or not spam
• Task (T): Classify email as spam or not spam
• Experience(E): watching the user to mark/label the email as spam or
not spam
• Performance (P): The number or fraction of email to be correctly
classified as spam or not spam

16
Cntd..
Example 2
Recognizing hand written digits/ characters
• Task(T): Recognizing hand written digit
• Experience (E): watching the user to mark/ label the hand written digit
to 10 classes(0-9) & identify underling pattern
• Performance(P):The number of fractions of hand-written digits
correctly classified

17
Why Machine Learning Important?.
• Human expertise does not exist
Navigating on Mars
industrial/manufacturing control
mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise OR Some tasks cannot be defined well, except by
examples
face/handwriting/speech recognition/ recognizing people
driving a car, flying a plane

• Relationships and correlations can be hidden within large amounts of data


(e.g., stock market analysis)
• Environments change over time.
(e.g., routing on a computer network)
18
Cntd..
• The amount of knowledge available about certain tasks might be too large for explicit encoding by
humans
(e.g., medical diagnostic).
• New knowledge about tasks is constantly being discovered by humans. It may be difficult to
continuously re-design systems “by hand”.
• Rapidly changing phenomena
credit scoring, financial modeling
diagnosis, fraud detection
• Need for customization/personalization
personalized news reader
movie/book recommendation

19
How does machine learning help us in daily life?
Social networking :

• Use of the appropriate emotions, suggestions about friend tags on Facebook, filtered on Instagram,
content recommendations and suggested followers on social media platforms, etc., are examples of
how machine learning helps us in social networking.
Personal finance and banking solutions

• Whether it’s fraud prevention, credit decisions, or checking deposits on our smartphones machine
learning does it all.
Commute estimation

• Identification of the route to our selected destination, estimation of the time required to reach that
destination using different transportation modes, calculating traffic time, and so on are all made by
machine learning. 20
Applications of Machine
• Face detection Learning Speech recognition
• Stock prediction Hand-written digit recognition
• Spam Email Detection Computational Biology
• Machine Translation Recommender Systems
• Self-parking Cars Guiding robots
• Airplane Navigation Systems Space Exploration
• Medicine Supermarket Chain
• Data Mining

21
Examples…
Example 1: hand-written digit recognition: Output

Learn a classifier f(x) such that, f : x → {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}


Input training data : e.g. 500 samples
Example 2: Face detection
Input : is an image , the classes are people to be recognized
[non-face , frontal-face , profile-face] and the learning program should learn to
associate the face images to identities.
This problem is more difficult because there are more classes, input image
is larger, and a face is 3-dimensional and differences in pose and lighting cause
significant changes in the image. There may also be occlusion ( blockage )of certain
inputs; e.g. glasses may hide the eyes and eyebrows, and a beard may hide the chin.
Why is this message in Spam? It's similar to messages that were detected by our spam filters. Learn more

Example 3: Spam detection


Watch Fest <info@crm.foolishpost.com> Ma
y
22

to me

• This is a classification problem


• Task is to classify email into spam/non-spam
• Requires a learning system as “enemy” keeps innovating
Example 4: Stock price prediction

• Task is to predict stock price at future date


• This is a regression task, as the output is continuous
Example : Weather prediction
Example : Medical Diagnosis
❖ Inputs are the relevant information about the patient and the classes are the illnesses.
❖ The inputs contain the patient’s age, gender, past medical history, and current symptoms.
❖ Some tests may not have been applied to the patient, and thus these inputs would be missing.
❖ Tests take time, may be costly, and may inconvience the patient so we do not want to apply
them unless we believe that they will give us valuable information.
❖ In the case of a medical diagnosis, a wrong decision may lead to a wrong or no treatment, and
in cases of doubt it is preferable that the classifier reject and defer decision to a human expert.
Example : Agriculture

A Crop Yield Prediction App in Senegal Using Satellite Imagery (Video


Link)
https://www.youtube.com/watch?v=4OnBGkhA4jc&t=160s
.
Data Preparation

Data Preparation Pipeline


Data Preparation
Data Preparation
Why is Data Preparation important?

sometimes, data in data sets have missing or incomplete information, which leads to less
accurate or incorrect predictions.
Further, sometimes data sets are clean but not adequately shaped, such as aggregated or
pivoted, and some have less business context.
Hence, after collecting data from various data sources, data preparation needs to
transform raw data.
Significant advantages of data preparation in machine learning as follows:
• It helps to provide reliable prediction outcomes in various analytics operations.
• It helps identify data issues or errors and significantly reduces the chances of errors.
• It increases decision-making capability.
• It reduces overall project cost (data management and analytic cost).
• It helps to remove duplicate content to make it worthwhile for different applications.
• It increases model performance.
Steps in Data Preparation Process

1.Understand the problem:


• understand the actual problem and try to solve it.
2.Data collection:
• collect data from various potential sources. These data sources may be either within
enterprise or third parties vendors.
• Data collection is beneficial to reduce and mitigate biasing in the ML model;
• hence before collecting data, always analyze it and also ensure that the data set was
collected from diverse people, geographical areas, and perspectives.
3.Profiling and Data Exploration:
• explore data such as trends, outliers, exceptions, incorrect, inconsistent, missing, or
skewed information, etc.
• Data exploration helps to determine problems such as collinearity, which means a
situation when the Standardization of data sets and other data transformations are
necessary.
Steps in Data Preparation Process

4.Data Cleaning and Validation:


• Data cleaning and validation techniques help determine and solve inconsistencies, outliers,
anomalies, incomplete data, etc.
• Clean data helps to find valuable patterns and information in data and ignores irrelevant data
in the datasets.
5.Data Formatting:
• After cleaning and validating data, the following approach is to ensure that the data is
correctly formatted or not.
Steps in Data Preparation Process

6.Feature engineering and selection:


• Feature engineering is defined as the study of selecting, manipulating, and transforming raw
data into valuable features
There are various feature engineering techniques used in machine learning as follows:
Imputation:
• Feature imputation is the technique to fill incomplete fields in the datasets.
• It is essential because most machine learning models don't work when there are missing data
in the dataset.
• Although, the missing values problem can be reduced by using techniques such as single value
imputation, multiple value imputation, K-Nearest neighbor, deleting the row, etc.
Encoding:
• Feature encoding is defined as the method to convert string values into numeric form.
• This is important as all ML models require all values in numeric format.
• Feature encoding includes label encoding and One Hot Encoding (also known as
get_dummies).
Data Pre-Processing

1.Data cleaning Data preprocessing


2.Data integration
3.Data transformation
4.Data reduction
5.Data Discretization
Cntd..
• Data preparation is also known as data "pre-processing," "data wrangling,"
"data cleaning," "data pre-processing," and "feature engineering."
• It is the later stage of the machine learning lifecycle, which comes after data
collection..
The data preparation process can be complicated by issues such as:
1. Missing or incomplete records: Missing data sometimes appears as empty
cells, values (e.g., NULL or N/A), or a particular character, such as a question
mark

40
Cntd..

41
Cntd..

42
Cntd..

43
Cntd..

44
Cntd..

45
Cntd..
2. Outliers or anomalies: Unexpected values
• ML algorithms are sensitive to the range and distribution of values when data
comes from unknown sources.
• These values can spoil the entire machine learning training system and the
performance of the model.
• Hence, it is essential to detect these outliers or anomalies through techniques
such as visualization technique.

46
Cntd..
2. Outliers or anomalies: Unexpected values
• ML algorithms are sensitive to the range and distribution of values when data
comes from unknown sources.
• These values can spoil the entire machine learning training system and the
performance of the model.
• Hence, it is essential to detect these outliers or anomalies through techniques
such as visualization technique.

47
Cntd..
3. Unstructured data format :
• Data comes from various sources and needs to be extracted into a different
format.
• Hence, before deploying an ML project, always consult with domain experts or
import data from known sources.
4. Limited or sparse features / attributes :
• Whenever data comes from a single source, it contains limited features,
• so it is necessary to import data from various sources for feature enrichment
or build multiple features in datasets.
5. Understanding feature engineering:
• Features engineering helps develop additional content in the ML models,
increasing model performance and accuracy in predictions.
48
Cntd..

49
Cntd..

50
Cntd..

51
Cntd..

52
Cntd..

53
Cntd..

54
Cntd..

55
Feature Engineering
Feature engineering is the pre-processing step of machine learning, which
is used to transform raw data into features that can be used for creating a
predictive model using Machine learning or statistical Modelling.

Feature engineering is the pre-processing step of machine learning, which extracts


features from raw data.
56
Feature Engineering
What is a feature?
• Generally, all machine learning algorithms take input data to generate the
output.
• The input data remains in a tabular form consisting of rows (instances or
observations) and columns (variable or attributes), and these attributes are often
known as features.
Feature Engineering processes:
1. Feature Creation: Feature creation is finding the most useful variables to
be used in a predictive model.
2. Transformations: The transformation step of feature engineering involves
adjusting the predictor variable to improve the accuracy and performance of
the model.

57
Feature Engineering
3.Feature Extraction: Feature extraction is an automated feature
engineering process that generates new variables by extracting them
from the raw data.
The main aim of this step is to reduce the volume of data so that it can be
easily used and managed for data modelling.
Feature extraction methods include cluster analysis, text analytics, edge
detection algorithms, and principal components analysis (PCA).
4. Feature Selection: Feature selection is a way of selecting the subset of
the most relevant features from the original features set by removing the
redundant, irrelevant, or noisy features."

58
Feature Engineering
Steps in Feature Engineering
Data Preparation:

• In this step, raw data acquired from different resources are prepared to make it in a suitable format so that it can be
used in the ML model.

• The data preparation may contain cleaning of data, delivery, data augmentation, fusion, ingestion, or loading.
Exploratory Analysis:
• This step involves analysis, investing data set, and summarization of the main characteristics of data.

• Different data visualization techniques are used to better understand the manipulation of data sources, to find the
most appropriate statistical technique for data analysis & to select the best features for the data.
Benchmark:
• Benchmarking is a process of setting a standard baseline for accuracy to compare all the variables from this
baseline.
• The benchmarking process is used to improve the predictability of the model and reduce the error rate.
59
Feature Engineering
Feature Engineering Techniques:
1. Imputation: Imputation is responsible for handling irregularities within the
dataset.
• For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns.
• For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.
2. Handling Outliers: This technique first identifies the outliers and then remove
them out.
• Standard deviation can be used to identify the outliers
• Z-score can also be used to detect outliers.
60
Feature Engineering
Feature Engineering Techniques:
3. Log transform: Log transform helps in handling the skewed data, and it makes
the distribution more approximate to normal after transformation.
4. Binning: can be used to normalize the noisy data. This process involves
segmenting different features into bins.
5. Feature Split: is the process of splitting features intimately into two or more
parts and performing to make new features.
6. One hot encoding: It is a technique that converts the categorical data in a form so
that they can be easily understood by machine learning algorithms and hence can
make a good prediction.

61
Types of Learning
• Supervised learning
Types of Learning
Supervised Learning :
Aim is to learn a mapping from the input to an output whose correct values are provided
by a supervisor.
1. Classification : Data is labelled …meaning it is assigned a class,
for example spam/non-spam or fraud/non-fraud
e.g. for financial institution ..input to classifier is savings and income and output is
one the class like high risk or low risk based on following classification rule
❑ if income > δ1 and saving δ2 then low risk else high risk
2. Regression : Data is labelled with a real value rather then a label.
e.g. price of a stock over time.
e.g predict the price of used car ….
Input : brand, year, engine capacity, mileage & other information …
output: Price of car
Types of Learning
Types of Learning
Types of Learning
Supervised Learning :
Types of Learning
Supervised Learning :
Unsupervised Learning
• Unsupervised learning
Example of Unsupervised learning
• Clustering
• Association

69
Example of Unsupervised learning
• Clustering
• Association

70
Example of Unsupervised learning

71
Example of Semi-supervised learning

72
Reinforcement Learning
• learning from mistakes
• Place a reinforcement learning algorithm into any environment and
it will make a lot of mistakes in the beginning
• As we provide some sort of signal to the algorithm that associates
good behaviors with a positive signal and bad behaviors with a
negative one
• we can reinforce our algorithm to prefer good behaviors over
bad ones.
• Over time, our learning algorithm learns to make less mistakes
than it used to.
Reinforcement Learning
Reinforcement Learning
Where is reinforcement learning in the real world?
• Video Games
• Industrial Simulation:
• Resource Management
76
Key Elements of Machine Learning
• There are tens of thousands of machine learning algorithms and hundreds
of new algorithms are developed every year.
• Every machine learning algorithm has three components:
1. Representation: how to represent knowledge.
Examples include decision trees, sets of rules, instances, graphical models,
neural networks, support vector machines, model ensembles and others.
2. Evaluation: the way to evaluate candidate programs (hypotheses).
Examples include accuracy, prediction and recall, squared error, likelihood,
posterior probability, cost, margin, entropy k-L divergence and others.

77
3. Optimization: the way candidate programs are generated known
as the search process.
For example combinatorial optimization, convex optimization,
constrained optimization.
• All machine learning algorithms are combinations of these three
components.
• A framework for understanding all algorithms.

78
Aspects of developing a learning system:
training data, concept representation, function approximation
• For training and testing purpose of our model we need to split the dataset in
to three distinct dataset, training set, validation set and testing set
• Training set:-
• A set of data used to train the model
• It is used to fit the model
• The model sees and learn from this data
• Later on the trained model can be deployed and used to accurately predict
on new data that it has not seen before
• Labeled data is used

79
Validation set
• Validation set is the set of data separate from the training data
• It is used to validate our model during training
• It gives information which is used for tuning model hyper parameter
• It ensures that our model is not over fitting to the data in the training
set
• Labeled data is used

80
Test Set
• A set of data use to test the model
• The test set is separated from both the train set and validation set
• Once the model is train and validated using the training data and
validation sets then the model is used to predict the output for the
data in the test set
• Unlabeled data is used

81
Data Split

Train Validation Test

• Rules for performing data split operation


• In order to avoid a correlation between the original dataset must be
randomly shuffled before applying the split phase
• All the split must represent the original distribution
• The percentage of splitting is mostly 60% for training 20% for
validation and 20% for testing
• With scikit-learn this can be done using train_test_split() function

82
Exploratory Data Analysis

Exploratory Data Analysis refers to the critical process of performing


initial investigations on data so as to discover patterns, to spot
anomalies, to test hypothesis and to check assumptions with the help of
summary statistics and graphical representations.

83
Exploratory Data Analysis
Typical graphical techniques used in EDA are:
• Box plot
• Histogram
• Multi-vari chart
• Run chart
• Pareto chart
• Scatter plot
• Stem-and-leaf plot
• Stem-and-leaf plot
• Parallel coordinates
• Odds ratio
• Targeted projection pursuit
84
EDA Example
• Wine quality data set from UCI ML repository
• imported necessary libraries (for this example pandas, numpy,
matplotlib and seaborn) and loaded the data set.

85
EDA

• Original data is separated by delimiter “ ; “ in given data set.


• To take a closer look at the data took help of “ .head()”function of
pandas library which returns first five observations of the data set.
Similarly “.tail()” returns last five observations of the data set.

86
EDA Techniques

• found out the total number of rows and columns in the data set using
“.shape”
• Dataset comprises of 4898 observations and 12 characteristics.
• Out of which one is dependent variable and rest 11 are independent
variables — physico-chemical characteristics.
• It is also a good practice to know the columns and their corresponding
data types,along with finding whether they contain null values or not.
87
EDA: Exploratory Data Analysis

Plotting using matplotlib import seaborn as sns


import pandas as pd # 2-D Scatter plot with color-coding for each flower
type/class.
import matplotlib.pyplot as plt
# Here 'sns' corresponds to seaborn.
iris = pd.read_csv("iris.csv") sns.set_style("whitegrid");
iris.head(5) sns.FacetGrid(iris, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
iris.plot(kind='scatter', x='sepal_length',
y='sepal_width') ; .add_legend();
plt.show();
plt.show()

88
89
90
EDA: Exploratory Data Analysis
What about 4-D, 5-D or n-D scatter plot?
3D scatter plot https://plot.ly/pandas/3d- Pair-plot
scatter-plots/ #Only possible to view 2D patterns.
import plotly plt.close();
import plotly.express as px sns.set_style("whitegrid");
sns.pairplot(iris, hue="species", size=3);
iris = px.data.iris() plt.show()
fig = px.scatter_3d(iris, x='sepal_length',
y='sepal_width', z='petal_width', VIOLIN PLOT
color='species') sns.violinplot(x="species", y="petal_length", data=iris,
size=8)
fig.show()
plt.show()

91
92
93
94
Progressive Data Analysis
• Data has only float and integer values.
• No variable column has null/missing values.

95
PDA Techniques

• The describe() function in pandas is very handy in getting various


summary statistics.
• This function returns the count, mean, standard deviation,
minimum and maximum values and the quantiles of the data.
96
Data Preparation: Types of Data

• Here as you can notice mean value is larger than median value of each
column which is represented by 50%(50th percentile) in index column.

• There is notably a large difference between 75th %tile and max values
of predictors “residualsugar”, ”freesulfurdioxide”, ”totalsulfur
dioxide”.

• Thus observations 1 and 2 suggests that there are extreme values-


Outliers in our data set.

97
Graph Visualisation Techniques

• Let’s now explore data with beautiful graphs. Python has a


visualization library , Seaborn which build on top of matplotlib.
• It provides very attractive statistical graphs in order to perform
both Univariate and Multivariate analysis.
• To use linear regression for modelling, Its necessary to remove
correlated variables to improve your model.
• One can find correlations using pandas “.corr()” function and can
visualize the correlation matrix using a heatmap in seaborn.

98
99
Data Pre-processing techniques for ML applications

• Dark shades represents positive correlation while lighter shades


represents negative correlation.
• If you set annot=True, you’ll get values by which features are
correlated to each other in grid-cells.

100
101
Box Plot

• A box plot (or box-and-whisker plot) shows the distribution of


quantitative data in a way that facilitates comparisons between
variables.
• The box shows the quartiles of the dataset while the whiskers extend
to show the rest of the distribution.
• The box plot (a.k.a. box and whisker diagram) is a standardized way
of displaying the distribution of data based on the five number
summary

102
• Minimum
• First quartile
• Median
• Third quartile
• Maximum.
• In the simplest box plot the central rectangle spans the first quartile to
the third quartile (the interquartile range or IQR).

103
104
Sparse Matrix
• In numerical analysis and scientific computing, a sparse matrix or sparse
array is a matrix in which most of the elements are zero.
• By contrast, if most of the elements are nonzero, then the matrix is
considered dense.
• The number of zero-valued elements divided by the total number of elements
(e.g., m × n for an m × n matrix) is called the sparsity of the matrix (which is
equal to 1 minus the density of the matrix).
• Using those definitions, a matrix will be sparse when its Sparsity is greater
than 0.5.
• Conceptually, Sparsity corresponds to systems with few pairwise interactions

105
Feature Engineering: Feature selection

• In machine learning and statistics, feature selection, also known as


variable selection, attribute selection or variable subset selection,
• It is the process of selecting a subset of relevant features (variables,
predictors) for use in model construction.
• Feature selection is the process of reducing the number of input variables
when developing a predictive model.
• It is desirable to reduce the number of input variables to both reduce the
computational cost of modeling and, in some cases, to improve the
performance of the model.

106
Feature Engineering: Feature selection

Feature selection is primarily focused on removing non-informative or


redundant predictors from the model.
Feature selection techniques are used for several reasons:
• simplification of models to make them easier to interpret by
researchers/users
• shorter training times
• to avoid the curse of dimensionality
• enhanced generalization by reducing overfitting

107
Feature Engineering: Feature selection

• There are two main types of feature selection techniques: supervised and
unsupervised
• Supervised methods may be divided into wrapper, filter and intrinsic

108
Feature Engineering: Feature selection

• Unsupervised feature selection techniques ignores the target variable, such as methods that remove redundant
variables using correlation.
• Supervised feature selection techniques use the target variable, such as methods that remove irrelevant variables.

Supervised Feature Selection Techniques:


• Filter: Select subsets of features based on their relationship with the target.
• Statistical Methods
• Feature Importance Methods
• Wrapper: Search for well-performing subsets of features.
• RFE (Recursive Feature Elimination)
• Intrinsic: Algorithms that perform automatic feature selection during training.
• Decision Trees

109
Feature Engineering: Feature selection

Filter:

• Statistical-based feature selection methods involve evaluating the relationship

between each input variable and the target variable using statistics and selecting those

input variables that have the strongest relationship with the target variable.

• These methods can be fast and effective, although the choice of statistical measures

depends on the data type of both the input and output variables.

110
Feature Engineering: Feature selection

Wrapper:
• Wrapper feature selection methods create many models with different subsets of
input features and select those features that result in the best performing model
according to a performance metric

Intrinsic
• machine learning algorithms that perform feature selection automatically as part of
learning the model. These techniques considered as intrinsic feature selection
methods.
• E.g. Decision tree

111
Feature Engineering: Feature selection

Statistics for Filter-Based Feature Selection Methods


It is common to use correlation type statistical measures between input and output
variables as the basis for filter feature selection.
Common input variable data types:
• Numerical Variables
• Integer Variables.
• Floating Point Variables.
• Categorical Variables.
• Boolean Variables
• Ordinal Variables.
• Nominal Variables.

112
Feature Engineering: Feature selection

filter-based feature selection method.


• consider two broad categories of variable types: numerical and categorical
• Also the two main groups of variables to consider: input and output.
• Input variables are those that are provided as input to a model.
• In feature selection, it is this group of variables that we wish to reduce in size.
• Output variables are those for which a model is intended to predict, often called the
response variable.
Response variable typically indicates the type of predictive modeling problem
• Numerical Output: Regression predictive modeling problem.
• Categorical Output: Classification predictive modeling problem

113
Feature Engineering: Feature selection

114
Feature Engineering: Feature selection

1. Numerical Input, Numerical Output


• This is a regression predictive modeling problem with numerical input variables.
• The most common techniques are to use a correlation coefficient, such as Pearson’s
for a linear correlation, or rank-based methods for a nonlinear correlation.
• Pearson’s correlation coefficient (linear).
• Spearman’s rank coefficient (nonlinear)
2. Numerical Input, Categorical Output
• This is a classification predictive modeling problem with numerical input variables.
• This might be the most common example of a classification problem,
• ANOVA correlation coefficient (linear).
• Kendall’s rank coefficient (nonlinear).
115
Feature Engineering: Feature selection

3. Categorical Input, Numerical Output


• This is a regression predictive modeling problem with categorical input variables.
• Nevertheless, you can use the same “Numerical Input, Categorical Output” but in
reverse.
4. Categorical Input, Categorical Output
• This is a classification predictive modeling problem with categorical input
variables.
• The most common correlation measure for categorical data is the chi-squared test.
• Chi-Squared test (contingency tables).
• Mutual Information.

116
Feature Engineering: Feature selection

Recursive Feature Elimination


• Recursive Feature Elimination, or RFE for short, is a feature selection algorithm.
• A machine learning dataset for classification or regression is comprised of rows and
columns, like an excel spreadsheet.
• Rows are often referred to as samples and columns are referred to as features
• Feature selection refers to techniques that select a subset of the most relevant
features (columns) for a dataset.
• Fewer features can allow machine learning algorithms to run more efficiently (less
space or time complexity) and be more effective.
• Some machine learning algorithms can be misled by irrelevant input features,
resulting in worse predictive performance.

117
Feature Engineering: Feature selection

• RFE is a wrapper-type feature selection algorithm.


• This means that a different machine learning algorithm is given and used in the core
of the method, is wrapped by RFE, and used to help select features.
• This is in contrast to filter-based feature selections that score each feature and select
those features with the largest (or smallest) score.
• RFE works by searching for a subset of features by starting with all features in the
training dataset and successfully removing features until the desired number remains.

118
Dimensionality Reduction: PCA
• We can see/visualize 2D,3D data …..by scatterplot
• 4D, 5D, 6D ………..use pair plot…..(nC2 pairs)
• 10D,100D,1000D data ?
• Visualization of High dimension (n- dim) data?
n-D Reduce 2-D or 3-D
• map high-dimensional data into low dimensions and preserve all the structure.
• Using Dimensionality Reduction techniques like PCA, t-SNE to visualize high
dimension data
• PCA tries to preserve linear structure, MDS tries to preserve global geometry, and
t-SNE tries to preserve topology
119
Principal Component Analysis (PCA)
Why PCA ?

• For dimensionality reduction i.e. d-dim d’-dim E.g. mnist dataset of 784- dim to 2 dim
# MNIST dataset downloaded from Kaggle :
#https://www.kaggle.com/c/digit-recognizer/data

Application-

• Visualization of high dim data using scatter plot, pair plot etc.

• As a ML ,models to solve problems on high dimensions

120
Principal Component Analysis (PCA)
PCA Steps for dimensionality reduction:

1. Column standardization of data

2. Find Covariance matrix

3. Find eigen values and eigen vectors

4. Find principal components

5. Reducing dimensions of dataset

121
Principal Component Analysis (PCA)
1.Standardization of the data
• missing out on standardization will probably result in a biased outcome.
• Standardization is all about scaling your data in such a way that all the variables and their values lie
within a similar range
• E.g let’s say that we have 2 variables in our data set, one has values ranging between 10-100 and the
other has values between 1000-5000.
• In such a scenario, it is obvious that the output calculated by using these predictor variables is going to
be biased
• standardizing the data into a comparable range is very important.

122
Principal Component Analysis (PCA)
2 Computing the covariance matrix
• A covariance matrix expresses the correlation between the different variables in the data set.
• It is essential to identify heavily dependent variables because they contain biased and redundant
information which reduces the overall performance of the model.
• a covariance matrix is a p × p matrix, where p represents the dimensions of the data set.
• 2-Dimensional data set with variables a and b, the covariance matrix is a 2×2 matrix as shown below

• If the covariance value is negative, it denotes the respective variables are indirectly proportional to each
other
• A positive covariance denotes that the respective variables are directly proportional to each other
123
Principal Component Analysis (PCA)
3. Calculating the Eigenvectors and Eigenvalues
Eigenvectors and eigenvalues are computed from the covariance matrix in order to determine the
principal components of the data set.
What are Principal Components?
• Principal components are the new set of variables that are obtained from the initial set of variables.
• The principal components are computed in such a manner that newly obtained variables are highly
significant and independent of each other.
• The principal components compress and possess most of the useful information that was scattered among
the initial variables.
• E.g data set is of 5 dimensions, then 5 principal components are computed, such that, the first principal
component stores the maximum possible information and the second one stores the remaining maximum
info and so on

124
Principal Component Analysis (PCA)
4. Computing the Principal Components
• Eigenvectors and eigenvalues placed in the descending order
• where the eigenvector with the highest eigenvalue is the most significant and thus forms the
first principal component.
• The principal components of lesser significances can thus be removed in order to reduce the
dimensions of the data.
• The final step in computing the Principal Components is to form a matrix known as the feature
matrix that contains all the significant data variables that possess maximum information about
the data.

125
Principal Component Analysis (PCA)
5. Reducing the dimensions of the data set

• performing PCA is to re-arrange the original data with the final principal components which
represent the maximum and the most significant information of the data set.

• In order to replace the original data axis with the newly formed Principal Components, you
simply multiply the transpose of the original data set by the transpose of the obtained feature
vector.

• For iris https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

126
T-SNE
https://distill.pub/2016/misread-tsne/
https://colah.github.io/posts/2014-10-Visualizing-MNIST/
t-SNE is t- distributed stochastic neighborhood embedding
• Used dimensionality reduction

• Best technique for visualization

• PCA and t-SNE used in industry

• PCA preserve global structure whereas t-SNE preserve local structure

127
T-SNE.
Neighborhood and Embedding
• Points are geometrically together…….Neighborhood

• Embedding….For every points in high-dim space finding its


corresponding points in low dimension

Stochastic …….probabilistic

Geometric intuition…preserving distances of points in neighborhood

128
T-SNE

Crowding problem :
E.G …2dim to 1 dim
Sometimes it is impossible to preserve distance in all the neighborhood
points such problem is called crowding problem.

129
T-SNE
https://distill.pub/2016/misread-tsne/
• Run t-SNE on simple dataset
• Perplexity : points in neighbors
• Epsilion : learning rate
• Steps : iteration
t-SNE is iterative algorithm…1,2,3…. Runs t-SNE eventually till points
are not moving (stable configuration…shape does not changing)
1. Always runs t-SNE till shape does not change
2. Always runs t-SNE with multiple perplexity value.
3. Perplexity 2 <= p<=N
130
Exploratory Data Analysis

In statistics, exploratory data analysis (EDA) is an approach to analyzing


data sets to summarize their main characteristics, often with visual
methods.
• Understanding dataset by using tools from statistics or simple plotting
tools
• Understand data ?
• E.G iris dataset….flowers are visually similar
• https://en.wikipedia.org/wiki/Iris_flower_data_set
131
132
ML-UNIT II

SCHOOL OF COMPUTER ENGINEERING & TECHNOLOGY


Syllabus
Supervised Learning Techniques
• Regression Analysis : Residual Analysis , Multiple Linear Regression,
Logistic regression
• Distance Based Models: Nearest Neighbor classification
• Tree Based Models: Decision Trees
• Probabilistic Model: Naive Bayes Classifier
• Support Vector Machine: Maximum Margin Classifier, Kernels, SVM
Case Study

3/27/2023 CSP42B: AML 2


Decision Tree
Step 1: Model Construction

Classification
Algorithms
Training
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Step 2: Model Usage

Classifier

Testing Unseen
Data Data

(Jeff, Professor, 4)

Tenured?
Decision Tree Classification
Decision Tree Terminologies

▪ Decision tree may be n-ary, n ≥ 2.


▪ There is a special node called root node.
▪ Internal nodes are test attribute/decision attribute
▪ Leaf nodes are class labels
▪ Edges of a node represent the outcome for a value of the test node.
▪ In a path, a node with same label is never repeated.
▪ Decision tree is not unique, as different ordering of internal nodes
can give different decision tree
Rule Extraction From decision Tree
■ Rules are easier to understand than large trees
■ One rule is created for each path from the root to
a leaf
■ Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
■ Rules are mutually exclusive and exhaustive

Example: Rule extraction from our buys_computer decision-tree


IF age = youth AND student = no THEN buys_computer = no
IF age = youth AND student = yes THEN buys_computer = yes
IF age = middle_aged THEN buys_computer = yes
IF age = senior AND credit_rating = excellent THEN buys_computer = yes
IF age = youth AND credit_rating = fair THEN buys_computer = no
Attribute Selection : Information gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”

1 : Calculate Entropy for Class Labels

2 : Calculate Information of Each Attribute (one by one)

3 : Calculate Gain of Each Attribute (one by one)

4: Select Attribute with max gain as a root attribute

Gain(age)> any other gain. so age is selected as root


node
Intermediate DT of the buys_computer dataset

Age is the splitting attribute at the root node of DT. Repeat the
procedure to determine the splitting attribute(except age) along
each of the branches from root node till stopping condition is
reached to generate the final DT
Final DT of the buys_computer dataset
Example : DT Creation
Example : DT Creation
Example : DT Creation
Example : Usage of Information Gain and
Entropy in DT Creation
Decision Tree
▪A Classifier ( tree structure): used in classification and regression
▪Classification mostly uses Decision tree
▪Decision tree Model act as classifier : tree structure classifier
▪New input( unlabeled, unknown) is fed to the model and model classifies it to
a particular class
▪Decision tree has two nodes:
▪ Decision node( branch: test conducted either yes or no) [corresponds to an Attribute]
▪ Leaf node( no branch) [corresponds to a Class Label ]
▪Finally leaf node is a class….( assign a class to next coming sample)

3/27/2023 CSP42B: AML 16


Classification & Regression Trees (CART)
▪ DT creates a model that predicts the value of a target (or dependent variable)
based on the values of several input (or independent variables)
▪ CART was introduced in 1984 by Leo Breiman, Jerome Friedman, Richard
Olshen and Charles Stone.
▪ The main elements of CART (and any decision tree algorithm) are:
• Rules for splitting data at a node based on the value of one variable
• Stopping rules for deciding when a branch is terminal and can be split no more
• Finally, a prediction for the target variable in each terminal node.

3/27/2023 17
Decision Tree
▪ Ex.: Let’s say you want to predict whether a person is fit given their
information like age, eating habit, and physical activity, etc.

▪ The decision nodes here are questions like ‘What’s the age?’, ‘Does he
exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are outcomes
like either ‘fit’, or ‘unfit’.

▪ Binary classification problem (yes /no type)

3/27/2023 CSP42B: AML 18


Decision Tree Types
There are two main types of Decision Trees:
▪ Classification trees (Yes/No types)
▪ the outcome variable (like ‘fit’ or ‘unfit’) or decision variable is
Categorical. Tree is used to identify the "class"

▪ Regression trees (Continuous data types)


▪ the decision or the outcome variable is Continuous (like 123). tree is
used to predict target variable value.
3/27/2023 CSP42B: AML 19
3/27/2023 Data Science 20
Entropy
▪ Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S,
▪ It’s the measure of the amount of uncertainty or randomness in data.
▪ Intuitively, it tells us about the predictability of a certain event.
▪ Example, consider a coin toss whose probability of heads is 0.5 and probability of tails is
0.5. Here the entropy is the highest possible, since there’s no way of determining what the
outcome might be.
▪ Alternatively, consider a coin which has heads on both the sides, the entropy of such an
event can be predicted perfectly since we know beforehand that it’ll always be heads. In
other words, this event has no randomness hence it’s entropy is zero.
▪ In particular, lower values imply less uncertainty while higher values imply high
uncertainty.

3/27/2023 CSP42B: AML 21


Support Vector Machines
(SVM)
UNIT II
SVM
• “Support Vector Machine” (SVM) is a supervised machine learning
algorithm which can be used for both classification or regression
challenges.
• it is mostly used in classification problems.
• In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the
value of each feature being the value of a particular coordinate.
• Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well (look at the below snapshot).
How does it work?

A thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the two classes
better”. In this scenario, hyper-plane “B” has excellently performed this job.
How does it work?

maximizing the distances between nearest data


point (either class) and hyper-plane will help us
to decide the right hyper-plane. This distance is
called as Margin.
Margin in SVM
Margin for hyper-plane C is high as compared to both A
and B. Hence, we name the right hyper-plane as C.

If we select a hyper-plane having low margin then there is


high chance of miss-classification.
How does it Work?

Hyper-plane B as it has higher margin


compared to A. But, here is the catch,
SVM selects the hyper-plane which
classifies the classes accurately prior
to maximizing margin.

Here, hyper-plane B has a classification


error and A has classified all correctly.
Therefore, the right hyper-plane is A.
How does it work?

The SVM algorithm has a feature to ignore


outliers and find the hyper-plane that has
the maximum margin.
Hence, we can say, SVM classification is
robust to outliers.
How does it work?

SVM can solve this problem. Easily! It solves


this problem by introducing additional feature.
Here, we will add a new feature z=x^2+y^2.
Now, let’s plot the data points on axis x and z:
kernel trick
• The SVM kernel is a function that takes low dimensional input space
and transforms it to a higher dimensional space i.e. it converts not
separable problem to separable problem.
• it does some extremely complex data transformations, then finds out
the process to separate the data based on the labels or outputs
you’ve defined.
Different Kernel Functions
SVM in Python
Exercise
• https://www.youtube.com/watch?v=LXGaYVXkGtg&t=189s
SVM
SVM
Bayesian Classification

▪ A statistical classifier: performs probabilistic prediction, i.e.,


predicts class membership probabilities
▪ Foundation: Based on Bayes’ Theorem.
▪ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
▪ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
▪ Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Bayesian Classification - Example
Bayesian Classification - Probability

▪ Probability: How likely something is to happen

▪ Probability of an event happening =


Number of times it can happen
_________________________
Total number of outcomes
Bayesian Classification Theorem

▪ Let X be a data sample (“evidence”): class label is unknown


▪ Let H be a hypothesis that X belongs to class C
▪ Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
▪ P(H) (prior probability), the initial probability
▪ E.g., X will buy computer, regardless of age, income, …
▪ P(X): probability that sample data is observed
▪ P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
▪ E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Bayesian Classification Theorem

▪ Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes theorem

▪ Predicts X belongs to Ci iff the probability P(Ci|X) is the highest


among all the P(Ck|X) for all the k classes
▪ Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
Naïve Bayesian Classification Theorem

▪ Let D be a training set of tuples and their associated class labels,


and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
▪ Suppose there are m classes C1, C2, …, Cm.
▪ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)

▪ This can be derived from Bayes’ theorem


▪ Since P(X) is constant for all classes, only

needs to be maximized
Naïve Bayesian Classification - Example

Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’

Data sample
X = (age = youth,
income = medium,
student = yes
credit_rating = fair)
Naïve Bayesian Classification - Example
Test for X = (age = youth , income = medium, student = yes, credit_rating =
fair)
• Prior probability P(Ci): P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) =
5/14= 0.357
• To compute P(X|Ci) for each class, compute following conditional probabilities
P(age = youth| buys_computer = yes) = 2/9 = 0.222
P(age = youth | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444
P(income = medium | buys_computer = no) = 2/5 = 0.4
P(student = yes | buys_computer = yes) = 6/9 = 0.667
P(student = yes | buys_computer = no) = 1/5 = 0.2
P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4 0.028>0.007 ..
• P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 =Therefore,
0.044 X belongs to class
(“buys_computer = yes”)
P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
• P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028
P(X|buys_computer = no) * P(buys_computer = no) = 0.007
Comment on Naive Bayes Classification

▪ Advantages
▪ Easy to implement
▪ Good results obtained in most of the cases
▪ Disadvantages
▪ Assumption: class conditional independence i.e. effect of an attribute
value on a given class is independent of the values of other attributes,
therefore loss of accuracy
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
https://www.kaggle.com/code/skalskip/iris-data-visualization-
and-knn-classification

68
Feature Scaling
• Feature scaling is the method to limit the range of
variables so that they can be compared on common
grounds.
What is Regression
Types of Regression
Cntd..
Regression analysis is a statistical technique that attempts to explore
and model the relationship between two or more variables.
Understanding Regression
• Dependent variable (The value to be predicted)
• Independent Variable (The predictors)

• Assume relationship between independent and dependent variable


is straight line.
• y = a+ bx
• a – intercept on y axis(indicates value of y when x = 0)
• b – slope (how much line rises for each increase in x)

3/27/2023 74
Cntd..
• Linear Regression (Straight line model)
i. Simple linear Regression (Single independent variable and dependent variable is
continuous)
ii. Multiple regression (More than one independent variable and dependent
variable is continuous)
• Logistic Regression (Binary categorical outcome)
.Logistic model is used to model the probability of a certain class or
event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
.. This can be extended to model several classes of events such as
determining whether an image contains a cat, dog, lion, etc..

• For example, the weight, height, and age represent continuous variable.
• a person's gender, occupation, or marital status are categorical or discrete variables
3/27/2023 75
Cntd..
Example: Advertising through TV, Radio, Newspaper to increase sales
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?

• linear regression can be used to answer each of these questions

3/27/2023 76
Cntd..
• Straightforward simple linear regression approach for predicting a
quantitative response Y on the basis of a single predictor variable X

• Mathematically regressing Y on X (or Y onto X)


• X may represent TV advertising and Y may represent sales.

3/27/2023 77
3/27/2023 78
Simple Linear Regression
ANALYZING DA

IV DV
Simple Linear Regression
SALARY (₹) EQUATION PLOTTING

y = b0 + b1 *x1

SALARY = b0 + b1 * EXPERIENCE
+10 K

+1 Yr HOW much Salary will increase?

+1 Yr EXPERIENCE
Simple Linear Regression

Constant Coefficient

y = b0 + b1 * x1

Dependent variable Independent variable


(DV) (IV)
Simple Linear Regression
ORDINARY LEAST SQUARES

• How SLR finds Best Fitting Line from our Data


SALARY (₹)
Mr. ABC
yi

Modeled
Observation
y^i

SUM ( y - y^ )2 -> min

EXPERIENCE
Least Square Method

• Finds the line of best fit for a dataset, providing a visual


demonstration of the relationship between the data
points.
• The differences between the actual and estimated
function values on the training examples are called
residuals

• The least-squares method, consists in finding ˆ f such that


is minimised
Ex: Least Square method using Univariate
Regression

x y x – x mean y - y mean (x – x mean)2 (x – x


mean).
(y – y
mean)
1 2 -2 -2 4 4
2 4 -1 0 1 0
3 5 0 1 0 0
4 4 1 0 1 0
5 5 2 1 4 2
Problem Statement

• Last year, five randomly selected students took a


math aptitude test before they began their statistics
course. The Statistics Department has three
questions.
• What linear regression equation best predicts
statistics performance, based on math aptitude
scores?
• If a student made an 80 on the aptitude test, what
grade would we expect her to make in statistics?
• How well does the regression equation fit the data?
Student x y x – x mean y - y mean (x – x (x – x
mean)2 mean).
(y – y
mean)
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
mean 78 77
How to Find the Regression Equation

• The regression equation is a linear equation of the form: ŷ = b0


+ b1x
• First, we solve for the regression coefficient (b1):
• b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
• b1 = 470/730
• b1 = 0.644
• Once we know the value of the regression coefficient (b1), we
can solve for the regression slope (b0):
• b0 = y - b1 * x
• b0 = 77 - (0.644)(78)
• b0 = 26.768
• Therefore, the regression equation is: ŷ = 26.768 + 0.644x .
How to Use the Regression Equation

• In our example, the independent variable is the


student's score on the aptitude test.
• The dependent variable is the student's statistics
grade.
• If a student made an 80 on the aptitude test, the
estimated statistics grade (ŷ) would be:
• ŷ = b0 + b1x
• ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80
• ŷ = 26.768 + 51.52 = 78.288
How to Find the Coefficient of
Determination

• Whenever you use a regression equation, you should ask


how well the equation fits the data.
• One way to assess fit is to check the coefficient of
determination, which can be computed from the
following formula.
• R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
• where N is the number of observations used to fit the model,
• Σ is the summation symbol, xi is the x value for observation i,
• x is the mean x value,
• yi is the y value for observation i,
• y is the mean y value,
• σx is the standard deviation of x, and
• σy is the standard deviation of y.
• σx = sqrt [ Σ ( xi - x )2 / N ]
• σx = sqrt( 730/5 ) = sqrt(146) = 12.083
• σy = sqrt [ Σ ( yi - y )2 / N ]
• σy = sqrt( 630/5 ) = sqrt(126) = 11.225
• And finally, we compute the coefficient of determination
• (R2): R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
• R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2
• R2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48
• A coefficient of determination equal to 0.48 indicates that about 48% of
the variation in statistics grades (the dependent variable) can be
explained by the relationship to math aptitude scores (the independent
variable).
• This would be considered a good fit to the data, in the sense that it would
substantially improve an educator's ability to predict student performance
in statistics class.
Cntd..
xy
11
23
43
32
55

When we have a single input attribute (x) and we want to use linear regression, this is called simple linear regression.

If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear regression.

The procedure for linear regression is different and simpler than that for multiple linear regression

3/27/2023 95
Cntd..
With simple linear regression we want to model our data as follows: y = B0 + B1 * x

This is a line where y is the output variable we want to predict, x is the input variable we know

and B0 and B1 are coefficients that we need to estimate that move the line around.

3/27/2023 96
Calculating the sum of these squared values gives us up denominator of 10 Now we can
calculate the value of our slope. B1 = 8 / 10 so further B1 = 0.8

3/27/2023 97
3/27/2023 98
3/27/2023 99
3/27/2023 100
Linear Regression
▪ Simple linear regression: models the relationship between a dependent
variable and one independent variables using a linear function

3/27/2023 CSP42B: AML 101


What is “Linear”?
▪ Remember the LINE equation : Y = mX + B
▪ Expected value of (y) at a given level of (x): E ( yi / xi ) =  + xi

3/27/2023 CSP42B: AML 102


Predicted value for an individual…

yi =  + *xi + random errori

Fixed – Follows a normal


exactly on distribution
the line

3/27/2023 CSP42B: AML 103


Multiple Linear Regression
▪ Multiple linear regression: two or more (multiple) explanatory variables
are used to predict the dependent variable

▪ Ex. Demand of a product varies directly with


the change in demographic characteristics (age,
income) of a market area.

3/27/2023 CSP42B: AML 104


Regression Modeling

▪ A simple regression model (one


independent variable) fits a regression
line in 2-dimensional space

▪A multiple regression model with


two explanatory variables fits a
regression plane in 3-dimensional
space
3/27/2023 CSP42B: AML 105
Multiple Linear Regression

3/27/2023 CSP42B: AML 106


Simple and Multiple Regression

3/27/2023 CSP42B: AML 107


3/27/2023 108
3/27/2023 109
Multiple Linear Regression
Suppose we have the following dataset with one response variable y and two
predictor variables X1 and X2:

Use the following steps to fit a multiple linear regression model to this dataset.

3/27/2023 110
Multiple Linear Regression
Step 1: Calcúlate X12, X22, X1y, X2y and X1X2.

3/27/2023 111
Multiple Linear Regression
Step 2: Calculate Regression Sums.
Next, make the following regression sum calculations:
Σx12 = ΣX12 – (ΣX1)2 / n = 38,767 – (555)2 / 8 = 263.875
Σx22 = ΣX22 – (ΣX2)2 / n = 2,823 – (145)2 / 8 = 194.875
Σx1y = ΣX1y – (ΣX1Σy) / n = 101,895 – (555*1,452) / 8 = 1,162.5
Σx2y = ΣX2y – (ΣX2Σy) / n = 25,364 – (145*1,452) / 8 = -953.5
Σx1x2 = ΣX1X2 – (ΣX1ΣX2) / n = 9,859 – (555*145) / 8 = -200.375

3/27/2023 112
Multiple Linear Regression
Step 3: Calculate b0, b1, and b2.

The formula to calculate b1 is: [(Σx22)(Σx1y) – (Σx1x2)(Σx2y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b1 = [(194.875)(1162.5) – (-200.375)(-953.5)] / [(263.875) (194.875) – (-200.375)2] = 3.148

The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875) (194.875) – (-200.375)2] = -1.656

The formula to calculate b0 is: y – b1X1 – b2X2

Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867

3/27/2023 113
Multiple Linear Regression
Step 5: Place b0, b1, and b2 in the estimated linear regression equation.

The estimated linear regression equation is: ŷ = b0 + b1*x1 + b2*x2

In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2

3/27/2023 114
Residual
▪ Residual: degree of discrepancy between the model assumed and data
observed
▪ Residuals (e) : Difference between the observed value of the dependent
variable (y) and the predicted value (ŷ)

ei = yi − y
ˆi
▪ Each data point has one residual
▪ Residual = [Observed value] – [Predicted value]
▪ Residual is a more precise estimator of zero

3/27/2023 CSP42B: AML 115


Residual

Fig.1: Plotting the residual (Random Error)


3/27/2023 CSP42B: AML 116
Residual Analysis
Ex.

Fig.2: Plotting the four data points for given equation

3/27/2023 CSP42B: AML 117


Residual Analysis

▪Residual analysis helps for evaluating


the goodness of a model
▪Ideally, deviation must be near to zero
Fig.3: Residual analysis plot

3/27/2023 CSP42B: AML 118


Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?

Y
60
40
20
0 X
0 20 40 60
EPI 809/Spring 2008 119
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope changed

Y
60
40
20
0 X
0 20 40 60
Intercept unchanged
EPI 809/Spring 2008 120
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope unchanged

Y
60
40
20
0 X
0 20 40 60
Intercept changed
EPI 809/Spring 2008 121
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope changed

Y
60
40
20
0 X
0 20 40 60
Intercept changed
EPI 809/Spring 2008 122
Least Squares (LS)
▪ ‘Best Fit’ Means Difference Between Actual Y-Values & Predicted Y-Values
are a Minimum.
▪ But Positive Differences Off-Set Negative. So square errors!

▪ LS Minimizes the Sum of the Squared Differences (errors) (SSE)

( )
n n
Yi − Yˆi  ˆ
2
= 2
i
i =1 i =1

EPI 809/Spring 2008 123


Least Squares Graphically
n
LS minimizes   i =  1 +  2 +  3 +  4
 2  2  2  2  2

i =1
Y  +
Y2 =   X + 
0 1 2 2

^4
^2
^1 ^3
  
Yi =  0 +  1X i
X
EPI 809/Spring 2008 124
Coefficient Equations
• Prediction equation
ˆi = ˆ0 + ˆ1 xi
y
• Sample slope
SS xy  ( xi − x )( yi − y )
ˆ1 = =
SS xx  i( x − x )2
• Sample Y - intercept

ˆ0 = y − ˆ1x
EPI 809/Spring 2008 125
Linear classification
Nonlinearity
Logistic Regression

• Logistic Regression is one of the most commonly


used Machine Learning algorithms that is used to
model a binary variable that takes only 2 values – 0
and 1.
• The objective of Logistic Regression is to develop a
mathematical equation that can give us a score in the
range of 0 to 1. This score gives us the probability of
the variable taking the value 1.
Logistic regression

• Even if called regression, this is a classification method which


is based on the probability for a sample to belong to a class.
• As our probabilities must be continuous in R and bounded
between (0, 1), it's necessary to introduce a threshold function
to filter the term z.
• The name logistic comes from the decision to use the sigmoid
(or logistic) function:
Logistic regression

• If ‘Z’ goes to infinity, Y(predicted) will become 1 and if


‘Z’ goes to negative infinity, Y(predicted) will become
0.
Applications

• How does the probability of getting lung cancer (yes


vs. no) change for every additional pound a person
is overweight and for every pack of cigarettes
smoked per day?
• Do body weight, calorie intake, fat intake, and age
have an influence on the probability of having a
heart attack (yes vs. no)?
• Spam Detection
• Tumour Prediction
• Credit Card Fraud
• Marketing
The Logistic Regression Model
▪ The logistic distribution constrains the estimated probabilities to lie
between 0 and 1.
▪ The estimated probability is:

p = 1/[1 + exp(- -  X)]

▪ if you let  +  X =0, then p = .50


▪ as  +  X gets really big, p approaches 1
▪ as  +  X gets really small, p approaches 0

3/27/2023 CSP42B: AML 132


The Logistic Regression Model

3/27/2023 CSP42B: AML 133


Linear Vs Logistic

3/27/2023 CSP42B: AML 134


Regularization
Overfitting is one of the most serious kinds of problems related to machine
learning. It occurs when a model learns the training data too well. The model then
learns not only the relationships among data but also the noise in the dataset.
Overfitted models tend to have good performance with the data used to fit them
(the training data), but they behave poorly with unseen data (or test data, which is
data not used to fit the model).
Overfitting usually occurs with complex models. Regularization normally tries to
reduce or penalize the complexity of the model. Regularization techniques applied
with logistic regression mostly tend to penalize large coefficients 𝑏₀, 𝑏₁, …, 𝑏ᵣ:
•L1 regularization penalizes the LLF with the scaled sum of the absolute values of
the weights: |𝑏₀|+|𝑏₁|+⋯+|𝑏ᵣ|.
•L2 regularization penalizes the LLF with the scaled sum of the squares of the
weights: 𝑏₀²+𝑏₁²+⋯+𝑏ᵣ².
•Elastic-net regularization is a linear combination of L1 and L2 regularization.
Regularization can significantly improve model performance on unseen data.
Overfitting
▪ Natural end of process in DT is 100%
purity in each leaf
▪ This overfits the data, which end up fitting
noise in the data
▪ Overfitting leads to low predictive
accuracy of new data
▪ Past a certain point, the error rate for the
validation data starts to increase

3/27/2023 CSP42B: AML 136


Overfitting & Underfitting
▪ Training a Model:
▪ Overfit: if the model is performing much better on train set but not
performing well on cross-validation set
▪ Overfit is called as “High Variance” problem

▪ Underfit: if the model is not performing well on train set itself


▪ Underfit is considered as “High Bias” problem

3/27/2023 CSP42B: AML 137


3/27/2023 CSP42B: AML 138

You might also like