Chapter 4 Statistical Classification Methods

FEM 2063 - Data Analytics
CHAPTER 4: Classifications
1
Overview
At the end of this chapter students
should be able to understand
➢Supervised and non-supervised Learning
➢Logistic Regression
➢Naïve Bayesian
➢Discriminant Analysis
➢Linear Discriminant Analysis
➢Quadratic Discriminant Analysis
2
Machine Learning
Machine
Learning
Supervised Learning
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, a dataset comprising of elements is given with a set of
features X1,X2,…,Xp as well as a response or outcome variable Y for each
element. The goal was then to build a model to predict Y using X1,X2,…,Xp.
Example:
Regression
and
classification
where prior
information is
available
Classification
Supervised learning or classification: attribution of a class or label to an
observation by exploiting the availability of a training set (labeled data) or in other
words Classification is a subcategory of supervised learning where the goal is
to predict the categorical class labels (discrete, unordered values, group
membership) of new instances based on past observations
Unsupervised Learning
What is unsupervised machine learning?
Unsupervised learning is a machine learning technique in which models are not
supervised using training dataset. Instead, models itself find the hidden patterns
and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things.
Example: Clustering
Supervised vs Unsupervised
Classification Performance - Confusion Matrix
THE TOOLS
What is confusion matrix?
A confusion matrix is a table that is often used to describe the performance of
a classification model (or "classifier") on a set of test data for which the true
values are known.
TP : True Positive, TN : True Negative
FP : False Positive, FN : False Negative
Classification Performance - Confusion Matrix
THE TOOLS
Example: In medical diagnosis,
test sensitivity is the ability of a test to
correctly identify those with the
disease (true positive rate), whereas test
specificity is the ability of the test to
correctly identify those without the
disease (true negative rate).
TP + TN
Accuracy =
TP + TN + FP + FN
Sensitivity = TP
= recall = r
TP + FN
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
Overview
➢Naïve Bayesian
G. James, D. Witten, T. Hastie, R. Tibshirani, “An Introduction to Statistical Learning with Applications in R”, Springer,
ISBN 978-1-4614-7137-0, ISBN 978-1-4614-7138-7 (eBook) 11
Overview – Logistic Regression
Logistic regression is a statistical model that uses a logistic function to model a
binary dependent variable. In regression analysis, logistic regression is
estimating the parameters in a form of binary regression.
What is logistic regression in simple terms?
Logistic Regression, also known as Logit Regression or Logit Model, is a
mathematical model used in statistics to estimate (guess) the probability of an event
occurring having been given some previous data. Logistic Regression works with
binary data, where either the event happens (1), or the event does not happen (0).
12
Overview – Logistic Regression
What is difference between logistic regression and linear regression?
• Linear regression is used for predicting the continuous dependent
variable using a given set of independent features whereas Logistic
Regression is used to predict the categorical.
13
Logistic Regression
Example: Credit Card Fraud
When a credit card transaction happens, the bank
makes note of several factors. For instance, the
date of the transaction, amount, place, type of
purchase, etc. Based on these factors, they
develop a Logistic Regression model of whether
the transaction is a fraud or not.
Logistic Regression
Why is logistic regression better?
Good accuracy for many simple data sets and it performs well when the dataset
is linearly separable.
Logistic Regression
Formula for one variable, X Formula for multi variables of Xi
Logistic Regression - Example
Using a software (e.g. Python, R) to find the Logistic Regression model
Example of output
Making Predictions: Predicting good or bad creditor based on account balance

• What is our estimated probability of default for someone with a balance of $1000?
Please try using

this formula
• With a balance of $2000?
A group of 20 students spends between 0 and 6 hours studying for an exam.
How does the number of hours spent studying affect the probability of the student
passing the exam?
The reason for using logistic regression for this problem is that the values of the
dependent variable, pass and fail, while represented by "1" and "0", are
not cardinal numbers.
RESULTS
Logistic Regression
More than 2 independent variables
Example of outputs
Example: more than 2 variables
A sample of 1000 people were selected to identify how their age, daily internet
usage and time spent on site, will affect their intuition to click on an
advertisement. Use first 700 observations as training datasets and remaining 300
as testing datasets.
Number of observation: 1000
Variables Description
Y “Clicked on Ad”: Indicating clicking on Ad, 0 = NO, 1 = YES
X1 “Daily time spent on Site”: Consumer time spending on site in
minute
X2 “Age”: Consumer Age
X3 “Daily Internet Usage”: Average time in minutes a day consumer is
on the internet (online)
Logistic Regression – Phyton Codes
#Logistic Regression #set the values that will be used to train the
#commands below are used to import the model which is about 70% and test 30%
important commands needed for the codes x_for_train (x[:700])
y_for_train (y[:700])
import matplotlib.pyplot as plt
import numpy as np x_for_test (x[701:1000])
import pandas as pd y_for_test (y[701:1000])
#import Logistic Regression command as it #define model to be used

will be used in the coding model = LogisticRegression()
from sklearn.linear_model import #fit training data command
LogisticRegression model.fit(x_for_train,y_for_train)
#import data from computer #fit training data command and print the
from google.colab import files results
uploaded = files.upload() model.fit(x_for_train,y_for_train)
#read the file #define and print values of intercept and

df = pd.read_csv(‘XXX.csv') coefficient of the data for the train
#the data in the file imported will be
displayed in the coding #calculate accuracy
print(df)
#set the independent variables and dependent These codes are not complete, please attend
variable as x and y respectively
tutorial/lab classes for full details of the codes
Logistic Regression – Phyton Codes
beta0 : [19.66511428]
beta1 : [[-0.17887355 0.1258386 -0.06456599]]
Results - Interpretation:
Classifying your daily productivity
Lately you’ve been interested in gauging your productivity. You’ve been asking yourself, at
the end of each day, if the day was indeed productive. But that’s just a potentially biased,
qualitative data point. You want to find a more scientific way to go about it. You’ve
observed the natural flows of your day, and realized that what impacts it the most is:
•Sleep you know that sleep, or lack thereof, has a big impact on your day.
•Coffee doesn’t the day start after coffee?
•Focus time it’s not always possible, but you try to have 3–4h of intently focused time to
dive into projects.
•Lunch you’ve noticed the day flows smoothly when you have time for a proper lunch, not
just snacks.
•Walks you’ve been taking short walks to get your steps in, relax a bit and muse about
your projects.
https://towardsdatascience.com/logistic-regression-in-real-life-
building-a-daily-productivity-classification-model-a0fc2c70584e
To classify your day as productive or not with a
Logistics Regression model, the first step is to pick
an arbitrary threshold x and assign observations to
each class based on a simple criteria:
•Class Non-Productive, all outcomes that are less
than or equal to x.
•Class Productive otherwise, i.e., all outcomes
greater than x.
Observed Data for 20 days
•Outcomes less than or equal to zero are assigned to Class 0, i.e., a nonproductive day.
•Positive outcomes are assigned to Class 1, i.e., a productive day.
Logistic Regression
One of the most significant advantages of the logistic regression model is that it doesn't
just classify but also gives probabilities.
The following are some of the advantages of the logistic
regression algorithm.
•Simple to understand, easy to implement, and efficient to train
•Performs well when the dataset is linearly separable
•Good accuracy for smaller datasets
•Doesn't make any assumptions about the distribution of classes
•Useful to find relationships between features
•Provides well-calibrated probabilities
•Less prone to overfitting in low dimensional datasets
•Can be extended to multi-class classification
The following are some of the disadvantages of the logistic regression algorithm:
•Constructs linear boundaries

•Can lead to overfitting if the number of features is more than the number of
observations
•Predictors should have no multicollinearity
•Challenging to obtain complex relationships. Algorithms like neural networks are
more suitable and powerful
•Can be used only to predict discrete functions
•Can't solve non-linear problems
•Sensitive to outliers
Overview
➢Naïve Bayes
31
Naïve Bayes
Learning objectives:
-Introduction - Deterministic vs
Stochastics
-Law of Probability
-Understand Naïve Bayes
Classifier
Some references
http://www3.cs.stonybrook.edu/~cse634/ch6book.pdf
https://www3.cs.stonybrook.edu/~cse634/T14.pdf
Introduction- Stochastic vs Deterministic
A deterministic system is a system A stochastic model is a tool for estimating
in which no randomness is involved probability distributions of potential outcomes
in the development of future states by allowing for random variation in one or more
of the system. A deterministic inputs over time. The random variation is
model will thus always produce the usually based on fluctuations observed in
same output from a given starting historical data for a selected period using
condition or initial state. standard time-series techniques.
33
Example - Stochastic vs Deterministic
34
Laws of Probability
Conditional probability is the likelihood of an

outcome occurring, based on a previous outcome
occurring.
Bayes' theorem provides a way to revise existing
predictions or theories (update probabilities) given
new or additional evidence.
© 2019 Petroliam Nasional35
Berhad (PETRONAS) |
What is Bayes’ Theorem?
Prior probability represents what is originally believed before new evidence is

introduced, and posterior probability takes this new information into account.
A posterior probability can subsequently become a prior for a new
updated posterior probability as new information arises and is incorporated
into the analysis. Likelihood, event that is expected to occur.
36
Example of Bayes’ Theorem
Example
A doctor knows that meningitis causes stiff
neck 50% of the time - likelihood
Prior probability of any patient having
meningitis is 1/50,000 - prior Question:
Prior probability of any patient having stiff If a patient has stiff neck,
neck is 1/20 - prior what is the probability
Solution he/she has meningitis?
© 2019 Petroliam Nasional37

Berhad (PETRONAS) |
Naïve Bayes - Example
𝑷 𝑿 𝑯 𝑷(𝑯)
𝑷 𝑯𝑿 =
𝑷(𝑿)
To predict whether this

officer Drew is a male
or female…..
Given the small database with names and sex.

We can apply Bayes theorem.
1 Attribute: Name
Officer Drew is a female!

Test sample (unseen)
41
Bayes’ Theorem – Phyton codes
#Naive Bayes #set the values that will be used to train the
#commands below are used to import the model which is about 70% and test 30%
important commands needed for the codes x_for_train (x[:700])
y_for_train (y[:700])
import numpy as np x_for_test (x[701:1000])
import pandas as pd y_for_test (y[701:1000])
#import Naïve Bayes command as it will be #define model to be used

used in the coding model = GaussianNB()
from sklearn.naive_bayes import GaussianNB #fit training data command
model.fit(x_for_train,y_for_train)
#import data from computer
from google.colab import files #fit training data command and print the
uploaded = files.upload() results
model.fit(x_for_train,y_for_train)
#read the file
df = pd.read_csv(‘XXX.csv') #calculate accuracy
#the data in the file imported will be
displayed in the coding These codes are not complete, please attend
print(df)
#set the independent variables and dependent
variable as x and y respectively 42
Bayes’ Theorem – Phyton outputs
43
Naïve Bayes
• Advantages:
– Fast to train. Fast to classify

– Not sensitive to irrelevant features
– Handles real and discrete data
– Handles streaming data well
• Disadvantage:
Assumes independence of features
Naïve bayes vs Logistic Regression – major differences
1. Purpose or what class of machine leaning does it solve?
Both the algorithms can be used for classification of the data. Example, you could predict whether a banker can
offer a loan to a customer or not or identify given mail is a Spam or not.
2. Algorithm’s Learning mechanism
Naïve Bayes: For the given features (x) and the label y, it estimates a joint probability from the training data, hence
this is a Generative model.
Logistic regression: Estimates the probability(y/x) directly from the training data by minimizing error. Hence this is
a Discriminative model
3. Model assumptions
Naïve Bayes: Model assumes all the features are conditionally independent .so, if some of the features are
dependent on each other (in case of a large feature space), the prediction might be poor.
Logistic regression: It the splits feature space linearly; it works OK even if some of the variables are correlated.
4. Model limitations
Naïve Bayes: Works well even with less training data, as the estimates are based on the joint density function
Logistic regression: With the small training data, model estimates may over fit the data
5. Approach to be followed to improve the results
Naïve Bayes: When the training data size is less relative to the features, the information/data on prior probabilities
help in improving the results
Logistic regression: When the training data size is less relative to the features, Lasso and Ridge regression will
help in improving the results.
https://www.quora.com/What-is-the-difference-between-logistic-regression-and-Naive-Bayes
Overview
➢Naïve Bayes
46
What is linear discriminant analysis
Linear discriminant analysis is a technique that is used by the researcher to
analyze the research data when the criterion or the dependent variable is
categorical and the predictor or the independent variable is interval in nature.
Discriminant analysis is a versatile statistical method often used by market

researchers to classify observations into two or more groups or categories. In
other words, discriminant analysis is used to assign objects to one group among
several known groups.
Discriminant Analysis
• LDA makes predictions by
estimating the probability that a new
set of inputs belongs to each class.
The class that gets the highest
probability is the output class and
a prediction is made.
• Model the distribution of X in each of the classes separately, and then use
Bayes theorem to obtain
Pr( 𝑌 = 𝑘|𝑋 = 𝑥)
• Use normal (Gaussian) distributions for each class, this leads to linear or
quadratic discriminant analysis.
• Remark: it could be done with other distributions.
Linear Discriminant Analysis when there is only 1 predictor (p=1)
• The Gaussian (normal) density has the form

f k ( x) = Pr( X = x | Y = k ) is the (normal) density for X in class k .
• Here k is the mean, and  k2 the variance (in class k).
• We will assume that all the  k =  are the same.
•  k = Pr(Y = k ) is the marginal or prior probability for class k.
Classify to the
highest density
Example of
decision
boundaries:
Discriminant functions
• To classify the value X = x, we need to find the k which gives the largest pk ( x)
• After simplifications it is equivalent of finding the largest discriminant score using
the formula:
Note that  k ( x) is a linear function of x.
n = size of training set

𝑛𝑘 = size of class k in the training set
Example Discriminant Analysis
Default Balance When x μ0 =(580+245+1970)/3=931.67 π0 =3/5
0 580 = 900
μ1 =(7390+2845)/2=5117.50 π1 =2/5
0 245 what is
0 1970 the 𝜎 2 =((580-931.67)^2+…+(2845-5117.50)^2)/3=4.10^6
default?
1 7390
1 2845 n = 5,
? 900 k =2
δ0 = - 0.12 δ1 = - 2.52
since δ0 > δ1 --> Default for x=900 is 0

LDA– Phyton codes
##Mount the drive #set the values that will be used to train the
from google.colab import drive model which is about 70% and test 30%
drive.mount('/content/drive’)
#Fit variables into model
#Import Data #Find predicted classes of test sets
import pandas as pd
df = pd.read_csv('/content/drive/My #Determine the Accuracy
Drive/XXX.csv') #Visualize the results by plotting figures
print('\nData', df)
#Import additional items These codes are not complete, please attend
from sklearn import linear_model
import numpy as np
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis as LDA
from sklearn.metrics import accuracy_score
#Define x and y variables
#Split data into training and testing

Example: Dataset chosen contains 3 LDA - Results
attributes (Diastolic Blood Pressure,
(diaBP), Systolic Blood Pressure,
(sysBP) and the age of the patient with
more than 150 observations. These
variables are used to predict 10-year
risk of coronary heart disease
LDA - Interpretation
Example:
Plots for LDA results are the visual relationship of the independent variable, risk of
getting coronary heart disease to dependent variables, age, Diastolic Blood
Pressure, (diaBP), and Systolic Blood Pressure, (sysBP). The blue dots on the
scattered plots represents no coronary heart disease risk in the next ten
years, while the orange dots on the scattered plots represents there is risk of
getting coronary heart disease in the next 10 years. Based on the graphs, it is
observed that all three independent variables contribute to most of the no-risk of
obtaining coronary heart disease in 10 years, while only a very small number of
patients will obtain coronary heart disease. It is concluded that there is only small
amount of inaccuracy of prediction as the accuracy is 85.4.
Other forms of Discriminant Analysis
• When f k ( x) are Gaussian densities, with the same covariance

matrix in each class, this leads to linear discriminant analysis.
• With Gaussians but different  k in each class, we get quadratic
discriminant analysis (QDA).
QDA vs LDA
A major difference between the two is that LDA
assumes the feature covariance matrices of both
classes are the same, which results in a linear
decision boundary. In contrast, QDA is less strict
and allows different feature covariance matrices for
different classes, which leads to a quadratic
decision boundary.
LDA Assumptions:
•LDA assumes normally distributed data and a class-
specific mean vector.
•LDA assumes a common covariance matrix, that is
common to all classes in a data set.
When these assumptions hold, then LDA approximates the Bayes classifier very
closely and the discriminant function produces a linear decision boundary.
QDA vs LDA
QDA Assumptions:
•Observation of each class is drawn from a normal distribution (same as LDA).
•QDA assumes that each class has its own covariance matrix (different from LDA).
When these assumptions hold, QDA approximates the Bayes classifier very closely
and the discriminant function produces a quadratic decision boundary.
In conclusion, LDA is less flexible than QDA because it can estimate fewer
parameters. This can be good when only a few observations in training dataset so
lower the variance. On the other hand, when the K classes have very different
covariance matrices then LDA suffers from high bias and QDA might be a
better choice, what comes down to is the bias-variance trade-off. Therefore, it is
crucial to test the underlying assumptions of LDA and QDA on the data set and
then use both methods to decide which one is more appropriate.
Prediction Models in Healthcare
Machine learning applications in healthcare sector: An overview
Virendra Kumar Verma a, Savita Verma
Materials Today: Proceedings 57 (2022) 2144–2147
Machine learning (ML) applications are

everywhere and are used in many real-world
applications. It is essential in several areas,
such as healthcare and medical data protection.
ML is applied to analyze medical records and
disease forecasts. In this study, authors review
several ML algorithms, applications, techniques,
opportunities, and challenges for the healthcare
sector. This paper fills a research gap for
efficient use of ML algorithms and applications
in the healthcare sector.
Machine learning (ML) is essential in

healthcare sector such as medical imaging
diagnostics, improved radiotherapy,
personalized treatment, crowdsourced data
gathering, smart health records, ML based
behavioral modification, clinical trials, and
research. Healthcare is becoming more
problematic and costly. It uses several ML
techniques to fix it. Various ML techniques and
applications for disease prediction are
presented in this paper. Using ML algorithms
and techniques, we hope to improve the
accuracy of many disease predictions in the
future.
S. P. Chatrati, G. Hossain, A. Goyal et al., Smart home health monitoring system for predicting type 2 diabetes and hypertension,
Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2020.01.010
This work proposes a smart home health monitoring system that helps to analyze the patient’s
blood pressure and glucose readings at home and notifies the healthcare provider in case of any
abnormality detected. The goal is to predict the hypertension and diabetes status using the
patient’s glucose and blood pressure readings using supervised machine learning classification
algorithms.
Proposed Block Diagram and Workflow

Summary
Logistic Regression (LR):Logistic function, Maximum likelihood
Naïve Bayes: Independence of attributes
Linear Discriminant Analysis (LDA): Normal distribution, Same

covariance matrices
Quadratic Discriminant Analysis (QDA): Normal distribution, Different

covariance matrices
63

Chapter 4 Statistical Classification Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 Statistical Classification Methods

Uploaded by

Copyright:

Available Formats

FEM 2063 - Data Analytics

➢Supervised and non-supervised Learning

Making Predictions: Predicting good or bad creditor based on account balance

Please try using

#import Logistic Regression command as it #define model to be used

#read the file #define and print values of intercept and

•Constructs linear boundaries

Conditional probability is the likelihood of an

Prior probability represents what is originally believed before new evidence is

© 2019 Petroliam Nasional37

To predict whether this

Given the small database with names and sex.

Officer Drew is a female!

#import Naïve Bayes command as it will be #define model to be used

– Fast to train. Fast to classify

Discriminant analysis is a versatile statistical method often used by market

• The Gaussian (normal) density has the form

Note that  k ( x) is a linear function of x.

n = size of training set

since δ0 > δ1 --> Default for x=900 is 0

#Define x and y variables

#Split data into training and testing

• When f k ( x) are Gaussian densities, with the same covariance

Machine learning (ML) applications are

Machine learning (ML) is essential in

Proposed Block Diagram and Workflow

Naïve Bayes: Independence of attributes

Linear Discriminant Analysis (LDA): Normal distribution, Same

Quadratic Discriminant Analysis (QDA): Normal distribution, Different

You might also like