Professional Documents
Culture Documents
DSC-7 Introduction To Business Analytics Sol Chapter
DSC-7 Introduction To Business Analytics Sol Chapter
Continuing Education
University of Delhi
Editors
Dr. Rishi Rajan Sahay
Assistant Professor, Shaheed Sukhdev College of Business
Studies, University of Delhi
Dr. Sanjay Kumar
Assistant Professor, Delhi Technological University
Content Writers
Dr. Abhishek Kumar Singh, Dr. Satish Goel,
Mr. Anurag Goel, Dr. Sanjay Kumar
Academic Coordinator
Mr. Deekshant Awasthi
Published by:
Department of Distance and Continuing Education under
the aegis of Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007
Printed by:
School of Open Learning, University of Delhi
INDEX
3.11 AUC
3.12 Summary
ii | P a g e
LESSON 1
INTRODUCTION TO BUSINESS ANALYTICS AND
DESCRIPTIVE ANALYTICS
Dr. Abhishek Kumar Singh
Assistant Professor
University of Delhi-19
abhishekbhu008@gmail.com
STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Introduction to Business Analytics
1.4 Role of Analytics for Data-Driven Decision Making
1.5 Types of Business Analytics
1.6 Introduction to the concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to relevant statistical software packages
1.9 Summary
1.10 Glossary
1.11 Answer to in text Question
1.12 Self- Assessment Question
1.13 References
1.14 Suggested Reading
1.2 INTRODUCTION
Business analytics (BA) consists of using data to gain valuable insights and make informed
decisions in a business setting. It involves analysing and interpreting data to uncover patterns,
trends, and correlations that can help organizations improve their operations, better
understand their customers, and make strategic decisions. Business analytics (BA) places a
focus on statistical analysis. In addition to statistical analysis, business analytics also focuses
on various other aspects, such as data mining, predictive modelling, data visualization,
machine learning, and data-driven decision making.
Companies committed to making data-driven decisions employ business analytics. The study
of data through statistical and operational analysis, the creation of predictive models, the use
of optimisation techniques, and the communication of these results to clients, business
partners, and college executives are all considered to be components of business analytics. It
relies on quantitative methodologies, and data needed to create specific business models and
reach lucrative conclusions must be supported by proof. As a result, Business Analytics
heavily relies on and utilises Big Data. Business analytics is the process used to analyse data
after looking at past outcomes and problems in order to create an effective future plan.
Big Data, or a lot of data, is utilised to generate answers. The economy and the sectors
that prosper inside it depend on this way of conducting business or this outlook on creating
and maintaining a business. Over the past ten or so years, the word analytics has gained
popularity. Analytics are now incredibly important due to the growth of the internet and
information technology. In this lesson we are going to learn about Business Analytics and the
area of analytics integrates data, information technology, statistical analysis, and combining
quantitative techniques with computer-based models. All of these factors work together to
give decision-makers every possibility that can arise, allowing them to make well-informed
choices. The computer-based model makes sure that decision-makers may examine how their
choices function in various scenarios.
2|Page
Fig. 1
1.3.1 Meaning: Business analytics (BA) utilizes data analysis, statistical models, and various
quantitative techniques as a comprehensive discipline and technological approach. It involves
a systematic and iterative examination of organizational data, with a specific emphasis on
statistical analysis, to facilitate informed decision-making.
Business analytics primarily entails a combination of the following: discovering novel
patterns and relationships using data mining; developing business models using quantitative
and statistical analysis; conducting A/B and multi-variable testing based on findings;
forecasting future business needs, performance, and industry trends using predictive
modelling; and reporting your findings to co-workers, management, and clients in simple-to-
understand reports.
1.3.2 Definition
Business analytics (BA) involves utilizing knowledge, tools, and procedures to analyse past
business performance in order to gain insight and inform present and future business strategy.
Business analytics is the process of transforming data into insights to improve business
choices. It is based on data and statistical approaches to provide new insights and
understanding of business performance. Some of the methods used to extract insights from
3|Page
data include data management, data visualisation, predictive modelling, data mining,
forecasting simulation, and optimisation.
1.3.3 Business analytics evolution
Business analytics has been around for a very long time and has developed as more, better
technology have been available. Operations research, which was widely applied during
World War II, is where it has its roots.
Operations research was initially designed as a methodical strategy to analyse data in military
operations. Over time, this strategy began to be applied in business domain as well.
Gradually, the study of operations evolved into management science. Furthermore, the
fundamental elements such as decision-making models, and other foundations of
management science were the same as those of operation research.
Ever since Frederick Winslow Taylor implemented management exercises in the late 19th
century, analytics have been employed in business. Henry Ford's freshly constructed
assembly line involved timing of each component.
However, when computers were deployed in decision support systems in the late 1960s,
analytics started to garner greater attention. Since then, enterprise resource planning (ERP)
systems, data warehouses, and a huge range of other software tools and procedures have all
modified and shaped analytics.
With the advent of computers, business analytics have grown in recent years. This
modification has elevated analytics to entirely new heights and opened up a world of
opportunity. Many people would never guess that analytics began in the early 1900s with Mr
Ford himself, given how far analytics has come in history and what the discipline is now.
Business intelligence, decision support systems, and PC software all developed from
management science.
4|Page
5|Page
IN-TEXT QUESTIONS
1. Define Business Analytics?
2. What do you understand by the term Business analysis evolution?
3. State two importance of Business Analytics?
Business analytics can be divided into four primary categories, each of which gets more
complex. They bring us one step closer to implementing scenario insight applications for the
present and the future. Below is a description of each of these business analytics categories.
1. Descriptive analytics,
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
1. Descriptive analytics: In order to understand what has occurred in the past or is
happening right now, it summarises the data that an organisation currently has. The
simplest type of analytics is descriptive analytics, which uses data aggregation and
mining techniques. It increases the availability of data to an organization's
stakeholders, including shareholders, marketing executives, and sales managers. It can
aid in discovering strengths and weaknesses and give information about customer
behaviour. This aids in the development of strategies for the field of focused
marketing.
2. Diagnostic Analytics: This kind of analytics aids in refocusing attention from past
performance to present occurrences and identifies the variables impacting trends.
Drill-down, data mining, and other techniques are used to find the underlying cause of
occurrences. Probabilities and likelihoods are used in diagnostic analytics to
6|Page
comprehend the potential causes of events. For classification and regression, methods
like sensitivity analysis and training algorithms are used.
3. Predictive Analytics: With the aid of statistical models and ML approaches, this type
of analytics is used to predict the likelihood of a future event. The outcome of
descriptive analytics is built upon to create models that extrapolate item likelihood.
Machine learning specialists are used to conduct predictive analyses. They can be
more accurate than they might be with just business intelligence. Sentiment analysis is
among its most popular uses. Here, social media data already in existence is used to
construct a complete picture of a user's viewpoint. To forecast their attitude (positive,
neutral, or negative), this data is evaluated.
4. Prescriptive Analytics: It offers suggestions for the next best course of action, going
beyond predictive analytics. It makes all beneficial predictions in accordance with a
particular course of action and also provides the precise steps required to produce the
most desirable outcome. It primarily depends on a robust feedback system and
ongoing iterative analysis. It gains knowledge of the connection between acts and
their results. The development of recommendation systems is a typical use of this kind
of analytics.
Fig. No. 2
7|Page
8|Page
definitions of big data management also contain two additional features in addition to
"the Three v/s," namely:
Veracity: The level of dependability and truth that big data can provide in terms of its
applicability, rigour, and correctness.
Value: This feature examines whether information and analytics will eventually be
beneficial or detrimental as the main goal of big data collection and analysis is to
uncover insights that can guide decision-making and other activities.
Fig: 3
1.6.2 Services for Big Data Management
Organisations can pick from a wide range of big data management options when it comes to
technology. Big data management solutions can be standalone or multi-featured, and many
businesses employ several of them. The following are some of the most popular kinds of big
data management capabilities:
Finding and resolving problems in data sets is known as data cleansing.
Data integration is the process of merging data from several sources.
Data preparation is the process of preparing data for use in analytics or other
applications. Data enrichment is the process of enhancing data by adding new data
sets, fixing minor errors, or extrapolating new information from raw data. Data
migration is the process of moving data from one environment to another, such as
from internal data centres to the cloud.
9|Page
Adding new data sets, fixing minor errors, or extrapolating new information from raw
data are all examples of data enrichment. Data analytics is the process of analysing
data using a variety of techniques in order to gain insights.
10 | P a g e
algorithm will iteratively evaluate and optimise, updating weights on its own each
time.
1.7.2 Machine learning methods: Machine learning classifiers fall into three primary
categories
1. Supervised machine learning: The definition of supervised learning, commonly
referred to as supervised machine learning, is the use of labelled datasets to train
algorithms that can reliably classify data or predict outcomes. As the model receives
input data and modifies its weights until the model is properly fitted. This happens as
part of the cross-validation process to make sure the model doesn't fit too well or too
poorly. Supervised learning assists organisations in finding saleable solutions to a
range of real-world issues, such as classifying spam in a different folder from your
email. Neural networks, naive Bayes, linear regression, logistic regression, random
forest, support vector machine (SVM), and other techniques are used in supervised
learning.
2. Unsupervised Machine learning: Unsupervised learning, commonly referred to as
unsupervised machine learning, analyses and groups un-labelled datasets using
machine learning algorithms. These algorithms identify hidden patterns or data
clusters without the assistance of a human. It is the appropriate solution for
exploratory data analysis, cross-selling tactics, consumer segmentation, and picture
and pattern recognition because of its capacity to find similarities and differences in
information. Through the process of dimensionality reduction, it is also used to lower
the number of features in a model; principal component analysis (PCA) and singular
value decomposition (SVD) are two popular methods for this. The use of neural
networks, k-means clustering, probabilistic clustering techniques, and other
algorithms is also common in unsupervised learning.
3. Semi-supervised education: A satisfying middle ground between supervised and
unsupervised learning is provided by semi-supervised learning. It employs a smaller,
labelled data set during training to direct feature extraction and classification from a
larger, unlabelled data set. If you don't have enough labelled data—or can't pay to
label enough data—to train a supervised learning system, semi-supervised learning
can help.
1.7.3 Reinforcement learning with computers: Although the algorithm isn't trained on
sample data, reinforcement machine learning is a behavioural machine learning model that is
similar to supervised learning. Trial and error are used by this model to learn as it goes. The
11 | P a g e
optimal suggestion or strategy will be created for a specific problem by reinforcing a string of
successful outcomes.
A subset of artificial intelligence called "machine learning" employs computer algorithms to
enable autonomous learning from data and knowledge. In machine learning, computers can
change and enhance their algorithms without needing to be explicitly programmed.
Computers can now interact with people, drive themselves, write and publish sport match
reports, and even identify terrorism suspects thanks to machine learning algorithms.
12 | P a g e
13 | P a g e
For those who might not have a background in information technology, it offers
simple data entry forms, database development, and data analytics including
epidemiology statistics, maps, and graphs.
Investigations into disease outbreaks, the creation of small to medium-sized
disease monitoring systems, and the analysis, visualisation, and reporting (AVR)
elements of bigger systems all make use of it.
It is utilised for the analysis of numerical data.
9. NVivo
It is a piece of software that enables the organisation and archiving of qualitative
data for analysis.
The analysis of unstructured text, audio, video, and image data, such as that from
interviews, focus groups (FGD), surveys, social media, and journal articles, is
done using NVivo.
You can import Word documents, PDFs, audio, video, and photos here.
It facilitates users' more effective organisation, analysis, and discovery of insights
from structured or qualitative data.
The user-friendly layout makes it instantly familiar and intuitive for the user. It
contains a free version as well as automated transcribing and auto coding.
Research using mixed methods and qualitative data is conducted using NVivo.
10. Mini-tab
Mini-tab provides both fundamental and moderately sophisticated statistical
analysis capabilities.
It has the ability to analyse a variety of data sets, automate statistical calculations,
and provide beautiful visualisations.
The usage of mini-tabs allows users to concentrate more on data analysis by
allowing them to examine both current and historical data to spot trends and
patterns as well as hidden links between variables.
It makes it easier to understand the data's insights.
For the examination of qualitative data, Mini-tab is employed.
11. Dedoose
Dedoose, a tool for qualitative and quantitative data analysis, is entirely web-
based.
15 | P a g e
1.9 SUMMARY
The disciplines of management, business, and computer science are all combined in business
analytics. The commercial component requires knowledge of the industry at a high level as
well as awareness of current practical constraints. An understanding of data, statistics, and
computer science is required for the analytical portion. Business analysts can close the gap
between management and technology thanks to this confluence of disciplines. Business
analytics also includes effective problem-solving and communication to translate data
insights into information that is understandable to executives. A related field called business
intelligence likewise uses data to better understand and inform businesses. What distinguishes
16 | P a g e
business analytics from business intelligence in terms of objectives? Despite the fact that both
areas rely on data to provide answers, the goal of business intelligence is to comprehend how
an organisation came to be in the first place. Measurement and monitoring of key
performance indicators (KPIs) are part of this. The goal of business analytics, on the other
hand, is to support business improvements by utilizing predictive models that offer insight
into the results of suggested adjustments. Big data, statistical analysis, and data visualization
are all used in business analytics to implement organizational changes. This work includes
predictive analytics, which is crucial since it uses data that is already accessible to build
statistical models. These models can be applied to decision-making and result prediction.
Business analytics can provide specific recommendations to fix issues and enhance
enterprises by learning from the data already available.
1.10 GLOSSARY
17 | P a g e
18 | P a g e
1.13 REFERENCES
Evans, J.R. (2021), Business Analytics: Methods, Models and Decisions, Pearson
India
Kumar, U. D. (2021), Business Analytics: The Science of Data-Driven Decision
Making, Wiley India.
Larose, D. T. (2022), Data Mining and Predictive Analytics, Wiley India
Shmueli, G. (2021), Data Mining and Business Analytics, Wiley India
Business Analysis Techniques: 99 Essential Tools for Success, Cadle, Paul, and
Turner, 2014. BCS in Swindon.
Kimi Ziemski, Richard Vander Horst, and Kathleen B. Hass (2008). Business analyst
management concepts: elevating the role of the analyst, 2008. ISBN 1-56726-213-9.
p94: "As business analysis becomes a more professionalised discipline”.
19 | P a g e
LESSON 2
PREDICTIVE ANALYTICS
Dr. Satish Kumar Goel
Assistant Professor
Shaheed Sukhdev College of Business Studies
(University of Delhi)
satish@sscbsdu.ac.in
STRUCTURE
2.2 INTRODUCTION
In this chapter, we will explore the field of predictive analytics, focusing on two fundamental
techniques: Simple Linear Regression and Multiple Linear Regression. Predictive analytics is
a powerful tool for analysing data and making predictions about future outcomes. We will
20 | P a g e
cover various aspects of regression models, including parameter estimation, model validation,
coefficient of determination, significance tests, residual analysis, and confidence and
prediction intervals. Additionally, we will provide practical exercises to reinforce your
understanding of these concepts, using R or Python for implementation.
2.3.1. Introduction
Predictive analytics is the use of statistical techniques, machine learning algorithms, and
other tools to identify patterns and relationships in historical data and use them to make
predictions about future events. These predictions can be used to inform decision-making in a
wide variety of areas, such as business, marketing, healthcare, and finance.
Linear regression is the traditional statistical technique used to model the relationship
between one or more independent variables and a dependent variable.
Linear regression involving only two variables is called simple linear regression. Let us
consider two variables as ‘x’ and ‘y’. Here ‘x’ represents independent variable or explanatory
variable and ‘y’ represents dependent variable or response variable. Dependent variable must
be a ratio variable, whereas independent variable can be ratio or categorical variable. We can
talk about regression model for cross-sectional data or for time series data. In time series
regression model, time is taken as independent variable and is very useful for predicting
future. Before we develop a regression model, it is a good exercise to ensure that two
variables are linearly related. For this, plotting the scatter diagram is really helpful. A linear
pattern can easily be identified in the data.
The Classical Linear Regression Model (CLRM) is a statistical framework used to analyse
the relationship between a dependent variable and one or more independent variables. It is a
widely used method in econometrics and other fields to study and understand the nature of
this relationship, make predictions, and test hypotheses.
Regression analysis aims to examine how changes in the independent variable(s) affect the
dependent variable. The CLRM assumes a linear relationship between the dependent variable
(Y) and the independent variable(s) (X), allowing us to estimate the parameters of this
relationship and make predictions.
The regression equation in the CLRM is expressed as:
Yi = α + βxi + μi
21 | P a g e
22 | P a g e
23 | P a g e
Var (µi) =
24 | P a g e
Yi =
Yj =
Cov(ui,uj)≠ 0 = spatial autocorrelation
Cov(ut, ut+1)≠0 = autocorrelation
In cross sectional data, if two error terms do not have zero covariance, then it is a
situation of SPATIAL CORRELATION. In time series data, if two error terms for
consecutive time periods do not have zero covariance, then it is a situation of
AUTOCORRELATION OR SERIAL CORRELATION.
No Multicollinearity: Multicollinearity occurs when there is a high degree of correlation
between two or more independent variables in the regression model. This can pose a
problem because it becomes challenging to separate the individual effects of the
25 | P a g e
26 | P a g e
Residual Analysis: Residuals are the differences between the observed values and the
predicted values of the dependent variable. By analysing the residuals, you can evaluate the
model's performance. Some key aspects to consider are:
Checking for randomness: Plotting the residuals against the predicted values or the
independent variable can help identify any patterns or non-random behaviour.
Assessing normality: Plotting a histogram or a Q-Q plot of the residuals can indicate
whether they follow a normal distribution. Departures from normality might suggest
violations of the assumptions.
Checking for homoscedasticity: Plotting the residuals against the predicted values or the
independent variable can reveal any patterns indicating non-constant variance. The spread
of the residuals should be consistent across all levels of the independent variable.
R-squared (Coefficient of Determination): R-squared measures the proportion of the total
variation in the dependent variable that is explained by the linear regression model. A higher
R-squared value indicates a better fit. However, R-squared alone does not provide a complete
picture of model performance and should be interpreted along with other validation metrics.
Adjusted R-squared: Adjusted R-squared takes into account the number of independent
variables in the model. It penalizes the addition of irrelevant variables and provides a more
reliable measure of model fit when comparing models with different numbers of predictors.
F-statistic: The F-statistic assesses the overall significance of the linear regression model. It
compares the fit of the model with a null model (no predictors) and provides a p-value
indicating whether the model significantly improves upon the null model.
Outlier Analysis: Identify potential outliers in the data that may have a substantial impact on
the model's fit. Outliers can skew the regression line and affect the estimated coefficients. It
is important to investigate and understand the reasons behind any outliers and assess their
influence on the model.
Cross-Validation: Splitting the dataset into training and testing subsets allows you to assess
the model's performance on unseen data. The model is trained on the training set and then
evaluated on the testing set. Metrics such as mean squared error (MSE), or root mean squared
error (RMSE) can be calculated to quantify the model's predictive accuracy.
By employing these validation techniques, you can gain insights into the model's
performance, evaluate its assumptions, and make informed decisions about its reliability and
usefulness for predicting the dependent variable.
27 | P a g e
ε is the error term or residual, representing the unexplained variation in the dependent
variable.
30 | P a g e
Impact of multicollinearity:
Unbiasedness: The Ordinary Least Squares (OLS) estimators remain unbiased.
Precision: OLS estimators have large variances and covariances, making precise estimation
difficult and leading to wider confidence intervals. Statistically insignificant coefficients may
be observed.
High R-squared: The R-squared value can still be high, even with statistically insignificant
coefficients.
Sensitivity: OLS estimators and their standard errors are sensitive to small changes in the
data.
Efficiency: Despite increased variance, OLS estimators are still efficient, meaning they have
minimum variance among all linear unbiased estimators.
In summary, multicollinearity undermines the precision of coefficient estimates and can lead
to unreliable statistical inference. While the OLS estimators remain unbiased, they become
imprecise, resulting in wider confidence intervals and potential insignificance of coefficients.
We will learn how to detect multicollinearity using the Variance Inflation Factor (VIF) and
explore strategies to address this issue, ensuring the accuracy and interpretability of the
regression model.
31 | P a g e
VIF stands for Variance Inflation Factor, which is a measure used to assess
multicollinearity in multiple regression model. VIF quantifies how much the variance of the
estimated regression coefficient is increased due to multicollinearity. It measures how much
the variance of one independent variable's estimated coefficient is inflated by the presence of
other independent variables in the model.
The formula for calculating the VIF for an independent variable Xj is:
VIF(Xj) = 1 / (1 – rj2)
where rj2 represents the coefficient of determination (R-squared) from a regression model that
regresses Xj on all other independent variables.
The interpretation of VIF is as follows:
If VIF(Xj) is equal to 1, it indicates that there is no correlation between Xj and the other
independent variables.
If VIF(Xj) is greater than 1 but less than 5, it suggests moderate multicollinearity.
If VIF(Xj) is greater than 5, it indicates a high degree of multicollinearity, and it is generally
considered problematic.
When assessing multicollinearity, it is common to examine the VIF values for all independent
variables in the model. If any variables have high VIF values, it indicates that they are highly
correlated with the other variables, which may affect the reliability and interpretation of the
regression coefficients.
If high multicollinearity is detected (e.g., VIF greater than 5), some steps can be taken to
address it:
Remove one or more of the highly correlated independent variables from the model.
Combine or transform the correlated variables into a single variable.
Obtain more data to reduce the correlation among the independent variables.
By addressing multicollinearity, the stability and interpretability of the regression model can
be improved, allowing for more reliable inferences about the relationships between the
independent variables and the dependent variable.
HOW TO DETECT MULTICOLLINEARITY
To detect multicollinearity in your regression model, you can use several methods:
32 | P a g e
Pairwise Correlation: Calculate the pairwise correlation coefficients between each pair of
explanatory variables. If the correlation coefficient is very high (typically greater than 0.8), it
indicates potential multicollinearity. However, low pairwise correlations do not guarantee the
absence of multicollinearity.
Variance Inflation Factor (VIF) and Tolerance: VIF measures the extent to which the
variance of the estimated regression coefficient is increased due to multicollinearity. High
VIF values (greater than 10) suggest multicollinearity. Tolerance, which is the reciprocal of
VIF, measures the proportion of variance in the predictor variable that is not explained by
other predictors. Low tolerance values (close to zero) indicate high multicollinearity.
Insignificance of Individual Variables: If many of the explanatory variables in the model are
individually insignificant (i.e., their t-statistics are statistically insignificant) despite a high R-
squared value, it suggests the presence of multicollinearity.
Auxiliary Regressions: Conduct auxiliary regressions where each independent variable is
regressed against the remaining independent variables. Check the overall significance of
these regressions using the F-test. If any of the auxiliary regressions show significant F-
values, it indicates collinearity with other variables in the model.
HOW TO FIX MULTICOLLINEARITY
To address multicollinearity, you can consider the following approaches:
Increase Sample Size: By collecting a larger sample, you can potentially reduce the
severity of multicollinearity. With a larger sample, you can include individuals with
different characteristics, reducing the correlation between variables. Increasing the
sample size leads to more efficient estimators and mitigates the multicollinearity
problem.
Drop Non-Essential Variables: If you have variables that are highly correlated with
each other, consider excluding non-essential variables from the model. For example,
if both father's and mother's education are highly correlated, you can choose to
include only one of them. However, be cautious when dropping variables as it may
result in model misspecification if the excluded variable is theoretically important.
Detecting and addressing multicollinearity is crucial for obtaining reliable regression results.
By understanding the signs of multicollinearity and applying appropriate remedies, you can
improve the accuracy and interpretability of your regression model.
33 | P a g e
Graphical Method
Durbin Watson test
Breusch-Godfrey test
1. Graphical Method
Autocorrelation can be detected using graphical methods. Here are a few graphical
techniques to identify autocorrelation:
34 | P a g e
Residual Plot: Plot the residuals of the regression model against the corresponding time or
observation index. If there is no autocorrelation, the residuals should appear random and
evenly scattered around zero. However, if autocorrelation is present, you may observe
patterns or clustering of residuals above or below zero, indicating a systematic relationship.
Partial Autocorrelation Function (PACF) Plot: The PACF plot displays the correlation
between the residuals at different lags, while accounting for the intermediate lags. In the
absence of autocorrelation, the PACF values should be close to zero for all lags beyond the
first. If there is significant autocorrelation, you may observe spikes or significant values
beyond the first lag.
Autocorrelation Function (ACF) Plot: The ACF plot shows the correlation between the
residuals at different lags, without accounting for the intermediate lags. Similar to the PACF
plot, significant values beyond the first lag in the ACF plot indicate the presence of
autocorrelation.
Figure 1.2
Autocorrelation and partial autocorrelation function (ACF and PACF) plots, prior to
differencing (A and B) and after differencing (C and D)
In both the PACF and ACF plots, significance can be determined by comparing the
correlation values against the confidence intervals. If the correlation values fall outside the
confidence intervals, it suggests the presence of autocorrelation.
35 | P a g e
It's important to note that these graphical methods provide indications of autocorrelation, but
further statistical tests, such as the Durbin-Watson test or Ljung-Box test, should be
conducted to confirm and quantify the autocorrelation in the model.
2. Durbin Watson D Test
The Durbin-Watson test is a statistical test used to detect autocorrelation in the residuals of a
regression model. It is specifically designed for detecting first-order autocorrelation, which is
the correlation between adjacent observations.
The Durbin-Watson test statistic is computed using the following formula:
d = (Σ (e_i - e_i-1)^2) / Σe_i^2
where:
· e_i is the residual for observation i.
· e_i-1 is the residual for the previous observation (i-1).
The test statistic is then compared to critical values to determine the presence of
autocorrelation. The critical values depend on the sample size, the number of independent
variables in the regression model, and the desired level of significance.
The Durbin-Watson test statistic, denoted as d, ranges from 0 to 4. The test statistic is
calculated based on the residuals of the regression model and is interpreted as follows:
A value of d close to 2 indicates no significant autocorrelation. It suggests that the residuals
are independent and do not exhibit a systematic relationship.
A value of d less than 2 indicates positive autocorrelation. It suggests that there is a positive
relationship between adjacent residuals, meaning that if one residual is high, the next one is
likely to be high as well.
A value of d greater than 2 indicates negative autocorrelation. It suggests that there is a
negative relationship between adjacent residuals, meaning that if one residual is high, the
next one is likely to be low.
The closer it is to zero, the greater is the evidence of positive autocorrelation, and the closer it
is to 4, the greater is the evidence of negative autocorrelation. If d is about 2, there is no
evidence of positive or negative (first-) order autocorrelation.
36 | P a g e
37 | P a g e
To reinforce the concepts covered in this chapter, practical exercises using R/Python
programming has been shown. These exercises will involve implementing simple OLS
regression using R or Python, interpreting the results obtained, and conducting assumption
tests such as checking for multicollinearity, autocorrelation, and normality. Furthermore,
regression analysis with categorical/dummy/qualitative variables will be performed to
understand their impact on the dependent variable.
Exercise1: Perform simple OLS regression on R/Python and interpret the results obtained.
Sol. Certainly! Here's an example of how you can perform a simple Ordinary Least Squares
(OLS) regression in both R and Python, along with results interpretation.
Let's assume you have a dataset with a dependent variable (Y) and an independent variable
(X). We will use this dataset to demonstrate the OLS regression.
Using R:
# Load the necessary libraries
Library(dplyr)
# Read the dataset
data <- read.csv("your_dataset.csv")
# Perform the OLS regression
model <- lm (Y ~ X, data = data)
# Print the summary of the regression results
Summary (model)
38 | P a g e
39 | P a g e
p-values: The regression results also provide p-values for the coefficients. These p-values
indicate the statistical significance of the coefficients. Generally, a p-value less than a
significance level (e.g., 0.05) suggests that the coefficient is statistically significant, implying
a relationship between the independent variable and the dependent variable.
R-squared: The R-squared value (R-squared or R2) measures the proportion of the variance
in the dependent variable that can be explained by the independent variable(s). It ranges from
0 to 1, with higher values indicating a better fit of the regression model to the data. R-squared
can be interpreted as the percentage of the dependent variable's variation explained by the
independent variable(s).
Residuals: The regression results also include information about the residuals, which are the
differences between the observed values of the dependent variable and the predicted values
from the regression model. Residuals should ideally follow a normal distribution with a mean
of zero, and their distribution can provide insights into the model's goodness of fit and
potential violations of the regression assumptions.
It's important to note that interpretation may vary depending on the specific context and
dataset. Therefore, it's essential to consider the characteristics of your data and the objectives
of your analysis while interpreting the results of an OLS regression.
Exercise 2. Test the assumptions of OLS (multicollinearity, autocorrelation, normality etc.)
on R/Python.
Sol. To test the assumptions of OLS, including multicollinearity, autocorrelation, and
normality, you can use various diagnostic tests in R or Python. Here are the steps and some
commonly used tests for each assumption:
Multicollinearity:
Step 1: Calculate the pairwise correlation matrix between the independent variables using the
cor () function in R or the corrcoef() function in Python (numpy).
Step 2: Calculate the Variance Inflation Factor (VIF) for each independent variable using the
vif () function from the "car" package in R or the variance_inflation_factor() function from
the "statsmodels" library in Python. VIF values greater than 10 indicate high
multicollinearity.
Step 3: Perform auxiliary regressions by regressing each independent variable against the
remaining independent variables to identify highly collinear variables.
Autocorrelation:
40 | P a g e
Step 1: Plot the residuals against the predicted values (fitted values) from the regression
model. In R, you can use the plot () function with the residuals() and fitted() functions. In
Python, you can use the scatter () function from matplotlib.
Step 2: Conduct the Durbin-Watson test using the dwtest () function from the "lmtest"
package in R or the DurbinWatson() function from the "statsmodels.stats.stattools" module in
Python. A value close to 2 indicates no autocorrelation, while values significantly greater or
smaller than 2 suggest positive or negative autocorrelation, respectively.
Normality of Residuals:
Step 1: Plot a histogram or a kernel density plot of the residuals. In R, you can use the hist ()
or density() functions. In Python, you can use the histplot () or kdeplot() functions from the
seaborn library.
Step 2: Perform a normality test such as the Shapiro-Wilk test using the shapiro.test ()
function in R or the shapiro() function from the "scipy.stats" module in Python. A p-value
greater than 0.05 indicates that the residuals are normally distributed.
It's important to note that these tests provide diagnostic information, but they may not be
definitive. It's also advisable to consider the context and assumptions of the specific
regression model being used.
Here is the random data set to perform the regression code in either R or Python.
41 | P a g e
This dataset consists of three columns: y represents the dependent variable, and x1 and x2 are
the independent variables. Each row corresponds to an observation in the dataset.
We can use this dataset to run the provided code and perform diagnostic tests on the OLS
regression model.
import numpy as np.
import pandas as pd.
import statsmodels.api as sm.
import seaborn as sns.
import matplotlib. pyplot as plt
42 | P a g e
# Create a DataFrame
data = pd. DataFrame({'y': y, 'x1': x1, 'x2': x2})
# Diagnostic tests
print("Multicollinearity:")
vif = pd. DataFrame()
vif["Variable"] = X. columns
vif["VIF"] = [variance_inflation_factor (X.values, i) for i in range(X.shape[1])]
print(vif)
print("\nAutocorrelation:")
residuals = results. resid
fig, ax = plt. subplots()
ax. scatter(results.fittedvalues, residuals)
ax.set_xlabel ("Fitted values")
ax.set_ylabel("Residuals")
plt. show()
43 | P a g e
print("\nNormality of Residuals:")
sns.histplot(residuals, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()
shapiro_test = sm.stats.shapiro(residuals)
print(f"Shapiro-Wilk test p-value: {shapiro_test[1]}")
In this example, we generated a random dataset with two independent variables (x1 and x2)
and a dependent variable (y). We fit an OLS regression model using the statsmodels library.
Then, we perform diagnostic tests for multicollinearity, autocorrelation, and normality of
residuals.
The code calculates the VIF for each independent variable, plots the residuals against the
fitted values, performs the Durbin-Watson test for autocorrelation, and plots a histogram of
the residuals. Additionally, the Shapiro-Wilk test is conducted to check the normality of
residuals.
We can run this code in a Python environment to see the results and interpretations for each
diagnostic test based on the random dataset provided.
3. Perform regression analysis with categorical/dummy/qualitative variables on R/Python.
import pandas as pd
import statsmodels.api as sm
# Create a DataFrame with the data
data = {
'y': [3.3723, 5.5593, 8.1878, -2.4581, 3.8578, 5.4747, 6.4135, 8.1032,
5.56, 5.3514, 5.8457],
44 | P a g e
df = pd.DataFrame(data)
In this example, we have created a DataFrame df with the y, x1, x2, and category variables.
The category variable is converted into dummy variables using the get_dummies function,
and the category A column is dropped to avoid multicollinearity. We then define the
dependent variable y and the independent variables X, including the dummy variable
category_B. A constant term is added to the independent variables using sm.add_constant.
Finally, we fit the OLS model using sm.OLS and print the summary of the regression results
using model.summary(). The regression analysis provides the estimated coefficients, standard
errors, t-statistics, and p-values for each independent variable, including the dummy variable
category B.
45 | P a g e
2.6 SUMMARY
2.8 REFERENCES
1. Business Analytics: The Science of Data Driven Decision Making, First Edition
(2017), U Dinesh Kumar, Wiley, India.
1. Introduction to Machine Learning with Python, Andreas C. Mueller and Sarah Guido,
O'Reilly Media, Inc.
2. 2. Data Mining or Business Analytics – Concepts, Techniques, and Applications in
Python. Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. Wiley.
47 | P a g e
LESSON 3
LOGISTIC AND MULTINOMIAL REGRESSION
Anurag Goel
Assistant Professor, CSE Dept.
Delhi Technological University, New Delhi
Email-Id: anurag@dtu.ac.in
STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lemshow Test
3.7 Pseudo R Square
3.8 Classification Table
3.9 Gini Coefficient
3.10 ROC
3.11 AUC
3.12 Summary
3.13 Glossary
3.14 Answers to In-Text Questions
3.15 Self-Assessment Questions
3.16 References
3.17 Suggested Readings
3.2 INTRODUCTION
In machine learning, we often are required to determine if a particular variable belongs to a
given class. In such cases, one can use logistic regression. Logistic Regression, a popular
supervised learning technique, is commonly employed when the desired outcome is a
categorical variable such as binary decisions (e.g., 0 or 1, yes or no, true or false). It finds
extensive applications in various domains, including fake news detection and cancerous cell
identification.
Some examples of logistic regression applications are as follows:
To detect whether a given news is fake or not.
To detect whether a given cell is Cancerous cell or not.
In essence, logistic regression can be understood as the probability of belonging to a class
given a particular input variable. Since it’s probabilistic in nature, the logistic regression
output values lie in the range of 0 and 1.
Generally, when we think about regression from a strictly statistical perspective, the output
value is generally not restricted to a particular interval. Thus, to achieve this in logistic
regression, we utilise logistic function. An intuitive example to see the use of logistic
function can be to understand logistic regression as any simple regression value model, on
top of whose output value, we have applied a logistic function so that the final output
becomes restricted in the above defined range.
Generally, logistic regression results work well when the output is of binary type, that is, it
either belongs to a specific category or it does not. This, however, is not always the case in
real-life problem statements. We may encounter a lot of scenarios where we have a
dependent variable having multiple classes or categories. In such cases, Multinomial
Regression emerges as a valuable extension of logistic regression, specifically designed to
handle multiclass problems. Multinomial Regression is the generalization of logistic
regression to multiclass problems. For example, based on the results of some analysis,
predicting the engineering branch students will choose for their graduation is a multinomial
regression problem since the output categories of engineering branches are multiple. In this
multinomial regression problem, the engineering branch will be the dependent variable
predicted by the multinomial regression model while the independent variables are student’s
marks in XII board examination, student’s score in engineering entrance exam, student’s
interest areas/courses etc. These independent variables are used by the multinomial regression
model to predict the outcome i.e. engineering branch the student may opt for.
49 | P a g e
50 | P a g e
It is a mathematical function that assigns values between 0 and 1 based on the input variable.
It is characterized by its S-shaped curve and is commonly used in statistics, machine learning,
and neural networks to model non-linear relationships and provide probabilistic
interpretations.
3.3.2 Estimation of probability using logistic function
The logistic function is often used for estimating probabilities in various fields. By applying
the logistic function to a linear combination of input variables, such as in logistic regression,
it transforms the output into a probability value between 0 and 1. This allows for the
prediction and classification of events based on their likelihoods.
where Dr represents the deviance of the reduced model (without predictors) and Df represents
the deviance of the full model (with predictors).
The Omnibus test statistic approximately follows chi-square distribution with degrees of
freedom given by the difference in the number of predictors between the full and reduced
models. By comparing the test statistic to the chi-square distribution and calculating the
associated p-value, we can calculate the collective statistical significance of the predictor
variables.
When the calculated p-value is lower than a predefined significance level (e.g., 0.05), we
reject the null hypothesis, indicating that the group of predictor variables collectively has a
statistically significant influence on the dependent variable. On the other hand, if the p-value
exceeds the significance level, we fail to reject the null hypothesis, suggesting that the
predictors may not have a significant collective effect.
The Omnibus test provides a comprehensive assessment of the overall significance of the
predictor variables within a regression model, aiding in the understanding of how these
predictors jointly contribute to explaining the variation in the dependent variable.
51 | P a g e
Let's consider an example where we have a regression model with three predictor variables
(X1, X2, X3) and a continuous dependent variable (Y). We want to assess the overall
significance of these predictors using the Omnibus test.
Here is a sample dataset with the predictor variables and the dependent variable:
X1 X2 X3 Y
2.5 6 8 10.2
3.2 4 7 12.1
1.8 5 6 9.5
2.9 7 9 11.3
3.5 5 8 13.2
2.1 6 7 10.8
2.7 7 6 9.7
3.9 4 9 12.9
2.4 5 8 10.1
2.8 6 7 11.5
By using statistical software, we obtain the estimated coefficients and the deviance of the full
model:
Deviance_reduced = 15.924
By referring to the chi-square distribution table or using statistical software, we determine the
p-value associated with the Omnibus test statistic. Let's assume the p-value is 0.022.
53 | P a g e
where β is the estimated coefficient for the predictor variable of interest, β₀ is the
hypothesized value of the coefficient under the null hypothesis (typically 0 for testing if the
coefficient is zero) and Var(β) is the estimated variance of the coefficient.
The Wald test statistic is compared to the chi-square distribution, where the degrees of
freedom are set to 1 (since we are testing a single parameter) to obtain the associated p-value.
Rejecting the null hypothesis occurs when the calculated p-value falls below a predetermined
significance level (e.g., 0.05), indicating that the predictor variable has a statistically
significant impact on the dependent variable.
The Wald test allows us to determine the individual significance of predictor variables by
testing whether their coefficients significantly deviate from zero. It is a valuable tool for
identifying which variables have a meaningful impact on the outcome of interest in a
regression model.
Let's consider an example where we have a logistic regression model with two predictor
variables (X1 and X2) and a binary outcome variable (Y). We want to assess the significance
of the coefficient for each predictor using the Wald test.
Here is a sample dataset with the predictor variables and the binary outcome variable:
X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1
54 | P a g e
significantly different from zero, indicating that X2 may not have a significant effect on the
binary outcome variable Y.
In summary, based on the Wald tests, we do not have sufficient evidence to conclude that
either X1 or X2 has a significant impact on the binary outcome variable in the logistic
regression model.
IN-TEXT QUESTIONS
1. What does the Wald test statistic compare to obtain the associated p-value?
a) The F-distribution
b) The t-distribution
c) The normal distribution
d) The chi-square distribution
56 | P a g e
Where Oij is the observed number of outcomes (events or non-events) in the ith bin and jth
outcome category, Eij is the expected number of outcomes (events or non-events) in the ith
bin and jth outcome category, calculated as the sum of predicted probabilities in the bin for
the jth outcome category.
The test statistic H follows an approximate chi-square distribution with degrees of freedom
equal to the number of bins minus the number of model parameters. A smaller p-value
obtained by comparing the test statistic to the chi-square distribution suggests a poorer fit of
the model to the data, indicating a lack of goodness-of-fit.
By conducting the Hosmer-Lemeshow test, we can determine whether the logistic regression
model adequately fits the observed data. A non-significant result (p > 0.05) indicates that the
model fits well, suggesting that the predicted probabilities align closely with the observed
outcomes. Conversely, a significant result (p < 0.05) suggests a lack of fit, indicating that the
model may not accurately represent the data.
The Hosmer-Lemeshow test is a valuable tool in assessing the goodness-of-fit of logistic
regression models, allowing us to evaluate the model's performance in predicting outcomes
based on observed and predicted probabilities.
Let's consider the example again with the logistic regression model predicting the probability
of a disease (Y) based on a single predictor variable (X). We will divide the predicted
probabilities into three bins and calculate the observed and expected frequencies in each bin.
Y X Predicted Probability
0 2.5 0.25
1 3.2 0.40
0 1.8 0.15
1 2.9 0.35
1 3.5 0.45
0 2.1 0.20
1 2.7 0.30
0 3.9 0.60
0 2.4 0.18
1 2.8 0.28
57 | P a g e
Bin: [0.1-0.3]
Total cases in bin: 3
Observed cases (Y = 1): 1
Expected cases: (0.25 + 0.20 + 0.28) * 3 = 1.23
Bin: [0.3-0.5]
Total cases in bin: 4
Observed cases (Y = 1): 2
Expected cases: (0.40 + 0.35 + 0.30 + 0.28) * 4 = 3.52
Bin: [0.5-0.7]
Total cases in bin: 3
Observed cases (Y = 1): 2
Expected cases: (0.45 + 0.60) * 3 = 3.15
58 | P a g e
where ℒmodel is the log-likelihood of the full model, ℒnull is the log-likelihood of the null
model (a model with only an intercept term) and ℒmax is the log-likelihood of a model with
perfect prediction (a hypothetical model that perfectly predicts all outcomes).
Nagelkerke's R-squared ranges from 0 to 1, with 0 indicating that the predictors have no
explanatory power, and 1 suggesting a perfect fit of the model. However, it is important to
note that Nagelkerke's R-squared is an adjusted measure and should not be interpreted in the
same way as R-squared in linear regression.
59 | P a g e
Pseudo R-squared provides an indication of how well the predictor variables explain the
variance in the dependent variable in logistic regression. While it does not have a direct
interpretation as the proportion of variance explained, it serves as a relative measure to
compare the goodness-of-fit of different models or assess the improvement of a model
compared to a null model.
One commonly used pseudo R-squared measure is the Cox and Snell R-squared. Let's
calculate the Cox and Snell R-squared using the given example of a logistic regression model
with two predictor variables.
X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1
60 | P a g e
61 | P a g e
outputs 20 input cells as cancerous and rest 80 as non-cancerous cells. Out of the total
predicted cancerous cells, only 5 input cells are actually cancerous as per the ground truth
while the rest 15 cells are non-cancerous. On the other hand, out of the total predicted non-
cancerous cells, 75 cells are also non-cancerous cells in the ground truth but 5 cells are
cancerous. Here, cancerous cell is considered as positive class while non-cancerous cell is
considered as negative class for the given classification problem. Now, we define the four
primary building blocks of the various evaluation metrics of classification models as follows:
True Positive (TP): The number of input cells for which the classification model X correctly
predicts that they are cancerous cells are referred as True Positive. For example, for the
model X, TP = 5.
True Negative (TN): The number of input cells for which the classification model X
correctly predicts that they are non-cancerous cells are referred as True Negative. For
example, for the model X, TN = 75.
False Positive (FP): The number of input cells for which the classification model X
incorrectly predicts that they are cancerous cells are referred as False Positive. For example,
for the model X, FP = 15.
False Negative (FN): The number of input cells for which the classification model X
incorrectly predicts that they are non-cancerous cells are referred as False Negative Positive.
For example, for the model X, FN = 5.
Actual
Cancerous Non-Cancerous
Cancerous TP = 5 FP = 15
Predicted
Non-Cancerous FN = 5 TN = 75
3.8.1 Sensitivity
Sensitivity, also referred to as True Positive Rate or Recall, is calculated as the ratio of
correctly predicted cancerous cells to the total number of cancerous cells in the ground truth.
To compute sensitivity, you can use the following formula:
62 | P a g e
3.8.2 Specificity
Specificity is defined as the ratio of number of input cells that are correctly predicted as non-
cancerous to the total number of non-cancerous cells in the ground truth. Specificity is also
known as True Negative Rate. To compute specificity, we can use the following formula:
3.8.3 Accuracy
Accuracy is calculated as the ratio of correctly classified cells to the total number of cells. To
compute accuracy, you can use the following formula:
3.8.4 Precision
Precision is calculated as the ratio of the correctly predicted cancerous cells to the total
number of cells predicted as cancerous by the model. To compute precision, you can use the
following formula:
3.8.5 F score
The F1-score is calculated as the harmonic mean of Precision and Recall. To compute the F1-
score, you can follow the following formula:
IN-TEXT QUESTIONS
3. For the model X results on the given dataset of 100 cells, the precision of model is
a) 0 b) 0.25
c) 0.5 d) 1
4. For the model X results on the given dataset of 100 cells, the recall of model is
a) 0 b) 0.25
c) 0.5 d) 1
63 | P a g e
3.10 ROC
In particular in logistic regression or machine learning techniques, the performance of a
binary classification model is assessed using a graphical representation called the Receiver
Operating Characteristic (ROC) curve. The trade-off between the true positive rate
(sensitivity) and the false positive rate (specificity minus 1) for various categorization
thresholds is demonstrated.
Plotting the true positive rate (TPR) against the false positive rate (FPR) at various
categorization thresholds results in the ROC curve. The formula for TPR and FPR are as
follows:
64 | P a g e
We may evaluate the model's capacity to distinguish between positive and negative examples
at various classification levels using the ROC curve. With a TPR of 1 and an FPR of 0, a
perfect classifier would have a ROC curve that reaches the top left corner of the plot. The
model's discriminatory power increases with the distance between the ROC curve and the top
left corner.
3.11 AUC
When employing a Receiver Operating Characteristic (ROC) curve, the Area Under the
Curve (AUC) is a statistic used to assess the effectiveness of a binary classification model.
The likelihood that a randomly selected positive occurrence will have a greater projected
probability than a randomly selected negative instance is represented by the AUC.
The AUC is calculated by integrating the ROC curve. However, it is important to note that
the AUC does not have a specific formula since it involves calculating the area under a curve.
Instead, it is commonly calculated using numerical methods or software.
The AUC value ranges between 0 and 1. A model with an AUC of 0.5 indicates a random
classifier, where the model's predictive power is no better than chance. An AUC value that is
nearer 1 indicates a classifier that is more accurate and is better able to distinguish between
positive and negative situations. Conversely, an AUC value closer to 0 suggests poor
performance, with the model performing worse than random guessing.
In binary classification tasks, the AUC is a commonly utilized statistic since it offers a
succinct assessment of the model's performance at different categorization thresholds. It is
especially useful when the dataset is imbalanced i.e. the number of instances that are positive
and negative differ significantly.
In conclusion, the AUC measure evaluates a binary classification model's total discriminatory
power by delivering a single value that encapsulates the model's capacity to rank cases
properly. Better classification performance is shown by higher AUC values, whilst worse
performance is indicated by lower values.
65 | P a g e
IN-TEXT QUESTIONS
5. Which of the following illustrates trade-off between True Positive Rate and
False Positive Rate?
a) Gini Coefficient b) F1-score
c) ROC d) AUC
6. Which of the following value of AUC indicates a more accurate classifier?
a) 0.01 b) 0.25
c) 0.5 d) 0.99
7. What is the range of values for the Gini coefficient?
a) -1 to 1
b) 0 to 1
c) 0 to infinity
d) -infinity to infinity
8. How can the Gini coefficient be computed?
a) By calculating the area under the precision-recall curve
b) By calculating the area under the receiver operating characteristic (ROC)
curve
c) By calculating the ratio of true positives to true negatives.
d) By calculating the ratio of false positives to false negatives.
3.12 SUMMARY
Logistic regression is used to solve the classification problems by producing the probabilistic
values within the range of 0 and 1. Logistic regression uses Logistic function i.e. sigmoid
function. Multinomial Regression is the generalization of logistic regression to multiclass
problems. Omnibus test is a statistical test utilized to test the significance of several model
parameters at once. Wald test is a statistical test used to assess the significance of individual
predictor variables in a regression model. Hosmer-Lemeshow test is a statistical test
employed to assess the adequacy of a logistic regression model. Pseudo R-square is a
measure to assess the proportion of variance in the dependent variable explained by the
predictor variables. There are various classification metrics namely Sensitivity, Specificity,
Accuracy, Precision, F-score, Gini Coefficient, ROC and AUC, which are utilized to evaluate
the performance of a classifier model.
66 | P a g e
3.13 GLOSSARY
Terms Definition
Omnibus test A statistical test used to test the significance of multiple model
parameters simultaneously.
Wald test A statistical test used to evaluate the significance of each individual
predictor variables within a regression model.
Hosmer- a statistical test utilized to assess the adequacy of fit for a logistic
Lemeshow test regression model.
ROC curve ROC curve demonstrates the balance between the true positive rate and
the false positive rate across various classification thresholds.
3.16 REFERENCES
LaValley, M. P. (2008). Logistic regression. Circulation, 117(18), 2395-2399.
Wright, R. E. (1995). Logistic regression.
Chatterjee, Samprit, and Jeffrey S. Simonoff. Handbook of regression analysis. John
Wiley & Sons, 2013.
Kleinbaum, David G., K. Dietz, M. Gail, Mitchel Klein, and Mitchell Klein. Logistic
regression. New York: Springer-Verlag, 2002.
DeMaris, Alfred. "A tutorial in logistic regression." Journal of Marriage and the
Family (1995): 956-968.
Osborne, J. W. (2014). Best practices in logistic regression. Sage Publications.
Bonaccorso, Giuseppe. Machine learning algorithms. Packt Publishing Ltd, 2017.
68 | P a g e
LESSON 4
DECISION TREE AND CLUSTERING
Dr. Sanjay Kumar
Dept. of Computer Science and Engineering,
Delhi Technological University,
Email-Id: sanjay.kumar@dtu.ac.in
STRUCTURE
4.2 INTRODUCTION
Decision Tree is a popular machine learning approach for classification and regression tasks.
Its structure is similar to a flowchart, where internal nodes represent features or attributes,
branches depict decision rules, and leaf nodes signify outcomes or predicted values. The data
are divided recursively according to feature values by the decision tree algorithm to create the
tree. It chooses the best feature for data partitioning at each stage by analysing parameters
such as information gain or Gini impurity. The goal is to divide the data into homogeneous
subsets within each branch to increase the tree's capacity for prediction.
By choosing a path through the tree based on feature values, the decision tree can be used to
generate predictions on fresh, unexpected data after it has been constructed. The
circumference. Figure 4.1 shows the decision tree helps classify an animal based on a series
of questions. The flowchart begins with the question, "Is it a mammal?" If the answer is
"Yes," we follow the branch on the left. The next question asks, "Does it have spots?" If the
answer is "Yes," we conclude that it is a leopard. If the answer is "No," we determine it is a
cheetah.
70 | P a g e
If the answer to the initial question, "Is it a mammal?" is "No," we follow the branch on the
right, which asks, "Is it a bird?" If the answer is "Yes," we classify it as a parrot. If the answer
is "No," we classify it as a fish.
Thus decision tree demonstrates a classification scenario where we aim to determine the type
of animal based on specific attributes. By following the flowchart, we can systematically
navigate through the questions to reach a final classification.
4.3 Classification and Regression Tree
A popular machine learning approach for classification and regression tasks is called the
Classification and Regression Tree (CART). It is a decision tree-based model that divides the
data into subsets according to the values of the input features and then predicts the target
variable using the tree structure.
CART is especially well-liked because of how easy it is to understand. Each core node
represents a test on a specific feature, and each leaf node represents a class label or a
predicted value, forming a binary tree structure. The method divides the data iteratively
according to the features with the goal of producing homogeneous subsets with regard to the
target variable.
In classification tasks, CART measures the impurity or disorder within each node using a
criterion like Gini impurity or entropy. Selecting the best feature and split point at each node
aims to reduce this impurity. The outcome is a tree that correctly categorises new instances
according to their feature values. In regression problems, CART measures the quality of each
split using a metric called mean squared error (MSE). In order to build a tree that can forecast
the continuous target variable, it searches for the feature and split point that minimises the
MSE.
Example: Let suppose we have a dataset of patients and we want to predict whether they
have a heart disease based on their age and cholesterol level. The dataset contains the
following information:
Age Cholesterol Disease
45 180 Yes
50 210 No
55 190 Yes
71 | P a g e
60 220 No
65 230 Yes
70 200 No
Using the CART algorithm, we can build a decision tree to make predictions. The decision
tree may look like this:
4.3 CHAID
working with data that involves categorical variables, which represent different groups or
categories. The CHAID algorithm aims to identify meaningful patterns by dividing the data
into groups based on various categories of variables. This is achieved through the application
of statistical tests, particularly the chi-square test. The chi-square test helps determine if there
is a significant relationship between the categories of a variable and the outcome of interest.
It divides the data into smaller groups. It repeats this procedure for each of these smaller
groups in order to find other categories that might be significantly related to the result. The
leaves on the tree indicate the expected outcomes, and each branch represents a distinct
category.
Calculate the Chi-Square statistic (χ^2):
(1.1)
O represents the observed frequencies in each category or cell of a contingency table.
E represents the expected frequencies under the assumption of independence between
variables.
73 | P a g e
This flowchart shows how CHAID gradually divides the dataset into subsets according to the
most important predictor factors, resulting in a hierarchical structure. It enables us to clearly
and orderly visualise the links between the variables and their effects on the target variable
(Customer Satisfaction).
Age Group is the first variable on the flowchart, and it has two branches: "Young" and
"Middle-aged." We further examine the Gender variable within the "Young" branch, resulting
in branches for "Male" and "Female." The Purchase Frequency variable is next examined for
each gender subgroup, yielding three branches: "Low," "Medium," and "High." We arrive at
the leaf nodes, which represent the customer satisfaction outcome and are either "Satisfied"
or "Not Satisfied."
4.3.2 Bonferroni Correction
The Bonferroni correction is a statistical method used to adjust the significance levels (p-
values) when conducting multiple hypothesis tests at the same time. It helps control the
overall chance of falsely claiming a significant result by making the criteria for significance
more strict.
To apply the Bonferroni correction, we divide the desired significance level (usually denoted
as α) by the number of tests being performed (denoted as m). This adjusted significance level,
denoted as α' or α_B, becomes the new threshold for determining statistical significance.
Mathematically, the Bonferroni correction can be represented as:
(1.2)
For example, suppose we are conducting 10 hypothesis tests, and we want a significance
level of 0.05 (α = 0.05). By applying the Bonferroni correction, we divide α by 10, resulting
in an adjusted significance level of:
(1.3)
Now, when we assess the p-values obtained from each test, we compare them against the
adjusted significance level (α') instead of the original α. If a p-value is less than or equal to α',
we consider the result to be statistically significant.
Let's consider an example. Suppose we have conducted 10 independent hypothesis tests, and
we obtain p-values of 0.02, 0.07, 0.01, 0.03, 0.04, 0.09, 0.06, 0.08, 0.05, and 0.02. Using the
Bonferroni correction with α of 0.05 and m = 10, the adjusted significance level becomes α' =
0.05 / 10 = 0.005.
74 | P a g e
We want a significance level (α) of 0.05, and we have 10 hypothesis tests (m = 10). Applying
the Bonferroni correction, we divide α by 10, resulting in an adjusted significance level (α')
of 0.005.
Comparing the p-values to the adjusted significance level (α'), we find:
- Hypothesis test 1: p-value (0.02) ≤ α' (0.005) - Statistically significant
- Hypothesis test 2: p-value (0.07) > α' (0.005) - Not statistically significant
- Hypothesis test 3: p-value (0.01) ≤ α' (0.005) - Statistically significant
- Hypothesis test 4: p-value (0.03) > α' (0.005) - Not statistically significant
- Hypothesis test 5: p-value (0.04) > α' (0.005) - Not statistically significant
- Hypothesis test 6: p-value (0.09) > α' (0.005) - Not statistically significant
- Hypothesis test 7: p-value (0.06) > α' (0.005) - Not statistically significant
- Hypothesis test 8: p-value (0.08) > α' (0.005) - Not statistically significant
- Hypothesis test 9: p-value (0.05) > α' (0.005) - Not statistically significant
- Hypothesis test 10: p-value (0.02) ≤ α' (0.005) - Statistically significant
Based on the Bonferroni correction, we conclude that Test 1, Test 3, and Test 10 show
statistically significant results, as their p-values are less than or equal to the adjusted
significance level. The remaining tests are not considered statistically significant.
(1.4)
75 | P a g e
Where, p and i represents the probability of each class label in the node.
By using the Gini impurity index, decision tree algorithms can make decisions on how to split
the data by selecting the feature and threshold that minimize the impurity after the split. A
lower Gini impurity index indicates a more homogeneous distribution of class labels, which
helps in creating pure and informative branches in the decision tree.
Example:
Suppose we have a dataset with 50 samples and two classes, "A" and "B". The table below
shows the distribution of class labels for a particular node in a decision tree:
Sample Class
20 A
10 B
15 A
5 B
76 | P a g e
4.4.2 Entropy
Entropy is a concept used in information theory and decision tree algorithms to measure the
level of uncertainty or disorder within a set of class labels. It helps us understand how mixed
or impure the class distribution is in a given node. The entropy value is calculated based on
the probabilities of each class label within the node.
To compute entropy, we start by determining the probability of each class label. This is done
by dividing the count of elements belonging to a particular class by the total number of
elements. Next, we apply the logarithm (typically base 2) to each probability, multiply it by
the probability itself, and sum up these values. Finally, we take the negative of the sum to
obtain the entropy value.
Mathematically, the formula for entropy is as follows:
(1.5)
77 | P a g e
The goal of cost-based splitting criteria is to minimize the overall cost or expenses related to
misclassification by selecting the best feature and split point at each node. Instead of solely
maximizing information gain or reducing impurity, the algorithm assesses the cost associated
with potential misclassifications. The specific cost-based measure used depends on the
problem domain and the assigned costs for different types of misclassifications. For instance,
in a medical diagnosis scenario, misclassifying a severe condition as a less severe one might
incur a higher cost compared to the opposite error.
Example:
Let's consider a dataset of 30 fruits, where each fruit has two features: color (red, green, or
orange) and diameter (small or large). The target variable is the type of fruit, which can be
"Apple" or "Orange". We also have costs associated with misclassifications: $10 for each
false positive (classifying an orange as an apple) and $5 for each false negative (classifying
an apple as an orange).
When using cost-based splitting criteria, the decision tree algorithm considers the features
(colour and diameter) to find the optimal split that minimizes the overall cost. For simplicity,
let's assume the first split is based on the colour feature. The algorithm assesses the costs
associated with misclassification for each colour category and chooses the colour that results
in the lowest expected cost. For instance, if the algorithm determines that splitting the data
based on colour between "Red/Green" and "Orange" fruits minimizes the expected cost, it
proceeds to evaluate the diameter feature for each branch. The algorithm continues this
recursive process of splitting the data until it constructs a complete decision tree.
The resulting decision tree may look like this:
The decision tree in the above picture displays splits depending on colour and diameter,
resulting in the labelling of fruits as "Apple" or "Orange" at the leaf nodes. Now we can use
the decision tree to estimate the type of a new fruit when its colour and diameter are
displayed. The model determines the fruit's expected class (apple or orange) by tracing the
path down the tree based on the provided attributes.
79 | P a g e
forecast when a new email is received, and the final prediction will be based on the consensus
of all the decision trees.
4.5.2 Random Forest
In machine learning, the widely used ensemble learning technique Random Forest is utilised
for both classification and regression applications. The outcomes from each individual
decision tree are aggregated to create forecasts using many decision trees. With each tree
based on a distinct subset of the training data, the Random Forest method builds a collection
of decision trees. The subset is produced using a technique known as bootstrap sampling,
which includes replacing some of the randomly chosen data points. Further randomness is
added by considering a random subset of features for splitting at each node of the decision
tree.
First, we create a bunch of decision trees, each using a different set of data. We randomly
pick some of the data for each tree, which helps add variety to the predictions. Next, we train
each decision tree by dividing the data into smaller groups based on different features. We
want the trees to be different from each other, so we use random subsets of features to make
the divisions. For example, if we're trying to classify something, each tree votes for the class
it thinks is correct. The final prediction is based on the majority vote.
Step-by-step explanation of how the Random Forest algorithm works:
Random Sampling: The algorithm starts by randomly selecting subsets of the training data
from the original dataset. Each subset is constructed by randomly selecting data points with
replacement. These subsets are used to build individual decision trees.
Tree Construction: Recursive partitioning is a technique used to build a decision tree for
each subset of the training data. A random subset of features is taken into account for
splitting at each node of the tree. Each tree is unique thanks to the randomness, which also
lessens association between the trees.
Voting and Aggregation: Each tree in the Random Forest identifies the target variable
separately (for classification tasks) or predicts its value independently (for regression tasks)
while making predictions. For classification, the final prediction is chosen by a majority vote;
for regression, the predictions are averaged. The overall forecast accuracy is enhanced by the
voting and aggregation procedure.
Random Forest has several key features and advantages:
80 | P a g e
Robustness against overfitting: The Random Forest is more resilient to noise or outliers in
the data thanks to the integration of many decision trees, which also helps to avoid
overfitting.
Feature importance estimation: The features that have the greatest impact on the
predictions are identified by Random Forest using a measure of feature importance. With this
knowledge, features may be chosen and underlying relationships in the data can be
understood.
4.6 CLUSTERING
81 | P a g e
The above diagrams (Fig. 4.5) show that the different fruits are divided into different clusters
or groups with similar properties.
A) Partitioning Clustering:
Partitioning clustering is a clustering algorithm that seeks to divide a dataset into separate and
non-overlapping clusters. In this type of clustering, the dataset is split into a predetermined
number of groups, denoted as K. The cluster centres are positioned in a manner that
minimizes the distance between the data points within a cluster and the centroid of another
cluster. Figure 1.6 illustrates the resulting partition of clusters.
The most well-known partitioning algorithm is K-means clustering. Here are the advantages
and disadvantages of partitioning clustering:
Advantages of Partitioning Clustering are:
Scalability:
Ease of implementation:
Interpretability:
Applicability to various data types
Disadvantages of partitioning clustering are:
Sensitivity to initial centroid selection
Dependence on the number of clusters
Limited ability to handle complex cluster shapes
K-means Clustering
This is one of the most popular clustering algorithms. It aims to partition the data into a
predetermined number of clusters (K) by minimizing the sum of squared distances between
83 | P a g e
data points and the centroid of their assigned cluster. K-means clustering is a popular and
widely used algorithm for partitioning a dataset into a predefined number of clusters. It is an
iterative algorithm that aims to minimize the within-cluster sum of squares, also known as the
total squared distance between data points and their cluster centres. The algorithm assigns
data points to clusters by iteratively updating the cluster centres until convergence. Here is a
detailed description of the K-means clustering algorithm:
Initialization:
Specify the number of clusters K that you want to identify in the dataset. Initialize K
cluster centres randomly or using a predefined strategy, such as randomly selecting K
data points as initial centres.
Assignment Step:
For each data point in the dataset, calculate the distance (e.g., Euclidean distance) to
each of the K cluster centres.
Assign the data point to the cluster with the nearest cluster centre, forming K initial
clusters.
Update Step:
Calculate the new cluster centres by computing the mean (centroid) of all data points
assigned to each cluster. The cluster centre is the average of the feature values of all
data points in that cluster.
Iteration:
Repeat the assignment and update steps until convergence or until a predefined
stopping criterion is met. In each iteration, reassign data points to clusters based on
the updated cluster centres and recalculate the cluster centres.
Convergence:
The algorithm converges when the cluster assignments no longer change significantly
between iterations or when the maximum number of iterations is reached.
Final Result:
Once the algorithm converges, the final result is a set of K clusters, where each data
point is assigned to one of the clusters based on its nearest cluster centre.
It is worth noting that K-means clustering is sensitive to the initial placement of cluster
centres. Different initializations can lead to different clustering results. To mitigate this, it is
common to run the algorithm multiple times with different random initializations and choose
the solution with the lowest within-cluster sum of squares as the final result. K-means
clustering has several advantages, including its simplicity, scalability to large datasets, and
efficiency.
84 | P a g e
Example:
Cluster the following 4 point in two dimensional space using K value 2
X1 X2
A 2 3
B 6 1
C 1 2
D 3 0
Solution:
Select two centroids as AB and CD, calculate as
AB = Average of A, B
CD = Average of C, D
X1 X2
AB 4 2
CD 2 1
Now calculate the Euclidean distance between each point and the centroids and assign each
point to the closest centroid:
A B C D
AB 5 5 9 5
CD 4 16 2 2
We can observe in above table, the distance between A, CD is 4 and it is smaller than
distance between AB, A which is 5. So we can move A to CD cluster. Two clusters are
formed ACD and B. Now recomputed the centroids B, ACD
85 | P a g e
We repeat the process by calculating the distance between each point and the updated
centroids and reassigning the points to the closest centroid. We continue this iteration until
the centroids no longer change significantly.
After a few iterations, the algorithm converges, and the final cluster assignments are:
Cluster 1: B
Cluster 2: ACD
B) Density-Based Clustering:
Density-based algorithms, such as DBSCAN (Density-Based Spatial Clustering of
Applications with Noise), identify clusters based on regions of high data point density. The
data points are group together that are close to each other and separates regions with low
density. Density-based clustering does not assume any specific shape for clusters. It can
detect clusters of arbitrary shapes, including non-linear and irregular clusters. Also, such
clustering techniques handles noise or outlier points appropriately.
C) Hierarchical Clustering:
Hierarchical clustering can be used instead of partitioned clustering because there is no need
to indicate the number of clusters to be produced. Hierarchical clustering constructs a
hierarchical structure of clusters, represented by a dendrogram, which resembles a tree-like
formation. This clustering method can be categorized into two primary approaches:
agglomerative, where individual data points begin as separate clusters and are progressively
merged, and divisive, where all data points commence in a single cluster and are recursively
divided.
86 | P a g e
Merge the closest clusters: Identify the pair of clusters with the smallest dissimilarity
or highest similarity measure and merge them into a single cluster. The dissimilarity
or similarity between the new merged cluster and the remaining clusters is updated.
Repeat the merging process: Repeat steps 2 and 3 until all the data points are part of a
single cluster or until a predefined stopping criterion is met.
Hierarchical representation: The merging process forms a hierarchy of clusters, often
represented as a dendrogram. The dendrogram illustrates the sequence of merging and
allows for different levels of granularity in cluster interpretation.
The advantages of agglomerative hierarchical algorithms is the hierarchical structure and
there is no need to specify the number of clusters unlike partitioning algorithm. The
drawback of this algorithm is the high computational complexity and lack of stability.
4.6.3 Distance and Dissimilarity Measures in Clustering:
In clustering, distance and dissimilarity measures play a crucial role in determining the
similarity or dissimilarity between data points. These measures quantify the proximity
between objects and are used by clustering algorithms to assign data points to clusters or
determine the cluster centres. Here are some commonly used distance and dissimilarity
measures in clustering [8].
1. Euclidean Distance: This is one of the most widely used distance measures in
clustering. It calculates the straight-line distance between two data points in a
Euclidean space. For two points, P = (p1, p2, ..., pn) and Q = (q1, q2, ..., qn), the
Euclidean distance is given by:
(1.6)
2. Manhattan Distance: Also known as the City Block distance or L1 norm, it calculates
the sum of absolute differences between the coordinates of two points. For two points,
P = (p1, p2, ..., pn) and Q = (q1, q2, ..., qn), the Manhattan distance is given by:
(1.7)
88 | P a g e
3. Cosine Similarity: Cosine similarity measures the cosine of the angle between two
vectors, indicating the similarity in their directions. It is commonly used in text
mining or when dealing with high-dimensional data. For two vectors, P = (p1, p2, ...,
pn) and Q = (q1, q2, ..., qn), the cosine similarity is given by:
(1.8)
(1.9)
Quality and determining the optimal number of clusters are important considerations in
clustering analysis. Let's explore each of these aspects:
The quality of clustering refers to how well the clustering algorithm captures the inherent
structure and patterns in the data. Several factors contribute to the assessment of clustering
quality:
Compactness: Compactness refers to how close the data points are within each
cluster. A good clustering result should have data points tightly clustered together
within their assigned clusters.
Separability: Separability refers to the distance between different clusters. A high-
quality clustering result should exhibit distinct separation between clusters, indicating
that the clusters are well-separated from each other.
89 | P a g e
90 | P a g e
91 | P a g e
92 | P a g e
4.7 SUMMARY
4.8 GLOSSARY
Terms Definition
Classification The classification algorithm is a supervised learning
technique that is used to categorize new observations
on the basics of the training data.
93 | P a g e
1. What are the different decision tree algorithms used in machine learning?
2. What is entropy?
3. Which metrics is best entropy or gini impurity for node selection in decision tree?
4. Write some advantages and disadvantages of decision tree.
5. How decision trees are used for classification and regression tasks?
94 | P a g e
4.11 REFERENCES
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques. Morgan
Kaufmann.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning:
data mining, inference, and prediction. Springer Science & Business Media.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT Press.
Mann, A. K., & Kaur, N. (2013). Review paper on clustering techniques. Global Journal
of Computer Science and Technology.
Rai, P., & Singh, S. (2010). A survey of clustering techniques. International Journal of
Computer Applications, 7(12), 1-5.
Cheng, Y. M., & Leu, S. S. (2009). Constraint-based clustering and its applications in
construction management. Expert Systems with Applications, 36(3), 5761-5767.
Bijuraj, L. V. (2013). Clustering and its Applications. In Proceedings of National
Conference on New Horizons in IT-NCNHIT (Vol. 169, p. 172).
Kameshwaran, K., & Malarvizhi, K. (2014). Survey on clustering techniques in data
mining. International Journal of Computer Science and Information Technologies, 5(2),
2272-2276.
APA FORMAT Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques.
Morgan Kaufmann.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data
mining, inference, and prediction. Springer Science & Business Media.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
95 | P a g e
9 788119 169849