You are on page 1of 101

Department of Distance and

Continuing Education
University of Delhi

Bachelor of Management Studies


Course Credit - 4
Semester-III
Discipline Specific Core (DSC-7)

As per the UGCF - 2022 and National Education Policy 2020


DSC-7: Introduction to Business Analytics

Editors
Dr. Rishi Rajan Sahay
Assistant Professor, Shaheed Sukhdev College of Business
Studies, University of Delhi
Dr. Sanjay Kumar
Assistant Professor, Delhi Technological University

Content Writers
Dr. Abhishek Kumar Singh, Dr. Satish Goel,
Mr. Anurag Goel, Dr. Sanjay Kumar

Academic Coordinator
Mr. Deekshant Awasthi

© Department of Distance and Continuing Education


ISBN: 978-81-19169-84-9
Ist edition: 2023
E-mail: ddceprinting@col.du.ac.in
management@col.du.ac.in

Published by:
Department of Distance and Continuing Education under
the aegis of Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007

Printed by:
School of Open Learning, University of Delhi

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
Attention

Corrections/Modifications/Suggestions proposed by Statutory Body,


DU/ Stakeholder/s in the Self Learning Material (SLM) will be
incorporated in the next edition. However, these
corrections/modifications/ suggestions will be uploaded on the website
https://sol.du.ac.in. Any feedback or suggestions can be sent to the
email- feedbackslm@col.du.ac.in

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

INDEX

Lesson1: Introduction to business analytics and descriptive analytics………01


1.1 Learning objectives
1.2 Introduction
1.3 Introduction to business analytics.
1.4 Role of Analytics for Data-Driven Decision making
1.5 Types of Business Analytics
1.6 Introduction to the Concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to relevant statistical software packages
1.9 Summary

Lesson2: Predictive Analytics………………………………………………………20


2.1 Learning objectives
2.2 Introduction
2.3 Classical Linear Regression Model
2.4 Multiple Linear Regression Models
2.5 Practical Exercise using R/Python Programming:
2.6 Summary

Lesson 3: Logistic and multinomial regression………………………………….48


3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lem Show Test
3.7 Pseudo R Square
3.8 Classification Table
3.9 Gini Coefficient
3.10 ROC
i|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

3.11 AUC
3.12 Summary

Lesson 4: Decision Tree and Clustering……………………….…………………...69


4.1 Learning Objectives
4.2 Introduction
4.3 Classification and Regression Tree
4.4 CHAID
4.4 Impurity Measures
4.5 Ensemble Methods
4.6 Clustering
4.7 Summary

ii | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

LESSON 1
INTRODUCTION TO BUSINESS ANALYTICS AND
DESCRIPTIVE ANALYTICS
Dr. Abhishek Kumar Singh
Assistant Professor
University of Delhi-19
abhishekbhu008@gmail.com

STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Introduction to Business Analytics
1.4 Role of Analytics for Data-Driven Decision Making
1.5 Types of Business Analytics
1.6 Introduction to the concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to relevant statistical software packages
1.9 Summary
1.10 Glossary
1.11 Answer to in text Question
1.12 Self- Assessment Question
1.13 References
1.14 Suggested Reading

1.1 LEARNING OBJECTIVES

After studying the lesson, you will be able to:


 Define Business Analytics
 State the Role of Analytics for Data-Driven Decision Making
 Mention the types of Business Analytics
 Classify the concepts of Big Data Analytics
 Describe Machine Learning Algorithms
 Identify relevant statistical software packages.
1|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

1.2 INTRODUCTION

Business analytics (BA) consists of using data to gain valuable insights and make informed
decisions in a business setting. It involves analysing and interpreting data to uncover patterns,
trends, and correlations that can help organizations improve their operations, better
understand their customers, and make strategic decisions. Business analytics (BA) places a
focus on statistical analysis. In addition to statistical analysis, business analytics also focuses
on various other aspects, such as data mining, predictive modelling, data visualization,
machine learning, and data-driven decision making.
Companies committed to making data-driven decisions employ business analytics. The study
of data through statistical and operational analysis, the creation of predictive models, the use
of optimisation techniques, and the communication of these results to clients, business
partners, and college executives are all considered to be components of business analytics. It
relies on quantitative methodologies, and data needed to create specific business models and
reach lucrative conclusions must be supported by proof. As a result, Business Analytics
heavily relies on and utilises Big Data. Business analytics is the process used to analyse data
after looking at past outcomes and problems in order to create an effective future plan.
Big Data, or a lot of data, is utilised to generate answers. The economy and the sectors
that prosper inside it depend on this way of conducting business or this outlook on creating
and maintaining a business. Over the past ten or so years, the word analytics has gained
popularity. Analytics are now incredibly important due to the growth of the internet and
information technology. In this lesson we are going to learn about Business Analytics and the
area of analytics integrates data, information technology, statistical analysis, and combining
quantitative techniques with computer-based models. All of these factors work together to
give decision-makers every possibility that can arise, allowing them to make well-informed
choices. The computer-based model makes sure that decision-makers may examine how their
choices function in various scenarios.

2|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Fig. 1

1.3 BUSINESS ANALYTICS

1.3.1 Meaning: Business analytics (BA) utilizes data analysis, statistical models, and various
quantitative techniques as a comprehensive discipline and technological approach. It involves
a systematic and iterative examination of organizational data, with a specific emphasis on
statistical analysis, to facilitate informed decision-making.
Business analytics primarily entails a combination of the following: discovering novel
patterns and relationships using data mining; developing business models using quantitative
and statistical analysis; conducting A/B and multi-variable testing based on findings;
forecasting future business needs, performance, and industry trends using predictive
modelling; and reporting your findings to co-workers, management, and clients in simple-to-
understand reports.
1.3.2 Definition
Business analytics (BA) involves utilizing knowledge, tools, and procedures to analyse past
business performance in order to gain insight and inform present and future business strategy.
Business analytics is the process of transforming data into insights to improve business
choices. It is based on data and statistical approaches to provide new insights and
understanding of business performance. Some of the methods used to extract insights from

3|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

data include data management, data visualisation, predictive modelling, data mining,
forecasting simulation, and optimisation.
1.3.3 Business analytics evolution
Business analytics has been around for a very long time and has developed as more, better
technology have been available. Operations research, which was widely applied during
World War II, is where it has its roots.
Operations research was initially designed as a methodical strategy to analyse data in military
operations. Over time, this strategy began to be applied in business domain as well.
Gradually, the study of operations evolved into management science. Furthermore, the
fundamental elements such as decision-making models, and other foundations of
management science were the same as those of operation research.
Ever since Frederick Winslow Taylor implemented management exercises in the late 19th
century, analytics have been employed in business. Henry Ford's freshly constructed
assembly line involved timing of each component.
However, when computers were deployed in decision support systems in the late 1960s,
analytics started to garner greater attention. Since then, enterprise resource planning (ERP)
systems, data warehouses, and a huge range of other software tools and procedures have all
modified and shaped analytics.
With the advent of computers, business analytics have grown in recent years. This
modification has elevated analytics to entirely new heights and opened up a world of
opportunity. Many people would never guess that analytics began in the early 1900s with Mr
Ford himself, given how far analytics has come in history and what the discipline is now.
Business intelligence, decision support systems, and PC software all developed from
management science.

1.4 ROLE OF ANALYTICS FOR DATA-DRIVEN DECISION


MAKING
1.4.1 Applications and uses for business analytics are numerous. It can be applied to
descriptive analysis, which makes use of facts to comprehend the past and present. This form
of descriptive analysis is employed to evaluate the company's present position in the market
and the success of earlier business decisions.
Predictive analysis, which is frequently employed to evaluate past business performance, is
used with it. Prescriptive analysis, which is used to develop optimisation strategies for better

4|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

corporate performance, is another application of business analytics. Business analytics, for


instance, is used to base price decisions for various products in a department store on
historical and current data.
1.4.2 Workings of business analytics: Several fundamental procedures are first carried out
by BA before any data analysis is done:
 Identify the analysis's corporate objective.
 Choose an analytical strategy.
 Gather business data often from multiple systems and sources to support the study.
 Cleanse and incorporate all data into one location, such as data warehouse or data
mart.
1.4.3 Need/Importance of Business Analytics
Business analytics serves as an approach to help in making informed business decisions.
As a result, it affects how the entire organisation functions. Business analytics can therefore
help a company become more profitable, grow its market share and revenue, and give
shareholders a higher return. It entails improved primary and secondary data interpretation,
which again affects the operational effectiveness of several departments. Moreover, it
provides a competitive advantage to the organization. The flow of information is nearly equal
among all actors in this digital age. The competitiveness of the company is determined by
how this information is used. Corporate analytics improves corporate decisions by combining
readily available data with numerous carefully considered models.
1.4.4 Transforms data into insightful knowledge.
Business analytics is the serves as a resource for a firm to make informed decisions. These
choices will probably have an effect on your entire business because they will help you
expand market share, boost profitability, and give potential investors a higher return.
While some businesses struggle with how to use massive volumes of data, business analytics
aims to combine this data with useful insights to enhance the decisions your organisation
makes.
In essence, business analytics is significant across all industries for the following four
reasons:
 Enhances performance by providing your company with a clear picture of what is and
what isn't working

5|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

 Facilitates quicker and more precise decision-making


 Reduces risks by assisting a company in making wise decisions on consumer
behaviour, trends, and performance.
 By providing information on the consumer, it encourages innovation and change.

IN-TEXT QUESTIONS
1. Define Business Analytics?
2. What do you understand by the term Business analysis evolution?
3. State two importance of Business Analytics?

1.5 TYPES OF BUSINESS ANALYTICS

Business analytics can be divided into four primary categories, each of which gets more
complex. They bring us one step closer to implementing scenario insight applications for the
present and the future. Below is a description of each of these business analytics categories.
1. Descriptive analytics,
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
1. Descriptive analytics: In order to understand what has occurred in the past or is
happening right now, it summarises the data that an organisation currently has. The
simplest type of analytics is descriptive analytics, which uses data aggregation and
mining techniques. It increases the availability of data to an organization's
stakeholders, including shareholders, marketing executives, and sales managers. It can
aid in discovering strengths and weaknesses and give information about customer
behaviour. This aids in the development of strategies for the field of focused
marketing.
2. Diagnostic Analytics: This kind of analytics aids in refocusing attention from past
performance to present occurrences and identifies the variables impacting trends.
Drill-down, data mining, and other techniques are used to find the underlying cause of
occurrences. Probabilities and likelihoods are used in diagnostic analytics to

6|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

comprehend the potential causes of events. For classification and regression, methods
like sensitivity analysis and training algorithms are used.
3. Predictive Analytics: With the aid of statistical models and ML approaches, this type
of analytics is used to predict the likelihood of a future event. The outcome of
descriptive analytics is built upon to create models that extrapolate item likelihood.
Machine learning specialists are used to conduct predictive analyses. They can be
more accurate than they might be with just business intelligence. Sentiment analysis is
among its most popular uses. Here, social media data already in existence is used to
construct a complete picture of a user's viewpoint. To forecast their attitude (positive,
neutral, or negative), this data is evaluated.
4. Prescriptive Analytics: It offers suggestions for the next best course of action, going
beyond predictive analytics. It makes all beneficial predictions in accordance with a
particular course of action and also provides the precise steps required to produce the
most desirable outcome. It primarily depends on a robust feedback system and
ongoing iterative analysis. It gains knowledge of the connection between acts and
their results. The development of recommendation systems is a typical use of this kind
of analytics.

Fig. No. 2
7|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

1.6 INTRODUCTION TO THE CONCEPTS OF BIG DATA


ANALYTICS
Big Data Analytics is made up of enormous volumes of information that cannot be processed
or stored using conventional data processing or storage methods. There are often three
distinct versions.
 Structured data, as the name implies, has a clear structure and follows a regular
sequence. A person or machine may readily access and utilise this type of information
since it has been intended to be user-friendly. Structured data is typically kept in
databases, especially relational database management systems, or RDBMS, and tables
with clearly defined rows and columns, such as spreadsheets.
 While semi-structured data displays some of the same characteristics as structured
data, for the most part it lacks a clear structure and cannot adhere to the formal
specifications of data models like an RDBMS.
 Unstructured data does not adhere to the formal structural norms of traditional data
models and lacks a consistent structure across all of its different forms. In a very small
number of cases, it might contain information on the date and time.
1.6.1 Large-scale Data Management Traits
According to traditional definitions of the term, big data is typically linked to three essential
traits:
 Volume: The massive amounts of information produced every second by social
media, mobile devices, automobiles, transactions, connected sensors, photos, video,
and text are referred to by this characteristic. Only big data technologies can handle
enormous volumes, which come in petabyte, terabyte, or even zettabyte sizes.
 Diversity: Information in the form of images, audio streams, video, and a variety of
other forms now contributes to a diversity of data kinds, around 80% of which are
completely unstructured, to the existing landscape of transactional and demographic
data like phone numbers and addresses.
 Velocity: This attribute relates to the velocity of data accumulation and refers to the
phenomenal rate at which information is flooding into data repositories. It also
describes how quickly massive data can be analysed and processed to draw out the
insights and patterns it contains. Now, that speed is frequently real-time. Current

8|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

definitions of big data management also contain two additional features in addition to
"the Three v/s," namely:
 Veracity: The level of dependability and truth that big data can provide in terms of its
applicability, rigour, and correctness.
 Value: This feature examines whether information and analytics will eventually be
beneficial or detrimental as the main goal of big data collection and analysis is to
uncover insights that can guide decision-making and other activities.

Fig: 3
1.6.2 Services for Big Data Management
Organisations can pick from a wide range of big data management options when it comes to
technology. Big data management solutions can be standalone or multi-featured, and many
businesses employ several of them. The following are some of the most popular kinds of big
data management capabilities:
 Finding and resolving problems in data sets is known as data cleansing.
 Data integration is the process of merging data from several sources.
 Data preparation is the process of preparing data for use in analytics or other
applications. Data enrichment is the process of enhancing data by adding new data
sets, fixing minor errors, or extrapolating new information from raw data. Data
migration is the process of moving data from one environment to another, such as
from internal data centres to the cloud.
9|Page

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

 Adding new data sets, fixing minor errors, or extrapolating new information from raw
data are all examples of data enrichment. Data analytics is the process of analysing
data using a variety of techniques in order to gain insights.

1.7 OVERVIEW OF MACHINE LEARNING ALGORITHMS

1.7.1 Machine Learning:


Machine Learning (ML) is the study of computer algorithms that can get better on their own
over time and with the help of data. It is thought to be a component of artificial intelligence.
Without being expressly taught to do so, machine learning algorithms create a model using
sample data, also referred to as training data, in order to make predictions or judgements. In a
wide range of fields where it is challenging or impractical to design traditional algorithms,
such as medicine, email filtering, speech recognition, and computer vision, machine learning
algorithms are applied. Computational statistics, which focuses on making predictions with
computers, is closely related to a subset of machine learning, but not all machine learning is
statistical learning. The field of machine learning benefits from the tools, theory, and
application fields that come from the study of mathematical optimisation. Data mining is a
related area of study that focuses on unsupervised learning for exploratory data analysis.
Some machine learning applications employ data and neural networks in a way that closely
resembles how a biological brain function. Machine learning is also known as predictive
analytics when it comes to solving business problems.
How does machine learning operate?
The operation of machine learning
1. The Making of a Decision: Typically, machine learning algorithms are employed to
produce a forecast or classify something. Your algorithm will generate an estimate
about a pattern in the supplied data, which may be tagged or unlabelled.
2. An Error Function: A model's prediction can be assessed using an error function. In
order to evaluate the model's correctness when there are known examples, an error
function can compare the results.
3. A process for model optimisation: If the model can more closely match the data
points in the training set, weights are modified to lessen the difference between the
known example and the model prediction. Until an accuracy level is reached, the

10 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

algorithm will iteratively evaluate and optimise, updating weights on its own each
time.
1.7.2 Machine learning methods: Machine learning classifiers fall into three primary
categories
1. Supervised machine learning: The definition of supervised learning, commonly
referred to as supervised machine learning, is the use of labelled datasets to train
algorithms that can reliably classify data or predict outcomes. As the model receives
input data and modifies its weights until the model is properly fitted. This happens as
part of the cross-validation process to make sure the model doesn't fit too well or too
poorly. Supervised learning assists organisations in finding saleable solutions to a
range of real-world issues, such as classifying spam in a different folder from your
email. Neural networks, naive Bayes, linear regression, logistic regression, random
forest, support vector machine (SVM), and other techniques are used in supervised
learning.
2. Unsupervised Machine learning: Unsupervised learning, commonly referred to as
unsupervised machine learning, analyses and groups un-labelled datasets using
machine learning algorithms. These algorithms identify hidden patterns or data
clusters without the assistance of a human. It is the appropriate solution for
exploratory data analysis, cross-selling tactics, consumer segmentation, and picture
and pattern recognition because of its capacity to find similarities and differences in
information. Through the process of dimensionality reduction, it is also used to lower
the number of features in a model; principal component analysis (PCA) and singular
value decomposition (SVD) are two popular methods for this. The use of neural
networks, k-means clustering, probabilistic clustering techniques, and other
algorithms is also common in unsupervised learning.
3. Semi-supervised education: A satisfying middle ground between supervised and
unsupervised learning is provided by semi-supervised learning. It employs a smaller,
labelled data set during training to direct feature extraction and classification from a
larger, unlabelled data set. If you don't have enough labelled data—or can't pay to
label enough data—to train a supervised learning system, semi-supervised learning
can help.
1.7.3 Reinforcement learning with computers: Although the algorithm isn't trained on
sample data, reinforcement machine learning is a behavioural machine learning model that is
similar to supervised learning. Trial and error are used by this model to learn as it goes. The
11 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

optimal suggestion or strategy will be created for a specific problem by reinforcing a string of
successful outcomes.
A subset of artificial intelligence called "machine learning" employs computer algorithms to
enable autonomous learning from data and knowledge. In machine learning, computers can
change and enhance their algorithms without needing to be explicitly programmed.
Computers can now interact with people, drive themselves, write and publish sport match
reports, and even identify terrorism suspects thanks to machine learning algorithms.

IN-TEXT QUESTIONS 1.2

1. What are the types of Business Analytics?


2. What is Big Data Analytics?
3. Name the three essential traits of big data?

1.8 INTRODUCTION TO RELEVANT STATISTICAL SOFTWARE


PACKAGES
A statistical package is essentially a group of software programmes that share a common user
interface and were created to make it easier to do statistical analysis and related duties like
data management.
What is Statistical Software?
Software for doing complex statistical analysis is known as statistical software. They serve as
tools for organising, interpreting, and presenting particular data sets in order to provide
scientific insights on patterns and trends. To perform data sciences, statistical software uses
statistical analysis theorems and procedures like regression analysis and time series analysis.
Benefits of Statistical Software:
 Increases productivity and accuracy in data management and analysis.
 Requiring less time.
 Simple personalization
 Access to a sizable database that reduces sampling error and enables data-driven
decision making.

12 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Relevant statistical software packages:


Increases productivity and accuracy in data management and analysis requiring less time.
Simple personalization access to a sizable database reduces sampling error and enables data-
driven decision-making.
1. SPSS (Statistical Package for Social Sciences)
 The most popular and effective programme for analysing complex statistical data
is called SPSS.
 To make the results easy to discuss, it quickly generates descriptive statistics,
parametric and non-parametric analysis, and delivers graphs and presentation-
ready reports.
 Here, estimate and the discovery of missing values in the data sets lead to more
accurate reports.
 For the analysis of quantitative data, SPSS is utilised.
2. Stata
 Stata is another commonly used programme that makes it possible to manage,
save, produce, and visualise data graphically. It does not require any coding
expertise to use.
 Its use is more intuitive because it has both a command line and a graphical user
interface.
3. R
 Free statistical software known as "R" offers graphical and statistical tools,
including linear and non-linear modelling.
 Available for a wide range of applications are toolboxes, which are effective
plugins. Here, coding expertise is necessary.
 It offers interactive reports and apps, makes extensive use of data, and complies
with security guidelines.
 R is used to analyse quantitative data.
4. Python
 Another freely available software
 Extensive libraries and frameworks

13 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

 A popular choice for machine learning tasks.


 Simplicity and Readability
5. SAS (Statistical Analysis Software)
 It is a cloud-based platform that offers ready-to-use applications for manipulating
data, storing information, and retrieving it.
 Its processes employ several threads. executing several tasks at once.
 Business analysts, statisticians, data scientists, researchers, and engineers utilise it
largely for statistical modelling, spotting trends and patterns in data, and assisting
in decision-making.
 For someone unfamiliar with this method, coding can be challenging.
 It is utilised for the analysis of numerical data.
6. MATLAB (MATrix LABoratory)
 The initials MATLAB stand for Matrix Laboratory.
 Software called MATLAB offers both an analytical platform and a programming
language.
 It expresses matrix and array mathematics, function and data charting, algorithm
implementation, and user interface development.
 A script that combines code, output, and formatted text into an executable
notebook is produced by Live Editor, which is also provided.
 Engineers and scientists utilise it a lot.
 For the analysis of quantitative data, MATLAB is employed.
7. Epi-data
 Epi-data is a widely used, free data programme created to help epidemiologists,
public health researchers, and others enter, organise, and analyse data while
working on the ground.
 It manages all of the data and produces graphs and elementary statistical analysis.
 Here, users can design their own databases and forms.
 Epi-data is a tool for analysing quantitative data.
8. Epi-info
 It is a public domain software suite created by the Centres for Disease Control
and Prevention (CDC) for researchers and public health professionals worldwide.
14 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

 For those who might not have a background in information technology, it offers
simple data entry forms, database development, and data analytics including
epidemiology statistics, maps, and graphs.
 Investigations into disease outbreaks, the creation of small to medium-sized
disease monitoring systems, and the analysis, visualisation, and reporting (AVR)
elements of bigger systems all make use of it.
 It is utilised for the analysis of numerical data.
9. NVivo
 It is a piece of software that enables the organisation and archiving of qualitative
data for analysis.
 The analysis of unstructured text, audio, video, and image data, such as that from
interviews, focus groups (FGD), surveys, social media, and journal articles, is
done using NVivo.
 You can import Word documents, PDFs, audio, video, and photos here.
 It facilitates users' more effective organisation, analysis, and discovery of insights
from structured or qualitative data.
 The user-friendly layout makes it instantly familiar and intuitive for the user. It
contains a free version as well as automated transcribing and auto coding.
 Research using mixed methods and qualitative data is conducted using NVivo.
10. Mini-tab
 Mini-tab provides both fundamental and moderately sophisticated statistical
analysis capabilities.
 It has the ability to analyse a variety of data sets, automate statistical calculations,
and provide beautiful visualisations.
 The usage of mini-tabs allows users to concentrate more on data analysis by
allowing them to examine both current and historical data to spot trends and
patterns as well as hidden links between variables.
 It makes it easier to understand the data's insights.
 For the examination of qualitative data, Mini-tab is employed.
11. Dedoose
 Dedoose, a tool for qualitative and quantitative data analysis, is entirely web-
based.
15 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

 This low-cost programme is user-friendly and team-oriented, and it makes it


simple to import both text and visual data.
 It has access to cutting-edge data security equipment.
12. ATLAS.ti
 It is a pioneer in qualitative analysis software and has incorporated AI as it has
developed.
 The greatest places for this are research organisations, businesses, and academic
institutions. Due to the cost of doing individual studies.
 With sentiment analysis and auto coding, it is more potent.
 It gives users the option to use any language or character set.
13. MAXDQA 12
 It is expert software for analysing data using quantitative, qualitative, and mixed
methods.
 It imports the data, reviews it in a single spot, and categorises any unstructured
data with ease.
 With this software, a literature review may also be created.
 It costs money and is not always easy to collaborate with others in a team.
IN-TEXT QUESTIONS 1.3

1. Name three relevant statistical software packages?


2. Name the machine learning methods?

1.9 SUMMARY
The disciplines of management, business, and computer science are all combined in business
analytics. The commercial component requires knowledge of the industry at a high level as
well as awareness of current practical constraints. An understanding of data, statistics, and
computer science is required for the analytical portion. Business analysts can close the gap
between management and technology thanks to this confluence of disciplines. Business
analytics also includes effective problem-solving and communication to translate data
insights into information that is understandable to executives. A related field called business
intelligence likewise uses data to better understand and inform businesses. What distinguishes

16 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

business analytics from business intelligence in terms of objectives? Despite the fact that both
areas rely on data to provide answers, the goal of business intelligence is to comprehend how
an organisation came to be in the first place. Measurement and monitoring of key
performance indicators (KPIs) are part of this. The goal of business analytics, on the other
hand, is to support business improvements by utilizing predictive models that offer insight
into the results of suggested adjustments. Big data, statistical analysis, and data visualization
are all used in business analytics to implement organizational changes. This work includes
predictive analytics, which is crucial since it uses data that is already accessible to build
statistical models. These models can be applied to decision-making and result prediction.
Business analytics can provide specific recommendations to fix issues and enhance
enterprises by learning from the data already available.

1.10 GLOSSARY

Term Full Form/Formulae/Meaning


Business Business analytics consists of using data analysis and statistical
Analytics methods to gain insights, make informed decisions, and drive
strategic actions in a business or organizational context
Big Data Big data refers to large and complex datasets. It is characterized
by the volume, velocity, and variety of data, often generated from
various sources such as social media, sensors, devices, and
business transactions.

1.11 ANSWER TO INTEXT QUESTION

INTEXT QUESTIONS 1.1


1. Business analytics (BA) refers to the knowledge, tools, and procedures used for
ongoing, iterative analysis and investigation of previous business performance in
order to generate knowledge and inform future business strategy.
2. Business analytics evolution has been around for a very long time and has developed
as more, better technology has been available.

17 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

3. The two importance of business analytics are:


1. Gives businesses a competitive advantage
2. Transforms accessible data into insightful knowledge
INTEXT QUESTIONS 1.2
1. There are four types of Business analytics.
 Descriptive analytics,
 Diagnostic Analytics
 Predictive Analytics
 Prescriptive Analytics
2. Big data is made up of enormous volumes of information that cannot be processed or
stored using conventional data processing or storage methods.
3. The three essential traits are Volume, diversity, and velocity.
INTEXT QUESTIONS 1.3
1. The Three relevant statistical software packages are SPSS, STATA and SAS.
2. The machine learning methods are: Supervising machine learning, Machine learning
without supervision, Semi-supervised education.

1.12 TERMINAL QUESTION

1. What is Business Analysis?


2. Why a Business Analyst needed in an organization?
3. What is SaaS?
4. What are considered to be the four types of Business analytics? Explain them in your
own words.
5. Explain the importance of Business Analytics?
6. Explain the three relevant statistical software packages?

18 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

7. How does the machine learning method works?


8. Explain the difference between any two software packages?

1.13 REFERENCES

 Evans, J.R. (2021), Business Analytics: Methods, Models and Decisions, Pearson
India
 Kumar, U. D. (2021), Business Analytics: The Science of Data-Driven Decision
Making, Wiley India.
 Larose, D. T. (2022), Data Mining and Predictive Analytics, Wiley India
 Shmueli, G. (2021), Data Mining and Business Analytics, Wiley India

1.14 SUGGESTED READING

 Business Analysis Techniques: 99 Essential Tools for Success, Cadle, Paul, and
Turner, 2014. BCS in Swindon.
 Kimi Ziemski, Richard Vander Horst, and Kathleen B. Hass (2008). Business analyst
management concepts: elevating the role of the analyst, 2008. ISBN 1-56726-213-9.
p94: "As business analysis becomes a more professionalised discipline”.

19 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

LESSON 2
PREDICTIVE ANALYTICS
Dr. Satish Kumar Goel
Assistant Professor
Shaheed Sukhdev College of Business Studies
(University of Delhi)
satish@sscbsdu.ac.in

STRUCTURE

2.1 Learning Objectives


2.2 Introduction
2.3 Classical Linear Regression Model (CLRM)
2.4 Multiple Linear Regression Model
2.5 Practical Exercises Using R/Python Programming
2.6 Summary
2.7 Self-Assessment Questions
2.8 References
2.9 Suggested Readings

2.1 LEARNING OBJECTIVES

● To understand the basic concept of linear regression and where to apply.


● To develop a linear relationship between two or more variables.
● To predict the value of dependent variable given the value of independent variable
using regression line.
● To be familiar with the different metrices used in the regression.
● Use of R and Python for regression implementation.

2.2 INTRODUCTION

In this chapter, we will explore the field of predictive analytics, focusing on two fundamental
techniques: Simple Linear Regression and Multiple Linear Regression. Predictive analytics is
a powerful tool for analysing data and making predictions about future outcomes. We will

20 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

cover various aspects of regression models, including parameter estimation, model validation,
coefficient of determination, significance tests, residual analysis, and confidence and
prediction intervals. Additionally, we will provide practical exercises to reinforce your
understanding of these concepts, using R or Python for implementation.

2.3 CLASSICAL LINEAR REGRESSION MODEL (CLRM)

2.3.1. Introduction
Predictive analytics is the use of statistical techniques, machine learning algorithms, and
other tools to identify patterns and relationships in historical data and use them to make
predictions about future events. These predictions can be used to inform decision-making in a
wide variety of areas, such as business, marketing, healthcare, and finance.
Linear regression is the traditional statistical technique used to model the relationship
between one or more independent variables and a dependent variable.
Linear regression involving only two variables is called simple linear regression. Let us
consider two variables as ‘x’ and ‘y’. Here ‘x’ represents independent variable or explanatory
variable and ‘y’ represents dependent variable or response variable. Dependent variable must
be a ratio variable, whereas independent variable can be ratio or categorical variable. We can
talk about regression model for cross-sectional data or for time series data. In time series
regression model, time is taken as independent variable and is very useful for predicting
future. Before we develop a regression model, it is a good exercise to ensure that two
variables are linearly related. For this, plotting the scatter diagram is really helpful. A linear
pattern can easily be identified in the data.
The Classical Linear Regression Model (CLRM) is a statistical framework used to analyse
the relationship between a dependent variable and one or more independent variables. It is a
widely used method in econometrics and other fields to study and understand the nature of
this relationship, make predictions, and test hypotheses.
Regression analysis aims to examine how changes in the independent variable(s) affect the
dependent variable. The CLRM assumes a linear relationship between the dependent variable
(Y) and the independent variable(s) (X), allowing us to estimate the parameters of this
relationship and make predictions.
The regression equation in the CLRM is expressed as:
Yi = α + βxi + μi

21 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

Here, Yi represents the dependent variable,


xi represents the independent variable,
α represents the intercept,
β represents the coefficient or slope that quantifies the effect of xi on Yi, and μi
represents the error term or residual.
The error term captures the unobserved factors and random variations that affect the
dependent variable but are not explicitly included in the model.
The CLRM considers the population regression function (PRF), which is the true underlying
relationship between the variables in the population. The PRF is expressed as:
Yi = α + βxi + μi
The difference between the regression equation and the PRF is the inclusion of the error term
(μi) in the PRF. The error term represents the discrepancy between the observed value of Yi
and the predicted value based on the regression equation.
In practice, we estimate the parameters of the PRF using sample data and derive the sample
regression function (SRF), which is an approximation of the PRF. The SRF is represented as:
Yi = (α ̂) + (β ̂xi) + (u ̂i)
In the SRF, (α ̂) and (β ̂) are the estimated intercept and coefficient, respectively, obtained
through statistical methods such as ordinary least squares (OLS). The estimated error term
(u ̂i) captures the residuals or discrepancies between the observed and predicted values based
on the estimated parameters.
2.3.2. Assumptions
To ensure reliable and meaningful results, the CLRM relies on several key assumptions. Let's
discuss these assumptions one by one:
 Linearity: The regression model must be linear in its parameters. Linearity refers to the
linearity of the parameters (α and β), not necessarily the linearity of the variables
themselves. For example, even if the variable xi is not linear, the model can still be
considered linear if the parameters (α and β) are linear.
 Variation in Independent Variables: There should be sufficient variation in the
independent variable(s) to be qualified as an explanatory variable. In other words, if there

22 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

is little or no variation in the independent variable, it cannot effectively explain the


differences in the dependent variable.
For example, suppose we want to model the consumption level taking income as the
independent variable. If everyone in the sample has an income of Rs 10,000, then there is
no variation in Xi. Hence, the difference in their consumption levels cannot be explained
by Xi.
Hence, we assume that there is enough variation in Xi. Otherwise, we cannot include it as
an explanatory variable in the model.
 Zero Mean and Normal Distribution of Error Term: The error term (μi) should have a
mean of zero. This means that, on average, the errors do not systematically overestimate
or underestimate the dependent variable. Additionally, the error term is assumed to follow
a normal distribution, allowing for statistical inference and hypothesis testing.
 Fixed Values of Independent Variables: The values of the independent variable(s) are
considered fixed over repeated sampling. This assumption implies that the independent
variables are not subject to random fluctuations or changes during the sampling process.
 No Endogeneity: Endogeneity refers to the situation where there is a correlation between
the independent variables and the error term. In other words, the independent variables
are not independent of the error term. To ensure valid results, it is crucial to address
endogeneity issues, as violating this assumption can lead to biased and inconsistent
parameter estimates.
 Number of Parameters vs. Sample Size: The number of parameters to be estimated (k)
from the model should be significantly smaller than the total number of observations in
the sample (n). In general, it is recommended that the sample size (n) should be at least 20
times greater than the number of parameters (k) to obtain reliable and stable estimates.
 Correct Model Specification: The econometric model should be correctly specified,
meaning that it reflects the true relationship between the variables in the population.
Model misspecification can occur in two ways: improper functional form and
inclusion/exclusion of relevant variables. Improper functional form refers to using a linear
model when the true relationship is nonlinear, leading to biased parameter estimates. The
inclusion of irrelevant variables or exclusion of relevant variables can also lead to biased
and inefficient estimates.

23 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

 Homoskedasticity: Homoskedasticity assumes that the variance of the error term is


constant across all levels of the independent variables. It means that the spread or
dispersion of the errors does not change systematically with the values of the independent
variable(s). This assumption is important for obtaining efficient and unbiased estimates of
the parameters.
To understand homoskedasticity visually, let's consider a scatter plot with a regression
line. In a homoskedastic scenario, the spread of the residuals around the regression line
will be relatively constant across different values of the independent variable(s).
Homoskedasticity means that Variance of the error term is constant
Yi = α +βxi +µi

Var (µi) =

Fig 2.1: Scatter Plot


Even at higher levels of Xi, the variance of the error term remains constant.
In a homoskedastic scenario, the spread of the residuals (green lines) remains relatively
constant across different values of the independent variable. This means that the variability of
the dependent variable is consistent across the range of the independent variable.

24 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Homoskedasticity is an important assumption in the CLRM because violations of this


assumption can lead to biased and inefficient estimators, affecting the reliability of the
regression analysis. If heteroskedasticity is present (where the spread of the residuals varies
across the range of the independent variable), it can indicate that the model is not adequately
capturing the relationship between the variables, leading to unreliable inference and
misleading results.
To detect heteroskedasticity, you can visually inspect the scatter plot of the residuals or
employ statistical tests specifically designed to assess the presence of heteroskedasticity, such
as the Breusch-Pagan test or the White test.
If heteroskedasticity is detected, various techniques can be employed to address it, such as
transforming the variables, using weighted least squares (WLS) regression, or employing
heteroskedasticity-consistent standard errors.
 No Autocorrelation: Autocorrelation, also known as serial correlation, refers to the
correlation between error terms of different observations. In the case of cross-sectional
data, autocorrelation occurs when the error terms of different individuals or units are
correlated. In time series data, autocorrelation occurs when the error terms of consecutive
time periods are correlated. Autocorrelation violates the assumption of independent and
identically distributed errors, and it can lead to biased and inefficient estimates.
This means that the covariance between µi and µi-1 should be zero. If that is not the case,
then it is a situation of autocorrelation.

Yi =

Yj =
Cov(ui,uj)≠ 0 = spatial autocorrelation
Cov(ut, ut+1)≠0 = autocorrelation
In cross sectional data, if two error terms do not have zero covariance, then it is a
situation of SPATIAL CORRELATION. In time series data, if two error terms for
consecutive time periods do not have zero covariance, then it is a situation of
AUTOCORRELATION OR SERIAL CORRELATION.
 No Multicollinearity: Multicollinearity occurs when there is a high degree of correlation
between two or more independent variables in the regression model. This can pose a
problem because it becomes challenging to separate the individual effects of the

25 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

correlated variables. Multicollinearity can lead to imprecise and unstable parameter


estimates.
By adhering to these assumptions, the CLRM exhibits desirable properties such as efficiency,
unbiasedness, and consistency. Efficiency refers to obtaining parameter estimates with the
minimum possible variance, allowing for precise estimation. Unbiasedness means that, on
average, the estimated parameters are not systematically over or underestimating the true
population parameters. Consistency implies that as the sample size increases, the estimated
parameters converge to the true population parameters.
In conclusion, the Classical Linear Regression Model (CLRM) is a widely used statistical
framework for analysing the relationship between a dependent variable and one or more
independent variables. By estimating the parameters of the regression equation, we can make
predictions, test hypotheses, and gain insights into the factors influencing the dependent
variable. However, it is crucial to ensure that the assumptions of the CLRM are met to obtain
reliable and meaningful results. Violating these assumptions can lead to biased and
inconsistent parameter estimates, compromising the validity of the analysis.
2.3.3 Simple Linear Regression
2.3.3.1. Estimation of Parameters
Simple Linear Regression involves estimating the parameters of a linear equation that best
fits the relationship between a single independent variable and a dependent variable. We will
discuss the methods used to estimate these parameters and interpret their meaning in the
context of the problem at hand using R/Python programming.
2.3.3.2 Model Validation
Validating the simple linear regression model is crucial to ensure its reliability. We will cover
various techniques, such as hypothesis testing, to assess the significance of the model and
evaluate its performance. Additionally, we will examine residual analysis to understand the
differences between the observed and predicted values and identify potential issues with the
model.
Validation of a simple linear regression model involves assessing the model's performance
and determining how well it fits the data. Here are some common techniques for validating a
simple linear regression model:

26 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Residual Analysis: Residuals are the differences between the observed values and the
predicted values of the dependent variable. By analysing the residuals, you can evaluate the
model's performance. Some key aspects to consider are:
 Checking for randomness: Plotting the residuals against the predicted values or the
independent variable can help identify any patterns or non-random behaviour.
 Assessing normality: Plotting a histogram or a Q-Q plot of the residuals can indicate
whether they follow a normal distribution. Departures from normality might suggest
violations of the assumptions.
 Checking for homoscedasticity: Plotting the residuals against the predicted values or the
independent variable can reveal any patterns indicating non-constant variance. The spread
of the residuals should be consistent across all levels of the independent variable.
R-squared (Coefficient of Determination): R-squared measures the proportion of the total
variation in the dependent variable that is explained by the linear regression model. A higher
R-squared value indicates a better fit. However, R-squared alone does not provide a complete
picture of model performance and should be interpreted along with other validation metrics.
Adjusted R-squared: Adjusted R-squared takes into account the number of independent
variables in the model. It penalizes the addition of irrelevant variables and provides a more
reliable measure of model fit when comparing models with different numbers of predictors.
F-statistic: The F-statistic assesses the overall significance of the linear regression model. It
compares the fit of the model with a null model (no predictors) and provides a p-value
indicating whether the model significantly improves upon the null model.
Outlier Analysis: Identify potential outliers in the data that may have a substantial impact on
the model's fit. Outliers can skew the regression line and affect the estimated coefficients. It
is important to investigate and understand the reasons behind any outliers and assess their
influence on the model.
Cross-Validation: Splitting the dataset into training and testing subsets allows you to assess
the model's performance on unseen data. The model is trained on the training set and then
evaluated on the testing set. Metrics such as mean squared error (MSE), or root mean squared
error (RMSE) can be calculated to quantify the model's predictive accuracy.
By employing these validation techniques, you can gain insights into the model's
performance, evaluate its assumptions, and make informed decisions about its reliability and
usefulness for predicting the dependent variable.
27 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

2.3.4. Coefficient of Determination:


The coefficient of determination, commonly known as R-squared, quantifies the proportion
of variance in the dependent variable that can be explained by the independent variable in a
simple linear regression model. We will delve into the calculation and interpretation of this
important metric.
Introduction:
The overall goodness of fit of the regression model is measured by the coefficient of
determination, r2. It tells what proportion of the variation in the dependent variable, or
regressor and, is explained by the explanatory variable, or regressor. This r2 lies between 0
and 1; the closer it is to 1, the better is the fit.
Let TSS denotes TOTAL SUM OF SQUARES which is Total variation of the actual Y
values about their sample means which may be called the total sum of squares:
TSS = Σ (yᵢ - ȳ)²
TSS can further be split into two variations; explained sum of square (ESS) and residual sum
of squares (RSS).
Explained sum of square (ESS) or Regression sum of squares or Model sum of squares is a
statistical quantity used in modelling of a process. ESS gives an estimate of how well a model
explains the observed data for the process.
ESS = Σ (ŷᵢ - ȳ)²
The residual sum of squares (RSS) is a statistical technique used to measure the amount of
variance in a data set that is not explained by a regression model itself. Instead, it estimates
the variance in the residuals, or error term.
RSS = Σ (ŷᵢ - ȳ)²
Since, TSS = ESS + RSS
Or 1= ESS/TSS +RSS/TSS
Since ESS/TSS determines proportion of variability in Y explained by regression model,
therefore.
r2 = ESS/TSS
Alternatively, from above r2 = 1-RSS/TSS
28 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

2.3.5 Significance Tests:


To determine the significance of the simple linear regression model and its coefficients, we
will explore statistical tests such as t-tests and p-values in the practical exercise. These tests
help assess the statistical significance of the relationships between variables and make
informed conclusions.
2.3.6 Residual Analysis
Residual analysis is a critical step in evaluating the adequacy of a simple linear regression
model. Using practical examples, we will discuss how to interpret and analyse residuals,
which represent the differences between the observed and predicted values. Residual analysis
provides insights into the model's assumptions and potential areas for improvement.
2.3.7 Confidence and Prediction Intervals
Confidence and prediction intervals are essential in understanding the uncertainty associated
with the predictions made by a simple linear regression model. We will cover the calculation
and interpretation of these intervals, allowing us to estimate the range within which future
observations are expected to fall in the practical exercises.

2.4 MULTIPLE LINEAR REGRESSION MODEL

Multiple regression is a statistical analysis technique used to examine the relationship


between a dependent variable and two or more independent variables. It builds upon the
concept of simple linear regression, which analyses the relationship between a dependent
variable and a single independent variable.
The multiple regression model equation looks like this:
Y = β0 + β1X1 + β2X2 + ... + βnXn + ε
In this equation:
Y represents the dependent variable that we want to predict or explain.
X1, X2, ..., Xn are the independent variables.
β0 is the y-intercept or constant term.
β1, β2, ..., βn are the coefficients or regression weights that represent the change in the
dependent variable associated with a one-unit change in the corresponding independent
variable.
29 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

ε is the error term or residual, representing the unexplained variation in the dependent
variable.

2.4.1 Interpretation of Partial Regression Coefficients:


Multiple Linear Regression extends the simple linear regression framework to include
multiple independent variables. We will explore the interpretation of partial regression
coefficients, which quantify the relationship between each independent variable and the
dependent variable while holding other variables constant.
2.4.2 Working with Categorical Variables:
Categorical variables require special treatment in regression analysis. We will discuss how to
handle categorical variables by creating dummy variables or qualitative variables. The
interpretation of these coefficients will be explained to understand the impact of categorical
variables on the dependent variable.
2.4.3 Multicollinearity and VIF:
Multicollinearity refers to the presence of ha igh correlation between independent variables in
a multiple linear regression model. One of the assumptions of the CLRM is that there is no
exact linear relationship among the independent variables (regressors). If there are one or
more such relationships among the regressors, we call it multicollinearity.
There are two types of multicollinearities.
1. Perfect collinearity
2. Imperfect collinearity
Perfect multicollinearity occurs when two or more independent variables in a regression
model exhibit a deterministic (perfectly predictable or containing no randomness) linear
relationship. With imperfect multicollinearity, an independent variable has a strong but not
perfect linear function of one or more independent variables.
This also means that there are also variables in the model that effects the independent
variable.
Multicollinearity occurs when there is a high correlation between independent variables in a
regression model. It can cause issues with the estimation of coefficients and affect the
reliability of statistical inference.

30 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

The causes of multicollinearity are as


1) Data collection method: If we sample over a limited range of values taken by the
regressors in the population, it can lead to multicollinearity
2) Model specification: If we introduce polynomial terms into the model, especially when
the values of the explanatory variables are small; it can lead to multicollinearity.
3) Constraint on the model or in the population: For example, if we try to regress electricity
expenditure on house size and income, it may suffer from multicollinearity as there is a
constraint in the population. People with higher incomes typically have bigger houses.
4) Over determined model: If we have more explanatory variables than the number of
observations, then it could lead to multicollinearity. Often happens in medical research
when you only have a limited number of patients about whom a large amount of
information is collected.

Impact of multicollinearity:
Unbiasedness: The Ordinary Least Squares (OLS) estimators remain unbiased.
Precision: OLS estimators have large variances and covariances, making precise estimation
difficult and leading to wider confidence intervals. Statistically insignificant coefficients may
be observed.
High R-squared: The R-squared value can still be high, even with statistically insignificant
coefficients.
Sensitivity: OLS estimators and their standard errors are sensitive to small changes in the
data.
Efficiency: Despite increased variance, OLS estimators are still efficient, meaning they have
minimum variance among all linear unbiased estimators.
In summary, multicollinearity undermines the precision of coefficient estimates and can lead
to unreliable statistical inference. While the OLS estimators remain unbiased, they become
imprecise, resulting in wider confidence intervals and potential insignificance of coefficients.
We will learn how to detect multicollinearity using the Variance Inflation Factor (VIF) and
explore strategies to address this issue, ensuring the accuracy and interpretability of the
regression model.

31 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

VIF stands for Variance Inflation Factor, which is a measure used to assess
multicollinearity in multiple regression model. VIF quantifies how much the variance of the
estimated regression coefficient is increased due to multicollinearity. It measures how much
the variance of one independent variable's estimated coefficient is inflated by the presence of
other independent variables in the model.
The formula for calculating the VIF for an independent variable Xj is:
VIF(Xj) = 1 / (1 – rj2)
where rj2 represents the coefficient of determination (R-squared) from a regression model that
regresses Xj on all other independent variables.
The interpretation of VIF is as follows:
If VIF(Xj) is equal to 1, it indicates that there is no correlation between Xj and the other
independent variables.
If VIF(Xj) is greater than 1 but less than 5, it suggests moderate multicollinearity.
If VIF(Xj) is greater than 5, it indicates a high degree of multicollinearity, and it is generally
considered problematic.
When assessing multicollinearity, it is common to examine the VIF values for all independent
variables in the model. If any variables have high VIF values, it indicates that they are highly
correlated with the other variables, which may affect the reliability and interpretation of the
regression coefficients.
If high multicollinearity is detected (e.g., VIF greater than 5), some steps can be taken to
address it:
 Remove one or more of the highly correlated independent variables from the model.
Combine or transform the correlated variables into a single variable.
 Obtain more data to reduce the correlation among the independent variables.
By addressing multicollinearity, the stability and interpretability of the regression model can
be improved, allowing for more reliable inferences about the relationships between the
independent variables and the dependent variable.
HOW TO DETECT MULTICOLLINEARITY
To detect multicollinearity in your regression model, you can use several methods:

32 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Pairwise Correlation: Calculate the pairwise correlation coefficients between each pair of
explanatory variables. If the correlation coefficient is very high (typically greater than 0.8), it
indicates potential multicollinearity. However, low pairwise correlations do not guarantee the
absence of multicollinearity.
Variance Inflation Factor (VIF) and Tolerance: VIF measures the extent to which the
variance of the estimated regression coefficient is increased due to multicollinearity. High
VIF values (greater than 10) suggest multicollinearity. Tolerance, which is the reciprocal of
VIF, measures the proportion of variance in the predictor variable that is not explained by
other predictors. Low tolerance values (close to zero) indicate high multicollinearity.
Insignificance of Individual Variables: If many of the explanatory variables in the model are
individually insignificant (i.e., their t-statistics are statistically insignificant) despite a high R-
squared value, it suggests the presence of multicollinearity.
Auxiliary Regressions: Conduct auxiliary regressions where each independent variable is
regressed against the remaining independent variables. Check the overall significance of
these regressions using the F-test. If any of the auxiliary regressions show significant F-
values, it indicates collinearity with other variables in the model.
HOW TO FIX MULTICOLLINEARITY
To address multicollinearity, you can consider the following approaches:
 Increase Sample Size: By collecting a larger sample, you can potentially reduce the
severity of multicollinearity. With a larger sample, you can include individuals with
different characteristics, reducing the correlation between variables. Increasing the
sample size leads to more efficient estimators and mitigates the multicollinearity
problem.
 Drop Non-Essential Variables: If you have variables that are highly correlated with
each other, consider excluding non-essential variables from the model. For example,
if both father's and mother's education are highly correlated, you can choose to
include only one of them. However, be cautious when dropping variables as it may
result in model misspecification if the excluded variable is theoretically important.
Detecting and addressing multicollinearity is crucial for obtaining reliable regression results.
By understanding the signs of multicollinearity and applying appropriate remedies, you can
improve the accuracy and interpretability of your regression model.

33 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

2.4.4 Outlier Analysis


Outliers can significantly influence the results of a regression model. We will discuss
techniques for identifying and handling outliers effectively, enabling us to build more robust
and reliable models.
2.4.5 Autocorrelation
Autocorrelation, also known as serial correlation, refers to the correlation between
observations in a time series data set or within a regression model. It arises when there is a
systematic relationship between the current observation and one or more past observation.
Autocorrelation occurs when the residuals of a regression model exhibit a pattern, indicating
a potential violation of the model's assumptions. We will cover methods for detecting and
addressing autocorrelation, ensuring the independence of residuals and the validity of our
model.
Consequences of Autocorrelation
I. OLS estimators are still unbiased and consistent.
II. They are still normally distributed in large samples.
III. But they are no longer efficient. That is, they are no longer BLU. In a case of
autocorrelation, standard errors are UNDERESTIMATED. This means that the t-
values are OVERESTIMATED. Hence, it means that variables that may not be
statistically significant erroneously appear to be statistically significant with high
t-values.
IV. Hypothesis testing procedure is not reliable as standard errors are erroneous, even
with large samples. Therefore, the F and T tests may not be valid.
Autocorrelation can be detected by following methods:

 Graphical Method
 Durbin Watson test
 Breusch-Godfrey test
1. Graphical Method
Autocorrelation can be detected using graphical methods. Here are a few graphical
techniques to identify autocorrelation:

34 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Residual Plot: Plot the residuals of the regression model against the corresponding time or
observation index. If there is no autocorrelation, the residuals should appear random and
evenly scattered around zero. However, if autocorrelation is present, you may observe
patterns or clustering of residuals above or below zero, indicating a systematic relationship.
Partial Autocorrelation Function (PACF) Plot: The PACF plot displays the correlation
between the residuals at different lags, while accounting for the intermediate lags. In the
absence of autocorrelation, the PACF values should be close to zero for all lags beyond the
first. If there is significant autocorrelation, you may observe spikes or significant values
beyond the first lag.
Autocorrelation Function (ACF) Plot: The ACF plot shows the correlation between the
residuals at different lags, without accounting for the intermediate lags. Similar to the PACF
plot, significant values beyond the first lag in the ACF plot indicate the presence of
autocorrelation.

Figure 1.2
Autocorrelation and partial autocorrelation function (ACF and PACF) plots, prior to
differencing (A and B) and after differencing (C and D)
In both the PACF and ACF plots, significance can be determined by comparing the
correlation values against the confidence intervals. If the correlation values fall outside the
confidence intervals, it suggests the presence of autocorrelation.

35 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

It's important to note that these graphical methods provide indications of autocorrelation, but
further statistical tests, such as the Durbin-Watson test or Ljung-Box test, should be
conducted to confirm and quantify the autocorrelation in the model.
2. Durbin Watson D Test
The Durbin-Watson test is a statistical test used to detect autocorrelation in the residuals of a
regression model. It is specifically designed for detecting first-order autocorrelation, which is
the correlation between adjacent observations.
The Durbin-Watson test statistic is computed using the following formula:
d = (Σ (e_i - e_i-1)^2) / Σe_i^2
where:
· e_i is the residual for observation i.
· e_i-1 is the residual for the previous observation (i-1).
The test statistic is then compared to critical values to determine the presence of
autocorrelation. The critical values depend on the sample size, the number of independent
variables in the regression model, and the desired level of significance.
The Durbin-Watson test statistic, denoted as d, ranges from 0 to 4. The test statistic is
calculated based on the residuals of the regression model and is interpreted as follows:
A value of d close to 2 indicates no significant autocorrelation. It suggests that the residuals
are independent and do not exhibit a systematic relationship.
A value of d less than 2 indicates positive autocorrelation. It suggests that there is a positive
relationship between adjacent residuals, meaning that if one residual is high, the next one is
likely to be high as well.
A value of d greater than 2 indicates negative autocorrelation. It suggests that there is a
negative relationship between adjacent residuals, meaning that if one residual is high, the
next one is likely to be low.
The closer it is to zero, the greater is the evidence of positive autocorrelation, and the closer it
is to 4, the greater is the evidence of negative autocorrelation. If d is about 2, there is no
evidence of positive or negative (first-) order autocorrelation.

36 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

3. The Breusch-Godfrey Test


The Breusch-Godfrey test, also known as the LM test for autocorrelation, is a statistical test
used to detect autocorrelation in the residuals of a regression model. Unlike the Durbin-
Watson test, which is primarily designed for detecting first-order autocorrelation, the
Breusch-Godfrey test can detect higher-order autocorrelation.
The test is based on the idea of regressing the residuals of the original regression model on
lagged values of the residuals. It tests whether the lagged residuals are statistically significant
in explaining the current residuals, indicating the presence of autocorrelation.
The general steps for performing the Breusch-Godfrey test are as follows:
1. Estimate the initial regression model and obtain the residuals.
2. Extend the initial regression model by including lagged values of the residuals as
additional independent variables.
3. Estimate the extended regression model and obtain the residuals from this model.
4. Perform a hypothesis test on whether the lagged residuals are jointly significant in
explaining the current residuals.
The test statistic for the Breusch-Godfrey test follows a chi-square distribution and is
calculated based on the residual sum of squares (RSS) from the extended regression model.
The test statistic is compared to the critical values from the chi-square distribution to
determine the presence of autocorrelation.
The interpretation of the Breusch-Godfrey test involves the following steps:
1. Set up the null hypothesis (H0): There is no autocorrelation in the residuals
(autocorrelation is absent).
2. Set up the alternative hypothesis (Ha): There is autocorrelation in the residuals
(autocorrelation is present).
3. Conduct the Breusch-Godfrey test and calculate the test statistic.
4. Compare the test statistic to the critical value(s) from the chi-square distribution.
5. If the test statistic is greater than the critical value, reject the null hypothesis and conclude
that there is evidence of autocorrelation and If the test statistic is less than the critical
value, fail to reject the null hypothesis and conclude that there is no significant evidence
of autocorrelation.

37 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

2.4.6 Transformation of Variables


Transforming variables can enhance the fit and performance of a regression model. We
would explore techniques such as logarithmic and power transformations in practical
examples, which can help improve linearity, normality, and homoscedasticity assumptions.
2.4.7 Variable Selection in Regression Model Building:
Building an optimal regression model involves selecting the most relevant independent
variables. We will discuss various techniques for variable selection, including stepwise
regression and regularization methods like Lasso and Ridge regression.

2.5 PRACTICAL EXERCISES USING R/PYTHON PROGRAMMING

To reinforce the concepts covered in this chapter, practical exercises using R/Python
programming has been shown. These exercises will involve implementing simple OLS
regression using R or Python, interpreting the results obtained, and conducting assumption
tests such as checking for multicollinearity, autocorrelation, and normality. Furthermore,
regression analysis with categorical/dummy/qualitative variables will be performed to
understand their impact on the dependent variable.
Exercise1: Perform simple OLS regression on R/Python and interpret the results obtained.
Sol. Certainly! Here's an example of how you can perform a simple Ordinary Least Squares
(OLS) regression in both R and Python, along with results interpretation.
Let's assume you have a dataset with a dependent variable (Y) and an independent variable
(X). We will use this dataset to demonstrate the OLS regression.

Using R:
# Load the necessary libraries
Library(dplyr)
# Read the dataset
data <- read.csv("your_dataset.csv")
# Perform the OLS regression
model <- lm (Y ~ X, data = data)
# Print the summary of the regression results
Summary (model)
38 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Using Python (using the statsmodels library):


# Import the necessary libraries
import pandas as pd.
import statsmodels.api as sm.
# Read the dataset
data = pd. read_csv("your_dataset.csv")
# Perform the OLS regression
model = sm.OLS(data['Y'], sm.add_constant(data['X']))
# Fit the model
results = model.fit ()
# Print the summary of the regression results
print(results. summary())
In both R and Python, we first load the necessary libraries (e.g., dplyr in R and pandas and
statsmodels in Python). Then, we read the dataset containing the variables Y and X.
Next, we perform the OLS regression by specifying the formula in R (Y ~ X) and using the
lm function. In Python, we create an OLS model object using sm.OLS and provide the
dependent variable (Y) and independent variable (X) as arguments. We also add a constant
term using sm.add_constant to account for the intercept in the regression.
After fitting the model, we can print the summary of the regression results using
summary(model) in R and print(results. summary()) in Python. The summary provides
various statistical measures and information about the regression model.
Interpreting the results:
Coefficients: The regression results will include the estimated coefficients for the intercept
and the independent variable. These coefficients represent the average change in the
dependent variable for a one-unit increase in the independent variable. For example, if the
coefficient for X is 0.5, it suggests that, on average, Y increases by 0.5 units for every one-
unit increase in X.

39 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

p-values: The regression results also provide p-values for the coefficients. These p-values
indicate the statistical significance of the coefficients. Generally, a p-value less than a
significance level (e.g., 0.05) suggests that the coefficient is statistically significant, implying
a relationship between the independent variable and the dependent variable.
R-squared: The R-squared value (R-squared or R2) measures the proportion of the variance
in the dependent variable that can be explained by the independent variable(s). It ranges from
0 to 1, with higher values indicating a better fit of the regression model to the data. R-squared
can be interpreted as the percentage of the dependent variable's variation explained by the
independent variable(s).
Residuals: The regression results also include information about the residuals, which are the
differences between the observed values of the dependent variable and the predicted values
from the regression model. Residuals should ideally follow a normal distribution with a mean
of zero, and their distribution can provide insights into the model's goodness of fit and
potential violations of the regression assumptions.
It's important to note that interpretation may vary depending on the specific context and
dataset. Therefore, it's essential to consider the characteristics of your data and the objectives
of your analysis while interpreting the results of an OLS regression.
Exercise 2. Test the assumptions of OLS (multicollinearity, autocorrelation, normality etc.)
on R/Python.
Sol. To test the assumptions of OLS, including multicollinearity, autocorrelation, and
normality, you can use various diagnostic tests in R or Python. Here are the steps and some
commonly used tests for each assumption:
Multicollinearity:
Step 1: Calculate the pairwise correlation matrix between the independent variables using the
cor () function in R or the corrcoef() function in Python (numpy).
Step 2: Calculate the Variance Inflation Factor (VIF) for each independent variable using the
vif () function from the "car" package in R or the variance_inflation_factor() function from
the "statsmodels" library in Python. VIF values greater than 10 indicate high
multicollinearity.
Step 3: Perform auxiliary regressions by regressing each independent variable against the
remaining independent variables to identify highly collinear variables.
Autocorrelation:

40 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Step 1: Plot the residuals against the predicted values (fitted values) from the regression
model. In R, you can use the plot () function with the residuals() and fitted() functions. In
Python, you can use the scatter () function from matplotlib.
Step 2: Conduct the Durbin-Watson test using the dwtest () function from the "lmtest"
package in R or the DurbinWatson() function from the "statsmodels.stats.stattools" module in
Python. A value close to 2 indicates no autocorrelation, while values significantly greater or
smaller than 2 suggest positive or negative autocorrelation, respectively.
Normality of Residuals:
Step 1: Plot a histogram or a kernel density plot of the residuals. In R, you can use the hist ()
or density() functions. In Python, you can use the histplot () or kdeplot() functions from the
seaborn library.
Step 2: Perform a normality test such as the Shapiro-Wilk test using the shapiro.test ()
function in R or the shapiro() function from the "scipy.stats" module in Python. A p-value
greater than 0.05 indicates that the residuals are normally distributed.
It's important to note that these tests provide diagnostic information, but they may not be
definitive. It's also advisable to consider the context and assumptions of the specific
regression model being used.
Here is the random data set to perform the regression code in either R or Python.

41 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

This dataset consists of three columns: y represents the dependent variable, and x1 and x2 are
the independent variables. Each row corresponds to an observation in the dataset.
We can use this dataset to run the provided code and perform diagnostic tests on the OLS
regression model.
import numpy as np.
import pandas as pd.
import statsmodels.api as sm.
import seaborn as sns.
import matplotlib. pyplot as plt

# Set random seed for reproducibility


np. random.seed(123)

# Generate random data


n = 100 # Number of observations
x1 = np. random.normal(0, 1, n) # Independent variable 1
x2 = np. random.normal(0, 1, n) # Independent variable 2
epsilon = np. random.normal(0, 1, n) # Error term

42 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

# Generate dependent variable


y = 1 + 2*x1 + 3*x2 + epsilon

# Create a DataFrame
data = pd. DataFrame({'y': y, 'x1': x1, 'x2': x2})

# Fit OLS regression model


X = sm.add_constant (data[['x1', 'x2']]) # Add constant term
model = sm.OLS(data['y'], X)
results = model.fit ()

# Diagnostic tests
print("Multicollinearity:")
vif = pd. DataFrame()
vif["Variable"] = X. columns
vif["VIF"] = [variance_inflation_factor (X.values, i) for i in range(X.shape[1])]
print(vif)

print("\nAutocorrelation:")
residuals = results. resid
fig, ax = plt. subplots()
ax. scatter(results.fittedvalues, residuals)
ax.set_xlabel ("Fitted values")
ax.set_ylabel("Residuals")
plt. show()

print ("Durbin-Watson test:")


dw_statistic = sm. stats.stattools.durbin_watson(residuals)

43 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

print (f"Durbin-Watson statistic: {dw_statistic}")

print("\nNormality of Residuals:")
sns.histplot(residuals, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

shapiro_test = sm.stats.shapiro(residuals)
print(f"Shapiro-Wilk test p-value: {shapiro_test[1]}")

In this example, we generated a random dataset with two independent variables (x1 and x2)
and a dependent variable (y). We fit an OLS regression model using the statsmodels library.
Then, we perform diagnostic tests for multicollinearity, autocorrelation, and normality of
residuals.
The code calculates the VIF for each independent variable, plots the residuals against the
fitted values, performs the Durbin-Watson test for autocorrelation, and plots a histogram of
the residuals. Additionally, the Shapiro-Wilk test is conducted to check the normality of
residuals.
We can run this code in a Python environment to see the results and interpretations for each
diagnostic test based on the random dataset provided.
3. Perform regression analysis with categorical/dummy/qualitative variables on R/Python.
import pandas as pd
import statsmodels.api as sm
# Create a DataFrame with the data
data = {
'y': [3.3723, 5.5593, 8.1878, -2.4581, 3.8578, 5.4747, 6.4135, 8.1032,
5.56, 5.3514, 5.8457],

'x1': [-1.085631, 0.997345, 0.282978, -1.506295, -0.5786, 1.651437, -


2.426679, -0.428913, -0.86674, 0.742045, 2.312265],

44 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

'x2': [-0.076047, 0.352978, -2.242685, 1.487477, 1.058969, -0.37557, -


0.600516, 0.955434, -0.151318, -0.10322, 0.410598],
'category': ['A', 'B', 'A', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'B']
}

df = pd.DataFrame(data)

# Convert the categorical variable to dummy variables


df = pd.get_dummies(df, columns=['category'], drop_first=True)

# Define the dependent and independent variables


X = df[['x1', 'x2', 'category_B']]
y = df['y']

# Add a constant term to the independent variables


X = sm.add_constant(X)

# Fit the OLS model


model = sm.OLS(y, X).fit()

# Print the summary of the regression results


print(model.summary())

In this example, we have created a DataFrame df with the y, x1, x2, and category variables.
The category variable is converted into dummy variables using the get_dummies function,
and the category A column is dropped to avoid multicollinearity. We then define the
dependent variable y and the independent variables X, including the dummy variable
category_B. A constant term is added to the independent variables using sm.add_constant.
Finally, we fit the OLS model using sm.OLS and print the summary of the regression results
using model.summary(). The regression analysis provides the estimated coefficients, standard
errors, t-statistics, and p-values for each independent variable, including the dummy variable
category B.
45 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

IN-TEXT QUESTIONS AND ANSWERS

1. What is the main objective of simple linear regression?


Answer: The main objective of simple linear regression is to establish a linear
relationship between a dependent variable and a single independent variable and
use it to predict the value of the dependent variable based on the value of the
independent variable.

2. What are the key assumptions of multiple linear regression?


Answer: The key assumptions of multiple linear regression are linearity,
independence of errors, homoscedasticity, absence of multicollinearity, and
normality of residuals.

3. What is the interpretation of the coefficient of determination (R-squared)?


Answer: The coefficient of determination (R-squared) represents the proportion
of the variance in the dependent variable that can be explained by the
independent variables in the regression model. It ranges from 0 to 1, where 0
indicates no explanatory power, and 1 indicates that all the variability in the
dependent variable is accounted for by the independent variables.

4. How is multicollinearity detected in multiple linear regression?


Answer: Multicollinearity in multiple linear regression can be detected through
methods such as examining pairwise correlations among the independent
variables, calculating variance inflation factor (VIF) values, and performing
auxiliary regressions.

2.6 SUMMARY

This chapter discusses a comprehensive understanding of predictive analytics techniques,


with a specific focus on simple linear regression and multiple linear regression. It provides
the knowledge and practical skills necessary to apply these techniques using R or Python,
enabling one to make informed predictions and interpretations in the context of the regression
analysis.
46 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

2.7 SELF-ASSESSMENT QUESTIONS

1. What is the purpose of residual analysis in regression?


2. How do you interpret the p-value in regression analysis?
3. What is the purpose of stepwise regression?
4. What is the difference between simple linear regression and multiple linear
regression?
5. What is the purpose of interaction terms in multiple linear regression?
6. How can you assess the goodness of fit in regression analysis?

2.8 REFERENCES

1. Business Analytics: The Science of Data Driven Decision Making, First Edition
(2017), U Dinesh Kumar, Wiley, India.

2.9 SUGGESTED READINGS

1. Introduction to Machine Learning with Python, Andreas C. Mueller and Sarah Guido,
O'Reilly Media, Inc.
2. 2. Data Mining or Business Analytics – Concepts, Techniques, and Applications in
Python. Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. Wiley.

47 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

LESSON 3
LOGISTIC AND MULTINOMIAL REGRESSION
Anurag Goel
Assistant Professor, CSE Dept.
Delhi Technological University, New Delhi
Email-Id: anurag@dtu.ac.in

STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lemshow Test
3.7 Pseudo R Square
3.8 Classification Table
3.9 Gini Coefficient
3.10 ROC
3.11 AUC
3.12 Summary
3.13 Glossary
3.14 Answers to In-Text Questions
3.15 Self-Assessment Questions
3.16 References
3.17 Suggested Readings

3.1 LEARNING OBJECTIVES


At the end of the chapter, the students will be able to:
● familiarize the concepts of logistic regression and multinomial logistic regression.
● understand the various evaluation metrics to evaluate the logistic regression model.
● analyse the scenario where the logistic regression model is relevant.
● apply logistic regression model for nominal and ordinal outcomes.
48 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

3.2 INTRODUCTION
In machine learning, we often are required to determine if a particular variable belongs to a
given class. In such cases, one can use logistic regression. Logistic Regression, a popular
supervised learning technique, is commonly employed when the desired outcome is a
categorical variable such as binary decisions (e.g., 0 or 1, yes or no, true or false). It finds
extensive applications in various domains, including fake news detection and cancerous cell
identification.
Some examples of logistic regression applications are as follows:
 To detect whether a given news is fake or not.
 To detect whether a given cell is Cancerous cell or not.
In essence, logistic regression can be understood as the probability of belonging to a class
given a particular input variable. Since it’s probabilistic in nature, the logistic regression
output values lie in the range of 0 and 1.
Generally, when we think about regression from a strictly statistical perspective, the output
value is generally not restricted to a particular interval. Thus, to achieve this in logistic
regression, we utilise logistic function. An intuitive example to see the use of logistic
function can be to understand logistic regression as any simple regression value model, on
top of whose output value, we have applied a logistic function so that the final output
becomes restricted in the above defined range.
Generally, logistic regression results work well when the output is of binary type, that is, it
either belongs to a specific category or it does not. This, however, is not always the case in
real-life problem statements. We may encounter a lot of scenarios where we have a
dependent variable having multiple classes or categories. In such cases, Multinomial
Regression emerges as a valuable extension of logistic regression, specifically designed to
handle multiclass problems. Multinomial Regression is the generalization of logistic
regression to multiclass problems. For example, based on the results of some analysis,
predicting the engineering branch students will choose for their graduation is a multinomial
regression problem since the output categories of engineering branches are multiple. In this
multinomial regression problem, the engineering branch will be the dependent variable
predicted by the multinomial regression model while the independent variables are student’s
marks in XII board examination, student’s score in engineering entrance exam, student’s
interest areas/courses etc. These independent variables are used by the multinomial regression
model to predict the outcome i.e. engineering branch the student may opt for.

49 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

To better understand the application of multinomial regression, consider the example of


predicting a person's blood group based on the results of various diagnostic tests. Unlike
binary classification problems that involve two categories, blood group prediction involves
multiple possible outcomes. In this case, the output categories are the different blood groups,
and predicting the correct blood group for an individual becomes a multinomial regression
problem. The multinomial regression model aims to estimate the probabilities associated with
each class or category, allowing us to assign an input sample to the most likely category.
Now, let us understand this better by doing a simple walkthrough of how a multinomial
logistic regression model might work on the above example. For simplicity, let us assume we
have a well-balanced, cleaned, pre-processed and labelled dataset available with us which has
an input variable (or feature) and a corresponding output blood group. During training, our
multinomial logistic regression model will try to learn the underlying patterns and
relationships between the input features and the corresponding class labels (from training
data). Once trained, the model can utilise these learned patterns and relationships on new (or
novel) input variable to assign a probability of the input variable to belonging to each output
class using the logistic function. Model can then simply select the class which has the highest
probability as the predicted output of our overall model.
Thus, multinomial regression serves as a powerful extension of logistic regression, enabling
the handling of multiclass classification problems. By estimating the probabilities associated
with each class using the logistic function, it provides a practical and effective approach for
assigning input samples to their most likely categories. Applications of multinomial
regression encompass a wide range of domains, including medical diagnosis, sentiment
analysis, and object recognition, where classification tasks involve more than two possible
outcomes.

3.3 Logistic Function


3.3.1 Logistic function (Sigmoid function)
The sigmoid function is represented as follows:

50 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

It is a mathematical function that assigns values between 0 and 1 based on the input variable.
It is characterized by its S-shaped curve and is commonly used in statistics, machine learning,
and neural networks to model non-linear relationships and provide probabilistic
interpretations.
3.3.2 Estimation of probability using logistic function
The logistic function is often used for estimating probabilities in various fields. By applying
the logistic function to a linear combination of input variables, such as in logistic regression,
it transforms the output into a probability value between 0 and 1. This allows for the
prediction and classification of events based on their likelihoods.

3.4 OMNIBUS TEST


Omnibus test is a statistical test used to test the significance of several model parameters at
once. It examines whether the combined effect of the predictors is statistically significant.
The Omnibus statistic is calculated by examining the difference in deviance between the full
model (with predictors) and the reduced model (without predictors) to derive its formula:

where Dr represents the deviance of the reduced model (without predictors) and Df represents
the deviance of the full model (with predictors).
The Omnibus test statistic approximately follows chi-square distribution with degrees of
freedom given by the difference in the number of predictors between the full and reduced
models. By comparing the test statistic to the chi-square distribution and calculating the
associated p-value, we can calculate the collective statistical significance of the predictor
variables.
When the calculated p-value is lower than a predefined significance level (e.g., 0.05), we
reject the null hypothesis, indicating that the group of predictor variables collectively has a
statistically significant influence on the dependent variable. On the other hand, if the p-value
exceeds the significance level, we fail to reject the null hypothesis, suggesting that the
predictors may not have a significant collective effect.
The Omnibus test provides a comprehensive assessment of the overall significance of the
predictor variables within a regression model, aiding in the understanding of how these
predictors jointly contribute to explaining the variation in the dependent variable.

51 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

Let's consider an example where we have a regression model with three predictor variables
(X1, X2, X3) and a continuous dependent variable (Y). We want to assess the overall
significance of these predictors using the Omnibus test.
Here is a sample dataset with the predictor variables and the dependent variable:
X1 X2 X3 Y
2.5 6 8 10.2
3.2 4 7 12.1
1.8 5 6 9.5
2.9 7 9 11.3
3.5 5 8 13.2
2.1 6 7 10.8
2.7 7 6 9.7
3.9 4 9 12.9
2.4 5 8 10.1
2.8 6 7 11.5

Step 1: Fit the Full Model


We start by fitting the full regression model that includes all three predictor variables:

Y = β₀ + β₁*X₁ + β₂*X₂ + β₃*X₃

By using statistical software, we obtain the estimated coefficients and the deviance of the full
model:

β₀ = 8.463, β₁ = 0.643, β₂ = 0.245, β₃ = 0.812


Deviance_full = 5.274

Step 2: Fit the Reduced Model


Next, we fit the reduced model, which only includes the intercept term:
Y = β₀
52 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Similarly, we obtain the deviance of the reduced model:

Deviance_reduced = 15.924

Step 3: Calculate the Omnibus Test Statistic


Using the deviance values obtained from the full and reduced models, we can calculate the
Omnibus test statistic:

Omnibus = (Deviance_reduced - Deviance_full) / Deviance_reduced


= (15.924 - 5.274) / 15.924
= 0.668

Step 4: Conduct the Hypothesis Test


To assess the statistical significance of the predictors, we compare the Omnibus test statistic
to the chi-square distribution with degrees of freedom equal to the difference in the number
of predictors between the full and reduced models. In this case, the difference is 3 (since we
have 3 predictor variables).

By referring to the chi-square distribution table or using statistical software, we determine the
p-value associated with the Omnibus test statistic. Let's assume the p-value is 0.022.

Step 5: Interpret the Results


Since the p-value (0.022) is smaller than the predetermined significance level (e.g., 0.05), we
reject the null hypothesis. This indicates that the set of predictor variables (X1, X2, X3)
collectively has a statistically significant impact on the dependent variable (Y). In other
words, the predictors significantly contribute.

3.5 WALD TEST


The Wald test is a statistical test utilized to assess the significance of individual predictor
variables in a regression model. It examines whether the estimated coefficient for a specific
predictor differs significantly different from zero, indicating its importance in predicting the
dependent variable.

The formula for the Wald test statistic is as follows:

53 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

where β is the estimated coefficient for the predictor variable of interest, β₀ is the
hypothesized value of the coefficient under the null hypothesis (typically 0 for testing if the
coefficient is zero) and Var(β) is the estimated variance of the coefficient.
The Wald test statistic is compared to the chi-square distribution, where the degrees of
freedom are set to 1 (since we are testing a single parameter) to obtain the associated p-value.
Rejecting the null hypothesis occurs when the calculated p-value falls below a predetermined
significance level (e.g., 0.05), indicating that the predictor variable has a statistically
significant impact on the dependent variable.
The Wald test allows us to determine the individual significance of predictor variables by
testing whether their coefficients significantly deviate from zero. It is a valuable tool for
identifying which variables have a meaningful impact on the outcome of interest in a
regression model.
Let's consider an example where we have a logistic regression model with two predictor
variables (X1 and X2) and a binary outcome variable (Y). We want to assess the significance
of the coefficient for each predictor using the Wald test.
Here is a sample dataset with the predictor variables and the binary outcome variable:
X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1

54 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Step 1: Fit the Logistic Regression Model


We start by fitting the logistic regression model with the predictor variables X1 and X2:
logit(p) = β₀ + β₁*X₁ + β₂*X₂
By using statistical software, we obtain the estimated coefficients and their standard errors:
β₀ = -1.613, β₁ = 0.921, β₂ = 0.372
SE(β₀) = 0.833, SE(β₁) = 0.512, SE(β₂) = 0.295
Step 2: Calculate the Wald Test Statistic
Next, we calculate the Wald test statistic for each predictor variable using the formula:
W = (β - β₀)² / Var(β)
For X1:
W₁ = (0.921 - 0)² / (0.512)² = 1.790
For X2:
W₂ = (0.372 - 0)² / (0.295)² = 1.608
Step 3: Conduct the Hypothesis Test
To assess the statistical significance of each predictor, we compare the Wald test statistic for
each variable to the chi-square distribution with 1 degree of freedom (since we are testing a
single parameter).
By referring to the chi-square distribution table or using statistical software, we determine the
p-value associated with each Wald test statistic. Let's assume the p-value for X1 is 0.183 and
the p-value for X2 is 0.205.
Step 4: Interpret the Results
For X1, since the p-value (0.183) is larger than the predetermined significance level (e.g.,
0.05), we fail to reject the null hypothesis. This suggests that the coefficient for X1 is not
statistically significantly different from zero, indicating that X1 may not have a significant
effect on the binary outcome variable Y.
Similarly, for X2, since the p-value (0.205) is larger than the significance level, we fail to
reject the null hypothesis. This suggests that the coefficient for X2 is not statistically
55 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

significantly different from zero, indicating that X2 may not have a significant effect on the
binary outcome variable Y.
In summary, based on the Wald tests, we do not have sufficient evidence to conclude that
either X1 or X2 has a significant impact on the binary outcome variable in the logistic
regression model.

IN-TEXT QUESTIONS
1. What does the Wald test statistic compare to obtain the associated p-value?
a) The F-distribution
b) The t-distribution
c) The normal distribution
d) The chi-square distribution

2. What does the Omnibus test assess in a regression model?


a) The individual significance of predictor variables
b) The collinearity between predictor variables
c) The overall significance of predictor variables collectively
d) The goodness-of-fit of the regression model

3.6 HOSMER LEMESHOW TEST


The Hosmer-Lemeshow test is a statistical test used to evaluate the goodness-of-fit of a
logistic regression model. It assesses how well the predicted probabilities from the model
align with the observed outcomes.
The Hosmer-Lemeshow test is based on dividing the observations into groups or "bins" based
on the predicted probabilities of the logistic regression model. The formula for the Hosmer-
Lemeshow test statistic is as follows:

56 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Where Oij is the observed number of outcomes (events or non-events) in the ith bin and jth
outcome category, Eij is the expected number of outcomes (events or non-events) in the ith
bin and jth outcome category, calculated as the sum of predicted probabilities in the bin for
the jth outcome category.
The test statistic H follows an approximate chi-square distribution with degrees of freedom
equal to the number of bins minus the number of model parameters. A smaller p-value
obtained by comparing the test statistic to the chi-square distribution suggests a poorer fit of
the model to the data, indicating a lack of goodness-of-fit.
By conducting the Hosmer-Lemeshow test, we can determine whether the logistic regression
model adequately fits the observed data. A non-significant result (p > 0.05) indicates that the
model fits well, suggesting that the predicted probabilities align closely with the observed
outcomes. Conversely, a significant result (p < 0.05) suggests a lack of fit, indicating that the
model may not accurately represent the data.
The Hosmer-Lemeshow test is a valuable tool in assessing the goodness-of-fit of logistic
regression models, allowing us to evaluate the model's performance in predicting outcomes
based on observed and predicted probabilities.
Let's consider the example again with the logistic regression model predicting the probability
of a disease (Y) based on a single predictor variable (X). We will divide the predicted
probabilities into three bins and calculate the observed and expected frequencies in each bin.
Y X Predicted Probability
0 2.5 0.25
1 3.2 0.40
0 1.8 0.15
1 2.9 0.35
1 3.5 0.45
0 2.1 0.20
1 2.7 0.30
0 3.9 0.60
0 2.4 0.18
1 2.8 0.28

57 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

Step 1: Fit the Logistic Regression Model


By fitting the logistic regression model, we obtain the predicted probabilities for each
observation based on the predictor variable X.

Step 2: Divide the Predicted Probabilities into Bins


Let's divide the predicted probabilities into three bins: [0.1-0.3], [0.3-0.5], and [0.5-0.7].

Step 3: Calculate Observed and Expected Frequencies in Each Bin


Now, we calculate the observed and expected frequencies in each bin.

Bin: [0.1-0.3]
Total cases in bin: 3
Observed cases (Y = 1): 1
Expected cases: (0.25 + 0.20 + 0.28) * 3 = 1.23

Bin: [0.3-0.5]
Total cases in bin: 4
Observed cases (Y = 1): 2
Expected cases: (0.40 + 0.35 + 0.30 + 0.28) * 4 = 3.52

Bin: [0.5-0.7]
Total cases in bin: 3
Observed cases (Y = 1): 2
Expected cases: (0.45 + 0.60) * 3 = 3.15

Step 4: Calculate the Hosmer-Lemeshow Test Statistic


We calculate the Hosmer-Lemeshow test statistic by summing the contributions from each
bin:

HL = ((O₁ - E₁)² / E₁) + ((O₂ - E₂)² / E₂) + ((O₃ - E₃)² / E₃)

HL = ((1 - 1.23)² / 1.23) + ((2 - 3.52)² / 3.52) + ((2 - 3.15)² / 3.15)


= (0.032) + (0.670) + (0.224)
= 0.926

58 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Step 5: Conduct the Hypothesis Test


We compare the Hosmer-Lemeshow test statistic (HL) to the chi-square distribution with 1
degree of freedom (number of bins - 2).
By referring to the chi-square distribution table or using statistical software, let's assume that
the critical value for a significance level of 0.05 is 3.841.
Since the calculated test statistic (0.926) is less than the critical value (3.841), we fail to reject
the null hypothesis. This suggests that the logistic regression model fits the data well.
Step 6: Interpret the Results
Based on the Hosmer-Lemeshow test, there is no evidence to suggest lack of fit for the
logistic regression model. The calculated test statistic (0.926) is below the critical value,
indicating good fit between the observed and expected frequencies in the different bins.
In summary, the Hosmer-Lemeshow test assesses the goodness of fit of a logistic regression
model by comparing the observed and expected frequencies in different bins of predicted
probabilities. In this example, the test result indicates that the model fits the data well.

3.7 PSEUDO R SQUARE


Pseudo R-square is a measure used in regression analysis, particularly in logistic regression,
to assess the proportion of variance in the dependent variable explained by the predictor
variables. It is called "pseudo" because it is not directly comparable to the R-squared used in
linear regression.
There are various methods to calculate Pseudo R-squared, and one commonly used method is
Nagelkerke's R-squared. The formula for Nagelkerke's R-squared is as follows:

where ℒmodel is the log-likelihood of the full model, ℒnull is the log-likelihood of the null
model (a model with only an intercept term) and ℒmax is the log-likelihood of a model with
perfect prediction (a hypothetical model that perfectly predicts all outcomes).
Nagelkerke's R-squared ranges from 0 to 1, with 0 indicating that the predictors have no
explanatory power, and 1 suggesting a perfect fit of the model. However, it is important to
note that Nagelkerke's R-squared is an adjusted measure and should not be interpreted in the
same way as R-squared in linear regression.

59 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

Pseudo R-squared provides an indication of how well the predictor variables explain the
variance in the dependent variable in logistic regression. While it does not have a direct
interpretation as the proportion of variance explained, it serves as a relative measure to
compare the goodness-of-fit of different models or assess the improvement of a model
compared to a null model.
One commonly used pseudo R-squared measure is the Cox and Snell R-squared. Let's
calculate the Cox and Snell R-squared using the given example of a logistic regression model
with two predictor variables.
X1 X2 Y

2.5 6 0

3.2 4 1

1.8 5 0

2.9 7 1

3.5 5 1

2.1 6 0

2.7 7 1

3.9 4 0

2.4 5 0

2.8 6 1

Step 1: Fit the Logistic Regression Model


By fitting the logistic regression model using the predictor variables X1 and X2, we obtain
the estimated coefficients for each predictor.
Step 2: Calculate the Null Log-Likelihood (LL0)
To calculate the null log-likelihood, we fit a null model with only an intercept term. Let's
assume that the null log-likelihood (LL0) is -48.218.

60 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Step 3: Calculate the Full Log-Likelihood (LLF)


The full log-likelihood represents the maximum value of the log-likelihood for the fitted
logistic regression model. Let's assume that the full log-likelihood (LLF) is -31.384.

Step 4: Calculate the Cox and Snell R-Squared


Using the formula R²_CS = 1 - (LL0 / LLF)^(2 / n), we can calculate the Cox and Snell R-
squared.
Given:
LL0 = -48.218
LLF = -31.384
n = 10 (number of observations)

R²_CS = 1 - (-48.218 / -31.384)^(2 / 10)


= 1 - 0.4309
= 0.5691

Step 5: Interpret the Results


The calculated Cox and Snell R-squared is approximately 0.5691. This indicates that around
56.91% of the variance in the binary outcome variable can be explained by the logistic
regression model.
In summary, based on the calculations, the Cox and Snell R-squared for the logistic
regression model with X1 and X2 as predictors is approximately 0.5691, suggesting a
moderate amount of variance explained by the model.

3.8 CLASSIFICATION TABLE


To understand the classification table, let’s consider a binary classification problem of
detecting whether an input cell is cancerous cell or not. Consider a logistic regression model
X implemented for the given classification problem on a dataset of 100 random cells in which
10 cells are cancerous cells and 90 cells are non-cancerous cells. Let suppose the model X

61 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

outputs 20 input cells as cancerous and rest 80 as non-cancerous cells. Out of the total
predicted cancerous cells, only 5 input cells are actually cancerous as per the ground truth
while the rest 15 cells are non-cancerous. On the other hand, out of the total predicted non-
cancerous cells, 75 cells are also non-cancerous cells in the ground truth but 5 cells are
cancerous. Here, cancerous cell is considered as positive class while non-cancerous cell is
considered as negative class for the given classification problem. Now, we define the four
primary building blocks of the various evaluation metrics of classification models as follows:
True Positive (TP): The number of input cells for which the classification model X correctly
predicts that they are cancerous cells are referred as True Positive. For example, for the
model X, TP = 5.
True Negative (TN): The number of input cells for which the classification model X
correctly predicts that they are non-cancerous cells are referred as True Negative. For
example, for the model X, TN = 75.
False Positive (FP): The number of input cells for which the classification model X
incorrectly predicts that they are cancerous cells are referred as False Positive. For example,
for the model X, FP = 15.
False Negative (FN): The number of input cells for which the classification model X
incorrectly predicts that they are non-cancerous cells are referred as False Negative Positive.
For example, for the model X, FN = 5.

Actual
Cancerous Non-Cancerous
Cancerous TP = 5 FP = 15
Predicted
Non-Cancerous FN = 5 TN = 75

Fig 3.2: Classification Matrix

3.8.1 Sensitivity
Sensitivity, also referred to as True Positive Rate or Recall, is calculated as the ratio of
correctly predicted cancerous cells to the total number of cancerous cells in the ground truth.
To compute sensitivity, you can use the following formula:

62 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

3.8.2 Specificity
Specificity is defined as the ratio of number of input cells that are correctly predicted as non-
cancerous to the total number of non-cancerous cells in the ground truth. Specificity is also
known as True Negative Rate. To compute specificity, we can use the following formula:

3.8.3 Accuracy
Accuracy is calculated as the ratio of correctly classified cells to the total number of cells. To
compute accuracy, you can use the following formula:

3.8.4 Precision
Precision is calculated as the ratio of the correctly predicted cancerous cells to the total
number of cells predicted as cancerous by the model. To compute precision, you can use the
following formula:

3.8.5 F score
The F1-score is calculated as the harmonic mean of Precision and Recall. To compute the F1-
score, you can follow the following formula:

IN-TEXT QUESTIONS

3. For the model X results on the given dataset of 100 cells, the precision of model is
a) 0 b) 0.25
c) 0.5 d) 1
4. For the model X results on the given dataset of 100 cells, the recall of model is
a) 0 b) 0.25
c) 0.5 d) 1

63 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

3.9 GINI COEFFICIENT


A metric used to assess inequality is the Gini coefficient, also referred to as the Gini index.
The Gini coefficient has a value between 0 and 1. The performance of the model improves
with increasing Gini coefficient values. Gini coefficient can be computed from the AUC of
ROC curve using the formula:

3.10 ROC
In particular in logistic regression or machine learning techniques, the performance of a
binary classification model is assessed using a graphical representation called the Receiver
Operating Characteristic (ROC) curve. The trade-off between the true positive rate
(sensitivity) and the false positive rate (specificity minus 1) for various categorization
thresholds is demonstrated.

Plotting the true positive rate (TPR) against the false positive rate (FPR) at various
categorization thresholds results in the ROC curve. The formula for TPR and FPR are as
follows:

64 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

We may evaluate the model's capacity to distinguish between positive and negative examples
at various classification levels using the ROC curve. With a TPR of 1 and an FPR of 0, a
perfect classifier would have a ROC curve that reaches the top left corner of the plot. The
model's discriminatory power increases with the distance between the ROC curve and the top
left corner.

3.11 AUC
When employing a Receiver Operating Characteristic (ROC) curve, the Area Under the
Curve (AUC) is a statistic used to assess the effectiveness of a binary classification model.
The likelihood that a randomly selected positive occurrence will have a greater projected
probability than a randomly selected negative instance is represented by the AUC.
The AUC is calculated by integrating the ROC curve. However, it is important to note that
the AUC does not have a specific formula since it involves calculating the area under a curve.
Instead, it is commonly calculated using numerical methods or software.
The AUC value ranges between 0 and 1. A model with an AUC of 0.5 indicates a random
classifier, where the model's predictive power is no better than chance. An AUC value that is
nearer 1 indicates a classifier that is more accurate and is better able to distinguish between
positive and negative situations. Conversely, an AUC value closer to 0 suggests poor
performance, with the model performing worse than random guessing.
In binary classification tasks, the AUC is a commonly utilized statistic since it offers a
succinct assessment of the model's performance at different categorization thresholds. It is
especially useful when the dataset is imbalanced i.e. the number of instances that are positive
and negative differ significantly.
In conclusion, the AUC measure evaluates a binary classification model's total discriminatory
power by delivering a single value that encapsulates the model's capacity to rank cases
properly. Better classification performance is shown by higher AUC values, whilst worse
performance is indicated by lower values.

65 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

IN-TEXT QUESTIONS
5. Which of the following illustrates trade-off between True Positive Rate and
False Positive Rate?
a) Gini Coefficient b) F1-score
c) ROC d) AUC
6. Which of the following value of AUC indicates a more accurate classifier?
a) 0.01 b) 0.25
c) 0.5 d) 0.99
7. What is the range of values for the Gini coefficient?
a) -1 to 1
b) 0 to 1
c) 0 to infinity
d) -infinity to infinity
8. How can the Gini coefficient be computed?
a) By calculating the area under the precision-recall curve
b) By calculating the area under the receiver operating characteristic (ROC)
curve
c) By calculating the ratio of true positives to true negatives.
d) By calculating the ratio of false positives to false negatives.

3.12 SUMMARY
Logistic regression is used to solve the classification problems by producing the probabilistic
values within the range of 0 and 1. Logistic regression uses Logistic function i.e. sigmoid
function. Multinomial Regression is the generalization of logistic regression to multiclass
problems. Omnibus test is a statistical test utilized to test the significance of several model
parameters at once. Wald test is a statistical test used to assess the significance of individual
predictor variables in a regression model. Hosmer-Lemeshow test is a statistical test
employed to assess the adequacy of a logistic regression model. Pseudo R-square is a
measure to assess the proportion of variance in the dependent variable explained by the
predictor variables. There are various classification metrics namely Sensitivity, Specificity,
Accuracy, Precision, F-score, Gini Coefficient, ROC and AUC, which are utilized to evaluate
the performance of a classifier model.

66 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

3.13 GLOSSARY
Terms Definition

Omnibus test A statistical test used to test the significance of multiple model
parameters simultaneously.

Wald test A statistical test used to evaluate the significance of each individual
predictor variables within a regression model.

Hosmer- a statistical test utilized to assess the adequacy of fit for a logistic
Lemeshow test regression model.

Pseudo R- A metric used to evaluate the portion of variability in the dependent


square variable that can be accounted for by the predictor variables.

F1-score The F1-score is calculated as the harmonic mean of Precision and


Recall.

ROC curve ROC curve demonstrates the balance between the true positive rate and
the false positive rate across various classification thresholds.

Gini A metric used to measure the inequality.


Coefficient

3.14 ANSWERS TO INTEXT QUESTIONS

1. (d) The chi-square distribution 5. (c) ROC


2. (c) The overall significance of predictor 6. (d) 0.99
variables collectively 7. (b) 0 to 1
3. (b) 0.25 8. (b) By calculating the area under the receiver
4. (c) 0.25 operating characteristic (ROC) curve.

3.15 SELF-ASSESSMENT QUESTIONS


1. Differentiate between Linear Regression and Logistic Regression.
2. Differentiate between Sensitivity and Specificity.
67 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

3. Define True Positive Rate and False Positive Rate.


4. Consider a logistic regression model X that is applied on a problem of classifying a
statement is hateful or not. Consider a dataset D of 100 statements containing equal
number of hateful statements and non-hateful statements. Suppose the model X is
classifying all the input statements as hateful. Comment on the precision and recall
values of the model X.
5. Define F-score and Gini Index.
6. Explain the use of ROC curve and AUC of a ROC curve.

3.16 REFERENCES
 LaValley, M. P. (2008). Logistic regression. Circulation, 117(18), 2395-2399.
 Wright, R. E. (1995). Logistic regression.
 Chatterjee, Samprit, and Jeffrey S. Simonoff. Handbook of regression analysis. John
Wiley & Sons, 2013.
 Kleinbaum, David G., K. Dietz, M. Gail, Mitchel Klein, and Mitchell Klein. Logistic
regression. New York: Springer-Verlag, 2002.
 DeMaris, Alfred. "A tutorial in logistic regression." Journal of Marriage and the
Family (1995): 956-968.
 Osborne, J. W. (2014). Best practices in logistic regression. Sage Publications.
 Bonaccorso, Giuseppe. Machine learning algorithms. Packt Publishing Ltd, 2017.

3.17 SUGGESTED READINGS


 Huang, F. L. (2022). Alternatives to logistic regression models in experimental
studies. The Journal of Experimental Education, 90(1), 213-228.
 https://towardsdatascience.com/logistic-regression-in-real-life-building-a-daily-
productivity-classification-model-a0fc2c70584e

68 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

LESSON 4
DECISION TREE AND CLUSTERING
Dr. Sanjay Kumar
Dept. of Computer Science and Engineering,
Delhi Technological University,
Email-Id: sanjay.kumar@dtu.ac.in

STRUCTURE

4.1 Learning Objectives


4.2 Introduction
4.3 Classification and Regression Tree
4.4 CHAID
4.4 Impurity Measures
4.5 Ensemble Methods
4.6 Clustering
4.7 Summary
4.8 Glossary
4.9 Answers to In-Text Questions
4.10 Self-Assessment Questions
4.11 References
4.12 Suggested Readings

4.1 LEARNING OBJECTIVES

At the end of the chapter, the students will be able to:


● Exploring the concept of decision tree and its components
● Evaluating attribute selection measures
● Understanding ensemble methods
● Comprehending the random forest algorithm
● Exploring the concept of clustering and its types
● Comprehending distance and similarity measures
● Evaluating cluster quality
69 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

4.2 INTRODUCTION

Decision Tree is a popular machine learning approach for classification and regression tasks.
Its structure is similar to a flowchart, where internal nodes represent features or attributes,
branches depict decision rules, and leaf nodes signify outcomes or predicted values. The data
are divided recursively according to feature values by the decision tree algorithm to create the
tree. It chooses the best feature for data partitioning at each stage by analysing parameters
such as information gain or Gini impurity. The goal is to divide the data into homogeneous
subsets within each branch to increase the tree's capacity for prediction.

Fig 4.1: Decision Tree for classification scenario of a mammal

By choosing a path through the tree based on feature values, the decision tree can be used to
generate predictions on fresh, unexpected data after it has been constructed. The
circumference. Figure 4.1 shows the decision tree helps classify an animal based on a series
of questions. The flowchart begins with the question, "Is it a mammal?" If the answer is
"Yes," we follow the branch on the left. The next question asks, "Does it have spots?" If the
answer is "Yes," we conclude that it is a leopard. If the answer is "No," we determine it is a
cheetah.

70 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

If the answer to the initial question, "Is it a mammal?" is "No," we follow the branch on the
right, which asks, "Is it a bird?" If the answer is "Yes," we classify it as a parrot. If the answer
is "No," we classify it as a fish.
Thus decision tree demonstrates a classification scenario where we aim to determine the type
of animal based on specific attributes. By following the flowchart, we can systematically
navigate through the questions to reach a final classification.
4.3 Classification and Regression Tree
A popular machine learning approach for classification and regression tasks is called the
Classification and Regression Tree (CART). It is a decision tree-based model that divides the
data into subsets according to the values of the input features and then predicts the target
variable using the tree structure.
CART is especially well-liked because of how easy it is to understand. Each core node
represents a test on a specific feature, and each leaf node represents a class label or a
predicted value, forming a binary tree structure. The method divides the data iteratively
according to the features with the goal of producing homogeneous subsets with regard to the
target variable.
In classification tasks, CART measures the impurity or disorder within each node using a
criterion like Gini impurity or entropy. Selecting the best feature and split point at each node
aims to reduce this impurity. The outcome is a tree that correctly categorises new instances
according to their feature values. In regression problems, CART measures the quality of each
split using a metric called mean squared error (MSE). In order to build a tree that can forecast
the continuous target variable, it searches for the feature and split point that minimises the
MSE.
Example: Let suppose we have a dataset of patients and we want to predict whether they
have a heart disease based on their age and cholesterol level. The dataset contains the
following information:
Age Cholesterol Disease
45 180 Yes

50 210 No

55 190 Yes

71 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

60 220 No

65 230 Yes
70 200 No

Using the CART algorithm, we can build a decision tree to make predictions. The decision
tree may look like this:

Fig 4.2: Predicting Disease based on Age and Cholesterol Levels


The decision tree in this illustration begins at the node at the top, which evaluates the
statement "Age = 55." If a patient is under the age of 55, we proceed to the left branch and
examine the "Cholesterol = 200" condition. The diagnosis is "No Disease" if the patient's
cholesterol level is less than or equal to 200. The forecast is "Yes Disease" if the cholesterol
level is more than 200.
However, if the patient is older than 55, we switch to the right branch, where "No Disease" is
predicted regardless of the cholesterol level.

4.3 CHAID

4.3.1 Chi-Square Automatic Interaction Detection


CHAID (Chi-Square Automatic Interaction Detection) is a statistical method used to analyze
the interaction between different categories of variables. It is particularly useful when
72 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

working with data that involves categorical variables, which represent different groups or
categories. The CHAID algorithm aims to identify meaningful patterns by dividing the data
into groups based on various categories of variables. This is achieved through the application
of statistical tests, particularly the chi-square test. The chi-square test helps determine if there
is a significant relationship between the categories of a variable and the outcome of interest.
It divides the data into smaller groups. It repeats this procedure for each of these smaller
groups in order to find other categories that might be significantly related to the result. The
leaves on the tree indicate the expected outcomes, and each branch represents a distinct
category.
Calculate the Chi-Square statistic (χ^2):

(1.1)
O represents the observed frequencies in each category or cell of a contingency table.
E represents the expected frequencies under the assumption of independence between
variables.

Fig 4.3: Determining Satisfaction Levels of customer

73 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

This flowchart shows how CHAID gradually divides the dataset into subsets according to the
most important predictor factors, resulting in a hierarchical structure. It enables us to clearly
and orderly visualise the links between the variables and their effects on the target variable
(Customer Satisfaction).
Age Group is the first variable on the flowchart, and it has two branches: "Young" and
"Middle-aged." We further examine the Gender variable within the "Young" branch, resulting
in branches for "Male" and "Female." The Purchase Frequency variable is next examined for
each gender subgroup, yielding three branches: "Low," "Medium," and "High." We arrive at
the leaf nodes, which represent the customer satisfaction outcome and are either "Satisfied"
or "Not Satisfied."
4.3.2 Bonferroni Correction
The Bonferroni correction is a statistical method used to adjust the significance levels (p-
values) when conducting multiple hypothesis tests at the same time. It helps control the
overall chance of falsely claiming a significant result by making the criteria for significance
more strict.
To apply the Bonferroni correction, we divide the desired significance level (usually denoted
as α) by the number of tests being performed (denoted as m). This adjusted significance level,
denoted as α' or α_B, becomes the new threshold for determining statistical significance.
Mathematically, the Bonferroni correction can be represented as:

(1.2)
For example, suppose we are conducting 10 hypothesis tests, and we want a significance
level of 0.05 (α = 0.05). By applying the Bonferroni correction, we divide α by 10, resulting
in an adjusted significance level of:

(1.3)
Now, when we assess the p-values obtained from each test, we compare them against the
adjusted significance level (α') instead of the original α. If a p-value is less than or equal to α',
we consider the result to be statistically significant.
Let's consider an example. Suppose we have conducted 10 independent hypothesis tests, and
we obtain p-values of 0.02, 0.07, 0.01, 0.03, 0.04, 0.09, 0.06, 0.08, 0.05, and 0.02. Using the
Bonferroni correction with α of 0.05 and m = 10, the adjusted significance level becomes α' =
0.05 / 10 = 0.005.

74 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

We want a significance level (α) of 0.05, and we have 10 hypothesis tests (m = 10). Applying
the Bonferroni correction, we divide α by 10, resulting in an adjusted significance level (α')
of 0.005.
Comparing the p-values to the adjusted significance level (α'), we find:
- Hypothesis test 1: p-value (0.02) ≤ α' (0.005) - Statistically significant
- Hypothesis test 2: p-value (0.07) > α' (0.005) - Not statistically significant
- Hypothesis test 3: p-value (0.01) ≤ α' (0.005) - Statistically significant
- Hypothesis test 4: p-value (0.03) > α' (0.005) - Not statistically significant
- Hypothesis test 5: p-value (0.04) > α' (0.005) - Not statistically significant
- Hypothesis test 6: p-value (0.09) > α' (0.005) - Not statistically significant
- Hypothesis test 7: p-value (0.06) > α' (0.005) - Not statistically significant
- Hypothesis test 8: p-value (0.08) > α' (0.005) - Not statistically significant
- Hypothesis test 9: p-value (0.05) > α' (0.005) - Not statistically significant
- Hypothesis test 10: p-value (0.02) ≤ α' (0.005) - Statistically significant
Based on the Bonferroni correction, we conclude that Test 1, Test 3, and Test 10 show
statistically significant results, as their p-values are less than or equal to the adjusted
significance level. The remaining tests are not considered statistically significant.

4.4 IMPURITY MEASURES

4.4.1 Gini Impurity Index


Gini impurity index is a measure used in decision tree algorithms to evaluate the impurity or
disorder within a set of class labels. It quantifies the likelihood of a randomly selected
element being misclassified based on the distribution of class labels in a given node. The Gini
impurity index ranges from 0 to 1, where 0 represents a perfectly pure node with all elements
belonging to the same class, and 1 represents a completely impure node with an equal
distribution of elements across different classes.
To calculate the Gini impurity index, we first compute the probability of each class label
within the node by dividing the count of elements belonging to that class by the total number
of elements. Then, we square each probability and sum up the squared probabilities for all
classes. Finally, we subtract the sum from 1 to obtain the Gini impurity index.
Mathematically, the formula for Gini impurity index is as follows:

(1.4)

75 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

Where, p and i represents the probability of each class label in the node.
By using the Gini impurity index, decision tree algorithms can make decisions on how to split
the data by selecting the feature and threshold that minimize the impurity after the split. A
lower Gini impurity index indicates a more homogeneous distribution of class labels, which
helps in creating pure and informative branches in the decision tree.
Example:
Suppose we have a dataset with 50 samples and two classes, "A" and "B". The table below
shows the distribution of class labels for a particular node in a decision tree:
Sample Class

20 A

10 B

15 A

5 B

To calculate the Gini impurity index, we follow these steps:


Calculate the probability of each class label:
Probability of class A = (20 + 15) / 50 = 35 / 50 = 0.7
Probability of class B = (10 + 5) / 50 = 15 / 50 = 0.3
Square each probability:
Square of 0.7 = 0.49
Square of 0.3 = 0.09
Sum up the squared probabilities:
0.49 + 0.09 = 0.58
Subtract the sum from 1 to obtain the Gini impurity index:
Gini Index = 1 - 0.58 = 0.42
So, the Gini impurity index for this particular node is 0.42. This value represents the impurity
or disorder within the node, with a lower Gini impurity index indicating a more homogeneous
distribution of class labels.

76 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

4.4.2 Entropy
Entropy is a concept used in information theory and decision tree algorithms to measure the
level of uncertainty or disorder within a set of class labels. It helps us understand how mixed
or impure the class distribution is in a given node. The entropy value is calculated based on
the probabilities of each class label within the node.
To compute entropy, we start by determining the probability of each class label. This is done
by dividing the count of elements belonging to a particular class by the total number of
elements. Next, we apply the logarithm (typically base 2) to each probability, multiply it by
the probability itself, and sum up these values. Finally, we take the negative of the sum to
obtain the entropy value.
Mathematically, the formula for entropy is as follows:

(1.5)

Where, represents the probability of each class label in the node.


By using entropy, decision tree algorithms can assess the impurity within a node and
determine the feature and threshold that minimize the entropy after the split. A lower entropy
value indicates a more homogeneous distribution of class labels, leading to more informative
and accurate splits in the decision tree.
Example:
Let's consider a node with 80 samples, where 60 samples belong to class A and 20 samples
belong to class B. The probability of class A is 60/80 = 0.75, and the probability of class B is
20/80 = 0.25. Applying the logarithm to these probabilities, we get -0.415 and -1.000,
respectively. Multiplying these values by their probabilities and summing them up, we obtain
-0.311. Taking the negative of this sum, the entropy for this node is 0.311.
4.4.3 Cost-based splitting criteria
Cost-based splitting criteria in decision trees involve considering cost-related measures when
determining how to split the data at each node of the tree. Cost-based criteria consider the
associated costs or penalties of misclassification, whereas standard decision tree algorithms
concentrate on metrics like information gain or Gini impurity. When compared to
misclassifying instances of another class, misclassifying instances of one class might
occasionally have a bigger effect or cost.

77 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

The goal of cost-based splitting criteria is to minimize the overall cost or expenses related to
misclassification by selecting the best feature and split point at each node. Instead of solely
maximizing information gain or reducing impurity, the algorithm assesses the cost associated
with potential misclassifications. The specific cost-based measure used depends on the
problem domain and the assigned costs for different types of misclassifications. For instance,
in a medical diagnosis scenario, misclassifying a severe condition as a less severe one might
incur a higher cost compared to the opposite error.
Example:
Let's consider a dataset of 30 fruits, where each fruit has two features: color (red, green, or
orange) and diameter (small or large). The target variable is the type of fruit, which can be
"Apple" or "Orange". We also have costs associated with misclassifications: $10 for each
false positive (classifying an orange as an apple) and $5 for each false negative (classifying
an apple as an orange).
When using cost-based splitting criteria, the decision tree algorithm considers the features
(colour and diameter) to find the optimal split that minimizes the overall cost. For simplicity,
let's assume the first split is based on the colour feature. The algorithm assesses the costs
associated with misclassification for each colour category and chooses the colour that results
in the lowest expected cost. For instance, if the algorithm determines that splitting the data
based on colour between "Red/Green" and "Orange" fruits minimizes the expected cost, it
proceeds to evaluate the diameter feature for each branch. The algorithm continues this
recursive process of splitting the data until it constructs a complete decision tree.
The resulting decision tree may look like this:

Fig 4.5: Split based on colour and diameter


78 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

The decision tree in the above picture displays splits depending on colour and diameter,
resulting in the labelling of fruits as "Apple" or "Orange" at the leaf nodes. Now we can use
the decision tree to estimate the type of a new fruit when its colour and diameter are
displayed. The model determines the fruit's expected class (apple or orange) by tracing the
path down the tree based on the provided attributes.

4.5 ENSEMBLE METHODS

4.5.1 Introduction to Ensemble Methods


In order to increase prediction accuracy and generalisation, ensemble methods are used in
machine learning. Ensemble methods construct an ensemble from several models rather of
relying on a single model. The ensemble's separate models are each trained on a distinct
subset of the data or with a different algorithm. Two common approaches for ensemble
method are as follows:
Voting: According to this method, each model in the ensemble provides a prediction, and the
outcome is decided by considering the weighted majority of all the individual models. For
instance, when performing classification tasks, the ensemble selects the class that receives the
most model votes as its forecast.
Averaging: The predicted outcomes of each individual model are integrated using this
method by averaging their results. This method is frequently employed in regression tasks
because it enables the ensemble to provide a final prediction by averaging the values
predicted by each model.
Some popular ensemble methods include Random Forest, AdaBoost, Gradient Boosting, and
Bagging. These methods have been widely adopted in various domains and have shown
significant improvements in prediction performance compared to using a single model alone.
Example:
Suppose our goal is to determine whether a specific email is spam or not. We have a dataset
that includes details on email sender, subject, and content. Using bagging, we can assemble
several decision tree models. A random subset of the data is used to train each decision tree,
and their predictions are aggregated via majority voting.
Each decision tree in this ensemble will identify various spam email patterns and
characteristics. While some trees may concentrate on certain words or phrases, others may
take sender information into consideration. Each decision tree in the ensemble will make a

79 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

forecast when a new email is received, and the final prediction will be based on the consensus
of all the decision trees.
4.5.2 Random Forest
In machine learning, the widely used ensemble learning technique Random Forest is utilised
for both classification and regression applications. The outcomes from each individual
decision tree are aggregated to create forecasts using many decision trees. With each tree
based on a distinct subset of the training data, the Random Forest method builds a collection
of decision trees. The subset is produced using a technique known as bootstrap sampling,
which includes replacing some of the randomly chosen data points. Further randomness is
added by considering a random subset of features for splitting at each node of the decision
tree.
First, we create a bunch of decision trees, each using a different set of data. We randomly
pick some of the data for each tree, which helps add variety to the predictions. Next, we train
each decision tree by dividing the data into smaller groups based on different features. We
want the trees to be different from each other, so we use random subsets of features to make
the divisions. For example, if we're trying to classify something, each tree votes for the class
it thinks is correct. The final prediction is based on the majority vote.
Step-by-step explanation of how the Random Forest algorithm works:
Random Sampling: The algorithm starts by randomly selecting subsets of the training data
from the original dataset. Each subset is constructed by randomly selecting data points with
replacement. These subsets are used to build individual decision trees.
Tree Construction: Recursive partitioning is a technique used to build a decision tree for
each subset of the training data. A random subset of features is taken into account for
splitting at each node of the tree. Each tree is unique thanks to the randomness, which also
lessens association between the trees.
Voting and Aggregation: Each tree in the Random Forest identifies the target variable
separately (for classification tasks) or predicts its value independently (for regression tasks)
while making predictions. For classification, the final prediction is chosen by a majority vote;
for regression, the predictions are averaged. The overall forecast accuracy is enhanced by the
voting and aggregation procedure.
Random Forest has several key features and advantages:

80 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Robustness against overfitting: The Random Forest is more resilient to noise or outliers in
the data thanks to the integration of many decision trees, which also helps to avoid
overfitting.
Feature importance estimation: The features that have the greatest impact on the
predictions are identified by Random Forest using a measure of feature importance. With this
knowledge, features may be chosen and underlying relationships in the data can be
understood.

4.6 CLUSTERING

Cluster analysis, commonly referred to as clustering, is a machine learning technique that


categorizes datasets without labels. In this method, objects that share similarities are grouped
together while maintaining minimal or no similarities with other groups. By relying solely on
the inherent characteristics of the data, clustering algorithms unveil patterns and associations.
Their objective is to divide a collection of data points into distinct clusters or groups, where
objects within the same cluster exhibit greater similarity to each other compared to those in
different clusters. The primary aim is to maximize the similarity within clusters while
minimizing the similarity between clusters.
Let's look at the clustering technique in activity using the real-world example of Mall: When
we go to a shopping centre, we notice that items with comparable uses are clustered together.
T-shirts, for example, are arranged in one section and pants in another; similarly, in vegetable
sections, apples, bananas, mangoes, and so on are grouped in distinct sections so that we can
easily discover what we're looking for.
Fig 4.5 shows the images of uncategorized and categorized data in the form of three types of
fruits mixed. The left side shows the uncategorized data, where right side shows the
categorized data or group of same fruits.
 Points in the same cluster are similar

 Points in the different clusters are dissimilar

81 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

Fig 4.5: Clustering from mixed input

The above diagrams (Fig. 4.5) show that the different fruits are divided into different clusters
or groups with similar properties.

4.6.1 Characteristics of Clustering


Clustering analysis possesses several distinct characteristics that make it a powerful tool in
data analysis. Firstly, it is an unsupervised learning technique, meaning that it does not
require prior knowledge or labelled data to guide the clustering process. Instead, clustering
algorithms discover patterns and groupings solely based on the inherent characteristics of the
data. Secondly, clustering analysis is exploratory in nature, allowing researchers to uncover
hidden structures and relationships that may not be immediately apparent. This exploratory
aspect of clustering is data-driven and does not impose assumptions or constraints on the
structure of the data, making it applicable to a wide range of domains and data types.
4.6.2 Types of Clustering
The clustering methods are broadly categorized into two partition namely as Hard-Clustering
(data points belong to just one group) and Soft clustering (data points might belong to more
than one group). However, there are alternative Clustering techniques available. The
following are the most common clustering approaches used in machine learning.
A) Partitioning Clustering
B) Density-Based Clustering
C) Hierarchical clustering
82 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

A) Partitioning Clustering:
Partitioning clustering is a clustering algorithm that seeks to divide a dataset into separate and
non-overlapping clusters. In this type of clustering, the dataset is split into a predetermined
number of groups, denoted as K. The cluster centres are positioned in a manner that
minimizes the distance between the data points within a cluster and the centroid of another
cluster. Figure 1.6 illustrates the resulting partition of clusters.

Fig 4.6 Partitioning Clustering

The most well-known partitioning algorithm is K-means clustering. Here are the advantages
and disadvantages of partitioning clustering:
Advantages of Partitioning Clustering are:
 Scalability:
 Ease of implementation:
 Interpretability:
 Applicability to various data types
Disadvantages of partitioning clustering are:
 Sensitivity to initial centroid selection
 Dependence on the number of clusters
 Limited ability to handle complex cluster shapes
K-means Clustering
This is one of the most popular clustering algorithms. It aims to partition the data into a
predetermined number of clusters (K) by minimizing the sum of squared distances between
83 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

data points and the centroid of their assigned cluster. K-means clustering is a popular and
widely used algorithm for partitioning a dataset into a predefined number of clusters. It is an
iterative algorithm that aims to minimize the within-cluster sum of squares, also known as the
total squared distance between data points and their cluster centres. The algorithm assigns
data points to clusters by iteratively updating the cluster centres until convergence. Here is a
detailed description of the K-means clustering algorithm:
 Initialization:
Specify the number of clusters K that you want to identify in the dataset. Initialize K
cluster centres randomly or using a predefined strategy, such as randomly selecting K
data points as initial centres.
 Assignment Step:
For each data point in the dataset, calculate the distance (e.g., Euclidean distance) to
each of the K cluster centres.
Assign the data point to the cluster with the nearest cluster centre, forming K initial
clusters.
 Update Step:
Calculate the new cluster centres by computing the mean (centroid) of all data points
assigned to each cluster. The cluster centre is the average of the feature values of all
data points in that cluster.
 Iteration:
Repeat the assignment and update steps until convergence or until a predefined
stopping criterion is met. In each iteration, reassign data points to clusters based on
the updated cluster centres and recalculate the cluster centres.
 Convergence:
The algorithm converges when the cluster assignments no longer change significantly
between iterations or when the maximum number of iterations is reached.
 Final Result:
Once the algorithm converges, the final result is a set of K clusters, where each data
point is assigned to one of the clusters based on its nearest cluster centre.
It is worth noting that K-means clustering is sensitive to the initial placement of cluster
centres. Different initializations can lead to different clustering results. To mitigate this, it is
common to run the algorithm multiple times with different random initializations and choose
the solution with the lowest within-cluster sum of squares as the final result. K-means
clustering has several advantages, including its simplicity, scalability to large datasets, and
efficiency.
84 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Example:
Cluster the following 4 point in two dimensional space using K value 2

X1 X2

A 2 3

B 6 1

C 1 2

D 3 0

Solution:
Select two centroids as AB and CD, calculate as
AB = Average of A, B
CD = Average of C, D

X1 X2

AB 4 2

CD 2 1

Now calculate the Euclidean distance between each point and the centroids and assign each
point to the closest centroid:

A B C D

AB 5 5 9 5

CD 4 16 2 2

We can observe in above table, the distance between A, CD is 4 and it is smaller than
distance between AB, A which is 5. So we can move A to CD cluster. Two clusters are
formed ACD and B. Now recomputed the centroids B, ACD

85 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

We repeat the process by calculating the distance between each point and the updated
centroids and reassigning the points to the closest centroid. We continue this iteration until
the centroids no longer change significantly.
After a few iterations, the algorithm converges, and the final cluster assignments are:
Cluster 1: B
Cluster 2: ACD
B) Density-Based Clustering:
Density-based algorithms, such as DBSCAN (Density-Based Spatial Clustering of
Applications with Noise), identify clusters based on regions of high data point density. The
data points are group together that are close to each other and separates regions with low
density. Density-based clustering does not assume any specific shape for clusters. It can
detect clusters of arbitrary shapes, including non-linear and irregular clusters. Also, such
clustering techniques handles noise or outlier points appropriately.

Fig 4.7 Density based Clustering

C) Hierarchical Clustering:
Hierarchical clustering can be used instead of partitioned clustering because there is no need
to indicate the number of clusters to be produced. Hierarchical clustering constructs a
hierarchical structure of clusters, represented by a dendrogram, which resembles a tree-like
formation. This clustering method can be categorized into two primary approaches:
agglomerative, where individual data points begin as separate clusters and are progressively
merged, and divisive, where all data points commence in a single cluster and are recursively
divided.

86 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

Fig 4.9 Density Hierarchical Clustering

In this example, we have nine data points labelled A, B, C, D, E, F, G, H, and I. The


dendrogram represents the hierarchical relationships between these data points. Data points B
and C are combined into a single cluster at the first level of the dendrogram because they are
the closest to one another. The difference between B and C can be seen in the heights of the
branches that connect them.
The nearest data points at the level below are D, E, and F, which group together to create a
cluster. The data points G, H, and I also form a cluster. The branch joining these two clusters
to the combined clusters B and C represents the merging of these two clusters into the bigger
cluster.
Agglomerative hierarchical algorithm
The agglomerative hierarchical algorithm is a popular clustering algorithm that follows a
bottom-up approach to create a hierarchical structure of clusters. It starts with each data point
assigned to its own individual cluster and progressively merges the closest pairs of clusters
until a single cluster, containing all the data points, is formed. This algorithm is also known
as agglomerative clustering or bottom-up clustering. Here are the key steps and
characteristics of the agglomerative hierarchical algorithm [7].
 Initialization: Assign each data point to its own initial cluster, resulting in N clusters
for N data points.
 Compute the proximity or dissimilarity matrix: Calculate the dissimilarity or
similarity measure between each pair of clusters. The choice of distance or
dissimilarity measure depends on the specific application and the nature of the data.
87 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

 Merge the closest clusters: Identify the pair of clusters with the smallest dissimilarity
or highest similarity measure and merge them into a single cluster. The dissimilarity
or similarity between the new merged cluster and the remaining clusters is updated.
 Repeat the merging process: Repeat steps 2 and 3 until all the data points are part of a
single cluster or until a predefined stopping criterion is met.
 Hierarchical representation: The merging process forms a hierarchy of clusters, often
represented as a dendrogram. The dendrogram illustrates the sequence of merging and
allows for different levels of granularity in cluster interpretation.
The advantages of agglomerative hierarchical algorithms is the hierarchical structure and
there is no need to specify the number of clusters unlike partitioning algorithm. The
drawback of this algorithm is the high computational complexity and lack of stability.
4.6.3 Distance and Dissimilarity Measures in Clustering:

In clustering, distance and dissimilarity measures play a crucial role in determining the
similarity or dissimilarity between data points. These measures quantify the proximity
between objects and are used by clustering algorithms to assign data points to clusters or
determine the cluster centres. Here are some commonly used distance and dissimilarity
measures in clustering [8].
1. Euclidean Distance: This is one of the most widely used distance measures in
clustering. It calculates the straight-line distance between two data points in a
Euclidean space. For two points, P = (p1, p2, ..., pn) and Q = (q1, q2, ..., qn), the
Euclidean distance is given by:

(1.6)

2. Manhattan Distance: Also known as the City Block distance or L1 norm, it calculates
the sum of absolute differences between the coordinates of two points. For two points,
P = (p1, p2, ..., pn) and Q = (q1, q2, ..., qn), the Manhattan distance is given by:

(1.7)

88 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

3. Cosine Similarity: Cosine similarity measures the cosine of the angle between two
vectors, indicating the similarity in their directions. It is commonly used in text
mining or when dealing with high-dimensional data. For two vectors, P = (p1, p2, ...,
pn) and Q = (q1, q2, ..., qn), the cosine similarity is given by:

(1.8)

4. Minkowski Distance: This is a generalized distance measure that includes Euclidean


and Manhattan distances as special cases. The Minkowski distance between two
points P = (p1, p2, ..., pn) and Q = (q1, q2, ..., qn) is given by:

(1.9)

4.6.4 Quality and Optimal Number of Clusters:

Quality and determining the optimal number of clusters are important considerations in
clustering analysis. Let's explore each of these aspects:

(A) Quality of Clustering:

The quality of clustering refers to how well the clustering algorithm captures the inherent
structure and patterns in the data. Several factors contribute to the assessment of clustering
quality:
 Compactness: Compactness refers to how close the data points are within each
cluster. A good clustering result should have data points tightly clustered together
within their assigned clusters.
 Separability: Separability refers to the distance between different clusters. A high-
quality clustering result should exhibit distinct separation between clusters, indicating
that the clusters are well-separated from each other.

89 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

 Stability: Stability measures the consistency of clustering results under different


conditions, such as different initializations or subsets of the data. A stable clustering
result is less prone to variation4s and demonstrates robustness.
 Domain-specific Measures: Depending on the application domain, additional
measures specific to the problem can be used to assess clustering quality. For
example, in customer segmentation, metrics like homogeneity, completeness, and
silhouette coefficient can be used to evaluate the effectiveness of clustering in
capturing meaningful customer groups.
(B) Determining the Optimal Number of Clusters in K-means clustering:
Determining the optimal number of clusters, K, in K-means clustering is a crucial step in
clustering analysis. Selecting the appropriate number of clusters is important for interpreting
and extracting meaningful information from the data. Several methods are commonly used to
determine the optimal number of clusters:
 Elbow Method: “The elbow method involves plotting the within-cluster sum of
squares (WCSS) as a function of the number of clusters”. The plot resembles an arm,
and the optimal number of clusters is often identified at the "elbow" point, where the
rate of decrease in WCSS slows down significantly. In the Fig. below, it is clear that
k=3 is the optimal number of clusters.

Fig 4.10 Elbow Method

90 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

91 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

92 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

4.7 SUMMARY

 Decision Tree is a popular machine learning approach for classification and


regression tasks.
 The CHAID algorithm looks for meaningful patterns by splitting the data into groups
based on different categories of variables.
 The Bonferroni correction is a statistical method used to adjust the significance levels
(p-values).
 Gini impurity index is a measure used in decision tree algorithms to evaluate the
impurity or disorder within a set of class labels.
 Entropy is a concept used in information theory and decision tree algorithms to
measure the impurity.
 Cost-based splitting criteria aim to minimize the overall cost or misclassification
expenses by selecting the optimal feature and split point at each node.
 CART measures the impurity or disorder within each node using a criterion like Gini
impurity or entropy.
 Random forest combines multiple decision trees to make predictions by aggregating
the results from each individual tree
 Clustering algorithms discover patterns and groupings solely based on the inherent
characteristics of the data.
 K-means clustering is a popular and widely used algorithm for partitioning a dataset
into a predefined number of clusters.
 Distance measures quantify the proximity between objects and are used by clustering
algorithms to assign data points to clusters or determine the cluster centers.

4.8 GLOSSARY

Terms Definition
Classification The classification algorithm is a supervised learning
technique that is used to categorize new observations
on the basics of the training data.

93 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
BMS

Clustering Clustering is a machine learning technique, which


group the unlabelled datasets or grouping the
datapoints into different clusters.
Gini index Gini index also known as Gini impurity, it is used to
measure the degree or probability of a particular
variable being wrongly classified.
Entropy Entropy is used to measure the impurity or disorder in
the dataset. It is commonly involved in decision tree.
Distance Measures Distance measures are used to quantify the similarity or
dissimilarity between the data points. It is widely used
in clustering, classification and nearest neighbour
search.

4.9 ANSWERS TO INTEXT QUESTIONS

1. (d) All of these 4. (b) Random forest are difficult to


2. (c) Both interpret but often very accurate
3. (a) The length of the longest path from 5. (b) Homogeneity
a root to a leaf 6. (b) tree showing how close things are to
7. (d) all of the mentioned each other
9. (c) Bad initialization can produce good 8. (d) None of the mentioned
clustering 10. (b) Hierarchical
11. (a) The different learners in boosting 12. (a) K-means clustering algorithm
can be trained in parallel

4.10 SELF-ASSESSMENT QUESTIONS

1. What are the different decision tree algorithms used in machine learning?
2. What is entropy?
3. Which metrics is best entropy or gini impurity for node selection in decision tree?
4. Write some advantages and disadvantages of decision tree.
5. How decision trees are used for classification and regression tasks?
94 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics

6. Can a random forest handle categorical features and missing values?


7. What is the purpose of using random subsets of data and features in a random forest?
8. What are the main types of clustering algorithms?
9. What is the key parameter in k-means clustering?
10. What are the limitations of clustering?

4.11 REFERENCES

 Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques. Morgan
Kaufmann.
 Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning:
data mining, inference, and prediction. Springer Science & Business Media.
 Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
 Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT Press.
 Mann, A. K., & Kaur, N. (2013). Review paper on clustering techniques. Global Journal
of Computer Science and Technology.
 Rai, P., & Singh, S. (2010). A survey of clustering techniques. International Journal of
Computer Applications, 7(12), 1-5.
 Cheng, Y. M., & Leu, S. S. (2009). Constraint-based clustering and its applications in
construction management. Expert Systems with Applications, 36(3), 5761-5767.
 Bijuraj, L. V. (2013). Clustering and its Applications. In Proceedings of National
Conference on New Horizons in IT-NCNHIT (Vol. 169, p. 172).
 Kameshwaran, K., & Malarvizhi, K. (2014). Survey on clustering techniques in data
mining. International Journal of Computer Science and Information Technologies, 5(2),
2272-2276.

4.12 SUGGESTED READINGS

APA FORMAT Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques.
Morgan Kaufmann.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data
mining, inference, and prediction. Springer Science & Business Media.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

95 | P a g e

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
978-81-19169-84-9

9 788119 169849

Department of Distance and Continuing Education


University of Delhi

You might also like