You are on page 1of 38

Indian Institute of Management Kashipur

presents

PI-MANAGEMENT KIT 2023


Preparation Material
MBA (Analytics)
PI KIT 2023

IT Industry:
The Information Technology & Information Technology Enabled Services
(IT-ITeS) sector is a field which is undergoing rapid evolution and is
changing the shape of Indian business standards.
This sector includes:
software development
consultancies
software management
online services and business process outsourcing (BPO)

According to IBEF (India Brand Equity Foundation), the IT and BPM


sector has become one of the most significant growth catalysts for the
Indian economy, contributing significantly to the country’s GDP and
public welfare. The IT industry accounted for 7.4% of India’s GDP in
FY22, and it is expected to contribute 10% to India’s GDP by 2025.
The growing demand, global footprint, competitive advantage, and policy
support of IT industry is explained below.

1
MARKET SIZE:

SECTOR COMPOSITION:

GOVERNMENT INITIATIVES:

Analytics, data science and big data industry in India:

The analytics industry recorded a substantial increase of 34.5% on a


year-on-year basis in 2022, with the market value reaching USD 61.1
billion. The overall Analytics industry is projected to reach a size of
$201.0

2
Billion by 2027 at a CAGR of 26.9%.BFSI emerged as a top contributor of
the analytics market share of Non-IT sectors for a consecutive year, with
a share of 34.1% of the total market.

The Indian analytics industry is predicted to escalate to USD 98 billion in


2025 and nearly USD 119 billion in 2026
75% of big data is helping government departments to improve the
lifestyle of their citizens.
The data analytics industry is projected to create over 11 million jobs by
2026 and increase investments in AI and machine learning by 33.49% in
2022 alone.

DATA ANALYTICS:
Data analytics is the science of analyzing raw data to make conclusions
about that information.
Data analytics help a business optimize its performance, perform more
efficiently, maximize profit, or make more strategically guided decisions.
The techniques and processes of data analytics have been automated
into mechanical processes and algorithms that work over raw data for
human consumption.

Types of Data Analytics


There are four types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics

Predictive Analytics: Using predictive analytics, the data are


transformed into useful knowledge. Predictive analytics uses data to
estimate the chance of a condition arising or the likely course of an
occurrence.

3
In order to anticipate future events, predictive analytics uses a number
of statistical techniques from modelling, machine learning, data mining,
and game theory. These techniques examine both current and past data.
Techniques that are used for predictive analytics are:

Linear Regression
Time series analysis and forecasting
Data Mining
There are three basic cornerstones of predictive analytics:
Predictive modelling
Decision Analysis and optimization
Transaction profiling

Descriptive Analytics: In order to understand how to approach future


events, descriptive analytics examines data and analyses prior events. By
analysing historical data, it examines prior performance and analyses
performance to determine what caused past success or failure. This kind
of analysis is used in almost all management reporting, including that for
sales, marketing, operations, and finance.

In order to categorize consumers or prospects into groups, the


descriptive model quantifies relationships in data. Descriptive analytics
uncovers a variety of interactions between the client and the product, in
contrast to predictive models that concentrate on forecasting the
behaviour of a specific customer.

Common examples of Descriptive analytics are company reports that


provide historic reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard

4
Prescriptive Analytics: In order to produce a prediction, prescriptive
analytics automatically combine large data, mathematical science,
business rules, and machine learning. They then propose a choice
alternative to capitalize on the prediction.

Prescriptive analytics goes beyond forecasting outcomes by additionally


recommending actions that will benefit from the forecasts and outlining
the implications of each decision option for the decision maker. In
addition to predicting what will happen and when prescriptive analytics
also considers why it will happen. Moreover, prescriptive analytics can
recommend options on how to seize a future opportunity or lessen a
future risk, and it can also explain the implications of each option.

Prescriptive analytics, for instance, can help strategic planning in the


healthcare industry by leveraging operational and consumption data
mixed with data from outside elements like the economy and population
demographics.

Diagnostic Analytics: In this study, historical data is typically preferred


over other data when attempting to provide an answer or resolve a
query. We look for any dependencies and patterns in the past data
related to the specific issue.
Companies utilise this analysis, for instance, because it provides
significant insight into a problem. They also retain extensive records at
their disposal, as doing so would make data collection individual and
time-consuming for each problem.

Common techniques used for Diagnostic Analytics are:


Data discovery
Data mining
Correlations

5
Data analytics relies on a variety of software tools ranging from
spreadsheets, data visualization, and reporting tools, data mining
programs, or open-source languages for the greatest data manipulation.
Data analytics techniques can reveal trends and metrics that would
otherwise be lost in the mass of information. This information can then
be used to optimize processes to increase the overall efficiency of a
business or system.

Example:
For example, manufacturing companies often record the runtime,
downtime, and work queue for various machines and then analyse the
data to better plan the workloads so the machines operate closer to peak
capacity.
Gaming companies use data analytics to set reward schedules for players
that keep most players active in the game.
Content companies use many of the same data analytics to keep you
clicking, watching, or re-organizing content to get another view or
another click.

ANALYTICS INDUSTRY:
Traditional data analytics refers to the process of analyzing massive
amounts of collected data to get insights and predictions. Business data
analytics (sometimes called business

6
analytics) takes that idea, but puts it in the context of business insight,
often with prebuilt business content and tools that expedite the analysis
process.

Specifically, business analytics refers to:


Taking in and processing historical business data
Analysing that data to identify trends, patterns, and root causes
Making data-driven business decisions based on those insights

Using cloud analytics tools, organizations can consolidate data from


different departments—sales, marketing, HR, and finance—for a unified
view that shows how one department’s numbers can influence the
others. Further, tools, such as visualization, predictive insights, and
scenario modelling deliver all kinds of unique insights across an entire
organization.

Business data analytics has many individual components that work


together to provide insights. While business analytics tools handle the
elements of crunching data and creating insights through reports and
visualization, the process actually starts with the infrastructure for
bringing that data in. A standard workflow for the business analytics
process is as follows:
Data collection: Wherever data comes from, be it IoT devices, apps,
spreadsheets, or social media, all of that data needs to get pooled and
centralized for access. Using a cloud database makes the collection
process significantly easier.

Data mining: Once data arrives and is stored (usually in a data lake), it
must be sorted and processed. Machine learning algorithms can
accelerate this by recognizing patterns and repeatable actions, such as
establishing metadata for data from specific sources, allowing data
scientists to focus more on deriving insights rather than manual
logistical tasks.

7
Descriptive analytics: What is happening and why is it happening?
Descriptive data analytics answers these questions to build a greater
understanding of the story behind the data.
Predictive analytics: With enough data—and enough processing of
descriptive analytics —business analytics tools can start to build
predictive models based on trends and historical context. These models
can thus be used to inform future decisions regarding business and
organizational choices.

Visualization and reporting: Visualization and reporting tools can help


break down the numbers and models so that the human eye can easily
grasp what is being presented. Not only does this make presentations
easier, but these types of tools can also help anyone from experienced
data scientists to business users quickly uncover new insights.

Benefits of business analytics


Data-driven decisions
Easy visualization
Modelling the what-if scenario
Go augmented

8
Business analytics use cases

Marketing: Analytics to identify success and impact


Which customers are more likely to respond to an email campaign?
What was the last campaign’s ROI? More and more marketing
departments are trying to better understand how their programs affect
the business at large. With AI and machine learning powering analysis,
it’s possible to use data to drive strategic marketing decisions.

Human Resources: Analytics to find and share talent insights


What actually drives employee decisions regarding their career? More
and more HR leaders are trying to better understand how their programs
affect the business at large. With the right analytical capabilities, HR
leaders are able to quantify and predict outcomes, understand
recruitment channels, and review employee decisions.

Sales: Analytics to optimize your sales


What is the critical moment that converts a lead to a sale? In-depth
analytics can break down the sales cycle, taking in all the different
variables that lead to a purchase. Price, availability, geography, season,
and other factors can be the turning point on the customer journey—and
analytics offer the tool to decipher that key moment.

Finance: Analytics to power predictive organizational budgets


How can you increase your profit margins? Finance works with every
department, be it HR or sales. That means that innovation is always key,
especially as finance departments face larger volumes of data. With
analytics, it’s possible to bring finance into the future for predictive
modelling, detailed analysis, and insights from machine learning.

9
Few Examples:
Western Digital, for example, can access data 25X faster across their
mission-critical business applications—including ERP, EPM, and SCM—
enabling their business to focus on strategic insights, innovation, and
improved customer experience instead of how to integrate point
systems to analyze data.

CONTINUOUS AND DISCRETE VARIABLES:

DISCRETE VARIABLE
A discrete variable is a type of statistical variable that can assume only a
fixed number of distinct values and lacks an inherent order.
Also known as a categorical variable, it has separate, invisible categories.
However, no values can exist in-between two categories, i.e., it does not
attain all the values within the limits of the variable. For example: -
The number of printing mistakes in a book.
The number of road accidents in New Delhi.
The number of siblings of an individual.

CONTINUOUS VARIABLE
A continuous variable, as the name suggests is a random variable that
assumes all the possible values in a continuum. Simply put, it can take
any value within the given range.
A continuous variable is defined throughout values, meaning it can
suppose any values between the minimum and maximum values. For
example:
·Height of a person
·Age of a person
·Profit earned by the company

Distribution
Distribution tells how the data is distributed across the centre. There are
2 types of distributions:

10
1. Continuous distribution - E.g., Normal Distribution, Poisson
Distribution, Bernoulli Distribution, Chi-square Distribution.
2. Discrete distribution - E.g., Binomial Distribution, Poisson
Distribution, Bernoulli Distribution. Quantitative attributes: Mean,
median, maximum, minimum, and standard deviation.
Categorical/qualitative data: Mode, count.

Let’s have an overview of these statistical terms:

Mean: Average of all the values available. It can be used to get an average
value of a particular attribute for a group of samples. Mean is used in
parametric statistical tests.

Median: It’s the statistical measure that gives the middle value of the
data when sorted in ascending order. The median can be used to check if
data is normally distributed or not. When the mean of all values is equal
to the media, data is said to be normally distributed. The distribution of
data always plays an important role in statistical analysis and model
formulation.

Mode: Mode is the most frequently occurring value of a particular


attribute in a dataset. Mode is used explicitly with categorical data in a
non-parametric statistical test such as the chi-square test.

11
For Example, in a survey of the highest used social media, if respondents
chose Instagram over other channels, it is said to be the Mode of the
categorical data sets. Here, the model is nothing but the count of a
specific response.

Standard Deviation: Standard deviation is a measure of the dispersion of


data where how far the value is away from its mean is calculated.

For Example, the GPA system used in grading the students is based on
standard deviation where how far the student’s mark in a particular
subject is far (less/more) from an average score of the class.

Mean is used in parametric statistical tests.

HYPOTHESIS TESTING:
Hypothesis testing is the statistical concept used to compare sample
parameters of 2 or more samples using various tests such as z-test, t-
test, ANOVA, and chi-square. Hypothesis in simple language means
educated guess. The assumption is made that the mean of both the
sample groups or sample and population groups are either the same or
different. The assumption that must be proven true is considered an
alternate hypothesis.

12
IMPORTANT POINTS:
1. The hypothesis is a claim, and the objective of hypothesis testing is to
either reject or retain a null hypothesis with data
2. Hypothesis testing consists of two complementary statements called
"Null Hypothesis" and "Alternate Hypothesis".
3. A null hypothesis is an existing belief and an alternate hypothesis is
what we intend to establish with new evidence.
4. Hypothesis tests can be broadly classified into parametric and non-
parametric tests. Parametric tests are about population parameters
of a distribution such as mean, proportion, standard deviation etc.
Non-parametric tests are about the independence of events. The
non-parametric test can be referred to as a distribution-free test.

Steps in Hypothesis Testing


1. Define null and alternate hypothesis.
2. Identify the test statistic (hypothesis test) to be used.
3. Decide the criteria (α value) for rejection or retention of the Null
Hypothesis. α-value is usually 0.05.
4. Calculate the p-value.
5. Take the decision to reject or retain the null hypothesis based on the
p-value and significant value α.

13
Type I error:
A type 1 error is also known as a false positive and occurs when a
researcher incorrectly rejects a true null hypothesis. This means that
your report that your findings are significant (accept alternate
hypothesis) when in fact they have occurred by chance.
The probability of making a type I error is represented by your alpha
level (α), which is the p-value below which you reject the null hypothesis.
A p-value of 0.05 indicates that you are willing to accept a 5% chance
that you are wrong when you reject the null hypothesis.
You can reduce your risk of committing a type I error by using a lower
value for p. For example, a p-value of 0.01 would mean there is a 1%
chance of committing a Type I error. However, using a lower value for
alpha means that you will be less likely to detect a true difference if one
really exists (thus risking a type II error).

Type II error
A type II error is also known as a false negative and occurs when a
researcher fails to reject a null hypothesis which is false. Here a
researcher concludes there is not a significant effect when there really
is.
The probability of making a type II error is called Beta (β), and this is
related to the power of the statistical test (power = 1- β). You can
decrease your risk of committing a type II error by ensuring your test
has enough power.
You can do this by ensuring your sample size is large enough to detect a
practical difference when one truly exists.

For example, H0 (Null Hypothesis): The return from stock A is higher


than return from stock B

14
H1 (Alternate): The return from stock B is higher than the return from
stock A
Here, the hypothesis that the statistician wants to prove is ‘The return
from stock B is higher than the return from stock A’ based on available
historical data of stock prices.
The null hypothesis is either accepted or rejected based on the P-value
obtained by performing statistical tests.

Now let’s see what P-value is


The P-value is the probability of occurring result is given that null
hypothesis is true. The confidence interval is associated with the
hypothesis which is known as alpha.

Alpha: Alpha is nothing but the probability of rejecting the null


hypothesis when it’s true. The lower the alpha, the better after all it’s the
probability of occurring an error.
Whereas beta is completely the opposite of alpha.

Beta: It is the probability of accepting a null hypothesis when not true. 1-


beta i.e., ‘Probability of not making type 2 error is known as the power of
the test.
Ideally, you would like to keep both errors as low as possible which is
practically not possible as both errors are complementary to each other.
Hence, commonly used values of alpha are 0.01, 0.05, and 0.10 which
gives a good balance between alpha and beta.

So how is this P-value used to prove the hypothesis

When P-value > alpha i.e., the probability of the null hypothesis being
true is higher than the probability of type I error (rejecting the null
hypothesis when true), the null hypothesis is accepted.
When P-value < alpha i.e., the probability of the null hypothesis being
true is lesser than the probability of type I error (rejecting the null
hypothesis when true), the null hypothesis is rejected.

15
Directional Hypothesis Tests

A directional hypothesis is a prediction made by a researcher regarding a


positive or negative change, relationship, or difference between two
variables of a population. This prediction is typically based on past
research, accepted theory, extensive experience, or literature on the
topic. Keywords that distinguish a directional hypothesis are higher,
lower, more, less, increase, decrease, positive, and negative.

Example:
The salaries of postgraduates are higher than the Salaries of graduates.

Non-Directional Hypothesis Tests


A nondirectional hypothesis differs from a directional hypothesis in that
it predicts a change, relationship, or difference between two variables
but does not specifically

16
designate the change, relationship, or difference as being positive or
negative. Another difference is the type of statistical test that is used.

Example: Salaries of postgraduates are significantly different from the


Salaries of graduates.

Different Statistical Process

Z-TEST
Z-test is a statistical procedure used to test an alternative hypothesis
against a null hypothesis. Z-test is any statistical hypothesis used to
determine whether two samples’ means are different when variances are
known and the sample is large (n ≥ 30). It is a Comparison of the means
of two independent groups of samples, taken from one population with
known variance.
Null: Sample mean is same as the population mean
Alternate: Sample mean is not the same as the population mean

17
If the test statistic is lower than the critical value, accept the
hypothesis or else reject the hypothesis

Understanding a One-Sample Z-Test

A teacher claims that the mean score of students in his class is greater
than 82 with a standard deviation of 20. If a sample of 81 students was
selected with a mean score of 90 then check if there is enough evidence
to support this claim at a 0.05 significance level.
As the sample size is 81 and population standard deviation is known, this
is an example of a right-tailed one-sample z-test.
Mean 90
The size of the sample is 81
The population mean is 82
Standard Deviation for Population is 20

H0: μ=82
H1 : μ>82
From the z table the critical value at α = 1.645

x¯ = 90, μ = 82, n = 81, σ = 20


z = 3.6
As 3.6 > 1.645 thus, the null hypothesis is rejected and it is concluded that
there is enough evidence to support the teacher's claim.
Answer: Reject the null hypothesis

18
Understanding a Two-Sample Z-Test

Here, let’s say we want to know if Girls on average score 10 marks more
than boys. We have the information that the standard deviation for girls'
scores is 100 and for boys’ scores is 90. Then we collect the data of 20
girls and 20 boys by using random samples and recording their marks.
Finally, we also set our 𝘢 value (significance level) to be 0.05.

In this example:
Mean Score for Girls (SampleMean) is 641
Mean Score for Boys (SampleMean) is 613.3
Standard Deviation for the Population of Girls is 100
Standard deviation for the Population of Boys is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10
Putting in the above formula, we get a z-score, and thereby we
compute p-values as 0.278 from the z-score which is greater than
0.05, hence we fail to reject the null hypothesis

T-TEST
If we have a sample size of less than 30 and do not know the population
variance, then we must use a t-test.

19
One-sample and Two-sample Hypothesis Tests

The one-sample t-test is a statistical hypothesis test used to determine


whether an unknown population parameter is different from a specific
value.

In statistical hypothesis testing, a two-sample test is a test performed on


the data of two random samples, each of which is independently
obtained. The purpose of the test is to determine whether the difference
between these two populations is statistically significant.

Understanding a One-Sample t-Test


Let’s say we want to determine if on average girls score more than 600 in
the exam. We do not have the information related to variance (or
standard deviation) for girls’ scores. To perform a t-test, we randomly
collect the data of 10 girls with their marks and choose our 𝘢 value
(significance level) to be 0.05 for Hypothesis Testing.
In this example:
Mean Score for Girls is 606.8
The size of the sample is 10

20
The population mean is 600
Standard Deviation for the sample is 13.14

Putting in the above formula, we get an at-score, and thereby we


compute p-value as 0.06 from t-score which is greater than 0.05, hence
we fail to reject the null hypothesis and don’t have enough evidence to
support the hypothesis that on average, girls score more than 600 in the
exam

Understanding a Two-Sample t-Test

Here, let’s say we want to determine if on average, boys score 15 marks


more than girls in the exam. We do not have the information related to
variance (or standard deviation) for girls’ scores or boys’ scores. To
perform a t-test. we randomly collect the data of 10 girls and boys with
their marks. We choose our 𝘢 value (significance level) to be 0.05 as the
criteria for Hypothesis Testing.
In this example:
Mean Score for Boys is 630.1
Mean Score for Girls is 606.8
Difference between Population Mean 15
Standard Deviation for Boys’ score is 13.42
Standard Deviation for Girls’ score is 13.14
Putting in the above formula, we get an at-score, and thereby we
compute p-value as 0.019 from t-score which is less than 0.05, hence
we reject the null hypothesis and conclude that on average boys
score 15 marks more than girls in the exam.

Normal Probability Plot of Residuals:


The effectiveness of the regression can be evaluated using these residual
graphs. The underlying statistical presumptions about residuals, such as
constant variance, variable independence, and normality of the
distribution, can be looked at. The residuals would need to be randomly
distributed around zero in order for these assumptions to be valid for a
certain regression model.

21
The accuracy of these assumptions can be verified using several residual
plot types, which can also offer guidance on how to enhance the model.
For instance, if the regression is successful, the residuals' scatter plot
will be disorganized. There should be no pattern in the residuals. If there
was a pattern, the residuals might not have been independent. On the
other hand, the normality assumption is more likely to be accurate if a
histogram plot of the residuals shows a symmetric bell-shaped
distribution.

Checking the error variance :

An increasing trend in the residuals plot (see the image below) suggests
that the error variance increases with the independent variable, whereas
a falling trend in the distribution suggests that the error variance
reduces with the independent variable. These distributions don't both
have the same variance. As a result, they suggest that the regression is
poor and that the assumption of constant variance is unlikely to be valid.
A horizontal-band pattern, on the other hand, shows that the variance of
the residuals is constant.

22
Checking normality of variance

To determine if the variance is regularly distributed, utilize the


histogram of the residual. The normality assumption is likely to be true if
the bell-shaped histogram is symmetric and uniformly distributed about
zero. The histogram demonstrates that the model's underlying
assumptions may have been broken if it shows that random error is not
regularly distributed.

23
histogram of the Residuals showing that the deviation is normally
distributed.

It is possible to determine whether the variance is also normally


distributed by looking at a normal probability plot of the residuals. We
continue assuming that the error terms are normally distributed if the
resulting plot is roughly linear. The percentiles are estimated by the
plot's basis, which is percentiles versus ordered residuals.

where n is the total number of datasets and i is ith data. The normal
probability plot of the residuals is like this:

24
Normal Probability Plot of the Residuals

Normally distributed residuals :

Histogram

The following histogram of residuals suggests that the residuals (and


hence the error terms) are normally distributed:

25
Normal Probability Plot

The normal probability plot of the residuals is approximately linear


supporting the condition that the error terms are normally distributed.

Normal residuals but with one outlier:

Histogram

The following histogram of residuals suggests that the residuals (and


hence the error terms) are normally distributed. But there is one
extreme outlier (with a value larger than 4):

26
Normal Probability Plot

Here's the corresponding normal probability plot of the residuals:

27
This is a famous illustration of a normal probability plot with a single
outlier and normally distributed residuals. Other than one data point, the
relationship is essentially linear. After eliminating the outlier from the
data set, we might proceed under the presumption that the error terms
are regularly distributed.

Homoscedasticity:

The residuals for each projected DV score are assumed to be roughly


equal under the homoscedasticity assumption. Another way to look at it
is that the variability in your IV scores is constant across all DV values.
The same residuals plot discussed in the linearity and normalcy sections
can be used to assess homoscedasticity.

If the residuals plot is the same width for all projected DV values, the
data are homoscedastic. Typically, heteroscedasticity is shown by a
cluster of points that becomes broader as the expected DV values
increase. Alternatively, you can examine a scatterplot between each IV
and the DV to see if there is homoscedasticity. The point cluster should
be around the same width throughout, just like the residual plot.

The residuals graphic that follows displays data that are largely
homoscedastic. Because the residual plot is rectangular and has a
concentration of points along the centre, the data in this residual plot
really support the hypotheses of homoscedasticity, linearity, and
normality.

28
Heteroscedasticity:

Similar to the residual plot, the point cluster should be roughly the same
width throughout. The data shown in the residuals graphic are primarily
homoscedastic. The data in this residual plot strongly support the
homoscedasticity, linearity, and normality assumptions because the
residual plot is rectangular and contains a concentration of dots along
the centre.

CRISP-DM Overview:

CRISP-DM, which stands for Cross-Industry Standard Process for Data


Mining, is an industry-proven way to guide your data mining efforts.

29
As a methodology, it includes descriptions of the typical phases of a
project, the tasks involved with each phase, and an explanation of the
relationships between these tasks.
As a process model, CRISP-DM provides an overview of the data
mining life cycle.

The CRISP-DM model is flexible and can be customized easily. For


example, if your organization aims to detect money laundering, it is
likely that you will shift through large amounts of data without a specific
modelling goal. Instead of modelling, your work will focus on data
exploration and visualization to uncover suspicious patterns in financial
data. CRISP-DM allows you to create a data mining model that fits your
particular needs.

The Data Mining Life Cycle:

30
BUSINESS UNDERSTANDING:
It refers to the case and the data. Go through them and understand the
complexities of the business captured in the data and the problem at
hand. Using exploratory analysis, we can further understand the data
with descriptive, visual, and inferential analytics aiding the data
understanding in a business context.

DATA UNDERSTANDING:
Inferential Analytics: Inferential analytics draw valid inferences about a
population based on an analysis of a representative sample of that
population. The results of such an analysis are generalized to the larger
population from which the sample originates, to make assumptions or
predictions about the population in general.

Predictive Analytics: Modeling techniques used in predictive analytics.

What is regression?
Regression is one of the main types of supervised learning in machine
learning. In regression, the training set contains observations (also called
features) and their associated continuous target values. The process of
regression has two phases:

The first phase is exploring the relationships between the


observations and the targets. This is the training phase.
The second phase is using the patterns from the first phase to
generate the target for future observation. This is the prediction
phase.

31
Stages of Regression Model

SUPERVISED LEARNING TECHNIQUES:

Classification Model:

The major difference between regression and classification is that the


output values in regression are continuous, while in classification they
are discrete. This leads to different application areas for these two
supervised learning methods. Classification is basically used to
determine desired memberships or characteristics, such as an email
being spam or not, news topics, or ad click-through. On the other hand,
regression mainly involves estimating an outcome or forecasting a
response.

A simple linear regression has a single variable, and it can be described


using the following formula:
y= A + Bx

Here, y is the dependent variable, x is the independent variable, A is the


intercept or constant (where x is to the power of zero) and B is the
coefficient

32
Regression Properties (Diagnostics)

1. Linearity - The relationship between y and x is linear


2. Independence - The errors are independent of each other (not
correlated, especially in a time series)
3. Normality - The errors (aka residuals) εi follow a normal distribution
with mean(ε) = 0
4. Homoscedasticity - The variance var(ε) is constant for different
values of X. When var(ε) is constant, it is known as homoscedasticity.
When var(ε) is not constant, it is heteroscedasticity

Model Diagnostics of Linear Regression:

1. To check if the relationship is linear: We try to fit a line against a set


of two-dimensional data points. In other words, regress a linear line for
the data points.

2. To check the independence of Residuals: To check this, if the data is


naturally chronological, then the plot of residuals over time may be
observed for any specific pattern. Usually, in time-series data, Durbin-
Watson Test may be used to assess the independence of time-series
data. The time-series data will be dealt in courses in the second year.

33
3. Normality of Residuals: To check for normality, obtain the
standardized residuals and check for normality using a distribution or
histogram plot.

4. Homoscedasticity of residuals: Homoscedasticity can be observed by


drawing a residual plot, which is a plot between the standardised
residual value and standardized predicted value.

Feature Engineering in analysing data:

This helps in carrying out,


Inputting Missing values in the dataset. Eg: With median values.
Encoding Categorical variables. Eg: one-hot encoding
Transforming Variables. Eg: logarithmic or reciprocal
transformation.
Handling outliers. Eg: removing outliers.

To reduce the number of features there can be three approaches:

1. Feature construction: Feature construction attempts to increase the


expressive power of the original features
2. Feature selection: Select feature which makes context-specific sense
and is driven by logic/facts
3. Feature Extraction: Combining multiple features with mathematical
or statistical operations - Example: Profit = Revenue - Cost;
minimum_time_logistic = min(time_by_train, time_by_truck) and
Carrying out Principal Component Analysis (PCA) or other feature
reduction techniques

34
MBA (ANALYTICS) PI QUESTIONS

Q1.Tell me something about yourself?


Q2.Tell me something which is not included in your CV
Q3.Questions related to work experience
How did you use analytics in the marketing department of your
company?
What projects did you work on in your company and which tools
did you use?
Explain the clustering project which you worked on related to
student performance?
Q4.Subject specific questions/subject of interest (Statistical Quality
control, ML, and supply chain management)

What is a P chart?

A p-chart is an attributes control chart that is used with data obtained in


small subgroups. Because the size of the subgroup can fluctuate, it
displays a percentage rather than a count of nonconforming items. P-
charts depict the evolution of a process over time. The process attribute
(or characteristic) is always expressed as a yes/no, pass/fail, or go/no
go situation. Use a p-chart to plot the percentage of incomplete
insurance claim forms received weekly

What is a Control Chart?

Shewhart charts, also known as process-behaviour charts, are a


statistical process control tool that can be used to identify whether a
manufacturing or business process is under control. Control charts are
more accurately described as the graphical device for Statistical Process
Monitoring

In X bar- R chart, which chart will be plotted first and why?

First, we proceed with the R chart to ensure that the process lies within
the range. If the process lies within the range, we proceed for X bar

35
What are the various assumptions of linear regression?
-Homoscedasticity
-Normality
-No Autocorrelation
-Independent variables are not correlated among themselves

Tell me in one word, the difference between Clustering and Linear


regression.
Clustering belongs to Unsupervised learning and linear regression
belongs to supervised learning

Why don’t we import all packages at once in python?


Since Packages are user-defined functions. There are chances that when
we load all the packages at once, the user-defined function may overlap,
and there are also high chances that CPU usage will be increased. To
avoid complexity, we generally load the package which is required rather
than loading everything

Q5. Related to General Knowledge


Who is the chief minister of Uttarakhand?
What is your view of Mamata Banerjee’s victory in West Bengal?
Total states in India?
List a few union territories that recently got converted into States.

36
We wish all the candidates a

Best of Luck!

In case of any queries reach out to us on

You might also like