Professional Documents
Culture Documents
presents
IT Industry:
The Information Technology & Information Technology Enabled Services
(IT-ITeS) sector is a field which is undergoing rapid evolution and is
changing the shape of Indian business standards.
This sector includes:
software development
consultancies
software management
online services and business process outsourcing (BPO)
1
MARKET SIZE:
SECTOR COMPOSITION:
GOVERNMENT INITIATIVES:
2
Billion by 2027 at a CAGR of 26.9%.BFSI emerged as a top contributor of
the analytics market share of Non-IT sectors for a consecutive year, with
a share of 34.1% of the total market.
DATA ANALYTICS:
Data analytics is the science of analyzing raw data to make conclusions
about that information.
Data analytics help a business optimize its performance, perform more
efficiently, maximize profit, or make more strategically guided decisions.
The techniques and processes of data analytics have been automated
into mechanical processes and algorithms that work over raw data for
human consumption.
3
In order to anticipate future events, predictive analytics uses a number
of statistical techniques from modelling, machine learning, data mining,
and game theory. These techniques examine both current and past data.
Techniques that are used for predictive analytics are:
Linear Regression
Time series analysis and forecasting
Data Mining
There are three basic cornerstones of predictive analytics:
Predictive modelling
Decision Analysis and optimization
Transaction profiling
4
Prescriptive Analytics: In order to produce a prediction, prescriptive
analytics automatically combine large data, mathematical science,
business rules, and machine learning. They then propose a choice
alternative to capitalize on the prediction.
5
Data analytics relies on a variety of software tools ranging from
spreadsheets, data visualization, and reporting tools, data mining
programs, or open-source languages for the greatest data manipulation.
Data analytics techniques can reveal trends and metrics that would
otherwise be lost in the mass of information. This information can then
be used to optimize processes to increase the overall efficiency of a
business or system.
Example:
For example, manufacturing companies often record the runtime,
downtime, and work queue for various machines and then analyse the
data to better plan the workloads so the machines operate closer to peak
capacity.
Gaming companies use data analytics to set reward schedules for players
that keep most players active in the game.
Content companies use many of the same data analytics to keep you
clicking, watching, or re-organizing content to get another view or
another click.
ANALYTICS INDUSTRY:
Traditional data analytics refers to the process of analyzing massive
amounts of collected data to get insights and predictions. Business data
analytics (sometimes called business
6
analytics) takes that idea, but puts it in the context of business insight,
often with prebuilt business content and tools that expedite the analysis
process.
Data mining: Once data arrives and is stored (usually in a data lake), it
must be sorted and processed. Machine learning algorithms can
accelerate this by recognizing patterns and repeatable actions, such as
establishing metadata for data from specific sources, allowing data
scientists to focus more on deriving insights rather than manual
logistical tasks.
7
Descriptive analytics: What is happening and why is it happening?
Descriptive data analytics answers these questions to build a greater
understanding of the story behind the data.
Predictive analytics: With enough data—and enough processing of
descriptive analytics —business analytics tools can start to build
predictive models based on trends and historical context. These models
can thus be used to inform future decisions regarding business and
organizational choices.
8
Business analytics use cases
9
Few Examples:
Western Digital, for example, can access data 25X faster across their
mission-critical business applications—including ERP, EPM, and SCM—
enabling their business to focus on strategic insights, innovation, and
improved customer experience instead of how to integrate point
systems to analyze data.
DISCRETE VARIABLE
A discrete variable is a type of statistical variable that can assume only a
fixed number of distinct values and lacks an inherent order.
Also known as a categorical variable, it has separate, invisible categories.
However, no values can exist in-between two categories, i.e., it does not
attain all the values within the limits of the variable. For example: -
The number of printing mistakes in a book.
The number of road accidents in New Delhi.
The number of siblings of an individual.
CONTINUOUS VARIABLE
A continuous variable, as the name suggests is a random variable that
assumes all the possible values in a continuum. Simply put, it can take
any value within the given range.
A continuous variable is defined throughout values, meaning it can
suppose any values between the minimum and maximum values. For
example:
·Height of a person
·Age of a person
·Profit earned by the company
Distribution
Distribution tells how the data is distributed across the centre. There are
2 types of distributions:
10
1. Continuous distribution - E.g., Normal Distribution, Poisson
Distribution, Bernoulli Distribution, Chi-square Distribution.
2. Discrete distribution - E.g., Binomial Distribution, Poisson
Distribution, Bernoulli Distribution. Quantitative attributes: Mean,
median, maximum, minimum, and standard deviation.
Categorical/qualitative data: Mode, count.
Mean: Average of all the values available. It can be used to get an average
value of a particular attribute for a group of samples. Mean is used in
parametric statistical tests.
Median: It’s the statistical measure that gives the middle value of the
data when sorted in ascending order. The median can be used to check if
data is normally distributed or not. When the mean of all values is equal
to the media, data is said to be normally distributed. The distribution of
data always plays an important role in statistical analysis and model
formulation.
11
For Example, in a survey of the highest used social media, if respondents
chose Instagram over other channels, it is said to be the Mode of the
categorical data sets. Here, the model is nothing but the count of a
specific response.
For Example, the GPA system used in grading the students is based on
standard deviation where how far the student’s mark in a particular
subject is far (less/more) from an average score of the class.
HYPOTHESIS TESTING:
Hypothesis testing is the statistical concept used to compare sample
parameters of 2 or more samples using various tests such as z-test, t-
test, ANOVA, and chi-square. Hypothesis in simple language means
educated guess. The assumption is made that the mean of both the
sample groups or sample and population groups are either the same or
different. The assumption that must be proven true is considered an
alternate hypothesis.
12
IMPORTANT POINTS:
1. The hypothesis is a claim, and the objective of hypothesis testing is to
either reject or retain a null hypothesis with data
2. Hypothesis testing consists of two complementary statements called
"Null Hypothesis" and "Alternate Hypothesis".
3. A null hypothesis is an existing belief and an alternate hypothesis is
what we intend to establish with new evidence.
4. Hypothesis tests can be broadly classified into parametric and non-
parametric tests. Parametric tests are about population parameters
of a distribution such as mean, proportion, standard deviation etc.
Non-parametric tests are about the independence of events. The
non-parametric test can be referred to as a distribution-free test.
13
Type I error:
A type 1 error is also known as a false positive and occurs when a
researcher incorrectly rejects a true null hypothesis. This means that
your report that your findings are significant (accept alternate
hypothesis) when in fact they have occurred by chance.
The probability of making a type I error is represented by your alpha
level (α), which is the p-value below which you reject the null hypothesis.
A p-value of 0.05 indicates that you are willing to accept a 5% chance
that you are wrong when you reject the null hypothesis.
You can reduce your risk of committing a type I error by using a lower
value for p. For example, a p-value of 0.01 would mean there is a 1%
chance of committing a Type I error. However, using a lower value for
alpha means that you will be less likely to detect a true difference if one
really exists (thus risking a type II error).
Type II error
A type II error is also known as a false negative and occurs when a
researcher fails to reject a null hypothesis which is false. Here a
researcher concludes there is not a significant effect when there really
is.
The probability of making a type II error is called Beta (β), and this is
related to the power of the statistical test (power = 1- β). You can
decrease your risk of committing a type II error by ensuring your test
has enough power.
You can do this by ensuring your sample size is large enough to detect a
practical difference when one truly exists.
14
H1 (Alternate): The return from stock B is higher than the return from
stock A
Here, the hypothesis that the statistician wants to prove is ‘The return
from stock B is higher than the return from stock A’ based on available
historical data of stock prices.
The null hypothesis is either accepted or rejected based on the P-value
obtained by performing statistical tests.
When P-value > alpha i.e., the probability of the null hypothesis being
true is higher than the probability of type I error (rejecting the null
hypothesis when true), the null hypothesis is accepted.
When P-value < alpha i.e., the probability of the null hypothesis being
true is lesser than the probability of type I error (rejecting the null
hypothesis when true), the null hypothesis is rejected.
15
Directional Hypothesis Tests
Example:
The salaries of postgraduates are higher than the Salaries of graduates.
16
designate the change, relationship, or difference as being positive or
negative. Another difference is the type of statistical test that is used.
Z-TEST
Z-test is a statistical procedure used to test an alternative hypothesis
against a null hypothesis. Z-test is any statistical hypothesis used to
determine whether two samples’ means are different when variances are
known and the sample is large (n ≥ 30). It is a Comparison of the means
of two independent groups of samples, taken from one population with
known variance.
Null: Sample mean is same as the population mean
Alternate: Sample mean is not the same as the population mean
17
If the test statistic is lower than the critical value, accept the
hypothesis or else reject the hypothesis
A teacher claims that the mean score of students in his class is greater
than 82 with a standard deviation of 20. If a sample of 81 students was
selected with a mean score of 90 then check if there is enough evidence
to support this claim at a 0.05 significance level.
As the sample size is 81 and population standard deviation is known, this
is an example of a right-tailed one-sample z-test.
Mean 90
The size of the sample is 81
The population mean is 82
Standard Deviation for Population is 20
H0: μ=82
H1 : μ>82
From the z table the critical value at α = 1.645
18
Understanding a Two-Sample Z-Test
Here, let’s say we want to know if Girls on average score 10 marks more
than boys. We have the information that the standard deviation for girls'
scores is 100 and for boys’ scores is 90. Then we collect the data of 20
girls and 20 boys by using random samples and recording their marks.
Finally, we also set our 𝘢 value (significance level) to be 0.05.
In this example:
Mean Score for Girls (SampleMean) is 641
Mean Score for Boys (SampleMean) is 613.3
Standard Deviation for the Population of Girls is 100
Standard deviation for the Population of Boys is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10
Putting in the above formula, we get a z-score, and thereby we
compute p-values as 0.278 from the z-score which is greater than
0.05, hence we fail to reject the null hypothesis
T-TEST
If we have a sample size of less than 30 and do not know the population
variance, then we must use a t-test.
19
One-sample and Two-sample Hypothesis Tests
20
The population mean is 600
Standard Deviation for the sample is 13.14
21
The accuracy of these assumptions can be verified using several residual
plot types, which can also offer guidance on how to enhance the model.
For instance, if the regression is successful, the residuals' scatter plot
will be disorganized. There should be no pattern in the residuals. If there
was a pattern, the residuals might not have been independent. On the
other hand, the normality assumption is more likely to be accurate if a
histogram plot of the residuals shows a symmetric bell-shaped
distribution.
An increasing trend in the residuals plot (see the image below) suggests
that the error variance increases with the independent variable, whereas
a falling trend in the distribution suggests that the error variance
reduces with the independent variable. These distributions don't both
have the same variance. As a result, they suggest that the regression is
poor and that the assumption of constant variance is unlikely to be valid.
A horizontal-band pattern, on the other hand, shows that the variance of
the residuals is constant.
22
Checking normality of variance
23
histogram of the Residuals showing that the deviation is normally
distributed.
where n is the total number of datasets and i is ith data. The normal
probability plot of the residuals is like this:
24
Normal Probability Plot of the Residuals
Histogram
25
Normal Probability Plot
Histogram
26
Normal Probability Plot
27
This is a famous illustration of a normal probability plot with a single
outlier and normally distributed residuals. Other than one data point, the
relationship is essentially linear. After eliminating the outlier from the
data set, we might proceed under the presumption that the error terms
are regularly distributed.
Homoscedasticity:
If the residuals plot is the same width for all projected DV values, the
data are homoscedastic. Typically, heteroscedasticity is shown by a
cluster of points that becomes broader as the expected DV values
increase. Alternatively, you can examine a scatterplot between each IV
and the DV to see if there is homoscedasticity. The point cluster should
be around the same width throughout, just like the residual plot.
The residuals graphic that follows displays data that are largely
homoscedastic. Because the residual plot is rectangular and has a
concentration of points along the centre, the data in this residual plot
really support the hypotheses of homoscedasticity, linearity, and
normality.
28
Heteroscedasticity:
Similar to the residual plot, the point cluster should be roughly the same
width throughout. The data shown in the residuals graphic are primarily
homoscedastic. The data in this residual plot strongly support the
homoscedasticity, linearity, and normality assumptions because the
residual plot is rectangular and contains a concentration of dots along
the centre.
CRISP-DM Overview:
29
As a methodology, it includes descriptions of the typical phases of a
project, the tasks involved with each phase, and an explanation of the
relationships between these tasks.
As a process model, CRISP-DM provides an overview of the data
mining life cycle.
30
BUSINESS UNDERSTANDING:
It refers to the case and the data. Go through them and understand the
complexities of the business captured in the data and the problem at
hand. Using exploratory analysis, we can further understand the data
with descriptive, visual, and inferential analytics aiding the data
understanding in a business context.
DATA UNDERSTANDING:
Inferential Analytics: Inferential analytics draw valid inferences about a
population based on an analysis of a representative sample of that
population. The results of such an analysis are generalized to the larger
population from which the sample originates, to make assumptions or
predictions about the population in general.
What is regression?
Regression is one of the main types of supervised learning in machine
learning. In regression, the training set contains observations (also called
features) and their associated continuous target values. The process of
regression has two phases:
31
Stages of Regression Model
Classification Model:
32
Regression Properties (Diagnostics)
33
3. Normality of Residuals: To check for normality, obtain the
standardized residuals and check for normality using a distribution or
histogram plot.
34
MBA (ANALYTICS) PI QUESTIONS
What is a P chart?
First, we proceed with the R chart to ensure that the process lies within
the range. If the process lies within the range, we proceed for X bar
35
What are the various assumptions of linear regression?
-Homoscedasticity
-Normality
-No Autocorrelation
-Independent variables are not correlated among themselves
36
We wish all the candidates a
Best of Luck!