You are on page 1of 129

IMT-PG Programme in Management

Decision Sciences

Revision
Presented by: Dr. Anuja Shukla
Agenda
Topic Time
Module 1 Quiz Module 1 5 minutes
https://forms.gle/fKZ4zUt1cdK4jfgS8
Doubt collection 5 Minutes
Doubt resolution 20 minutes
Module 2 Quiz Module 2 5 minutes
https://forms.gle/qD9vWqA2ziZnS1ea7
Doubt collection 5 Minutes
Doubt resolution 20 minutes
Module 3 Quiz Module 3 5 minutes
https://forms.gle/ztguRfJhWqAWhoUKA
Doubt collection 5 Minutes
Doubt resolution 20 minutes
Quiz: Final For Practice
https://forms.gle/SPqJEteFmDvj3TLd8
Total Time 90 Minutes
Module 1
• Session 1: Basics of Statistics
• Session 2: Measure of Central Tendency
• Session 3: Probability and Probability Distributions
• Session 4: Sampling and Estimation
Descriptive and Inferential Statistics
• Descriptive Statistics is a branch of statistics that describes or summarizes a collection
of information. Descriptive statistics uses the data to provide descriptions of the
population, either through numerical calculations or graphs or tables.
• Inferential statistics makes inferences and predictions about a population based on a
sample of data taken from the population in question.

Statistics

Inferential Statistics
Descriptive Statistics
Drawing conclusions about a
Presenting, organising and
population based on data
summarizing data
observed in a sample
Data Visualisation
• Data visualization is the graphical representation of information and data. By using
visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.

• Example: IMRB Report

Mobile advertisements are expected to grow from INR


34.5 billion in 2017 to INR 304.5 billion in 2023 at a CAGR
of 43.8%.
Statistical tools for data visualisation

• Bar Graph
• Pareto Chart
• Histogram
• Line Chart
• Pie Chart
• Pivot chart

More examples on:


https://www.tutorialspoint.com/excel_data_analysis/excel_data_analysis_visualization.htm
https://courses.lumenlearning.com/wmopen-mathforliberalarts/chapter/introduction-representing-data-graphically/
Session 2: Measure of Central Tendency

Mean

Median

Mode
Mean

• Mean is the sum of all the data values divided by the total number of sample
values.
• Mean is commonly represented by the symbol 𝝁.
• Mean = Sum of all observations/ Number of observations

Q: Find Mean of following data: 4,7,9,10,12


Mean= sum/n
= 42/5= 8.4
Median

• Arrange the sample data in ascending order of frequency, from left to right, the
value in the middle is called the median.
• For an odd number of values, we have one median.
• For an even number of values, the median is the average of the two central
values.
• Suitable in case of outlier
Data:
a. N is odd
4,7,9,10,12
Median= 3rd observation =9

b. N is even
4,7,9,10
Median=Average of 2nd and 3rd observation =(7+9)/2
Mode

• The mode is the number that appears


Example
most frequently in a data set.
• A set of numbers may have one mode, Data: 4 , 1, 5, 3, 0, 2
more than one mode, or no mode at Mode: No mode
all.
• For qualitative data, it is not possible Data: 4 , 2, 4, 3, 2, 2,1,5
to measure the mean or median Mode: 4 and 2 - Bimodal
values, as there are no numerical
values. Data: 4 , 2, 1,5, 4, 3, 2, 2, 3
Mode: 2,3,4 – Multi modal
• Thus, the variable with the highest
frequency is considered as the
measure of central tendency in such
cases.
Summary

Mean Average value of data set


Median Mid value of data
(Suitable in case of outliers)

Mode Most frequent value in data


Measure of Dispersion

Variance

Standard deviation

Interquartile ranges
Variance
• One way of measuring the variation within a dataset can be with respect to a fixed
reference value. In the case of variance, this fixed value is the mean.
• Variance is defined as the mean of the square of the difference between data
points and the mean value of all data points within a dataset.
• Variance is a measure of variability that utilizes all the data.
• Average (approximately) of squared deviations of values from the mean.
• So, the formula of variance is —

Variance =∑(x−μ)^2/(n) (for a population)


Variance = ∑(x−μ)^2/(n-1) (for a sample)
where x is any data point within the dataset, n is the
number of data points, and μ is the mean value of all
data points within a dataset.
Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the variance
• This metric serves the purpose of measuring variation without
exaggerating its magnitude.
• It is popularly represented as 𝜎. So, the variance is represented as σ^2.

Sample standard deviation=  i


(X − X ) 2

S= i =1
n -1
Interquartile range
• Suitable in case of outlier
• The best way to find an outlier is to calculate the standard deviation. If the
result is much higher than expected, there is a high chance that your data
contains an outlier.
• In such cases, the interquartile spread is a much better way to communicate
the variation or spread in the data.
• Quartile values are the values in a sample at the 25th, 50th, 75th, and 100th
percentiles.

Quartile Excel Function


25th percentile or First Quartile QUARTILE (A1: A20, 1)
50th percentile or Second Quartile QUARTILE (A1: A20, 2)
75th percentile or Third Quartile QUARTILE (A1: A20, 3)
100th percentile or Fourth Quartile QUARTILE (A1: A20, 4)
Quartiles
Quartiles

Interquartile range = Q3- Q1


=77-64
=13
Session 3: Probability , Sampling and Estimation

• Statistics and types


• Use of probability theory in Business
decision making
• Random Variables
• Discrete & Continuous Random variables
• Binomial Distribution
• Normal Distribution
• Empirical Rule
• Standard Normal Distribution
Statistics
• Statistics provides tools that let you make inferences about data.

• Types: Descriptive and Inferential

• Tools for making statistical inferences are


1) built on top of probability theory, and
2) require an understanding of how samples behave when you take them from
distributions (defined by probability theory…).
Probability vs Statistics

• Probability theory is “the doctrine of chances”.


• It’s a branch of mathematics that tells you how often different kinds of events
will happen.
• For example, all of these questions are things you can answer using probability
theory:
• What are the chances of a fair coin coming up heads 10 times in a row?
• If I roll two six sided dice, how likely is it that I’ll roll two sixes?
• How likely is it that five cards drawn from a perfectly shuffled deck will all be
hearts?
• What are the chances that I’ll win the lottery?
Probability vs Statistics
• In statistics, we know the truth about the world.
• All we have is the data, and it is from the data that we want
to learn the truth about the world.
• Statistical questions tend to look more like these:
• If my friend flips a coin 10 times and gets 10 heads, are they playing a trick on
me?
• If five cards off the top of the deck are all hearts, how likely is it that the deck
was shuffled?
• If the lottery commissioner’s spouse wins the lottery, how likely is it that the
lottery was rigged?
Probability
• Probability quantifies the likelihood or belief that an event will occur.
• Probability is the branch of mathematics concerning numerical
descriptions of how likely an event is to occur or how likely it is that a
proposition is true. The probability of an event is a number between 0
and 1, where, roughly speaking, 0 indicates impossibility of the event
and 1 indicates certainty.
• p= probability of occurrence of an event
• q=probability of failure of an event (1-p)

p+q=1
Types of Probability

• There are three basic ways of classifying probability. These three


represent rather different conceptual approaches to the study of
probability theory.

Probability

Classical or Empirical or
Subjective
Theoretical Frequentist
Approach
Approach Approach
Classical or Theoretical Approach

• Defines probability as ratio of favorable outcomes to the total outcomes.


• Also known as priori probability.
• It assumes a number of assumptions, hence is the most restrictive approach, and it
is least useful in real-life situations.
• The classical/theoretical approach is the one where certain assumptions are made
(for instance, all possible outcomes are equally likely), and the probability values are
then calculated using a formula.
• One does not need to perform any experiment or gather any data when using this
approach to arrive at the probability of an event.
Calculation of Probability: Fair Coin

• Calculate the probability of getting heads on


one toss of a fair coin.
• Assume- All the possible outcomes are
equally likely to occur.
• Total outcomes = {Head , Tail}
• No. of total outcomes = 2
• No. Favorable outcomes= {Head} = 1
Empirical or Frequentist Approach

• Defines probability as observed relative frequency of an event in a very large


number of trials.
• It assumes less assumptions but requires the event to be capable of being repeated
a large number of times.
• Probability gains accuracy as we increase the number of observations.
• Probabilities are derived from observations.
• For probability of getting heads on tossing a coin, you would toss the coin several
times and note down the outcome of each trial.
• Let’s say you tossed the coin 10,000 times and got heads 5,052 times and tails 4,948 times.

• P(H)=5052/10,000
• Following the frequentist approach, you would conclude that the probability of getting heads is
0.5052 and that of getting tails is 0.4948 for that particular coin.
Example
• Suppose an insurance company knows from past actuarial data that of all males 40
years old, about 60 out of every 100,000 will die within a 1 –year period. Using this
method, the company estimates the probability of death for that age group as 60
per 100,000. Calculate the probability of death.

p=60/10000 = 0.0006
Subjective Probability

• Deals with specific or unique situations typical of the business or management


world.
• Based upon some belief or educated guess of the decision maker.
• Subjective assessments of probability permits the widest flexibility of the three
concepts, also known as personal probability.

• Example: Judge punishing a criminal


• Example: Say it is your job to interview and select a new social service caseworker.
You have narrowed your choice to three people. Each has an attractive
appearance, a high level of energy, abounding self-confidence, a record of past
accomplishments, etc. What are the chances each will relate to clients
successfully? Here, Choosing among the three will require to assign a subjective
probability to each person’s potential.
Random Variable

• A Random Variable is a set of possible values from a random experiment.


• A Random Variable has a whole set of values ...... and it could take on any of those
values, randomly.
• A random variable, usually written X, is a variable whose possible values are
numerical outcomes of a random phenomenon

Types of Random Variables

Discrete Continuous
(Binomial Distribution) (Normal Distribution)
Discrete random variable

• A discrete random variable is one which may take on only a countable number of distinct
values such as 0,1,2,3,4,........
• The variables which can be counted, and do not have any decimal parts, are known as
discrete variables.
• Discrete random variables are usually (but not necessarily) counts. If a random variable can
take only a finite number of distinct values, then it must be discrete. The probability
distribution of a discrete random variable is a list of probabilities associated with each of its
possible values. It is also sometimes called the probability function or the probability mass
function.
• Examples: No of customer, roll of die, roll of coin, no of students in class, number of
children in a family, the Friday night attendance at a cinema, the number of patients in a
doctor's surgery, the number of defective light bulbs in a box of ten.
• For example, the number of students in a class. A class can have 10 students or 11 students,
but it cannot have 10.25 students.
Continuous Random Variables
• A continuous random variable is one which takes an infinite number of possible
values.

• The variables which can be divided infinitely into smaller parts are known as
continuous variables. For example, a student’s height can be 1 metre or 0.99
metre, or 0.998 metre.

• A continuous random variable is not defined at specific values. Instead, it is defined


over an interval of values, and is represented by the area under a curve.
• The probability of observing any single value is equal to 0, since the number of
values which may be assumed by the random variable is infinite.

• Example: height, weight, the amount of sugar in an orange, the time required to
run a mile.
Density Curve
• Suppose a random variable X may take all values over an interval of real numbers.
Then the probability that X is in the set of outcomes A, P(A), is defined to be the
area above A and under a curve. The curve, which represents a function p(x), must
satisfy the following:
• 1: The curve has no negative values (p(x) > 0 for all x)
• 2: The total area under the curve is equal to 1.
• A curve meeting these requirements is known as a density curve
• Link: http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm
Binomial Distribution
• The binomial probability distribution is the theoretical probability distribution of all
numbers of possible successes over a certain number of Bernoulli trials.
• A binomial experiment is a type of simple random experiment where only two mutually
exclusive outcomes are possible on any trial and those two outcomes are a success and
failure. Such trials where only one of two mutually exclusive outcomes is possible are
Bernoulli trials
• For example, flipping a coin is a Bernoulli trial, because only heads and tails are
possible. Heads could be defined as a “success” and tails could be defined as a
“failure.”
• A person with cancer who is taking a new experimental type of chemotherapy is a
Bernoulli trial, where the patient being cured is a “success” and the patient not
being cured is a “failure.”
• The binomial probability is the probability of observing a certain number of successes (r)
over a certain number of independent Bernoulli trials (n)= nC rprqn-r where n! = n*(n-
1)*(n-2)*(n-3)....1
How many heads when we toss 3 coins?
• The three coins can land in eight possible ways:
HHH, HHT, HTT, HTH, THH, THT, TTH, TTT
• Sample space= {0, 1, 2, 3}
• We see just 1 case of Three Heads, but 3 cases of Two Heads, 3 cases of One Head,
and 1 case of Zero Heads. So:

• Total outcomes= 8

• P(X = 0) = 1/8 (TTT) P(X = 1) = 3/8 (TTH, THT, HTT)


• P(X = 2) = 3/8 (THH, HHT, HTH) P(X = 3) = 1/8 (HHH)
Probability Distribution of discrete variable

How many heads when we toss 3 coins?


P(X = 0) = 1/8 (TTT)
P(X = 1) = 3/8 (TTH, THT, HTT)
P(X = 2) = 3/8 (THH, HHT, HTH)
P(X = 3) = 1/8 (HHH)

Examples:
1. A new drug is introduced to cure a disease, it
either cures the disease (it’s successful) or it doesn’t
cure the disease (it’s a failure).
2. If you purchase a lottery ticket, you’re either
going to win money, or you aren’t. Binomial Distribution

Note: Binomial distribution is a process where there are only two possible outcomes: true or false
Example 1: The Binomial Probability Distribution

Ten consumers were asked to state their preferences between two types of
ice-cream. Assuming that there is no difference between two types of
icecream, calculate the probability that:

a) 3 or less consumers will prefer ice-cream A.

b) 7 or more consumers will prefer ice-cream B.

p = 0.5, q = 0.5 and n = 10


Formula = nC rprqn-r
Example 2: The Binomial Probability Distribution
Approximately 20% of U.S. workers are afraid that they will never be able to retire.
Suppose 10 workers are randomly selected. What is the probability that none of the
workers is afraid that they will never be able to retire?
Solution:
Let X = 10, then

( )
P ( X = x ) = nx p x q n − x =
n!
x ! ( n − x )!
pxq n−x

for x = 0,1,2, , n. By definition, 0! = 1.


Example 2: The Binomial Probability Distribution
Can you tell the difference between Coke and Pepsi in a blind taste test? Most
people say they can and have a preference for one brand or the other. However,
research suggests that people can correctly identify a sample of one of these
products only about 60 percent of the time. Suppose we decide to investigate this
question and select a sample of 15 college students.

a. What is the probability exactly 10 of the students surveyed will correctly identify
Coke or Pepsi?
b. What is the probability at least 10 of the students will correctly identify Coke or
Pepsi?

More Practice questions:


http://ecoursesonline.iasri.res.in/mod/resource/view.php?id=89767
https://www.six-sigma-material.com/Binomial-Distribution.html
https://www.coursera.org/lecture/descriptive-statistics-statistical-distributions-business-application/business-
application-of-the-binomial-distribution-6N7gq
Normal Distribution
• Normal distribution is symmetric about its mean and extends infinitely
on both sides.
• Probability density is higher close to the mean and decreases
exponentially as we move further away from the mean.
• There is a high probability that the value of the random variable is close
to the mean. As we move further away from the mean, the probability
of the occurrence of such values decreases.
Example 1: The Normal Probability Distribution

• Find the probability of randomly selecting a graduate that makes more


than Rs. 80000 a year, given the same normal distribution.

• Hint : Norm.dist gives area on left side of the curve


• =1- NORM.DIST(80000, 60000, 15000,TRUE)
Example 2: The Normal Probability Distribution

The amounts of money requested on home loan applications at State


Bank of follow the normal distribution with a mean of Rs. 70000 and a
standard deviation of Rs. 20,000. A loan application is received this
morning. What is the probability:
a. The amount requested is Rs. 80,000 or more?
b. The amount requested is between Rs. 65,000 and Rs. 80,000?
c. The amount requested is Rs. 65,000 or more?
Empirical Rule
• 68% of observations falls within the first standard deviation (µ ± σ)
• 95% within the first two standard deviations (µ ± 2σ)
• 99.7% within the first three standard deviations (µ ± 3σ)
Example 3: The Normal Probability Distribution

• A reputed college advertises that about their placements data. Suppose


that the data concerning the first-year salaries of college graduates is
normally distributed with the population mean µ = Rs. 60000 and the
population standard deviation σ = Rs. 15000. Find the probability of a
randomly selected graduate earning less than Rs. 45000 annually.
• Hint: To answer this question, we have to find the portion of the area under the
normal curve from 45000 all the way to the left.

=NORM.DIST(45000, 60000, 15000,TRUE)


= 15.87%.
Example 4: The Normal Probability Distribution
• Suppose that you have opened a new branch of your office on the outskirts of the
city and you need to provide some extra compensation to the employees having a
higher commute time. Now, suppose that the commute time of the employees of
the new office follows a normal distribution, with a mean of 40 minutes and a
standard deviation of 10 minutes. What is the probability that the commute time of
an employee will be between 20 and 60 minutes?
Example 5: The Normal Probability Distribution
Among U.S. cities with a population of more than 250,000 the mean one-way
commute to work is 24.3 minutes. The longest one-way travel time is New York City,
where the mean time is 38.3 minutes. Assume the distribution of travel times in New
York City follows the normal probability distribution and the standard deviation is 7.5
minutes.
a. What per cent of the New York City commutes are for less than 30 minutes?
b. What per cent are between 30 and 35 minutes?
c. What per cent are between 30 and 40 minutes?
Standard normal distribution
• Standard normal distribution is a normal distribution with a mean (µ) of 0 and a
standard deviation (σ) of 1.
• The formula to convert any point in a normal distribution into its equivalent point
(referred to as the z-score) on the standard normal distribution is as follows:
• Z−score=(x−μ)/σ
https://www.youtube.com/watch?v=2tuBREK_mgE
Session 4: Sampling and Estimation
• Sampling
• Central Limit Theorem
• Confidence Intervals

https://learn.upgrad.com/v/course/791/session/90258/segment/504954
Sample vs Population
Population Sample
The measurable quality is The measurable quality is
called a parameter called statistics

The population is The sample is a subset of


complete set the population
Reports are a true Reports have a margin of
representation of opinion error and confidence
interval
It contains all members of It is a subset that
a specified group represents the entire
population
Central Limit Theorem

• Central limit theorem states that if you take sufficiently large random samples
(sample size ‘n’) from any population distribution with a mean μ and standard
deviation σ, the distribution of sample means (or the ‘sampling distribution of
sample means’) will be a normal distribution with a mean µ and standard deviation
σ/√n.
• First, the mean of the sampling distribution is assumed to be equal to the mean of
the population.
• Second, the standard deviation of the sampling distribution is assumed to be equal
to standard deviation of the population divided by the square root of the sample
size.
• Standard deviation of the sample means distribution is also referred to as the
‘standard error of the mean’, or simply the ‘standard error’, and is denoted by ‘SE’.
• Sample standard deviation (n>30) = σ/√n.
Refer: http://onlinestatbook.com/stat_sim/sampling_dist/
Central Limit Theorem
• From the formula of the standard error, it is clear that as the sample size increases, the
sampling distribution of sample means becomes narrower and better resembles a
normal distribution.
• To summarise, the central limit theorem claims that irrespective of the probability
distribution of the population, the distribution of sample means follows a normal
distribution if the sample size is sufficiently large.

• The Central Limit Theorem, tells us that if we take the mean of the samples (n)
and plot the frequencies of their mean, we get a normal distribution! And as the
sample size (n) increases --> approaches infinity, we find a normal distribution.

• Mean of sample is same as mean of the population.


• Standard deviation of the sample is equal to standard deviation of the population divided
by square root of sample size.
• Central limit theorem is applicable for a sufficiently large sample sizes (n≥30).
https://www.youtube.com/watch?v=b5xQmk9veZ4
References
• https://www.mathgoodies.com/lessons/vol6/independent_events
• https://www.mathgoodies.com/lessons/vol6/addition_rules
• https://www.investopedia.com/ask/answers/021215/what-difference-
between-standard-deviation-and-
variance.asp#:~:text=Key%20Takeaways,average%20of%20all%20data%20
points
Additional Links
• https://learn.upgrad.com/v/course/791/session/90255/segment/504929
• https://www.youtube.com/watch?v=gUp2xk5pJcM Video:
• Online visualization tool https://www.youtube.com/channel/UCiiyrRcEuDSzInajTud90Sw
• https://www.mathsisfun.com/data/data-graph.php Mean:
• http://onlinestatbook.com/stat_sim/sampling_dist/ https://www.youtube.com/watch?v=mk8tOD0t8M0
• https://www.intmath.com/counting-probability/normal-distribution-graph-
interactive.php Normal Distribution :
• Probability Questions https://www.youtube.com/watch?v=2tuBREK_mgE
• https://www.careerbless.com/aptitude/qa/discuss/270_1.php Empirical rule:
• https://www.six-sigma-material.com/Binomial-Distribution.html https://www.youtube.com/watch?v=mtbJbDwqWLE&t=3s
• https://learn.saylor.org/course/view.php?id=109&sectionid=3922 Central Limit Theorem
• https://www.analyzemath.com/statistics/normal_distribution.html
https://www.youtube.com/watch?v=b5xQmk9veZ4
• https://crumplab.github.io/statistics/probability-sampling-and-
estimation.html
• Moving normal distribution
• https://crumplab.github.io/statistics/probability-sampling-and-
estimation.html
• https://sites.google.com/site/fundamentalstatistics/chapter-9
Module 2: Hypothesis Testing
✓Hypothesis
✓Need of Hypothesis
✓Types of Hypothesis : Null and Alternate
✓Types of tail in test
✓Step-by-step process of hypothesis testing
✓Converting of business problem into a
hypothesis statement
✓Testing of Hypothesis: Critical Value
method and p value method
✓Types of Errors
✓Examples for practice
✓Application of hypothesis in Industry
Test

Test When to apply Parameter of


measurement

Z test (two tail, left tail, right tail) One sample Mean

Student t test for single sample One sample Mean

T test for two sample Two sample Mean


A/B Testing Compare two variations Pop Proportion
of webpages/ options
Type of Test

Type of Test
Test (normality One sample Population standard N>30 Z test
assumption) deviation is known

Not known N<30 Independent


sample t test

*Not known N>30 Independent


sample t test
Two sample Paired two-sample means test
Unpaired two-sample means test
Comparing two versions A/B Testing
Confidence level

• Confidence interval is a range of values, derived from sample statistics, which is


likely to contain the value of an unknown population parameter
• The confidence level is defined for the hypothesis test according to the accuracy
needed.
• A higher confidence level indicates that more evidence is needed to reject the
null hypothesis.
• Therefore, increasing the confidence level makes it harder to reject the null
hypothesis.
• Inversely, a low confidence level indicates that the null hypothesis can be
rejected easily.
Level of Significance
• The significance level is the probability of rejecting the null hypothesis when it is
true. For example, a significance level of 0.05 indicates a 5% risk of concluding that
a difference exists when there is no actual difference. Lower significance levels
indicate that you require stronger evidence before you will reject the null
hypothesis.
• Los (Alpha)= 1- CI
Hypothesis
• A research hypothesis is a specific, clear, and testable proposition or predictive
statement about the possible outcome of a scientific research study based on a
particular property of a population, such as presumed differences between groups
on a particular variable or relationships between variables.
• Decision-makers often face situations wherein they are interested in testing
hypotheses on the basis of available information and then take decisions on the
basis of such testing.
• In social science, where direct knowledge of population parameter(s) is rare,
hypothesis testing is the often used strategy for deciding whether a sample data
offer such support for a hypothesis that generalisation can be made.
• Hypothesis may be defined as a proposition or a set of proposition set forth as an
explanation for the occurrence of some specified group of phenomena either
asserted merely as a provisional conjecture to guide some investigation or accepted
as highly probable in the light of established facts.
Decision making in Industry

• An airline company claims that 90% of its flights are on time.


• A consultant claims that using just-in-time production can reduce your inventory
cost per unit by ₹10.
• A tyre manufacturer claims its tyres last 50% longer than its competitors’. You are
probably left wondering how many of these are actually true.
• For instance, suppose that the fallout rate of samples drawn from two different
groups is 15% and 10%, respectively. It would be a partial judgment saying that
one is better than the other.
• Hypothesis testing is designed to detect significant differences: differences that
did not occur by random chance.

• Additional Reading :https://towardsdatascience.com/how-to-interpret-p-value-


with-covid-19-data-edc19e8483b
Characteristics of hypothesis
• Hypothesis should be clear and precise.
• Hypothesis should be capable of being tested
• Hypothesis should state relationship between variables, if it happens to be a
relational hypothesis.
• Hypothesis should be limited in scope and must be specific. A researcher must
remember that narrower hypotheses are generally more testable and he should
develop such hypotheses.
• Hypothesis should be consistent with most known facts
Types of Hypothesis

Types of Hypothesis

Null Hypothesis Alternate Hypothesis


An alternative hypothesis is one in which some
A null hypothesis is a statement of the status quo, one
difference or effect is expected. Accepting the
of no difference or no effect. If the null hypothesis is
alternative hypothesis will lead to changes in
not rejected, no changes will be made.
opinions or actions.
Represented by H0
Represented by Ha

Null hypothesis refers to a specified value of the population parameter not sample
A null hypothesis may be rejected, but it can never be accepted based on a single test.
Tails of test

Null Alternate Type of test


H0: Mean=10 H0: Mean ≥ 10 H0: Mean ≤ 10
Ha: Mean≠10 Ha: Mean < 10 Ha: Mean >10
Two tail Left tail Right tail
Steps Involved in Hypothesis Testing
Formulate H0 and H1

Select Appropriate Test


Choose Level of Significance

Collect Data and Calculate Test Statistic

Determine Probability Determine Critical Value of


Associated with Test Statistic Test Statistic TSCR

Compare with Level of Determine if TSCAL falls into


Significance,  (Non) Rejection Region

Reject or Do not Reject H0

Draw Marketing Research Conclusion


Formulating the Hypotheses
• A well-known car-maker claims that one of its cars has mileage of at least 17
kilometres per litre. You want to challenge this claim. Define the null and alternative
hypotheses for the problem.

• Hypothesis Statement
Null hypothesis: The mileage is greater than or equal to 17
(as this is the default claim made by the brand )
Alternative hypothesis: The mileage is less than 17
(as this challenges the null hypothesis)

• Mathematically
Ho: Mileage (mean) ≥ 17
Hα: Mileage (mean) < 17
Formulating the Hypotheses

S.No. Statement in question Null Hypothesis statement

1 At least
More than or equal to
2 More than
3 Less than
Less than or equal to
4 At most

Convention: Include the equal sign in the null hypothesis statement


Formulating the Hypotheses
• Let’s say you are the COO of a shoe-manufacturing company. An employee has
developed a new sole and claims that incorporating it will decrease the wear after
three years of use by more than 9%. Now, suppose you want to test this claim.
• What will be the null and alternative hypotheses for the sole developed by the
employee in this scenario?

Ho: Decrease in wear after 3 years ≤ 9%; Ha: Decrease in wear after 3 years > 9%
Formulating the Hypotheses
• Mr. Mohan of the Civil Engineering Department wants to test the load bearing
capacity of an old bridge which must be more than 10 tons, in that case he can
state his hypotheses as under:
• Null hypothesis H0 : tons µ<=10
• Alternative Hypothesis Ha : tons µ > 10
A Broad Classification of Hypothesis Tests

Hypothesis Tests

Tests of Tests of
Association Differences

Distributions Means Median/


Proportions
Rankings
Formulating the Hypotheses
• The average score in an aptitude test administered at the national level
is 80. To evaluate a state’s education system, the average score of 100
of the state’s students selected on random basis was 75. The state
wants to know if there is a significant difference between the local
scores and the national scores.

• Null hypothesis H0 : µ = 80
• Alternative Hypothesis Ha : µ ≠ 80
Formulating the Hypotheses

• It is believed that the average commute time for an employee to and


from their office in Hyderabad is at least 35 minutes. Now, suppose you
want to test this claim.
• What will be the null and alternative hypotheses in this case if the
average commute time is represented by μ?
• Ho: μ ≥ 35 minutes; Ha: μ < 35 minutes
Formulating the Hypotheses
• Goodyear has launched a new tyre, which, it claims, can travel more
than 7,500 miles before it needs any replacement.
• Assuming that the ‘average distance travelled before replacement’ is
given by μ, what would be the null and alternative hypotheses in this
case?

Ho: μ ≤ 7500 miles; Ha: μ > 7500 miles


Formulating the Hypotheses
• Suppose flour packaged by a manufacturer is sold by weight; and a particular size of
package is supposed to average 40 ounces. Suppose the manufacturer wants to test
to determine whether their packaging process is out of control as determined by
the weight of the flour packages. The null hypothesis for this experiment is that the
average weight of the flour packages is 40 ounces (no problem). The alternative
hypothesis is that the average is not 40 ounces (process is out of control).
• H0: Mean = 40 oz
• Ha: Mean ≠ 40 oz
Formulating the Hypotheses: Practice sets
• A financial investment firm wants to test to determine whether the average hourly
change in the Dow Jones Average over a 10-year period is +0.25.
• A manufacturing company wants to test to determine whether the average
thickness of a plastic bottle is 2.4 millimeters.
• A retail store wants to test to determine whether the average age of its customers
is less than 40 years.
Two tail test

• Example : One plus


• You need to verify whether the OnePlus 6 takes 30 minutes to reach 60% charge,
since this is the popular sentiment.
• First, if the time taken is less than 30 minutes, you want to revise your claim to
boast about the better figure.
• Second, if the time taken is more, you want the engineers to fix this issue.
• What will your null and alternative hypotheses be?
• Hypothesis Statement
Assumptions • Null hypothesis: The time needed to charge till 60 percent is equal to 30 minutes.
Population is normal
• Alternative hypothesis: The time needed to charge till 60 percent is not equal to 30
Sample size is large (n>30) minutes.
Z score
Two tail hypothesis Test • For Testing hypothesis
• Ho: Mean = 30
• Hα: Mean ≠ 30
Results of Hypothesis
Critical value method
• Critical z lies within range: Fail to reject null hypothesis
• Critical z lies outside range : Reject null hypothesis

P value method
If p<=alpha , Reject Ho
If p> alpha, Fail to Reject Ho
Confidence level
✓ The confidence level or reliability is the expected percentage of times that the actual value will fall
within the stated precision limits.
✓ Thus, if we take a confidence level of 95%, then we mean that there are 95 chances in 100 (or .95 in
1) that the sample results represent the true condition of the population within a specified precision
range against 5 chances in 100 (or .05 in 1) that it does not.
✓ Confidence level indicates the likelihood that the answer will fall within that range, and the
significance level indicates the likelihood that the answer will fall outside that range.
✓ We can always remember that if the confidence level is 95%, then the significance level will be (100
– 95) i.e., 5%; if the confidence level is 99%, the significance level is (100 – 99) i.e., 1%.
✓ We should also remember that the area of normal curve within precision limits for the specified
confidence level constitute the acceptance region and the area of the curve outside these limits in
either direction constitutes the rejection regions.
Z Score

✓If the z-score of the sample lies further away from the center than the critical z-
values, the null hypothesis is rejected.
✓Otherwise, the test fails to reject the hypothesis.
✓The only two possible outcomes of a hypothesis test are ‘reject the null
hypothesis’ or ‘fail to reject the null hypothesis’. This hypothesis can never be
‘accepted’.
Commonly used critical z scores
Left tail test Two tail test Right tail test
Testing hypothesis : Z score

• z score is distance of point from centre (in terms of std dev)


Testing hypothesis : P Value
✓ An alternative way of obtaining the test
result is by calculating the p-value.
✓ The p-value can be calculated from the
z-score, using a z-table or by inserting
the z-score into a p-value calculator .
✓ The null hypothesis can be rejected at
all confidence levels below 1-p.
✓ p-value can be visualised as the
‘probability of the null hypothesis
being true’
✓ Directly tells Confidence Interval at
which null hypothesis can be rejected P value method
If p<=alpha , Reject Ho
✓ If the p-value is less than the If p> alpha, Fail to Reject Ho
significance level (α), then you can
reject the null hypothesis.
http://courses.atlas.illinois.edu/spring2016/STAT/STAT200/pnormal.html
Right Tail test
• Example: Hypothesis test from the perspective of a OnePlus 6
customer
• Will you care if the OnePlus 6 makes the claim of “a day’s power in half
an hour” and then overperforms by taking lesser time to charge?
• I would care only if the phone was underperforming. Therefore, it is
often sufficient to perform the hypothesis test on only one side of the
curve, depending on the context.
• Null hypothesis : The time needed ≤ 30 minutes.
• Alternative hypothesis : The time needed is > 30 minutes.
Hypothesis test : One plus Customer

• Two tail test One Tail test (Right Tail)

Example 2: MS EXCEL

Fail to reject Null Hypothesis


Example: Two sample t test
• Suppose you’re testing the effect of a medicine on 15 volunteers. The
medicine is supposed to increase the presence of hormone XYZ in the
volunteers’ blood, by exactly 10 micro-units. What will the null and
alternative hypotheses be?
• What kind of hypothesis test will you conduct? (t test, n<30)
• The medicine is believed to increase the hormone by exactly 10 micro-units.
Since we’re talking about the hormonal effects of a medicine, you won’t
want any positive or negative deviation. (two tail, Increase=exact 10)

• Hypothesis Statement
• Ho: Increase in XYZ =10 micro-units
• Hα : Increase in XYZ ≠10 micro-units

• Example 4: MS EXCEL
Left Tail test

• Imagine you’re the owner of a pizza company, and you claim that your pizzas
are more than 9 inches in diameter. But you’ve been receiving complaints
from some of your customers, who say that the pizzas are actually smaller.
Your task is to now find out whether your chefs are producing smaller pizzas.
In this case, you will conduct a ‘left-tailed test’ by checking whether your
sample mean is significantly lesser than 9 inches, since you’re checking
whether the complaints about smaller pizzas are true.
• Hypothesis Statement
• Null hypothesis : Pizza size is at least 9 inches (i.e. 9 or more).
• Alternative hypothesis : Pizza size is less than 9 inches
• Mathematically
• Null hypothesis : Pizza size ≥ 9.
• Alternative hypothesis : Pizza size is < 9.
Two Sample t Test
• When there is a need to compare the means of two samples, a two-
sample t-test is conducted. In such a case, the formula for the t-statistic
becomes
Types of two sample test
• Paired t test - Paired means that both samples consist of the same test subjects
• Unpaired t test- Unpaired means that both samples consist of distinct test subjects.
Example
• Let’s take the medicine example again. This time, you want to test the
default belief that the medicine affects males and females in the same
way.
• You have 15 male volunteers and 15 female volunteers. You measure
the increase in hormone XYZ in these patients, on taking the medicine.
• Male=Sample A, Female= Sample B.
• Remember, the default belief states that the medicine has the same
effect in both sexes.
• Hypothesis Statements
• Null hypothesis : Mean of the male sample and the mean of the female
sample are equal. Or
• Null hypothesis : The mean of sample A - mean of sample B = 0
• Alternative hypothesis : The mean of sample A - mean of sample B ≠ 0
• Example 5: Excel
Summary
1. Define the hypothesis statements: Your test will either ‘reject’ or ‘fail to reject’ the null hypothesis.
2. Collect as many data points as possible: The data points you collect will produce one sample. The size
of this single sample will depend on how many data points you take.
3. Measure the sample mean and the sample standard deviation: The standard deviation should be
calculated using the ‘n-1’ method. The STDEV function in Excel takes care of this.
4. Identify the distribution of the sample means: If the sample size is larger than 30, the distribution will
be a normal one (We’re only focusing on normal distributions for now).
5. Define the confidence level: This is the level of surety that you demand from a hypothesis test. The
higher the confidence level, the harder it is to reject the null hypothesis.
6. Find the critical z-scores of the confidence level and the test statistic or the z-score of the sample: The
z-score of the sample can be calculated by subtracting the hypothesised mean from the sample mean
and dividing it by the population standard deviation, divided by the root over sample size.
7. Compare the sample test statistic with the critical z-scores: Here, you check whether the sample
statistic is more extreme than the z-scores.
8. If the sample test statistic is more extreme than the critical z-scores, you will reject the null
hypothesis. Otherwise, you will fail to reject it.
Errors in Hypothesis test

Left tail test


Null hypothesis : Pizza size ≥ 9.
Alternative hypothesis : Pizza size is < 9.

• Type 1- Null hypothesis was true but rejected, pizza>=9, but I rejected
• Type 2 error- Accept Ho when ho is false, pizza was not >=9 but accepted it
Cost of Error

• Type II error is more costly


Cost of Error

• Type I error is more


costly
Handling Error
There are two ways of handling error-
1. Increasing confidence level of the test
a. Reduces type 1 error
b. Increase types two error
2. Increasing sample size
a. Reduces type 2 error
b. Doesn’t effect Type 1 error
Summary
✓When the test needs to check only positive or negative deviation from the null
hypothesis, a one-tailed test is performed.
✓When the test needs to check for deviation on either side of the null hypothesis, a
two-tailed test is performed.
✓When the sample size is low, a t-test is performed.
✓A t-test is also preferred over a z-test when the population standard deviation is
unknown.
✓When two sample means need to be checked for equality, a two-sample t-test is
performed.
✓When there is a need to check whether an entire distribution is similar to another,
a goodness of fit test is performed.
✓Hypothesis testing also carries some probability of committing errors. The errors
can be of two types: Type I and Type II.
A/B testing
✓An A/B test tells you whether there is a statistical difference in the performance of
the two options.
✓Data driven decision making system
✓A/B tests are used whenever there is a need to compare two alternatives.
✓The A/B test can be considered the most basic kind of randomized controlled
experiment
✓You will now learn about ‘A/B tests’, which are used in the industry when there is a
need to make a choice between two options. An A/B test tells you whether there
is a statistical difference in the performance of the two options.
A/B testing
• In the 1920s statistician and biologist Ronald Fisher discovered the most important
principles behind A/B testing and randomized controlled experiments in general.
• Fisher ran agricultural experiments, asking questions such as, What happens if I put
more fertilizer on this land? The principles persisted and in the early 1950s
scientists started running clinical trials in medicine. In the 1960s and 1970s the
concept was adapted by marketers to evaluate direct response campaigns (e.g.,
would a postcard or a letter to target customers result in more sales?).
Areas of application
• Medicine, to understand if a drug works or not
• Economics, to understand human behaviour
• Foreign aid and charitable work (the reputable ones at least), to
understand which interventions are most effective at alleviating
problems (health, poverty, etc)
Example: A/ B testing
• Let’s say John builds a website for a free e-book and is testing out two colour
variations — red and blue. On the red website, 45 out of 100 visitors downloaded
the e-book. But on the blue website, 47 out of 100 visitors downloaded the e-book.
Based on this, John may conclude that the blue website is performing better.

• However, John’s method can backfire. This is because he did not bother to check for
statistical significance. The difference in performance observed may be due to plain
old randomness. Thus, there’s a high probability that he may end up with an
inferior website colour.

• You will tackle this problem through an A/B test


Example: A/ B testing

• Null hypothesis (H0): Visitors that receive Layout B will not have higher
end-of-visit conversion rates compares to visitors that receive Layout A
• Alternative hypothesis (H1): Visitors that receive Layout B will have
higher end-of-visit conversion rates compared to visitors that receive
layout A
Example: A/ B testing
Example: A/ B testing
Example: A/ B testing
References
Download stat pro
➢ file:///D:/nptel/19%20HYPOTHESIS%20 https://faculty.chicagobooth.edu/jeffrey.russell/teaching/bstats/StatPro.zip
TESTING%20T_%20Z%20TEST%20(1).p Install Stat Pro
df https://www.youtube.com/watch?v=S24BV6tCkQQ
Download Real stats (for Logistic regression)
➢ https://hbr.org/2017/06/a-refresher- http://www.real-statistics.com/free-download/real-statistics-resource-pack/
on-ab-testing Install Real stats (for Logistic regression)
➢ Upgrad Study Material https://www.youtube.com/watch?v=EKRjDurXau0
P value calculator
➢ Kothari, C. R. (2004). Research http://courses.atlas.illinois.edu/spring2016/STAT/STAT200/pnormal.html
methodology: Methods and
techniques. New Age International. Ab testing
https://www.surveymonkey.com/mp/ab-testing-significance-calculator/
➢ Malhotra, N., & Birks, D.
Market Research, Naresh Malhotra
(2007). Marketing Research: an applied http://www.ru.ac.bd/wp-content/uploads/sites/25/2019/03/407_08_00_Malhotra-Marketing-
approach: 3rd European Edition. Research-An-Applied-Orientation.pdf
Pearson education. XL Stat links
➢ List of test: https://help.xlstat.com/s/article/k-means-clustering-in-excel-tutorial?language=en_US
https://help.xlstat.com/s/article/which https://help.xlstat.com/s/article/conjoint-analysis-in-excel-tutorial-new?language=en_US
-statistical-test-should-you- https://help.xlstat.com/s/article/discriminant-analysis-in-excel-tutorial?language=en_US
use?language=en_US Kotler
http://eprints.stiperdharmawacana.ac.id/24/1/%5BPhillip_Kotler%5D_Marketing_Management_1
4th_Edition%28BookFi%29.pdf
Module 3: Regression Analysis

• Covariance
• Correlation
• Difference between Covariance
and Correlation
• Regression
• Simple Linear Regression
• Multiple Linear Regression
• Logistic Regression
• Hands on experience using MS
EXCEL
• Hypothesis testing and
Interpretation
Covariance

• Covariance measures the directional association


between two variables
• A positive value of covariance indicates that one of
the variables increases with an increase in the other • Only sign can be interpreted
Positive linear relationship
one. Negative linear relationship
• A negative value indicates that one of the variables • Magnitude does not represent
decreases with an increase in the other one. anything

• The magnitude of covariance cannot give you any


idea of the strength of the association between the
two variables. Also, the covariance value doesn’t
have any bounds and can take any value.
Relationship between Variables

• There are several methods of determining the relationship between variables, but
no method can tell us for certain that a correlation is indicative of causal
relationship.
• Thus we have to answer two types of questions in bivariate or multivariate
populations viz.,
(i) Does there exist association or correlation between the two (or more) variables? If yes, of what
degree?
(ii) Is there any cause and effect relationship between the two variables in case of the bivariate
population or between one variable on one side and two or more variables on the other side in
case of multivariate population? If yes, of what degree and in which direction?

The first question is answered by the use of correlation technique and the second
question by the technique of regression.
Karl Pearson’s coefficient of correlation
• Karl Pearson’s coefficient of correlation (or simple correlation) is the most widely
used method of measuring the degree of relationship between two variables. This
coefficient assumes the following:
• that there is linear relationship between the two variables;
• that the two variables are casually related which means that one of the variables is
independent and the other one is dependent;
• a large number of independent causes are operating in both variables so as to produce a
normal distribution.
• Karl Pearson’s coefficient of correlation can be calculated using following formula:
Correlation

• Correlation measures the extent of the


association between two variables — the
directional association and its strength.
• The value of the correlation coefficient
always ranges from -1 to 1
• If the correlation coefficient is —
• 1 two variables are perfectly positively
correlated with each other
• 0 two variables are not correlated with
each other
• -1 two variables are perfectly negatively
correlated with each other
• The value of ‘r’ nearer to +1 or –1 indicates
high degree of correlation between the two
variables.
Difference between Covariance and Correlation

Covariance Correlation

• Indicates the direction of the • Measures both the


linear relationship between strength and direction of the
variables. linear relationship between two
• Only direction can be variables
interpreted not strength • Direction and degree of
association
• How well they move together
• Does not tell about dependency
• It does not mean causation
Regression

• Regression is a statistical method used to determine the relationship between variables in


the form of an equation.
• In regression terminology, the variable that you predict is called the ‘dependent’ or the
‘response’ variable. It is usually denoted by ‘Y’.
• The variable that is used to predict this dependent variable is called the ‘independent’ or
‘explanatory’ variable. It is usually denoted by ‘X’.
• In simple linear regression, a regression model is in the form of a straight line:
Y=β0+β1X or Y = a+bX

Eating Weight Gain

Independent Variable (X) Dependent Variable (Y)


Regression Equation
• Y = mX + C
• X is known as an independent variable, i.e. it can take any value.
• Y is known as a dependent variable, i.e. its value is calculated from the
value of X.
• Here m is called the slope of the straight line, or in other words, m is
the rate at which Y increases for an increase in X.
Fitted Value vs Residual value
Hypothesis Testing in Simple Linear Regression

• Null Hypothesis
Null hypothesis: Temperature does not significantly influence air conditioner
sales

Null hypothesis : β1=0 , where β1 is the coefficient of X in the


equation Y=β0+β1X

• Alternative Hypothesis
Alternative hypothesis : Temperature has a significant influence on the sales

Alternative hypothesis: β1≠0, where β1 is the coefficient of X in the


equation Y=β0+β1X
R square
• R-square, or the coefficient of determination, is a metric used to
quantify the goodness of fit for a regression model.
• R-square explains how close the actual data points are to the fitted
regression line. In other words, you can say that it measures the
percentage of variability in the dependent variable that is explained by
the regression line.
• R-square is equal to the explained variation in the dependent variable
divided by Total variation. R-square values lie between 0 and 1.
R square
• R square=0.8859
• Regression model is able to explain
88.59% of the variability of the
dependent variable
• 60-70% R square shows good model
• Its not adequate and full proof test.
As number of variables increases,
the R square goes up.
• So, check for standard error. Smaller
the standard error, more accurate
the predictions of model
The standard error of the estimate

• The standard error of the estimate is a


measure of the accuracy of predictions.
The regression line is the line that
minimizes the sum of squared deviations
of prediction (also called the sum of
squares error), and the standard error of
the estimate is the square root of the
average squared deviation.
• Standard error of estimate is the average
prediction error of the regression line
when used to predict dependent variable
from the independent variable.
• It is typically the average prediction error
of this model when it is used for
predicting the air conditioner sales from
the temperatures
• The smaller the standard error the more
accurate the predictions are.
Multiple Linear Regression
• When there are two or more than two independent variables, the
analysis concerning relationship is known as multiple correlation and
the equation describing such relationship as the multiple regression
equation.
Hypothesis in MLR
• ANOVA
• H0: All the regression coefficients of the household size, annual
income and annual credit card charges are equal to 0.
Ha: At least one of the regression coefficients of the household
size, annual income and annual credit card charges is non-zero.
• Individual
• H0: The regression coefficients of the household size is equal to 0.
Ha: The regression coefficients of the household size is not equal
to 0
Example: Regression
Multiple Linear Regression: Hypothesis Testing

✓Overall Significance Test ✓ Individual significance test


✓F test ✓ T test
F test- test for over all significance

• Check P value in ANOVA


table
• P<.01, Null hypothesis
rejected at 99%
confidence level
• At least one IDV has
significant impact on DV
Individual significance test : t test

• Check individual p value


• P<.01 reject Null hypothesis for Television advertising and News paper
advertising at 99% CL (i.e. accept Alternate and use Beta values)
• P>.01, Fail to reject Null hypothesis for online advertising at 99% CL (Beta
values not important)
Regression Equation
R square and Adjusted R square
• R Square is a basic matrix which tells you
about that how much variance is been Multiple R 0.8996
explained by the model.
R-Square 0.8093
• What happens in a multivariate linear
regression is that if you keep on adding Adj R-Square 0.7572
new variables, the R square value will
always increase irrespective of the StErr of Est 1.1610
variable significance.
• What adjusted R square do is calculate R
square from only those variables whose
addition in the model which are
significant.
• So always while doing a multivariate linear
regression we should look at adjusted R
square instead of R square.
Correlation vs Regression
Links
• Download stat pro
https://faculty.chicagobooth.edu/jeffrey.russell/teaching/bstats/StatPro.zip
• Install Stat Pro
https://www.youtube.com/watch?v=S24BV6tCkQQ
• Download Real stats (for Logistic regression)
• http://www.real-statistics.com/free-download/real-statistics-resource-pack/
• Install Real stats (for Logistic regression)
• https://www.youtube.com/watch?v=EKRjDurXau0
➢ https://hbr.org/2017/06/a-refresher-on-ab-testing
➢ Upgrad Study Material
➢ Kothari, C. R. (2004). Research methodology: Methods and techniques. New Age International.
➢ Malhotra, N., & Birks, D. (2007). Marketing Research: an applied approach: 3rd European Edition.
Pearson education.
➢ List of test: https://help.xlstat.com/s/article/which-statistical-test-should-you-use?language=en_US

➢ Disclaimer: Some data/pictures are used from internet resources just for teaching purpose
Doubts?
All the Best!

https://www.youtube.com/watch?v=Z9Gw9dIJGiA&t=86s&ab_channel=upGrad_Gmba

You might also like