You are on page 1of 119

Business Statistics

Syllabus
• UNIT-I: Probability Distributions

• UNIT-II: Sampling & Sampling Distribution, Estimation and Confidence Interval

• UNIT- III: (Hypothesis Testing for means & proportions)

• UNIT- IV (Chi-Square Test, Non Parametric Test & ANOVA)

• UNIT- V (Correlation and Regression Analysis)


Probability Distributions
• Probability is the chance of occurrence of an event.

• For example, probability of getting head or tail in a coin toss is 0.5. Probability of getting 1 in a
ross dice is 1/6

• Random variable is a variable which can take more than one value with some probability.

• For example, X is a random variable representing the outcome of a dice. X can take values like 1,
2, 3 etc each with probability 1/6.

• Probability distribution gives probability of all the possible values of a random variable

• For example: P(x)=1/6, for x=1,2,3,4,5,6 is the Probability distribution of randome variable x
representing the outcome of a dice
Probability Distributions

• There are some specific types of probability distributions for which, we


directly have the formula for probability of random variable.

Some of these distributions are:

• Binomial Distribution

• Poisson distribution

• Normal Distribution
Binomial Distribution
A binomial distribution has the following essential properties:
1.The experiment consists on “n” trials
2. All the trials are independent of each other
3. Each trial has only 2 outcomes. Any one of them can be denoted as “success” and the other as “failure”.
Probability of “success” is “p” and the failure is “(1-p)”
4. In the experiment conducted for “n” times, the random variable is the number of success obtained “x” times
So, x={0, 1, 2, 3, 4,…………, n}, if n=3 then x={0, 1, 2, 3}
x=0 means exactly 0 success in 3 trials x=1 means exactly 1 success in 3 trials,
x=2 means exactly 2 success in 3 trials, x=3 means exactly 3 success in 3 trials
5. Then, the probability of “x” successes denoted by p(x) is given as:
P(x)=
Where = in which n! = n*(n-1)*(n-2)……1. Also 0!=1
Example

A coin is tossed 3 times. Assume Head to be success. Calculate:

• a. Probability of getting exactly no success

• b. Probability of getting exactly 1 success

• c. Probability of getting exactly 2 success

• d. Probability of getting exactly 3 success


Example

• The manager of a departmental store informs that the probability that a


customer who is just browsing will eventually buy some items is 0.4.
During the pre-lunch session on a day, 7 customers are seen to browse in
the department. Using this information, answer the following questions:

(a): What is the probability of 0, 1, 2, ..., 7 customers buying some items?


Example
(b) What is the probability that at least 4 customers will be buying
something?
Example
(c) What is the probability that no browsing customers will be buying
anything?
Example
(d) What is the probability that not more than 4 customers will be
buying something?
Example

• The manager of a departmental store informs that the probability that a customer
who is just browsing will eventually buy some items is 0.4. During the pre-lunch
session on a day, 7 customers are seen to browse in the department. Find:

• A. Expected number of customers buying some items.

• B. Variance of the customers buying some items.

• We know: P(0)= 0.02801, P(1)= 0.13062, P(2)= 0.26133, P(3)= 0.29034, P(4)=
0.19355, P(5)= 0.07746, P(6)= 0.01727, P(7)= 0.0017
Expected value

• Expected value is the average or mean value of the data. The


formula for expected value is:

Where n is the number of data points, xi is the values of each data point
and pi is the probability of ith data point
Variance
• Variance is the measure of dispersion in the data. If data is far away
from each other, variance will be high and vice versa.
• The formula for calculating Variance is

is mean of the data


Expected value and standard deviation

• We have already discussed the calculation of expected value and


standard deviation of a probability distribution.

• It can be shown mathematically that in the case of binomial


distribution, the formulae for these measures simplify to the following:

Expected value or mean E(x)= n*p

Variance = n*p*(1-p)
Poisson distribution
• In many cases, the number of trials “n” and probability “p” is not given

• If we are interested in the:

 Number of events occurring in the future

And the probability of number of event occurring in the future,

Then, we can use Poisson distribution


Examples
• Number of customer arrival per minute at a retail

• Number of visitors in one hour on a e-commerce website

• The number of telephone calls received per minute

• The number of arrivals at a bank teller booth in every 5-minute period

• Number of accidents per day


Poisson distribution
• The Poisson probability distribution, named after the French mathematician S.D.
Poisson, is another discrete probability distribution with important practical
applications.

• There are many discrete phenomena that are represented by a Poisson process.

• A Poisson distribution is said to exist when we can observe discrete events in some
area of interest (which may be a continuous interval of time, space, length, etc.)
Poisson distribution
• The only condition for Poisson distribution is that the expected number of successes (or events) must be
known which is represented by . [Example: Average customers visiting a site is 5 per minute]

• If we know , we can find the probabilities of exactly getting 0 success in future, 1 success in future, 2, 3,4 ,5
…………………to infinity [Example: Probability of exactly 0 customer per minute, 1 customer per minute,
prob. of exactly 2 customer per minute, 3 customer…………..]

• So, x is the random variable denoting number of successes or number of arrivals x={0,1,2,3,4,…..∞}

• Then, the formula for probability of exactly x success is given by

• Here, e = 2.7183 (a mathematical constant)

• = the expected number of successes (the mean)

• x = number of successes and P(x) = the probability of x successes


Poisson distribution
The following points may be noted:

• (a) The minimum number of successes in a Poisson distribution is zero while there is no upper limit.

• (b) In calculating probabilities, the value of should be defined carefully. To illustrate, it is given
that, on an average, 12 accidents occur in a quarter of a year on a certain crossing. In this case, for
calculating probabilities,

(i) A certain number of accidents to occur over a month, we should take = 4.

(ii) A certain number of accidents to occur over a two-month period, we should take = 8.

(iii) A certain number of accidents to occur over a three-month period, we should take = 12.

(iv) A certain number of accidents to occur over a one-and-a-half month period, we should take = 6.
EXAMPLE:
• If, on an average, 2 customers arrive at a shopping mall per minute, what is the
probability that

(a) In a given minute, exactly 3 customers will arrive?

(b) In a given minute, exactly 4 customers will arrive?

(c) In a given minute, no customer will arrive?

(d) In a given minute, more than 2 customers will arrive?

(e) In a 5-minute period, exactly 10 customers will arrive?


Expected Value and Standard Deviation
• The expected value (mean) and the variance of a Poisson distribution
are both numerically equal to . Thus, standard deviation of a Poisson
distribution is equal to (since ).
Normal Distribution

• Normal Distribution is defined for a continuous random variable

• Normal Distribution holds a very significant place in statistics

• There are several phenomena which seem to follow this distribution very closely or
can be approximated by it

• When the data is very large, then in most of the cases, the random variable follows
the normal distribution

• Examples: Height, weight of a large population


Normal Distribution
• A variable x has normal distribution if its curve, as
shown in Figure, is given by the following equation:

• where, e = 2.7183
• = expected value
• σ =standard deviation
[standard deviation=
• x = a particular value of the random variable,
• y(x) = density for x
Normal Distribution

• Since the random variable is continuous,


the Y axis does not represents probability.
It represents density.

• We can not define probability for a point


in the continuous random variables since
there are infinite points. Instead, we
define probability for an interval
Calculation of Probabilities
• Since the underlying variable in a normal
distribution is continuous, the probability that the
variable value will lie in a certain range can be
calculated.

• In this case, the probability that the variable will


lie in a given range, say x1 and x2, is equal to
ratio that the area under the normal curve between
x1 and x2 to the total area under the curve.

• The total area under the curve to be equal to 1


Calculation of Probabilities

• z-transformation: Convert any normal random variable x into a


standardized normal variable z. This is called z-transformation. The
transformation from x to z is done by the following formula:

• Then, use standard z table to calculate the area under curve.


Example
A machine fills coffee powder in packets, with an average of 200 gm and a standard deviation
of 4 gm. Assuming that the coffee weight is normally distributed, find the probability that a
coffee packet selected at random will contain the following quantity of coffee.

(a) Between 200 and 206 gm,

(b) Between 206 and 210 gm,

(c) Between 190 and 195 gm,

(d) Atleast 200 gm, and


Sampling-Introduction
• Decision makers cannot always measure an entire population and therefore
they have to rely on the information gained by sampling from the population.

• Samples are taken and analysed not just for their sake but to learn about the
populations from which they are drawn.

• Sampling is an integral part of what is known as inferential statistics which


allows us to draw conclusions regarding large populations from a relatively
small number of observations drawn from such population.
Reasons For Sampling

• Economic: Sampling is done mainly for the economic reasons as it may be too
expensive or too time-consuming to attempt either a complete or a nearly
complete coverage in a statistical study.

• Destructive nature of tests: Where the testing results in the destruction of the
elements in the process of examination

• Very large populations: When the population in question is very large in size or
is infinite, then sampling is the only choice
Types of Sampling
1. Simple Random Sampling

• A simple random sample is one in which every element of the parent


population has an equal chance of being included in the sample.

• For a population comprising N elements, the chance of a particular


element being selected is1/N.
2. Systematic Sampling
• Firstly, the elements of population are sequenced randomly.

• Then, the ratio of the population size, N, to the sample size, n, is calculated and
represented by k. Thus, k = N/n.

• Note that only integer value of k is considered here, ignoring the fractional part, if any.

• After this, an element is chosen randomly from the first k elements. This is the first
element selected in the sample.

• It is followed by choosing every kth element from the element chosen, for inclusion in
the sample.
3. Stratified Sampling
• In stratified sampling, the N elements of the population are first sub-divided into distinct and
mutually exclusive sub-populations, also called strata, according to some common
characteristic.

• For example, the employees of a large company can be divided by their rank, gender,
department, and so forth.

• After a population is divided into appropriate strata, a simple random sample is taken within
each strata

• Stratified sampling is more efficient than simple random sampling or systematic sampling
because such sampling ensures representation of individuals or items across the total population.
4. Cluster Sampling

• In cluster sampling, the elements of a population are divided into


several clusters so that each cluster is nearly representative of the
entire population.

• A random sampling of clusters is then taken and all elements of each


selected cluster are then studied.
5. Convenience Sampling

• Convenience sampling is a non-probability sampling procedure, involving


no restrictions.

• In this type of sampling, the investigator or his people have the freedom to
choose whomsoever they find conveniently.

• Convenience sampling is obviously convenient and relatively cheaper to


undertake but it does not ensure precision since the sample can be biased
Statistics and Parameters

• In the context of sampling, it is important to understand the difference


between statistics and parameters.

• A statistic refers to a quantitative characteristic of a sample while a


parameter is a quantitative characteristic of a population.

• Example: Average of a population is a parameter whereas average of a


sample is statistic
Statistics and Parameters
• The sample statistics are distinguished from parameters by using different symbolic
representations.

• For example, sample mean and standard deviation are represented by , and s
respectively, while the population parameters are represented by μ and .

• Since a sample is only a part of the population, we do not expect a statistic value
to match exactly the corresponding parameter, except only by chance.

• Since different values of a statistic are possible, a statistic is a random variable


whereas a population parameter is a fixed value.
Sampling and Non Sampling Errors

• In statistics, an error refers to the difference between the value of a sample


statistic and the corresponding population parameter.

• Such an error is likely to occur due to the fact that a sample is only a subset of
the population.

• However, this is not the only reason of having errors. There are other reasons
also that cause errors.

• As such, a distinction is made between sampling errors and non-sampling


errors.
Sampling Errors

• The sampling errors arise only for the reason of sampling and result from
the chance selection of sampling units

• This type of error occurs simply because only a part of the population is
observed and is expected to disappear when a census study is undertaken.

• Example: Difference between a population mean and a sample mean


Non Sampling Errors

• Non-sampling errors arise because of reasons other than sampling.

• They may arise because of bias, vague definitions used in the data
collection, defective methods of data collection, incomplete coverage of the
population, wrong entries made in the questionnaire, etc.

• While the non-sampling errors can be controlled by careful planning and


execution, we cannot avoid sampling errors and have to deal with them.
Sampling Distribution

• Since, the “statistic” is a random variable, we can study how the


group of statistics behave and distribute themselves

• The distribution of a statistic is called a sampling distribution,

• Sampling distribution is used to understand how the data from


samples can be used in decision-making process.
Sampling Distribution

• If all possible samples of size n are selected from a population and


the same statistic is calculated for each sample, the distribution of
these values of the statistic is called the sampling distribution of
that statistic.

• Example: Distribution of Mean


Standard Deviation of a Sampling Distribution: Standard Error

• A sampling distribution, like any other distribution, is described by its


expected value and standard deviation.

• However, the standard deviation of the sampling distribution is called


standard error.

• Thus, the standard deviation of the sampling distribution of mean is


known as standard error of mean (SE,)
Sampling Distribution of Mean
• The sampling distribution of mean is the probability distribution of all sample mean, obtained
from all possible samples of a certain fixed size from a given population.

• The following steps are used to obtain sampling distribution of mean:

(a) Take all possible samples of size n from a population of size N, having mean μ and standard
deviation

(b) Calculate mean values for all the samples obtained.

(c) Tabulate the mean values and calculate the relative frequency of each value of mean by
dividing the frequency with which it appears by the total frequency (equal to the number of
samples). The relative frequency of each value indicates its probability.
Example

• A population consists of five elements, viz. 2, 3, 4, 6 and 10. Obtain


sampling distribution of mean taking sample size equal to. Also,
calculate the mean and variance of the sampling distribution
Sampling with replacement

• In sampling with replacement, repetition of samples are permitted.

• If sampling is done with replacement from a population of size N,


taking samples of size n each, then a total samples can be drawn
Sampling without replacement

• In sampling without replacement, repetition of samples are not


permitted.

• If sampling is done without replacement from a population of size N,


taking samples of size n each, then a total samples can be drawn
Example

• A population consists of five elements, viz. 2, 3, 4, 6 and 10. Obtain


sampling distribution of mean taking sample size equal to 2 assuming
that sampling is done without replacment. Also, calculate the mean
and variance of the sampling distribution
Mean and Standard Deviation of Sampling Distributions
• The mean of the sampling distribution is equal to population mean μ.

• The standard deviation (known as standard error of mean) depends


upon whether the sampling is done with or without replacement.

• When sampling is done with replacement:

• When sampling is done without replacement


Shape of the Sampling Distribution of Mean

• In situations with larger populations, the number of possible samples,


and therefore sample means, can become very large.

• When the number of possible samples is very large, we cannot


probably calculate all possible sample means to obtain μ and

• There are two theorems that allow decision-makers to describe how


the sample means are distributed
Sampling from Normally Distributed Populations

• If the population from which sampling is done is normally distributed


with mean μ and standard deviation σ, the sampling distribution of
mean is also normally distributed with mean μ , and standard error
regardless of the size of the sample, n.
Sampling from Not Normally-Distributed Populations

• Central Limit Theorem (CLT): If random samples of size n are drawn from any
population with mean μ and standard deviation σ, and if n is sufficiently large, then
the distribution of possible mean values will be approximately normal with expected
value μ, and standard error, , regardless of the population distribution.

• Note: Sample Size of 30 or greater is considered as sufficiently large


Sampling Distribution of Mean and the Probabilities

• Now we can get the sampling distribution of mean when

1. When population is normally distributed

2. When population is not normally distributed but sample size n is large enough.

• We can use normal area table to calculate probabilities involving the sample
mean.
Example

• The lives of a certain brand of batteries are known to be normally distributed


with a mean of 415 hours and a standard deviation of 20 hours. Calculate the
probability that

(a) A random sample of 10 batteries will have a mean life of 412 hours or
greater.

(b) A sample of 100 batteries selected randomly will have a mean life of at least
412 hours.
Example

• A random sample of 100 discs is taken from a production line which


produces discs having mean =120 gm and standard deviation = 16 gm.
Find the probability that the sample mean will be

(i) greater than 124 gm, and

(ii) between 118 and 121.6 gm.


Sampling Distribution of Number of Successes
• If π is the proportion of successes (or the probability of success) in the population, the
sampling distribution of the number of successes calculated from the set of all samples of
size n shall be approximately normally distributed with:
Mean =
Standard deviation =
When both and are at least equal to 5.
• To calculate probabilities for the normal distribution, we use z-transformation

• Since the number of successes involved here is a discrete variable, we need to apply
continuity correction factor (CCF) of +0.5 or -0.5
Sampling Distribution of Number of Successes

• Example: A large shipment of picture tubes is known to contain 10


percent defectives. A sample of 100 tubes is taken randomly such that
each tube is replaced before the next is taken. What is the probability
that the sample will have 8 or more defectives? Calculate using CCF
and without CCF.
Sampling Distribution of Proportion of Successes
• Now we consider the proportion of successes (p) instead of number of successes in a sample.
If π is the proportion of successes (or the probability of success) in the population. Then, the sampling
distribution of proportion of successes shall be approximately normally distributed with:
Mean =
Standard deviation or standard error =
When both and are at least equal to 5.
• To calculate probabilities for the normal distribution, we use z-transformation

• We need to apply continuity correction factor (CCF) of +1/2n or -1/2n


Sampling Distribution of Proportion of Successes

• Example: Of the customers visiting a University cafe, 40 percent


prefer to have self-service. Use the sampling distribution of proportion
to find the probability that at least 50 percent of the next 120
customers will prefer self-service. Calculate without CCF.
Theory of Estimation

• Estimation refers to the procedure where sample information is used


to estimate the numerical value of some population measure, such as
population mean and population variance.

• It, thus, involves using sample statistics to predict the values of the
population parameters.
Theory of Estimation

• There are two kinds of estimates:

Point estimates and Interval estimates.

• A point estimate is a specific value while an interval estimate is a range of values.

• Using a point estimate, we might say that average height of Pune citizens is 160 cm.

• An interval estimate would say that there is 95% probability that average height of
Pune citizens falls in the interval 150 cm-170cm
Point Estimates

• A point estimate is a specific value

• When a point estimate is found, the sample value or statistic used is called
an estimator and the specific number obtained is called an estimate.

• Thus, we can think of an estimate as an educated guess about some value


of a population parameter whereas an estimator is the rule or procedure
used to obtain that guess.
Properties of a Good Estimator

• Unbiasedness: An estimator is said to be unbiased if its expected value is equal


to the population parameter it estimates.

• Efficiency: An estimator is said to be efficient if it has relatively small variance.

• Consistency: An estimator is said to be consistent if its probability of being close


to the parameter it estimates increases as the sample size increases.

• Sufficiency: A sufficient estimator is one which utilizes all the information a


sample contains about the parameter to be estimated.
Good Estimators of mean and standard deviation of
population
• The good estimator of mean of the population is average of the
sample

• The good estimator of standard deviation of the population is s


Interval Estimates
• An interval estimate gives the estimate in terms of a range of values.

• In general terms, an interval estimate is obtained as under:

Point estimate ± (Interval coefficient * Standard error)

• So, interval estimate is obtained by adding and subtracting some quantity to the point estimate.

• Standard error is the standard deviation of the sampling distribution

• The interval coefficient or confidence coefficient, depends on the level of confidence and the shape of the sampling

distribution.

• A level of confidence equal to 95 percent means that the probability is 0.95 that the parameter value being estimated is

contained, and 0.05 that is not contained, within the interval we obtain.

• A level of confidence is designated as 1 -α. Hence, if 95 percent confidence level is required, then α = 0.05.

• α is the probability of error indicating that the parameter will not be included in the interval estimate.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is Known

• If the sampling distribution is normally distributed, an interval estimate of population mean may
be constructed as follows

• Here is the sample mean. is the standard error of mean.

• is the value determined by the probability associated with the estimate.

• When the level of confidence is 95 percent, we have = 0.05, and /2 = 0.025. Now an area equal to
0.025 lies under the normal curve beyond z =1.96. Thus, for 95 percent level of confidence, we
have z = 1.96.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE
SAMPLES: When the Population Standard Deviation is Known

• Example: A marketing research firm has contracted with an advertising agency to provide

information on the average time spent on watching TV per. week by families in a city. The firm

selected a random sample of 100 families and found the mean time spent by them watching TV

to be equal to 32 hours. It is known that the standard deviation of the TV watching time is 12

hours, which has been constant over the past few years.

• The firm wants to construct

(a) a 95 percent confidence interval, and

(b) a 99 percent confidence interval for mean time spent by families watching TV. Help the firm.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is not Known

• We now consider situations where population standard deviation is not known.

• In most of the business applications, neither population mean is known, nor is the population standard
deviation.

• To make interval estimates in such cases, we use sample standard deviation, s, calculated as follows, as an
estimator of the population standard deviation

• Thus, if the sample size is large, the confidence interval estimate of the population mean is approximated by the
following expression
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is not Known

• Example: A random sample of 100 discs is taken from a large output of a


certain machine and is found to have mean weight equal to 130 gm and a
standard deviation of 20 gm. You are required to determine a 95 percent and
a 99 percent confidence interval of the mean weight of the entire output of
the machine.
Determining the sample size
• Recall that the interval estimate of population mean is obtained by the following
expression:

• This can be interpreted as: there is (1-) probability that the value of sample mean will
provide a sample error of or less
• Given the extent of error acceptable (E), the level of confidence (1-), and the population
standard deviation (), we can determine the sample size required.

• Rearranging the terms, we get


Determining the sample size

• Oil India Corporation has a bottle-filling machine which can be adjusted to fill oil
to any given average level, but it fills oil with a standard deviation of 0.010 litres.
The machine has recently been reset to a new filling level and the manager wants
an estimate of the mean amount of fill to be within ±0.001 litres with a 99 percent
level of confidence. How many bottles should the manager sample?
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES
When the population is normally distributed and the population
standard deviation is known
If population is normally distribute and the population standard deviation is known, then the
sampling distribution of mean is also normally distributed irrespective of the size of the sample
involved.

This implies that even if the sample is small in size, we can use the z-table to obtain the interval
estimate in the same manner as in the case of large samples using the interval:
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

When the population is normally distributed and the population


standard deviation is known

Example:

• Obtain a 90 percent confidence interval for the mean of a population


from which a sample of size 10 is drawn randomly. The population is
known to be normally distributed and has a standard deviation equal to
15. The sample drawn has been found to have a mean equal to 42.
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

When the population is normally distributed and the population


standard deviation is not known
• We now proceed to consider the case where

• The population is normally distributed,

• The population standard deviation is not known, and

• The sample is small in size, so that n < 30.

the solution to the problem of small-sample interval estimation lies with a statistic called `t' rather
than with z
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

• The t variable is defined as

• Where s is the standard deviation of the sample and is obtained as

• For a sample size n, the degrees of freedom is one less than the size, n, that is,
v=n-1
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

• Interval estimate of mean


• Using t-distribution, an estimation of population mean can be done in the following way:

• As in the case of z, the value of t depends upon the level of confidence. In addition, the t-value is
dependent on the degrees of freedom.
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

• The t-table is constructed in a manner different from the z-table.

• The chance of μ not being included in the confidence interval is represented by α and, therefore,
reference should be made to the column head represented by α if the table gives areas in both tails of the
distribution and α /2 if the table provides areas in one tail of it.

• Values under area (α) = 0.05 in a two-tail table would match with the values under area (α /2) = 0.025 in
the case of one-tail table.

• If the level of confidence is given to be 90 percent, we focus on the column headed 0.10 in the case of a
two-tail table and 0.05 in the case of a one-tail table.

• Another element relevant in consulting the t-table is the number of degrees of freedom, v, which is
equal to n-1
t-table
Example:

• A sample of 20 items is selected randomly from a very large shipment. It is found


to have a mean weight of 310 gm and standard deviation equal to 9 gm. Derive the
95 percent and 99 percent confidence intervals for the population mean weight.
Summary-Interval Estimation
Testing of Hypotheses: Means and Proportions
Hypothesis

• A hypothesis refers to a statement or a claim about the whole population.

• Eg. Average height of Pune citizen is 170 cm.

• Pune Citizen’s average preference for a product is 6 on a scale of 1-10

• The hypothesis testing involves an examination based on sample evidence and


probability theory, to determine whether the hypothesis is a reasonable statement.
Hypothesis Testing
• The process of hypotheses testing essentially comprises the following:

(a) Making a statement about the population,

(b) Sampling from the population and analysing the sample data,

(c.1) While recognizing that some divergence between sample and population
characteristics may be expected since a sample is only a subset of the population,

(c.2) To determine whether the differences between what is observed and the
statement made could have been due to chance alone and hence insignificant or
they are significant casting a doubt on the statement made.
Example
• A company uses a semi-automatic process to fill coffee powder in 200 gm jars and
this fill is known to have a standard deviation equal to 4 gm. For long, the amount
of coffee powder filled is observed to be normally distributed with a mean of 200
gm. The manager is concerned with ensuring that the process is working
satisfactorily so that the average amount of coffee powder filled in the jars is 200
gm. She has currently taken a sample of 25 jars, weighs the amount of coffee in
each of them and finds the average amount equal to 202 gm. Now, her problem is
as to how this difference of 2 gm be interpreted. Is it a small difference and may
be ignored or is it large enough to conclude that the process is not working
properly and some action is warranted?
Hypothesis Testing
• Process of hypothesis testing consists of following six steps

Step 1: Set up null and alternate hypotheses.

Step 2: Select the level of significance.

Step 3: Select test statistic.

Step 4: Establish the decision rule.

Step 5: Perform computations.

Step 6: Draw a conclusion.


Step 1: Setting up of Null and Alternate Hypotheses
• The first step in hypothesis testing methodology is always to setup a pair of hypotheses,
called the null and alternate hypotheses.

• The null hypothesis is designated as Ho and the alternate hypothesis as H1.

• The null hypothesis is the hypothesis which is tested.

• It is a hypothesis of `no difference' or `no change' and is set up on the presumption that no
significant difference exists between sample result and the population parameter
hypothesized.

• It is assumed that whatever difference is observed between sample statistics and population
parameters is due to random causes only and is not significant.
Step 1: Setting up of Null and Alternate Hypotheses
• The complement of the null hypothesis, the alternate hypothesis, on the other hand, states what may be believed
to be true and is accepted if the null hypothesis is rejected.

• The alternate hypothesis is, therefore, what we theorize.

• In coffee powder example, we may set up the hypotheses as follows:

• Null hypothesis, Ho : µ = 200 gm

• Alternate hypothesis, H1 : µ ≠ 200 gm.

• We hypothesize that the mean coffee powder is indeed 200 gm. If the sample mean is not found to be significantly
different from this hypothesized value, we have no reason to reject this hypothesis. However, if the sample mean
is significantly higher or lower than this value, then the null hypothesis is rejected and the alternate hypothesis is
accepted. The alternate hypothesis stipulates that mean is, in fact, other than 200 gm, higher or lower.
Examples of Null and Alternate Hypotheses
Example: A study claims that the mean income of the senior executives in the manufacturing sector
in an industrial state is ₹625,000 per annum. To test this claim, it is decided to take a sample of 200
executives and obtain their mean income. Set up the appropriate hypotheses.

The null and alternate hypotheses would be:

• Null hypothesis, Ho: µ = ₹625,000 The claim is true

• Alternate hypothesis, H1: µ ≠ ₹625;000 The claim is not true.


Examples of Null and Alternate Hypotheses
Example: A researcher contends that the mean daily sleep of young babies is 15 hours. A sample
of40 babies is taken randomly and their average sleep is found to be 13 hours and 48 minutes, with a
standard deviation of 80 minutes. Set up the null and the alternate hypotheses to test his claim.

For the purpose of setting up null and alternate hypotheses, it is assumed that the sample mean is
different from the population mean of 15 hours only due to chance factors. The alternate hypothesis
would be that the difference between the two mean values is real and not due to chance.

Null hypothesis, Ho:µ = 1.5 hours The claim is tenable.

Alternate hypothesis, Hl: µ15 hours The claim is not tenable.


Examples of Null and Alternate Hypotheses
Example: A company administered an intelligence test to all its employees for a long period of time. For all the
80,000 employees, the mean score was found to be 75 and the standard deviation 12. A researcher wishes to
study the theory that the top line supervisors of the company are more intelligent than the average. For that, a
sample of 50 supervisors is chosen randomly and their mean score is found. To test the theory, what should be
the null and alternate hypotheses?

Let the population mean (the mean score of the entire set of employees) be µ and the mean score of the top line
supervisors be µl . The null and alternate hypotheses would be as under:

Null hypothesis, Ho: µl = 75 Top line supervisors are not more intelligent than average employees.

Alternate hypothesis, HI : µ1 > 75 Top line supervisors are more intelligent than average employees.
Examples of Null and Alternate Hypotheses
Example: A coin sorting machine with a bank can sort, on an average, 22,800 coins in a day with a
standard deviation of 280 coins. A new model of the machine is out in the market and the bank manager
is considering the old one to be replaced. The bank manager proposes to replace the old one with the
new model if only the latter sorts out larger number of coins than the former. For this, the manager
employs the new sorter for a period of 4 weeks and finds that the average number of coins sorted per
day is 23,540 with almost same standard deviation as the old one. Write the appropriate hypotheses for
testing.

Null hypothesis Ho: µl = 22,800 The new model is no better than the existing one.

Alternate hypothesis Hl: µl > 22,800 The new model is better than the existing one.
Examples of Null and Alternate Hypotheses
Example: A company is making brake system for cars. Using this brake system, a car running at speed of 40kmph
comes to a halt after covering a distance of 15 feet on an average. Recently, the company has developed power
brakes which, when applied one hundred times to cars running at 40 kmph, indicated that the average distance
covered is 13 feet before stopping. To test if the power brake system is better, set up the appropriate hypotheses.

For null hypothesis, we assume that the power brake is no different from the old brake system, so that the average
distance for stopping a car with new brake system is found lower than that of the average of the existing brake
system due only to chance factors. The alternate hypothesis would, of course, be that the power brake system is
better than the existing one.

Null hypothesis,H0: µ = 15 feet The power brake is no better than the existing brake.

Alternate hypothesis, Hl: µ < 15 feet The power brake is better than the existing brake.
Examples of Null and Alternate Hypotheses

• A look at all the examples reveals that no matter how a problem is


stated, the null hypothesis always contains the “=" sign.

• This sign never appears in the alternate hypothesis.

• The reason for this is simple: it is always the null hypothesis that is
tested and we need a specific value to test for calculations.
Step 2:Selection of the Level of Significance
• Once the hypotheses are set up, we need to state the level of significance, designated by α.

• The level of significance refers to the probability of rejecting a null hypothesis when it is
true.

• This is also termed the level of risk because it indicates the risk that a true null hypothesis
will be rejected.

• Generally, level of significance is given in question. Ex: test the hypothesis on 5% level of
significance or 95% level of confidence.
Selection of the Level of Significance

• The level of significance is the complement of the level of confidence

• If the level of confidence is 95%, the level of significance is 5% or simply, α =


0.05, and if the level of confidence is 99%, the level of significance, α = 0.01.

• Clearly, a lower value of a implies a smaller chance of rejecting a null hypothesis


and, consequently, greater chance of its acceptance.
Step3: Selecting the Test Statistic
• The testing of hypotheses is done by means of computing some statistic and comparing its
value with a certain standard value, called critical value, to determine whether null
hypothesis may be rejected.

• There are several test statistics such as z, t, F, chi square etc. available.

• The choice of an appropriate test-statistic is an important step in hypothesis testing.

• The nature of underlying population from which sampling is done, the knowledge about its
parameters, sample size, number of samples collected, etc. are the factors which govern the
choice of a test-statistic.
Step3: Selecting the Test Statistic
• To illustrate, it may be recalled that the sampling distribution of mean is normally distributed when the
parent population is normally distributed and when sample is reasonably large even though the population is
not normally distributed, with parameters mean equal to the population mean µ, and standard deviation, .

• Since in our coffee example, the coffee weight is known to be normally distributed, we shall use the z-
statistic for the purpose of testing the hypothesis that µ = 200. Thus, in a hypothesis testing for mean in such
a case, we use z as the test statistic and use the properties of normal curve. The z-statistic is defined as

• Different test statistics are used in different kinds of hypothesis testing situations.
Step 4: Establishing the Decision Rule

• A decision rule is a statement of the conditions under which a null hypothesis in a


test will be rejected and the conditions under which it will not be rejected

• Let us understand this for test on means where the test statistic used is z.

• In the coffee example, suppose that we select the level of significance as 5 percent
so that α= 0.05. Further, the sampling distribution of mean is known to be
normally distributed with parameters µ = 200 and standard error = 0.8
Step 4: Establishing the Decision Rule
• The sampling distribution is shown
divided into two parts: acceptance region and rejection region
• The acceptance region consists of the 95 percent
area while the rejection region covers the remaining 5 percent
area divided equally on both ends of the curve.
• Since the alternate hypothesis states that μ≠ 200, it follows
that the mean is hypothesized to be different from 200 which
may be higher or lower than this value.
• Thus, α = 0.05, which is the probability of rejecting the null
hypothesis, is divided into two halves of 0.025 each.
• From the z table,
• We find the z-value beyond which the area equal
to 0.025 included under the curve is 1.96. Similarly, the area
0.025 lies to the left of z = —1.96.
Step 4: Establishing the Decision Rule
• These z values for some level of significance α are called critical values

• If the z-value computed in respect of the sample taken works out to be more than 1.96 or less than -1.96, the null
hypothesis will be rejected.

• If z> 1.96, it has the implication that the sample mean is significantly higher than the hypothesized population
mean, µ and if z <-1.96, then the sample mean is significantly lower than the hypothesized population mean.

• In each case, the difference between and µ is not seen to be due to fluctuations of sampling, and is considered to
be significant. Hence, the null hypothesis is rejected.

• On the other hand, if the z-value is found to lie between ±1.96, the difference between and hypothesized µ the is
considered to be not significant.

• This means that the evidence from sample is not sufficient to reject the null hypothesis. Hence, it is accepted.
Step 4: Establishing the Decision Rule

• The decision rule in this example can be stated as follows:

"Reject the null hypothesis if z > 1.96 or z < -1.96“

• Decision rule: If the computed value of the test statistic is more extreme than its
critical value/s, then reject Ho, else accept H1.

• The critical values are obtained having reference to appropriate area tables.
Step 4: Establishing the Decision Rule:
two tailed and one tailed test

• The critical value to be used in a given problem also depends upon whether the test involved is a
two-tailed test or a one-tailed test.

• To illustrate, in the coffee example, since the rejection region lies on both ends of the sampling
distribution because the alternate hypothesis is μ ≠ 200, the test is called a two-tailed test.

Two tailed test

• A two-tailed test of hypothesis conducted where it is important to consider both directions of


difference away from the parameter specified in the null hypothesis.

• Here the rejection region lies in both ends of the sampling distribution.
Step 4: Establishing the Decision Rule: One tailed test
• If the rejection in a given problem lies only on
one end of the sampling distribution, the test is
called a one-tailed test.
• For example, if we have to test Ho:μ = 200
against HI:μ > 200 at α = 0.05, the test would
be one-tailed.
• In this case, the alternate hypothesis involves a
`>' sign.
• This implies that the null hypothesis:μ = 200
would stand rejected in favour of the alternate
hypothesis μ > 200 if the z-statistic is found to
be lying in the region of rejection.
Step 4: Establishing the Decision Rule: One tailed test
• Since the entire 0.05 area lies in the right
tail of the curve, the critical value works
out to be 1.645 (being the value of z to the
right of which an area 0.05 lies).
• For the null hypothesis to be rejected, the
calculated value of z should exceed 1.645.
• When this happens, we may conclude that
the sample mean is found to be
significantly higher than the hypothesized
mean, μ = 200.
• Further, since the region of rejection is on
the right-hand side, it is called right-tailed
test.
Step 4: Establishing the Decision Rule: One tailed test

• In the same manner, if the rejection region lies on the left-


hand side of the curve of sampling distribution, it would
be termed a left-tailed test.

• For example: Let H0: μ= 200, HI:μ < 200 and α = 0.05.

• Here the decision rule can be stated as follows: "Reject


null hypothesis if z < -1.645".

• Thus, if the z-statistic from the sample information is


found to be smaller than -1.645, then we conclude that the
sample mean is significantly lower than the hypothesized
population mean, μ = 200.
Step 4: Establishing the Decision Rule
• A two-tailed test of hypothesis conducted where it is important to consider both directions of
difference away from the parameter specified in the null hypothesis. Here the rejection region lies in
both ends of the sampling distribution
• A one-tailed test of hypothesis is conducted where it is important to consider only a single direction of
difference away froth the parameter value specified in the null hypothesis. In such a test, the rejection
region lies only in one end of the sampling distribution
• Notice that even when the level of significance is same in the two cases, the critical value in a one-
tailed test is different from that in a two-tailed test. This is because the rejection region lies only on
one side in the former case while it is split into two parts in the case of a two-tailed test. For some
selected levels of significance, the critical z-values are given here.
Step 4: Establishing the Decision Rule
• To sum up, establishing the decision rule in a test calls for determining the critical value
of the test statistic used for hypothesis testing.

• If the computations based on sample information yield a statistic value which exceeds
the critical value (note that it also means that the statistic value is smaller than the
critical value when they both are negative), the null hypothesis is rejected and the
alternate hypothesis is accepted.

• The critical value in a given situation depends upon the test-statistic employed, the level
of significance used and, where relevant, whether the test is a one-tailed or two-tailed.
Step 5: Computations

• The next step in hypothesis testing is taking sample and using the sample
information to calculate the value of the test statistic in question.

• To illustrate, in the coffee example in consideration, we have = 202, μ = 200 and


= 0.8.

• With these inputs, we get


Step 6: Conclusion
• The final step in the hypothesis testing process calls for drawing conclusion of the test.

• This is done on the basis of the comparison of the calculated value of the test statistic with the critical
value.

• For the coffee powder example, the calculated value of z is found to be 2.50, that is larger than the
critical value of 1.96 mentioned earlier.

• This leads us to reject the null hypothesis in favour of the alternate hypothesis.

• Thus, the sample mean here is found to be significantly different from the hypothesized value of the
population mean.

• There is evidence to suggest, therefore, that the mean coffee powder has drifted from 200 gm

You might also like