You are on page 1of 82

Probability Distributions

Syllabus
• UNIT-I: Probability Distributions

• UNIT-II: Sampling & Sampling Distribution, Estimation and Confidence Interval

• UNIT- III: (Hypothesis Testing for means & proportions)

• UNIT- IV (Chi-Square Test, Non Parametric Test & ANOVA)

• UNIT- V (Correlation and Regression Analysis)


Binomial Distribution
A binomial distribution has the following essential properties:
1.The experiment consists on “n” trials
2. All the trials are independent of each other
3. Each trial has only 2 outcomes. Any one of them can be denoted as “success” and the other as “failure”.
Probability of “success” is “p” and the failure is “(1-p)”
4. In the experiment conducted for “n” times, the random variable is the number of success obtained “x” times
So, x={0, 1, 2, 3, 4,…………, n}, if n=3 then x={0, 1, 2, 3}
x=0 means exactly 0 success in 3 trials x=1 means exactly 1 success in 3 trials,
x=2 means exactly 2 success in 3 trials, x=3 means exactly 3 success in 3 trials
5. Then, the probability of “x” successes denoted by p(x) is given as:
P(x)=
Where = in which n! = n*(n-1)*(n-2)……1. Also 0!=1
Example

A coin is tossed 3 times. Assume Head to be success. Calculate:

• a. Probability of getting exactly no success

• b. Probability of getting exactly 1 success

• c. Probability of getting exactly 2 success

• d. Probability of getting exactly 3 success


Example

• The manager of a departmental store informs that the probability that a


customer who is just browsing will eventually buy some items is 0.4.
During the pre-lunch session on a day, 7 customers are seen to browse in
the department. Using this information, answer the following questions:

(a): What is the probability of 0, 1, 2, ..., 7 customers buying some items?


Example
(b) What is the probability that at least 4 customers will be buying
something?
Example
(c) What is the probability that no browsing customers will be buying
anything?
Example
(d) What is the probability that not more than 4 customers will be
buying something?
Example

• The manager of a departmental store informs that the probability that a customer
who is just browsing will eventually buy some items is 0.4. During the pre-lunch
session on a day, 7 customers are seen to browse in the department. Find:

• A. Expected number of customers buying some items.

• B. Variance of the customers buying some items.

• We know: P(0)= 0.02801, P(1)= 0.13062, P(2)= 0.26133, P(3)= 0.29034, P(4)=
0.19355, P(5)= 0.07746, P(6)= 0.01727, P(7)= 0.0017
Expected value and standard deviation

• We have already discussed the calculation of expected value and


standard deviation of a probability distribution.

• It can be shown mathematically that in the case of binomial


distribution, the formulae for these measures simplify to the following:

Expected value or mean E(x)= n*p

Variance = n*p*(1-p)
Poisson distribution
• In many cases, the number of trials “n” and probability “p” is not given

• If we are interested in the:

 Number of events occurring in the future

And the probability of number of event occurring in the future,

Then, we can use Poisson distribution


Examples
• Number of customer arrival per minute at a retail

• Number of visitors in one hour on a e-commerce website

• The number of telephone calls received per minute

• The number of arrivals at a bank teller booth in every 5-minute period

• Number of accidents per day


Poisson distribution
• The Poisson probability distribution, named after the French mathematician S.D.
Poisson, is another discrete probability distribution with important practical
applications.

• There are many discrete phenomena that are represented by a Poisson process.

• A Poisson distribution is said to exist when we can observe discrete events in some
area of interest (which may be a continuous interval of time, space, length, etc.)
Poisson distribution
• The only condition for Poisson distribution is that the expected number of successes (or events) must be
known which is represented by . [Example: Average customers visiting a site is 5 per minute]

• If we know , we can find the probabilities of exactly getting 0 success in future, 1 success in future, 2, 3,4 ,5
…………………to infinity [Example: Probability of exactly 0 customer per minute, 1 customer per minute,
prob. of exactly 2 customer per minute, 3 customer…………..]

• So, x is the random variable denoting number of successes or number of arrivals x={0,1,2,3,4,…..∞}

• Then, the formula for probability of exactly x success is given by

• Here, e = 2.7183 (a mathematical constant)

• = the expected number of successes (the mean)

• x = number of successes and P(x) = the probability of x successes


Poisson distribution
The following points may be noted:

• (a) The minimum number of successes in a Poisson distribution is zero while there is no upper limit.

• (b) In calculating probabilities, the value of should be defined carefully. To illustrate, it is given
that, on an average, 12 accidents occur in a quarter of a year on a certain crossing. In this case, for
calculating probabilities,

(i) A certain number of accidents to occur over a month, we should take = 4.

(ii) A certain number of accidents to occur over a two-month period, we should take = 8.

(iii) A certain number of accidents to occur over a three-month period, we should take = 12.

(iv) A certain number of accidents to occur over a one-and-a-half month period, we should take = 6.
EXAMPLE:
• If, on an average, 2 customers arrive at a shopping mall per minute, what is the
probability that

(a) In a given minute, exactly 3 customers will arrive?

(b) In a given minute, exactly 4 customers will arrive?

(c) In a given minute, no customer will arrive?

(d) In a given minute, more than 2 customers will arrive?

(e) In a 5-minute period, exactly 10 customers will arrive?


Expected Value and Standard Deviation
• The expected value (mean) and the variance of a Poisson distribution
are both numerically equal to . Thus, standard deviation of a Poisson
distribution is equal to (since ).
Normal Distribution

• Normal Distribution is defined for a continuous random variable

• Normal Distribution holds a very significant place in statistics

• There are several phenomena which seem to follow this distribution very closely or
can be approximated by it

• When the data is very large, then in most of the cases, the random variable follows
the normal distribution

• Examples: Height, weight of a large population


Normal Distribution
• A variable x has normal distribution if its curve, as
shown in Figure, is given by the following equation:

• where, e = 2.7183
• = expected value
• σ =standard deviation
[standard deviation=
• x = a particular value of the random variable,
• y(x) = density for x

Note: Normal distribution is a symmetric distribution


with mean (expected value) located in the middle
Calculation of Probabilities
• We can not define probability for a point in the
continuous random variables since there are infinite
points. Instead, we define probability for an interval

• In this case, the probability that the variable will lie


in a given range, say x1 and x2, is equal to ratio that
the area under the normal curve between x1 and x2
to the total area under the curve.

• The total area under the curve to be equal to 1


Calculation of Probabilities

• z-transformation: Convert any normal random variable x into a


standardized normal variable z. This is called z-transformation. The
transformation from x to z is done by the following formula:

• Then, use standard z table to calculate the area under curve.


Example
A machine fills coffee powder in packets, with an average of 200 gm and a standard deviation
of 4 gm. Assuming that the coffee weight is normally distributed, find the probability that a
coffee packet selected at random will contain the following quantity of coffee.

(a) Between 200 and 206 gm,

(b) Between 206 and 210 gm,

(c) Between 190 and 195 gm,

(d) Atleast 200 gm, and


Sampling-Introduction
• Decision makers cannot always measure an entire population and therefore
they have to rely on the information gained by sampling from the population.

• Samples are taken and analysed not just for their sake but to learn about the
populations from which they are drawn.

• Sampling is an integral part of what is known as inferential statistics which


allows us to draw conclusions regarding large populations from a relatively
small number of observations drawn from such population.
Reasons For Sampling

• Economic: Sampling is done mainly for the economic reasons as it may be too
expensive or too time-consuming to attempt either a complete or a nearly
complete coverage in a statistical study.

• Destructive nature of tests: Where the testing results in the destruction of the
elements in the process of examination

• Very large populations: When the population in question is very large in size or
is infinite, then sampling is the only choice
Types of Sampling
1. Simple Random Sampling

• A simple random sample is one in which every element of the parent


population has an equal chance of being included in the sample.

• For a population comprising N elements, the chance of a particular


element being selected is1/N.
2. Systematic Sampling
• Firstly, the elements of population are sequenced randomly.

• Then, the ratio of the population size, N, to the sample size, n, is calculated and
represented by k. Thus, k = N/n.

• Note that only integer value of k is considered here, ignoring the fractional part, if any.

• After this, an element is chosen randomly from the first k elements. This is the first
element selected in the sample.

• It is followed by choosing every kth element from the element chosen, for inclusion in
the sample.
3. Stratified Sampling
• In stratified sampling, the N elements of the population are first sub-divided into distinct and
mutually exclusive sub-populations, also called strata, according to some common
characteristic.

• For example, the employees of a large company can be divided by their rank, gender,
department, and so forth.

• After a population is divided into appropriate strata, a simple random sample is taken within
each strata

• Stratified sampling is more efficient than simple random sampling or systematic sampling
because such sampling ensures representation of individuals or items across the total population.
4. Cluster Sampling

• In cluster sampling, the elements of a population are divided into


several clusters so that each cluster is nearly representative of the
entire population.

• A random sampling of clusters is then taken and all elements of each


selected cluster are then studied.
5. Convenience Sampling

• Convenience sampling is a non-probability sampling procedure, involving


no restrictions.

• In this type of sampling, the investigator or his people have the freedom to
choose whomsoever they find conveniently.

• Convenience sampling is obviously convenient and relatively cheaper to


undertake but it does not ensure precision since the sample can be biased
Statistics and Parameters

• In the context of sampling, it is important to understand the difference


between statistics and parameters.

• A statistic refers to a quantitative characteristic of a sample while a


parameter is a quantitative characteristic of a population.

• Example: Average of a population is a parameter whereas average of a


sample is statistic
Statistics and Parameters
• The sample statistics are distinguished from parameters by using different symbolic
representations.

• For example, sample mean and standard deviation are represented by , and s
respectively, while the population parameters are represented by μ and .

• Since a sample is only a part of the population, we do not expect a statistic value
to match exactly the corresponding parameter, except only by chance.

• Since different values of a statistic are possible, a statistic is a random variable


whereas a population parameter is a fixed value.
Sampling and Non Sampling Errors

• In statistics, an error refers to the difference between the value of a sample


statistic and the corresponding population parameter.

• Such an error is likely to occur due to the fact that a sample is only a subset of
the population.

• However, this is not the only reason of having errors. There are other reasons
also that cause errors.

• As such, a distinction is made between sampling errors and non-sampling


errors.
Sampling Errors

• The sampling errors arise only for the reason of sampling and result from
the chance selection of sampling units

• This type of error occurs simply because only a part of the population is
observed and is expected to disappear when a census study is undertaken.

• Example: Difference between a population mean and a sample mean


Non Sampling Errors

• Non-sampling errors arise because of reasons other than sampling.

• They may arise because of bias, vague definitions used in the data
collection, defective methods of data collection, incomplete coverage of the
population, wrong entries made in the questionnaire, etc.

• While the non-sampling errors can be controlled by careful planning and


execution, we cannot avoid sampling errors and have to deal with them.
Sampling Distribution

• Since, the “statistic” is a random variable, we can study how the


group of statistics behave and distribute themselves

• The distribution of a statistic is called a sampling distribution,

• Sampling distribution is used to understand how the data from


samples can be used in decision-making process.
Sampling Distribution

• If all possible samples of size n are selected from a population and


the same statistic is calculated for each sample, the distribution of
these values of the statistic is called the sampling distribution of
that statistic.

• Example: Distribution of Mean


Standard Deviation of a Sampling Distribution: Standard Error

• A sampling distribution, like any other distribution, is described by its


expected value and standard deviation.

• However, the standard deviation of the sampling distribution is called


standard error.

• Thus, the standard deviation of the sampling distribution of mean is


known as standard error of mean (SE,)
Sampling Distribution of Mean
• The sampling distribution of mean is the probability distribution of all sample mean, obtained
from all possible samples of a certain fixed size from a given population.

• The following steps are used to obtain sampling distribution of mean:

(a) Take all possible samples of size n from a population of size N, having mean μ and standard
deviation

(b) Calculate mean values for all the samples obtained.

(c) Tabulate the mean values and calculate the relative frequency of each value of mean by
dividing the frequency with which it appears by the total frequency (equal to the number of
samples). The relative frequency of each value indicates its probability.
Example

• A population consists of five elements, viz. 2, 3, 4, 6 and 10. Obtain


sampling distribution of mean taking sample size equal to. Also,
calculate the mean and variance of the sampling distribution
Sampling with replacement

• In sampling with replacement, repetition of samples are permitted.

• If sampling is done with replacement from a population of size N,


taking samples of size n each, then a total samples can be drawn
Sampling without replacement

• In sampling without replacement, repetition of samples are not


permitted.

• If sampling is done without replacement from a population of size N,


taking samples of size n each, then a total samples can be drawn
Example

• A population consists of five elements, viz. 2, 3, 4, 6 and 10. Obtain


sampling distribution of mean taking sample size equal to 2 assuming
that sampling is done without replacment. Also, calculate the mean
and variance of the sampling distribution
Mean and Standard Deviation of Sampling Distributions
• The mean of the sampling distribution is equal to population mean μ.

• The standard deviation (known as standard error of mean) depends


upon whether the sampling is done with or without replacement.

• When sampling is done with replacement:

• When sampling is done without replacement


Shape of the Sampling Distribution of Mean

• In situations with larger populations, the number of possible samples,


and therefore sample means, can become very large.

• When the number of possible samples is very large, we cannot


probably calculate all possible sample means to obtain μ and

• There are two theorems that allow decision-makers to describe how


the sample means are distributed
Sampling from Normally Distributed Populations

• If the population from which sampling is done is normally distributed


with mean μ and standard deviation σ, the sampling distribution of
mean is also normally distributed with mean μ , and standard error
regardless of the size of the sample, n.
Sampling from Not Normally-Distributed Populations

• Central Limit Theorem (CLT): If random samples of size n are drawn from any
population with mean μ and standard deviation σ, and if n is sufficiently large, then
the distribution of possible mean values will be approximately normal with expected
value μ, and standard error, , regardless of the population distribution.

• Note: Sample Size of 30 or greater is considered as sufficiently large


Sampling Distribution of Mean and the Probabilities

• Now we can get the sampling distribution of mean when

1. When population is normally distributed

2. When population is not normally distributed but sample size n is large enough.

• We can use normal area table to calculate probabilities involving the sample
mean.
Example

• The lives of a certain brand of batteries are known to be normally distributed


with a mean of 415 hours and a standard deviation of 20 hours. Calculate the
probability that

(a) A random sample of 10 batteries will have a mean life of 412 hours or
greater.

(b) A sample of 100 batteries selected randomly will have a mean life of at least
412 hours.
Example

• A random sample of 100 discs is taken from a production line which


produces discs having mean =120 gm and standard deviation = 16 gm.
Find the probability that the sample mean will be

(i) greater than 124 gm, and

(ii) between 118 and 121.6 gm.


Sampling Distribution of Number of Successes
• If π is the proportion of successes (or the probability of success) in the population, the
sampling distribution of the number of successes calculated from the set of all samples of
size n shall be approximately normally distributed with:
Mean =
Standard deviation =
When both and are at least equal to 5.
• To calculate probabilities for the normal distribution, we use z-transformation

• Since the number of successes involved here is a discrete variable, we need to apply
continuity correction factor (CCF) of +0.5 or -0.5
Sampling Distribution of Number of Successes

• Example: A large shipment of picture tubes is known to contain 10


percent defectives. A sample of 100 tubes is taken randomly such that
each tube is replaced before the next is taken. What is the probability
that the sample will have 8 or more defectives? Calculate using CCF
and without CCF.
Sampling Distribution of Proportion of Successes
• Now we consider the proportion of successes (p) instead of number of successes in a sample.
If π is the proportion of successes (or the probability of success) in the population. Then, the sampling
distribution of proportion of successes shall be approximately normally distributed with:
Mean =
Standard deviation or standard error =
When both and are at least equal to 5.
• To calculate probabilities for the normal distribution, we use z-transformation

• We need to apply continuity correction factor (CCF) of +1/2n or -1/2n


Sampling Distribution of Proportion of Successes

• Example: Of the customers visiting a University cafe, 40 percent


prefer to have self-service. Use the sampling distribution of proportion
to find the probability that at least 50 percent of the next 120
customers will prefer self-service. Calculate without CCF.
Theory of Estimation

• Estimation refers to the procedure where sample information is used


to estimate the numerical value of some population measure, such as
population mean and population variance.

• It, thus, involves using sample statistics to predict the values of the
population parameters.
Theory of Estimation

• There are two kinds of estimates:

Point estimates and Interval estimates.

• A point estimate is a specific value while an interval estimate is a range of values.

• Using a point estimate, we might say that average height of Pune citizens is 160 cm.

• An interval estimate would say that there is 95% probability that average height of
Pune citizens falls in the interval 150 cm-170cm
Point Estimates

• A point estimate is a specific value

• When a point estimate is found, the sample value or statistic used is called
an estimator and the specific number obtained is called an estimate.

• Thus, we can think of an estimate as an educated guess about some value


of a population parameter whereas an estimator is the rule or procedure
used to obtain that guess.
Properties of a Good Estimator

• Unbiasedness: An estimator is said to be unbiased if its expected value is equal


to the population parameter it estimates.

• Efficiency: An estimator is said to be efficient if it has relatively small variance.

• Consistency: An estimator is said to be consistent if its probability of being close


to the parameter it estimates increases as the sample size increases.

• Sufficiency: A sufficient estimator is one which utilizes all the information a


sample contains about the parameter to be estimated.
Good Estimators of mean and standard deviation of
population
• The good estimator of mean of the population is average of the
sample

• The good estimator of standard deviation of the population is s


Interval Estimates
• An interval estimate gives the estimate in terms of a range of values.

• In general terms, an interval estimate is obtained as under:

Point estimate ± (Interval coefficient * Standard error)

• So, interval estimate is obtained by adding and subtracting some quantity to the point estimate.

• Standard error is the standard deviation of the sampling distribution

• The interval coefficient or confidence coefficient, depends on the level of confidence and the shape of the sampling

distribution.

• A level of confidence equal to 95 percent means that the probability is 0.95 that the parameter value being estimated is

contained, and 0.05 that is not contained, within the interval we obtain.

• A level of confidence is designated as 1 -α. Hence, if 95 percent confidence level is required, then α = 0.05.

• α is the probability of error indicating that the parameter will not be included in the interval estimate.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is Known

• If the sampling distribution is normally distributed, an interval estimate of population mean may
be constructed as follows

• Here is the sample mean. is the standard error of mean.

• is the value determined by the probability associated with the estimate.

• When the level of confidence is 95 percent, we have = 0.05, and /2 = 0.025. Now an area equal to
0.025 lies under the normal curve beyond z =1.96. Thus, for 95 percent level of confidence, we
have z = 1.96.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE
SAMPLES: When the Population Standard Deviation is Known

• Example: A marketing research firm has contracted with an advertising agency to provide

information on the average time spent on watching TV per. week by families in a city. The firm

selected a random sample of 100 families and found the mean time spent by them watching TV

to be equal to 32 hours. It is known that the standard deviation of the TV watching time is 12

hours, which has been constant over the past few years.

• The firm wants to construct

(a) a 95 percent confidence interval, and

(b) a 99 percent confidence interval for mean time spent by families watching TV. Help the firm.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is not Known

• We now consider situations where population standard deviation is not known.

• In most of the business applications, neither population mean is known, nor is the population standard
deviation.

• To make interval estimates in such cases, we use sample standard deviation, s, calculated as follows, as an
estimator of the population standard deviation

• Thus, if the sample size is large, the confidence interval estimate of the population mean is approximated by the
following expression
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is not Known

• Example: A random sample of 100 discs is taken from a large output of a


certain machine and is found to have mean weight equal to 130 gm and a
standard deviation of 20 gm. You are required to determine a 95 percent and
a 99 percent confidence interval of the mean weight of the entire output of
the machine.
Determining the sample size
• Recall that the interval estimate of population mean is obtained by the following
expression:

• This can be interpreted as: there is (1-) probability that the value of sample mean will
provide a sample error of or less
• Given the extent of error acceptable (E), the level of confidence (1-), and the population
standard deviation (), we can determine the sample size required.

• Rearranging the terms, we get


Determining the sample size

• Oil India Corporation has a bottle-filling machine which can be adjusted to fill oil
to any given average level, but it fills oil with a standard deviation of 0.010 litres.
The machine has recently been reset to a new filling level and the manager wants
an estimate of the mean amount of fill to be within ±0.001 litres with a 99 percent
level of confidence. How many bottles should the manager sample?
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES
When the population is normally distributed and the population
standard deviation is known
If population is normally distribute and the population standard deviation is known, then the
sampling distribution of mean is also normally distributed irrespective of the size of the sample
involved.

This implies that even if the sample is small in size, we can use the z-table to obtain the interval
estimate in the same manner as in the case of large samples using the interval:
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

When the population is normally distributed and the population


standard deviation is known

Example:

• Obtain a 90 percent confidence interval for the mean of a population


from which a sample of size 10 is drawn randomly. The population is
known to be normally distributed and has a standard deviation equal to
15. The sample drawn has been found to have a mean equal to 42.
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

When the population is normally distributed and the population


standard deviation is not known
• We now proceed to consider the case where

• The population is normally distributed,

• The population standard deviation is not known, and

• The sample is small in size, so that n < 30.

the solution to the problem of small-sample interval estimation lies with a statistic called `t' rather
than with z
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

• The t variable is defined as

• Where s is the standard deviation of the sample and is obtained as

• For a sample size n, the degrees of freedom is one less than the size, n, that is,
v=n-1
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

• Interval estimate of mean


• Using t-distribution, an estimation of population mean can be done in the following way:

• As in the case of z, the value of t depends upon the level of confidence. In addition, the t-value is
dependent on the degrees of freedom.
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES

• The t-table is constructed in a manner different from the z-table.

• The chance of μ not being included in the confidence interval is represented by α and, therefore,
reference should be made to the column head represented by α if the table gives areas in both tails of the
distribution and α /2 if the table provides areas in one tail of it.

• Values under area (α) = 0.05 in a two-tail table would match with the values under area (α /2) = 0.025 in
the case of one-tail table.

• If the level of confidence is given to be 90 percent, we focus on the column headed 0.10 in the case of a
two-tail table and 0.05 in the case of a one-tail table.

• Another element relevant in consulting the t-table is the number of degrees of freedom, v, which is
equal to n-1
t-table
Example:

• A sample of 20 items is selected randomly from a very large shipment. It is found


to have a mean weight of 310 gm and standard deviation equal to 9 gm. Derive the
95 percent and 99 percent confidence intervals for the population mean weight.
Summary-Interval Estimation

You might also like