Professional Documents
Culture Documents
Syllabus
• UNIT-I: Probability Distributions
• For example, probability of getting head or tail in a coin toss is 0.5. Probability of getting 1 in a
ross dice is 1/6
• Random variable is a variable which can take more than one value with some probability.
• For example, X is a random variable representing the outcome of a dice. X can take values like 1,
2, 3 etc each with probability 1/6.
• Probability distribution gives probability of all the possible values of a random variable
• For example: P(x)=1/6, for x=1,2,3,4,5,6 is the Probability distribution of randome variable x
representing the outcome of a dice
Probability Distributions
• Binomial Distribution
• Poisson distribution
• Normal Distribution
Binomial Distribution
A binomial distribution has the following essential properties:
1.The experiment consists on “n” trials
2. All the trials are independent of each other
3. Each trial has only 2 outcomes. Any one of them can be denoted as “success” and the other as “failure”.
Probability of “success” is “p” and the failure is “(1-p)”
4. In the experiment conducted for “n” times, the random variable is the number of success obtained “x” times
So, x={0, 1, 2, 3, 4,…………, n}, if n=3 then x={0, 1, 2, 3}
x=0 means exactly 0 success in 3 trials x=1 means exactly 1 success in 3 trials,
x=2 means exactly 2 success in 3 trials, x=3 means exactly 3 success in 3 trials
5. Then, the probability of “x” successes denoted by p(x) is given as:
P(x)=
Where = in which n! = n*(n-1)*(n-2)……1. Also 0!=1
Example
• The manager of a departmental store informs that the probability that a customer
who is just browsing will eventually buy some items is 0.4. During the pre-lunch
session on a day, 7 customers are seen to browse in the department. Find:
• We know: P(0)= 0.02801, P(1)= 0.13062, P(2)= 0.26133, P(3)= 0.29034, P(4)=
0.19355, P(5)= 0.07746, P(6)= 0.01727, P(7)= 0.0017
Expected value
Where n is the number of data points, xi is the values of each data point
and pi is the probability of ith data point
Variance
• Variance is the measure of dispersion in the data. If data is far away
from each other, variance will be high and vice versa.
• The formula for calculating Variance is
Variance = n*p*(1-p)
Poisson distribution
• In many cases, the number of trials “n” and probability “p” is not given
• There are many discrete phenomena that are represented by a Poisson process.
• A Poisson distribution is said to exist when we can observe discrete events in some
area of interest (which may be a continuous interval of time, space, length, etc.)
Poisson distribution
• The only condition for Poisson distribution is that the expected number of successes (or events) must be
known which is represented by . [Example: Average customers visiting a site is 5 per minute]
• If we know , we can find the probabilities of exactly getting 0 success in future, 1 success in future, 2, 3,4 ,5
…………………to infinity [Example: Probability of exactly 0 customer per minute, 1 customer per minute,
prob. of exactly 2 customer per minute, 3 customer…………..]
• So, x is the random variable denoting number of successes or number of arrivals x={0,1,2,3,4,…..∞}
• (a) The minimum number of successes in a Poisson distribution is zero while there is no upper limit.
• (b) In calculating probabilities, the value of should be defined carefully. To illustrate, it is given
that, on an average, 12 accidents occur in a quarter of a year on a certain crossing. In this case, for
calculating probabilities,
(ii) A certain number of accidents to occur over a two-month period, we should take = 8.
(iii) A certain number of accidents to occur over a three-month period, we should take = 12.
(iv) A certain number of accidents to occur over a one-and-a-half month period, we should take = 6.
EXAMPLE:
• If, on an average, 2 customers arrive at a shopping mall per minute, what is the
probability that
• There are several phenomena which seem to follow this distribution very closely or
can be approximated by it
• When the data is very large, then in most of the cases, the random variable follows
the normal distribution
• where, e = 2.7183
• = expected value
• σ =standard deviation
[standard deviation=
• x = a particular value of the random variable,
• y(x) = density for x
Normal Distribution
• Samples are taken and analysed not just for their sake but to learn about the
populations from which they are drawn.
• Economic: Sampling is done mainly for the economic reasons as it may be too
expensive or too time-consuming to attempt either a complete or a nearly
complete coverage in a statistical study.
• Destructive nature of tests: Where the testing results in the destruction of the
elements in the process of examination
• Very large populations: When the population in question is very large in size or
is infinite, then sampling is the only choice
Types of Sampling
1. Simple Random Sampling
• Then, the ratio of the population size, N, to the sample size, n, is calculated and
represented by k. Thus, k = N/n.
• Note that only integer value of k is considered here, ignoring the fractional part, if any.
• After this, an element is chosen randomly from the first k elements. This is the first
element selected in the sample.
• It is followed by choosing every kth element from the element chosen, for inclusion in
the sample.
3. Stratified Sampling
• In stratified sampling, the N elements of the population are first sub-divided into distinct and
mutually exclusive sub-populations, also called strata, according to some common
characteristic.
• For example, the employees of a large company can be divided by their rank, gender,
department, and so forth.
• After a population is divided into appropriate strata, a simple random sample is taken within
each strata
• Stratified sampling is more efficient than simple random sampling or systematic sampling
because such sampling ensures representation of individuals or items across the total population.
4. Cluster Sampling
• In this type of sampling, the investigator or his people have the freedom to
choose whomsoever they find conveniently.
• For example, sample mean and standard deviation are represented by , and s
respectively, while the population parameters are represented by μ and .
• Since a sample is only a part of the population, we do not expect a statistic value
to match exactly the corresponding parameter, except only by chance.
• Such an error is likely to occur due to the fact that a sample is only a subset of
the population.
• However, this is not the only reason of having errors. There are other reasons
also that cause errors.
• The sampling errors arise only for the reason of sampling and result from
the chance selection of sampling units
• This type of error occurs simply because only a part of the population is
observed and is expected to disappear when a census study is undertaken.
• They may arise because of bias, vague definitions used in the data
collection, defective methods of data collection, incomplete coverage of the
population, wrong entries made in the questionnaire, etc.
(a) Take all possible samples of size n from a population of size N, having mean μ and standard
deviation
(c) Tabulate the mean values and calculate the relative frequency of each value of mean by
dividing the frequency with which it appears by the total frequency (equal to the number of
samples). The relative frequency of each value indicates its probability.
Example
• Central Limit Theorem (CLT): If random samples of size n are drawn from any
population with mean μ and standard deviation σ, and if n is sufficiently large, then
the distribution of possible mean values will be approximately normal with expected
value μ, and standard error, , regardless of the population distribution.
2. When population is not normally distributed but sample size n is large enough.
• We can use normal area table to calculate probabilities involving the sample
mean.
Example
(a) A random sample of 10 batteries will have a mean life of 412 hours or
greater.
(b) A sample of 100 batteries selected randomly will have a mean life of at least
412 hours.
Example
• Since the number of successes involved here is a discrete variable, we need to apply
continuity correction factor (CCF) of +0.5 or -0.5
Sampling Distribution of Number of Successes
• It, thus, involves using sample statistics to predict the values of the
population parameters.
Theory of Estimation
• Using a point estimate, we might say that average height of Pune citizens is 160 cm.
• An interval estimate would say that there is 95% probability that average height of
Pune citizens falls in the interval 150 cm-170cm
Point Estimates
• When a point estimate is found, the sample value or statistic used is called
an estimator and the specific number obtained is called an estimate.
• So, interval estimate is obtained by adding and subtracting some quantity to the point estimate.
• The interval coefficient or confidence coefficient, depends on the level of confidence and the shape of the sampling
distribution.
• A level of confidence equal to 95 percent means that the probability is 0.95 that the parameter value being estimated is
contained, and 0.05 that is not contained, within the interval we obtain.
• A level of confidence is designated as 1 -α. Hence, if 95 percent confidence level is required, then α = 0.05.
• α is the probability of error indicating that the parameter will not be included in the interval estimate.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is Known
• If the sampling distribution is normally distributed, an interval estimate of population mean may
be constructed as follows
• When the level of confidence is 95 percent, we have = 0.05, and /2 = 0.025. Now an area equal to
0.025 lies under the normal curve beyond z =1.96. Thus, for 95 percent level of confidence, we
have z = 1.96.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE
SAMPLES: When the Population Standard Deviation is Known
• Example: A marketing research firm has contracted with an advertising agency to provide
information on the average time spent on watching TV per. week by families in a city. The firm
selected a random sample of 100 families and found the mean time spent by them watching TV
to be equal to 32 hours. It is known that the standard deviation of the TV watching time is 12
hours, which has been constant over the past few years.
(b) a 99 percent confidence interval for mean time spent by families watching TV. Help the firm.
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is not Known
• In most of the business applications, neither population mean is known, nor is the population standard
deviation.
• To make interval estimates in such cases, we use sample standard deviation, s, calculated as follows, as an
estimator of the population standard deviation
• Thus, if the sample size is large, the confidence interval estimate of the population mean is approximated by the
following expression
CONFIDENCE INTERVAL FOR POPULATION MEAN OF LARGE SAMPLES:
When the Population Standard Deviation is not Known
• This can be interpreted as: there is (1-) probability that the value of sample mean will
provide a sample error of or less
• Given the extent of error acceptable (E), the level of confidence (1-), and the population
standard deviation (), we can determine the sample size required.
• Oil India Corporation has a bottle-filling machine which can be adjusted to fill oil
to any given average level, but it fills oil with a standard deviation of 0.010 litres.
The machine has recently been reset to a new filling level and the manager wants
an estimate of the mean amount of fill to be within ±0.001 litres with a 99 percent
level of confidence. How many bottles should the manager sample?
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES
When the population is normally distributed and the population
standard deviation is known
If population is normally distribute and the population standard deviation is known, then the
sampling distribution of mean is also normally distributed irrespective of the size of the sample
involved.
This implies that even if the sample is small in size, we can use the z-table to obtain the interval
estimate in the same manner as in the case of large samples using the interval:
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES
Example:
the solution to the problem of small-sample interval estimation lies with a statistic called `t' rather
than with z
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES
• For a sample size n, the degrees of freedom is one less than the size, n, that is,
v=n-1
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES
• As in the case of z, the value of t depends upon the level of confidence. In addition, the t-value is
dependent on the degrees of freedom.
CONFIDENCE INTERVAL FOR POPULATION MEAN: SMALL SAMPLES
• The chance of μ not being included in the confidence interval is represented by α and, therefore,
reference should be made to the column head represented by α if the table gives areas in both tails of the
distribution and α /2 if the table provides areas in one tail of it.
• Values under area (α) = 0.05 in a two-tail table would match with the values under area (α /2) = 0.025 in
the case of one-tail table.
• If the level of confidence is given to be 90 percent, we focus on the column headed 0.10 in the case of a
two-tail table and 0.05 in the case of a one-tail table.
• Another element relevant in consulting the t-table is the number of degrees of freedom, v, which is
equal to n-1
t-table
Example:
(b) Sampling from the population and analysing the sample data,
(c.1) While recognizing that some divergence between sample and population
characteristics may be expected since a sample is only a subset of the population,
(c.2) To determine whether the differences between what is observed and the
statement made could have been due to chance alone and hence insignificant or
they are significant casting a doubt on the statement made.
Example
• A company uses a semi-automatic process to fill coffee powder in 200 gm jars and
this fill is known to have a standard deviation equal to 4 gm. For long, the amount
of coffee powder filled is observed to be normally distributed with a mean of 200
gm. The manager is concerned with ensuring that the process is working
satisfactorily so that the average amount of coffee powder filled in the jars is 200
gm. She has currently taken a sample of 25 jars, weighs the amount of coffee in
each of them and finds the average amount equal to 202 gm. Now, her problem is
as to how this difference of 2 gm be interpreted. Is it a small difference and may
be ignored or is it large enough to conclude that the process is not working
properly and some action is warranted?
Hypothesis Testing
• Process of hypothesis testing consists of following six steps
• It is a hypothesis of `no difference' or `no change' and is set up on the presumption that no
significant difference exists between sample result and the population parameter
hypothesized.
• It is assumed that whatever difference is observed between sample statistics and population
parameters is due to random causes only and is not significant.
Step 1: Setting up of Null and Alternate Hypotheses
• The complement of the null hypothesis, the alternate hypothesis, on the other hand, states what may be believed
to be true and is accepted if the null hypothesis is rejected.
• We hypothesize that the mean coffee powder is indeed 200 gm. If the sample mean is not found to be significantly
different from this hypothesized value, we have no reason to reject this hypothesis. However, if the sample mean
is significantly higher or lower than this value, then the null hypothesis is rejected and the alternate hypothesis is
accepted. The alternate hypothesis stipulates that mean is, in fact, other than 200 gm, higher or lower.
Examples of Null and Alternate Hypotheses
Example: A study claims that the mean income of the senior executives in the manufacturing sector
in an industrial state is ₹625,000 per annum. To test this claim, it is decided to take a sample of 200
executives and obtain their mean income. Set up the appropriate hypotheses.
For the purpose of setting up null and alternate hypotheses, it is assumed that the sample mean is
different from the population mean of 15 hours only due to chance factors. The alternate hypothesis
would be that the difference between the two mean values is real and not due to chance.
Let the population mean (the mean score of the entire set of employees) be µ and the mean score of the top line
supervisors be µl . The null and alternate hypotheses would be as under:
Null hypothesis, Ho: µl = 75 Top line supervisors are not more intelligent than average employees.
Alternate hypothesis, HI : µ1 > 75 Top line supervisors are more intelligent than average employees.
Examples of Null and Alternate Hypotheses
Example: A coin sorting machine with a bank can sort, on an average, 22,800 coins in a day with a
standard deviation of 280 coins. A new model of the machine is out in the market and the bank manager
is considering the old one to be replaced. The bank manager proposes to replace the old one with the
new model if only the latter sorts out larger number of coins than the former. For this, the manager
employs the new sorter for a period of 4 weeks and finds that the average number of coins sorted per
day is 23,540 with almost same standard deviation as the old one. Write the appropriate hypotheses for
testing.
Null hypothesis Ho: µl = 22,800 The new model is no better than the existing one.
Alternate hypothesis Hl: µl > 22,800 The new model is better than the existing one.
Examples of Null and Alternate Hypotheses
Example: A company is making brake system for cars. Using this brake system, a car running at speed of 40kmph
comes to a halt after covering a distance of 15 feet on an average. Recently, the company has developed power
brakes which, when applied one hundred times to cars running at 40 kmph, indicated that the average distance
covered is 13 feet before stopping. To test if the power brake system is better, set up the appropriate hypotheses.
For null hypothesis, we assume that the power brake is no different from the old brake system, so that the average
distance for stopping a car with new brake system is found lower than that of the average of the existing brake
system due only to chance factors. The alternate hypothesis would, of course, be that the power brake system is
better than the existing one.
Null hypothesis,H0: µ = 15 feet The power brake is no better than the existing brake.
Alternate hypothesis, Hl: µ < 15 feet The power brake is better than the existing brake.
Examples of Null and Alternate Hypotheses
• The reason for this is simple: it is always the null hypothesis that is
tested and we need a specific value to test for calculations.
Step 2:Selection of the Level of Significance
• Once the hypotheses are set up, we need to state the level of significance, designated by α.
• The level of significance refers to the probability of rejecting a null hypothesis when it is
true.
• This is also termed the level of risk because it indicates the risk that a true null hypothesis
will be rejected.
• Generally, level of significance is given in question. Ex: test the hypothesis on 5% level of
significance or 95% level of confidence.
Selection of the Level of Significance
• There are several test statistics such as z, t, F, chi square etc. available.
• The nature of underlying population from which sampling is done, the knowledge about its
parameters, sample size, number of samples collected, etc. are the factors which govern the
choice of a test-statistic.
Step3: Selecting the Test Statistic
• To illustrate, it may be recalled that the sampling distribution of mean is normally distributed when the
parent population is normally distributed and when sample is reasonably large even though the population is
not normally distributed, with parameters mean equal to the population mean µ, and standard deviation, .
• Since in our coffee example, the coffee weight is known to be normally distributed, we shall use the z-
statistic for the purpose of testing the hypothesis that µ = 200. Thus, in a hypothesis testing for mean in such
a case, we use z as the test statistic and use the properties of normal curve. The z-statistic is defined as
• Different test statistics are used in different kinds of hypothesis testing situations.
Step 4: Establishing the Decision Rule
• Let us understand this for test on means where the test statistic used is z.
• In the coffee example, suppose that we select the level of significance as 5 percent
so that α= 0.05. Further, the sampling distribution of mean is known to be
normally distributed with parameters µ = 200 and standard error = 0.8
Step 4: Establishing the Decision Rule
• The sampling distribution is shown
divided into two parts: acceptance region and rejection region
• The acceptance region consists of the 95 percent
area while the rejection region covers the remaining 5 percent
area divided equally on both ends of the curve.
• Since the alternate hypothesis states that μ≠ 200, it follows
that the mean is hypothesized to be different from 200 which
may be higher or lower than this value.
• Thus, α = 0.05, which is the probability of rejecting the null
hypothesis, is divided into two halves of 0.025 each.
• From the z table,
• We find the z-value beyond which the area equal
to 0.025 included under the curve is 1.96. Similarly, the area
0.025 lies to the left of z = —1.96.
Step 4: Establishing the Decision Rule
• These z values for some level of significance α are called critical values
• If the z-value computed in respect of the sample taken works out to be more than 1.96 or less than -1.96, the null
hypothesis will be rejected.
• If z> 1.96, it has the implication that the sample mean is significantly higher than the hypothesized population
mean, µ and if z <-1.96, then the sample mean is significantly lower than the hypothesized population mean.
• In each case, the difference between and µ is not seen to be due to fluctuations of sampling, and is considered to
be significant. Hence, the null hypothesis is rejected.
• On the other hand, if the z-value is found to lie between ±1.96, the difference between and hypothesized µ the is
considered to be not significant.
• This means that the evidence from sample is not sufficient to reject the null hypothesis. Hence, it is accepted.
Step 4: Establishing the Decision Rule
• Decision rule: If the computed value of the test statistic is more extreme than its
critical value/s, then reject Ho, else accept H1.
• The critical values are obtained having reference to appropriate area tables.
Step 4: Establishing the Decision Rule:
two tailed and one tailed test
• The critical value to be used in a given problem also depends upon whether the test involved is a
two-tailed test or a one-tailed test.
• To illustrate, in the coffee example, since the rejection region lies on both ends of the sampling
distribution because the alternate hypothesis is μ ≠ 200, the test is called a two-tailed test.
• Here the rejection region lies in both ends of the sampling distribution.
Step 4: Establishing the Decision Rule: One tailed test
• If the rejection in a given problem lies only on
one end of the sampling distribution, the test is
called a one-tailed test.
• For example, if we have to test Ho:μ = 200
against HI:μ > 200 at α = 0.05, the test would
be one-tailed.
• In this case, the alternate hypothesis involves a
`>' sign.
• This implies that the null hypothesis:μ = 200
would stand rejected in favour of the alternate
hypothesis μ > 200 if the z-statistic is found to
be lying in the region of rejection.
Step 4: Establishing the Decision Rule: One tailed test
• Since the entire 0.05 area lies in the right
tail of the curve, the critical value works
out to be 1.645 (being the value of z to the
right of which an area 0.05 lies).
• For the null hypothesis to be rejected, the
calculated value of z should exceed 1.645.
• When this happens, we may conclude that
the sample mean is found to be
significantly higher than the hypothesized
mean, μ = 200.
• Further, since the region of rejection is on
the right-hand side, it is called right-tailed
test.
Step 4: Establishing the Decision Rule: One tailed test
• For example: Let H0: μ= 200, HI:μ < 200 and α = 0.05.
• If the computations based on sample information yield a statistic value which exceeds
the critical value (note that it also means that the statistic value is smaller than the
critical value when they both are negative), the null hypothesis is rejected and the
alternate hypothesis is accepted.
• The critical value in a given situation depends upon the test-statistic employed, the level
of significance used and, where relevant, whether the test is a one-tailed or two-tailed.
Step 5: Computations
• The next step in hypothesis testing is taking sample and using the sample
information to calculate the value of the test statistic in question.
• This is done on the basis of the comparison of the calculated value of the test statistic with the critical
value.
• For the coffee powder example, the calculated value of z is found to be 2.50, that is larger than the
critical value of 1.96 mentioned earlier.
• This leads us to reject the null hypothesis in favour of the alternate hypothesis.
• Thus, the sample mean here is found to be significantly different from the hypothesized value of the
population mean.
• There is evidence to suggest, therefore, that the mean coffee powder has drifted from 200 gm