You are on page 1of 33

Problem Set: Probability and Statistics II (QM1b)

Tathagatha Bandopadhyay

Karthik Sriram

Indian Institute of Management Ahmedabad

September 8, 2020
Introduction to Sampling
1. Suppose to solve the following problems you decide to collect data by probability
sampling. In the context of each problem, dene (i) the population, (ii) the element,
(iii) an appropriate sampling design, (iv) the sampling frame, (iv) the sampling unit,
(v) the character under study (variable or attribute), (vi) the parameter of interest,
and (vii) an appropriate statistic to be used for estimating the parameter (You may
consider any sample size that would give a reasonable estimate. We will talk about
the sample size determination problem later).
Problem 1. To estimate the percentage of families in Ahmedabad who have sent
their children to private schools. (To assess the demand for private school education
in Ahmedabad)
Problem 2. To estimate the average time on a day(in hours) that a college student
in Ahmedabad spends in social network site(s). (To understand the networking habit
of Ahmedabad college students)
Problem 3. To estimate the percentage of students studying in grade VI-VIII in
schools run by the Ahmedabad Municipality Corporation who are able to read a sim-
ple Gujarati text. (To assess the reading skill of the students studying in municipality
schools)
Problem 4. To estimate the mean rating (1: poor, 2: okay, 3: good, 4: very good &
5: Excellent) of all the members of the IIMA gymnasium about its perceived service
quality. (To assess the service quality of the Gymnasium)
Problem 5. To estimate the percentage of C-section delivery during 2016 in private
hospitals in Ahmedabad. (According to WHO, rate of C-section delivery should ide-
ally be between 10-15%. To understand whether the hospitals follow this norm.)

2. To calculate literacy rate of India, census data are used. In census, household literacy
data are collected by asking the head of the household who are literate and who are
not.

Later a sample of 12000 households were selected from ve states in Hindi belt of In-
dia. From the selected households the literacy data were collected by census method
(CM), and also by giving a simple reading test (RT) to each member of the house-
holds.
It was found that the estimate of reading literacy rate by CM is at least 16% more
than that by RT. Also the estimate obtained by census method was found to be very
close to the literacy rate of the states reported by the govt.. Ref: Can India's literate

1
read? International Review of Education (2010), pp. 705-728.

In the above clearly the literacy data collected by census method are sub-
ject to substantial error? Suggest a simple method to reduce this error.
3. Trump's Muslim ban: Dierent agencies conducted opinion survey after Trump's pro-
posal to bar Muslim noncitizens from entering the United States, at least temporarily.

CBS News: Do you think the US should temporarily ban Muslims from other coun-
tries from entering the United States, or not? 36% support, 58% oppose.

YOUGOV: Do you agree or disagree that there should be 'a total and complete
shutdown of Muslims entering the United States until our country's representative
can gure out what is going on' ? 45% agree, 41% disagree.

What could be the reason for such divergent results in two surveys? Is it
possible to avoid this kind of bias?
Ref: https://www.nytimes.com/2015/12/16/upshot/how-unpopular-is-trumps-muslim-
ban-depends-how-you-ask.html

4. An experiment was conducted with the following two questions (Schuman & Presser
(1981)):
A. Do you think United States should let Communist newspaper reporters from other
countries come in here and send back to their papers the news as they see it?

B. Do you think a communist country like Russia should let American newspaper
reporters come in and send back to America the news as they see it?
If the questions appeared in the order AB(BA): 54.7% (74.6%) Yes to A, 63.7%
(81.9%) Yes to B
Why is this divergence in outcomes? Suggest a method to reduce the bias.

5. Suppose a FPM student would like to draw a random sample of size 200 from the
population of mid level HR executives working in Financial sector in India to do her
thesis. She has two options to collect data:
(i) Collect data from those who would visit the campus for MDP/customized pro-
grammes in the next one year.
(ii) Collect data by mailing the questionnaire (web survey) to the mid level HR exec-
utives of a large number of companies.

Which of the two options would she prefer? Discuss the kind of errors
that are expected in the two methods. Is it feasible to get a truly random
sample in this case (truly random sample means that each element in the
population has equal probability of being selected)? What could be the
major inhibitor/inhibitors? Suggest a sampling design that could be used

2
for getting a representative sample.

6. Suppose there are 10 colleges in a city. To select a college student at random from the
population of all students, a college is chosen at random, and then from the chosen
college, a student is picked up at random.
(i) Do you think that the sampling procedure will lead to the random selection of a
student?
(ii) If not, what procedure should be followed?

7. Suppose a probability sample of students of size 500 or more is to be chosen from the
colleges in Ahmedabad (assuming all the colleges have 500 or more students). Two
sampling schemes are suggested:
(i) Select a college at random and collect data from each student of the selected col-
lege.
(ii) Select a city block at random and collect data from all the college students residing
in the selected block.
Which sampling method would you prefer? Give your justication.

8. In each of the following studies indicate whether the data are collected by observa-
tional study or experimental study.

(i) A car manufacturer has developed a new engine to enhance the mileage of an ex-
isting model of car. The manufacturer nds the mileages of 100 cars manufactured
with the new engine to estimate the average mileage before marketing it.
(ii) The R & D team of a pharmaceutical company administers a newly developed
pain relieving drug to 100 terminally ill patients to get an estimate of the average
number of hours of relief.
(iii) A public interest group tested 100 cell phones of a particular model to estimate
the average number of hours the battery works after full charging.
(iv) A researcher recorded the increase in sugar level in blood of 100 diabetics after
they drank 300 ml of coke.
(v) A researcher collected data from 100 randomly chosen college students in Ahmed-
abad on the average number of hours in a day each of them talks on the cell phone.
(vi) Suppose to estimate the prevalence of HIV in India among children aged between
one to ve years in 2017 an NGO decides to carry out medical tests on 500000 ran-
domly chosen children.
(vii) A public interest group test 100 packets of one kg. Basmati rice of a particular
brand for its pesticide content.
(viii) In 2016, Broadcast Audience Research Council India installed meters in ten
thousand households selected by a proper sampling design from all over India to mon-
itor the TV watching habits of the people living in these households. These meters

3
recorded who in the family watched which TV programmes in 2016.

9. Suppose ten thousands payment vouchers are generated in 2016 in IIMA. An auditor
checks the vouchers by drawing a probability sample, often called audit sampling
(i) Why simple random sampling may not be appropriate? What kind of alternative
sampling design or designs the auditor may use?
(ii) Which sampling design would you prefer? Justify it.
(iii) Describe the sampling error (auditors often called sampling risk) and non-sampling
errors in this context.

10. Suppose a consignment of 50 sacks (each containing 20 kg, and of length 36 inches)
chilli powder are to be inspected for molds (a fungus that grows on chilli powder) by
the lab of Spices Board of India. A sample of size 100 each of 50 grams of chilli powder
is to be selected from the sacks. You may consider that each sack to be divided into
six layers each of length six inches (along its length) and each such layer is a sampling
unit
Discuss an appropriate method of sampling in this context. (Often called sack sam-
pling)

4
Simple Random Sampling
1. Suppose on Aruna's birthday two of her friends decide independently to present her
an Amazon gift cheque. Suppose Amazon gift cheques are of denominations Rs. 500,
Rs. 1000, & Rs. 1500 only. Suppose, each friend picks up one of the denomination at
random.
(i) Find the probability distribution of the total amount of the gift cheques that they
pick up.
(ii) Find the mean and standard deviation of the total amount.
(iii) Check from the standard formulas whether you are getting the same answers.
(iv) Do (i)-(iii) if her friends decide to select dierent denominations at random. (v)
Suppose now, each picks up a gift cheque of Rs. 500, Rs. 1000 and Rs. 1500 with
probability 0.5, 0.3 and 0.2 respectively. Then do the exercise (i)-(ii) above.

2. An alchemist visited the court of a medieval warlord and said "Your excellency", here
is my tribute to you. I have six envelopes. One of these contains a single copper coin,
another contains two copper coins, while a third one contains three copper coins. The
remaining three envelopes are empty. Kindly pick up any three of these six envelopes
at random and without replacement. I shall convert all the coins in the selected
envelopes to gold coins dating from the period of King Solomon  you can imagine
their value as antiques ! But what happens if I end up picking only the three empty
envelopes?, thundered the warlord, I shall behead you then. Take it easy, your
excellency, calmly replied the alchemist I am also a sorcerer  in that extreme case,
I shall make seven gold coins for you, again dating from King Solomon's era, simply
from the air. Assume that all the claims of the alchemist were true and that he kept
all his promises (the latter point is natural given the threat about his head!). Let X
be the number of gold coins that the warlord eventually ended up with. Obtain (a)
P(X = 3), (b) P(X = 4), (c) P(X = 5), (d) P(X = 6), (e) P(X = 7), (f ) E(X) and (g)
Var(X).

3. A textbook on business statistics contains ve chapters. A student, who is not very
serious, takes a simple random sample (without replacement) of three chapters. He
studies these three chapters with some seriousness and completely ignores the remain-
ing two chapters.

In the nal examination, the question paper on this subject consists of ve questions,
one from each chapter. The questions from Chapters 1 and 2 are compulsory and
carry 18 and 12 marks respectively. The questions from the other three chapters carry
20 marks each and each student is supposed to answer any one of these three questions

5
(even if a student answers more than one of these three questions, he/she gets credit
for only one of them). Thus the maximum possible score for any student is 50.

Obviously, the student under consideration gets zero in any question from a chapter
that he had ignored (so he does his best to avoid such a question, if possible). Fur-
thermore, as he is not very serious with his studies, he gets only 50% of the marks in
any question from a chapter that he had included for study. Let T be his score in the
examination.

Obtain the probability distribution of T and hence the expectation and variance of T.

4. Suppose a circus owner had 5 crocodiles to ship from Chennai to Mumbai. The
shipping company agreed to ship but would charge Rs. 20,000 per 100 kg. Naturally,
they need to know the total weight of all ve crocodiles. Weighing a crocodile is
dicult and at the same time expensive too. Let us name the crocodiles as Jumbo
(J), Kambo (K), Lambo (L), Mambo (N) and Shambo (S). They hired a statistician for
estimating the total weight by weighing two crocodiles only. The statistician proposed
the following procedure.

Step 1. Select two crocodiles at random without replacement.

Step 2. Weigh them, nd the mean weight and multiply it by 5.


By following the statisticians procedure the total weight came out to be 1750 kgs. The
manager of the shipping company is not happy with the estimate. After observing
the size of the crocodiles, and from his experience of shipping crocodiles, the manager
felt that the estimate was very low. Though, the statistician was claiming that his
estimate is unbiased and if the distribution of weight could be assumed to be normal
then it is actually the best among all unbiased estimates.

There was a guy who helped the company in the past for weighing crocodiles. He
could measure the weight of a crocodile by measuring its length and knowing its age.
His error in estimation was always within 10 kgs. The manager called the guy. His
estimates of weights (in kg) were: 1000 (J), 600(K), 500 (L), 400 (M) and 300 (S).

(i) Had the manager accepted the statistician's estimate, what would have
been the minimum loss of the shipping company?
This episode, of course, leads the manager to distrust of the eectiveness of statistical
estimation theory, and he decided not to call the statistician any more for consultation
in future.
(ii) As a statistician how you would have advised the manager in this case?
Incidentally, the estimated weights of the crocodiles by the second method matched
with the actual weights. Write down all possible samples of size two. For each sample
nd the sample mean and hence the probability distribution of the sample mean. Find
the mean and variance of the distribution of the sample mean. Check whether the
values of the mean and standard deviation of the probability distribution of sample
mean match with the values that you get directly by using the formulas.

6
5. A statistician who belonged to a group of rebellions was taken as a prisoner by the
army of king Juna and produced before the king. The king oered to play a game with
him that may save his life. Six bags of coins labeled B1 to B6 are placed before him.
Each bag contains either gold or silver coin. The statistician has to pick up two bags
at random to observe its contents. Based on this information he has to predict the
number of bags containing gold coin. Naturally, as a statistician his prediction would
be six times the proportion of gold coin bags in the sample. If he predicts correctly
he will be freed, and if he errs by 1 bag, he will be imprisoned for 5 years, else he will
be executed.

Suppose the emperor ordered to keep two bags of gold coin and four bags of silver
coin.

(i) What are the possible choices of sample of two bags?


(ii) For each choice nd the proportion of bags having gold coins.
(iii) Find the probability distribution of sample proportion of bags having gold coins,
and hence the probability distribution of the estimated number of gold bags.
(iv) Find the probabilities of the statistician getting free and getting executed respec-
tively.
(v) Find the mean and standard deviation of the distribution of the estimated num-
ber of gold bags. Check whether the results match with the results obtained from the
formulas.
(vi) What could be the best strategy for the king to maximize the chance of the
statistician's execution?

6. As a promotion strategy, a cell phone company decides to oer a discount of either


Rs. 5000 or Rs. 3000 or Rs. 2000 to the rst 10000 customers e-ordering a particular
model on its website. The price of the phone is Rs. 10000. As soon as the customer
places an order the discount amount will be ashed and will be deducted from the
price. To decide on the discount to be oered to a customer, the company uses the
following random mechanism. As soon as an order is placed, a digit between 0 to 9
will be selected at random. If the chosen digit is either 0 or 1, the oered discount
will be Rs. 5000, if it is between 2 and 4, the discount will be Rs. 3000, and otherwise
the discount will be Rs. 2000.

(i) Find the probability distribution of the price of the phone for a customer (one
among the rst 10000) and also nd its mean and standard deviation.

(ii) Suppose a couple places an order for two such phones (supposing their orders are
among the rst 10000). Find the probability distribution of total price of the two
phones. Also nd its mean and standard deviation.

(iii) Suppose a local cell phone shop owner places orders for 40 cell phones (all are
among the rst 10000 orders) using his network of friends and family members.
(a) Find the mean and the standard deviation of the average price of the forty phones.

7
(b) Find an approximation to its probability distribution. (c) Also nd the approxi-
mate probability that the average price is (c1) less than equal to 6000, (c2) more than
Rs. 7000 and (c3) between Rs. 6000 to Rs. 8000.

(Hint: Use central limit theorem for nding approximation to the distri-
bution of average)
7. (Application in Statistical Quality Control): A manufacturing process is supposed
to produce capsules containing 400 mg of a chemical, say, C. However, variation in
a manufacturing process is inherent, so the contents of dierent capsules would not
be identical. Suppose the regulatory authority makes it mandatory that the content
of every capsule should be between 399 mg. and 401 mg. To ensure it, the mean
and standard deviation of the contents produced by the manufacturing process are
set at 400 mg and 0.5 mg. The production supervisor knows from his experience
that the standard deviation of the process does rarely change. However, he feels that
continuous monitoring of the process is necessary for checking the stability of the mean
of the process. A consultant suggested him to implement the following procedure.

In every hour during a shift a sample of 100 capsules is to be selected and if the
average content of the sample falls below 399.90 or above 400.10 stop the process and
hunt for the trouble.

(i) Find the probability of a false a alarm if this procedure is followed.


(ii) Assuming that the mean has actually shifted to 400.1, nd the probability that
the shift will be detected by a sample.
(iii) Find the probability that the change in mean will remain undetected after in-
specting two consecutive samples since the beginning of morning shift
(iv) Find the probability that it remains undetected in the rst two and gets detected
at the inspection of the third sample.
(v) Suppose the process produces 10000 capsules per hour. What is the expected
number of capsules produced that will violate the norm of the regulatory authority
till the change in mean is detected in an eight hour shift?

(If instead of 100 the sample size is 25, what assumption would be necessary
for the calculation of the above probabilities? Make the assumption and
solve it.)
8. (Application in Statistical Quality Control) For assessing the quality of lots sent by
vendors, the quality control department uses sampling inspection plan to decide on
whether to accept or reject a lot. Suppose the department receives lots of size 100 (N),
then the sampling inspection plan selects a sample at random without replacement
from the lot, say, of size 10 (n, to be specied), and if the sample contains , say,
more than 1 (c, tobe specied) defective item, the decision would be to reject the lot,
otherwise do not reject. Sampling is often the only option if the testing is destructive
in nature.

For designing sampling inspection plans the interests of both the consumer and the

8
vendor should be protected. Since the decision to accept or reject a lot is taken on the
basis of a sample, there is a chance that a good (bad) lot may be rejected (accepted).
In order to avoid rejection of good lots, the vendor imposes a condition like: "if a lot
has 5% (p1 ) defective items, the chance of rejecting such a lot should not exceed 10%
(VRisk)". Let us call it the vendor's risk. On the other hand, to reduce acceptance of
bad lots, the consumer imposes a condition like, "the chance of accepting a lot with
10% (p2 ) defective items should not exceed 10% (CRisk)". Let us call it consumer's
risk.

For illustration, suppose the lot size N = 20, and the sampling plan chosen is given
by n = 5, c = 0 (in other words, draw a sample of size 5, if no defective is found the
lot is accepted, otherwise rejected). Suppose you are also given p1 = 5%, V risk =
10%, p2 = 10%, CRisk = 10%.
(i) Is the above sampling plan able to meet the specied Vendor's risk & consumer's
risk?
(ii) If the actual number of defectives in the lot is 4, what is the chance of accepting
such a lot by using the above sampling plan?
(iii) Solve (i) and (ii) with N = 1000, n = 20, c = 2, p1 = 5%, V risk = 10%, p2 =
10%, CRisk = 10%.
(iv) Solve (iii) when the number of defectives in the lot equal to 40.
(Use Binomial approximation)

9
Mathematical Concepts?
1. Justify the statements:

(i) For a parameter θ, it is not possible to have exactly two unbiased estimators.

(ii) For a parameter θ, if Tn is consistent for θ and the sucient conditions for con-
sistency hold for Tn , then there exists an innite number of estimators consistent for
θ. (Though the statement is true even if the sucient conditions do not hold)

2. Suppose X1 and X2 represent a random sample of size 2 from an innite population


with mean µ and Variance σ 2 . Consider unbiased estimators of µ of the form l1 X1 +
l2 X2 where l1 and l2 are real numbers.

(i) For these estimators to be unbiased, what condition l1 and l2 should satisfy? How
many such estimators exist?

(ii) Among these estimators, nd the one which has minimum variance.

(iii) Can the estimator in (ii) be considered as minimum variance unbiased estimator
of µ?

3. Do the exercise (2) if X1 and X2 represent a random sample of size 2 from two innite
populations both having mean µ but variances σ12 and σ22 , respectively.

4. Solve the following problems.

(i) Suppose x1 , x2 , ..., xn represent a random sample of size n from an innite pop-
2
ulation with mean µ and Variance σ . Find an unbiased estimator of µ based
2

on the sample. (You may assume sample mean= x̄n and sample variance s
2 =
n
(n − 1)−1 i=1 (xi − x̄n )2 are unbiased estimators of µ and σ 2 respectively.)
P

(ii) Supposex1 , x2 , ..., xn represent a random sample of size n of binary variable X from
an innite population with population proportion of defective p. Find an unbiased
2
estimator of p based on the sample.

(iii) In problem (i) assuming x̄n and s2 are consistent for µ and σ2 respectively, nd
µ 2
estimators of
σ, µ which are consistent.
1
(iv) In problem (ii), nd estimators of p(1 − p), p2 and
p which are consistent.

10
5. In a super store, an employee is assigned the task of estimating the average time spent
by a customer in the store in a day. The company head quarters intends to make this
a part of their daily dash board. For this, the employee collects a random sample of
10 people during the day and reports the individual time spent by each of them, to
his supervisor as and when he gets the data. However, since the supervisor is always
busy, he only remembers the recent two numbers reported by the employee and so
computes the average based on these. The supervisor reports his computed value for
dashboard purposes to the company. On the side, the employee also computes the
average time from the data he collected.

(i) Which of the estimators, supervisor's or the employee's is unbiased?

(ii) Which of the estimators, supervisor's or the employee's is more ecient?

(iii) Suppose, over time, the sample size collected by the employee can be increased,
but suppose the supervisor still continues to use recent two data points. Which of the
two estimators is consistent?

(iv) After tracking the dashboard for one month, suppose the company has given
permission to increase the sample size to 100 from 10, and have been using it as part
of the dashboard for 1 more month after the change. The supervisor, although is
aware of the increase, still uses the recent 2 data points reported by the employee to
compute the average and reports it to the company. Is there a way the company will
be able to realize by looking at the dashboard that they are not receiving the desired
estimates from the supervisor?

11
Interval Estimation
1. A sample of 240 observations based on SRS was collected on children aged between
2 and 6 years, who visited a paediatric clinic. The interest was to study the age at
which the children started walking unassisted. The frequency distribution of the data
obtained was as follows. (Assume that the population size can be considered as very
large)

Table 1: Frequency distribution for Age (when unassisted walking started)


Age (months) 9 10 11 12 13 14 15 16 17 18 19 20
Number of children 13 35 44 69 36 24 7 3 2 5 1 1

(i) Estimate the mean age and its standard error.

(ii) Compute the 95% condence interval based on CLT approximation. What
is the Margin of Error ?.

(iii) Suppose that some researchers want to do another study in a dierent region
with a population size of 20000 children in the desired age group, and want a
95% condence interval for the mean age with a margin of error of 0.5. What
sample size should they take? (Assume that they take the sample variance from
the previous study as a proxy for the population variance in the new study)

(iv) What would be the answer to (iii) if this population in the new region was
also very large.

2. In a study to determine the percentage of children that are overdue for vaccination
in a small town, the researcher took a sample out of 580 children served by the only
hospital in town.

(i) What sample size in an SRS (WOR) would be necessary to estimate the
proportion with 95% condence with a margin of error 0.107.

(ii) Suppose a sample of 120 was taken, of whom 27 were not overdue for vacci-
nation. Give a 95% condence interval for the percentage of children not overdue
for vaccination.

3. Suppose we are interested to estimate the proportion (p) of smokers among the male
students of IIMA. Suppose a sample of size 100 is chosen out of the male students, and
the sample proportion of smokers is found to be 0.2. Give an interval based on your

12
data so that you are 95% condent that the true value of the unknown proportion lies
inside it.

(i) State the assumptions that you have to make for nding this interval. Could you
check these assumptions?

(ii) How would you explain 95% condence to a layman?

(iii) Suppose a professor of IIMA claims that the proportion is 0.3. Based on a 99%
condence interval, would you accept this claim ?

4. Suppose a new surveillance system is installed on highways to prevent the drivers


from speeding. Out of the next 100 days after installation, the number of days with
no major car accident is observed to be 60. Find an interval estimate of the proportion
of days with no major accident that is expected with the new system in place with
condence level 95%. Before the installation of the system, this percentage was 40.

(i) Based on the 95% condence interval, would you believe that the installation of
the system has led to an improvement?

(ii) State the assumptions that you have made for drawing the conclusion. Could you
check these assumptions?

5. A week before presidential election in USA, a news agency wants to predict the winner
(Democrat or Republican) in one particular state. In order to predict, they collected
a random sample of size 10000 individuals using an appropriate sampling design, and
found 5100 of them intend to vote for the democrat candidate.

(i) Based on the data construct a 95% condence interval for the percentage of people
supporting the democrat candidate. Based on the 95% condence interval, would you
agree that the democrat candidate will win?

(ii) Could you claim the same if the condence level is increased to 99%?

6. A company likes to predict what proportion of people who visit its website would
eventually make a purchase. Last month's data show, out 1 million who visited 10,000
eventually made a purchase.

Could the company claim based on a 95% condence interval, that at least 0.5% of
those who visit also make a purchase?

7. Suppose a car manufacturing company claims that the average mileage of model M
is 20 (distance it could travel with one gallon of gas). The average and the standard
deviation of mileages of 16 individuals owning car of model M, who are chosen at
random, are found to be 19.5 and 4, respectively.

(i) Could you state based on a 95% condence interval, that the average mileage is
less than 20? Explain.

(ii) State the assumptions that you have made for nding this interval. Could you
check these assumptions?

13
8. Suppose a new drug for relieving pain is given to 100 randomly chosen terminally ill
patients. The average and the standard deviation of the observed number of hours of
relief are found to be 7.6 and 1.2 respectively. Find a 99% condence interval of the
population mean. The standard drug is known to give 7.5 hours of relief.

(i) Could you state based on a 99% condence interval that the new drug is better
than the standard one?

(ii) State the assumptions that you have made for nding this interval. Could you
check these assumptions?

9. A vendor sends a lot of size 10000 mother boards to a computer manufacturer. The
manufacturer wants to estimate the proportion of defective in the lot. A sample of
size 100 is chosen and 4 are found to be defective. Based on 95% condence interval
from the sample, if the manufacturer nds that the lot proportion of defective does
not exceed 5%, then the lot will be accepted.

(i) Will the manufacturer accept the lot?

(ii) State the assumptions that you have made for nding this interval. Could you
check these assumptions?

10. Suppose an airline assures its customers that the average check in time at the counter
should not exceed 15 minutes. A frequent traveller took 40 ights of the same airline
in last six months from Sardar Vallabhbhai Patel international airport observed that
the average and standard deviation of check in times are 10 and 2 minutes respectively.

(i) Could she reject the claim of the airline company based on a 95% condence
interval?

(ii) State the assumptions that you have made for drawing your conclusion. Could
you check these assumptions?

11. Suppose a manufacturer of mother boards claims that the average length of life of the
mother boards is at least 10000 hours. The computer manufacturer, who receives the
mother boards from the aforesaid company, subjects 100 such mother boards chosen
at random to an accelerated life testing experiment. The average and the standard
deviation of the failure times are found to be 9600 hours and 300 hours respectively.

(i) Would the computer manufacturer company accept the mother board manufac-
turer's claim based on a 95% condence interval?

(ii) State the assumptions that you have made for drawing your conclusion. Could
you check these assumptions?

14
Hypothesis Testing
1. Led Contamination: Suppose the Food Safety and Drug Administration (FDA) is sup-
posed to test whether packaged noodles available in the market under dierent brands
maintain the safety standards of allowable lead content which is between 0.01 ppm
and 2.5 ppm. Suppose the consumers forum claim that a particular brand available in
the market has at least 10% packets that violate the safety standards. Suppose FDA
decides to take a random sample of 10 packets of the same brand from the market and
tests it for the lead content in order to verify the claim of the forum. They decide to
reject the claim if no packet is found to violate the safety standard.

(a) Formulate the above decision problem as a hypotheses testing problem in terms
of the proportion of packets that the brand markets, say, p not complying with the
standard. In other words, by giving appropriate reasons, write the hypotheses H0 and
Ha .

(b) Write down the test statistic and its distribution under the null hypotheses.

(c) Find the level of signicance of the proposed test.

(d) Find the probability of type I error at p = 0.2, 0.3.

(d) Find the probability of type II error at p = 0.05, 0.01.

(e) Find the power of the test at p = 0.05, 0.01.

(f ) Suppose the sample of packets they pick up contains no packet violating the stan-
dard. Find the p-value associated with the test.

2. Heavy Metal Contamination: The permissible limit of Led, a heavy metal, is 0.3
mg per kg of vegetables. Suppose, an environmental action group claims that the
cauliowers grown in Dhapa, located in the eastern fringe of Kolkata, wherefrom most
of the vegetables come to the Kolkata vegetables market, contain Led at least 10%
more than the permissible limit. To test whether the cauliowers grown in Dhapa
have Led content more than the permissible limit, FDA collected a random sample
of size 25 from 25 cauliowers. The sample mean of Led content is found to be 0.27
mg/kg and the sample standard deviation is 0.4 mg.

(a) Formulate the null and alternative hypotheses for this problem with proper reasons.

15
(b) Write down the test statistic and its distribution under the null hypothesis stating
clearly the assumptions you make.

(c) Specify the values of the test statistic for which you would reject the null hypothesis
at 5

(d) On the basis of the information P (t < -0.36) = 0.3610 and P( t < -0.38) = 0.3536
(where t follows a t-distribution with 24 df ) answer which options are correct.

(i) P-value is greater than 0.36

(ii) P-value is greater than 0.35

(iii) P-value is less than 0.36

(iv) P-value is less than 0.35

(v) P-value lies between 0.36 and 0.37

(vi) P-value lies between 0.2 and 0.35

(vii) P-value lies between 0.35 and 0.36

(e) Which conclusions stated below are true?

i) Reject H0 at α = 0.40 but not at α = 0.30.

ii) Reject H0 at α = 0.45 and α = 0.4 but not at α = 0.2.

iii) Reject H0 at α = 0.36 and α = 0.37 but not at α = 0.35.

iv) Reject H0 at α = 0.3 but not at α = 0. 4 and α = 0.44.

f ) Find a 95% condence interval of the population mean and interpret.

3. Electronic goods: A dealer in electronic consumer goods received a supply of six


televisions of brand A and ten televisions of brand B. Out of the six televisions of
brand A, an unknown number, M, are defectives. Similarly, the number of defectives
among the ten televisions of brand B is 2M. In order to test the null hypothesis
H0 : M = 2 against the alternative Ha : M > 2, the following procedure is adopted:

(i) From the six televisions of brand A, draw a random sample of size two without
replacement.

(ii) From the ten televisions of brand B, draw a random sample of size two without
replacement.

(iii) Reject if and only if both the televisions in at least one of the two samples are
defectives.

(a) What is the probability of type I error for the above test procedure?

(b) Find the probability of type II error for this procedure when M equals 3.

16
4. Searching for Buttery A biologist, conducting research on butteries, collects butter-
ies in a region. There are two kinds of butteries in the region, colorful and boring
(that is, not so colorful), their respective proportions being θ and 1−θ. The butteries
are collected one by one and at random till exactly r colorful butteries are obtained.
Let X be the number of trials required to achieve this. The population of butteries
in the region is large so that the successive trials may be assumed to be independent.
In order to test the null hypothesis H0 : θ = 1/2 against the alternative Ha : θ < 1/2,
the following test rule is suggested. Reject H0 if X > c, where c is a suitable constant.

(a)Let r=1 and c = 3. For what θ, the above test has power =0.6? In case no such
θ exists, just say it does not exist.

(b) Find the p-value when r=2 and X = 4.

5. Emergency Service: A major west coast city in USA provides one of the most com-
prehensive emergency medical services in the world. Operating in a multiple hospital
system with approximately 20 mobile medical units, the service goal is to respond to
medical emergencies with a mean time of 12 minutes or less. The director of medical
services wants to formulate a hypothesis testing problem to determine whether or not
the service goal of 12 minutes or less is being achieved.

(a) Formulate the hypotheses testing problem.

(b) Suppose a random sample of 40 response times are selected from the records and
the sample mean was found to be 13.25 minutes and sample standard deviation was
3.2 minutes. Carry out the test at 5% level of signicance by nding the p-value.
State the assumptions you made.

(c) Find an appropriate two-sided condence interval and carry out the test at 5%
level of signicance.

6. Toothpaste Manufacturing: The production line for Glow toothpaste is designed to ll
tubes of toothpaste with a mean weight of 6 ounces. Periodically, a sample of 30 tubes
will be selected in order to check the lling process. Quality assurance procedures call
for the continuation of the lling process if the sample results are consistent with the
assumption that the mean lling weight for the population of toothpaste tubes is 6
ounces; otherwise the lling process will be stopped and adjusted.

(a) Formulate a hypothesis testing problem to help determine when the lling process
should continue operating, and when it should be stopped and corrected.

(b) Assume that a sample of 30 toothpaste tubes provides a sample mean of 6.1 ounces
and standard deviation of 0.2 ounces. Based on the information, carry out a test for
the hypotheses testing problem that you formulated. Also nd the p-value.

(c) Find an appropriate 95% two-sided condence interval of the mean and then carry
out the same test.

17
7. Catching Speeding Cars: To control speeding on the highway, the state highway au-
thority decides to draw samples of vehicle speeds at a few selected locations on the
highway. The speed limit is 65 mph. Based on the sample from each location, the
hypothesis H0 : µ ≤ 65 is to be tested. The locations where H0 is rejected are to
be selected for putting speedometer. Suppose at Location F, a sample of 16 vehicles
shows a mean speed of 68.2 mph with a standard deviation of 3.8 mph.

(a) Carry out the test of hypothesis at 5% level of signicance stating clearly the
assumptions you make.

(b) Find an appropriate two-sided condence interval and carry out the test at 1%
level of signicance.

8. Paint Research: The drying time of a certain type of paint under specied test con-
ditions is known to be normally distributed with mean value 75 min and standard
deviation 9 min. Chemists have developed a new additive designed to decrease aver-
age drying time. It is believed that drying time with this additive will remain normally
distributed with SD σ = 9. It is desirable that the evidence should strongly suggest
an improvement in average drying time before such a conclusion is adopted.

(a) Formulate a hypothesis testing problem with proper reasons.

(b) Experimental data consists of n = 25 specimens and sample mean is 73 min. Carry
out the test of hypotheses at 5% as well as 1% level of signicance.

(c) Find the p-value and hence draw your conclusion.

(d) Find the probability of type I error at µ = 76, and probability of type II error at
µ = 73.

(e) Find an appropriate two-sided condence interval to carry out the test.

18
ANOVA
1. Suppose ve dierent forms, say A, B, C, D and E, each consisting of the same 20
questions but in dierent orders are created. We want to investigate whether the order
of the questions may have an impact on the scores (a measure of diculty). From each
section of rst year PGP -2017 students, a sample of size of 21 is picked up, and forms
A, B,C, D,E are administered to students from Sections A, B, C, D, E respectively.
The following data summary is given:

Between Sum of Squares (SSB or SSE) = 334

Within Sum of Squares (SSW or SSTR) = 586.

(i) State the null and the alternative hypotheses that could be tested using the above
information.

(ii) State the assumptions that you would make for performing the test.

(iii) Write down the ANOVA table showing degrees of freedom, sums of squares, mean
squares and F statistic.

(iv) Carry out the test at α = 0.05.

(v) In order to draw a conclusion about the impact of order of questions what assump-
tions you would make?

2. A retail chain hired a marketing research company for coming up with new marketing
strategies. The marketing research company is interested to understand the variation
in service quality level from store to store. As a rst task, they collected data on 101
randomly chosen customers from each of the ve retail outlets in Ahmedabad. Each
one was asked to rate the quality of service during their visit to the store on a scale
of 0-10, with 0 = very poor and 10 = excellent. With these data they would like to
perform an one-way ANOVA on the mean service levels for the ve outlets.

(a) State the null and alternative hypotheses (explaining the notations).

(b) The summary of the collected data is given: Between Sum of Squares (SSB or SSE)
= 1350, Within Sum of Squares (SSW or SSTR) = 5600. Based on this information,
complete the ANOVA table.

(c) The degrees of freedom of your test statistic is:

19
i) 5 ii)101 iii) (5, 500) iv)(4, 500) v)(5, 505)

(d) The conclusion that we can draw based on the above test is:

i) Reject H0 at α = 0.1 but not at α = 0.01 or 0.05.

ii) Reject H0 at α = 0.05 and 0.1 but not at α = 0.01

iii) Reject H0 at α = 0.01 but not at α = 0.05 or 0.1.

iv) Reject H0 at α= 0.1, 0.05 and 0.01.

v) Do not reject H0 at any of the above signicance levels.

(e) Based on the decision above, what can we conclude about the service levels at the
ve outlets (at α = 0.05)?

i) The outlets dier signicantly with regard to the mean service levels.

ii) At least three of the outlets dier signicantly with regard to the mean service
levels.

iii) At least one of the outlets diers signicantly with regard to the mean service
levels from the others.

iv) There is no signicant dierence between the outlets with regard to the mean
service levels.

v) We cannot conclude anything denite regarding the mean service levels at the
ve outlets.

20
Regression
Concept/Theory Problems
iid
1. Consider the model yi = β ×xi +i where 1 , 2 , . . . , n ∼ N (0, σ 2 ), i.e a model without
an intercept.

(i) Using least squares method, nd the estimator (b )of β.

(ii) What is E[b] and V ariance(b)?

iid
2. Consider the simple linear regression model Wi = β0 +β1 ×Ti +i where 1 , 2 , . . . , n ∼
N (0, σ 2 ). Suppose Wi is the wind speed at location i measured in m/s and Ti is the
temperature measured in degree celsius. Let b0 and b1 be estimates obtained for the
model using the data. Suppose you are asked to express the model with Temperature
in Fahrenheit and wind speed in km/s, how would you modify the b0 and b1 you earlier
obtained.

iid
3. Consider the simple linear
( regression model yi = β0 +β1 ×xi +i where 1 , 2 , . . . , n ∼
1, if person is male
N (0, σ 2 ). Suppose xi = , what will be the ŷi for males? what
0, if person is female
will be ŷi for females?

Data Problems
1. The human resources manager of DataCom.Inc, wants to predict the annual salaries
of given employees using the potential explanatory variables in the le Datacom-
Salary.xls. (Note that Gender =1 denotes "Male", Departments 1,2,3,4 are codes for
various departments in the company 1=Sales, 2=Purchasing, 3=Advertising, 4=En-
gineering. See the comment near the heading of each variable in excel for a brief
description)

(a) Estimate an appropriate multiple regression equation to predict the annual salary
of a given DataCom employee using all of the data in columns C-H. Check the stan-
dardized residual plots and qqplots to verify model assumptions.

(b) What percentage of variability in salaries is explained by this model?

21
(c) How would you interpret the regression coecient of Years of previous experience?
Does the sign make intuitive sense? Assuming the statistical assumptions hold, what
does the P-value for this variable suggest?

(d) According to the estimated model, is there a signicant dierence between mean
salaries earned by male and female employees at DataCom ? If so, how large is the
dierence? According to your model does the dierence depend on the values of other
explanatory variables?. Construct a 95% condence interval for this dierence. Is the
dierence statistically signicant at 5% level?

(e) According to the estimated model, is there a dierence between mean salaries
earned by employees in sales department and those in Engineering? If so, how large
is the dierence?

(f ) For gender, the variable dened is only w.r.t Male indicator. Why isn't a similar
variable added for Female indicator and also included in the model?

(g) Why does the regression output (see table 5) show only coecients for 3 depart-
ments, names 2,3,4? and not any coecient for department 1 ?

(h) According to the estimated model, what is the predicted salary for a female em-
ployee who has 6 years of previous experience before Datacom, has worked for 4 years
with Datacom but has not supervised any employees, has 4 years of education beyond
high school and is with Advertizing department. obtain the 95% condence interval
(using minitab or otherwise) and 95% prediction interval. Which interval is wider.
What is the dierence between the two intervals in interpretation.

2. A power company located in southern Alabama wants to predict the peak power load
(i.e. the maximum amount of power that must be generated each day to meet demand)
as a function of daily high temperature (X). A random sampelof 25 summer days is
chosen and the peak power load and the high temperature are recorded each day. The
le "Power.xls" contains these observations.

a) Create a scatter plot for these data. Comment on the observed relation between Y
and X

b) Estimate the regression equation between power load and temperature. Interpret
the coecients

c) Analyze the standardized residuals. What do you observe. Can you think of a
transformation for temperature that may work better for modeling this data.

d) Use the modied model to predict the peak power on a summer day with high
temperature of 100 degreed (F)

e) Is it reasonable to use the model to predict the load at a temperature of 59 degrees


(F)?

22
3. The beta of a stock is found by running a regression with the monthly return on a
market index as the explanatory variabke and monthly return on stock as a dependent
variable. The beta of the stock is then the slope of this regression line.

a) Why would we usually see that most stocks have positive beta?

b) Explain why a stock with a beta with absolute value greater than one is more
volatile than the market index and a stock with a beta less than one (in absolute
value) is less volatile than the market index?

c) Use the data le stock.xls to estimate beta for each of the four companies listed
Caterpillar, Goodyear, McDonalds and Ford. Use the S&P500 as the market index.
Which company has largest beta and which the lowest?

d) For each of these companies, what percentage of variation in returns is explained


by the variation in market index? what percentage is unexplained?

4. The le Salary.xls contains (hypothetical) starting salaries of MBA students directly
after graduation. The le also lists the years of experience prior to the MBA program
and their class rank in the MBA program (on a scale 0-100)

a) Estimate the regression equation with salary as the dependent variable and ex-
perience and class rank as explanatory variables. Interpret the coecients and R2 .
Compute 95% condence intervals for the coecients.

b) How would you interpret the condence interval for Class rank.

c) Create an additional variable which is the product of Experience and Class rank
(such a variable is usually referred to as an "interaction term") and estimate this new
model. Is the interaction variable signicant at 5% level?

d) How do you interpret this new equation? Would you expect such interaction to be
present in real data of this type?

23
Answers to Problem Set
Introduction to Sampling
1. See Table 1

Table 2: Question 1 solution


Problem Population Element Design Frame Unit Variable Parameter Estimator
1 Families in Family SRSWOR- All House- HouseHold 1/0 whether % families in sample pro-
Ahmedabad can involve holds in Ahd, family has Ahd who have portion
stratication details that sent any child sent atleast
and Cluster help identify to private one child to
Sampling by clusters and school private school
areas/locality strata
etc
2 College stu- Student SRSWOR- Colleges in College Time spent Average time sample mean
dents in may involve Ahmedabad on social net- spent on so-
Ahmedabad cluster or along with working site cial network-
stratied sam- details that on a given day ing sites per
pling based help identify day
on locality, strata and
type of college clusters
(engg/medical),
clusters may
inviolve dif-
ferent days
during the
time frame of
study
3 Current class student May involve list of Ahmed- School Number of Percentage of sampleMean
Vi-Viii stu- stratications abad munici- students in students of of percentage
dents in and cluster pal schools class Vi-Viii clas Vi to Viii of students
Gujrat sampling who can read who can read who can
and then Gujrati, Total Gujrati read Gujrati
SRSWOR number of from each
students in schoolOR ra-
those classes tio estimator
- (number
of students
who can read
Gujrati in
sample)/
(number of
students in
sample)
4 Members member SRSWOR List of all member rating of qual- Average Rat- sample mean
of IIMA users of gym ity of service ing, ( or (or propor-
Gymnasium e.g.percentage tion)
people who
chose rating
as 4 or 5)
5 All deliveries Delivery SRSWOR List of all Hospital Number of % of delivaries Ratio
that hap- identied private hos- deliveries , that were C- Estimator-
pened in 2016 by some pitals'in number of sections (number of
in private code, Ahmedabad C-sections C-sections
hospitals of date and in sam-
Ahmedabad hospital ple)/(Number
of Deliveries
in sample)
Note that when stratification and cluster sampling are involved, the estimates
from individual clusters/strata need to get rolled up to the overall estimator
via weighted averaging.

24
2. Answer: There is a measurement error involved because in CM method, literacy is
checked by just asking the head of household which can be biased and wrong. Method
to remove the error is to use the RT method and by training personnel collecting the
data to administer the test.

3. Answer: The second question is hinting a positivity about the ban and hence inu-
encing the responder. By saying country's representative can gure out what is going
on, they are implying something is perhaps wrong and authorities are looking into it,
which people generally may nd hard to disagree with!

4. Answer:This is called "question-order eect". It may be easy for an american to


agree with the question A. However, if that is asked rst, he/she may be forced to
agree with B. Similar eect will occur if asked in the reverse, i.e. answer to second
question is forced because you have already expressed a view in the rst question.
Typically such issue can be avoided by asking a more general question rst followed
by specics. For example, Do you think a country should allow reporters from other
countries to come in and report news back to their home countries. Then follow up
with specic questions.

5. Answer Random sample from MDP participant is convenient to obtain. However,


it may not represent the population of HR executives. Since only selected executives
are sent to MDP at IIMA, there will be a sampling bias. The second approach has
potential to get a more representative sample. However, there may be issues with
Non-response. A better way to collect data here may be to rst create a frame of
companies that are of interest, sample the companies, once you reach the companies
a frame of executives can be created and then a sample out of those collected. In
general, it is dicult to collect a truly random sample. Convenience and judgement
may need to be used while collecting the data.

6. Answer Suppose there are unequal number of students per college, N1 , N2 , , .., N10 .
Let N = N1 + N2 + ... + N10 denote total number of students. For SRS, probability of
1
picking a student should be
N , i.e same for all students. However, in (i) the probability
1 1
is
10 ∗ Ni if student is from college i. For SRS, student needs to be picked randomly
from a list of all students obtained by pooling all colleges.

7. Answer: Clearly (i) is more convenient. However, it will work only if the heterogenity
that exists within any college is similar to what exists in the city and also if variation
across colleges is not much. Method (ii) is preferable over (i) if there is variation across
colleges w.r.t the variable of interest, since it would look at more than one college and
capture some variation that exists across colleges.

8. Answer: (i) Experimental (E) , (ii) E, (iii) Observational (O), (iv) E, (vi) O (vii) O
(viii) O

9. Answer: Since some payments may be large and some small, it is possible that SRS
may just result in looking at perhaps only small payments. To avoid such a situation,
one may do stratied sampling by size of payment. After stratication, it may be
preferred to do systematic sampling rather than SRS, again to avoid similar sized

25
vouchers getting selected. Sampling error would be present because the audit cannot
be done on all the vouchers but only on a representative sample. Non-sampling errrors
here would mainly include any errors done in the audit process itself, since it may not
be possible to do it perfectly on every voucher.

10. Answer First we need 100 samples from 50 sacks. So it is better to take 2 samples
from each sack to cover the heterogeneity that may exist across sacks. While selecting
the 2 samples from a sack, it is important to ensure that the 2 samples are not close
to each other. A systematic sampling scheme may be better here, where we choose
one layer in the sack at random and then choose the next layer that is 3 layers away.
e.g. If I choose one sample from layer 1, then choose second sample from layer 4. If
the rst chosen layer is 5, then choose the next layer as 1.

Simple Random Sampling


1. Answer (ii) mean=2000, sd=577.35 (iv) Mean=2000 SD=408.25 (v) Mean=1700,
SD=552.268.

2. Answer: (a) 0.3, (b) 0.15, (c) 0.15, (d) 0.05, (e) 0.05, (f ) 3.35, (g) 2.6275

3. Answer: The possible values of T are 10, 16, 19 and 25 with respective probabilities
0.1, 0.3, 0.3 and 0.3; E(T) = 19, V(T) = 21.6

4. Answer: (i) 200000 (ii) Statistical estimate should atleast be accompanied by stan-
dard error to show the uncertainty in the estimate.

5. Answer Best strategy: two gold and four silver or four gold and two silver.

6. (i) Mean = 7100, SD = 1135.78 (approx). (ii) For Average: Mean = 7100, SD =
803.12 (approx). (iii) a) Mean=7100, sd=179.6, b) Normal with mean, sd as in (a),
c) : c1) 0 (approx)(c2) 0.71 (approx) (c3) 1 (approx)

7. Answer: (i) 0.0455 (ii) 0.5 (iii) 0.25 (iv) 0.125 (v) 992 (approx. Assuming production
stops after the 8 hour shift)

8. Answer: (i)No (Probability = 0.25) No (Probability = 0.55) (ii) 0.28

Mathematical Concepts?
1. (i) if θ̂1 and θ̂2 are unbiased estimators, then so is λθ̂1 +(1−λ1 )θ̂2 for any 0 < λ < 1. So,
there cannot be exactly two unbiased estimators, because given two, we can construct
innitely many other unbiased estimators. (ii) think of factors to multiply whose limit
n
is 1 e.g.
n+1 .

26
2. (i) l1 +l2 = 1, innitely many (ii) Minimize l12 +l22 subject to l1 +l2 = 1. l1 = .5, l2 = .5.
(iii) It is minimum variance unbiased estimator among all unbiased estimators of the
form l1 x1 + l2 x2 .
σ22
3. l1 = σ12 +σ22

s2 p̂2 − n x̄n
4. (i) x̄2 − n (ii)
1− n1 , where p̂ = x̄n (iii)
s , x̄2n (iv) replace p with p̂ = x̄n .

5. (i) Both are unbiased (ii) Employee's is more ecient (iii) Employee estimator is
consistent. Supervisor's estimator is still based on only 2 data points and is not
changing with n (iv) For the employee's number the SE would decrease by a factor
1
of √ . However, for the supervisor's numbers the variability will not change. If
10
the company had been receiving data from employee, they will expect to see a large
decrease in standard deviation of the dash board values from "before" to "after"
sample size increase. They can detect the issue when they see that the change in
variation is not as expected.

Interval Estimation
1. (i) 12.079, SE=0.1243 (ii) [11.836, 12.323] , margin of error =0.2435 (iii) ≈ 57 (iv)
≈ 57
2. (i) ≈ 74 (ii) [.15, .3]

3. (i) [.122, .278] We are assuming normal approximation holds, and also that population
is large. One thumb rule is min(np̂, n(1 − p̂)) > 10, here it holds. (ii)If a large number
of samples are drawn and CI is constructed using same method, then 95% of them
should contain the true value. Essentially, while we know each sample would give
dierent condence interval, we know that atleast 95% of such cases true value will be
contained in it. (iii) 99% CI is [.097, .303], Yes, since this interval contains 0.3, there
is no reason to disagree with the professor's claim at 99% condence.

4. (i) 95% CI [.504, .696] , Since 0.40 is below and not contained in this interval, it
suggests a signicant improvement. (ii) CLT is assumed (min(np̂, n(1 − p̂)) > 10
holds)

5. (i) 95% CI [.5002, .5198] We would conclude democrat would win since .50 is outside
the interval (although at the border) (ii) However, at 99% condence, i.e. based on
99% CI [.497, ,523], we cannot agree that democrat wins.

6. 95% CI [.0098, .0102] Yes, because .005 is clearly outside and below this interval.

7. (i) using t.025 = 2.13, 95% CI= [17.37, 21.63], we cannot conclude that that the mileage
is signicantly changed from 20 based on the 95% interval since 20 is contained it.

8. (i) 99% CI [7.285, 7.915] 7.5 is contained in this interval. So, the new drug is not
signicantly dierent from the old drug. (ii) We assumed t distribution for the t-
statistic, which holds when underlying population values are approximately normal.

27
Also, we are assuming the population of terminally ill patients under study, is large.
One way to check these assumptions is to look at the histogram of data and trying to
study how close it is normal, in terms of skewness etc

9. (i) The 95% condence interval is 0.04 +/- 0.038 [0.002,0.078]. Since the rejection
limit of 0.05 lies within interval the lot has to be rejected. (ii) We are assuming
normal approximation for p̂ Here np̂ = 4 < 10 , so the normal approximation may
not be very good. (In principle , one could use binomial approximation, i.e Number
defective ∼ Bin(100, p))

10. (i) 95% CI is [9.36, 10.639] Since 15 is clearly outside the interval, we can say that
average check in time is indeed lesser. (ii)We have assumed t− distribution for the
t statistic. We are assuming that the check in times are approximately normally
distributed. such assumption can be checked by studying the histogram, skewness etc
of the sampled observations.

11. (i)95% CI is [9540, 9659] 10000 is clearly outside and higher than the interval. So
company would not accept the claim. (ii) We are assuming that t-statistic can be
approximated by t− distribution, which holds when underlying population values are
approximately normal. One way to check these assumptions is to look at the histogram
of data and trying to study how close it is normal, in terms of skewness etc

Hypothesis Testing
1. (a) H0 : p = 0.1 Ha : p <0.1. Think which error is more serious. (b) X (Number of
packets in the sample violating the standard), Binomial distribution with parameters
n = 10, p = 0.1. (c) 0.3487 (d) 0.4013, 0.0956 (e) 0.5987, 0.9044 (f ) 0.3487.

2. (a) H0 : µ ≥ 0.3 Ha : µ < 0.3. Think, which error is more serious. (b) t =

( n(x̄ − 0.3))/s, t follows a t distribution with 24 df under the assumption that the
population distribution is normal. (c) t ≤ −1.71 (t ≤ −2.49) (d) Observed t = -0.375,
P-value = P (t ≤ −0.375) which is less than 0.3610 and more than 0.3536. Now verify
the statements one by one. (e) Note P-value lies between 0.3536 to 0.3610. Thus if
P-value is less than equal α, you reject, otherwise don't. Now check the statements.
(f ) x̄ ± t.025,24 √sn which is [0.1049,0.4351]. If this condence interval is found for a

large number of (say 100) independent random samples, on the average 95

3. (a) 43/225 (b) 8/15

4. (a) Power at θ = (1 − θ)3 = 0.6 so, θ = 0.157. (b) 0.5.

5. (a) H0 : µ ≥ 12 Ha : µ < 12. (b) 0.9932 (Assumption: Sample size is more than equal
to 30, and we are assuming it is appropriate to use normal distribution to approximate
the distribution of t) (c) The 90% condence interval is [12.4177, 14.0823]. The value
of µ specied under H0 (=12) is less than the upper limit, so we cannot reject the null
hypothesis at 5% level

28
6. (a) H0 : µ = 6 Ha : µ 6= 6. (b) Observed t = 2.7386, thus the the null hypothesis is
rejected at 5% (1%) level of signicance. P-value = 2 × P (t ≥ 2.73) = 2 × 0.0031 =
0.0062 (We are asuming that it is appropriate to use CLT approximation.) (c) normal
distribution based 95% condence interval of µ is [6.0284,6.1716] using CLT. Note that
the value of µ specied under the null hypothesis is outside the interval. We reject
null hypothesis at 5% level of signicance.

7. (a) Assumptions: Sample is randomly selected. The population distribution is normal.


Observed value of t = 3.3684 > t15,.05 = 1.75, thus H0 is rejected at 5% level of signif-
icance. (b) 98% condence interval : x̄ ± t15 (.01) √sn = 68.2 ± 2.6 × √3.8
16
=[65.73,70.67].

The hypothetical value of µ under H0 is 65, which is less than the lower limit, so the
null hypothesis is rejected at 1% level.

8. (a) H0 : µ = 75 Ha : µ < 75 . Alternative is a research hypothesis. (b) Observed value


√ (x̄−75)
of z = 25 σ = (5 × (−2))/9 = −1.1111 > −1.645 (-2.33 at 1%), thus H0 cannot
be rejected at 5% as well as 1% level. (c) P-value = 0.1388. (d) P (T ypeIerror|µ =
76) = 0.0044. P ( Type II error|µ = 73) = 0.8296. (e) To carry out the test at 5%
level we nd a 90% condence interval [70.0390, 75.9610]. Note that the hypothetical
value of µ under H0 is 75, which is included in the 95% condence interval. So cannot
reject H0 at 5% level.

ANOVA
1. (i) H0 : µA = µB = · · · = µE , Ha : Not all means are equal. (ii) Sample from
each section represents a random sample from a normal population. The means of
dierent populations corresponding to dierent sections may dier but all have the
same variance. The samples from dierent sections are independent. (iii) see Table
3. (iv) Observed F > critical value F at (4,100) dof (shown in the table), thus H0 is
rejected at 5% level. (v) We need to assume that the student populations in dierent
sections are similar in terms of the scholastic aptitude being tested.

Table 3: : ANOVA table for Question 9


SS df MS F F4,100 (α = 0.05)
Between Forms SSB 334 4 83.5 14.25 2.4626
Within Forms SSW 586 100 5.86
Total SSW 920 104

2. Ans: a) (i) H0 : µA = µB = · · · = µE , Ha : Not all means are equal. b) see Table 4. c)


(4, 500) d) Note F4,500 (.05) = 2.3898, F4,500 (.01) = 3.3568, F4,500 (0.1) = 1.9561. H0
is rejected at all levels considered, so (iv) is correct. e) Clearly the mean service levels
are signicantly dierent for dierent outlets. Thus (i) and (iii) are correct statements.

29
Table 4: : ANOVA table for Question 10
SS df MS F F4,100 (α = 0.05)
Between Forms SSB 1350 4 337.5 30.13 2.4626
Within Forms SSW 5600 500 11.2
Total SSW 6950 504

Regression
Concept/theory problems
Pn
xi yi 2
1. (i) b= Pi=1
n 2 , (ii) E[b] = β , V (b) = Pnσ
i=1 xi i=1 x2i
 
b0 32×5 b1 ×5
2. The new interecpt will be
1000 − 1000×9 and new slope will be
1000×9

3. ŷi for any male would be the average of all yi in the data that are for males and ŷi for
any female would be the average of all yi in the data that are for females. (check by
least squares method)

Data problems
1. (a) See table 5, (b) 85.3% (c) If all other variables are held xed, an extra year of
experience is associated with a decrease in salary of 72.8 dollars. No, sign is reverse
of what we would expect. However pvalue is large suggests that the eect of past
experience is statistically insignicant. (d) Male employees, on average, earn $2,040
less than female employees (assuming that they are comparable otherwise). This
value is independent of the values of the other variables. 95% CI [-4976, 895]. Since
P-value is larger the 5% this is not statistically signicant dierence at 5% level. (e)
Employees in the Sales department earn, on average, about $8096 less than employees
in the engineering department. This is signicant at 5% level. (f ) In the presence of
intercept, we cannot include both male and female indicators as it would create perfect
collinearity and redundancy in variables (e.g. adding male + female indicator= 1) (g)
Since department is a categorical variable, it is included in the model via 1/0 indicator
variables indicating each department. However, including all 4 indicators would create
redundancy in the presence of the intercept. Hence, only three of them are in the
model. The interpretation of each coecient is w.r.t depart 1 as base. e.g. Department
2 earns 8455 more than Department 1 on an average, while all other variables are held
xed. (h) t=32941.34 95% CI [28604.44 37278.24] Prediction interval = [22569.41
43313.28], CI is for the expected value (average) of all salaries at the given levels of
explanatory variable. Prediction interval is a range within which we can expect to see
individual salaries at the given level of X variables.

2. (a) scatter plot shows a nonlinear relation (b) see table 7. We rst estimate a simple
linear regression model involving one explanatory variable, Daily High Temperature.
This estimated model reveals that as the daily high temperature rises by one degree,

30
Table 5: : Datacom regression coecients
Estimate Std. Error t value Pvalue
(Intercept) 19313.19 2518.57 7.67 0.00
Years.Previous.Experience -72.80 198.39 -0.37 0.72
Years.Employed 709.45 120.96 5.87 0.00
Years.Education 1544.52 338.22 4.57 0.00
Gender -2040.25 1448.97 -1.41 0.17
Departments2 8455.63 2288.73 3.69 0.00
Departments3 5049.08 2333.00 2.16 0.04
Departments4 8096.05 1830.64 4.42 0.00
Number.Supervised 130.17 81.68 1.59 0.12

the peak power load increases by 1.976 megawatts. (c) The plot of the residuals indi-
cates that there is a nonrandom curved pattern. This suggests estimating a quadratic,
using a squared term. i.e. yi = β0 + β1 xi + β2 x2i + i . d) 154.049 e) No that would be
out of range of the data used to develop the model.

Table 6: : Power company regression coecients


Estimate Std. Error t value Pvalue
(Intercept) -47.39 15.67 -3.02 0.01
Daily.High.Temperature 1.98 0.18 11.13 0.00

Table 7: : Power company regression coecients after quadratic


Estimate Std. Error t value Pvalue
(Intercept) 385.05 55.17 6.98 0.00
Daily.High.Temperature -8.29 1.30 -6.38 0.00
squareTemp 0.06 0.01 7.93 0.00

3. a) We would expect most stocks to be positively correlated with the market. As


the market goes up, we would expect the stock price to go up as well, at least on
average. b) The beta is the expected change in the stock return when the market
return increases by 1 unit. So if beta is greater than 1, the stock tends to react more
than the market, and if it is less than 1, the stock tends to react less than the market
c) McDonalds has the largest beta; Ford has the lowest. The companies Caterpillar,
Goodyear, McDonalds and Ford have betas 1.29, 1.64, 1.76 and 0.73 respectively. d)
explained variation is 35.7%, 26%, 31.2% and 24.3%.

4. (a) See table 8 For every extra year of experience, the student can expect an extra
$2677 in salary. For every extra point in class standing, the student can expect an
$61 in salary. These two variables account for 80.6% of the variation in salary

(b) keeping experience constant, we can say with 95% condence that an increase in

31
Table 8: MBA Salary regression
Standard Condence Interval 95%
Regression Table Coecient Error t-Value p-Value Lower Upper
Constant 51051.02 761.5860322 67.0325 < 0.0001 49534.50867 52567.53174
Experience 2677.42 159.5674175 16.7793 < 0.0001 2359.685157 2995.1638
Class Rank 60.87 7.920446574 7.6858 < 0.0001 45.10319227 76.64644037

Class rank by 1 is associated with an increase in salary of anywhere between 45.1 and
76.6.

c) see table 9. Yes the new variable is signicant at 5% level.

Table 9: MBA Salary regression with interaction


Standard Condence Interval 95%
Regression Table Coecient Error t-Value p-Value Lower Upper
Constant 48920.98 1252.836763 39.0482 < 0.0001 46425.73455 51416.21588
Experience 3248.27 311.8259345 10.4169 < 0.0001 2627.216641 3869.326987
Class Rank 103.19 21.45597083 4.8092 < 0.0001 60.45214919 145.918688
Interaction(Exper,Class) -11.54 5.456608542 -2.1147 0.0377 -22.40661425 -0.671058699

d) To interpret the equation, write it as:

Predicted Salary = (48921 + 103.2Class) + (3248 - 11.54Class)*Exper

Now compare two students with dierent values of Class, say, Class = 25 and Class =
75, and see how their predicted salaries vary with Experience. The rst student starts
out lower (the 48921 + 103.2Class term is lower), but his coecient of Experience is
higher (because of the minus sign). The important point is that the eect of more
experience on salary, while always positive, decreases as the student's class standing
increases. When interactions are present, the eect of unit change in one variable (e.g.
experience) will depend on the level of the other variable (e.g. class)

32

You might also like