You are on page 1of 6

Statistics

Assignment 1

1) You are discussing your favourite sports team with a friend. She tells you that she has observed
their performance in the last 1000 games that they have played and has observed the following
in terms of whether the team won a game or not and whether they were leading at halftime in a
game or not:

Win Did not win


Leading at halftime 352 515
Not leading at 485
halftime
530 470 1000

Based on this information, fill out the remaining cells in the 2x2 table and answer the following:

a) What is P(Team wins)?


b) What is P(Team leading at halftime & Team wins )?
c) What is P(Team wins / Team leading at halftime)?

2) There is debate as to whether a “First mover advantage” exists for companies that innovate
and are the first to launch new products/ideas. The issues surrounding this debate are complex.
A management “expert”, while making an argument in favour of this notion, states that he has
studied 80 companies that dominate the market in their category, and 59 of these companies
were “first movers” in that category.

(a) Letting A = {company dominates category} and B = {company is a first mover}, the number
59/80 =0.737 can be expressed as
(i) P(A / B) (ii) P(A & B) (iii) P( A and not B) (iv) P(B / A)

(b) What kind of conditional probability (or probabilities) would you prefer to see instead if you
wanted to be persuaded by the notion of “first mover advantage”?

Note: this should not impact how you answer the questions above, but it is important to note
that in order to compute these probabilities above, one has to collect data on companies to figure
out which ones were “first mover companies”. One vital point that is often missed or not
appreciated in doing so, is that of “survivor bias”; one may think of a company as having been
the first mover in an industry, but it may be possible that some other company had actually been
the original first mover but went bust early on (did not survive) and so fell out of the historical
record. As a consequence, all resulting calculations and inferences become very suspect when
such bias exists, but the bias is very often missed, resulting in incorrect actions/policies etc. with
serious consequences. A very famous example from World War II, that illustrates how ignoring
survival bias can result in very poor outcomes (here, capture or death), can be seen in this 2 min
video (The only inaccuracy in the video is the narrator referring to Wald as an Austrian economist
when he was neither; Wald was Hungarian and more of a mathematician/statistician).

3) Suppose you and a business partner own a chain of spas across Mumbai and are trying to figure
out methods to increase revenue. You hire a consultant and give her access to all the data that
you have on your customers over the past 2 years. After doing various analyses on the data, she
presents you with a summary of findings, one of which says “The 90 minute Swedish Massage
treatment gives a lift of 0.6 to the Facial Care treatment”
In plain simple English and in no more than 5 sentences (it’s fine if you use fewer), explain to your
business partner, who has never studied Statistics, what that sentence means AND what its
practical implication is for your chain in terms of making or not making any special offers for a
combination of these two treatments.

4) One important methodology in Statistics is that of “Testing of hypotheses”, something that we


will discuss in some detail later in the semester. Tests of hypotheses and a related measure called
a “p value” play a fundamental role in many situations ranging from medicine (almost all new
prescription medicine/drugs that a pharma company wants to sell in the U.S. have to go through
clinical trials, which are tests of hypotheses, to establish that they work effectively; all the press
over the last few months around the effectiveness of hydroxychloroquine is essentially related
to the underlying test of hypothesis that tests whether it is effective in curing a patient from
covid), through engineering to marketing, finance etc. There is literally no area in which tests of
hypotheses do not find application.

Though the methodology was put on a firm footing in the 1930’s, the very first known application
of tests of hypotheses, using the logic that is still the basis for them to this day, is from 1710. That
year, John Arbuthnott, a Scottish polymath, who, in addition to knowing/being friends with a
variety of bigwigs such as Jonathan Swift, Isaac Newton and Samuel Johnson, was also the Royal
Physician to Queen Anne, published a paper in the Philosophical Transactions titled
Arbuthnott stated in his paper that for 82 years (1629-1710) in a row in London, there were more
male babies born every year than female babies. He made this observation by showing a table
(below) in the paper that shows the number of male and female christenings in London; this
clearly is a leap of faith because the number of christenings would be substantially less than the
actual number of births at that time, due to causes such as a much higher infant mortality rate,
the stigma surrounding illegitimate babies, etc. Nevertheless, we will let this slide (and treat # of
christenings as equal to # of births) and continue with Arbuthnott’s argument, which is as follows,
and the logic of which continues to be used to this day:

Arbuthnott first assumed that there is no Divine Providence guiding the world and hence, since
only Chance governed everything by this assumption, he concluded correctly that P(Baby is M) =
P(Baby is F)=0.5 (this is like tossing a fair coin)
From here, he argued that the implication of the above probability was that

P(# of Males born in a year > # of Females born that year) ≈0.5

to a very good approximation and so set this probability to be exactly 0.5 (The Binomial
distribution method, which we will not cover in this class, is used to prove this approximation.
However, we will not worry about this but will take the approximation for granted in what
follows.).
Next, he answered the following question:
Given the assumption above and the subsequent calculations, what is the probability that there
are more Males born than Females each year, for 82 years in a row? Note that this event
corresponds to the joint event
(# M born yr 1 > #F born yr 1) &(# M born yr 2 > #F born yr 2) &…&(# M born in yr 82 > #F born
in yr 82)
a) Compute the above probability, showing the steps and assumptions you make to get your
answer. Do you think that any assumptions you may have made are reasonable and, if
yes, why? (Note: This prob computed in this context, we shall see later, is an example of
what is called the p-value)
b) Note that the probability you computed in (a) was based on the foundational assumption
that P(M)=P(F)=0.5. Given this probability of the event you computed in (a) with the FACT
that more births were male than female in every single year for 82 consecutive years,
what do you feel about the foundational assumption that P(M)=P(F)=0.5? (Note: This
foundational assumption in this context, we will see later, is an example of what is called
the Null Hypothesis)

Arbuthnott had to now make the leap from his feeling (which, if you have reasoned correctly,
should match yours) about the foundational assumption of P(M)=P(F)=0.5 to his claim that this
feeling justified the existence of Divine Providence. There is no name for this in modern Statistics
because this is exactly what I called it, viz. a leap, with no basis in science. However, Arbuthnott
came up with a tortured argument to make his case, as follows, thus claiming the existence of
Divine Providence:
5) There has been a lot of talk over the last year about pooled testing for covid as a cost effective
and fast means of testing. The way pooled testing works is as follows: A random sample of, say,
5 people is selected, a nasal swab taken from each and then the material from the 5 swabs is
pooled/combined and one single test is run on the pooled material. If this test comes back
negative, all 5 people are declared negative. If this test comes back positive, then all the 5
individuals are tested separately (i.e. 5 tests). We will assume throughout this discussion that the
test is a confirmatory test and not a diagnostic one (i.e. there are no false positives or false
negatives)

Suppose that 10% of the population in a city is infected and a pooled test is carried out on 5
randomly selected people.
a) Assuming that the event “a person is positive” is independent of “another person is positive”,
what is the probability that the test comes back positive? Hint: The event “the test is positive”
is equivalent to “at least one person in 5 is positive”. However, this event is the complement
of the event “none of the 5 people are positive”. It is easier to compute the probability of the
latter event and then use the fact that P(A)=1-P(A complement)

Suppose, further, that one has a town of 50,000 people with an incidence rate of 0.1 (i.e. 10% of
the population is infected). If the city decides to test every person individually, it would need to
run 50,000 tests. However, if it breaks up the 50,000 individuals at random into 10,000 groups of
5 each, then it can run 10,000 pooled tests, though, each time a pooled test turns up positive,
the city has to test each of the 5 people in that positive pool.

b) Out of the 10,000 pooled tests, how many pooled tests would you expect to find positive?
(HINT: Apply the long run relative frequency interpretation to the probability that a pooled test
is positive).

c) Based on this expected number of positive pooled tests, how many total tests (pooled and by
individual) would the city expect to run?

(Note: you will see that as the incidence rate becomes smaller, the pooled testing becomes more
efficient in reducing the total number of tests needed, on average. However, as the incidence
rate increases, there isn’t a substantial reduction in the number of tests to be done, on average,
and pooled testing loses its efficiency)

6) One of the many applications of machine learning classification techniques is in deciding which
potential customers to target for mailing advertisements for a product. For a given potential
customer, using the values of their features such as their age, the location where they live, marital
status, etc., the classification model can provide a probability that the potential customer will
respond to the mailing. Suppose that the cost of sending the mailing to the potential customer is
$1 (this includes generating the content, printing and actually mailing it) and that the net profit
for the company, if the potential customer buys the product, is $99 (so this 99$ profit is after
accounting for the $1 cost of mailing)
a) Letting p denote the probability that a potential customer will respond to the mailing (i.e. buy
the product), compute the expected value of the profit for the company (your answer will be an
expression that involves the probability p). Hint: First list the two possible values of the random
variable “Profit”, one for the case that the person buys the product and the other if the customer
does not buy the product; in the latter case, note that a cost corresponds to a negative profit
value
b) Based on your answer to (a), what is the minimum value that p must have, such that the
expected profit for the company is positive?

You might also like