You are on page 1of 21

Advance Statistics

PROJECT

Name- Ashish Gupta


PGP-DSBA Online.AUG22B
Date- 13/11/2022

Table of Contents

1|Page
S. No Heading Page No
1 Problem 1 3
2 Problem 2 6
3 Problem 3 9
4 Problem 4 12
5 Problem 5 14
6 Problem 6 16
7 Problem 7 17

2|Page
Problem 1

A physiotherapist with a male football team is interested in studying the relationship


between foot injuries and the positions at which the players play from the data collected
 
  Striker Forward Attacking Midfielder Winger Total
Players Injured 45 56 24 20 145
Players Not Injured 32 38 11 9 90
Total 77 94 35 29 235
 
1.1 What is the probability that a randomly chosen player would suffer an injury?
1.2 What is the probability that a player is a forward or a winger?
1.3 What is the probability that a randomly chosen player plays in a striker position and has
a foot injury?
1.4 What is the probability that a randomly chosen injured player is a striker?
1.5 What is the probability that a randomly chosen injured player is either a forward or an
attacking midfielder? 

Solution:

1.1 What is the probability that a randomly chosen player would suffer an injury?

Sol:

Let P(I) be the probability of a player suffering from injury.

Total Players Injured


Therefore, P(I) =
Total Count of Players

= 145/235
= 0.617

There is 61.7 % chance that out of total players, a randomly chosen player would suffer an
injury.

1.2 What is the probability that a player is a forward or a winger?

Sol:

3|Page
Let’s find the probability of a player being forward and a winger separately.

Let P(F) be the probability of a player being Forward.


and Let P(W) be the probability of a player being Forward.

P(F  W)= P(F) + P(W)

Since a player can either be Forward or Winger and can’t be both simultaneously, we can’t
have P(F  W).

P(F) = 94/235= 0.40


P(W) = 29/235 = 0.123

Therefore, P(F  W)= P(F) + P(W) = 0.40 + 0.123 = 0.523

Thus, there is 52.3% chance that a randomly chosen player is a Forward or a Winger

1.3 What is the probability that a randomly chosen player plays in a striker position and has
a foot injury?

Sol:

From the table, we can deduce the following:

Number of players who are Strikers and have an injury = 45


Therefore,
Probability that a randomly chosen player who is a Striker and has a Foot injury=

Strikers with injury

Total number of players

= 45/235
= 0.1914

Therefore, there is 19.14% chance that a randomly chosen player plays in a Striker
position and has a foot injury.

1.4 What is the probability that a randomly chosen injured player is a striker?

Sol:

4|Page
To find this, we will first see the total number of injured players and see the probability of a
randomly chosen injured player being a Striker.

Therefore, from the table we can see,

Number of injured Strikers = 45


Total number of Injured players= 145

Therefore, probability that a randomly chosen injured player is a Striker =

Number of Injured Strikers

Total number of injured players

= 45/145
= 0.310

Therefore, there is 31% chance that a randomly chosen injured player is a Striker.

1.5 What is the probability that a randomly chosen injured player is either a forward or an
attacking midfielder? 

Sol:

Let’s first find out the probability that an injured player chosen is a forward and mid-fielder
separately.

Let P( Fi ) be the probability that a randomly chosen injured player is a Forward.


& Let P( Mi ) be the probability that a randomly chosen injured player is an Attacking Mid-
fielder.

P( Fi  Mi ) will give the probability that a randomly chosen injured player is either a forward
or an attacking mid-fielder.

P( Fi  Mi ) = P(Fi) + P(Mi)

Since a player can either be a Forward or a Mid-Fielder and can’t be both. Thus we can’t find
P( Fi  Mi ).

Now, P( Fi )= 56/145 = 0.386


P( Mi )= 24/145 = 0.165

Therefore,
P( Fi  Mi ) = P(Fi) + P(Mi)

= 0.386 + 0.165

5|Page
= 0.551

Therefore, there is 55% chance that a randomly chosen injured player is either a forward
or an attacking midfielder.

Problem 2:

An independent research organization is trying to estimate the probability that an accident


at a nuclear power plant will result in radiation leakage. The types of accidents possible at
the plant are, fire hazards, mechanical failure, or human error. The research organization
also knows that two or more types of accidents cannot occur simultaneously.
According to the studies carried out by the organization, the probability of a radiation leak
in case of a fire is 20%, the probability of a radiation leak in case of a mechanical 50%, and
the probability of a radiation leak in case of a human error is 10%. The studies also showed
the following;
 The probability of a radiation leak occurring simultaneously with a fire is 0.1%.
 The probability of a radiation leak occurring simultaneously with a mechanical failure
is 0.15%.
 The probability of a radiation leak occurring simultaneously with a human error is
0.12%.
On the basis of the information available, answer the questions below:

2.1 What are the probabilities of a fire, a mechanical failure, and a human error
respectively?
2.2 What is the probability of a radiation leak?
2.3 Suppose there has been a radiation leak in the reactor for which the definite cause is not
known. What is the probability that it has been caused by:
 A Fire.
 A Mechanical Failure.
 A Human Error.

Solution:

2.1 What are the probabilities of a fire, a mechanical failure, and a human error
respectively?

Sol:

This problem can be solved by application of Bayes Theorem.

Let P(R) be the probability of Radiation leak


P(F) be the probability of Fire
P(M) be the probability of Mechanical Failure
P(H) be the probability of Human Error

According to the given data,


P(R | F) = 0.20

6|Page
P(R | M) = 0.50
P(R | H) = 0.10
and also,

P( F  R ) = 0.1% = 0.001
P( M  R ) = 0.15% = 0.0015
P( H  R ) = 0.12% = 0.0012

As we know, P( A|B ) = P( A  B ) / P(B), given A & B are independent events.

Since, it is given that all possible accidents cannot occur simultaneously, we can say that
these are independent events.

 P( F ) = P( F  R ) / P( R|F)
= 0.001/0.20
= 0.005
Thus, probability of a fire at the plant is 0.5%

 P( M ) = P( M  R ) / P( R|M)
= 0.0015/0.5
= 0.003
Thus, probability of a mechanical failure at the plant is 0.3%

 P( H ) = P(H  R ) / P( R|H)
= 0.0012/0.10
= 0.012
Thus, probability of a accident due to Human error at the plant is 1.2 %

2.2 What is the probability of a radiation leak?

Sol:

Since, fire, mechanical failure and human error are only possible accidents at the plant, we
can say that there will be no radiation leak if these accidents do not happen.

Let P( N ) be the probability that there will be no accident.

P( N )= 1- (P(F) + P(M) + P(H))


= 1- ( 0.005 + 0.003 + 0.012 )
= 1- 0.02 = 0.98

Now, probability that radiation leak will happen when there is no accident is zero, i.e
P(R|N)= 0

Using Bayes formula,


P( A  B ) = P(A|B) x P(B)

7|Page
 P(R  N) = P(R|N) x P(R)
= 0 x P(R) = 0

Since probability that radiation leak will happen totally depends on if any accident will
happen. This means that no leakage will happen if there is no accident i.e it is completely
dependent on occurrence of an accident.

P( R ) = P( R  F ) + P( R  M ) + P( R  H ) + P( R  N)
= 0.001 + 0.0015 + 0.0012 + 0
= 0.0037

2.3 Suppose there has been a radiation leak in the reactor for which the definite cause is not
known. What is the probability that it has been caused by:
 A Fire.
 A Mechanical Failure.
 A Human Error.

Sol:

Here we have to find,

o P( F | R )
o P( M | R )
o P( H | R )

P( F | R ) = P ( R  F ) / P( R )
= 0.001/0.0037
= 0.277027

Thus, probability that a radiation leak has happened due to fire is 27%

P( M | R ) = P ( R  M ) / P( R )
= 0.0015/0.0037
= 0.4054

Thus, probability that a radiation leak has happened due to mechanical failure is 40.5%

P( H | R ) = P ( R  H ) / P( R )
= 0.0012/0.0037
= 0.3243

Thus, probability that a radiation leak has happened due to human error is 32.4%

8|Page
Problem 3:
The breaking strength of gunny bags used for packaging cement is normally distributed with
a mean of 5 kg per sq. centimeter and a standard deviation of 1.5 kg per sq. centimeter. The
quality team of the cement company wants to know the following about the packaging
material to better understand wastage or pilferage within the supply chain; Answer the
questions below based on the given information; (Provide an appropriate visual
representation of your answers, without which marks will be deducted)

3.1 What proportion of the gunny bags have a breaking strength less than 3.17 kg per sq
cm?
3.2 What proportion of the gunny bags have a breaking strength at least 3.6 kg per sq cm.?
3.3 What proportion of the gunny bags have a breaking strength between 5 and 5.5 kg per
sq cm.?
3.4 What proportion of the gunny bags have a breaking strength NOT between 3 and 7.5 kg
per sq cm.?

Solution:

3.1 What proportion of the gunny bags have a breaking strength less than 3.17 kg per sq
cm?

Let us find the Z score which will give the Standard normal variable

X− X X= Observed value
Z=
σ X = Mean value = 5 kg per sq cm
As per question,  = Standard Deviation = 1.5 kg/cm2
Z= (3.17-5)/1.5
= -1.22

In Python, We will use norm function of scipy.stats to calculate our cumulative density
function, which will give the area to the left of distribution below 3.17
From, Python calculation, using the code scipy.stats.norm.cdf(-1.22), we get the output as
0.111.

Therefore, we can say that 11% of the gunny bags have a breaking strength less than 3.17
kg per square cm.

3.2 What proportion of the gunny bags have a breaking strength at least 3.6 kg per sq cm.?

Using the same approach, let us find the Z- score.

X− X
Z=
σ

9|Page
= (3.6 – 5) / 1.5
= - 0.933

Since we are asked to give a proportion of bags with breaking strength at least 3.6 Kg/ cm 2
This implies that we want to find the area under curve to the right of P( X >= 3.6 )

 P( X >= 3.6 ) = 1 – stats.norm.cdf(-0.933)

Therefore, P( Z > - 0.933 ) = 0.8247

Therefore, we can say that 82.5% of the gunny bags have a breaking strength of at least
3.6 kg per square cm.

3.3 What proportion of the gunny bags have a breaking strength between 5 and 5.5 kg per
sq cm.?

To find this, we will find Z- score for both 5 and 5.5 Kg/cm 2 and calculate areas under curves
correspondingly. Then subtracting the area for both with each other to find the area
between these two points.

Let Z1 be the Z score corresponding to the strength of 5 Kg per sq cm.


And Let Z2 be the Z score corresponding to the strength of 5.5 Kg per sq cm.

Therefore,

Z1= (5-5)/1.5 = 0
Z2= (5.5-5)/5 = 0.3333

We will do stats.norm.cdf(Z2) – stats.norm.cdf(Z1) .

This will give us the area between these two points.

P( Z1 < Z < Z2 ) = P ( Z< 0.3333) – P ( Z < 0 )

This comes out to be 0.1306.

Therefore, we can say that 13% of the gunny bags have breaking strength between 5 and
5.5 kg/cm2.

3.4 What proportion of the gunny bags have a breaking strength NOT between 3 and 7.5 kg
per sq cm.?

To find this, we will first find Z- score for both 3 and 7.5 Kg/cm 2 and calculate areas under
curves correspondingly. Then subtracting the area for both with each other to find the area
between these two points.
To calculate the proportion not between 3 and 7.5 kg/cm2, we will subtract the result from 1

10 | P a g e
Let Z1 be the Z score corresponding to the strength of 3 Kg per sq cm.
And Let Z2 be the Z score corresponding to the strength of 7.5 Kg per sq cm.

Therefore,

Z1= (3-5)/1.5 = - 1.333


Z2= (7.5-5)/5 = 1.6666

We will do 1- (stats.norm.cdf(Z2) – stats.norm.cdf(Z1)) .

This will give us the area between these two points.

This comes out to be 0.1390.

Therefore, we can say that 13.9% of the gunny bags have breaking strength between 3
and 7.5 kg/cm2.

11 | P a g e
Problem 4:
Grades of the final examination in a training course are found to be normally distributed,
with a mean of 77 and a standard deviation of 8.5. Based on the given information answer
the questions below.
 
4.1 What is the probability that a randomly chosen student gets a grade below 85 on this
exam?
4.2 What is the probability that a randomly selected student scores between 65 and 87?
4.3 What should be the passing cut-off so that 75% of the students clear the exam?

Solution:

4.1 What is the probability that a randomly chosen student gets a grade below 85 on this
exam?

Let us find the Z score which will give the Standard normal variable

X− X X= Observed value
Z=
σ X = Mean value = 77
As per question,  = Standard Deviation = 8.5
Z= (85-77)/8.5
= 0.9411

In Python, We will use norm function of scipy.stats to calculate our cumulative density
function, which will give the area to the left of distribution below 85
From, Python calculation, using the code scipy.stats.norm.cdf(0.9411), we get the output as
0.8266.

Therefore, we can say that there is 82.6% probability that a randomly chosen student gets
a grade below 85 on this exam.

4.2 What is the probability that a randomly selected student scores between 65 and 87?

To find this, we will find Z- score for both 65 and 85 and calculate areas under curves
correspondingly. Then subtracting the area for both with each other to find the area
between these two points.

Let Z1 be the Z score corresponding to the score of 65


And Let Z2 be the Z score corresponding to the score of 87.

Therefore,

Z1= (65-77)/8.5 = - 1.4117


Z2= (87-77)/8.5 = 1.1764

We will do stats.norm.cdf(Z2) – stats.norm.cdf(Z1) .

12 | P a g e
This will give us the area between these two points.

P( Z1 < Z < Z2 ) = P ( Z< 1.1764) – P ( Z < -1.4117 )

This comes out to be 0.80128.

Therefore, we can say that there is 80% probability that a randomly chosen student gets a
grade between 65 and 87 on this exam.

4.3 What should be the passing cut-off so that 75% of the students clear the exam?

To calculate this, we will use ppf function (Percent point function) of scipy.stats library,
which will return a discrete value that is less than or equal to the asked probability.

Here, we want the cut off score above which 75% of students clear the exam. We want area
to the right side of the discrete value above which 75% students pass.

Therefore, using stats.norm.ppf(0.25, loc= 77, scale= 8.5) we get 71.266

Thus, the passing cut off score so that 75% of students clear the exam is 71.26

13 | P a g e
Problem 5:
Zingaro stone printing is a company that specializes in printing images or patterns on
polished or unpolished stones. However, for the optimum level of printing of the image the
stone surface has to have a Brinell's hardness index of at least 150. Recently, Zingaro has
received a batch of polished and unpolished stones from its clients. Use the data provided
to answer the following (assuming a 5% significance level);
 
5.1 Earlier experience of Zingaro with this particular client is favorable as the stone surface
was found to be of adequate hardness. However, Zingaro has reason to believe now that
the unpolished stones may not be suitable for printing. Do you think Zingaro is justified in
thinking so?
5.2 Is the mean hardness of the polished and unpolished stones the same?

Solution:

5.1 Earlier experience of Zingaro with this particular client is favorable as the stone surface
was found to be of adequate hardness. However, Zingaro has reason to believe now that
the unpolished stones may not be suitable for printing. Do you think Zingaro is justified in
thinking so?

Step 1:
Defining Hypothesis:

Null Hypothesis
Ho: Adequate hardness of stone found >= 150

Alternate Hypothesis
Ha: Unpolished stone hardness not suitable for printing < 150

Here, Level of Significance ɑ = 0.05 and Sample size n= 75 (derived from dataset), x¯ =
134.11, σ = 33.04, μ = 150

Step 2: Define the test statistic based on the information in the question. Here, we are going
to use the Zstat .

Let us calculate the value of the test statistic.

test_statistic z= (Xbar- mu)/(𝛔/√n)

The value of Zstat comes out to be -4.16.

From the value of the Zstat , we understand that this is a lower tailed-test.

Step 3: Let us check the critical value with respect to α for the test statistic.
Using norm.ppf for a alpha value of 0.05, the critical value is -1.64

14 | P a g e
Here, the calculated Zstat value is less than -1.64. Thus, this value falls in the rejection
region.

Hence, we can reject the H0 i.e accept the Null Hypothesis.


Let’s calculate the p-value as well. The p-value comes out to be 1.5 x 10 -5, which is too low
than alpha.
We see that the p-value < α . Thus, it is confirmed we can reject the null hypothesis.
With 95% confidence, we are able to reject the Null Hypothesis.

With 95% confidence, we can say that we have enough evidence to say that the hardness for
unpolished stones is less than 150. Thus Zingaro is right in its thinking that unpolished stones are
not fit for printing.

5.2 Is the mean hardness of the polished and unpolished stones the same?

Let us look at the five point summary of the data.

From the table, we can see that the mean hardness of Polished stones is 134.11 whereas mean
hardness of Treated and Polished stone is 147.78 which is not same.

15 | P a g e
Problem 6:

Aquarius health club, one of the largest and most popular cross-fit gyms in the country has
been advertising a rigorous program for body conditioning. The program is considered
successful if the candidate is able to do more than 5 push-ups, as compared to when he/she
enrolled in the program. Using the sample data provided can you conclude whether the
program is successful? (Consider the level of Significance as 5%)
Note that this is a problem of the paired-t-test. Since the claim is that the training will make
a difference of more than 5, the null and alternative hypotheses must be formed
accordingly.

Solution:

Step 1:

Defining Hypothesis:

Let µ1 be the Mean push ups done by candidate before enrolment and µ2 be the Mean push ups
done by candidate after enrolment.

Null Hypothesis:

Ho: µ1=µ2

Alternate Hypothesis:

Ha: µ2-µ1 > 5

Here, Level of Significance ɑ = 0.05 and Sample size n= 100 (derived from dataset)

- The level of significance (Alpha) = 0.05.

- But since the population standard deviation (Sigma) is unknown, we have to use a T-stat test.

- Degree of Freedom: Since the sample is the same for both Sampling tests, we have N-1 degrees
of freedom: 99

- Since the sole purpose of the test is to check whether the Program is successful, we would prefer
a One-sided T-test.

From the test, we observe that that p-value is less than alpha. Thus we reject the null hypothesis.

Hence we can say that we have statistical evidence to state that the Program was successful

16 | P a g e
Problem 7:

Dental implant data: The hardness of metal implant in dental cavities depends on multiple factors,
such as the method of implant, the temperature at which the metal is treated, the alloy used as well
as on the dentists who may favour one method above another and may work better in his/her
favourite method. The response is the variable of interest.

1. Test whether there is any difference among the dentists on the implant hardness. State the
null and alternative hypotheses. Note that both types of alloys cannot be considered
together. You must state the null and alternative hypotheses separately for the two types of
alloys.?

2. Before the hypotheses may be tested, state the required assumptions. Are the assumptions
fulfilled? Comment separately on both alloy types.? 

3. Irrespective of your conclusion in 2, we will continue with the testing procedure. What do you
conclude regarding whether implant hardness depends on dentists? Clearly state your
conclusion. If the null hypothesis is rejected, is it possible to identify which pairs of dentists
differ?

4. Now test whether there is any difference among the methods on the hardness of dental
implant, separately for the two types of alloys. What are your conclusions? If the null
hypothesis is rejected, is it possible to identify which pairs of methods differ?

5. Now test whether there is any difference among the temperature levels on the hardness of
dental implant, separately for the two types of alloys. What are your conclusions? If the null
hypothesis is rejected, is it possible to identify which levels of temperatures differ?

6. Consider the interaction effect of dentist and method and comment on the interaction plot,
separately for the two types of alloys?

7. Now consider the effect of both factors, dentist, and method, separately on each alloy. What
do you conclude? Is it possible to identify which dentists are different, which methods are
different, and which interaction levels are different?

Solution:
1. Test whether there is any difference among the dentists on the implant hardness. State the null
and alternative hypotheses. Note that both types of alloys cannot be considered together. You must
state the null and alternative hypotheses separately for the two types of alloys?

Step 1:

Defining Hypothesis:

Defining Separate Hypothesis for both cases

Case 1:

Null Hypothesis:

Ho: Mean Hardness is same across all dentists for Alloy 1

17 | P a g e
Alternate Hypothesis:

Ha: Mean Hardness is not same for at least one pair of Dentists for Alloy 1

Case 2:

Null Hypothesis:

Ho: Mean Hardness is same across all dentists for Alloy 2

Alternate Hypothesis:

Ha: Mean Hardness is not same for at least one pair of Dentists for Alloy 2

Here, Level of Significance ɑ = 0.05 and Sample size n= 90 (derived from dataset)

2. Before the hypotheses may be tested, state the required assumptions. Are the assumptions fulfilled?
Comment separately on both alloy types? 

These are the assumptions that are required before the test:

o The responses for each type of alloy have a normal distribution.


o These distributions have the same variance.
o The data is independent.

Let’s see the boxplot of the Response variable to see the distribution.

We can clearly see that the distribution is not normal. The data clearly does not fulfill the
assumption.

18 | P a g e
3. Irrespective of your conclusion in 2, we will continue with the testing procedure. What do you
conclude regarding whether implant hardness depends on dentists? Clearly state your conclusion. If
the null hypothesis is rejected, is it possible to identify which pairs of dentists differ?

We will perform one way ANOVA for the response variable.

Now, we see that the corresponding p-value is greater than alpha (0.05). Thus, we fail to reject the
Null Hypothesis.

Thus the Mean hardness is same across all dentists

Now let us perform One way ANOVA for Response variable for Alloy 1 and Alloy 2 separately.

Alloy 1:

Alloy 2:

Now, we see that the corresponding p-value is greater than alpha (0.05). Thus, we fail to reject the
Null Hypothesis.

Thus, for both Alloy 1 and Alloy 2, the Mean hardness of Alloy1 and Alloy 2 is same across all dentists

4. Now test whether there is any difference among the methods on the hardness of dental implant,
separately for the two types of alloys. What are your conclusions? If the null hypothesis is rejected, is it
possible to identify which pairs of methods differ?

Step 1:

Defining Hypothesis:

Defining Separate Hypothesis for both cases

Case 1:

Null Hypothesis:

Ho: Mean Hardness is same across all methods for Alloy 1

19 | P a g e
Alternate Hypothesis:

Ha: Mean Hardness is not same for at least one pair of Methods for Alloy 1

Case 2:

Null Hypothesis:

Ho: Mean Hardness is same across all methods for Alloy 2

Alternate Hypothesis:

Ha: Mean Hardness is not same for at least one pair of methods for Alloy 2

Now let us perform One way ANOVA for Response variable for Alloy 1 and Alloy 2 with respect to the
Method separately.

Alloy 1:

Alloy 2:

The p-value is lesser than alpha (0.05) for both alloy 1 and alloy 2. We will reject the null hypothesis.
We can say that the mean hardness for both Alloy 1 and Alloy 2 is different for at least one pair of
methods of Dental Implant.

5. Now test whether there is any difference among the temperature levels on the hardness of dental
implant, separately for the two types of alloys. What are your conclusions? If the null hypothesis is
rejected, is it possible to identify which levels of temperatures differ?

Step 1:

Defining Hypothesis:

Defining Separate Hypothesis for both cases

Case 1:

Null Hypothesis:

Ho: Mean Hardness is same across all temperature levels for Alloy 1

20 | P a g e
Alternate Hypothesis:

Ha: Mean Hardness is not same across different temperature levels for Alloy 1

Case 2:

Null Hypothesis:

Ho: Mean Hardness is same across all temperature levels for Alloy 2

Alternate Hypothesis:

Ha: Mean Hardness is not same across different temperature levels for Alloy 2

Alloy 1:

Alloy 2:

Now, we see that the corresponding p-value is greater than alpha (0.05). Thus, we fail to reject the
Null Hypothesis.

Thus, we can say that Mean Hardness is same across all temperature levels for both Alloy 1 and Alloy
2.

21 | P a g e

You might also like