You are on page 1of 12

Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022

2022

Statistics for Decision Making in Python


Session 7, Lecture 8
Business Vertical – DA, Trimester III, Batch ‘21-’23

V Shekhar Avasthy, 18th Feb, 2022

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 1

What we intend to cover today?

• Quick Recap
• Variance revisited
• F-Test
• ANOVA

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 2

All rights reserved, Facts n Data, 2022 1


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

What we intend to cover today?

• Quick Recap
• Variance revisited
• F-Test
• ANOVA

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 3

Quick recap…
• Irrespective of population distribution of size, N>0, if we plot MEANS of ALL possible sample combinations
of sample size ‘n’, ( N < n < 0 ), then the distribution is (almost) always a BELL shaped curve.
• Because this bell shaped curve is found so often, it is also called a NORMAL distribution curve.
• This curve has certain properties:
• It is ‘characterized’ by its MEAN and Standard Deviation (SD) – i.e. a given mean and SD shall ALWAYS
lead to only one specific bell-curve.
• It is symmetric about its PEAK.
• Mean = Median = Mode for an IDEAL bell shaped curve – for all curves of sample size ‘n’, even if
population mean/ mode/median does not match, the curve of all possible means of any size ‘n’
follow this property.
• The PEAK of all curves of all samples sizes ‘n’ always coincides with the mean of the population.
• The curve has specific properties such as area on either side of PEAK on multiples of SD are always
same, e.g., all curves shall always have 95% samples lying between +1.96*σ and so on.
• We also studies that σ = sn /SQRT (n), i.e., SD of sample curve of ALL possible combinations of
size ‘n’ = Population SD / Square Root of ‘n’ (sample size for which the curve is created)

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 4

All rights reserved, Facts n Data, 2022 2


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

…Quick recap
• Statisticians encounter multiple problems to solve:
• To ESTIMATE Population mean from a sample, properties listed on previous slide are exploited
(remember difference between Confidence Level and Interval).
• To check if a sample is part of a given population, we may need to check if mean of population and
sample are likely to have same mean and/ or variance, because if they have same mean and variance,
they are part of the same curve of size ‘n’, and thus belong to the same population from where this
curve is drawn.
• To check if a sample and population are likely to have same mean, we may use one-sample t-test
(provided the data meets assumptions of one sample t-test)
• To check if TWO given sample are likely to be the part of same population (thus are similar), we may
use two-sample t-test (provided the data meets assumptions of two sample t-test)
• To check for variance difference for sets of categorical data, we may use Chi-Square test
• To check for variance difference for sets of continuous data, we may use ANOVA (Analysis Of
Variance) - following slides

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 5

What we intend to cover today?

• Quick Recap
• Variance revisited
• F-Test
• ANOVA

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 6

All rights reserved, Facts n Data, 2022 3


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

Remember session 2 Slide 23!


• Total Population Variance is
sum of squared difference of
independent values from
mean

• Why squared difference? –


to make the distance from
mean a positive value

• So, why not take modulus?


E.g., -3.4 could be treated as
|-3.4| = 3.4: because we
want to AMPLIFY the
difference to highlight values
that were far away. A
difference of, say, 7 would
become 49!

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 7

What we intend to cover today?

• Quick Recap
• Variance revisited
• F-Test
• ANOVA

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 8

All rights reserved, Facts n Data, 2022 4


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

F-Test
• Also called Snedecor's F distribution or the Fisher–Snedecor distribution
• The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the
variance ratio in the 1920s.
• It is often desirable to compare two variances rather than two averages. For instance:
• College administrators would like two college professors grading exams to have the same variation in their grading.
• In order for a lid to fit a container, the variation in the lid and the container should be approximately the same.
• A service organization such as a car dealership would like to provide customers with a consistent service experience,
implying that the variance between delivery standards of different workshops or service advisors should be nearly
same.

• Assumptions/ Necessary Conditions:


1. The populations from which the two samples are drawn are approximately normally distributed.
2. The two populations are independent of each other.
• Unlike most other hypothesis tests, the F test for equality of two variances is very sensitive to deviations from
normality. If the two (POPULATION) distributions are NOT normal, or close, the test can give a biased result
for the test statistic.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 9

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 10

10

All rights reserved, Facts n Data, 2022 5


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

Hypothesis for F-Test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 11

11

Example
PROBLEM:
Two college instructors are interested in whether or not there is any variation in the way they grade math exams. They
each grade the same set of 10 exams. The first instructor's grades have a variance of 52.3. The second instructor's grades
have a variance of 89.9. Test the claim that the first instructor's variance is smaller. The level of significance is 10%.

SOLUTION:
Let 1 and 2 be the subscripts that indicate the first and second instructor, respectively. n1 = n2 = 10.
H0: σ12 ≥ σ22 and Ha: σ12 < σ22

Calculate the test statistic:


By the null hypothesis (σ12 ≥ σ22 ) ,
the F statistic is:
Fc = s22 /s12
= 89.9 / 52.3 = 1.719

LOOK AT THE F table TABLE.

Since F value < FCRITICAL, reject H0.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 12

12

All rights reserved, Facts n Data, 2022 6


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

What we intend to cover today?

• Quick Recap
• Variance revisited
• F-Test
• ANOVA

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 13

13

INTRODUCTION TO ANOVA

H0 True, all means are near similar, minor HA True, all means are very different,
differences are due to random variations differences unlikely due to random variations
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 14

14

All rights reserved, Facts n Data, 2022 7


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

STEPS FOR ANOVA

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 15

15

Steps for ANOVA (Contd)

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 16

16

All rights reserved, Facts n Data, 2022 8


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

Steps for ANOVA (Contd)…

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 17

17

EXAMPLE 1
BIMTECH students grew tomato plants under different soil cover conditions. Groups of three plants each had one of the
following treatments: bare soil / a commercial ground cover / black plastic / Straw /compost
All plants grew under the same conditions and were the same variety. Students recorded the weight (in grams) of
tomatoes produced by each of the n = 15 plants:

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 18

18

All rights reserved, Facts n Data, 2022 9


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

1. Plot the points


10,000

9,000

8,000

7,000

Total Mean 6147 6,000

5,000

4,000

3,000

2,000

1,000

0
0 2 4 6 8 10 12 14 16

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 19

19

2. Calculate total variance (Distance from mean)2


10,000

9,000

8,000

7,000

Total Mean 6147 6,000

5,000

4,000

3,000

2,000

1,000

0
0 2 4 6 8 10 12 14 16

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 20

20

All rights reserved, Facts n Data, 2022 10


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

10,000

9,000

8,000
Grp 4 Mean 7804
Grp 4 Mean 7591
7,000
Grp 3 Mean 6324
Total Mean 6147 6,000
Grp 2 Mean 5504
5,000

4,000
Grp 1 Mean 3512
3,000

2,000

1,000

0
0 2 4 6 8 10 12 14 16

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 21

21

Populating the ANOVA Table

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 22

22

All rights reserved, Facts n Data, 2022 11


Session 7, Lecture 8, BIMTECH, 18 Feb 2/18/2022
2022

EXAMPLE 2

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 23

23

Thank You!

shekhar@factsNdata.com / 9810228402

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 24

24

All rights reserved, Facts n Data, 2022 12

You might also like