You are on page 1of 13

Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022

2022

Statistics for Decision Making in Python


Session 6, Lecture 7
Business Vertical – DA, Trimester III, Batch ‘21-’23

V Shekhar Avasthy, 15th Feb, 2022

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 1

What we intend to cover today?

• One Sample t-Test


• Chi-Squared test
• 2 sample t-test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 2

All rights reserved, Facts n Data, 2022 1


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

What we intend to cover today?

• One Sample t-Test


• Chi-Squared test
• 2 sample t-test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 3

Problem Statement

The label of an energy bar claims that the


bar has 20 grams of protein. We want to
know if the claim is correct or not?

Image used from the internet purely for representational purpose.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 4

All rights reserved, Facts n Data, 2022 2


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

One sample t-test

• When to use? Used to determine whether an unknown population mean is different


from a specific value. E.g., Does the protein value in a chocolate bar is actually 20g/ bar?
• Typically, H0 = Test_Value, Ha ≠ Test_Value. In our case, H0 = 20g, Ha ≠ 20g
• Assumptions/ Basic Conditions:
• Data should be continuous
• Observations should be independent (Value of any sample should NOT be
dependent on value of any other sample).
• Data should be a random sample.
• Data should be from a normal population. (For ‘small’ sample sizes, test for
normality may not be possible. When normality cannot be assumed, this cannot be
used.)

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 5

Example
• Imagine we have collected a random sample of 31 energy bars from a number of different stores to represent the
population of energy bars available to the general consumer.
• We measure the protein content in these 31 samples and found it to be (ones in red are <20g):
20.70 27.46 22.15 19.85 21.29 24.75 20.75 22.91 25.34 20.33 21.54 21.08
22.14 19.56 21.10 18.04 24.12 19.95 19.72 18.28 16.26 17.46 20.53 22.12
25.06 22.44 19.08 19.88 21.39 22.33 25.79
• Questions: Which test to use?
• We need to test if population mean (mean of all bars of population) is different from 20g?, So, one sample t-test can be used.
• Is data value independent? Yes, because we collected bars from different stores (to avoid same batch/ same manufacturing
date). Logical to believe that protein content in one bar is INDEPENDENT of that of any other bar.
• Data are continuous.
• It’s a random sample (we took from different stores).
• For “large samples” (n>30), this can be tested by looking at histograms.
So, one sample t-test is appropriate. Always remember, selection of any test OR
technique depends on
(A) What is the OBJECTIVE? AND,
(B) Does data meets assumptions of technique(s) that
help you meet that objective?

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 6

All rights reserved, Facts n Data, 2022 3


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

See Excel sheet and Jupyter Notebook


…Example
• STEP 1: Calculate Average, Std Dev and “n” for given sample. In our case, Mean = 21.4; Std Dev = 2.54; n = 31 and Std Error
Mean = Std Dev/ n = 0.456

• Step 2: Decide on α. Say, we want to be confident 95%, so, α = 1-95% = 1-0.95 = 0.05. (If we wanted 99% confidence, α = 0.01)

• STEP 3: Find the difference between Sample Mean (21.4, in this case) and claimed Mean (20, in our case): 21.4-20 = 1.4

• Step 4: Decide if it’s 2-tailed OR 1-tailed (α lies on both sides OR 1 side). For this case, we are interested in 2 tailed test.

• Step 5: Calculate Test Statistic: t= Difference / Std Error = 1.4 / 0.456 = 3.07

• Step 6: Get DoF - For energy bar, n=31, so Degrees of Freedom for sample = 31-1 = 30.

• Step 7: Refer to t-value look up table for 2 tailed test, in this case. For α = 0.05, DoF = 30, t Value = 2.042.

• Step 8: As test statistic (t-score) 3.07 > t value 2.042, we FAIL TO ACCEPT the null Hypothesis (H0 = 20). In other words,
sample suggests that the Protein Content is not equal to 20, OR the sample data do not APPEAR to have come from a
distribution curve for protein content =20.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 7

What we intend to cover today?

• One Sample t-Test


• Chi-Squared test
• 2 sample t-test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 8

All rights reserved, Facts n Data, 2022 4


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

Introduction to Chi Square Test 2  


( Observed  Expected ) 2
Expected

• The chi-square distribution can be used to find relationships between two things, like
grocery prices at different stores.
• Are lottery winning numbers were evenly distributed or if some numbers occurred with
a greater frequency? How about if the types of movies people preferred were different
across different age groups? What about if a coffee machine was dispensing
approximately the same amount of coffee each time?
• The notation for the chi-square distribution is: χ ∼ χ2df where df = degrees of freedom
which depends on how chi-square is being used.
• Typically used for NOMINAL data

“Chi” in Chinese means “vital force of life”.


It’s 22nd alphabet in Greek – looks like X, pronounced as KEE/ KAI but commonly called KAI or
CHAI or CHI by Statisticians.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 9

Problem Statement
• Studies often compare two groups. E.g., the effect aspirin in preventing heart attacks –one group is given aspirin and
the other group is given a placebo. Then, the heart attack rate is studied over several years.
• Other examples: Politicians compare the proportion of individuals from different segments (such as income brackets /
religion) who might vote for them. Students are interested in whether SAT or GRE preparatory courses really help raise
their scores. Many business applications require comparing two groups. It may be the investment returns of two
different investment strategies, or the differences in production efficiency of different management styles.

• Underlying statement is: Do these two samples appear to have come from same group?

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 10

10

All rights reserved, Facts n Data, 2022 5


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

Assumptions
1. The data in the cells should be frequencies or counts of cases rather than percentages or some other
transformation of the data.
2. The levels (or categories) of the variables are mutually exclusive. That is, a particular subject fits into one
and only one level of each of the variables.
3. Each subject may contribute data to one and only one cell in the χ2. If, for example, the same subjects are
tested over time such that the comparisons are of the same subjects at Time 1, Time 2, Time 3, etc., then χ2
may not be used.
4. The study groups must be independent. This means that a different test must be used if the two groups are
related.
5. There are 2 variables, and both are measured as categories, usually at the nominal level. However, data
may be ordinal data. Interval or ratio data that have been collapsed into ordinal categories may also be used.
While Chi-square has no rule about limiting the number of cells (by limiting the number of categories for each
variable), a very large number of cells (over 20) can make it difficult to meet assumption #6 below, and to
interpret the meaning of the results.
6. The value of the cell expected should be 5 or more in at least 80% of the cells, and no cell should have an
expected of less than one.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 11

11

Example Refer to Excel Sheet

Presume you observe 100 people to see who deposits garbage in the can and who litters. You want to see if
there is a difference based on gender?

Male deposits garbage-42


Male litters 33
Female deposits garbage18
Female litters 7

• Basic Assumptions?
• Frequencies? – Yes
• Categories are Mutually Exclusive? – Yes
• Each observation contributes to one cell only? – Yes
• Independent groups? – we shall collect data so that related people do not participate – Yes
• Two variables, both with Nominal categories? –Men/ Women, garbage/ litter – Yes
• Value in cell > 5 and Not <1 – Yes

• We can thus use Chi Squared test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 12

12

All rights reserved, Facts n Data, 2022 6


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

Steps

1. Formulate Hypothesis
• Null hypothesis(H0): No relation between gender and garbage deposit
• Alternate Hypothesis(H1): Relation between gender and garbage deposit

2. Create Contingency matrix Deposit Litter Total


Women 18 7 25
Men 42 33 75
TOTAL 60 40 100

3. Find expected value: row total*column total/total


Deposit Litter Total
Women =(25*60)/100 = 15 =(25*40)/100 = 10 25
Men =(75*60)/100 = 45 =75*40)/100 = 30 75
TOTAL 60 40 100

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 13

13

Steps

4. Apply chi square formula : ∑(O-E)2/E (Calculated value)


= (18-15)2/15 + (7-10)2/10 +(42-45)2/45 + (33-30)2/30 Deposit Litter Total
Women 18 (15) 7 (10) 25
=2 Men 42 (45) 33 (30) 75
TOTAL 60 40 100
5.Find degrees of Freedom: (no of rows-1)(no of columns-1)
= 1*1 = 1

6.Significant level(α)will be given( if not) ,select 0.05


7.Find the value of intersection of (α) and degree of freedom(df) in the table (theoretical value )
8.Compare calculated value and theoretical value:
If χ2 value >theoretical value then we can reject the null hypothesis
If χ2 value <=theoretical value –accept the null hypothesis

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 14

14

All rights reserved, Facts n Data, 2022 7


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 15

15

Refer to Jupyter Notebook


Example 2…
• The aim of the test is to conclude whether the two variables( gender and choice of pet ) are related to each other.
• Null hypothesis: (H0) states that there is no relation between the variables.
• Alternate hypothesis (H1) would state that there is a significant relation between the two.
• Given data:
dog cat bird total
men 207 282 241 730
women 234 242 232 708
total 441 524 473 1438
• Does it meet assumptions? - Verify that it meets all assumptions.
• Define a significance factor to determine whether the relation between the variables is of considerable significance-
let's select alpha value of 0.05. This alpha value denotes the probability of erroneously rejecting H0 when it is true. If
the p-value for the test comes out to be strictly greater than the alpha value, then H0 holds true.
• Expected Values Table :
• Prepare a similar table of calculated(or expected) values. To do this we need to calculate each item in the new table as:
{row_total * column_total} / {grand_total}

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 16

16

All rights reserved, Facts n Data, 2022 8


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

…Example 2
• Expected Values Table : Next, prepare a similar table of calculated(or expected) values. To do this we need to calculate
each item in the new table as: {row_total * column_total} / {grand_total}
• The expected values table :
dog cat bird total
men 223.87343533 266.00834492 240.11821975 730
women 217.12656467 257.99165508 232.88178025 708
total 441 524 473 1438

• Chi-Square Table : Prepare this table by calculating for each item the following:
{( Observed_value - Expected_value)^2 } / { Expected_value}
#PYTHON CODE
observed (o) calculated (c) (o-c)^2 / c from scipy.stats import chi2_contingency
207 223.87 1.2717
# defining the table
282 266.00 0.9613
data = [[207, 282, 241], [234, 242, 232]]
241 240.11 0.0032 stat, p, dof, expected = chi2_contingency(data)
234 217.12 1.3112
242 257.99 0.9912 # interpret p-value
232 232.88 0.0033 alpha = 0.05
Total 4.5422 print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (H0 holds true)')
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 17

17

What we intend to cover today?

• One Sample t-Test


• Chi-Squared test
• 2 sample t-test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 18

18

All rights reserved, Facts n Data, 2022 9


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

Two Sample t-Test


• Two-sample t-test (also called independent samples t-test) is used to test whether the unknown population means of
two groups are equal or not. It’s the same as an A/B test
• Used when
• data values are independent,
• are randomly sampled from two normal populations and
• the two independent groups have equal variances.

• In case of >2 groups, use a multiple comparison method such as Analysis of variance (ANOVA)/ Tukey-Kramer test of all
pairwise differences/ analysis of means (ANOM) to compare group means to the overall mean / Dunnett’s test to
compare each group mean to a control mean.

• If the variances for two groups are not equal, two-sample t-test can still be used but use a different estimate of the
standard deviation.

• If sample sizes are very small, test for normality may not be possible and might need to rely on understanding of the
data. When cannot can’t be safely assumed, perform a nonparametric test that doesn’t assume normality.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 19

19

When to use?

• For the two-sample t-test, we need two variables.


• One variable defines the two groups (E.g., Male/ Female).
• Second variable is the measurement of interest (E.g., Height etc).
• We also have an idea, or hypothesis, that the means of the underlying populations for
the two groups are different. Here are a couple of examples:
• We have students who speak English as their first language and students who do not. All students
take a reading test. Our two groups are the native English speakers and the non-native speakers. Our
measurements are the test scores. Our idea is that the mean test scores for the underlying
populations of native and non-native English speakers are not the same. We want to know if the
mean score for the population of native English speakers is different from the people who learned
English as a second language.
• We measure the grams of protein in two different brands of energy bars. Our two groups are the
two brands. Our measurement is the grams of protein for each energy bar. Our idea is that the mean
grams of protein for the underlying populations for the two brands may be different. We want to
know if we have evidence that the mean grams of protein for the two brands of energy bars is
different or not.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 20

20

All rights reserved, Facts n Data, 2022 10


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

Assumptions / Necessary Conditions

• Data values must be independent.


• Measurements for one observation do not affect measurements for any
other observation.
• Data in each group must be obtained via a random sample from the
population.
• Data in each group are normally distributed.
• Data values are continuous.
• The variances for the two independent groups are equal.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 21

21

Example
• One way to measure a person’s fitness is to measure their body fat percentage. Average body fat percentages vary by age, but according to
some guidelines, the normal range for men is 15-20% body fat, and the normal range for women is 20-25% body fat.
• Given data:
Men: 13.3 6.0 20.0 8.0 14.0 19.0 18.0 25.0 16.0 24.0 15.0 1.0 15.0
Women: 22.0 16.0 21.7 21.0 30.0 26.0 12.0 23.2 28.0 23.0

QUICK CHECKS:
• Two groups? Yes
• Variable Continuous? Yes
• Data values independent? (The body fat for any one person does not depend on the body fat for another person.) - Yes
• Random sample from the population? – Yes (assumed) FROM EXCEL:
The two histograms are on the same scale. From a quick look, we can see
• Normally distributed? Check Histogram - YES
that there are no very unusual points, or outliers. The data look roughly bell-
• Variances for men and women equal? Check – Yes shaped, so our initial idea of a normal distribution seems reasonable.
Examining the summary statistics, we see that the standard deviations are
similar. This supports the idea of equal variances. We can also check this
Two-way t-Test possible! using a test for variances.
Based on these observations, the two-sample t-test appears to be an
appropriate method to test for a difference in means.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 22

22

All rights reserved, Facts n Data, 2022 11


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

…Example
• Without doing any testing, it can be seen that the averages for men and women in our samples are not the same. But
how different are they? Are the averages “close enough” to conclude that mean body fat is the same for the larger
population of men and women? Or are the averages too different to make this conclusion?
• Start by calculating our test statistic. This calculation begins with finding the difference between the two averages:
22.29−14.95 =7.34 - This difference in our samples estimates the difference between the population means for the two
groups.
• Next, calculate the Pooled Variance -----------------------

• …and Sq Rt of Pooled SD = (38.88)0.5 = 6.24


…which is pooled variance!
• And the test statistic, t

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 23

23

…Example…
• Decide on the risk, say, 5% risk of saying that the unknown population means for men and
women are not equal when they really are. So, α = 0.05..
• Calculate the test statistic – in this case, 2.80.
• Find the theoretical value from the t-distribution. To find this value, we need the
significance level (α = 0.05) and the degrees of freedom. The degrees of freedom (df) are
based on the sample sizes of the two groups. For the body fat data, this is: df = n1 + n2 − 2
= 10 + 13 − 2 = 21
• The t value with α = 0.05 and 21 degrees of freedom is 2.080, from Table.
• Compare the value of t-statistic (2.80) to the t value (2.080). Since 2.80 > 2.080, reject the
null hypothesis that the mean body fat for men and women are equal, and conclude that
we have evidence body fat in the population is different between men and women with
95% confidence.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 24

24

All rights reserved, Facts n Data, 2022 12


Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022

shekhar@factsNdata.com / 9810228402

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 25

25

All rights reserved, Facts n Data, 2022 13

You might also like