Session 6 Lecture 7 20220215 Slides

Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022
2022
Statistics for Decision Making in Python

Session 6, Lecture 7
Business Vertical – DA, Trimester III, Batch ‘21-’23
V Shekhar Avasthy, 15th Feb, 2022
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 1
What we intend to cover today?
• One Sample t-Test

• Chi-Squared test
• 2 sample t-test
All rights reserved, Facts n Data, 2022 1

2022

• 2 sample t-test
Problem Statement
The label of an energy bar claims that the

bar has 20 grams of protein. We want to
know if the claim is correct or not?
Image used from the internet purely for representational purpose.

2022
One sample t-test
• When to use? Used to determine whether an unknown population mean is different

from a specific value. E.g., Does the protein value in a chocolate bar is actually 20g/ bar?
• Typically, H0 = Test_Value, Ha ≠ Test_Value. In our case, H0 = 20g, Ha ≠ 20g
• Assumptions/ Basic Conditions:
• Data should be continuous
• Observations should be independent (Value of any sample should NOT be
dependent on value of any other sample).
• Data should be a random sample.
• Data should be from a normal population. (For ‘small’ sample sizes, test for
normality may not be possible. When normality cannot be assumed, this cannot be
used.)
Example
• Imagine we have collected a random sample of 31 energy bars from a number of different stores to represent the
population of energy bars available to the general consumer.
• We measure the protein content in these 31 samples and found it to be (ones in red are <20g):
20.70 27.46 22.15 19.85 21.29 24.75 20.75 22.91 25.34 20.33 21.54 21.08
22.14 19.56 21.10 18.04 24.12 19.95 19.72 18.28 16.26 17.46 20.53 22.12
25.06 22.44 19.08 19.88 21.39 22.33 25.79
• Questions: Which test to use?
• We need to test if population mean (mean of all bars of population) is different from 20g?, So, one sample t-test can be used.
• Is data value independent? Yes, because we collected bars from different stores (to avoid same batch/ same manufacturing
date). Logical to believe that protein content in one bar is INDEPENDENT of that of any other bar.
• Data are continuous.
• It’s a random sample (we took from different stores).
• For “large samples” (n>30), this can be tested by looking at histograms.
So, one sample t-test is appropriate. Always remember, selection of any test OR
technique depends on
(A) What is the OBJECTIVE? AND,
(B) Does data meets assumptions of technique(s) that
help you meet that objective?

2022
See Excel sheet and Jupyter Notebook

…Example
• STEP 1: Calculate Average, Std Dev and “n” for given sample. In our case, Mean = 21.4; Std Dev = 2.54; n = 31 and Std Error
Mean = Std Dev/ n = 0.456
• Step 2: Decide on α. Say, we want to be confident 95%, so, α = 1-95% = 1-0.95 = 0.05. (If we wanted 99% confidence, α = 0.01)
• STEP 3: Find the difference between Sample Mean (21.4, in this case) and claimed Mean (20, in our case): 21.4-20 = 1.4
• Step 4: Decide if it’s 2-tailed OR 1-tailed (α lies on both sides OR 1 side). For this case, we are interested in 2 tailed test.
• Step 5: Calculate Test Statistic: t= Difference / Std Error = 1.4 / 0.456 = 3.07
• Step 6: Get DoF - For energy bar, n=31, so Degrees of Freedom for sample = 31-1 = 30.
• Step 7: Refer to t-value look up table for 2 tailed test, in this case. For α = 0.05, DoF = 30, t Value = 2.042.
• Step 8: As test statistic (t-score) 3.07 > t value 2.042, we FAIL TO ACCEPT the null Hypothesis (H0 = 20). In other words,
sample suggests that the Protein Content is not equal to 20, OR the sample data do not APPEAR to have come from a
distribution curve for protein content =20.

• 2 sample t-test

2022
Introduction to Chi Square Test 2  

( Observed  Expected ) 2
Expected
• The chi-square distribution can be used to find relationships between two things, like
grocery prices at different stores.
• Are lottery winning numbers were evenly distributed or if some numbers occurred with
a greater frequency? How about if the types of movies people preferred were different
across different age groups? What about if a coffee machine was dispensing
approximately the same amount of coffee each time?
• The notation for the chi-square distribution is: χ ∼ χ2df where df = degrees of freedom
which depends on how chi-square is being used.
• Typically used for NOMINAL data
“Chi” in Chinese means “vital force of life”.

It’s 22nd alphabet in Greek – looks like X, pronounced as KEE/ KAI but commonly called KAI or
CHAI or CHI by Statisticians.
Problem Statement
• Studies often compare two groups. E.g., the effect aspirin in preventing heart attacks –one group is given aspirin and
the other group is given a placebo. Then, the heart attack rate is studied over several years.
• Other examples: Politicians compare the proportion of individuals from different segments (such as income brackets /
religion) who might vote for them. Students are interested in whether SAT or GRE preparatory courses really help raise
their scores. Many business applications require comparing two groups. It may be the investment returns of two
different investment strategies, or the differences in production efficiency of different management styles.
• Underlying statement is: Do these two samples appear to have come from same group?
10

2022
Assumptions
1. The data in the cells should be frequencies or counts of cases rather than percentages or some other
transformation of the data.
2. The levels (or categories) of the variables are mutually exclusive. That is, a particular subject fits into one
and only one level of each of the variables.
3. Each subject may contribute data to one and only one cell in the χ2. If, for example, the same subjects are
tested over time such that the comparisons are of the same subjects at Time 1, Time 2, Time 3, etc., then χ2
may not be used.
4. The study groups must be independent. This means that a different test must be used if the two groups are
related.
5. There are 2 variables, and both are measured as categories, usually at the nominal level. However, data
may be ordinal data. Interval or ratio data that have been collapsed into ordinal categories may also be used.
While Chi-square has no rule about limiting the number of cells (by limiting the number of categories for each
variable), a very large number of cells (over 20) can make it difficult to meet assumption #6 below, and to
interpret the meaning of the results.
6. The value of the cell expected should be 5 or more in at least 80% of the cells, and no cell should have an
expected of less than one.
11
Example Refer to Excel Sheet
Presume you observe 100 people to see who deposits garbage in the can and who litters. You want to see if
there is a difference based on gender?
Male deposits garbage-42

Male litters 33
Female deposits garbage18
Female litters 7
• Basic Assumptions?
• Frequencies? – Yes
• Categories are Mutually Exclusive? – Yes
• Each observation contributes to one cell only? – Yes
• Independent groups? – we shall collect data so that related people do not participate – Yes
• Two variables, both with Nominal categories? –Men/ Women, garbage/ litter – Yes
• Value in cell > 5 and Not <1 – Yes
• We can thus use Chi Squared test
12

2022
Steps
1. Formulate Hypothesis
• Null hypothesis(H0): No relation between gender and garbage deposit
• Alternate Hypothesis(H1): Relation between gender and garbage deposit
2. Create Contingency matrix Deposit Litter Total

Women 18 7 25
Men 42 33 75
TOTAL 60 40 100
3. Find expected value: row total*column total/total

Deposit Litter Total
Women =(25*60)/100 = 15 =(25*40)/100 = 10 25
Men =(75*60)/100 = 45 =75*40)/100 = 30 75
TOTAL 60 40 100
13
Steps
4. Apply chi square formula : ∑(O-E)2/E (Calculated value)

= (18-15)2/15 + (7-10)2/10 +(42-45)2/45 + (33-30)2/30 Deposit Litter Total
Women 18 (15) 7 (10) 25
=2 Men 42 (45) 33 (30) 75
TOTAL 60 40 100
5.Find degrees of Freedom: (no of rows-1)(no of columns-1)
= 1*1 = 1
6.Significant level(α)will be given( if not) ,select 0.05

7.Find the value of intersection of (α) and degree of freedom(df) in the table (theoretical value )
8.Compare calculated value and theoretical value:
If χ2 value >theoretical value then we can reject the null hypothesis
If χ2 value <=theoretical value –accept the null hypothesis
14

2022
15
Refer to Jupyter Notebook

Example 2…
• The aim of the test is to conclude whether the two variables( gender and choice of pet ) are related to each other.
• Null hypothesis: (H0) states that there is no relation between the variables.
• Alternate hypothesis (H1) would state that there is a significant relation between the two.
• Given data:
dog cat bird total
men 207 282 241 730
women 234 242 232 708
total 441 524 473 1438
• Does it meet assumptions? - Verify that it meets all assumptions.
• Define a significance factor to determine whether the relation between the variables is of considerable significance-
let's select alpha value of 0.05. This alpha value denotes the probability of erroneously rejecting H0 when it is true. If
the p-value for the test comes out to be strictly greater than the alpha value, then H0 holds true.
• Expected Values Table :
• Prepare a similar table of calculated(or expected) values. To do this we need to calculate each item in the new table as:
{row_total * column_total} / {grand_total}
16

2022
…Example 2
• Expected Values Table : Next, prepare a similar table of calculated(or expected) values. To do this we need to calculate
each item in the new table as: {row_total * column_total} / {grand_total}
• The expected values table :
dog cat bird total
men 223.87343533 266.00834492 240.11821975 730
women 217.12656467 257.99165508 232.88178025 708
total 441 524 473 1438
• Chi-Square Table : Prepare this table by calculating for each item the following:
{( Observed_value - Expected_value)^2 } / { Expected_value}
#PYTHON CODE
observed (o) calculated (c) (o-c)^2 / c from scipy.stats import chi2_contingency
207 223.87 1.2717
# defining the table
282 266.00 0.9613
data = [[207, 282, 241], [234, 242, 232]]
241 240.11 0.0032 stat, p, dof, expected = chi2_contingency(data)
234 217.12 1.3112
242 257.99 0.9912 # interpret p-value
232 232.88 0.0033 alpha = 0.05
Total 4.5422 print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (H0 holds true)')
17

• 2 sample t-test
18

2022
Two Sample t-Test

• Two-sample t-test (also called independent samples t-test) is used to test whether the unknown population means of
two groups are equal or not. It’s the same as an A/B test
• Used when
• data values are independent,
• are randomly sampled from two normal populations and
• the two independent groups have equal variances.
• In case of >2 groups, use a multiple comparison method such as Analysis of variance (ANOVA)/ Tukey-Kramer test of all
pairwise differences/ analysis of means (ANOM) to compare group means to the overall mean / Dunnett’s test to
compare each group mean to a control mean.
• If the variances for two groups are not equal, two-sample t-test can still be used but use a different estimate of the
standard deviation.
• If sample sizes are very small, test for normality may not be possible and might need to rely on understanding of the
data. When cannot can’t be safely assumed, perform a nonparametric test that doesn’t assume normality.
19
When to use?
• For the two-sample t-test, we need two variables.

• One variable defines the two groups (E.g., Male/ Female).
• Second variable is the measurement of interest (E.g., Height etc).
• We also have an idea, or hypothesis, that the means of the underlying populations for
the two groups are different. Here are a couple of examples:
• We have students who speak English as their first language and students who do not. All students
take a reading test. Our two groups are the native English speakers and the non-native speakers. Our
measurements are the test scores. Our idea is that the mean test scores for the underlying
populations of native and non-native English speakers are not the same. We want to know if the
mean score for the population of native English speakers is different from the people who learned
English as a second language.
• We measure the grams of protein in two different brands of energy bars. Our two groups are the
two brands. Our measurement is the grams of protein for each energy bar. Our idea is that the mean
grams of protein for the underlying populations for the two brands may be different. We want to
know if we have evidence that the mean grams of protein for the two brands of energy bars is
different or not.
20

2022
Assumptions / Necessary Conditions
• Data values must be independent.

• Measurements for one observation do not affect measurements for any
other observation.
• Data in each group must be obtained via a random sample from the
population.
• Data in each group are normally distributed.
• Data values are continuous.
• The variances for the two independent groups are equal.
21
Example
• One way to measure a person’s fitness is to measure their body fat percentage. Average body fat percentages vary by age, but according to
some guidelines, the normal range for men is 15-20% body fat, and the normal range for women is 20-25% body fat.
• Given data:
Men: 13.3 6.0 20.0 8.0 14.0 19.0 18.0 25.0 16.0 24.0 15.0 1.0 15.0
Women: 22.0 16.0 21.7 21.0 30.0 26.0 12.0 23.2 28.0 23.0
QUICK CHECKS:
• Two groups? Yes
• Variable Continuous? Yes
• Data values independent? (The body fat for any one person does not depend on the body fat for another person.) - Yes
• Random sample from the population? – Yes (assumed) FROM EXCEL:
The two histograms are on the same scale. From a quick look, we can see
• Normally distributed? Check Histogram - YES
that there are no very unusual points, or outliers. The data look roughly bell-
• Variances for men and women equal? Check – Yes shaped, so our initial idea of a normal distribution seems reasonable.
Examining the summary statistics, we see that the standard deviations are
similar. This supports the idea of equal variances. We can also check this
Two-way t-Test possible! using a test for variances.
Based on these observations, the two-sample t-test appears to be an
appropriate method to test for a difference in means.
22

2022
…Example
• Without doing any testing, it can be seen that the averages for men and women in our samples are not the same. But
how different are they? Are the averages “close enough” to conclude that mean body fat is the same for the larger
population of men and women? Or are the averages too different to make this conclusion?
• Start by calculating our test statistic. This calculation begins with finding the difference between the two averages:
22.29−14.95 =7.34 - This difference in our samples estimates the difference between the population means for the two
groups.
• Next, calculate the Pooled Variance -----------------------
• …and Sq Rt of Pooled SD = (38.88)0.5 = 6.24

…which is pooled variance!
• And the test statistic, t
23
…Example…
• Decide on the risk, say, 5% risk of saying that the unknown population means for men and
women are not equal when they really are. So, α = 0.05..
• Calculate the test statistic – in this case, 2.80.
• Find the theoretical value from the t-distribution. To find this value, we need the
significance level (α = 0.05) and the degrees of freedom. The degrees of freedom (df) are
based on the sample sizes of the two groups. For the body fat data, this is: df = n1 + n2 − 2
= 10 + 13 − 2 = 21
• The t value with α = 0.05 and 21 degrees of freedom is 2.080, from Table.
• Compare the value of t-statistic (2.80) to the t value (2.080). Since 2.80 > 2.080, reject the
null hypothesis that the mean body fat for men and women are equal, and conclude that
we have evidence body fat in the population is different between men and women with
95% confidence.
24

2022
shekhar@factsNdata.com / 9810228402
25

Session 6 Lecture 7 20220215 Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 6 Lecture 7 20220215 Slides

Uploaded by

Copyright:

Available Formats

Session 6, Lecture 7, BIMTECH, 15 Feb 2/15/2022

Statistics for Decision Making in Python

V Shekhar Avasthy, 15th Feb, 2022

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 1

What we intend to cover today?

• One Sample t-Test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 2

All rights reserved, Facts n Data, 2022 1

What we intend to cover today?

• One Sample t-Test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 3

The label of an energy bar claims that the

Image used from the internet purely for representational purpose.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 4

All rights reserved, Facts n Data, 2022 2

One sample t-test

• When to use? Used to determine whether an unknown population mean is different

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 5

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 6

All rights reserved, Facts n Data, 2022 3

See Excel sheet and Jupyter Notebook

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 7

What we intend to cover today?

• One Sample t-Test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 8

All rights reserved, Facts n Data, 2022 4

Introduction to Chi Square Test 2  

“Chi” in Chinese means “vital force of life”.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 9

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 10

All rights reserved, Facts n Data, 2022 5

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 11

Example Refer to Excel Sheet

Male deposits garbage-42

• We can thus use Chi Squared test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 12

All rights reserved, Facts n Data, 2022 6

2. Create Contingency matrix Deposit Litter Total

3. Find expected value: row total*column total/total

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 13

4. Apply chi square formula : ∑(O-E)2/E (Calculated value)

6.Significant level(α)will be given( if not) ,select 0.05

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 14

All rights reserved, Facts n Data, 2022 7

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 15

Refer to Jupyter Notebook

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 16

All rights reserved, Facts n Data, 2022 8

What we intend to cover today?

• One Sample t-Test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 18

All rights reserved, Facts n Data, 2022 9

Two Sample t-Test

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 19

• For the two-sample t-test, we need two variables.

All rights reserved, Facts n Data, 2022 10

Assumptions / Necessary Conditions

• Data values must be independent.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 21

All rights reserved, Facts n Data, 2022 11

• …and Sq Rt of Pooled SD = (38.88)0.5 = 6.24

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 23

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 24