Professional Documents
Culture Documents
2022
Problem Statement
Example
• Imagine we have collected a random sample of 31 energy bars from a number of different stores to represent the
population of energy bars available to the general consumer.
• We measure the protein content in these 31 samples and found it to be (ones in red are <20g):
20.70 27.46 22.15 19.85 21.29 24.75 20.75 22.91 25.34 20.33 21.54 21.08
22.14 19.56 21.10 18.04 24.12 19.95 19.72 18.28 16.26 17.46 20.53 22.12
25.06 22.44 19.08 19.88 21.39 22.33 25.79
• Questions: Which test to use?
• We need to test if population mean (mean of all bars of population) is different from 20g?, So, one sample t-test can be used.
• Is data value independent? Yes, because we collected bars from different stores (to avoid same batch/ same manufacturing
date). Logical to believe that protein content in one bar is INDEPENDENT of that of any other bar.
• Data are continuous.
• It’s a random sample (we took from different stores).
• For “large samples” (n>30), this can be tested by looking at histograms.
So, one sample t-test is appropriate. Always remember, selection of any test OR
technique depends on
(A) What is the OBJECTIVE? AND,
(B) Does data meets assumptions of technique(s) that
help you meet that objective?
• Step 2: Decide on α. Say, we want to be confident 95%, so, α = 1-95% = 1-0.95 = 0.05. (If we wanted 99% confidence, α = 0.01)
• STEP 3: Find the difference between Sample Mean (21.4, in this case) and claimed Mean (20, in our case): 21.4-20 = 1.4
• Step 4: Decide if it’s 2-tailed OR 1-tailed (α lies on both sides OR 1 side). For this case, we are interested in 2 tailed test.
• Step 5: Calculate Test Statistic: t= Difference / Std Error = 1.4 / 0.456 = 3.07
• Step 6: Get DoF - For energy bar, n=31, so Degrees of Freedom for sample = 31-1 = 30.
• Step 7: Refer to t-value look up table for 2 tailed test, in this case. For α = 0.05, DoF = 30, t Value = 2.042.
• Step 8: As test statistic (t-score) 3.07 > t value 2.042, we FAIL TO ACCEPT the null Hypothesis (H0 = 20). In other words,
sample suggests that the Protein Content is not equal to 20, OR the sample data do not APPEAR to have come from a
distribution curve for protein content =20.
• The chi-square distribution can be used to find relationships between two things, like
grocery prices at different stores.
• Are lottery winning numbers were evenly distributed or if some numbers occurred with
a greater frequency? How about if the types of movies people preferred were different
across different age groups? What about if a coffee machine was dispensing
approximately the same amount of coffee each time?
• The notation for the chi-square distribution is: χ ∼ χ2df where df = degrees of freedom
which depends on how chi-square is being used.
• Typically used for NOMINAL data
Problem Statement
• Studies often compare two groups. E.g., the effect aspirin in preventing heart attacks –one group is given aspirin and
the other group is given a placebo. Then, the heart attack rate is studied over several years.
• Other examples: Politicians compare the proportion of individuals from different segments (such as income brackets /
religion) who might vote for them. Students are interested in whether SAT or GRE preparatory courses really help raise
their scores. Many business applications require comparing two groups. It may be the investment returns of two
different investment strategies, or the differences in production efficiency of different management styles.
• Underlying statement is: Do these two samples appear to have come from same group?
10
Assumptions
1. The data in the cells should be frequencies or counts of cases rather than percentages or some other
transformation of the data.
2. The levels (or categories) of the variables are mutually exclusive. That is, a particular subject fits into one
and only one level of each of the variables.
3. Each subject may contribute data to one and only one cell in the χ2. If, for example, the same subjects are
tested over time such that the comparisons are of the same subjects at Time 1, Time 2, Time 3, etc., then χ2
may not be used.
4. The study groups must be independent. This means that a different test must be used if the two groups are
related.
5. There are 2 variables, and both are measured as categories, usually at the nominal level. However, data
may be ordinal data. Interval or ratio data that have been collapsed into ordinal categories may also be used.
While Chi-square has no rule about limiting the number of cells (by limiting the number of categories for each
variable), a very large number of cells (over 20) can make it difficult to meet assumption #6 below, and to
interpret the meaning of the results.
6. The value of the cell expected should be 5 or more in at least 80% of the cells, and no cell should have an
expected of less than one.
11
Presume you observe 100 people to see who deposits garbage in the can and who litters. You want to see if
there is a difference based on gender?
• Basic Assumptions?
• Frequencies? – Yes
• Categories are Mutually Exclusive? – Yes
• Each observation contributes to one cell only? – Yes
• Independent groups? – we shall collect data so that related people do not participate – Yes
• Two variables, both with Nominal categories? –Men/ Women, garbage/ litter – Yes
• Value in cell > 5 and Not <1 – Yes
12
Steps
1. Formulate Hypothesis
• Null hypothesis(H0): No relation between gender and garbage deposit
• Alternate Hypothesis(H1): Relation between gender and garbage deposit
13
Steps
14
15
16
…Example 2
• Expected Values Table : Next, prepare a similar table of calculated(or expected) values. To do this we need to calculate
each item in the new table as: {row_total * column_total} / {grand_total}
• The expected values table :
dog cat bird total
men 223.87343533 266.00834492 240.11821975 730
women 217.12656467 257.99165508 232.88178025 708
total 441 524 473 1438
• Chi-Square Table : Prepare this table by calculating for each item the following:
{( Observed_value - Expected_value)^2 } / { Expected_value}
#PYTHON CODE
observed (o) calculated (c) (o-c)^2 / c from scipy.stats import chi2_contingency
207 223.87 1.2717
# defining the table
282 266.00 0.9613
data = [[207, 282, 241], [234, 242, 232]]
241 240.11 0.0032 stat, p, dof, expected = chi2_contingency(data)
234 217.12 1.3112
242 257.99 0.9912 # interpret p-value
232 232.88 0.0033 alpha = 0.05
Total 4.5422 print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (H0 holds true)')
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 17
17
18
• In case of >2 groups, use a multiple comparison method such as Analysis of variance (ANOVA)/ Tukey-Kramer test of all
pairwise differences/ analysis of means (ANOM) to compare group means to the overall mean / Dunnett’s test to
compare each group mean to a control mean.
• If the variances for two groups are not equal, two-sample t-test can still be used but use a different estimate of the
standard deviation.
• If sample sizes are very small, test for normality may not be possible and might need to rely on understanding of the
data. When cannot can’t be safely assumed, perform a nonparametric test that doesn’t assume normality.
19
When to use?
20
21
Example
• One way to measure a person’s fitness is to measure their body fat percentage. Average body fat percentages vary by age, but according to
some guidelines, the normal range for men is 15-20% body fat, and the normal range for women is 20-25% body fat.
• Given data:
Men: 13.3 6.0 20.0 8.0 14.0 19.0 18.0 25.0 16.0 24.0 15.0 1.0 15.0
Women: 22.0 16.0 21.7 21.0 30.0 26.0 12.0 23.2 28.0 23.0
QUICK CHECKS:
• Two groups? Yes
• Variable Continuous? Yes
• Data values independent? (The body fat for any one person does not depend on the body fat for another person.) - Yes
• Random sample from the population? – Yes (assumed) FROM EXCEL:
The two histograms are on the same scale. From a quick look, we can see
• Normally distributed? Check Histogram - YES
that there are no very unusual points, or outliers. The data look roughly bell-
• Variances for men and women equal? Check – Yes shaped, so our initial idea of a normal distribution seems reasonable.
Examining the summary statistics, we see that the standard deviations are
similar. This supports the idea of equal variances. We can also check this
Two-way t-Test possible! using a test for variances.
Based on these observations, the two-sample t-test appears to be an
appropriate method to test for a difference in means.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 22
22
…Example
• Without doing any testing, it can be seen that the averages for men and women in our samples are not the same. But
how different are they? Are the averages “close enough” to conclude that mean body fat is the same for the larger
population of men and women? Or are the averages too different to make this conclusion?
• Start by calculating our test statistic. This calculation begins with finding the difference between the two averages:
22.29−14.95 =7.34 - This difference in our samples estimates the difference between the population means for the two
groups.
• Next, calculate the Pooled Variance -----------------------
23
…Example…
• Decide on the risk, say, 5% risk of saying that the unknown population means for men and
women are not equal when they really are. So, α = 0.05..
• Calculate the test statistic – in this case, 2.80.
• Find the theoretical value from the t-distribution. To find this value, we need the
significance level (α = 0.05) and the degrees of freedom. The degrees of freedom (df) are
based on the sample sizes of the two groups. For the body fat data, this is: df = n1 + n2 − 2
= 10 + 13 − 2 = 21
• The t value with α = 0.05 and 21 degrees of freedom is 2.080, from Table.
• Compare the value of t-statistic (2.80) to the t value (2.080). Since 2.80 > 2.080, reject the
null hypothesis that the mean body fat for men and women are equal, and conclude that
we have evidence body fat in the population is different between men and women with
95% confidence.
24
shekhar@factsNdata.com / 9810228402
25