An Interactive Guide To Hypothesis Testing in Python

An Interactive Guide to Hypothesis
Testing in Python
T-Test, ANOVA, Chi-Squared Test with Examples
Statistical Tests in Python Cheatsheet (image from author’s website)
What is Hypothesis Testing?
Hypothesis testing is an essential part in inferential statistics

where we use observed data in a sample to draw conclusions about
unobserved data — often the population.
Implication of hypothesis testing:

 clinical research: widely used in psychology, biology and
healthcare research to examine the effectiveness of
clinical trials
 A/B testing: can be applied in business context to improve

conversions through testing different versions of
campaign incentives, website designs etc
 feature selection in machine learning: filter-based feature

selection methods use different statistical tests to
determine the feature importance
 college or university: well, if you major in statistics or data

science, it is likely to appear in your exams …
4 Steps in Hypothesis testing
Step 1. Define null and alternative hypothesis
Null hypothesis (H0) can be stated differently depends on the

statistical tests, but generalize as the claim that no difference, no
relationship or no dependency exists between two or more
variables.
Alternative hypothesis (H1) is contradictory to the null

hypothesis and it claims that relationships exist. It is the
hypothesis that we would like to prove right. However, a more
conservational approach is favored in statistics where we always
assume null hypothesis is true and try to find evidence to reject the
null hypothesis.
Step 2. Choose the appropriate statistical test
Common types of statistical testing including t-tests, z-tests,

anova test and chi-square test
choose the appropriate statistical test (image by author)
T-test: compare two groups/categories of numeric variables with

small sample size
Z-test: compare two groups/categories of numeric variables with

large sample size
ANOVA test: compare the difference between two or more

groups/categories of numeric variables
Chi-Squared test: examine the relationship between two

categorical variables
Correlation test: examine the relationship between two numeric

variables
Step 3. Calculate the p-value
How p value is calculated varies based on statistical tests. Firstly,

based on the mean and standard deviation of the observed sample
data, we are able to derive the test statistics value (e.g. t-statistics).
Then calculate the probability of getting this test statistics given the
distribution of the null hypothesis (e.g. student t-distribution), we
will find out the p-value. We will use some examples to
demonstrate this in more detail.
Step 4. Determine the statistical significance
p value is then compared against the significance level (also noted

as alpha value) to determine whether there is sufficient evidence to
reject the null hypothesis. The significance level is a predetermined
probability threshold — commonly 0.05. If p value is larger than
the threshold, it means that the value is likely to occur in the
distribution when the null hypothesis is true. On the other hand, if
lower than significance level, it means it is very unlikely to occur in
the null hypothesis distribution — so reject the null hypothesis.
Hypothesis Testing with Examples
Kaggle dataset “Customer Personality Analysis” is used in this case

study to demonstrate different types of statistical test: T-test,
ANOVA and Chi-Square test. They are sensitive to large sample
size, and almost certainly will generate very small p-value when
sample size is large . Therefore, I took a random sample (size of
100) from the original data:
sampled_df = df.sample(n=100, random_state=100)
T-Test
T Test (image from author’s website)
T-test is used when we want to test the relationship between a

numeric variable and a categorical variable.There are three main
types of t-test.
1. one sample t-test: test the mean of one group against a
constant value
2. two sample t-test: test the difference of means between

two groups
3. paired sample t-test: test the difference of means between

two measurements of the same subject
For example, if I would like to test whether “Recency” (the number

of days since customer’s last purchase ) contributes to the
prediction of “Response” (whether the customer accepted the offer
in the last campaign ), I can use a two sample t-test.
The first sample would be the “Recency” of customers who

accepted the offer:
recency_P = sampled_df[sampled_df['Response']==1]['Recency']
The second sample would be the “Recency” of customers who

rejected the offer:
recency_N = sampled_df[sampled_df['Response']==0]['Recency']
To compare the “Recency” of these two groups intuitively, we can

use histogram (or distplot) to show the distributions.
distplot for t-test (image by author)
It appears that positive responses have lower Recency compared to

negative responses. To quantify the difference, let’s follow the steps
in hypothesis testing and carry out a t-test.
Step1. define null and alternative hypothesis
 null: there is no difference in Recency between the

customers who accepted the offer in the last campaign
and who did not accept the offer
 alternative: customers who accepted the offer has lower

Recency compared to customers who did not accept the
offer
Step 2. choose the appropriate test

To test the difference between two independent samples, two-
sample t-test is the most appropriate statistical test. The test
statistics follows student t-distribution when null hypothesis is true
. The shape of t distribution is determined by the degree of
freedom, calculated as the sum of two sample size minus 2.
Import the Python library scipy.stats and create the t-distribution

as below.
from scipy.stats import t
rv = t(df=100-2)
Step 3. calculate the p-value
There are some handy functions in Python calculate the probability

in a distribution. For any x covered in the range of the
distribution, pdf(x) is the probability density function of x —
which can be represented as the orange line below, and cdf(x) is
the cumulative density function of x — which can be seen as the
cumulative area. In this example, we are testing the alternative
hypothesis that — Recency of positive response minus the Recency
of negative response is less than 0. Therefore we should use a one-
tail test and compare the t-statistics we get against the lowest value
in this distribution — therefore p-value can be calculated
as cdf(t_statistics) in this case.
t-statistics and t-distribution (image by author)
ttest_ind() is a handy function for independent t-test in python

that has done all of these for us automatically. Pass two samples
rececency_P and recency_N as the parameters, and we get the t-
statistics and p-value.
t-test in Python (image by author)
Here I use Plotly to visualize the p-value in t-distribution. Hover

over the line and see how point probability and p-value changes as
the x shifts. The area with filled color highlights the p-value we get
for this specific example.
An interactive visualization of t-distribution with t-statistics (check
the code)
Check out the Code Snippet on my website, if you want to build

this yourself.
Step 4. determine the statistical significance
The commonly used significance level threshold is 0.05. Since p-

value here (0.024) is smaller than 0.05, we can say that it is
statistically significant based on the collected sample. A lower
Recency of customer who accepted the offer is likely not occur by
chance. This further indicates the feature “Response” may be a
strong predictor of the target variable “Recency”. And if we would
perform feature selection for a machine learning model predicting
the “Recency” value, “Response” is likely to have high importance.
ANOVA Test
ANOVA Test (image by author)
Now that we know t-test is used to compare the mean of one or two
sample groups. What if we want to test more than two samples?
Use ANOVA test.
ANOVA examines the difference among groups by calculating the

ratio of variance across groups vs variance within a
group. Larger ratio indicates that the difference across groups is a
result of the group difference rather than just random chance.
As an example, I use the feature “Kidhome” for the prediction of

“NumWebPurchases”. There are three values of “Kidhome” — 0, 1,
2 which naturally forms three groups.
kidhome_0 = sampled_df[sampled_df['Kidhome']==0]
['NumWebPurchases']
['NumWebPurchases']
['NumWebPurchases']
Firstly, visualize the data. I found box plot to be the most aligned

visual representation of ANOVA test.
box plot for ANOVA test (image by author)
It appears there are distinct differences among three groups. So

let’s carry out ANOVA test to prove if that’s the case.
1. define hypothesis:
 null: there is no difference among three groups
 alternative: there is difference between at least two groups
2. choose the appropriate test: ANOVA test is preferred for

examining the relationships of numeric values against a categorical
value with more than two groups. The test statistics of null
hypothesis in ANOVA test also follows a distribution defined by
degrees of freedom — which is f-distribution. The degrees of
freedom is calculated by number of total samples (n) and the
number of groups (k).
 dfn = n — 1
 dfd = n — k
from scipy.stats import f
dfn = 3-1
dfd = 100-3
rv = f(dfn, dfd)
3. calculate the p-value: To calculate the p-value of the f-

statistics, we use the right tail cumulative area of the f-distribution,
which is 1-rv.cdf(f_statistics).
f-statistics and f-distribution (image by author)
x = np.linspace(rv.ppf(0.0001), rv.ppf(0.9999), 100000)
y = rv.pdf(x)
pvalue = 1 - rv.cdf(x)
An interactive visualization of f-distribution with f-statistics (check
the code)
To easily get the f-statistics and p-value using Python, we can use
the function stats.f_oneway() which returns p-value: 0.00040.
f_stat, pvalue = stats.f_oneway(kidhome_0, kidhome_1,
kidhome_2)
4. determine the statistical significance: Compare the p-

value against the significance level 0.05, we can infer that there is
strong evidence against the null hypothesis and very likely that
there is difference in “NumWebPurchases” between at least two
groups.
Chi-Squared Test
Chi-Squared Test (image from author’s website)
Chi-Squared test is for testing the relationship between two

categorical variables. The underlying principle is that if two
categorical variables are independent, then one categorical variable
should have similar composition when the other categorical
variable change. Let’s look at the example of whether “Education”
and “Response” are independent.
First, use stacked bar chart and contingency table to

summary the count of each category.
ed_contingency = pd.crosstab(sampled_df['Education'],
sampled_df['Response'])
stacked bar chart for Chi-Squared test (image by author)

If these two variables are completely independent to each other
(null hypothesis is true), then the proportion of positive Response
and negative Response should be the same across all Education
groups. It seems like composition are slightly different, but is it
significant enough to say there is dependency — let’s run a Chi-
Squared test.
1. define hypothesis:
 null: “Education” and “Response” are independent to each

other.
 alternative: “Education” and “Response” are dependent to

each other.
2. choose the appropriate test: Chi-Squared test is chosen for

categorical vsl categorical statistical tests. Chi-distribution is
determined by the degree of freedom which is calculated as (row —
1) x (column — 1).
from scipy.stats import chi2
r = 5
c = 2
dof = (5-1) * (2-1)
rv = chi2(df= dof)
3. calculate the p-value: p value is calculated as the right tail

cumulative area: 1-rv.cdf(chi2_statistics)
chi2-statistics and chi-distribution (image by author)
x = np.linspace(rv.ppf(0.0001), rv.ppf(0.9999), 100000)
y = rv.pdf(x)
pvalue = 1 - rv.cdf(x)
Python also provides a useful function to get the chi statistics and
p-value given the contingency table.
chi2_stat, pvalue, dof, exp = chi2_contingency(ed_contingency)
An interactive visualization of chi-distribution with chi-statistics (check
the code)
4. determine the statistical significance: the p-value is 0.41,

suggesting that it is not statistical significant. Therefore, we cannot
reject the null hypothesis that these two categorical variables are
independent. This also indicates that “Education” may not be a
strong predictor of “Response”.
Thanks for reaching so far, we have covered a lot of contents in this
article but still have two important hypothesis tests that are worth
discussing separately in upcoming posts.
 z-test: test the difference between two categories of

numeric variables — when sample size is LARGE
 correlation: test the relationship between two numeric

variables
If you would like to read more articles like this, I would really
appreciate your support by signing up Medium membership :)
Take-Home Message
In this article, we interactively explore and visualize the difference

between three common statistical tests: T-test, ANOVA test and
Chi-Squared test. We also use examples to walkthrough essential
steps in hypothesis testing:
1. define the null and alternative hypothesis
2. choose the appropriate test
3. calculate the p-value
4. determine the statistical significance

An Interactive Guide To Hypothesis Testing in Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Interactive Guide To Hypothesis Testing in Python

Uploaded by

Copyright:

Available Formats

An Interactive Guide to Hypothesis

Statistical Tests in Python Cheatsheet (image from author’s website)

What is Hypothesis Testing?

Hypothesis testing is an essential part in inferential statistics

Implication of hypothesis testing:

 A/B testing: can be applied in business context to improve

 feature selection in machine learning: filter-based feature

 college or university: well, if you major in statistics or data

4 Steps in Hypothesis testing

Step 1. Define null and alternative hypothesis

Null hypothesis (H0) can be stated differently depends on the

Alternative hypothesis (H1) is contradictory to the null

Step 2. Choose the appropriate statistical test

Common types of statistical testing including t-tests, z-tests,

choose the appropriate statistical test (image by author)

T-test: compare two groups/categories of numeric variables with

Z-test: compare two groups/categories of numeric variables with

ANOVA test: compare the difference between two or more

Chi-Squared test: examine the relationship between two

Correlation test: examine the relationship between two numeric

How p value is calculated varies based on statistical tests. Firstly,

Step 4. Determine the statistical significance

p value is then compared against the significance level (also noted

Hypothesis Testing with Examples

Kaggle dataset “Customer Personality Analysis” is used in this case

T Test (image from author’s website)

T-test is used when we want to test the relationship between a

2. two sample t-test: test the difference of means between

3. paired sample t-test: test the difference of means between

For example, if I would like to test whether “Recency” (the number

The first sample would be the “Recency” of customers who

The second sample would be the “Recency” of customers who

To compare the “Recency” of these two groups intuitively, we can

It appears that positive responses have lower Recency compared to

Step1. define null and alternative hypothesis

 null: there is no difference in Recency between the

 alternative: customers who accepted the offer has lower

Step 2. choose the appropriate test

Import the Python library scipy.stats and create the t-distribution

Step 3. calculate the p-value

There are some handy functions in Python calculate the probability

ttest_ind() is a handy function for independent t-test in python

t-test in Python (image by author)

Here I use Plotly to visualize the p-value in t-distribution. Hover

Check out the Code Snippet on my website, if you want to build

Step 4. determine the statistical significance

The commonly used significance level threshold is 0.05. Since p-

ANOVA examines the difference among groups by calculating the

As an example, I use the feature “Kidhome” for the prediction of

Firstly, visualize the data. I found box plot to be the most aligned

box plot for ANOVA test (image by author)

It appears there are distinct differences among three groups. So

 null: there is no difference among three groups

 alternative: there is difference between at least two groups

2. choose the appropriate test: ANOVA test is preferred for

3. calculate the p-value: To calculate the p-value of the f-

4. determine the statistical significance: Compare the p-

Chi-Squared Test (image from author’s website)

Chi-Squared test is for testing the relationship between two

First, use stacked bar chart and contingency table to

stacked bar chart for Chi-Squared test (image by author)

 null: “Education” and “Response” are independent to each

 alternative: “Education” and “Response” are dependent to