You are on page 1of 9

Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022

2022

Statistics for Decision Making in Python


Session 5, Lecture 6
Business Vertical – DA, Trimester III, Batch ‘21-’23

V Shekhar Avasthy, 11th Feb, 2022

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 1

What we intend to cover today?

• Limitations of Python (Lecture 2)


• Tit-bits about z-score, z-Score in Excel and Python
• Hypothesis Testing
• Type I and Type II errors

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 2

All rights reserved, Facts n Data, 2022 1


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

What we intend to cover today?

• Limitations of Python (Lecture 2)


• Tit-bits about z-score, z-Score in Excel and Python
• Hypothesis Testing
• Type I and Type II errors

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 3

Python: The Good, The Bad…


Low entry barriers:  Performance issues:
Easy to use,  In-Memory tool (data size handled limited by
Open source, RAM) – unfit for huge data volume
Low Cost  Interpreter based (slow)
Not-so-steep learning curve  Inefficient use multiple threads/ cores
 Not efficient in memory management
Syntax similar to many other
languages  Lack of commercial-grade
Community support standardization: Relies extensively
on community support, hence
Flexible…. different libraries MAY use different
algorithms for same thing and have
levels of challenges

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 4

All rights reserved, Facts n Data, 2022 2


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

What we intend to cover today?

• Limitations of Python (Lecture 2)


• Tit-bits about z-score, z-Score in Excel and Python
• Hypothesis Testing
• Type I and Type II errors

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 5

Z-Score
• Z-Score is also called Standard Score – because it leads to corresponding value on
Standard Curve
• Some textbooks also call this as ‘Normalised Score’ score though this should be avoided
as ‘Normal’ score is usually a result of ‘normalization’ (Normalization is converting a
series to a score typically between 0 and 1)
• Process of finding z-score is also called ‘standardization’ – this is NOT to be confused
with ‘normalization’!
• Key features of normal distribution:
• Normal distributions are symmetric around their mean.
• The mean, median, and mode of a normal distribution are equal.
• The total area under the normal curve is equal to 1.0 – remember that any point in curve represents
how many cases /samples shall have a particular mean value – this divided by total values is
Probability – so all points cumulatively have a probability of unity.
• Normal distributions are denser in the center and less dense in the tails.
• Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ).
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 6

All rights reserved, Facts n Data, 2022 3


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

Calculating Z-Score in Excel

• EXCEL (use appropriate address ranges, indicative steps):


• Create ‘n’ random values between any two numbers, say, ‘min’ and ‘max’ using
“=RANDBETWEEN(MIN,MAX)
• Calculate average using “=AVERAGE(data_range)”
• Calculate Standard Deviation using “=STDEV.P(data_range)” for Population, OR
“=STDEV.S(data_range)” for Sample
• Compute Z-Score by using “=(datapoint – Average)/Std Dev”
• The result shows how far a paint is from mean value as a proportion of Std Dev. Sigb
of score shows the direction: ‘-’ sign for left from mean, ‘+’ values for being right to
the mean. E.g., a Z-Score of 1.9 shows that that particular data point is 1.9 Standard
Deviations to the right size of mean, a z-Score of -0.2 indicates that the particular
point is 0.2 SD’s to the left of the mean.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 7

Calculating Z-Score in Python


• PYTHON (indicative steps):
• We can calculate z-scores in Python using scipy.stats.zscore, which uses the following syntax:
scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=’propagate’)
where: an array like object containing data
axis: the axis along which to calculate the z-scores. Default is 0.
ddof: degrees of freedom correction in the calculation of the standard deviation. Default is 0.
nan_policy: how to handle when input contains nan. Default is propagate, which returns nan. ‘raise’ throws an error and ‘omit’
performs calculations ignoring nan values.
• E.G., #Step 1: Import modules.
import pandas as pd
import numpy as np
import scipy.stats as stat

#Step 2: Create an array of values.


data = np.array([21,19,29,36,12,16,31,0,37,10,16,10,18,27,32,15,16,11,39,27])

#Step 3: Calculate the z-scores for each value in the array.


stat.zscore(data,ddof=0)

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 8

All rights reserved, Facts n Data, 2022 4


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

What we intend to cover today?

• Limitations of Python (Lecture 2)


• Tit-bits about z-score, z-Score in Excel and Python
• Hypothesis Testing
• Type I and Type II errors

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 9

What is Hypothesis? What is Hypothesis Testing?


• Hypothesis defined:
“an idea that is suggested as the possible explanation for something but has not yet been found to be true or correct”
• Hypothesis Testing: Statistician's way of checking the validity of the idea
• ….BUT OFTEN, Statisticians rely on SAMPLES and any conclusion on samples have some possibility of being incorrect! Hence,
Hypothesis Testing in Statistics has elements of Probability – there’s a chance that we may be wrong!
• What is to be tested is OFTEN a difference – “this group is different from the rest etc” – E.g., when an alleged criminal is brought to
a court, Police believes that the person has committed a crime, but Judge believes him to be innocent. Judge’s thought process at
the start is (“Criminal is LIKE other People who have not done that crime” OR “there is no difference between the alleged Criminal
and General public”)
• …BUT Police believes that the alternate fact is reality (Alternate fact – “Criminal is NOT like other people and has committed the crime
• Another example could be that a Statistician “thinks/ feels” that “Average height of males is India is more than that of
females”
• Thus, one hypothesis would be “there is NO DIFFERENCE between the height of males and females”
• Other hypothesis (to be tested) would be “there is a difference in heights of males and females”
• The Hypothesis that says there is NO DIFFERENCE (difference = Null or 0) is called the NULL HYPOTHESIS, represented as H0
• The Hypothesis that says there IS A DIFFERENCE (difference= NOT Null) is called the ALTERNATE HYPOTHESIS, H1 or H’
• Usually, the belief to be tested is ALTERNATE Hypothesis, as you are challenging the existing or general belief
• REMEMBER: You CANNOT interchange the two! Logically think about this.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 10

10

All rights reserved, Facts n Data, 2022 5


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

What is Hypothesis? What is Hypothesis Testing?


• One job of a statistician is to make statistical inferences about populations based on samples taken from
the population. Confidence intervals are one way to estimate a population parameter.
• Another way to make a statistical inference is to make decision about the value of a specific parameter.
For instance, a car dealer advertises that its new small truck gets 25 kms/ liter, on average. A tutoring
service claims that its method of tutoring helps 99% of its students get an A or a B grade. A company says
that women managers in their company earn an average of $60,000 per year.
• A statistician will make a decision about these claims. This process is called " hypothesis testing." A
hypothesis test involves collecting data from a sample and evaluating the data. Then, the statistician makes
a decision as to whether or not there is sufficient evidence, based upon analyses of the SAMPLE data, to
reject the null hypothesis.
• It MAY be regarded as testing validity of a claim from SAMPLE data – if you had data of population, there
was no need to Hypothesis Testing, as Hypothesis would not have existed at all!
• After you have determined which hypothesis the sample supports, you make a decision. There are two
options for a decision. They are "cannot accept H0 " if the sample information favors the alternative
hypothesis or "do not reject H0 " or "decline to reject H0 " if the sample information is insufficient to reject
the null hypothesis. These conclusions are all based upon a level of probability, a significance level, that is
set by the analyst.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 11

11

IMPORTANT to note…

• As a mathematical convention H0 always has a


symbol with an equal in it (= or >, or <). H’ or Ha
never has a symbol with an equal in it.
• The choice of symbol depends on the wording of the
hypothesis test.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 12

12

All rights reserved, Facts n Data, 2022 6


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

…but working with samples always has a chance of ERRORS!


• So, a Statistician is NEVER COMPLETELY SURE (Unless dealing with Census) about testing
• Whatever evidence (s)he finds - H0 OR H1 is true, there is still a ‘chance’ that his sample was skewed
• SO, Statisticians always say: “We fail to Reject the Null Hypothesis” OR “We fail to accept the Null Hypothesis”
• It’s like a Judge saying “Based on the evidence in front of this Court, this Court finds the accused Guilty/ Not
Guilty/ gives him the benefit of ‘doubt’”

• Four Scenarios, two types of ERRORS:

Probability α

Probability β
IMAGE CREDIT: http://www.personal.ceu.hu/students/08/Olga_Etchevskaia/hypotheses.html

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 13

13

…but working with samples always has a chance of ERRORS!

σ When estimating the mean of


population from sample, the unsaid
“hypothesis” (H0) is that the μ lies
within sample average+1.96σ 95%
times => the probability of rejecting
+2σ
the NULL hypothesis when Ha is
actually true is, α = 0.5.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 14

14

All rights reserved, Facts n Data, 2022 7


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

…and probability of errors too is measured!

• α = probability of a Type I error = P(Type I error) = probability of rejecting the null


hypothesis when the null hypothesis is true: rejecting a good null.
• Statistics allows us to set the probability that we are making a Type I error. The
probability of making a Type I error is α. This alpha and the previous α (confidence) are
the same.
• β = probability of a Type II error = P(Type II error) = probability of not rejecting the null
hypothesis when the null hypothesis is false. (1 − β) is called the Power of the Test.
• α and β should be as small as possible because they are probabilities of errors.
• Depending on the gravity of Type I or Type II error in a context, you decide to adjust α.
In some cases, you may not rely on a test at all (if β is above the desired threshold). See
next slide.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 15

15

Examples
• Suppose the null hypothesis, H0, is: The victim of an automobile accident is alive when he arrives at the
emergency room of a hospital. Assume that this requires action if it is true. Assume that if the null hypothesis
cannot be accepted then action is required and the hospital will not take any action.
• Type I error: The emergency crew thinks that the victim is dead when, in fact, the victim is alive.
• Type II error: The emergency crew does not know if the victim is alive when, in fact, the victim is dead.
• The error with the greater consequence is the Type I error in this case. (If the emergency crew thinks the victim is
dead, they will not treat him.)
• Suppose the null hypothesis, H0, is: Ram’s parachute number 2 is safe.
• Type I error: Ram thinks that his parachute number 2 may not be safe when, in fact, it really is safe.
• Type II error: Ram thinks that his parachute number 2 may be safe when, in fact, it is not safe.
• Notice that, in this case, the error with the greater consequence is the Type II error. (If Ram thinks his parachute
#2 is safe, he will go ahead and use it, but it may not work.) This is a situation described as "accepting a false null".

Suppose the null hypothesis, H0, is: a patient is not sick. Which type of error has the greater consequence, Type I/ Type II?

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 16

16

All rights reserved, Facts n Data, 2022 8


Session 5, Lecture 6, BIMTECH, 11 Feb 2/11/2022
2022

Thank You!

Comments/ Clarifications: shekhar@factsNdata.com / +91-9810228402

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 17

17

All rights reserved, Facts n Data, 2022 9

You might also like