You are on page 1of 34

PE 515 CS – DATA SCIENCE

UNIT – II
22.11.2022
 STATISTICAL MODELING
 RANDOM VARIABLE
 SAMPLE STATISTICS
 HYPOTHESIS TESTING
 CONFIDENCE INTERVALS
 P HACKING
 BAYESIAN INFERENCE

Objectives
 Introduction to random variables
 How they are characterized using probability measures and probability density functions
 How parameters of these density functions can be estimated
 How you can do decision making from data using the method of hypothesis testing
 Characterizing random phenomena – what are they – how probability can be used as a
measure to describe
Statistical Modeling
Random phenomena
1) Deterministic phenomenon
2) Stochastic phenomenon

1) Deterministic phenomenon – Phenomenon whose outcome can be predicted with a


very high degree of confidence
Example – age of a person (using date of birth stated in Aadhaar Card),
- up to number of days, if asked to predict the age of the person to a hour or a minute
the date of birth from an Aadhaar Card is insufficient
- may be the information from the birth certificate are needed
- But to predict the age with higher degree of precision, up to the last minute, it is not
possible to do it with the same level of confidence.

2) Stochastic phenomenon – Phenomenon which can have many possible outcomes for
some experimental conditions. Outcome can be predicted with limited confidence
Example – Outcome of a coin toss
- Might get a head or tail (but can’t say with 90% or 95% confidence)
- Might be able to say it only with a 50% confidence if it is a fair coin

Why are we dealing with stochastic phenomena?


Page 1 of 34
- Data obtained from experiments contains some errors.
- Reasons for these errors may be, all the rules are not known, which governs the data
generating process or in other words, all the laws are not known (knowledge of all the
causes that affects the outcomes and thereof)
- These are called modeling errors
- The other kind of error is due to the sensor itself.
- The sensors used for observing the outcomes may contain errors
- Such errors are called measurement errors.
- These 2 errors are modeled using probability density functions and therefore the outcomes
are also predicted with certain confidence intervals.
- The types of random phenomena can either be discrete where the outcomes are finite
- Example: Coin Toss experiment – only two outcomes (either head or tail)
- Example: Throw of a dice – 6 outcomes
- Continuous random phenomena – infinite number of outcomes
- Example: Measurement of a body temperature (varies from 96 – 105 degrees) depending
on a person is running a temperature or not.
- So such continuous variable thing which have random outcomes are called continuous

Characterizing random phenomena


 Sources of error in observed outcome
o Lack of knowledge of generating process (model error)
o Errors in sensors used for observing outcomes (measurement error)
 Types of random phenomena
o Discrete: Outcomes are finite
 Coin Toss: {H,T}
 Throw of a dice: {1, 2, 3, 4, 5, 6}
o Continuous: infinite number of outcomes
 Body temperature measurement in def F

Sample Space, Events (Discrete Phenomena)


 Sample space
o Set of all possible outcomes of a random phenomenon
 Coin toss: S = {H, T}
 Two Coin Tosses: S = {HH, HT, TH, TT)
 Event
o Subset of the sample space
 Occurrence of a head in first toss of a two coin toss experiment A = {HH, HT}
 Outcomes of a sample space are elementary events

- Random phenomena - all the notions of probability - using just the coin toss experiment

Page 2 of 34
- A single coin tosses whose outcomes are described by H and T
- The sample space is the set of all possible outcomes.
- In this case the sample space consists of these two outcomes H and T denoted by the
symbols H and T
- On the other hand if there are two successive coin tosses, then there can be 4 possible
outcomes denoted by the symbol HH, HT, TH and TT and that constitutes the sample space.
- Outcomes of the sample space for example, HH, HT, TH and TT can also be considered as
events. These events are known as elementary events.

Probability measure – is a function that assigns a real value to every outcome of random
phenomena, which satisfies following axioms:
 0 ≤ P(A) ≤ 1 (probabilities are non-negative and less than 1 for any event A)
 P(S) = 1 (probability of the entire sample space – one of the outcomes should occur)
 For two mutually exclusive events A and B,
o P(A∪B) = P(A) + P(B)

Note: To get ≤, use (2264 + ALT X)


To get ∪ use (ALT + 8746)
To get ∩ use (ALT + 8745)

Interpretation of a probability as a frequency:


 Conduct an experiment (coin toss) N time.
 If NA is number of times outcome A occurs then P(A) =NA/N

Exclusive and independent events


 Independent events
o Two events are independent if occurrence of one has no influence on occurrence of
other
 Formally A and B are independent events if and only if P(A∩B) = P(A) X P(B)
 In a two coin toss experiment, the occurrence of head in second toss can be
assumed to be independent of occurrence of head or tail in first toss, then
P(HH) = P(H in first toss) X P(H in second toss) = 0.5 X 0.5 = 0.25
o All the four outcomes in the case of two coin toss experiment, will have a probability
of 0.25
o Two events are said to be independent, if the occurrence of one has no influence on
the occurrence of other. That is, even if first event occurs it is not possible to make
any improvement about the predictability of B if event A and B are independent
formally.
o P(A∩B) which means a joint occurrence of A and B can be obtained by multiplying
their respective probabilities which is P(A) into P(B).

 Mutually exclusive events

Page 3 of 34
o Two events are mutually exclusive if occurrence of one implies other event does not
occur
 In a two coin toss experiment, events {HH} and {HT} are mutually exclusive
P(HH and HT) = P(HH) P(HT) = 0.25 + 0.25 = 0.5
 obtained by a basic laws of probability of mutually exclusive events
o Mutually exclusive events are the events that preclude each other.
o Which means, if event A has occurred then it implies B has not occurred, then A and
B are called mutually exclusive events
o Two coin tosses in succession, two successive heads have occurred, it is clear that
the event of head followed by a tail has not occurred, these are mutually exclusive
events
o The probability of either receiving two successive heads or a head and followed by a
tail can be obtained in this case by simply adding their respective probabilities
because they are mutually exclusive events.

Some rules of Probability


- Venn diagram to derive probability rules for the 2 coin toss experiment.

Page 4 of 34
- In the two coin toss experiment the sample space consists of 4 outcomes denoted by HH,
HT, TH and TT.
- The event A, which is a head in the first toss
- This consists of two outcomes HH and HT
- A compliment is the set of all events that exclude A, which is nothing but the set of
outcomes TH and TT
- Probability of a compliment = probability of the entire sample space – P( A), which is one
- P( HH) which is 0.25 + P( HT) which is 0.25 = 0.5.
- The probability of a compliment TH and TT = 0.5.
- P(A)c = 1 - P(A).

- A and B are not mutually exclusive, a common event of two successive heads which belongs
to both A and B
- Compute the P(A) or B - a head in the first toss or a head in the second toss, then this comes
to three outcomes and together gives the probability of 0.75 which can be counted from
the respective probabilities of HT, HH and TH

Conditional Probability
 If two events A and B are not independent, then information available about the
outcome of event A can influence the predictability of event B
 Conditional probability
o P(B | A) = P(A∩B)/P(A) if P(A)>0
o P(A | B)P(B) = P(B | A)P(A) – Bayes Formula
o P(A) = P(A | B)P(B) + P(A | BC)P(BC)
 Example: two (fair) coin toss experiment
o Event A : First toss in head = {HT, HH}
o Event B : Two successive heads = {HH}
o Pr(B)=0.25 (no information)
o Given event A has occurred Pr(B | A) = 0.5 = 0.25 / 0.5 = P(A∩B)/P(A)

EXAMPLE:
In manufacturing process of 1000 parts are produced of which 50 are defective. We
randomly take a part form the day’s production

o Outcomes: {A = Defective Part B = Non-Defective Part}


o P(A) = 50/1000, P(B) = 950/1000

 Suppose we draw a second part without replacing the first part


o Outcomes: {C = Defective Part D = Non-Defective Part}
o Pr(C)= 50/1000 (no information about the outcome of the first draw)
o Pr(C | A)= 49/999 (given information that first draw is defective)
o Pr(C | B)= 50/999 (given information that first draw is non-defective)
o P(C)=49/999 * 50/1000 + 50/999 * 950/1000 = 50/1000
o P(A | C) = P (A ∩ C)/P(C) = P(C | A)P(A)/P(C)=49/999
Page 5 of 34
Random Variables and Probability Mass/Density Functions

 The notion of random variables and the idea of probability mass and density
functions
 How to characterize these functions
 How to work with them

Random Variable
 A random variable (RV) is a map from sample space to a real line such that there is a
unique real number corresponding to every outcome of sample space
o Example: Coin toss sample space [H T] mapped to [0 1]. If the sample space
outcomes are real valued no need for this mapping (eg. Throw a dice)
o Allows numerical computations such as finding expected value of a RV
o Discrete RV (throw a dice or a coin)
o Continuous RV (sensor readings, time interval between failures)
o Associated with the RV is also a probability measure

Probability mass / Density Function


 For a Discrete RV the probability mass function assigns a probability to every
outcome in sample space
o Sample space of RV (x) for a coin toss experiment: [0 1]
o P(x=0)=0.5; P(x=1)=0.5
 For a continuous RV the probability density function f(x) can be used to assign a
probability to every interval on a real line
 Continuous RV(x) can take any value [–∞, ∞]
 area under curve

 Cumulative density function F(X)


b

 P(a < x < b) = ∫ f ( x ) dx


a
b

 F(b) = P(–∞ < x < b) = ∫ f ( x ) dx


−∞
 It is a fair coin and that is why the outcomes are given equal probability.

Page 6 of 34
 In the case of a continuous random variable, we define what is known as a
probability density function, which can be used to compute the probability for every
outcome of the random variable within an interval.
 Notice, in the case of a continuous random variable, there are ∞ of outcomes and
therefore, we cannot associate a probability with every outcome.
 However, we can associate a probability that the random variable lies within some
finite interval.
 Random variable x which can take any value in the real line from - ∞ to ∞, the density
function f(x), such that the probability that the variable lies in an interval a to b is defined
as the integral of this function from a to b.
 The integral is an area, the area represents the probability, the probability that the random
variable lies between - 1 to 2 is denoted by the shaded area
 The cumulative density function, which is denoted by capital F and this is the probability
that the random variable x lies in the interval - ∞ to b, the integral between - ∞ and b of
this density function f(x) dx.
 Other functions – Binomial Mass Function, Guassian or Normal Density Function,
Chi-square density function
 Other examples of pdf – uniform density function & exponential density function

Moments of a PDF
 Similar to describing a function using derivatives, a pdf can be described by its
moments
o For continuous distributions

 E[x ] = ∫ xk f(x) dx
k
−∞
o For discrete distributions
N

 E[x ] = ∑ xik p(xi)


k
i=1

o Mean: µ = E [x]
o Variance: σ2 = E [(x - µ)2] = E[x2] -µ2
o Standard deviation = square root of variance = σ

Note: To get ∞, use (221E + ALT X)


To get σ, use (ALT + 229)
To get µ, use (ALT + 230)

Page 7 of 34
01.12.2022
Sample Statistics

 Probability provides a theoretical framework for providing, for performing


statistical analysis of data.
 Statistics actually deals with the analysis of experimental observations that we have
obtained.

Need for sampling

 PDFs of RVs establishes theoretical framework.


o But entire sample space may not be known
o Parameters of distribution may not be known
 From a finite sample derive conclusions about the pdf and its parameters
 Sample (or observation) set is assumed to be sufficiently representative of the
entire sample space
o Proper sampling procedures and design of experiments to be used for
obtaining the sample
 While doing analysis we do not know the entire sample space
 We may also not know all the parameters of the distribution from which the samples are
being withdrawn.
 Typically we actually obtain only a few samples of the total number of available
population.
 So, from this finite sample we have to derive conclusions about the probability density
function of the entire population and also infer, make inferences about the parameters of
these distributions.
 So, the sample or observation set is supposed to be sufficiently representative of the
entire sample space.
 Example - To find out the average height of people in the world, can’t take heights of
American people alone because they are known to be much taller, when compared to the
Asian people.
 While taking samples, take from Europe, Asia and so on. So that, will get a representative
of the entire population of this world
 This is called proper sampling procedures and these are dealt with in the design of
experiments

Basic concepts
 Population – Set of all possible outcomes of a random experiment characterized by
f(x)
 Sample set (realization) – Finite set of observations obtained through an
experiment

Page 8 of 34
 Inference – conclusion derived regarding the population (pdf, parameters) from the
sample set
o Inference made from a sample set is also uncertain since it depends on the
sample set which is one of many possible realizations, provide also the
confidence interval associated with these estimates that are derived
 Statistical analysis
o Descriptive statistics (Analysis)
 Graphical – organizing and presenting the data (ex: Box plots,
probability plots)
 Numerical – summarizing the sample set (ex: mean, mode, range,
variance, moments)
o Inferential
 Estimation – estimate parameters of the pdf along with its confidence
region
 Hypothesis testing – making judgments about f(x) and its parameters
 Measures of central tendency
o Represent sample set by a single value
 Mean (or average) =

 Best estimate in least squares criterion


 Unbiased estimate of population mean E[ x ] = µ
 Affected by outliers
 Ex: sample height of 20 cherry trees
[55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
 Mean = 1385 / 20 = 69.25 (population mean used to generate
random sample was 70)
 Mean = 1435 / 20 = 71.75 (after a bias of 50 was added to first
sample value)
 The mean of a sample is defined as the summation of all the data points that was obtained
divided by the number of data points.
 That is also denoted by the symbol x̅ and its also called the mean or the average of the
sample.

Unbiased estimate - Expectation of x̅ is μ. This can be analytically proven for any kind of
distribution.
 Take a sample of N points and get an estimate x̅ . Repeat this experiment and draw
another random sample from the population of N points and get another value of x̅ .
 Average all these x̅ from different experimental sets, then the average of these averages
will tend to this population mean.
 This is an useful and important property of estimates

Page 9 of 34
 The one unfortunate aspect of this particular statistic or mean is that it is if there is one
bad data point in sample, then estimate of x̅ can be significantly affected by this bad
value. The bad value is called an outlier and even a single outlier in data can give rise to a
bad estimate of x̅ .
 A single bias in the sample will lead to the poor estimate.

MEDIAN
 Represent sample set by a single value
o Median – value of xi such that 50% of the values are less than x i and 50% of
observations are greater than xi
 Robust with respect to outliers in data
 Best estimate in least absolute derivation sense
 Ex: Sample Heights of 20 Cherry Trees
[55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
Median = 69 (population mean used to generate random sample was
70)
 Median = 69 (after a bias of 50 was added to first sample value)
 Another measure of central tendency is what is called a median.
 The median is a value such that 50 percent of the data points lie below this value and 50
percent of the experimental observations are greater than this value.
 Order all the observations from smallest to highest and then find out the middle value.
 10th point is 67, because there are even number of points, the eleventh point is 71 take
the average between this and call that the median. (67 + 71 = 138, 138 / 2 = 69)
 If there are odd number of points then take the middle point just as it is
 Add a bias in the first data point and make this 105 and then reorder the data and find out
the median again; the median has not changed.
 So, the presence of an outlier has not affected the median

MODE
 Represent sample set by a single value
o Mode – Value that occurs most often (most probable value)
o Ex: Sample heights of 20 cherry trees
o [55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
o Mode = 67 (3 occurrences)
 A mode is another measure of central tendency and this value is the value that occurs
most often or what is called the most probable value.
 Sometimes distribution may have two modes. What is called a bimodal distribution in
which case if sampling is done from such a distribution, it will give two clusters, one
cluster around the one of the modes and another cluster around the second mode

Page 10 of 34
Measures of spread
 Represents spread of sample set

https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php

Page 11 of 34
https://www.calculatorsoup.com/calculators/statistics/descriptivestatistics.php

 Another measure which characterizes a sample set is the measures of spread and tells
how widely their data is ranging.
 Sample variance – Formula description
Page 12 of 34
 The sample variance is an unbiased estimate of the population variance
 The square root of the sample variance is also known as the standard deviation.
 The sample variance happens to be also a very susceptible to outliers.
 So, if there is a single outlier, the sample variance, sample standard deviation can become
very poor estimate of the population parameter.

 Another measure of spread which is called the mean absolute deviation, somewhat
similar to the Median
 Formula description
 A third measure of spread is what is called the range that is basically the difference
between the maximum and minimum value.

 Example: Refer Measures of spread image in page 11, highlighted in red color
 A single outlier can cause the standard deviation and the variance to become very poor
and therefore cannot be trusted as a good estimate of the population standard deviation or
variance.
 Take the mean absolute deviation from the median, it would be even better in terms of
robustness with respect to the outlier.
 The range of the data can be obtained as the maximum and minimum value
 Even when the 20 data points were not given, only with the mean and standard deviation,
it is possible to tell the properties of the sample (power of sample statistics)

Page 13 of 34
03.12.2022
Graphical Analysis – Histograms, Box Plot, Probability Plot, Scatter Plot

 Histograms
o Divide the range of values in sample set into samll intervals and count how many
observations fall within each interval
o For each interval plot a rectangle with width = interval size and height equal to
number of observations in interval
o Example – Sample of 20 heights of black cherry trees
[73 75 55 60 66 71 81 67 83 75 82 71 63 55 72 78 67 65 67 59]

Page 14 of 34
 Given a sample set, first divide this sample set into small ranges; and count how many
observations fall within that range or within each interval

 Plot the width of the interval or the interval size of the x axis and the number of data
points available in that interval as the y axis

Page 15 of 34
Box Plot

 Other kinds of plots – which is called the box plot, which is used most often in sometimes
in visualizing stock prices
 Compute quantities called quartiles Q1, Q2 and Q3 and the minimum and maximum
values in the range
 What are quartiles?
 Quartiles are basically an extension of the idea of median
 Q2 is exactly the median - which means half the number of points fall below the value of
Q2 and half the number of points are exactly about Q2.
 Similarly, Q1 represents the 25 percent value which means 25 percent of the observations
fall below Q1.
 75 percent above Q1 and Q3 implies that 75 percent of the data points fall below Q3 and
25 percent above Q3.
 And once you have these values, the median, the quartiles and the minimum maximum,
you can plot what is called the box and whisker plot
 The lowest observation and the highest observation are called the whiskers.
 This gives a little more information about the spread of the data

Page 16 of 34
Probablity Plot

 The third kind of plot which is very useful is to know about the distribution of the data
and this is called the probability plot the p-p plot or the q-q plot.
 Standardization means remove the mean and divide by the standard deviation
 The 20 values as these are called the standardized values, sort them from the lowest to
highest

https://mathcracker.com/normal-probability-plot-maker#results

55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83

Page 17 of 34
 The theoretical frequencies fi need to be computed as well as the associated z-scores zi,
for i = 1, 2, ..., 20:
 Observe that the theoretical frequencies f_ifi are approximated using the following
formula: fi= i−0.375 / n+0.25
 Where i corresponds to the position in the ordered dataset, and zi is corresponding
associated z-score.
 This is computed as zi=Φ−1(fi)

Position (i) X (Asc. Order) fi zi


1 55 0.0309 -1.868
2 55 0.0802 -1.403
3 59 0.1296 -1.128
4 60 0.179 -0.919
5 63 0.2284 -0.744
6 65 0.2778 -0.589
7 66 0.3272 -0.448
8 67 0.3765 -0.315
9 67 0.4259 -0.187
10 67 0.4753 -0.062
11 71 0.5247 0.062
12 71 0.5741 0.187
13 72 0.6235 0.315
14 73 0.6728 0.448
15 75 0.7222 0.589
16 75 0.7716 0.744
17 78 0.821 0.919
18 81 0.8704 1.128
19 82 0.9198 1.403
20 83 0.9691 1.868

 The normal probability plot is obtained by plotting the X-values (sample data) on the
horizontal axis, and the corresponding zi values on your vertical axis.

Page 18 of 34
 The following normality plot is obtained:

Scatter plot

 The scatter plot plots one random variable against another.

Page 19 of 34
 So, if there are two random variables, let us say y and x and to know whether there is any
relationship between y and x, then one way of visually verifying this dependency or
interdependency is to plot y versus x.
 Data corresponding to 100 students – Students have spent time preparing for a quiz and
they have obtained marks in that quiz.
 If more time was spent on study then more marks might have scored by students
 X – axis (Time spent), Y-axis (Marks obtained)
 If the random variables has dependency, then alignment of the data can be seen
 If there is no dependency, then the data will spread randomly, and there is no clear
pattern
 This plot is helpful in the process of assessing dependency between 2 variables and then
proceed for further analysis

http://www.alcula.com/calculators/statistics/scatter-plot/

Statistics Calculator: Scatter Plot


Make a scatter plot from a set of data.

Data
1. 3, 100
2. 3, 100
3. 2, 75
4. 1, 50
5. 1, 45
6. 3, 100
7. 3, 100
8. 2, 75
9. 1, 50
10. 1, 45
11. 3, 100
12. 3, 100
13. 2, 75
14. 1, 50
15. 1, 45
16. 3, 100
17. 3, 100
18. 2, 75
19. 1, 50
20. 1, 45

08.12.2022
Hypothesis testing
Page 20 of 34
 The basics of hypothesis testing which is an important activity while taking decision from
a set of data.

Motivation for Hypothesis Testing


 Business – will an investment in a mutual fund yield annual returns greater than desired
value? (based on past performance of the fund)
 Medical – is the incidence of the diabetes greater among males than females?
 Social – are women more likely to change mobile service provider than men?
 Engineering – has the efficiency of the pump decreased from its original value due to
aging?

Hypothesis testing
 The hypothesis is generally converted to a test of the mean or variance parameter of a
population (or differences in mean or variances of populations)
 A hypothesis is a statement or postulate about the parameters of a distribution (or model)
o Null hypothesis H0 – The default or status quo postulate that we wish to reject if
the sample set provides sufficient evidence
o Alternative hypothesis H1 – The alternative postulate that is accepted if the null
hypothesis is rejected

Hypothesis tsting procedure


 Identify the parameter of interest (mean, variance, proportion) which you wish to test
 Construct the null and alternate hypothesis
 Compute a test statistic which is a fucntion of the sample set of observations
 Derive the distribution of the test statistic under the null hypothesis assumption
 Choose a test criterion (threshold) against which the test statistic is compared to reject /
not reject the null hypothesis

 No hypothesis test is perfect. There are inherent errors since it is based on observations
which are random
 The performance of a hypothesis test depends on
o Extent of variablity in data
o Number of observations (Sample size)
o Test Statistic (function of observations)
o Test criterion (Threshold)
 There are 2 types of hypothesis tests – the two sided and one sided test
 If it is a two sided test, it has a lower criterion threshold and the upper threshold selected
from the appropriate distribution
 Depending on the type of test, thresholds were chosen and then test statistics were
compared against those thresholds

Page 21 of 34
Errors in hypothesis testing
 Two types of Errors (Type 1 & Type II)

 Typically the type 1 error probability α (also called as the level of significance of the test)
is controlled by choosing the criterion from the distribution of the test statistic under the
null hypothesis
 Type 1 error (false alarm)
 Type II error also has a probability which is denoted by β.
 Correct decision probability is known as power of the statistical test and is denoted by 1 -
β.

Summary of useful hypothesis tests


Page 22 of 34
https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/#Hypothesis

What is a Hypothesis?
A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation. For example:
 A new medicine you think might work.
 A way of teaching you think might be better.

What is a Hypothesis Statement?


If you are going to propose a hypothesis, it’s customary to write a statement. Your statement will
look like this:
“If I…(do this to an independent variable)….then (this will happen to the dependent variable).”
For example:
 If I (decrease the amount of water given to herbs) then (the herbs will increase in size).
 If I (give patients counseling in addition to medication) then (their overall depression
scale will decrease).
 If I (give exams at noon instead of 7) then (student test scores will improve).

A good hypothesis statement should:


 Include an “if” and “then” statement (according to the University of California).
 Include both the independent and dependent variables.
 Be testable by experiment, survey or other scientifically sound technique.
 Be based on information in prior research (either yours or someone else’s).
 Have design criteria (for engineering or programming projects).

Page 23 of 34
What is Hypothesis Testing?
Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see
if you have meaningful results. You’re basically testing whether your results are valid by
figuring out the odds that your results have happened by chance. If your results may have
happened by chance, the experiment won’t be repeatable and so has little use.

What is the Null Hypothesis?


If you trace back the history of science, the null hypothesis is always the accepted fact. Simple
examples of null hypotheses that are generally accepted as being true are:
1. DNA is shaped like a double helix.
2. There are 8 planets in the solar system (excluding Pluto).
3. Taking Vioxx can increase your risk of heart problems (a drug now taken off the market).

Hypothesis Testing
Example #1: Basic Example
 A researcher thinks that if knee surgery patients go to physical therapy twice a week (instead
of 3 times), their recovery period will be longer.
 Average recovery time for knee surgery patients is 8.2 weeks.
 The hypothesis statement in this question is that the researcher believes the average recovery
time is more than 8.2 weeks.
 It can be written in mathematical terms as: H1: μ > 8.2
 Next, state the null hypothesis.
 That’s what will happen if the researcher is wrong.
 In the above example, if the researcher is wrong then the recovery time is less than or equal
to 8.2 weeks.
 In math, that’s: H0 μ ≤ 8.2

Example #2: Basic Example


 A principal at a certain school claims that the students in his school are above average
intelligence.
 A random sample of thirty students IQ scores have a mean score of 112.5.
 Is there sufficient evidence to support the principal’s claim?
 The mean population IQ is 100 with a standard deviation of 15.

Step 1: State the Null hypothesis.


The accepted fact is that the population mean is 100, so: H0: μ = 100.
Step 2: State the Alternate Hypothesis.
The claim is that the students have above average IQ scores, so: H1: μ > 100.
The fact that we are looking for scores “greater than” a certain point means that this is
a one-tailed test.

Page 24 of 34
Step 3: Draw a picture to help you visualize the problem.

Step 4: State the alpha level. If you aren’t given


an alpha level, use 5% (0.05).

Step 5: Find the rejection region area (given by your


alpha level above) from the z-table. An area of .05 is
equal to a z-score of 1.645.

Step 6:Find the test statistic using this formula:

For this set of data: z= (112.5–


100) / (15/√30)=4.56

Step 7:If Step 6 is greater than Step 5, reject the null hypothesis. If it’s less than Step 5, you
cannot reject the null hypothesis. In this case, it is more (4.56 > 1.645), so you can reject
the null.
https://youtu.be/N5Wdfd3exmc

11.12.2022
https://youtu.be/cL5ie-669rc

Confidence Interval

Page 25 of 34
 Estimation and confidence interval
 Levels of confidence
 Interpretation of interval estimation
 Margin of error
 Interval estimation of a population mean

Estimation and confidence interval


1. A point estimate is a single value used to estimate a population value
2. An interval estimate states the range within which a population parameter probably lies
3. A confidence interval is a range of values within which the population parameter is
expected to occur
4. The two confidence interval that are used extensively are 95% and 99%

Level of confidence (α)


1. Level of confidence is the probability that the unknown population parameter falls within
the interval
2. Denote (1-α)%=level of confidence
3. Example – 90%, 95%, 99%
4. α – alpha is the probability that the parameter is not within the interval

Interpretation of interval estimation


1. For a 95% confidence interval, about 95% of the similarity constructed intervals will
contain the parameter being estimated
2.
3.

Page 26 of 34
4. 95% of the sample means for a specified sample size will lie within 1.96 standard
deviations of the hypothesized population mean
5. For the 99% confidence interval, 99% of the sample means for a specified sample size
will lie within 2.58 standard deviations of the hypothesized population mean

Margin of error & Interval estimate


1. An interval estimate can be calculated by adding or subtracting the margin of error to the
point estimate
2. The purpose of an interval estimate is to provide information about how close the point
estimate is to the value of parameter
3. The general form of an interval estimate of a population mean is as follows: X̅ +/- Margin
of Error
4. Interval estimation of a population mean:
5. Interval estimate =

X̅ = sample mean
Z = number of standard deviation from the sample mean
s = Standard deviation in the sample
n = size of the sample

Example:

Suppose a student measuring the boiling temperature of a certain liquid observes


the readings (in degree Celsius) 101.7, 102.4, 106.3, 105.9, 101.2 and 102.7 on 6
different samples of the liquid. He calculates the sample mean to be 103.40. if he knows
that the standard deviation for this procedure is 1.2 degrees, what is the confidence
interval for the population mean at a 95% confidence level?

Use the above interval estimate formula

= 103.4 ± 1.96 X [1.2 / √ 6 ]


= 103.4 ± 0.93
= (102.47, 104.33)

Page 27 of 34
https://www.samlau.me/test-textbook/ch/18/hyp_phacking.html

P-hacking
 A p-value or probability value is the chance, based on the model in the null hypothesis
that the test statistic is equal to the value that was observed in the data or is even further
in the direction of the alternative.
 If a p-value is small, that means the tail beyond the observed statistic is small and so the
observed statistic is far away from what the null predicts.
 This implies that the data support the alternative hypothesis better than they support the
null.
 By convention, when the p-value is below 0.05, the result is called statistically
significant, and the null hypothesis is rejected.
 There are dangers that present itself when the p-value is misused.
 P-hacking is the act of misusing data analysis to show that patterns in data are statistically
significant, when in reality they are not.
 This is often done by performing multiple tests on data and only focusing on the tests that
return results that are significant.

Multiple Hypothesis Testing


 One of the biggest dangers of blindly relying on the p-value to determine “statistical
significance” comes when we are just trying to find the results that give us “good” p-
values.
 This is commonly done when doing “food frequency questionnaires,” or FFQs, to study
eating habits’ correlation with other characteristics (diseases, weight, religion, etc).
 FiveThirtyEight, an online blog that focuses on opinion poll analysis, made their own
FFQ – using this data, analysis were carried to find the results that can be considered
“statistically significant.”

data = pd.read_csv('raw_anonymized_data.csv')
# Do some EDA on the data so that categorical values get changed to 1s and 0s
data.replace('Yes', 1, inplace=True)
data.replace('Innie', 1, inplace=True)
data.replace('No', 0, inplace=True)
data.replace('Outie', 0, inplace=True)

# These are some of the columns that give us characteristics of FFQ-takers


characteristics = ['cat', 'dog', 'right_hand', 'left_hand']

# These are some of the columns that give us the quantities/frequencies of different food
the FFQ-takers ate
ffq = ['EGGROLLQUAN', 'SHELLFISHQUAN', 'COFFEEDRINKSFREQ']
 Look specifically for whether people own cats, dogs, or what handedness they are.

Page 28 of 34
 data[characteristics].head()
ca dog right_hand left_hand
t
0 0 0 1 0
1 0 0 1 0
2 0 1 1 0
3 0 0 1 0
4 0 0 1 0
 Additionally, look at how much shellfish, eggrolls, and coffee people consumed.
 data[ffq].head()
EGGROLLQUA SHELLFISHQUAN COFFEEDRINKSFREQ
N
0 1 3 2
1 1 2 3
2 2 3 3
3 3 2 1
4 2 2 2
 Calculate the p-value for every pair of characteristic and food frequency/quantity
features.
# Calculate the p value between every characteristic and food frequency/quantity pair
pvalues = {}
for c in characteristics:
for f in ffq:
pvalues[(c,f)] = findpvalue(data, c, f)
pvalues
{('cat', 'EGGROLLQUAN'): 0.69295273146288583,
('cat', 'SHELLFISHQUAN'): 0.39907214094767007,
('cat', 'COFFEEDRINKSFREQ'): 0.0016303467897390215,
('dog', 'EGGROLLQUAN'): 2.8476184473490123e-05,
('dog', 'SHELLFISHQUAN'): 0.14713568495622972,
('dog', 'COFFEEDRINKSFREQ'): 0.3507350497291003,
('right_hand', 'EGGROLLQUAN'): 0.20123440208411372,
('right_hand', 'SHELLFISHQUAN'): 0.00020312599063263847,
('right_hand', 'COFFEEDRINKSFREQ'): 0.48693234457564749,
('left_hand', 'EGGROLLQUAN'): 0.75803051153936374,
('left_hand', 'SHELLFISHQUAN'): 0.00035282554635466211,
('left_hand', 'COFFEEDRINKSFREQ'): 0.1692235856830212}
 study finds that:
Eating/ is linked to: P-value
Drinking
Egg rolls Dog ownership <0.0001
Shellfish Right-handedness 0.0002

Page 29 of 34
Shellfish Left-handedness 0.0004
Coffee Cat ownership 0.0016

 Clearly this is flawed.
 Aside from the fact that some of these correlations seem to make no sense, it is found that
shellfish is linked to both right and left handedness.
 Because it is blindly tested all columns against each other for statistical significance.
 Chosen only whatever pairs say “statistically significant” results.
 This shows the dangers of blindly following the p-value without a care for proper
experimental design.

Example:
 A simple example of this would be in the case of rolling a pair of dice and getting two 6s.
 If the null hypothesis that the dice are fair and not weighted, and take the test statistic to
be the sum of the dice, then find the p-value of this outcome, which will be 1/36 or 0.028,
and gives statistically significant results that the dice are fair.
 But obviously, a single roll is not nearly enough rolls to provide with good evidence to
say whether the results are statistically significant or not, and shows that blindly applying
the p-value without properly designing a good experiment can result in bad results.

Bayesian Inference
https://towardsdatascience.com/what-is-bayesian-inference-4eda9f9e20a6

 Frequentism is based on frequencies of events.


 Bayesianism is based on our knowledge of events.
 The prior represents your knowledge of the parameters before seeing data.
 The likelihood is the probability of the data given values of the parameters.
 The posterior is the probability of the parameters given the data.
 Bayes’ theorem relates the prior, likelihood, and posterior distributions.
 MLE is the maximum likelihood estimate, which is what frequentists use.
 MAP is the maximum a posteriori estimate, which is what Bayesians use.

Page 30 of 34
Illustration of how our prior knowledge affects our posterior knowledge

 Machine learning is mainly concerned with prediction; prediction is very much


concerned with probability
 Two main interpretations of probability: 1) frequentism 2) Bayesianism.
 The frequentist (or classical) definition of probability is based on frequencies of events,
whereas the Bayesian definition of probability is based on our knowledge of events.
(What the data says versus what we know from the data.)
 Analogy: Where did you lose your phone?
 Both the frequentist and the Bayesian use their ears when inferring, where to look for the phone,
but the Bayesian also incorporates prior knowledge about the lost phone into their inference.

Bayes’ theorem
 We have two sets of outcomes A and B (also called events), denote the probabilities of each
event P(A) and P(B) respectively.
 The probability of both events is denoted with the joint probability P(A, B), and expand this with
conditional probabilities
P (A, B) = P (A|B) P (B)  (1)
 i.e., the conditional probability of A given B and the probability of B give us the joint probability
of A and B. It follows that
P (A, B) = P (B|A) P (A)  (2)
 Since the left-hand sides of (1) and (2) are the same, we can see that the right-hand sides are equal
= P (A|B) P (B) = P (B|A) P (A)
P (B∨ A)P (A )
= P (A|B) =
P(B)
 This is Bayes’ theorem.

 The evidence (the denominator above) ensures that the posterior distribution on the left-hand side
is a valid probability density and is called the normalizing constant.
 The theorem in words is stated as follows: Posterior α Likelihood x Prior, where ∝ means
“proportional to”.

Example: coin flipping


 A coin flips heads up with probability θ and tails with probability 1−θ (where θ is unknown).
You flip the coin 11 times, and it ends up heads 8 times. Now, would you bet for or against
the event that the next two tosses turn up heads?

Page 31 of 34
 Let X be a random variable representing the coin, where X=1 is heads and X=0 is tails such
that P(X=1)=θ and Pr(X=0)=1−θ. Furthermore, let D denote our data (8 heads, 3 tails).
 Estimate the value of the parameter θ, so that we can calculate the probability of seeing 2
heads in a row.
 If the probability is less than 0.5, we will bet against seeing 2 heads in a row, but if it’s above 0.5,
then we bet for.

Frequentist approach
 As the frequentist, maximize the likelihood, which is to ask the question: what value of θ will
maximize the probability that we got D given θ, or more formally, we want to find

 This is called maximum likelihood estimation (MLE).


 The 11 coin flips follow a binomial distribution with n=11 trials, k=8 successes, and θ the
probability of success.
 Using the likelihood of a binomial distribution, find the value of θ that maximizes the
probability of the data.


 Note that (3) expresses the likelihood of θ given D, which is not the same as saying the probability
of θ given D.
 The image underneath shows our likelihood function P(D∣θ) (as a function of θ) and the maximum
likelihood estimate.

Illustration of the likelihood function P(D|θ)


with a vertical line at the maximum likelihood estimate.

 The value of θ that maximizes the likelihood is k/n, i.e., the proportion of successes in the trials.
The maximum likelihood estimate is therefore k/n = 8/11 ≈ 0.73.
 Assuming the coin flips are independent, now calculate the probability of seeing 2 heads in a row:


 Since the probability of seeing 2 heads in a row is larger than 0.5, we would bet for!

Page 32 of 34
Bayesian approach

 As the Bayesian, to maximize the posterior, asks the question: what value of θ will maximize the
probability of θ given D?

 which is called maximum a posteriori (MAP) estimation.


 To answer the question, use Bayes’ theorem

 Since the evidence P(D) is a normalizing constant not dependent on θ, ignore it.
 This now gives

 During the frequentist approach, it is found the likelihood (3),


 Drop the binomial coefficient, since it’s not dependent on θ.
 The only thing left is the prior distribution P(θ).
 This distribution describes initial (prior) knowledge of θ.
 A convenient distribution to choose is the Beta distribution, because it’s defined on the interval [0,
1], and θ is a probability, which has to be between 0 and 1.

 This gives
 Where Γ is the Gamma function. Since the fraction is not dependent on θ, ignore it, which gives

 Set the prior distribution in such a way that we incorporate, what we know about θ prior to seeing
the data.
 Now, we know that coins are usually pretty fair, and if we choose α=β=2, we get a beta
distribution that favors θ=0.5 more than θ=0 or θ=1.
 The illustration below shows this prior Beta(2,2), the normalized likelihood, and the resulting
posterior distribution.

Page 33 of 34
Illustration of the prior P (θ), likelihood P(D | θ), and posterior distribution P (θ | D)
with a vertical line at the maximum a posteriori estimate.

 The posterior distribution ends up being dragged a little more towards the prior distribution,
which makes the MAP estimate a little different the MLE estimate.

 which is a little lower than the MLE estimate — and if we now use the MAP estimate to calculate
the probability of seeing 2 heads in a row, we find that we will bet against it

Page 34 of 34

You might also like