Professional Documents
Culture Documents
UNIT – II
22.11.2022
STATISTICAL MODELING
RANDOM VARIABLE
SAMPLE STATISTICS
HYPOTHESIS TESTING
CONFIDENCE INTERVALS
P HACKING
BAYESIAN INFERENCE
Objectives
Introduction to random variables
How they are characterized using probability measures and probability density functions
How parameters of these density functions can be estimated
How you can do decision making from data using the method of hypothesis testing
Characterizing random phenomena – what are they – how probability can be used as a
measure to describe
Statistical Modeling
Random phenomena
1) Deterministic phenomenon
2) Stochastic phenomenon
2) Stochastic phenomenon – Phenomenon which can have many possible outcomes for
some experimental conditions. Outcome can be predicted with limited confidence
Example – Outcome of a coin toss
- Might get a head or tail (but can’t say with 90% or 95% confidence)
- Might be able to say it only with a 50% confidence if it is a fair coin
- Random phenomena - all the notions of probability - using just the coin toss experiment
Page 2 of 34
- A single coin tosses whose outcomes are described by H and T
- The sample space is the set of all possible outcomes.
- In this case the sample space consists of these two outcomes H and T denoted by the
symbols H and T
- On the other hand if there are two successive coin tosses, then there can be 4 possible
outcomes denoted by the symbol HH, HT, TH and TT and that constitutes the sample space.
- Outcomes of the sample space for example, HH, HT, TH and TT can also be considered as
events. These events are known as elementary events.
Probability measure – is a function that assigns a real value to every outcome of random
phenomena, which satisfies following axioms:
0 ≤ P(A) ≤ 1 (probabilities are non-negative and less than 1 for any event A)
P(S) = 1 (probability of the entire sample space – one of the outcomes should occur)
For two mutually exclusive events A and B,
o P(A∪B) = P(A) + P(B)
Page 3 of 34
o Two events are mutually exclusive if occurrence of one implies other event does not
occur
In a two coin toss experiment, events {HH} and {HT} are mutually exclusive
P(HH and HT) = P(HH) P(HT) = 0.25 + 0.25 = 0.5
obtained by a basic laws of probability of mutually exclusive events
o Mutually exclusive events are the events that preclude each other.
o Which means, if event A has occurred then it implies B has not occurred, then A and
B are called mutually exclusive events
o Two coin tosses in succession, two successive heads have occurred, it is clear that
the event of head followed by a tail has not occurred, these are mutually exclusive
events
o The probability of either receiving two successive heads or a head and followed by a
tail can be obtained in this case by simply adding their respective probabilities
because they are mutually exclusive events.
Page 4 of 34
- In the two coin toss experiment the sample space consists of 4 outcomes denoted by HH,
HT, TH and TT.
- The event A, which is a head in the first toss
- This consists of two outcomes HH and HT
- A compliment is the set of all events that exclude A, which is nothing but the set of
outcomes TH and TT
- Probability of a compliment = probability of the entire sample space – P( A), which is one
- P( HH) which is 0.25 + P( HT) which is 0.25 = 0.5.
- The probability of a compliment TH and TT = 0.5.
- P(A)c = 1 - P(A).
- A and B are not mutually exclusive, a common event of two successive heads which belongs
to both A and B
- Compute the P(A) or B - a head in the first toss or a head in the second toss, then this comes
to three outcomes and together gives the probability of 0.75 which can be counted from
the respective probabilities of HT, HH and TH
Conditional Probability
If two events A and B are not independent, then information available about the
outcome of event A can influence the predictability of event B
Conditional probability
o P(B | A) = P(A∩B)/P(A) if P(A)>0
o P(A | B)P(B) = P(B | A)P(A) – Bayes Formula
o P(A) = P(A | B)P(B) + P(A | BC)P(BC)
Example: two (fair) coin toss experiment
o Event A : First toss in head = {HT, HH}
o Event B : Two successive heads = {HH}
o Pr(B)=0.25 (no information)
o Given event A has occurred Pr(B | A) = 0.5 = 0.25 / 0.5 = P(A∩B)/P(A)
EXAMPLE:
In manufacturing process of 1000 parts are produced of which 50 are defective. We
randomly take a part form the day’s production
The notion of random variables and the idea of probability mass and density
functions
How to characterize these functions
How to work with them
Random Variable
A random variable (RV) is a map from sample space to a real line such that there is a
unique real number corresponding to every outcome of sample space
o Example: Coin toss sample space [H T] mapped to [0 1]. If the sample space
outcomes are real valued no need for this mapping (eg. Throw a dice)
o Allows numerical computations such as finding expected value of a RV
o Discrete RV (throw a dice or a coin)
o Continuous RV (sensor readings, time interval between failures)
o Associated with the RV is also a probability measure
Page 6 of 34
In the case of a continuous random variable, we define what is known as a
probability density function, which can be used to compute the probability for every
outcome of the random variable within an interval.
Notice, in the case of a continuous random variable, there are ∞ of outcomes and
therefore, we cannot associate a probability with every outcome.
However, we can associate a probability that the random variable lies within some
finite interval.
Random variable x which can take any value in the real line from - ∞ to ∞, the density
function f(x), such that the probability that the variable lies in an interval a to b is defined
as the integral of this function from a to b.
The integral is an area, the area represents the probability, the probability that the random
variable lies between - 1 to 2 is denoted by the shaded area
The cumulative density function, which is denoted by capital F and this is the probability
that the random variable x lies in the interval - ∞ to b, the integral between - ∞ and b of
this density function f(x) dx.
Other functions – Binomial Mass Function, Guassian or Normal Density Function,
Chi-square density function
Other examples of pdf – uniform density function & exponential density function
Moments of a PDF
Similar to describing a function using derivatives, a pdf can be described by its
moments
o For continuous distributions
∞
E[x ] = ∫ xk f(x) dx
k
−∞
o For discrete distributions
N
o Mean: µ = E [x]
o Variance: σ2 = E [(x - µ)2] = E[x2] -µ2
o Standard deviation = square root of variance = σ
Page 7 of 34
01.12.2022
Sample Statistics
Basic concepts
Population – Set of all possible outcomes of a random experiment characterized by
f(x)
Sample set (realization) – Finite set of observations obtained through an
experiment
Page 8 of 34
Inference – conclusion derived regarding the population (pdf, parameters) from the
sample set
o Inference made from a sample set is also uncertain since it depends on the
sample set which is one of many possible realizations, provide also the
confidence interval associated with these estimates that are derived
Statistical analysis
o Descriptive statistics (Analysis)
Graphical – organizing and presenting the data (ex: Box plots,
probability plots)
Numerical – summarizing the sample set (ex: mean, mode, range,
variance, moments)
o Inferential
Estimation – estimate parameters of the pdf along with its confidence
region
Hypothesis testing – making judgments about f(x) and its parameters
Measures of central tendency
o Represent sample set by a single value
Mean (or average) =
Unbiased estimate - Expectation of x̅ is μ. This can be analytically proven for any kind of
distribution.
Take a sample of N points and get an estimate x̅ . Repeat this experiment and draw
another random sample from the population of N points and get another value of x̅ .
Average all these x̅ from different experimental sets, then the average of these averages
will tend to this population mean.
This is an useful and important property of estimates
Page 9 of 34
The one unfortunate aspect of this particular statistic or mean is that it is if there is one
bad data point in sample, then estimate of x̅ can be significantly affected by this bad
value. The bad value is called an outlier and even a single outlier in data can give rise to a
bad estimate of x̅ .
A single bias in the sample will lead to the poor estimate.
MEDIAN
Represent sample set by a single value
o Median – value of xi such that 50% of the values are less than x i and 50% of
observations are greater than xi
Robust with respect to outliers in data
Best estimate in least absolute derivation sense
Ex: Sample Heights of 20 Cherry Trees
[55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
Median = 69 (population mean used to generate random sample was
70)
Median = 69 (after a bias of 50 was added to first sample value)
Another measure of central tendency is what is called a median.
The median is a value such that 50 percent of the data points lie below this value and 50
percent of the experimental observations are greater than this value.
Order all the observations from smallest to highest and then find out the middle value.
10th point is 67, because there are even number of points, the eleventh point is 71 take
the average between this and call that the median. (67 + 71 = 138, 138 / 2 = 69)
If there are odd number of points then take the middle point just as it is
Add a bias in the first data point and make this 105 and then reorder the data and find out
the median again; the median has not changed.
So, the presence of an outlier has not affected the median
MODE
Represent sample set by a single value
o Mode – Value that occurs most often (most probable value)
o Ex: Sample heights of 20 cherry trees
o [55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
o Mode = 67 (3 occurrences)
A mode is another measure of central tendency and this value is the value that occurs
most often or what is called the most probable value.
Sometimes distribution may have two modes. What is called a bimodal distribution in
which case if sampling is done from such a distribution, it will give two clusters, one
cluster around the one of the modes and another cluster around the second mode
Page 10 of 34
Measures of spread
Represents spread of sample set
https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php
Page 11 of 34
https://www.calculatorsoup.com/calculators/statistics/descriptivestatistics.php
Another measure which characterizes a sample set is the measures of spread and tells
how widely their data is ranging.
Sample variance – Formula description
Page 12 of 34
The sample variance is an unbiased estimate of the population variance
The square root of the sample variance is also known as the standard deviation.
The sample variance happens to be also a very susceptible to outliers.
So, if there is a single outlier, the sample variance, sample standard deviation can become
very poor estimate of the population parameter.
Another measure of spread which is called the mean absolute deviation, somewhat
similar to the Median
Formula description
A third measure of spread is what is called the range that is basically the difference
between the maximum and minimum value.
Example: Refer Measures of spread image in page 11, highlighted in red color
A single outlier can cause the standard deviation and the variance to become very poor
and therefore cannot be trusted as a good estimate of the population standard deviation or
variance.
Take the mean absolute deviation from the median, it would be even better in terms of
robustness with respect to the outlier.
The range of the data can be obtained as the maximum and minimum value
Even when the 20 data points were not given, only with the mean and standard deviation,
it is possible to tell the properties of the sample (power of sample statistics)
Page 13 of 34
03.12.2022
Graphical Analysis – Histograms, Box Plot, Probability Plot, Scatter Plot
Histograms
o Divide the range of values in sample set into samll intervals and count how many
observations fall within each interval
o For each interval plot a rectangle with width = interval size and height equal to
number of observations in interval
o Example – Sample of 20 heights of black cherry trees
[73 75 55 60 66 71 81 67 83 75 82 71 63 55 72 78 67 65 67 59]
Page 14 of 34
Given a sample set, first divide this sample set into small ranges; and count how many
observations fall within that range or within each interval
Plot the width of the interval or the interval size of the x axis and the number of data
points available in that interval as the y axis
Page 15 of 34
Box Plot
Other kinds of plots – which is called the box plot, which is used most often in sometimes
in visualizing stock prices
Compute quantities called quartiles Q1, Q2 and Q3 and the minimum and maximum
values in the range
What are quartiles?
Quartiles are basically an extension of the idea of median
Q2 is exactly the median - which means half the number of points fall below the value of
Q2 and half the number of points are exactly about Q2.
Similarly, Q1 represents the 25 percent value which means 25 percent of the observations
fall below Q1.
75 percent above Q1 and Q3 implies that 75 percent of the data points fall below Q3 and
25 percent above Q3.
And once you have these values, the median, the quartiles and the minimum maximum,
you can plot what is called the box and whisker plot
The lowest observation and the highest observation are called the whiskers.
This gives a little more information about the spread of the data
Page 16 of 34
Probablity Plot
The third kind of plot which is very useful is to know about the distribution of the data
and this is called the probability plot the p-p plot or the q-q plot.
Standardization means remove the mean and divide by the standard deviation
The 20 values as these are called the standardized values, sort them from the lowest to
highest
https://mathcracker.com/normal-probability-plot-maker#results
55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83
Page 17 of 34
The theoretical frequencies fi need to be computed as well as the associated z-scores zi,
for i = 1, 2, ..., 20:
Observe that the theoretical frequencies f_ifi are approximated using the following
formula: fi= i−0.375 / n+0.25
Where i corresponds to the position in the ordered dataset, and zi is corresponding
associated z-score.
This is computed as zi=Φ−1(fi)
The normal probability plot is obtained by plotting the X-values (sample data) on the
horizontal axis, and the corresponding zi values on your vertical axis.
Page 18 of 34
The following normality plot is obtained:
Scatter plot
Page 19 of 34
So, if there are two random variables, let us say y and x and to know whether there is any
relationship between y and x, then one way of visually verifying this dependency or
interdependency is to plot y versus x.
Data corresponding to 100 students – Students have spent time preparing for a quiz and
they have obtained marks in that quiz.
If more time was spent on study then more marks might have scored by students
X – axis (Time spent), Y-axis (Marks obtained)
If the random variables has dependency, then alignment of the data can be seen
If there is no dependency, then the data will spread randomly, and there is no clear
pattern
This plot is helpful in the process of assessing dependency between 2 variables and then
proceed for further analysis
http://www.alcula.com/calculators/statistics/scatter-plot/
Data
1. 3, 100
2. 3, 100
3. 2, 75
4. 1, 50
5. 1, 45
6. 3, 100
7. 3, 100
8. 2, 75
9. 1, 50
10. 1, 45
11. 3, 100
12. 3, 100
13. 2, 75
14. 1, 50
15. 1, 45
16. 3, 100
17. 3, 100
18. 2, 75
19. 1, 50
20. 1, 45
08.12.2022
Hypothesis testing
Page 20 of 34
The basics of hypothesis testing which is an important activity while taking decision from
a set of data.
Hypothesis testing
The hypothesis is generally converted to a test of the mean or variance parameter of a
population (or differences in mean or variances of populations)
A hypothesis is a statement or postulate about the parameters of a distribution (or model)
o Null hypothesis H0 – The default or status quo postulate that we wish to reject if
the sample set provides sufficient evidence
o Alternative hypothesis H1 – The alternative postulate that is accepted if the null
hypothesis is rejected
No hypothesis test is perfect. There are inherent errors since it is based on observations
which are random
The performance of a hypothesis test depends on
o Extent of variablity in data
o Number of observations (Sample size)
o Test Statistic (function of observations)
o Test criterion (Threshold)
There are 2 types of hypothesis tests – the two sided and one sided test
If it is a two sided test, it has a lower criterion threshold and the upper threshold selected
from the appropriate distribution
Depending on the type of test, thresholds were chosen and then test statistics were
compared against those thresholds
Page 21 of 34
Errors in hypothesis testing
Two types of Errors (Type 1 & Type II)
Typically the type 1 error probability α (also called as the level of significance of the test)
is controlled by choosing the criterion from the distribution of the test statistic under the
null hypothesis
Type 1 error (false alarm)
Type II error also has a probability which is denoted by β.
Correct decision probability is known as power of the statistical test and is denoted by 1 -
β.
What is a Hypothesis?
A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation. For example:
A new medicine you think might work.
A way of teaching you think might be better.
Page 23 of 34
What is Hypothesis Testing?
Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see
if you have meaningful results. You’re basically testing whether your results are valid by
figuring out the odds that your results have happened by chance. If your results may have
happened by chance, the experiment won’t be repeatable and so has little use.
Hypothesis Testing
Example #1: Basic Example
A researcher thinks that if knee surgery patients go to physical therapy twice a week (instead
of 3 times), their recovery period will be longer.
Average recovery time for knee surgery patients is 8.2 weeks.
The hypothesis statement in this question is that the researcher believes the average recovery
time is more than 8.2 weeks.
It can be written in mathematical terms as: H1: μ > 8.2
Next, state the null hypothesis.
That’s what will happen if the researcher is wrong.
In the above example, if the researcher is wrong then the recovery time is less than or equal
to 8.2 weeks.
In math, that’s: H0 μ ≤ 8.2
Page 24 of 34
Step 3: Draw a picture to help you visualize the problem.
Step 7:If Step 6 is greater than Step 5, reject the null hypothesis. If it’s less than Step 5, you
cannot reject the null hypothesis. In this case, it is more (4.56 > 1.645), so you can reject
the null.
https://youtu.be/N5Wdfd3exmc
11.12.2022
https://youtu.be/cL5ie-669rc
Confidence Interval
Page 25 of 34
Estimation and confidence interval
Levels of confidence
Interpretation of interval estimation
Margin of error
Interval estimation of a population mean
Page 26 of 34
4. 95% of the sample means for a specified sample size will lie within 1.96 standard
deviations of the hypothesized population mean
5. For the 99% confidence interval, 99% of the sample means for a specified sample size
will lie within 2.58 standard deviations of the hypothesized population mean
X̅ = sample mean
Z = number of standard deviation from the sample mean
s = Standard deviation in the sample
n = size of the sample
Example:
Page 27 of 34
https://www.samlau.me/test-textbook/ch/18/hyp_phacking.html
P-hacking
A p-value or probability value is the chance, based on the model in the null hypothesis
that the test statistic is equal to the value that was observed in the data or is even further
in the direction of the alternative.
If a p-value is small, that means the tail beyond the observed statistic is small and so the
observed statistic is far away from what the null predicts.
This implies that the data support the alternative hypothesis better than they support the
null.
By convention, when the p-value is below 0.05, the result is called statistically
significant, and the null hypothesis is rejected.
There are dangers that present itself when the p-value is misused.
P-hacking is the act of misusing data analysis to show that patterns in data are statistically
significant, when in reality they are not.
This is often done by performing multiple tests on data and only focusing on the tests that
return results that are significant.
# These are some of the columns that give us the quantities/frequencies of different food
the FFQ-takers ate
ffq = ['EGGROLLQUAN', 'SHELLFISHQUAN', 'COFFEEDRINKSFREQ']
Look specifically for whether people own cats, dogs, or what handedness they are.
Page 28 of 34
data[characteristics].head()
ca dog right_hand left_hand
t
0 0 0 1 0
1 0 0 1 0
2 0 1 1 0
3 0 0 1 0
4 0 0 1 0
Additionally, look at how much shellfish, eggrolls, and coffee people consumed.
data[ffq].head()
EGGROLLQUA SHELLFISHQUAN COFFEEDRINKSFREQ
N
0 1 3 2
1 1 2 3
2 2 3 3
3 3 2 1
4 2 2 2
Calculate the p-value for every pair of characteristic and food frequency/quantity
features.
# Calculate the p value between every characteristic and food frequency/quantity pair
pvalues = {}
for c in characteristics:
for f in ffq:
pvalues[(c,f)] = findpvalue(data, c, f)
pvalues
{('cat', 'EGGROLLQUAN'): 0.69295273146288583,
('cat', 'SHELLFISHQUAN'): 0.39907214094767007,
('cat', 'COFFEEDRINKSFREQ'): 0.0016303467897390215,
('dog', 'EGGROLLQUAN'): 2.8476184473490123e-05,
('dog', 'SHELLFISHQUAN'): 0.14713568495622972,
('dog', 'COFFEEDRINKSFREQ'): 0.3507350497291003,
('right_hand', 'EGGROLLQUAN'): 0.20123440208411372,
('right_hand', 'SHELLFISHQUAN'): 0.00020312599063263847,
('right_hand', 'COFFEEDRINKSFREQ'): 0.48693234457564749,
('left_hand', 'EGGROLLQUAN'): 0.75803051153936374,
('left_hand', 'SHELLFISHQUAN'): 0.00035282554635466211,
('left_hand', 'COFFEEDRINKSFREQ'): 0.1692235856830212}
study finds that:
Eating/ is linked to: P-value
Drinking
Egg rolls Dog ownership <0.0001
Shellfish Right-handedness 0.0002
Page 29 of 34
Shellfish Left-handedness 0.0004
Coffee Cat ownership 0.0016
Clearly this is flawed.
Aside from the fact that some of these correlations seem to make no sense, it is found that
shellfish is linked to both right and left handedness.
Because it is blindly tested all columns against each other for statistical significance.
Chosen only whatever pairs say “statistically significant” results.
This shows the dangers of blindly following the p-value without a care for proper
experimental design.
Example:
A simple example of this would be in the case of rolling a pair of dice and getting two 6s.
If the null hypothesis that the dice are fair and not weighted, and take the test statistic to
be the sum of the dice, then find the p-value of this outcome, which will be 1/36 or 0.028,
and gives statistically significant results that the dice are fair.
But obviously, a single roll is not nearly enough rolls to provide with good evidence to
say whether the results are statistically significant or not, and shows that blindly applying
the p-value without properly designing a good experiment can result in bad results.
Bayesian Inference
https://towardsdatascience.com/what-is-bayesian-inference-4eda9f9e20a6
Page 30 of 34
Illustration of how our prior knowledge affects our posterior knowledge
Bayes’ theorem
We have two sets of outcomes A and B (also called events), denote the probabilities of each
event P(A) and P(B) respectively.
The probability of both events is denoted with the joint probability P(A, B), and expand this with
conditional probabilities
P (A, B) = P (A|B) P (B) (1)
i.e., the conditional probability of A given B and the probability of B give us the joint probability
of A and B. It follows that
P (A, B) = P (B|A) P (A) (2)
Since the left-hand sides of (1) and (2) are the same, we can see that the right-hand sides are equal
= P (A|B) P (B) = P (B|A) P (A)
P (B∨ A)P (A )
= P (A|B) =
P(B)
This is Bayes’ theorem.
The evidence (the denominator above) ensures that the posterior distribution on the left-hand side
is a valid probability density and is called the normalizing constant.
The theorem in words is stated as follows: Posterior α Likelihood x Prior, where ∝ means
“proportional to”.
Page 31 of 34
Let X be a random variable representing the coin, where X=1 is heads and X=0 is tails such
that P(X=1)=θ and Pr(X=0)=1−θ. Furthermore, let D denote our data (8 heads, 3 tails).
Estimate the value of the parameter θ, so that we can calculate the probability of seeing 2
heads in a row.
If the probability is less than 0.5, we will bet against seeing 2 heads in a row, but if it’s above 0.5,
then we bet for.
Frequentist approach
As the frequentist, maximize the likelihood, which is to ask the question: what value of θ will
maximize the probability that we got D given θ, or more formally, we want to find
Note that (3) expresses the likelihood of θ given D, which is not the same as saying the probability
of θ given D.
The image underneath shows our likelihood function P(D∣θ) (as a function of θ) and the maximum
likelihood estimate.
The value of θ that maximizes the likelihood is k/n, i.e., the proportion of successes in the trials.
The maximum likelihood estimate is therefore k/n = 8/11 ≈ 0.73.
Assuming the coin flips are independent, now calculate the probability of seeing 2 heads in a row:
Since the probability of seeing 2 heads in a row is larger than 0.5, we would bet for!
Page 32 of 34
Bayesian approach
As the Bayesian, to maximize the posterior, asks the question: what value of θ will maximize the
probability of θ given D?
Since the evidence P(D) is a normalizing constant not dependent on θ, ignore it.
This now gives
This gives
Where Γ is the Gamma function. Since the fraction is not dependent on θ, ignore it, which gives
Set the prior distribution in such a way that we incorporate, what we know about θ prior to seeing
the data.
Now, we know that coins are usually pretty fair, and if we choose α=β=2, we get a beta
distribution that favors θ=0.5 more than θ=0 or θ=1.
The illustration below shows this prior Beta(2,2), the normalized likelihood, and the resulting
posterior distribution.
Page 33 of 34
Illustration of the prior P (θ), likelihood P(D | θ), and posterior distribution P (θ | D)
with a vertical line at the maximum a posteriori estimate.
The posterior distribution ends up being dragged a little more towards the prior distribution,
which makes the MAP estimate a little different the MLE estimate.
which is a little lower than the MLE estimate — and if we now use the MAP estimate to calculate
the probability of seeing 2 heads in a row, we find that we will bet against it
Page 34 of 34