You are on page 1of 9

25/08/2023

Whiteboard Chart by Shopify Partners


https://burst.shopify.com/photos/whiteboard-chart?q=graph

ETF1100 Business Statistics


Week 6
Midterm Test Revision
Charanjit Kaur

Learning Outcomes

• Revision of materials taught in Weeks 1-5

• Complete Practise Test for Mid-Semester

Basic Concepts of Statistics: Random Variables Wk1

Definition: outcomes of experiments whose values may vary due to chance

• Repeated observations of random variables produce a spread of values.

• These observations are “data” that is relevant in analysing the problem.

Business outcomes are rarely predictable

1
25/08/2023

Basic Concepts of Statistics: Population and Sample Wk1


Sample:
A subset of the population selected for
analysis.
• Often chosen randomly
• Preferably representative of the population.

Population:
All members of a group about which you want to draw a conclusion.
Eg. All voters in an election, all Telstra shareholders, all invoices
submitted to Medicare for reimbursement, etc.

Types of Data Wk1

Data Types

Numerical
Numerical
operations Categorical Numerical operations
are not (Qualitative)
meaningful.
(Quantitative) are
meaningful.

Nominal Ordinal Discrete Continuous


Values are labels & do Values are labels, but Values arise from Values arise from
not imply any order they have an order counting measuringg

Visualising Data Wk1

Nominal & Ordinal Discrete Data Continuous data

Bar Chart Pie Chart Bar Chart Histogram Box plot

Great for Great for


illustrating
illustrating Great for illustrating the distribution
relativity portions or
particularly for shares
ordinal data

2
25/08/2023

Normalization of Data Wk1


Purpose: comparability across observations

• Choice of normalization depends on the purpose of the analysis!

Store Profit ($000) Net Profit % ROI


A 8.06 3.96% 22.39%
B 54.229 10.81% 2.36%
C 17.981 17.55% 9.04%
D 94.891 15.67% 5.02%
E 70.913 7.04% 6.66%
F 23.005 15.49% 18.11%
G 108.656 17.94% 2.57%

Normalization of Data Wk1


Other common normalization
a. across time; real vs nominal values (CPI adjusted)
• Consumer Price Index (CPI): a weighted average of prices for everyday goods
and services people buy. It is indexed to 100 in the base period and is used
to calculate the inflation rate.
• Nominal value: value that is measured in terms of actual prices that exist at
the time.
• Real value: the value of the same item after it has been adjusted for
inflation.
b. across observations; e.g. percentage, per capita

Analysis of Categorical Data Wk2


Probability = quantification of chance. Probability of all possible events, add to 1
• Marginal Probability: P(A) = probability of event A.
• Joint Probability: Probability of “Intersection” describes “A AND B” 𝑃(𝐴 ∩ 𝐵)

• Conditional probability: P(A) conditional on B having occurred P(A|B) = P(A) where


𝑷(𝑨 ∩ 𝑩)
𝑷 𝑨𝑩 =
𝑷(𝑩)

• Probability of “Union” describes “Either A OR B” 𝑃(𝐴 ∪ 𝐵)

3
25/08/2023

Analysis of Categorical Data Wk2


• Mutually exclusive events: Events that cannot occur together
Pr 𝐻𝑒𝑎𝑑 ∩ 𝑇𝑎𝑖𝑙 = 0

• Independent events: Event A is independent of event B if


P 𝐴 𝐵 = Pr 𝐴 or P 𝐴 ∩ 𝐵 = P 𝐴 × P(𝐵)

• Evaluate the relationship between two categorical variables – refer to Exercise


vs Heart Disease Example in Seminar 2

Understanding Statistical Uncertainty Wk3


Distribution of
Numerical Data

Central
Variation Shape
Tendency

Arithmetic Interquartil Standard


Median Mode Range Variance Skewness
Mean e Range Deviation

What is the typical or the central value? How much variation in the distribution? Are there any
unusual values
that
contribute to
the
distribution?

Mean, Median & Mode Wk3


Mean: measure of typical value, also known as “average”.
The sum of all values observed divided by the no of observations. In Excel : =AVERAGE(…)

Median: The middle value if values are sorted from smallest to largest (50th percentile).
50% of values are equal to or lower than the median, and 50% are equal to or higher.
In Excel : =MEDIAN(…)

Mode: Value that occurs most frequently.


This might not be interesting if the values don’t repeat often. For numerical data, the most
populated bin range is often reported. In Excel : =MODE(…)

All are measures of central tendencies, but which one should we use?

4
25/08/2023

Measures of Variability Wk3


Range: The difference between the maximum and the minimum values. It relies just on the two
most extreme values in the dataset. In Excel: =MAX(…)-MIN(…)

Interquartile Range: the spread of the middle 50% of the data


Q1 = first quartile → 25% of data falls below this value In Excel: =QUARTILE.EXC(…,1)
Q3 = third quartile → 25% of data falls above this value In Excel: =QUARTILE.EXC(…,3)
In Excel: =Q3 – Q1

Variance: average squared deviations (distance) from the mean. Reported in squared units
In Excel: =VAR.S(…)

Standard deviation: Variance


Same unit as the data. Easier to interpret In Excel: =STDEV.S(…)

Shape/Skewness of Data Distribution Wk3


Skewness is the extent of asymmetry in the distribution.
If the distribution is symmetric, the mean is equal to the median.
Skewness > 0 Skewness = 0 Skewness < 0

Probability Distribution Wk3


• In statistics, we use a smooth mathematical function to model the
probability density function (pdf)
• These are approximation to the data distribution – “model”

• The function 𝑓 𝑋 denotes the “pdf”


• Areas under the curve represents
probability

5
25/08/2023

Normal Distribution Wk3


• The most common distribution in statistics → normal distribution
• It is a symmetric (bell-shaped) distribution
• The normal distribution has two features: Mean and Stdev
• Notation: 𝑋 ~ 𝑁(𝑀𝑒𝑎𝑛, 𝑆𝑡𝑑𝑒𝑣)
• Skewness = 0;
• Mean = Median = Mode

Excel Functions:
For probability “=NORM.DIST(xvalue,mean,stdev,TRUE)”
For percentile “=NORM.INV(prob, mean,stdev)”

Representative Sample Wk4


Representative sample is determined by:
1) Data collection process (sampling design)
2) Survey design → wording design of the questions/form.
3) Sample size → a sufficiently large sample means the sample statistic gets closer to the population
parameter
Biased sample:
• Non-representative statistics
• Invalid inference → invalid conclusions. It could end with catastrophic outcomes if used in business
decisions

Potential biases:
• Selection bias – each identity in the population has an uneven chance of being chosen
• Non-responsive bias – data collection process leading to systematic non-response from certain
groups

Statistics is UNCERTAIN Wk4


• Statistics is about quantifying the uncertainty of the sample estimate

• 𝒙
ഥ is an estimate of 𝑬 𝑿 = 𝝁 (Sample statistic is only an estimate of the
truth. Any sample statistic is not exact and has variation/error around
them.)
• Assume we take data samples repeatedly, and compute sample means as the
statistic for each set of sample. Then we would have the sampling
distribution of the sample mean to portray its variability.
𝒔
• Central Limit Theorem: If the sample size 𝒏 is large: ഥ
𝒙 ∼ 𝑵 𝝁,
𝒏

• This is true regardless of the shape of the population distribution

6
25/08/2023

Confidence Interval for the Population Mean Wk4


Confidence interval = plausible range of the unknown population
mean given some level of probability
𝑠 𝑠
𝑋−𝑍 < 𝜇 < 𝑋+𝑍
𝑛 𝑛

If the standard deviation (𝜎) ↑, the spread of the distribution is larger


𝑠𝑡𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
standard error ↑, width ↑, estimate is less precise
𝑛

𝑠𝑡𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
If the sample size (n) ↑,standard error ↓, width ↓, estimate is more precise
𝑛
The bigger the sample, the more information we have to increase the precision of the interval estimate of the
sample mean, the narrower the interval.

If the level of confidence (1-α) ↑, critical value changes, width ↑ , the estimate is less precise
The more confident we are, the more values we need to include in our confidence interval, the wider the
interval.

Hypothesis Test for Evidence-based Decisions Wk5


A statistical framework for using data to derive evidence-based
decisions.
• Define business problem and variables relevant to that problem
• Formulate a hypothesis around these variable that are relevant to business
decisions
• Conduct hypothesis testing to establish degree of evidence for the
hypotheses
• Based on evidence, make business decisions

Hypothesis Test for Evidence-based Decisions Wk5


21

Sample

Sampling
STATISTICS Distribution

DESCRIPTIVE INFERENTIAL

ESTIMATION
HYPOTHESIS TESTS
Point & Interval

Estimating the value of a Testing a claim about the value


population parameter of a population parameter

7
25/08/2023

Steps in Hypothesis Test Wk5

1 2 3 4
Formulate Decide Calculate Apply
𝐻0 & 𝐻1 on  the p-value decision rule:
reject 𝐻0
if p-value < 
OR retain it if
p-value > 

Defining the hypothesis Wk5


•Formulate 𝐻0 & 𝐻1
1 •The null hypothesis always involve equality sign (=)

•The alternative hypothesis is what we are searching evidence for. It can contain an “≠” , “>” or “<“ sign

𝐻0 : 𝜇 = 𝜇0 𝐻0 : 𝜇 = 𝜇0
𝐻0 : 𝜇 = 𝜇0
𝐻1 : 𝜇 > 𝜇0 𝐻1 : 𝜇 < 𝜇0
𝐻1 : 𝜇 ≠ 𝜇0

Two-tailed test Right-tailed test Left-tailed test


“different to” “greater than” “less than”

Mechanics of Hypothesis Testing Wk5


Decide on .
2 Recommendations: 𝛼 = 5%; or 𝛼 = 1% for conservative cases.

ҧ 0
𝑥−𝜇 ഥ−𝝁
𝒙
𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = 𝑠/ 𝑛
= 𝑺𝑬 𝒙ഥ𝟎
3 Judging whether or not the test statistic is outstanding “far from zero”, in the
direction of the alternative.

Decision:

4 Reject 𝐻0 if p-value <  OR Retain it if p-value > 


A smaller p-value means that there is stronger evidence in favor of H1

P-value for a right-tail test P-value for a left-tailed test P-value for a two-tail test
=1-NORM.S.DIST(test statistic ,TRUE) =NORM.S.DIST(test statistic ,TRUE) =2*NORM.S.DIST(??,TRUE)

8
25/08/2023

Type I and II errors Wk5


Since we rely on data samples to conduct hypothesis tests, there is a potential
for errors. Possible scenarios:

Type I error Type II error


Reject a true null The true ‘state of the world’ Retain a false null

𝑯𝟎 is TRUE 𝑯𝟎 is FALSE
Do not reject 𝑯𝟎 CORRECT TYPE II ERROR
DECISION! (β)
Reject 𝑯𝟎 TYPE I ERROR CORRECT
(α) DECISION!

You might also like