Professional Documents
Culture Documents
(week 1)
Statistics is a way to get information from data.
Descriptive Statistics
Descriptive statistics deals with methods of organising, summarising, and presenting data in a convenient and
informative way. One form of descriptive stats uses graphical techniques (histogram) and another uses
numerical techniques (Measure of central location – the average or median are examples, measure of
variability – range is an example).
Inferential Statistics
Inferential stats is a body of methods used to draw conclusions or inferences about characteristics of
population based on sample data.
1. Population
A population is the group of all items of interest to a statistics practitioner. It is frequently large and may be in
fact infinitely large.
A descriptive measure of a population is called a parameter, and it usually represents the information we
need.
2. Sample
A descriptive measure of a sample is called a statistic. Statistics are used to make inferences about
parameters.
3. Statistical inference
Statistical inference is the process of making an estimate, prediction, or decision about a population based on
sample data. However such conclusions and estimated are not always going to be correct so we build a
measure of reliability:
• The confidence level is the proportion of times that an estimating procedure will be correct
• The significance level measures how frequently the conclusion will be wrong
Values of the variable – are possible observations of the variable. E.g. the values of stock price are real
numbers usually measured in dollars and cents; ranges from 0 to hundreds of dollars.
Data – are the observed values of a variable. E.g. $0.70 $1.12.. these are the data which we will extract the
info we seek. Data is plural for datum.
• Interval data are real numbers such as heights, weights, incomes. Also called quantitative or numerical
data. Scores & time are quantitative.
• Values of nominal data are categories. E.g. questions to marital status (1 = single, 2 = married, 3 =
divorced, 4 = widowed). Any numbering system is valid (14 = single, 2 = married…). Nominal data are
also called qualitative or categorical. Gender is qualitative
• Ordinal data appear to be nominal, but the difference is the order of their values has meaning. E.g.
ratings for professor (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent). Need some sort of
ascending order (6 = poor, 24 = fair, 33 = good…). Order of values important not magnitude. To
distinguish between ordinal and interval data, the intervals or differences between values of interval
data are consistent and meaningful (difference between marks of 80 and 75 is 5 so we can calculate
difference and interpret results) but not for ordinal which aren’t numbers and just give you order.
Interval data – all calculations permitted. Often describe interval data by calculating average.
Nominal data – because codes are arbitrary, we cannot perform calculations on these codes. Can only count or
compute %s of occurrences of each category (marital status example – count numbers of each category and
report frequency).
Ordinal data - most important aspect of this data is the order of the values. Only permissible calculations are
those involving a ranking process. E.g. median.
Hierarchy of Data
1. Interval
2. Ordinal
3. Nominal
Higher level data types may be treated as lower-level ones. E.g. at UNSW marks are converted to grades
(interval data is converted to ordinal data). But when we convert, we lose information. LOWER LEVEL DATA
TYPES CANNOT BE TREATED AS HIGHER LEVEL TYPES
Variables same name as type of data. E.g. interval data are observations of interval variables.
In presenting the diff types of data, which statistical procedure do we use? And what type of info do we need
to produce from our data?
Graphical techniques catch a reader’s eye more quickly. Bar chart often used to display frequencies while a pie
chart shows relative frequencies (proportions). Page 22 excel explanation.
Practice in textbook.
No specific graphical techniques. Consequently, we treat it as nominal and use techniques above. Only
criterion is that the bars in bar charts should be arranged in ascending or descending ordinal values and in pie
charts the wedges are arranged clockwise in ascending or defending order.
WE USE FREQUENCY AND RELATIVE FREQUENCY TABLES, AND BAR AND PIE CHARTS WHEN:
Page 35 excel explanation. No relationship = bar charts approx the same. Vice versa.
Page 37 example.
Data Formats
Page 38.
THE OBJECTIVE IS TO DESCRIBE THE RELATIONSHIP BETWEEN 2 VARIABLES AND COMPARE 2 OR MORE
SETS OF DATA
DATA TYPE IS NOMINAL
Most important is the histogram (Suppose data are quantitative & not qualitative). Bases are the intervals &
height is the frequency. Define lower and upper class limits (need to be mutually exclusive & exhaustive) Page
47 excel explanation – should be no gaps between bars.
Depends entirely on the no. of observations in the data set. More observations = larger no. of class intervals.
Often rounded to convenient value. It is more important to choose classes that are easy to interpret, the
above are guidelines only. First class interval must contain smallest observation.
Shapes of Histograms
1. Symmetry – when a vertical line is drawn down the centre of the histogram, the two sides are identical
in shape & size.
2. Skewness: positively skewed – long tail extending to the right e.g. income of employees in large firm
negatively skewed – long tail extending to the left e.g. time to finish exams
asymmetric
may have outliers
3. Number of Modal Classes
A unimodal histogram has a single peak. A bimodal histogram has 2 peaks, not necessarily equal in height;
indicated 2 diff distributions.
A drawback of histograms is that we lose potentially useful info by classifying the observations into classes.
The stem-and-leaf display overcomes this loss to some extent as we can see actual observations.
First step is to split each observation into 2 parts, a stem & a leaf. The length of each line represents the
frequency in the class interval define by the stem. Page 58 excel explanation.
Ogive
Cumulative relative frequency distribution highlights the proportion of observations that lie below each of the
class limits.
The ogive is a graphical representation of the cumulative relative frequencies. Page 60 excel explanation.
Original data may be interval or nominal. A time series can also list the frequencies & relative frequencies of a
nominal variable over a number of time periods. E.g. Asks consumers to identify favourite brand – this is
nominal data. If we repeat the survey once a month for several years, the proportion of consumers who prefer
a certain company’s product each month would constitute a time series.
Line Chart
Time series data graphically depicted on a line chart which is a plot of the variable (vertical axis) over time
(horizontal axis). Page 66 excel explanation.
Contingency table (Cross-tabulation table) used for relationship between qualitative variables
Time series plot is a scatter diagram with one variable as time. Order matters.
2 most important characteristics are strength and direction of linear relationship to verbally describe how 2
variables are related.
Linearity
Draw straight line through points. If most points fall close to line, we say there’s a medium linear relationship,
if most points are scattered with only a semblance of a straight line, there is a weak linear relationship, or
strong if most points fall on the line.
Least squares method used to draw an objective straight line. There’s also quadratic and exponential
relationships.
If 2 variables are linearly related, it does not mean one is causing the other. Correlation is not causation!
Graphical excellence – term we apply to techniques that are informative & concise & that impart information
clearly to their viewers.
• The graph represents large data sets concisely & coherently. Graphical techniques used to summarise
& describe large data sets while tables summarise small data sets.
• The ideas & concepts the statistics practitioner wants to deliver are clearly understood by the viewer.
Excellent chart is one that replaces thousands of words.
• The graph encourages the viewed to compare two or more variables. Graphs best used to depict
relationships between variables or to explain how & why observed results occurred. Graphs displaying
1 variable provides little info.
• The display induces the viewed to address the substance of the data and not the form of the graph.
Form of graph supposed to help present substance.
• There is no distortion of what the data reveal. Tells truth about data.
Graphical Deception
3 different measures to describe centre of a set of data. Arithmetic mean is best known, also called mean or
average.
For sample:
For population:
N
n
∑x i
Sample Mean: x = i =1
Median
Mode
Mode – observation/s that occurs with greatest frequency. Statistic & parameter computer in same way.
For populations & large samples, preferred to report modal class. Using mode in a small sample may not be a
good measure for central location and it may not be unique. Page 102 excel.
Mean, Median, Mode: Which is Best?
• Median not as sensitive to extreme values (outliers) as mean is, so it produces a better measure when
there’s a relatively small no. of extreme observations (either very small or large, but not both).
For symmetric distributions: mean=median, positively skewed: mean > median, negatively skewed: mean <
median.
However, for ordinal & nominal data, calculation of mean not valid.
Median used for ordinal data, mode appropriate for nominal data. However, nominal data does not have a
“centre”, so pointless to compute mode of nominal data.
Geometric mean
When variable is growth rate or rate of change (value of investment over time) we need to use geometric
mean. It is used to find “average” growth rate or rate of change over time.
1. Use mean to find central location for single set of interval data
2. Use median to find central location for single set of ordinal or interval (with extreme observations)
data.
3. Use mode to describe a single set of nominal, ordinal, or interval data.
Range
Variance
Variance and its related measure, standard deviation most important statistics.
N
∑ (x − μ)
2
i
Population variance: σ 2 = i =1
N
n
∑ (x − x)
2
i
Sample variance: s 2 = i =1
n-1
For sample variance we divide by “n-1” because it is a better estimator than dividing by “n”
S2 is used because the negative and positive answers cancel each other out, giving us 0, squaring avoids this.
We could also use the absolute value of the deviations (difference between observation & mean) instead of
squaring- called mean absolute deviation (MAD).
Square deviations means units are also squared!!!!!!!!!!! Page 111 excel.
Larger variance = more variation. Difficult to interpret because units are squared.
Standard deviation
Standard deviation is spread measured in original units. Simply positive square root of variance; uses original
units.
Coefficient of Variation
Percentile – Pth percentile is value for which P percent are less than that value and (100-P)% are greater than
that value.
Quartiles – 25th (Q 1 ), 50th (Q 2 ) and 75th (Q 3 ) percentiles. Also quintiles and deciles.
Locating Percentiles
L P = (n + 1)P
100
Where L P is the location of the Pth percentile & n is the no. of data. This formula allows us to approximate
location of the Pth percentile. Page 119 excel.
Get idea of histogram from quartiles: 1st & 2nd quartiles closer than 2nd & 3rd, positively skewed. Vice versa.
Distance equal, approx. symmetric.
Interquartile Range
Interquartile range = upper – lower quartile, measure spread of middle 50% of observations.
Large value = 1st & 3rd fair apart; high level of variability.
Box Plots
Graphs 5 statistics: min and max observations & 1st, 2nd & 3rd quartiles.
3 vertical lines are quartiles, extended lines are whiskers. Points outside whiskers are outliers. Whiskers
extend 1.5 times interquartile range of both sides. Page 121 excel.
Percentiles & quartiles can be used to measure relative standing for interval and ordinal data.
Interquartile range used to measure variability of interval and ordinal data.
Covariance
Population covariance
∑ (x − µ )(y − µy )
N
i x i
σ xy = i =1
N
Sample covariance
n
∑ (x − x )( y
i i − y)
s xy = i =1
n-1
Generally when 2 variables move in the same direction (both ↑ or ↓), the covariance will be a large +ve no.
Magnitude of covariance describes strength of association, but it’s difficult to judge w/o additional stats.
Coefficient of Correlation
Coefficient of correlation is defined as the covariance divided by the standard deviations of the variables.
Population correlation
σ xy
ρ=
σ xσ y
Sample correlation
s xy
r=
sx s y
− 1 ≤ ρ ≤ 1 and − 1 ≤ r ≤ 1
The adv of this over covariance is it has a set of lower & upper limits. -1 = -ve linear relationship & scatter
diagram is a straight line. +1 is a perfect +ve linear relationship. 0 = no linear relationship. Rest of the values
judged in relation to these.
However except for -1, 0, +1, we cannot interpret correlation, can only get rough idea.s
Scatter diagram depicts relationship graphically; covariance & coefficient of correlation describe linear
relationship numerically.
Objective method of drawing a straight line through scatter diagram to determine linear relationship.
Line equation: yˆ = b0 + b1 x , so sum of squared deviations between points & line is minimised.
∑( y
i =1
i − yˆ i ) 2
s xy
b1 = b0 = y − b1 x
s x2
Note:
Compute covariance, coefficient of correlation, coefficient of determination, and least squares line to describe
relationship between 2 variables for interval data.
Direct observation
Simplest method. When data gathered this way, data said to be observational – measures actual behaviours
or outcome.
Select sample of men & women & ask if they take aspirin regularly over past 2 years & if they suffered any
heart attacks last 2 years.
However difficult to produce info this way. People who take aspirin suffer fewer heart attacks, can we
conclude aspirin is effective? May be ppl who take aspirin are more health conscious.
Experiments
More expensive, but better way to produce data. Data produced this way are called experimental – imposes a
treatment & measures resultant behaviour or outcomes.
Select random sample, split in 2 groups, 1 takes aspirin regularly while other doesn’t. after 2 years, stats
practitioner determine proportion of ppl in each group who suffered heart attacks.
Surveys
Important aspect of surveys is response data – proportion of all ppl who were selected who complete the
survey. Low response rate can destroy validity of any conclusion.
1. Personal interview – many feel best way to survey people. High expected response rate & fewer
incorrect responses resulting from misunderstanding. Interviewed must not say too much to avoid
biasing the response. However, expensive, especially when travel involved
2. Telephone interview – less expensive, but less personal & lower expected response rate.
3. Self-administered survey – mailed to sample of ppl. Inexpensive method & therefore attractive when
sample is large. However, lower response rate & high incorrect responses due to misunderstanding
questions.
Questionnaire design:
1. Define & understand problem firm wants to determine effectiveness of its advertising
2. Collect data
3. Analyse data a) use sample statistics to describe problem what is a typical customer? What % of
customers recall ads b) extract info about population parameter from sample stats
4. Communicate results
5.2 Sampling
Chief motive for using sample rather than population is cost.
Sampled population – actual population from which the sample has been taken
Self-selected sample – almost always biased because individuals who participate in them are more keenly
interested in the issue than are the other members of the population.
Simple Random Sampling – is a sample selected in such a way that every possible sample with same size is
equally likely to be chosen. Raffle selection, no. 1 to n and put in hat and pick one. Sometimes inappropriate to
use such as ppl have more than 1 phone no. or such. Pg 168 excel.
Low cost, avoids problems of bias where design of sample systematically favours certain outcomes.
A Stratified Random Sampling – is obtained by separating the population into mutually exclusive sets, or
strata, and then drawing simple random samples from each stratum. E.g. female or male, age stratum etc.
Only stratify when there’s connection between survey & strata like who favours tax ↑ and we select random
samples from diff income groups, no point selecting from diff religious beliefs
Advantage is besides acquiring info about entire population, we can also make inferences within each stratum
or compare strata. . Compare highest & lowest Y groups to see if they differ in their support for tax ↑
Can draw random samples from each Y group according to their proportion in population.
Cluster Sampling
Want to estimate average annual household Y in large city. Simple random sampling we would need complete
list of households in city from which to sample. Stratified random sampling would need list of households &
categorise each by some other variable like age to develop strata. A less expensive alternative would be let
each block within city represent a cluster. A sample of clusters then randomly selected & every household
within these clusters questioned. ↓ distance surveyor must cover to gather data.
But this ↑ sampling error as households belonging to same cluster likely to have similar Y (rich vs poor
streets).
Sample Size
Whatever sampling plan, need to decide sample size. Large sample size = more accurate.
Sampling error – refers to differences between sample & population that exists only because of the
observations that happened to be selected from the sample. Expect to occur when we make a statement
about a population that is based only on the observations contained in a sample. This error is not a mistake it
is a cost of sampling
Determine mean Y of US blue collar workers. Determine parameter we have to ask every US blue collar
worker’s Y & calculate mean. Expensive & impractical. Use statistical inference. Record Y of sample & find
mean. But values will be diff as value of sample mean depends on which Y happened to be selected. Diff
between true value of population mean & sample mean is sampling error.
Non-sampling error
Non-sampling error – results from mistakes made in the acquisition of data or from the sample observations
being selected improperly. More serious than sampling error as taking larger sample wont ↓ size of error.
1. Errors in data acquisition: arises from recording of incorrect responses possibly from incorrect
measurements taken due to faulty equipment, mistakes made during transcription from primary
sources, inaccurate recording of data because terms were misinterpreted or purposely give inaccurate
responses from sensitive issues like sexual activity.
2. Non-response error: error or bias introduced when responses are not obtained from some members of
the sample. When this happens, sample observation may not be representative of target population.,
resulting in biased results. May occur when cant contact person listed in sample, or person may refuse
to response to question(s). This problem is greater when self-administered questionnaires are used
rather than an interviewed.
3. Selection bias: occurs when sampling plan is such that some members of the target population cannot
possibly be selected for inclusion in the sample.
Chapter 6: Probability
6.1 Assigning Probability to Events
Probability – mathematical mean of studying uncertainty.
Random experiment – an action or process that leads to one of several possible outcomes. Experiment: flip a
coin, outcomes: heads or tails
First step to assigning probabilities is to produce a list of outcomes. List must be exhaustive – all possible
outcomes must be included & mutually exclusive (uses OR)– no 2 outcomes can occur at the same time. AND
for not mutually exclusive. E.g. flip a count, outcomes: heads. This is not exhaustive as tails is omitted. Range
of marks: 0-10, 10-20 etc is not mutually exclusive as getting 10 means you’re in both outcomes.
Sample space (S) of a random experiment– list of all possible outcomes of the experiment. Outcomes must be
exhaustive & mutually exclusive. The outcomes are denoted as O 1 , O 2 ,..,O k .
Once a sample space has been prepared, we assign probabilities to outcomes. 2 rules governing probabilities:
1. Probability of any outcome must lie between 0 & 1. 0≤ P(O i ) ≤ 1 where P(O i ) is probability of outcome
i.
2. Sum of probabilities of all outcomes in sample space must be 1.
Classical approach – determine probability associated with games of chance. Probability of heads & tails in a
coin flip are equal, so 50% each.
Relative frequency approach – defines probability as the long-run relative frequency with which an outcome
occurs. 1000 students, 200 got A, therefore relative frequency of A’s is 200/1000 = 20%. Figure is an estimate
of probability of getting A. Observe infinite no. of grade to determine exact probability.
Subjective approach – define probability as degree of belief that we hold in the occurrence of an event. Used
when can’t use classical approach & no history of outcomes to use relative frequency approach. Investor
wants to know probability particular stock will ↑ in value, using this approach, investor analyse no. of factors
associated with the stock & stock market & assign probability to outcomes of interest.
Defining events
Simple event – an individual outcome of a sample space. All other events composed of simple events in
sample space.
Event – collection or set of one or more simple events in a sample space. Event of achieving an A as set
numbers that lie between 80 and 100 inclusive.
Probability of events
Probability of an event – sum of the probabilities of the simple events that constitute the event.
Interpreting probability
No matter what method used to assign probability, we interpret using relative frequency approach for an
infinite no. of experiments. Using subjective approach we find that there’s a 65% chance value of stock will ↑
in next month. However, we interpret that if we had an infinite no. of stocks with exactly the same economic
& market characteristics, 65% of them will ↑ in price in a month’s time.
Relative frequency approach useful to interpret probability statements such as weather forecasts.
Intersection of events A & B – is the event that occurs when both A & B occur. Denoted as A and B.
Marginal probability
Marginal probability – computed by adding across rows or down column, are so named because they are
calculated in the margins of the table. P(A and B) for the outcome to involve 2 variables.
Conditional probability
Conditional probability – finds the probability of one event given the occurrence of another event. Probability
that the fun manager by a graduate of top 20 MBA program will outperform market.
| = given
Independence
Independent event – if probability of one event is not affected by the occurrence of the other event.
P(A|B) = P(A)
or
P(B|A) = P(B)
Union
Union of events A and B is the event that occurs when either A or B or both occur. It is denoted as A or B.
Complement rule
Complement of A is the event that occurs when A does not occur. Complement of event A = AC. Event and
complement of event sum to 1.
P (AC) = 1 – P (A).
Multiplication rule
Multiplication rule used to calculate joint probability of 2 events. Based on formula for conditional probability:
Or altering notation
If A and B are independent events, P (A|B) = P (A) & P (B|A) = P (B). Joint probability of 2 independent events
is product of probabilities of 2 events.
Addition Rule
Probability Trees
Effective & simpler method of applying probability rules is the probability tree, wherein the events in an
experiment are represented by lines.
Advantage is it restrains users from making wrong calculation. Joint probabilities at end of each branch must
sum to 1 as all possible events are listed.
Discrete random variable – one that can take on a countable number of values. E.g. number of heads observed
when flipping a coin 10 times. X = 0 or 1 or 2 or…or 10.
Continuous random variable – one whose values are uncountable. E.g. time to complete a task. Say X = time to
do a 3hr test, where no one can leave before 30min mark. Smallest value of X is 30, but next no. is it 30.1,
30.01 etc? We don’t know, there’s no. larger than 30 & smaller than 30.01.
Probability distribution – table, formula, or graph that describes values of a random variable & probability
associated with these values.
Uppercase letter like X represents name of random variable, lower case counterpart represent value of
random variable. Thus probability that random variable X will equal x as
P(X = x)
Or more simply
P(x).
For infinite populations, can’t calculate mean. List values & their associated probabilities instead to compute
mean & variance of population.
Population mean – weighted average of all its values. Weights are probabilities. Parameter also called
expected value of X & represented by E(X) → used when population is infinite.
E ( X ) = µ = ∑ xP( X = x)
all x → mean of a rv
Population variance – weighted average of squared deviations from the mean.
Var ( X ) = σ 2 = ∑ ( x − µ ) 2 P( X = x)
all x
V(X) = σ2 = [Σ x2P(x)] – μ2
σ= σ2
1. E(c) = c
2. E(X + c) = E(X) + c
3. E(cX) = cE(X)
Laws of Variance
1. V(c) = 0
2. V (X + c) = V(X)
3. V(cX) = c2V(X)
Bivariate (joint) probability distribution of X & Y – table or formula that lists joint probabilities for all pairs of
values of X & Y.
Requirements for a discrete bivariate distribution:
Marginal probabilities
Covariance
P = σ xy / σ x σ Y
Sum of 2 variables
If properties 2, 3 & 4 are satisfied = Bernoulli process. Adding property 1 = binomial experiment. Random
variable of a binomial experiment is defined as the no. of successes in ‘n’ trials = binomial random variable.
Have a sequence of Bernoulli random variables: X1, X2, …, Xn where Xi = 1 if success & Xi = 0 for a failure
Under assumptions made this is a sequence of independent & identically distributed (iid) random variables.
E.g. If n = 5, one possible outcome of the sequence of trials is 0, 1, 1, 0, 1, another is 1, 1, 0, 0, 0 etc.
Binomial random variable
Random variable of a binominal experiment is no. of successes in n trials, called binomial random variable. Can
take values 0, 1, 2,…, n. Thus random variable is discrete.
Often interested in random variables constructed from other random variables. Consider random variable
formed by summing these ‘n’ Bernoulli random variables
X = X1 + X2 + … + Xn
X represents number of success in n trials & is called a binomial random variable. Characterised by 2
parameters n & p.
Once we know the parameter values we know everything about the random variable & its probability
distribution.
In gender composition example can calculate P(MMF) = P( X1=1, X2=1, X3=0)= pp(1-p) = p2(1-p)
But there are 3 ways to obtain 2 Males (others being MFM or FMM) so
Coefficient of 3 on the probability is exactly the total number of combinations when choosing 2 from 3
(NC3).
N n!
Cx =
x!(n − x)!
Cumulative probability
Similarly P(X ≤ 1) could be verified from table of individual probabilities by summing as before or directly from
table of cumulative probabilities.
The distribution of X i is given by
xi 0 1
P( X i = xi ) 1 − p p
⇒ E ( X i ) = 0 × (1 − p) + 1× p = p
Var ( X i ) = (0 − p) 2 (1 − p) + (1 − p) 2 p = p(1 − p)( p + 1 − p) = p(1 − p)
Thus
E ( X ) = E (∑i =1 X i ) = ∑i =1 E ( X i ) = np
n n
Area in all rectangles adds to 1. Do this by dividing each relative frequency by width of interval. Result is
rectangle over each interval whose area = probability that random variable will fall into that interval.
If histogram drawn w/ large no. of small intervals, can smooth edges of rectangles to produce smooth curve.
Function that approximates curve is called a probability density function (pdf) - this is a continuous version of
a probability histogram used for discrete rv’s.
Uniform Distribution
F(x) = 1
b-a
where a ≤ x ≤ b
f(x) = height. To find probability of any interval, find area under curve.
Use a continuous distribution to approximate a discrete one when no. of values is countable but large. E.g. no.
of values of weekly income.
There are many other common distributions that can be used to represent or model real phenomena
1 x−µ 2
1 −
σ
f ( x) = e 2 −∞ ≤ x ≤ ∞
σ 2π
It is a continuous rv
- P(X = x) = 0
- P(a < X < b) = area under curve (pdf) between a & b
Pdf is unimodal, symmetric about its mean & bell shaped. The rv ranges between -∞ ≤ x ≤ ∞.
Increasing mean = shifts to the right. Vice versa. Large standard deviations = widen curves. Vice versa
X ~ N(μ, σ2)
Calculate probability that a normal rv falls into any variable = compute area in interval under curve. Need a
new table for selected values of μ & σ to find normal probabilities, can’t use maths. But by standardising rv by
Z=X–μ
σ
only need 1 table: standard normal probability table (lists cumulative probabilities). When normal variable
transformed, it’s called standard normal random variable.
Finding values of Z
Use notation Z A for values of A that require us to use the standard normal table backward.
Z 0.025 = 1.96 can be translated to P(Z > 1.96) = 0.025. means area to the right of Z 0.025 is 0.025
Finding Z 0.5 = find value of a standard normal random variable such that the probability that the rv is more
than it is 5%.
Z A and percentiles
Z .05 = 1.645 means that 1.645 is the 95th percentiles: 95% of all values of Z are below it, 5% above it.
p̂ = X
n
where X is no. of successes & n is sample size. Same size of n means we’re taking a binomial experiment,
hence X is binomially distributed. Thus probability of any value of p̂ can be calculated from its value of X.
Recall when we introduced continuous probability functions, we converted a histogram so that total area in
rectangles = 1. Do the same for binomial distribution.
Denote binomial rv by X B
E.g. let n = 20, p = .5, then X B = 0, 1, 2,…,20. A rectangle representing a value of x is drawn so that its area =
probability. Accomplish this by letting height of rectangle = probability & base of rectangle = 1. Thus base of
each rectangle for x is x - .05 to x + .05 such that rectangle representing X B = 10 has base 9.5 to 10.5 –
continuity correction factor & height is P(X B = 10) - .1762
To use normal approximation, find area under normal curve between 9.5 & 10.5
To find normal probabilities, need to standardise x first, so need to find mean & standard deviation.
P(X = 10) ≈ P(9.5 < Y < 10.5) where Y is normal random variable approximating binomial rv X.
After standardising we get P(-.22 < Z < .22), & using table can find answer.
In general approximate
Find P(X = x) must use correct factor. If we don’t, we left with find area in a line, which is 0.
When computing probability of a range of values, omit correction factor. But omitting it will ↓ accuracy of
approximation, but very small ↓.
Use sample data to estimate population parameter in 2 ways. 1st computer value of estimator & consider that
value as estimate of parameter. This is called point parameter - draws inferences about a population by
estimating the value of an unknown parameter using a single value of point; combining sample information to
produce a single number to estimate E.g. select 25 students at random, calculate mean meekly income to be
$400, so mean weekly income of all students is 400.
3 drawbacks to this:
Interval estimator – draws inferences about a population by estimating the value of an unknown parameter
using an interval. E.g. select 25 students at random, calculate mean meekly income to be between 380 and
420.
Unbiased estimator of a population parameter – is an estimator whose expected value is equal to that
parameter. Desired quality of an estimator is unbiasedness.
This means if u take an infinite no. of samples & calculate value of estimator in ea sample, average value of
estimators = parameter. but does not tell us how close estimator is to parameter.
Another desired quality is that as sample size ↑, the sample statistic should come closer to the population
parameter.
Consistency – an unbiased estimator is said to be consistent is the difference between the estimator and the
parameter grows smaller as the sample size grows larger.
Relative efficiency – is there are 2 unbiased estimators of a parameter, the one whose variance is smaller is
said to have relative efficiency.
Chapter 9: Sampling Distributions (week 7)
9.1 Sampling Distribution of the Mean
Sampling distribution created by sampling. It relates population parameters to properties of the distribution of
the sample statistics.
One way to create a sampling distribution is to draw samples of the same size from a population, calculate
statistic of interest, & use descriptive techniques to learn more about sampling distribution.
Second method relies on rules of probability, laws of expected value & variance to derive sampling
distribution.
Sampling distribution of the sample mean –informs us what a sample mean tells us about the population
mean.
Sampling distribution of a sample statistic, is the distribution of all possible values that can be assumed by that
statistic, computed from samples of the same size drawn from the same population
Population created by throwing fair die infinite times, with rv X indicating no. shown in any 1 throw.
Probability distribution of X:
X 1 2 3 4 5 6
P(X) 1/6 1/6 1/6 1/6 1/6 1/6
Sampling distribution created by drawing samples of size 2 from population (toss die twice or toss 2 dice).
is a new rv created by sampling as sample mean varies randomly from sample to sample.
Sample
1,1 1.0
1,2 1.5
1,3 2.0
… …
6,6 6.0
36 diff possible samples of size 2, & each is equally as likely so 1/36 of a sample being selected. However
can only assume 11 diff possible values: 1, 1,5, 2,…,6, with certain values of occurring more frequently than
others.
P( )
1.0 1/36
1.5 2/36
2.0 3/36
2.5 4/36
3.0 5/36
3.5 6/36
4.0 5/36
4.5 4/36
5.0 3/36
5.5 2/36
6.0 1/36
1. µ X = µ !!!!!!!
σ2
2. σ X2 = !!!!!!!!!!!
n
3. If X is normal, then is normal. If X is nonnormal, then is approximately normal for sufficiently
large sample sizes. Definition of “sufficiently large” depends on extent of nonnormality of X. The
more nonnormal the original population is, the large ‘n’ needs to be to be normal. Vice versa
4. From a modelling perspective a sufficiently large sample implies it is not necessary to assume
normality of the underlying population to make inferences about the sample mean based upon the
normal distribution
5. If a question doesn’t say population is normal, always assume it’s nonnormal and write “by central
limit theorem..” to make sample population normal.
Standard deviation of the sample distribution is called standard error of the mean
As we increase sample size (n), sample variance ↓ & probability that will be closer to the mean ↑. This is
because a randomly selected value of is likely to be closer to the mean than randomly selected value of X.
Thus sampling distribution of becomes narrower (more concentrated about the mean) as n ↑, becoming
increasingly bell shaped.
Central limit theorem – sampling distribution of the mean of a random sample drawn from any population is
approximately normal for a sufficient large sample size. The large the sample size, the more closely the sample
distribution of will resemble a normal distribution.
Accuracy of approximation depends on probability distribution of population & sample size. If population is
normal, then is normally distributed for all values of n. if population is nonnormal then is approximately
normal only for larger values of n.
If population size is relative large (20 times) compared to sample size, then finite population correction factor
= 1 and can be ignored. Most populations qualify as large because if population is small, we can investigate
each member of population w/o need for sample. So finite population correction factor is usually omitted.
Previously created sampling distribution theoretically by listing all possible samples of size 2. Can produce
distribution empirically by actually tossing 2 fair dice repeatedly, calculating sample mean for each sample,
counting no. of times occurs & computing relative frequency. If toss 2 dice a large no. of times, relative
frequency will = theoretical probabilities, but impractical.
Z A is the value of Z such that the area to the right of Z A under the standard normal curve is equal to A.
X −µ
Suppose X ~ N ( µ , σ 2 / n) and thus Z =
σ/ n
General form:
σ σ
P(μ - zα / 2 < < μ + zα / 2 ) = 1 – α!!!!!!!!!!!
n n
Where alpha α is the probability that does not fall into the interval.
e.g Clare is an auditor. A client has customer accounts with outstanding balances that are known to have s =
$30.334. Clare decides on a random sample of 250 accounts. What is the probability that the sample mean
balance will be within $4 of the population mean balance? Don’t need to know μ but need to know n and σ.
Need to determine P( µ − 4 < X < µ + 4) which requires sampling distribution of X
Is X normal? Doesn' t matter because sample size large
By CLT X ~ N ( µ , σ 2 / n) (or X ~ N ( µ ,1.92 2 ))
−4 X −µ 4
P ( µ − 4 < X < µ + 4) = P < <
1.92 σ / n 1.92
= P (−2.08 < Z < 2.08) = .9624
Suppose population has mean μ and standard deviation σ. Population mean unknown & need to find value.
Need to draw a random sample of size n & calculate .
Central limit theorem states that is normally distributed if X is normally distributed, or approximately
normal distributed if X is nonnormal & n is sufficiently large enough. This means
X −µ
Z=
σ / n → used only when population is or assumed to be normal
σ σ
P(μ - zα / 2 < < μ + zα / 2 )=1–α
n n
σ σ
P( - zα / 2 <μ< + zα / 2 )=1–α
n n
^ another form of probability statement about sample mean, the confidence interval estimator of μ.
σ σ
zα / 2 zα / 2
+ n - n
σ
zα / 2
- n
is the lower confidence limit (LCL)
σ
zα / 2
+ n is called the upper confidence level (UCL)
σ
zα / 2
± n
Interval estimators produce a range of values & attach a degree of confidence associated with that interval,
hence the name confidence interval.
Confidence level – the probability that the interval includes the actual value of μ. Usually set 1 – α close to 1
(between .90 & .99).
Be careful
To apply formula, specify confidence level 1 – α, from which we determine α, α/2, z a/2 (table 3 Appendix B).
e.g. want to estimate mean value of distribution resulting from the throw of a fair die. We know distribution,
we also know that μ = 3.5 & σ = 1.71 (from textbook). Pretend we don’t know σ & want to estimate it. To do
so, we draw a sample of size n = 100 & calculate .
σ
zα / 2
± n
1 – α = .9, α = .1
σ
zα / 2
Hence ± n ± 1.645 x 1.17/10 = ± .281
=
This means if we repeatedly draw samples of size 100 from this population, 90% of values of will be such μ
would lie somewhere between - .281 & +.281, & 10% of the values of will produce intervals that would
not include μ.
If we draw 40 samples of 100 observations each, 4 (10%) values of will produce intervals that exclude μ. We
don’t always get the expected 10% (in this case) “wrong” intervals.
Improve confidence associated w/ interval estimate by letting confidence level 1 – α = .95 etc.
Width of confidence interval estimate is a function of the population standard deviation, the confidence level,
& sample size.
Doubling population standard deviation → doubling width of confidence interval estimate. Logical, if there’s a
great deal of variation in the rv, it is more difficult to accurately estimate the population mean. That difficulty
is translated into a wider interval.
Although we cannot control value of σ, we can select values of sample size & confidence level:
↓ confidence level → narrows interval. Vice versa. This is because we need to widen interval to be more
confidence in the estimate. However, large confidence level is desirable. Need trade-off between confidence
level & width of interval. 95% confidence level is standard
↑ sample size fourfold → ↓ width of interval by half. Large sample provides more potential info. ↑ amount
of info. Reflected in a narrower interval. But tradeoff is ↑ sampling size → ↑ sampling cost.
Estimating population mean using the sample median
Sample mean produces better estimators than sample mean because sample median ignores actual
observations & uses ranks instead. This means we lose info. With less info. We have less precision in the
interval estimators (wider interval estimators) & so ultimately making poorer decisions.
Error of estimation
σ σ
zα / 2 zα / 2
P(- n μ n
< - <+ )=1–α
σ σ
zα / 2 zα / 2
Tells us difference between & μ lies between - n n
&+ with probability 1 – α.
σ
zα / 2
n
In other words, max. error of estimation that we are wiling to tolerate is , & label this value B which
stands for the bound of an error of estimation.
σ
zα / 2
n
B=
N = [( zα / 2 x σ )/ B ]2
1. Null hypothesis (H 0 ) – defendant is innocent, or more generally, some statement about a population
parameter
2. Alternative hypothesis or research hypothesis (H 1 ) – defendant is guilty
2 possible decisions:
1. Convict defendant = rejecting null hypothesis in favour of the alternative (enough evidence to
conclude that defendant is guilty)
2. Acquit defendant = not rejecting the null hypothesis in favour of the alternative (not enough evidence
to conclude that defendant was guilty)
2 types of errors:
1. Type I error – occurs when we reject a true null hypothesis = innocent person wrongly committed.
P(type I error) = P(reject H 0 | H 0 true) = α (significance level).
2. Type II error – defined as not rejecting a false null hypothesis = guilty defendant is acquitted. P(type II
error) = β.
Type 1 error more serious, so its P is smaller as we assume null hypothesis is true & prosecution must prove
otherwise.
Null hypothesis will always state the parameter equals the specified value in the alternative hypothesis, and
alternative hypothesis states what we are investigating.
Test statistic – criterion on which we base our decision about the hypothesis (evidence presented in the case
in a criminal trial). Based on the best estimator of the parameter, e.g. best estimator of the population mean is
X −µ
the sample mean. !!!! ALL TEST STATISTICS ARE APPROX NORMAL
σ n
If test statistic’s value is inconsistent w/ null hypothesis, we reject null hypothesis & infer that the alternative
hypothesis is true. E.g. trying to decide if mean is greater than 350, large value of say 600 would provide
enough evidence to infer mean is > 350. But if is close to 350 say 355, we would say this does not provide
much evidence to infer that the mean is > than 350. In the absence of sufficient evidence, we do not reject null
hypothesis in favour of the alternative.
11.2 Testing the Population Mean when the Population Standard Deviation is
Known
e.g. H 1 : μ > 170 (install new system)
H 0 : μ = 170 (do not install new system)
2 approaches to determine if sample mean contains sufficient info to infer about population mean e.g. is 178
sufficiently greater than 170 to allow us to confidently infer that the population mean is > than 170:
1. Rejection region – is a range of values such that if the test statistic falls into that range, we reject the
null hypothesis in favour of the alternative hypothesis.
Sample mean of 500 would mean null hypothesis is false & we would reject it, or if sample mean = 171,
we do not reject null hypothesis because it is possible to observe a sample mean of 171 from a
population whose mean is 170. However, sample mean of 178 is neither very far nor close.
Standardise:
As P(Z > z α ) = α
zα = ( L - μ )/ σ n
No rules
Instead of using test statistic where rejection region has to be set up in terms of , we can use
standardised value of (easier) – standardised test statistic.
Z = [( - μ)/ σ n]
2. p-value of a test – is the probability of observing a test statistic at least as extreme as the one
computed given that the null hypothesis is true. Measure of the amount of statistical evidence that
supports the alternative hypothesis. DOESN’T INVOLVE CHOOSING α
An extremely small p-value indicates that the actual data differs markedly from that expected if the null
hypothesis were true
Drawback of rejection region method is that it implies a decision is made based purely on the result of the
rejection region method. However, it is only one of several factors considered by a manager when making a
decision e.g. cost & feasibility of restructuring the billing system, possibility of error.
To make a better decision, need to measure amount of statistical evidence supporting the alternative
hypothesis which is provided by p value of a test.
e.g. p-value is probability of observing a sample mean at least as large as 178 when population mean is
170
p-value = P( > 178) = P(Z > 2.46) = 1 – P(Z < 2.46) = .0069
Probability of observing a sample mean at least as large as 178 from a population whose mean is 170 given H 0
is .0069, which is very small. Event very unlikely we doubt the null hypothesis is true, so reject null hypothesis
& support alternative hypothesis.
Closer is to the hypothesised mean, 170, large the p-value is. Vice versa.
Values of far above 170 tend to indicate that the alternative hypothesis is true. Thus smaller the p-value,
the more statistical evidence supporting the alternative hypothesis.
Describing the p-value
- p-value < .01 = overwhelming evidence to infer that the alternative hypothesis is true. Test is highly
significant.
- .01 < P-value < .05 = strong evidence to infer that alternative hypothesis is true. Result deemed
significant
- .05 < P-value < .10 = weak evidence to infer that alternative hypothesis is true. When p-value > 5% we
say result is not statistically significant.
- P value > .10 = little to no evidence to infer that alternative hypothesis is true
Rejecting hypothesis does not mean alternative hypothesis is true since out conclusion is based on sample
data & not entire population, rather we say there is enough statistical evidence to infer that null hypothesis is
false & alternative hypothesis is true.
Likewise, if value of test statistic does not fall into rejection region, we don’t say we accept null hypothesis
(which implies we’re saying it’s true), rather we state we do not reject null hypothesis & we conclude there
isn’t enough evidence to show that the alternative hypothesis is true.
Conclusion of a test of hypothesis – reject null hypothesis = conclude enough statistical evidence to infer that
the alternative hypothesis is true. Vice versa.
One-tail tests – rejection region located in only 1 tail of the sampling distribution. Upper or lower tail.
If α = .5, in 2 tail test, becomes .25 each rejection region of each tail.
From 1 tail test → 2 tail test, null hypothesis same but alternative hypothesis changes. Two-tail tests
conducted when the alternative hypothesis specifies that the mean is not equal to the value stated in the null
hypothesis:
Conduct one-tail test that focuses on the right tail of sampling distribution when we want to know whether
there is enough evidence to infer that the mean is > the quantity specified by the null hypothesis:
H0: μ = μ0
H1: μ > μ0
Left tail test used when we want to determine whether there is enough evidence to infer that the mean is <
value of the mean state in the null hypothesis:
H0: μ = μ0
H1: μ < μ0
2 one-tail tests:
H0: μ = μ0
H1: μ ≠ μ0
Determine whether hypothesised value of mean falls into the interval estimate.
e.g. department store thinking of introducing a new billing system. New system only effective if mean monthly
acc > $170. n = 400, X = $178 σ = $65
Suppose when acc at least $180, new system so attractive manager wouldn’t want to make mistake of not
installing it.
this tells us when mean acc = $180, probability of not rejecting H 0 when H 0 is false = .0764.
Or
X > 177.57.
A statistical test of hypothesis is defined by α & n, both of which are selected by practitioner. Can judge how
well test functions by determining probability of Type II error at some value of the parameter.
if we think β is too high, can ↓ it by ↑ α, which would ↑ chance of making Type I error which is costly.
Alternatively, can ↑ n.
Developing an understanding of statistical concepts: larger sample size = more info = better decisions
n ↑ → standard error of mean ( σ n ) ↓ → narrower distribution which represents more info & ↑ info
reflected in smaller probability of type II error.
Power of a test
Another way of expressing how well a test performs is to report its power – the probability of it leading us to
reject H 0 when it’s false. Power of a test = 1 – β.
If, given same H 1 , n, α, one test has a higher power than another, it is said to be more powerful.
↑ α → ↑power as β ↓.
We control error that is more costly, which is α. So the error that is more costly, we set up hypothesis
according to it.
Data type
Problem objective Nominal Ordinal interval
Describe a population 12.3, 15.1 Not covered 12.1, 12.2
Compare 2 populations 13.5, 15.2 19.1, 19.2 13.1, 13.3, 13.4, 19.1, 19.3
Compare 2 or more populations 15.2 19.3 Chapter 13, 19.3
Analyse the relationship between 2 variables 15.2 19.4 Chapter 16, 19.3
Analyse the relationship between 2 or more Not Not covered Chapter 17 & 18
variables covered
Chapter 12: Inference about a Population
12.1 Inference about a Population Mean when the Standard Deviation is
Unknown
Usually when population mean is unknown, so is the population standard deviation.
When the σ is unknown & the population is normal, the test statistic for testing hypotheses about μ is:
X −µ
t (t-statistic) = where σ is substituted for s
s n
s
X ± tα / 2 v=n–1
n
X − µ 2.18 − 2.0
Thus t = = = 2.23
s n .981 148
e.g. calculate sample mean & standard deviation from sample data.
we can 95% confidence interval estimate, 1 – α = .95 therefore α = .05 & α/2 = .025
s 4400
X ± tα / 2 = 11 343 ± 1.972 = 11343 ± 640
n 184
When population is nonnormal, still can use t-test & confidence interval estimate given n is large enough.
Require n depends on extent of nonnormality. Because in large samples s will be close to σ with high
probability.
Properties of t-distribution
When population is small, must adjust test statistic & interval estimator using finite population correction
factor. However, if population 20 times larger than n, can ignore it.
Finite populations allow us to use confidence interval estimator of a mean to produce a confidence interval
estimator of the population total by multiplying LCL & UCL by population size.
s
N [ X ± tα / 2 ]
n
∑ (x − x )
2
i
s2 = i =1
n-1
t-statistic like z-statistic measures difference between X & hypothesised value of μ in terms of the no. of
standard errors.
s
However, when σ is unknown, we estimate standard error by
n
t-statistic has 2 variables, X & s, both which will vary from sample to sample.
Thus parameter of interest in describing a population of nominal data is the population proportion p.
This parameter used to calculated probabilities based on binomial experiment. 2 possible outcomes per trial.
Most practical applications of inference about p involve more than 2 outcomes, but many cases only
interested in “success” & other outcomes are “failures”.
Logical statistic used to estimate & test the population proportion is the sample proportion:
X
Pˆ = P̂ is equivalent to X .
n → unbiased point estimator of population proportion
Sampling distribution of P̂ is aprrox. Normal with mean p & standard deviation p (1 − p ) / n provided that np
& n(1-p) > 5.
Pˆ − p
Z=
p (1 − p ) / n
Pˆ − p
Z=
p (1 − p ) / n
Pˆ ± z α/ 2 P( 1-P) / n
Which is useless as to produce interval estimate, we must compute standard error p (1 − p ) / n which
requires us to know p, the parameter we wish to estimate in the first place.
Pˆ ± zα/ 2 Pˆ ( 1-Pˆ ) / n
e.g. H 1 : p > .5
H 0 : p = .5
Pˆ − p
Test statistic is Z =
p (1 − p ) / n
Say α= .05
X 407
Hence sample proportion is Pˆ = = = 1.77
n 765
Missing data
Finite population correction factor when population isn’t at least 20 times > than sample.
Pˆ ± z α/ 2 Pˆ ( 1-Pˆ ) / n
Times it by population size to give total no. of success in a large finite population.
N [ Pˆ ± z α/ 2 Pˆ ( 1-Pˆ ) / n ]
N depends on confidence interval & bound on error of estimation that stats practitioner will tolerate.
When parameter to be estimated is a proportion, bound of error of estimation is:
B = zα/ 2 Pˆ ( 1-Pˆ ) / n
Solving for n:
N = [ z α/ 2 Pˆ ( 1-Pˆ ) / B]^2
Probabilistic model – a realistic representation of the relationship between 2 variables - represents the
randomness that is part of a real-life process. E.g. simple linear regression model
To create a probabilistic model, start with a deterministic model that approximates relationship we want to
model. Then add a term that measures the random error of the deterministic component.
e.g. real estate agent knows cost of building new house is $100/square foot & that most lots sell for about
$100 000. The approx selling price would be
we know however that selling price isn’t exactly $300 000, but range from $200 000 - $400 000. In other
words, deterministic model isn’t suitable, so use probabilistic model to represent situation.
where ϵ = error or disturbance term – difference between actual selling price & estimated price based on the
size of the house. It accounts for all variables, measureable & immeasurable, that are not part of the model.
Value of ϵ will vary from sale even if x is constant, meaning houses of same size sell for diff prices because of
diff location etc.
simple linear regression model – 1 independent variable. Analyses the relationship between 2 variables, x & y,
both of which must be interval. Finds line of best fit. P. 642 excel
To define relationship, need to know β 0 & β 1 but they are population parameters, which are almost always
unknown.
Y = β0 + β 1x + ϵ
Where
- Y = dependent variable
- X = independent / explanatory variable
- β 0 = y-intercept all population parameters, can’t be measured
- β 1 =slope of the line (rise/run)
- ϵ = Error variable
yˆ = b0 + b1 x
b 0 = y-intercept, estimate of - β0 → Sign of slope coefficient is same sign as covariance (& correlation)
between Yi & Xi
b 1 = slope, estimate of β1
ŷ = predicted value of y
∑ (y − y hat )
2
i
i =1
can only determine value of ŷ for values of x that is within the range of the sample values of x!!!
Least squares line coefficients
s xy
b1 =
s x2
b0 = Y − b1 X
where
n
∑ (X
i =1
i − X )(Yi − Y )
s xy =
n −1
n
∑ (X
i =1
i − X )^ 2
s =
2
n −1
x
∑x i
x= i =1
n
n
∑y i
ybar = i =1
pg 638
Residuals ( ei ) – deviations between actual data points & line. Observations of the error variable. differences
between observed values of y i & predicted values of Yˆ .
ei = Yi − Yˆi
Sum of squares error (SSE) – minimised sum of squared deviations. Basis for other statistics that assess how
well the linear model fits the data.
Some basics:
• OLS produces
Y i = b 0 + b 1 X i + e i → the predicted regression relationship links the (X,Y) pairs via estimated
parameters and calculated residuals
Reliable estimates of β 1 will require assumptions restricting the relationship between X i & ϵ i
• population & sample regression line diff, but as n↑, SRL → PRL
ϵ I ~ N(0, σ ϵ 2)************
this shows homoskedasticity, standard deviation same so
shape same.
s xy ∑ (X i − X )(Yi − Y )
= i =1
& if variance = 0, X = X for all x, making denominator 0 & b 1 = infinity.
s x2 n
∑(X
i =1
i − X) 2
4. Zero conditional mean: E(ϵ i |x i )=0 which implies ϵ i & X i are uncorrelated
5. Homoskedasticity: Var(ϵ i) = σ 2
6. Disturbances are uncorrelated: Cov(e i , e j ) = 0, (i not equal j)
7. Disturbances are normally distributed (used for inference)
1 – 4 allow OLS estimates to be unbiased & consistence, 5 allows estimates to be best out of all linear
estimates.
Decomposition of variance
∑ (Y − Y ) = ∑ (Yˆ − Y ) + ∑ e
2 2 2
i i i
SSR – sum of squares due to the regression; what your model explains; distance between ŷ & Y i.e. fitted
line to mean value.
SSE – sum of squared errors (residuals); what your model doesn’t explain i.e. distance between ŷ & y.
Coefficient of determination (R2) – tests strength of linear relationship. Explanatory power of the model. NOT
WHETHER IF THEY ARE +VELY OR –VELY RELATED.
Define :
SSR SSE
R2 = = 1−
SST SST
2
R measures the goodness of fit of the model. This means how much variation in y can be explained by the model.
Note :
0 ≤ R2 ≤ 1
Closer R 2 is to 1 the better the fit
e.g. R2 = .6483, means 64.83% of variation in dependent variable is explained by variation in independent
variable, remaining 35.17% unexplained.
Hence we need to assess how well linear model fits data, as if it’s poor, we discard linear model & seek
another one.
Standard error of estimate, t-test of the slope, & coefficient of determination used to evaluate whether linear
model should be employed.
Least squares method determines the coefficients that minimise the sum of squared deviations between
points & line defined by coefficients.
s 2 xy
SSE = (n-1)( s y 2 - )
s x2
Disturbance term shows spread (standard deviation) around population regression line. But we do not
know spread, need to estimate.
If σ ϵ is large, some errors will be large, which implied model’s fit is poor. If σ ϵ is small, errors tend to be close
to mean 0, hence model fits well.
Hence we can use σ ϵ to measure suitability of using a linear model. But σ ϵ is a population parameter, and is
unknown.
2
Unbiased estimator of variance of the error variable σ ϵ is standard error of estimate (Square root of S2 ϵ )
∑
n 2
e SSE
SEE = s where s 2
= i =1 i
=
n−2 n−2
p. 651 excel
compare s ϵ with Y . Standard error of estimate cannot be used as an absolute measure of the model’s utility
as no predefined upper limit of s ϵ
Most often conduct 2 tail test to determine whether there is sufficient evidence to infer that a linear
relationship exists, can be 1.
H1: β1 ≠ 0
No linear relationship does not mean there’s no relationship at all, can imply quadratic relationship etc.
b 1 ± t α/2, v s b1 !!!!!!!!!!!!!!!!!!!!!!!!!!!##**********
b1 & b0:
- are normally distributed as they are linear functions of Y 1 which are assumed to be normal
- Even without normality of Y 1 can invoke CLT & assume b i will be asymptotically normal
s^2 SSE
S b1 = where s e 2 =
∑(X i − X ) 2
n−2
t-statistic:
t = (b 1 - β 1 ) / s b1 ~ t n-2
if test statistic falls within rejection region, conclude variables are linearly related. β 1 > 0 = +vely relation, β 1
< 0 = -ve related. Since β 1 is coefficient of x, this means 1 unit change in X will cause a β 1 change in y.
Do not test for β 0 as interpreting value of y-intercept can lead to erroneous conclusions.
Cause-and-effect relationship
If there’s evidence of a linear relationship, changes in independent variable do not cause changes in
dependent variable.
Best estimator of difference between 2 population means, µ 1 - µ 2 , is difference between 2 sample means,
x 1 - x 2.
Sampling distribution of x 1 - x 2
E( x 1 - x 2 ) = µ 1 - µ 2
3. Variance of µ 1 - µ 2 is
V( µ 1 - µ 2 ) = σ 2 1 /n 1 + σ 2 2 /n 2
4. Standard error of x 1 - x 2 is root σ 2 1 /n 1 + σ 2 2 /n 2 .
Point prediction by itself will not provide any info on how closely value will match true selling P. need to use
interval.
Prediction interval
1 (X g − X )2
: ŷ ± t α/2, n-2 x s ϵ 1 + +
n ∑ (X i − X )2
E(y) = β 0 + β 1 x
t = (b 1 - β 1 ) / s b1 ~ t n-2
1 (X g − X )2
rearranging: ŷ ± t α/2, n-2 x s ϵ +
n ∑ (X i − X )2
Basic idea – use the fitted regression line (and b 0 , b 1 ) to predict a value of Y from a given value of X
Sydney unleaded petrol prices 1997Q3 to 2004Q4:
Forecasts to 2006Q3
120
100
Price (cents)
80
60
40
20
0
0 5 10 15 20 25 30 35 40
Time (t)
Y = β 0 + β 1 x + β 2 x 2 +…+ β k x k + ϵ
The estimated slope coefficient for the explanatory variable x 1 represents how much the outcome variable
changes when x 1 increases by one unit and all other variables remain the same.
With a large enough sample you can treat the estimated regression coefficients as if they are normally
distributed irrespective of the underlying distribution of the error term.
Error variable is retained because although we included additional independent variables, deviations between
predicted values of y & actual values of y will still occur.
As you add more explanatory variables to a multiple regression model you expect the standard error of the
estimate to ↓
1. Select the independent variables you believe are linearly related to the dependent variable: e.g.
education, age etc → income
2. Use a computer to generate the coefficients & the statistics used to assess the model: standard error
SSE
of estimate: s ϵ = root where k is the no. of independent variables in the model. Judge
n − k −1
magnitude of standard error of estimate to mean of y.
3. test validity of model: H 0 : β 1 = β 2 =… β k = 0 H 1 : at least one β i ≠ 0. If null true, none of the
independent variables is linearly related to y & therefore model is invalid.
4. Interpreting the coefficients:
A key “threat” to using simple linear regression for empirical work is the problem of omitted variables
e.g. Suppose you are interested in the relationship between food consumption and level of household
income. Even though your primary interest is in the consumption-income relationship it is good practice to
add in other explanatory variables such as household size in order to avoid problems of confoundment.
Y i = β 0 + β 1 X 1i + β 2 X 2i + ϵ i , i = 1, … , n
Violation of A4 implies?
A biased estimator of b 1
Adjusted R2
It is adjusted to take into account the sample size & no. of independent variables. If no. of independent
variables (k) is large relative to sample size, the unadjusted R2 value may be unrealistically high
Add more independent variables should ↑ R2, but does not necessarily mean relevantly ↑ if you add useless
variables.
Recall
SSR SSE
R2 = = 1−
SST SST
Adjusted R denoted by R 2 is given by
2
SSE /(n − k − 1)
R 2 = 1−
SST /(n − 1)
where n = number of observations
& k-1 = number of explanatory variables
t-tests allow us to determine whether β i ≠ 0 (for i = 1, 2, … , k). F-test in the analysis of variance combines
these t-tests, meaning we test all β i at 1 time to determine if at least 1 of them ≠ 0. F-test only needed to be
performed once as opposed to the k number of t-tests, so the probability that a Type I error will occur in a
single trial is α, meaning the chance of erroneously concluding the model is valid is substantially less with F
than t- tests.
Also, because of a commonly occurring problem called multicollinearity t-tests may indicate that some
independent variables are not linearly related to the dependent variable when in fact they are.
Multicollinearity doesn’t affect F-tests.
Return on stock is dependent variable (Y), return on index is independent variable (X).
b 1 measures how sensitive the stock’s rate of return is to changes in the level of the overall market.
If b 1 > 1 → stock’s rate of return is more sensitive to changes in the level of the overall market than the
average stock, meaning it’s more volatile than the market & therefore riskier than the entire market. i.e. b 1
= 2, 1% ↑ in index results → average ↑ of 2% in stock’s return.
b 1 measure stock’s market-related (or systematic) risk because it measures volatility of stock P that is related
to overall market. Tells us nothing about strength of relationship.
Firm-specific (or nonsystematic) risk is the proportion of risk associated w/ events specific to company rather
than market. E.g. company’s managers.
(n − 1) s 2
χ2 = ~ χ n2−1
σ 2
Rejection region:
(n − 1) s 2
LCL =
χα2 / 2,n −1
(n − 1) s 2
UCL =
χ12−α / 2,n −1
But for population variance the CI is (s2 – errorL <σ2 <s2 + errorU ) & errorL ≠ errorU
Chi-squared goodness of fit test - used to test if observed & expected distributions are the same
Like counting no. of successes in binomial, we count no. of outcomes falling into each of the k cells in
multinomial experiments. In doing so, we get observed frequencies f 1 , f 2 ,…,f k where f i is the observed
frequency of outcomes falling into cell i where i = 1, 2,…, k.
f 1 , f 2 ,…,f k = n
Chi-squared goodness-of-fit test is the equivalent to z-test of p in binomial experiments but for multinomial
experiments.
Expected frequency: e i = np i !!!!!!!!!!!!!!!!!!!!! → From previous data, the expected frequencies of each cell
c
( f i − e i )2
χ =∑ 2
~ χ c2−1 ,
i =1 ei
If H 0 is true, observed & expected frequencies should be similar & so test statistic should be small. If H 0 is
untrue, some observed & expected frequencies will differ & test statistic will be large.
Rejection region:
χ 2 > χ 2 α, k-1
Required condition
Like for normal approximation to the binomial in the sampling distribution of a proportion in that we needed
np & n(1-p) to be > 5, chi-squared test statistic needs to satisfy the rule of five – n must be large enough so
that the expected value for each cell must be 5 or more. Cells can be combined to satisfy this condition.
This is because the test can be unreliable if any values of e i = p i n get too small (e.g. 3 or 4).
Compare observed cell frequencies with those expected under null hypothesis of independence
Chi-Squared Test of a Contingency Table – used to determine whether there is enough evidence to infer that 2
nominal variables are related & to infer that differences exist between 2 or more populations of nominal
variables.
e.g. 2 variables: undergrad degree (B.A., B.Eng., B.B.A & other) & MBA major (marketing, finance &
accounting). Want to know whether variables are related. OR determine whether differences exist between
BA’s, BENG’s, BBA’s ^ others. In other words, treat holders of each undergrad degree as a separate population
& each population has 2 possible values represented by MBA major. Objective is to compare 4 populations. OR
other way around.
Test statistic
χ 2 = ∑∑
r c (o
ij − eij )
2
~ χν2
i =1 j =1 eij
Same test statistic as one used to test proportions in the goodness-of-fit test.
Need probabilities of expected values e i to calculate test statistic. Get probabilities from sample data.
Assume H 0 is true:
ni. n. j ni. n. j
eij = × ×n =
n n n
To determine RR, need to know no. of degrees of freedom associated w/ chi-squared statistic. V = (r – 1)(c – 1)
for a contingency table w/ r rows & c columns.
χ 2 > χ 2 α, v
Rule of five
In contingency table where 1 or more cells have expected values < 5, we need to combine rows or columns to
satisfy the rule of 5.