You are on page 1of 66

.Chapter 1: What is Statistics?

(week 1)
Statistics is a way to get information from data.

Descriptive Statistics

Descriptive statistics deals with methods of organising, summarising, and presenting data in a convenient and
informative way. One form of descriptive stats uses graphical techniques (histogram) and another uses
numerical techniques (Measure of central location – the average or median are examples, measure of
variability – range is an example).

Inferential Statistics

Inferential stats is a body of methods used to draw conclusions or inferences about characteristics of
population based on sample data.

1.1 Key Statistical Concepts


Statistical inference problems involve 3 key concepts:

1. Population

A population is the group of all items of interest to a statistics practitioner. It is frequently large and may be in
fact infinitely large.

A descriptive measure of a population is called a parameter, and it usually represents the information we
need.

2. Sample

A sample is a random set of data drawn from the studied population.

A descriptive measure of a sample is called a statistic. Statistics are used to make inferences about
parameters.

3. Statistical inference

Statistical inference is the process of making an estimate, prediction, or decision about a population based on
sample data. However such conclusions and estimated are not always going to be correct so we build a
measure of reliability:

• The confidence level is the proportion of times that an estimating procedure will be correct
• The significance level measures how frequently the conclusion will be wrong

1.2 Statistical Applications in Business


Statistical analysis plays a role in virtually all aspects of business and economics.
1.3 Large Real Data Sets
Textbook.

1.4 Statistics and the Computer


Microsoft excel 2012 (excel spreadsheets allow for ‘what if’ analyses) used. See Appendix 1 for details on add-
ins to excel.

Chapter 2: Graphical Descriptive Techniques I


2.1 Types of Data and Information
Variable – some characteristic of a population or sample. E.g. price of stock is a variable as it changes daily. We
use uppercase letters to represent name of variable. Variables may be discrete (football scores) or continuous
(time remaining in football game)

Values of the variable – are possible observations of the variable. E.g. the values of stock price are real
numbers usually measured in dollars and cents; ranges from 0 to hundreds of dollars.

Data – are the observed values of a variable. E.g. $0.70 $1.12.. these are the data which we will extract the
info we seek. Data is plural for datum.

3 types of data (not just sets of numbers):

• Interval data are real numbers such as heights, weights, incomes. Also called quantitative or numerical
data. Scores & time are quantitative.
• Values of nominal data are categories. E.g. questions to marital status (1 = single, 2 = married, 3 =
divorced, 4 = widowed). Any numbering system is valid (14 = single, 2 = married…). Nominal data are
also called qualitative or categorical. Gender is qualitative
• Ordinal data appear to be nominal, but the difference is the order of their values has meaning. E.g.
ratings for professor (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent). Need some sort of
ascending order (6 = poor, 24 = fair, 33 = good…). Order of values important not magnitude. To
distinguish between ordinal and interval data, the intervals or differences between values of interval
data are consistent and meaningful (difference between marks of 80 and 75 is 5 so we can calculate
difference and interpret results) but not for ordinal which aren’t numbers and just give you order.

Calculations for Types of Data

Interval data – all calculations permitted. Often describe interval data by calculating average.

Nominal data – because codes are arbitrary, we cannot perform calculations on these codes. Can only count or
compute %s of occurrences of each category (marital status example – count numbers of each category and
report frequency).
Ordinal data - most important aspect of this data is the order of the values. Only permissible calculations are
those involving a ranking process. E.g. median.

Hierarchy of Data

Permissible calculations rank the data.

1. Interval
2. Ordinal
3. Nominal

Higher level data types may be treated as lower-level ones. E.g. at UNSW marks are converted to grades
(interval data is converted to ordinal data). But when we convert, we lose information. LOWER LEVEL DATA
TYPES CANNOT BE TREATED AS HIGHER LEVEL TYPES

Interval, Ordinal, and Nominal Variables

Variables same name as type of data. E.g. interval data are observations of interval variables.

Problem Objectives and Information

In presenting the diff types of data, which statistical procedure do we use? And what type of info do we need
to produce from our data?

2.2 Describing a Set of Nominal Data


As we can only count frequency or compute % of occurrence of each value, we can summarise data in a table
called a frequency distribution (categories and their counts) with relative frequency distribution (%). Graphical
techniques can be used too: bar chart and pie chart. Page 20 excel explanation.

Bar and Pie Charts

Graphical techniques catch a reader’s eye more quickly. Bar chart often used to display frequencies while a pie
chart shows relative frequencies (proportions). Page 22 excel explanation.

Other Applications of Pie and Bar Charts

Practice in textbook.

Describing Ordinal Data

No specific graphical techniques. Consequently, we treat it as nominal and use techniques above. Only
criterion is that the bars in bar charts should be arranged in ascending or descending ordinal values and in pie
charts the wedges are arranged clockwise in ascending or defending order.

WE USE FREQUENCY AND RELATIVE FREQUENCY TABLES, AND BAR AND PIE CHARTS WHEN:

 THE OBJECTIVE IS TO DESCRIBE A SINGLE SET OF DATA


 THE DATA TPE IS NOMINAL OR ORDINAL.
2.3 Describing the Relationship between Two Nominal Variables and Comparing
Two or More Nominal Data Sets
Techniques applied to single sets of data are called univariate. Bivariate methods are required when we wish
to depict relationship between variables. A cross-classification tables or cross-tabulation table is used.

Tabular Methods of Describing the Relationship between Two Nominal Variables

Only use frequency or relative frequency of values. Page 34 excel explanation.

Graphing the Relationship between Two Nominal Variables

Page 35 excel explanation. No relationship = bar charts approx the same. Vice versa.

Comparing Two or More Sets of Nominal Data

Page 37 example.

Data Formats

Page 38.

WE USE A CROSS-CLASSIFICATION TABLE WHEN:

 THE OBJECTIVE IS TO DESCRIBE THE RELATIONSHIP BETWEEN 2 VARIABLES AND COMPARE 2 OR MORE
SETS OF DATA
 DATA TYPE IS NOMINAL

Chapter 3: Graphical Descriptive Techniques II


3.1 Graphical Techniques to Describe a Set of Interval Data
Can use frequency distribution for interval data by creating a series of intervals called classes (e.g. from 0 to
15) that cover a range of observations. Intervals should be equally wide because it makes task of reading &
interpreting the graph easier.

Most important is the histogram (Suppose data are quantitative & not qualitative). Bases are the intervals &
height is the frequency. Define lower and upper class limits (need to be mutually exclusive & exhaustive) Page
47 excel explanation – should be no gaps between bars.

Determining the Number of Class Intervals

Depends entirely on the no. of observations in the data set. More observations = larger no. of class intervals.

Number of Observations Number of Classes


<50 5-7
50-200 7-9
200-500 9-10
500-1000 10-11
1000-5000 11-13
5000-50000 13-17
>50000 17-20

Or Sturges’s formula: Number of class intervals = 1 + 3.3 log (n)

Class Interval Widths

Largest observation – Smallest observation

Class width = Number of classes

Often rounded to convenient value. It is more important to choose classes that are easy to interpret, the
above are guidelines only. First class interval must contain smallest observation.

Shapes of Histograms

1. Symmetry – when a vertical line is drawn down the centre of the histogram, the two sides are identical
in shape & size.
2. Skewness: positively skewed – long tail extending to the right e.g. income of employees in large firm
negatively skewed – long tail extending to the left e.g. time to finish exams
asymmetric
may have outliers
3. Number of Modal Classes

Mode – observation that occurs with the greatest frequency.

A unimodal histogram has a single peak. A bimodal histogram has 2 peaks, not necessarily equal in height;
indicated 2 diff distributions.

4. Bell shape – special type of symmetric unimodal histogram.

Stem and Leaf Display

A drawback of histograms is that we lose potentially useful info by classifying the observations into classes.
The stem-and-leaf display overcomes this loss to some extent as we can see actual observations.

First step is to split each observation into 2 parts, a stem & a leaf. The length of each line represents the
frequency in the class interval define by the stem. Page 58 excel explanation.

Ogive

Cumulative relative frequency distribution highlights the proportion of observations that lie below each of the
class limits.
The ogive is a graphical representation of the cumulative relative frequencies. Page 60 excel explanation.

WE USE HISTOGRAMS, OGIVES, OR STEM-AND-LEAF DISPLAYS WHEN:

 THE OBJECTIVE IS TO DESCRIBE A SINGLE SET OF DATA


 THE DATA TYPE IS INTERVAL

3.2 Describing Time-Series Data


Besides classifying data by type, we can classify according to whether observations are measured at the same
time (Cross sectional data are measurements at single point in time. E.g. observations at the same point in
time, looking at 100 recently sold homes) or whether they represent measurements at successive points in
time (Time series data refer to measurements at different points in time. E.g. observations taken over time,
looking at baby births by day for the past month).

Original data may be interval or nominal. A time series can also list the frequencies & relative frequencies of a
nominal variable over a number of time periods. E.g. Asks consumers to identify favourite brand – this is
nominal data. If we repeat the survey once a month for several years, the proportion of consumers who prefer
a certain company’s product each month would constitute a time series.

Line Chart

Time series data graphically depicted on a line chart which is a plot of the variable (vertical axis) over time
(horizontal axis). Page 66 excel explanation.

3.3 Describing the Relationship between Two Interval Variables


Scatter diagram used for quantitative variables. We use X to label independent variable and Y to label
dependent variable. In cases with no dependency, we label variables arbitrarily. Page 75 excel explanation.

Contingency table (Cross-tabulation table) used for relationship between qualitative variables

Time series plot is a scatter diagram with one variable as time. Order matters.

Patterns of Scatter Diagrams

2 most important characteristics are strength and direction of linear relationship to verbally describe how 2
variables are related.

Linearity

Draw straight line through points. If most points fall close to line, we say there’s a medium linear relationship,
if most points are scattered with only a semblance of a straight line, there is a weak linear relationship, or
strong if most points fall on the line.

Least squares method used to draw an objective straight line. There’s also quadratic and exponential
relationships.

Can exist clusters.


Direction

Positive linear relationship – when both variables increase

Negative linear relationship – when the variables move in opposite directions

No relationship – horizontal straight line

Interpreting a Strong Linear Relationship

If 2 variables are linearly related, it does not mean one is causing the other. Correlation is not causation!

We use a scatter diagram when:

 The objective is to describe the relationship between 2 variables


 The data type is interval

3.4 Art and Science of Graphical Presentations


Graphical excellence

Graphical excellence – term we apply to techniques that are informative & concise & that impart information
clearly to their viewers.

It is achieved when these characteristics apply:

• The graph represents large data sets concisely & coherently. Graphical techniques used to summarise
& describe large data sets while tables summarise small data sets.
• The ideas & concepts the statistics practitioner wants to deliver are clearly understood by the viewer.
Excellent chart is one that replaces thousands of words.
• The graph encourages the viewed to compare two or more variables. Graphs best used to depict
relationships between variables or to explain how & why observed results occurred. Graphs displaying
1 variable provides little info.
• The display induces the viewed to address the substance of the data and not the form of the graph.
Form of graph supposed to help present substance.
• There is no distortion of what the data reveal. Tells truth about data.

Graphical Deception

1. Graph without scale on one axis


2. Avoid being influenced by caption
3. Beware of percentage changes not being reported, and only absolute values being reported. E.g. a $1
drop in your $2 stock is more distressing than a $1 drop in your $200 stock.
4. Beware of size distortions by focusing on numerical values that the graph represents. Note the scales
on both axes. E.g. Stretching or compressing the horizontal/vertical axes alters perception.
Chapter 4: Numerical Descriptive Techniques (week 2)
4.1 Measures of Central Location
Arithmetic mean

3 different measures to describe centre of a set of data. Arithmetic mean is best known, also called mean or
average.

For sample:

 X 1 , x 2 ,…,x n used to label observations, where x 1 is first observation etc


 n is sample size.
 sample mean is denoted by .

For population:

 number of observations is labeled N


 population mean denoted by μ

∑x i page 100 excel explanation.


Population mean: µ= i =1

N
n

∑x i
Sample Mean: x = i =1

Median

2nd most popular measure of central location.

Median – calculated by placing all observations in ascending/descending order. Observation in middle is


median. Sample & population median computed same way. If even no. of observations, median is average of
middle 2 observations. Page 101 excel explanation.

Mode

Mode – observation/s that occurs with greatest frequency. Statistic & parameter computer in same way.

For populations & large samples, preferred to report modal class. Using mode in a small sample may not be a
good measure for central location and it may not be unique. Page 102 excel.
Mean, Median, Mode: Which is Best?

Mean usually first selection. Mode seldom chosen.

Several circumstances where median is better:

• Median not as sensitive to extreme values (outliers) as mean is, so it produces a better measure when
there’s a relatively small no. of extreme observations (either very small or large, but not both).

For symmetric distributions: mean=median, positively skewed: mean > median, negatively skewed: mean <
median.

Measures of Central Location for Ordinal and Nominal Data

When data is interval, we can use the 3.

However, for ordinal & nominal data, calculation of mean not valid.

Median used for ordinal data, mode appropriate for nominal data. However, nominal data does not have a
“centre”, so pointless to compute mode of nominal data.

Geometric mean

When variable is growth rate or rate of change (value of investment over time) we need to use geometric
mean. It is used to find “average” growth rate or rate of change over time.

1. Use mean to find central location for single set of interval data
2. Use median to find central location for single set of ordinal or interval (with extreme observations)
data.
3. Use mode to describe a single set of nominal, ordinal, or interval data.

4.2 Measures of Variability


Spread of data.

Range

Range = largest observation – smallest observation

Too simple, doesn’t tell us much about spread.

Variance

Variance and its related measure, standard deviation most important statistics.
N

∑ (x − μ)
2
i
Population variance: σ 2 = i =1

N
n

∑ (x − x)
2
i
Sample variance: s 2 = i =1

n-1

For sample variance we divide by “n-1” because it is a better estimator than dividing by “n”

S2 is used because the negative and positive answers cancel each other out, giving us 0, squaring avoids this.
We could also use the absolute value of the deviations (difference between observation & mean) instead of
squaring- called mean absolute deviation (MAD).

Square deviations means units are also squared!!!!!!!!!!! Page 111 excel.

Interpreting the Variance

Larger variance = more variation. Difficult to interpret because units are squared.

Standard deviation

Standard deviation is spread measured in original units. Simply positive square root of variance; uses original
units.

Population standard deviation: σ =


σ2

Sample standard deviation: s = s2

Coefficient of Variation

Population coefficient of variation:


CV = σ

Sample coefficient of variation: cv = s


Standard deviation of 10 is small for observations in its millions, but big for observations less than 50.
Standard deviation depends on magnitude to data. Coefficient of variation to cross compare between
different sized populations & samples.

VARIABILITY CALCULATIONS WORK ON INTERVAL DATA ONLY!!

4.3 Measures of Relative Standing and Box Plots


Measures of relative standing provide info about position of particular values relative to the entire set. Median
is one measure of relative standing.

Percentile – Pth percentile is value for which P percent are less than that value and (100-P)% are greater than
that value.

Quartiles – 25th (Q 1 ), 50th (Q 2 ) and 75th (Q 3 ) percentiles. Also quintiles and deciles.

Locating Percentiles

L P = (n + 1)P

100

Where L P is the location of the Pth percentile & n is the no. of data. This formula allows us to approximate
location of the Pth percentile. Page 119 excel.

Get idea of histogram from quartiles: 1st & 2nd quartiles closer than 2nd & 3rd, positively skewed. Vice versa.
Distance equal, approx. symmetric.

Interquartile Range

Interquartile range = upper – lower quartile, measure spread of middle 50% of observations.

Q 3 – Q 1 measures middle 50% of observations.

Large value = 1st & 3rd fair apart; high level of variability.

Box Plots

Useful for comparing 2 or more data sets.

Graphs 5 statistics: min and max observations & 1st, 2nd & 3rd quartiles.

3 vertical lines are quartiles, extended lines are whiskers. Points outside whiskers are outliers. Whiskers
extend 1.5 times interquartile range of both sides. Page 121 excel.

Measures of Relative Standing and Variability for Ordinal Data

No measures of variability for nominal data.

Percentiles & quartiles can be used to measure relative standing for interval and ordinal data.
Interquartile range used to measure variability of interval and ordinal data.

4.4 Measures of Linear Relationship


3 numerical measures of linear relationship:

Covariance

Population covariance

∑ (x − µ )(y − µy )
N

i x i
σ xy = i =1
N
Sample covariance
n

∑ (x − x )( y
i i − y)
s xy = i =1
n-1

X and y are variables.

Generally when 2 variables move in the same direction (both ↑ or ↓), the covariance will be a large +ve no.

If they move in opposite directions, covariance is a large –ve no.

No pattern = covariance small no.

Sign of covariance tells us nature of linear relationship:

• Positive (negative) covariance = positive (negative) linear association


• 0 covariance = no linear association

Magnitude of covariance describes strength of association, but it’s difficult to judge w/o additional stats.

Coefficient of Correlation

Coefficient of correlation is defined as the covariance divided by the standard deviations of the variables.

Population correlation
σ xy
ρ=
σ xσ y
Sample correlation
s xy
r=
sx s y
− 1 ≤ ρ ≤ 1 and − 1 ≤ r ≤ 1

The adv of this over covariance is it has a set of lower & upper limits. -1 = -ve linear relationship & scatter
diagram is a straight line. +1 is a perfect +ve linear relationship. 0 = no linear relationship. Rest of the values
judged in relation to these.
However except for -1, 0, +1, we cannot interpret correlation, can only get rough idea.s

Comparing the Scatter Diagram, Covariance, and Coefficient of Correlation

Scatter diagram depicts relationship graphically; covariance & coefficient of correlation describe linear
relationship numerically.

Least Squares Method

Objective method of drawing a straight line through scatter diagram to determine linear relationship.

Line equation: yˆ = b0 + b1 x , so sum of squared deviations between points & line is minimised.

B 0 is y-intercept and b 1 is slope and Ŷ is value of y determined by the line.


n

∑( y
i =1
i − yˆ i ) 2

Least square coefficients

s xy
b1 = b0 = y − b1 x
s x2

Page 135 excel

Note:

• Point of the means will lie on line of best fit


• B 1 will have same sign as covariance (correlation) between y & x
• Zero covariance (correlation) = b1 = 0 = ?

Compute covariance, coefficient of correlation, coefficient of determination, and least squares line to describe
relationship between 2 variables for interval data.

4.8 General Guidelines for Exploring Data


Purpose of applying graphical & numerical techniques is to describe & summarise data. Graphical often used
first as it gives shape of distribution.

Chapter 5: Data Collection and Sampling (week 3)


5.1 Methods of Collecting Data
Usually find secondary data - data collected by someone else possible for another purpose ABS
Or collect primary data – data collected by you market researchers calling ppl to determine impact of ads

Direct observation

Simplest method. When data gathered this way, data said to be observational – measures actual behaviours
or outcome.

Select sample of men & women & ask if they take aspirin regularly over past 2 years & if they suffered any
heart attacks last 2 years.

However difficult to produce info this way. People who take aspirin suffer fewer heart attacks, can we
conclude aspirin is effective? May be ppl who take aspirin are more health conscious.

Adv is it’s relatively inexpensive.

Experiments

More expensive, but better way to produce data. Data produced this way are called experimental – imposes a
treatment & measures resultant behaviour or outcomes.

Select random sample, split in 2 groups, 1 takes aspirin regularly while other doesn’t. after 2 years, stats
practitioner determine proportion of ppl in each group who suffered heart attacks.

Surveys

Important aspect of surveys is response data – proportion of all ppl who were selected who complete the
survey. Low response rate can destroy validity of any conclusion.

1. Personal interview – many feel best way to survey people. High expected response rate & fewer
incorrect responses resulting from misunderstanding. Interviewed must not say too much to avoid
biasing the response. However, expensive, especially when travel involved
2. Telephone interview – less expensive, but less personal & lower expected response rate.
3. Self-administered survey – mailed to sample of ppl. Inexpensive method & therefore attractive when
sample is large. However, lower response rate & high incorrect responses due to misunderstanding
questions.

Questionnaire design:

1. Short as possible so higher chance ppl will do it


2. Questions themselves should be short, simple and clear.
3. Start with simple demographic q’s to get ppl started
4. Dichotomous q’s (2 answers like yes or no) useful because simple but sometimes neither is answer.
5. Open ended q’s allow responders to express opinions more fully but time consuming & harder to
analyse
6. Avoid leading q’s as persuade ppl
7. Test questionnaire on small no. of ppl to detect ambiguities
8. When preparing q’s think about how you would tabulate responses.
Stages of statistical analysis

1. Define & understand problem firm wants to determine effectiveness of its advertising
2. Collect data
3. Analyse data a) use sample statistics to describe problem what is a typical customer? What % of
customers recall ads b) extract info about population parameter from sample stats
4. Communicate results

5.2 Sampling
Chief motive for using sample rather than population is cost.

Target population – population which we want to draw inferences

Sampled population – actual population from which the sample has been taken

Self-selected sample – almost always biased because individuals who participate in them are more keenly
interested in the issue than are the other members of the population.

5.3 Sampling Plans


Simple Random Sampling

Simple Random Sampling – is a sample selected in such a way that every possible sample with same size is
equally likely to be chosen. Raffle selection, no. 1 to n and put in hat and pick one. Sometimes inappropriate to
use such as ppl have more than 1 phone no. or such. Pg 168 excel.

Low cost, avoids problems of bias where design of sample systematically favours certain outcomes.

Stratified Random Sampling

A Stratified Random Sampling – is obtained by separating the population into mutually exclusive sets, or
strata, and then drawing simple random samples from each stratum. E.g. female or male, age stratum etc.
Only stratify when there’s connection between survey & strata like who favours tax ↑ and we select random
samples from diff income groups, no point selecting from diff religious beliefs

Advantage is besides acquiring info about entire population, we can also make inferences within each stratum
or compare strata. . Compare highest & lowest Y groups to see if they differ in their support for tax ↑

Gets more info than simple random sampling.

Can draw random samples from each Y group according to their proportion in population.

Cluster Sampling

Cluster Sample – is a simple random sample of groups or clusters of elements.


Useful when it is difficult or costly to develop a complete list of population members (making it difficult to
generate a simple random sample). Also useful when population elements are widely dispersed
geographically.

Want to estimate average annual household Y in large city. Simple random sampling we would need complete
list of households in city from which to sample. Stratified random sampling would need list of households &
categorise each by some other variable like age to develop strata. A less expensive alternative would be let
each block within city represent a cluster. A sample of clusters then randomly selected & every household
within these clusters questioned. ↓ distance surveyor must cover to gather data.

But this ↑ sampling error as households belonging to same cluster likely to have similar Y (rich vs poor
streets).

Sample Size

Whatever sampling plan, need to decide sample size. Large sample size = more accurate.

5.4 Sampling and Non-sampling Errors


Sampling error

Sampling error – refers to differences between sample & population that exists only because of the
observations that happened to be selected from the sample. Expect to occur when we make a statement
about a population that is based only on the observations contained in a sample. This error is not a mistake it
is a cost of sampling

Determine mean Y of US blue collar workers. Determine parameter we have to ask every US blue collar
worker’s Y & calculate mean. Expensive & impractical. Use statistical inference. Record Y of sample & find
mean. But values will be diff as value of sample mean depends on which Y happened to be selected. Diff
between true value of population mean & sample mean is sampling error.

↓ error is take a larger sample.

Non-sampling error

Non-sampling error – results from mistakes made in the acquisition of data or from the sample observations
being selected improperly. More serious than sampling error as taking larger sample wont ↓ size of error.

1. Errors in data acquisition: arises from recording of incorrect responses possibly from incorrect
measurements taken due to faulty equipment, mistakes made during transcription from primary
sources, inaccurate recording of data because terms were misinterpreted or purposely give inaccurate
responses from sensitive issues like sexual activity.
2. Non-response error: error or bias introduced when responses are not obtained from some members of
the sample. When this happens, sample observation may not be representative of target population.,
resulting in biased results. May occur when cant contact person listed in sample, or person may refuse
to response to question(s). This problem is greater when self-administered questionnaires are used
rather than an interviewed.
3. Selection bias: occurs when sampling plan is such that some members of the target population cannot
possibly be selected for inclusion in the sample.

Chapter 6: Probability
6.1 Assigning Probability to Events
Probability – mathematical mean of studying uncertainty.

Random experiment – an action or process that leads to one of several possible outcomes. Experiment: flip a
coin, outcomes: heads or tails

First step to assigning probabilities is to produce a list of outcomes. List must be exhaustive – all possible
outcomes must be included & mutually exclusive (uses OR)– no 2 outcomes can occur at the same time. AND
for not mutually exclusive. E.g. flip a count, outcomes: heads. This is not exhaustive as tails is omitted. Range
of marks: 0-10, 10-20 etc is not mutually exclusive as getting 10 means you’re in both outcomes.

Sample space (S) of a random experiment– list of all possible outcomes of the experiment. Outcomes must be
exhaustive & mutually exclusive. The outcomes are denoted as O 1 , O 2 ,..,O k .

Once a sample space has been prepared, we assign probabilities to outcomes. 2 rules governing probabilities:

1. Probability of any outcome must lie between 0 & 1. 0≤ P(O i ) ≤ 1 where P(O i ) is probability of outcome
i.
2. Sum of probabilities of all outcomes in sample space must be 1.

Three approaches to assigning probabilities

Classical approach – determine probability associated with games of chance. Probability of heads & tails in a
coin flip are equal, so 50% each.

Relative frequency approach – defines probability as the long-run relative frequency with which an outcome
occurs. 1000 students, 200 got A, therefore relative frequency of A’s is 200/1000 = 20%. Figure is an estimate
of probability of getting A. Observe infinite no. of grade to determine exact probability.

Subjective approach – define probability as degree of belief that we hold in the occurrence of an event. Used
when can’t use classical approach & no history of outcomes to use relative frequency approach. Investor
wants to know probability particular stock will ↑ in value, using this approach, investor analyse no. of factors
associated with the stock & stock market & assign probability to outcomes of interest.

Defining events

Simple event – an individual outcome of a sample space. All other events composed of simple events in
sample space.
Event – collection or set of one or more simple events in a sample space. Event of achieving an A as set
numbers that lie between 80 and 100 inclusive.

Probability of events

Probability of an event – sum of the probabilities of the simple events that constitute the event.

Interpreting probability

No matter what method used to assign probability, we interpret using relative frequency approach for an
infinite no. of experiments. Using subjective approach we find that there’s a 65% chance value of stock will ↑
in next month. However, we interpret that if we had an infinite no. of stocks with exactly the same economic
& market characteristics, 65% of them will ↑ in price in a month’s time.

Relative frequency approach useful to interpret probability statements such as weather forecasts.

6.2 Joint, Marginal, and Conditional Probability


Intersection

Intersection of events A & B – is the event that occurs when both A & B occur. Denoted as A and B.

joint probability – probability of the intersection.

Marginal probability

Marginal probability – computed by adding across rows or down column, are so named because they are
calculated in the margins of the table. P(A and B) for the outcome to involve 2 variables.

Conditional probability

Conditional probability – finds the probability of one event given the occurrence of another event. Probability
that the fun manager by a graduate of top 20 MBA program will outperform market.

The probability of event A given event B:

P(A|B) = P(A and B) / P(B)

Similarly the probability of event B given event A

P (B |A) = P(A and B) / P(A)

| = given

Independence

Independent event – if probability of one event is not affected by the occurrence of the other event.

2 events are said to be independent if:

P(A|B) = P(A)
or

P(B|A) = P(B)

Union

Union of events A and B is the event that occurs when either A or B or both occur. It is denoted as A or B.

6.3 Probability Rules and Trees


3 rules that enable us to calculate the probability of more complex events from the probability of simpler
events:

Complement rule

Complement of A is the event that occurs when A does not occur. Complement of event A = AC. Event and
complement of event sum to 1.

P (AC) = 1 – P (A).

Multiplication rule

Multiplication rule used to calculate joint probability of 2 events. Based on formula for conditional probability:

P(A|B) = P(A and B)/P(B)

Derive multiplication rules by multiplying both sides by P (B).

Joint probability of any 2 events A and B is

P (A and B) = P (B) P (A|B)

Or altering notation

P (A and B) = P (A) P (B|A)

If A and B are independent events, P (A|B) = P (A) & P (B|A) = P (B). Joint probability of 2 independent events
is product of probabilities of 2 events.

Multiplication rule for independent events:

P (A and B) = P (A) P (B)

Addition Rule

Addition rule enables us to calculate probability of the union of 2 events.

Probability that event A or B or both occur is:

P (A or B) = P (A) + P (B) – P (A and B).

Addition rule for mutually exclusive events:


P (A or B) = P (A) + P (B).

Probability Trees

Effective & simpler method of applying probability rules is the probability tree, wherein the events in an
experiment are represented by lines.

Advantage is it restrains users from making wrong calculation. Joint probabilities at end of each branch must
sum to 1 as all possible events are listed.

Chapter 7: Random Variables and Discrete


Probability Distributions (week 4)
7.1 Random Variables and Probability Distributions
Random variable – a function or rule that assigns a number to each outcomes of an experiment. Often labeled
as X.

2 types of random variables:

Discrete random variable – one that can take on a countable number of values. E.g. number of heads observed
when flipping a coin 10 times. X = 0 or 1 or 2 or…or 10.

Continuous random variable – one whose values are uncountable. E.g. time to complete a task. Say X = time to
do a 3hr test, where no one can leave before 30min mark. Smallest value of X is 30, but next no. is it 30.1,
30.01 etc? We don’t know, there’s no. larger than 30 & smaller than 30.01.

Probability distribution – table, formula, or graph that describes values of a random variable & probability
associated with these values.

Uppercase letter like X represents name of random variable, lower case counterpart represent value of
random variable. Thus probability that random variable X will equal x as

P(X = x)

Or more simply

P(x).

Discrete probability distributions

Requirements for a distribution of a discrete random variable:

1. 0 ≤ P(x) ≤ 1 for all x


2. Σ P(x) = 1 for all x
Probability distributions and populations

Probability distributions often represent populations.

Describing the population/probability distribution

For infinite populations, can’t calculate mean. List values & their associated probabilities instead to compute
mean & variance of population.

Population mean – weighted average of all its values. Weights are probabilities. Parameter also called
expected value of X & represented by E(X) → used when population is infinite.

E ( X ) = µ = ∑ xP( X = x)
all x → mean of a rv
Population variance – weighted average of squared deviations from the mean.

Var ( X ) = σ 2 = ∑ ( x − µ ) 2 P( X = x)
all x

Short cut calculation for population variance:

V(X) = σ2 = [Σ x2P(x)] – μ2

Population standard deviation:

σ= σ2

Laws of expected value

1. E(c) = c
2. E(X + c) = E(X) + c
3. E(cX) = cE(X)

X is the random variable & c is a constant.

Laws of Variance

1. V(c) = 0
2. V (X + c) = V(X)
3. V(cX) = c2V(X)

7.2 Bivariate Distributions


Bivariate Distributions – provides probabilities of 2 variables.

Bivariate (joint) probability distribution of X & Y – table or formula that lists joint probabilities for all pairs of
values of X & Y.
Requirements for a discrete bivariate distribution:

1. 0 ≤ P(x, y) ≤1for all pair sof values (x, y)


2. Σ ΣP(x, y) = 1

Marginal probabilities

Summing across rows & down columns.

Describing the Bivariate distribution

Describe by computing mean, variance & standard deviation of each variable.

Covariance

COV (X, Y) = σ xy = Σ Σ (x – μ X )(y – μ Y )P(x, y)


Simplified to:

COV (X, Y) = σ xy = Σ Σ xyP(x, y) - μ X μ Y


Coefficient of correlation

P = σ xy / σ x σ Y
Sum of 2 variables

Laws of expected value & variances of the sum of 2 variables

1. E(X + Y)= E(X) + E(Y)


2. V(X + Y) = V(X) + V(Y) + 2COV(X, Y)
3. If X & Y are independent, COV(X, Y) = 0 & thus V(X+Y) = V(X) + V(Y)

7.4 Binomial Distribution (week 5)


Binomial distribution – result of a binomial experiment which has following properties

• Consists of a fixed no. of trials (n)


• Each trial has 2 possible outcomes. One is success & other is labeled failure.
• Probability of success = p, failure = 1 – p
• Trials are independent; outcome of 1 trial doesn’t affect another.

If properties 2, 3 & 4 are satisfied = Bernoulli process. Adding property 1 = binomial experiment. Random
variable of a binomial experiment is defined as the no. of successes in ‘n’ trials = binomial random variable.

Have a sequence of Bernoulli random variables: X1, X2, …, Xn where Xi = 1 if success & Xi = 0 for a failure

Under assumptions made this is a sequence of independent & identically distributed (iid) random variables.
E.g. If n = 5, one possible outcome of the sequence of trials is 0, 1, 1, 0, 1, another is 1, 1, 0, 0, 0 etc.
Binomial random variable

Random variable of a binominal experiment is no. of successes in n trials, called binomial random variable. Can
take values 0, 1, 2,…, n. Thus random variable is discrete.

Often interested in random variables constructed from other random variables. Consider random variable
formed by summing these ‘n’ Bernoulli random variables

X = X1 + X2 + … + Xn

X represents number of success in n trials & is called a binomial random variable. Characterised by 2
parameters n & p.

N = number of times doing the action

P = probability to get what you want

Once we know the parameter values we know everything about the random variable & its probability
distribution.

In gender composition example can calculate P(MMF) = P( X1=1, X2=1, X3=0)= pp(1-p) = p2(1-p)

But there are 3 ways to obtain 2 Males (others being MFM or FMM) so

P(2 Males) = P(X = 2) = 3p2(1-p).

Coefficient of 3 on the probability is exactly the total number of combinations when choosing 2 from 3
(NC3).

N n!
Cx =
x!(n − x)!

Probability of x successes in a binomial experiment with n trials


& probability of success p is
n!
P( X = x) ≡ P( x) = p x (1 − p ) ( n − x ) for x = 0,1,  , n
x!(n − x)!

Cumbersome to evaluate – use binomial table.

- P(X = 5) is a binomial probability


- P(X ≤ 1) is a cumulative binomial probability
- P(X > 1) is called a survivor probability

Cumulative probability

P (X ≤ x) is a cumulative probability. Formula of binomial distribution allows us to determine probability that X


= certain values.
The cumulative density function (cdf) is defined as
F(x) = P(X ≤ x) e.g.
F(x)
 0 x<0 1

F ( x) =  x / 60 0 ≤ x ≤ 60
 1 x > 60

Cdf is often called the distribution function of X. x


0 60
Binomial table

P(X ≥ x) = 1 – P(X ≤ [x – 1])

P(x) = P(X ≤ x) – P(X ≤ [x-1])

Binomial Probability P(X=k)


p
n k 0.05 0.10 0.15 0.20 0.25 0.90
2 0 0.9025 0.8100 0.7225 0.6400 0.5625 0.0100
1 0.0950 0.1800 0.225 0.3220 0.3750 0.1800
2 0.0025 0.0100 0.0225 0.0400 0.0625 0.8100

3 0 0.8574 0.7290 0.6141 0.5120 0.4219 0.0010


1 0.1354 0.2430 0.3251 0.3840 0.4219 0.0270
2 0.0071 0.0270 0.0574 0.0960 0.1406 0.2430
3 0.0001 0.0010 0.0034 0.0080 0.0156 0.7290

10 0 0.5987 0.3487 0.1969 0.1074 0.0563 0.0000


1 0.3151 0.3874 0.3474 0.2684 0.1877 0.0000
2 0.0746 0.1937 0.2759 0.3020 0.2816 0.0000
3 0.0105 0.0574 0.1298 0.2013 0.2503 0.0000
4 0.0010 0.0112 0.0401 0.0881 0.1460 0.0001
5 0.0001 0.0015 0.0085 0.0264 0.0584 0.0015
6 0.0000 0.0001 0.0012 0.0055 0.0162 0.0112
7 0.0000 0.0000 0.0001 0.0008 0.0031 0.0574
8 0.0000 0.0000 0.0000 0.0001 0.0004 0.1937

For P(X=5) when n=10, p=0.2:

• Need column labeled p=0.2 in block of values for n=10


• Now consider row labeled k=5 → 0.0264 as before

Similarly P(X ≤ 1) could be verified from table of individual probabilities by summing as before or directly from
table of cumulative probabilities.
The distribution of X i is given by
xi 0 1
P( X i = xi ) 1 − p p

⇒ E ( X i ) = 0 × (1 − p) + 1× p = p
Var ( X i ) = (0 − p) 2 (1 − p) + (1 − p) 2 p = p(1 − p)( p + 1 − p) = p(1 − p)

Thus
E ( X ) = E (∑i =1 X i ) = ∑i =1 E ( X i ) = np
n n

Var ( X ) = Var (∑i =1 X i ) = ∑i =1Var ( X i ) = np (1 − p )


n n

Mean and Variance of a binomial distribution

μ = np and σ = root np(1-p).

Chapter 8: Continuous Probability Distributions


8.1 Probability Density Functions
A continuous random variable is one that can assume an uncountable no. of values. Completely diff from
discrete variable. Cannot list possible values & because there is an infinite no. of values, possibility of each
value is virtually 0. Consequently, can only determine probability of a random of values.

Area in all rectangles adds to 1. Do this by dividing each relative frequency by width of interval. Result is
rectangle over each interval whose area = probability that random variable will fall into that interval.

If histogram drawn w/ large no. of small intervals, can smooth edges of rectangles to produce smooth curve.
Function that approximates curve is called a probability density function (pdf) - this is a continuous version of
a probability histogram used for discrete rv’s.

Probabilities are now represented by areas under the pdf

Requirements for a probability density function whose range is a ≤ x ≤ b:

1. F(x) ≥ 0 for all x between a & b


2. Total area under curve between a & b is 1.

Uniform Distribution

Uniform probability distribution is also called rectangular probability distribution.

Uniform distribution described by function:

F(x) = 1
b-a
where a ≤ x ≤ b

f(x) = height. To find probability of any interval, find area under curve.

Probability that a continuous random variable equals any individual value is 0.

Using a continuous distribution to approximate a discrete distribution

Use a continuous distribution to approximate a discrete one when no. of values is countable but large. E.g. no.
of values of weekly income.

There are many other common distributions that can be used to represent or model real phenomena

- The Poisson is another discrete probability distribution


- Binomial – number of successes in sequence of trials
- Poisson – number of successes in a period of time or region of space
- Refers to a count variable (0, 1, 2, …) where successes are relatively rare. Number of times an
individual visited a GP in last year.

For discrete random variables we assigned probabilities to different outcomes

- In binomial model X = 0, 1, 2, …, n (n finite)


- In Poisson model X = 0, 1, 2, ... (said to be countably infinite) but associated rv remains discrete

Chapter 8: Continuous Probability Distributions (week


6)

8.2 Normal Distribution


Most important of all probability distributions because of its crucial role in statistical inference. In practice
heights of men & women are well-approximated by normal distributions.

Normal density function:

Probability density function of a normal random variable:

1  x−µ  2
1 −  
σ 
f ( x) = e 2 −∞ ≤ x ≤ ∞
σ 2π
It is a continuous rv

- P(X = x) = 0
- P(a < X < b) = area under curve (pdf) between a & b
Pdf is unimodal, symmetric about its mean & bell shaped. The rv ranges between -∞ ≤ x ≤ ∞.

Normal distribution described by 2 parameters: its mean μ & standard deviation σ.

Increasing mean = shifts to the right. Vice versa. Large standard deviations = widen curves. Vice versa

If X is normally distributed with mean μ standard deviation σ then we write

X ~ N(μ, σ2)

Calculating normal probabilities

Calculate probability that a normal rv falls into any variable = compute area in interval under curve. Need a
new table for selected values of μ & σ to find normal probabilities, can’t use maths. But by standardising rv by

Z=X–μ
σ

only need 1 table: standard normal probability table (lists cumulative probabilities). When normal variable
transformed, it’s called standard normal random variable.

In general this standardisation yields a rv with 0 mean & standard deviation of 1.

Z & X related where

Z = a + bX (where a = -(m /s) and b = 1/s)

So Z is a linear function (linear transformation) of X


If P(X < 1100), after standardising X, it becomes P[(X – μ)/ σ < (1100 – 1000)/100] = P(Z < 1.00) as any
operations performed on X must be performed on 1100. The area under curve remains same.

A value of Z = 1 corresponds to a value of X that is 1 standard deviation about mean.

Finding values of Z

Use notation Z A for values of A that require us to use the standard normal table backward.

Z 0.025 = 1.96 can be translated to P(Z > 1.96) = 0.025. means area to the right of Z 0.025 is 0.025

-Z 0.025 means area to the left of it.

Finding Z 0.5 = find value of a standard normal random variable such that the probability that the rv is more
than it is 5%.

Z A and percentiles

Values of Z A are the 100(1 – A)th percentiles of a standard normal rv.

Z .05 = 1.645 means that 1.645 is the 95th percentiles: 95% of all values of Z are below it, 5% above it.

Chapter 9: Sampling Distributions


9.2 Sampling Distribution of a Proportion
The estimator of a population proportion of successes is the sample proportion; that is, we count the no. of
successes in a sample & compute

p̂ = X
n

where X is no. of successes & n is sample size. Same size of n means we’re taking a binomial experiment,
hence X is binomially distributed. Thus probability of any value of p̂ can be calculated from its value of X.

P(p̂ ≤ .50) = P(X ≤.50) = .8338 using table.

Normal approximation to the binomial distribution

Recall when we introduced continuous probability functions, we converted a histogram so that total area in
rectangles = 1. Do the same for binomial distribution.

Denote binomial rv by X B

E.g. let n = 20, p = .5, then X B = 0, 1, 2,…,20. A rectangle representing a value of x is drawn so that its area =
probability. Accomplish this by letting height of rectangle = probability & base of rectangle = 1. Thus base of
each rectangle for x is x - .05 to x + .05 such that rectangle representing X B = 10 has base 9.5 to 10.5 –
continuity correction factor & height is P(X B = 10) - .1762

To use normal approximation, find area under normal curve between 9.5 & 10.5

To find normal probabilities, need to standardise x first, so need to find mean & standard deviation.

μ = np and σ = root np(1-p).

P(X = 10) ≈ P(9.5 < Y < 10.5) where Y is normal random variable approximating binomial rv X.

After standardising we get P(-.22 < Z < .22), & using table can find answer.

In general approximate

P(X B ≤ x) by P(Y ≤x+0.5) → area under normal curve to the left of x + .5

P(X B ≥ x) by P(Y ≥ x-0.5) → area under normal curve to the right of x - .5

Omitting the correction factor for continuity

Find P(X = x) must use correct factor. If we don’t, we left with find area in a line, which is 0.

When computing probability of a range of values, omit correction factor. But omitting it will ↓ accuracy of
approximation, but very small ↓.

Use normal approximation of binomial distribution to approximate sampling distribution of a sample


proportion, but we omit correction factor.

Approximate sampling distribution of a sample proportion

As n = 200 is large use


the normal approximation to the binomial and hence
the sample proportion is also normal with E(p̂ ) = p and Var(p̂ ) = P(1 – p) / n.
Chapter 10: Introduction to Estimation
10.1 Concepts of Estimation
Objective of estimation is to determine approximate value of a population parameter on basis of a sample
statistic. E.g. sample mean is estimator of population mean, once sample mean computed, its value is called
the estimate.

Estimators are random variables

Point and interval estimators

Use sample data to estimate population parameter in 2 ways. 1st computer value of estimator & consider that
value as estimate of parameter. This is called point parameter - draws inferences about a population by
estimating the value of an unknown parameter using a single value of point; combining sample information to
produce a single number to estimate E.g. select 25 students at random, calculate mean meekly income to be
$400, so mean weekly income of all students is 400.

3 drawbacks to this:

1. Virtually certain estimate will be wrong


2. We often need to know how close estimator is to parameter
3. In drawing inferences about parameter, it is intuitively reasonable to expect that a large sample will
produce more accurate results because it contains more info than a smaller sample. But point
estimators don’t have capacity to reflect effects of larger sample sizes.

Interval estimator – draws inferences about a population by estimating the value of an unknown parameter
using an interval. E.g. select 25 students at random, calculate mean meekly income to be between 380 and
420.

Its affected by sample size.

Unbiased estimator of a population parameter – is an estimator whose expected value is equal to that
parameter. Desired quality of an estimator is unbiasedness.

This means if u take an infinite no. of samples & calculate value of estimator in ea sample, average value of
estimators = parameter. but does not tell us how close estimator is to parameter.

Another desired quality is that as sample size ↑, the sample statistic should come closer to the population
parameter.

Consistency – an unbiased estimator is said to be consistent is the difference between the estimator and the
parameter grows smaller as the sample size grows larger.

Relative efficiency – is there are 2 unbiased estimators of a parameter, the one whose variance is smaller is
said to have relative efficiency.
Chapter 9: Sampling Distributions (week 7)
9.1 Sampling Distribution of the Mean
Sampling distribution created by sampling. It relates population parameters to properties of the distribution of
the sample statistics.

One way to create a sampling distribution is to draw samples of the same size from a population, calculate
statistic of interest, & use descriptive techniques to learn more about sampling distribution.

Second method relies on rules of probability, laws of expected value & variance to derive sampling
distribution.

Sampling distribution of the sample mean –informs us what a sample mean tells us about the population
mean.

Sampling distribution of a sample statistic, is the distribution of all possible values that can be assumed by that
statistic, computed from samples of the same size drawn from the same population

- Assume random sampling from the population distribution of a rv X


- Thus a sample of size n, X1, X2,…, Xn constitutes a sequence of iid random variables

Sampling distribution of the mean of 2 dice

Population created by throwing fair die infinite times, with rv X indicating no. shown in any 1 throw.

Probability distribution of X:

X 1 2 3 4 5 6
P(X) 1/6 1/6 1/6 1/6 1/6 1/6

Population infinitely large as we can throw infinite times.

Calculate expected value & variance & standard deviation.

Sampling distribution created by drawing samples of size 2 from population (toss die twice or toss 2 dice).

is a new rv created by sampling as sample mean varies randomly from sample to sample.

Sample
1,1 1.0
1,2 1.5
1,3 2.0
… …
6,6 6.0
36 diff possible samples of size 2, & each is equally as likely so 1/36 of a sample being selected. However
can only assume 11 diff possible values: 1, 1,5, 2,…,6, with certain values of occurring more frequently than
others.

Resulting sampling distribution of the sample mean:

P( )
1.0 1/36
1.5 2/36
2.0 3/36
2.5 4/36
3.0 5/36
3.5 6/36
4.0 5/36
4.5 4/36
5.0 3/36
5.5 2/36
6.0 1/36

Sampling distribution of is bell shaped.

1. µ X = µ !!!!!!!

σ2
2. σ X2 = !!!!!!!!!!!
n
3. If X is normal, then is normal. If X is nonnormal, then is approximately normal for sufficiently
large sample sizes. Definition of “sufficiently large” depends on extent of nonnormality of X. The
more nonnormal the original population is, the large ‘n’ needs to be to be normal. Vice versa
4. From a modelling perspective a sufficiently large sample implies it is not necessary to assume
normality of the underlying population to make inferences about the sample mean based upon the
normal distribution
5. If a question doesn’t say population is normal, always assume it’s nonnormal and write “by central
limit theorem..” to make sample population normal.

Standard deviation of the sample distribution is called standard error of the mean

As we increase sample size (n), sample variance ↓ & probability that will be closer to the mean ↑. This is
because a randomly selected value of is likely to be closer to the mean than randomly selected value of X.

Thus sampling distribution of becomes narrower (more concentrated about the mean) as n ↑, becoming
increasingly bell shaped.

Central limit theorem – sampling distribution of the mean of a random sample drawn from any population is
approximately normal for a sufficient large sample size. The large the sample size, the more closely the sample
distribution of will resemble a normal distribution.
Accuracy of approximation depends on probability distribution of population & sample size. If population is
normal, then is normally distributed for all values of n. if population is nonnormal then is approximately
normal only for larger values of n.

If population size is relative large (20 times) compared to sample size, then finite population correction factor
= 1 and can be ignored. Most populations qualify as large because if population is small, we can investigate
each member of population w/o need for sample. So finite population correction factor is usually omitted.

Creating the sample distribution empirically

Previously created sampling distribution theoretically by listing all possible samples of size 2. Can produce
distribution empirically by actually tossing 2 fair dice repeatedly, calculating sample mean for each sample,
counting no. of times occurs & computing relative frequency. If toss 2 dice a large no. of times, relative
frequency will = theoretical probabilities, but impractical.

Using the sampling distribution for inference

Z A is the value of Z such that the area to the right of Z A under the standard normal curve is equal to A.

X −µ
Suppose X ~ N ( µ , σ 2 / n) and thus Z =
σ/ n

Consider a symmetric interval around this standardized value of X


 X −µ 
P − a ≤ ≤ a  = 1 − α
 σ/ n 
Say we choose α = .05 ⇒ P(0 ≤ Z ≤ a ) = .475 ⇒ a = 1.96 (from tables)
 X −µ 
∴ P − 1.96 ≤ ≤ 1.96  = .95
 σ/ n 
Now rearrange this probability statement to yield
 σ σ 
P X − 1.96 ≤ µ ≤ X + 1.96  = .95
 n n

General form:

σ σ
P(μ - zα / 2 < < μ + zα / 2 ) = 1 – α!!!!!!!!!!!
n n

Where alpha α is the probability that does not fall into the interval.

e.g Clare is an auditor. A client has customer accounts with outstanding balances that are known to have s =
$30.334. Clare decides on a random sample of 250 accounts. What is the probability that the sample mean
balance will be within $4 of the population mean balance? Don’t need to know μ but need to know n and σ.
Need to determine P( µ − 4 < X < µ + 4) which requires sampling distribution of X
Is X normal? Doesn' t matter because sample size large
By CLT X ~ N ( µ , σ 2 / n) (or X ~ N ( µ ,1.92 2 ))
 −4 X −µ 4 
P ( µ − 4 < X < µ + 4) = P < < 
 1.92 σ / n 1.92 
= P (−2.08 < Z < 2.08) = .9624

Chapter 10: Introduction to Estimation


10.2 Estimating the Population Mean when the Population Standard Deviation is
known
Describe how an interval estimator is produced from a sampling distribution, simple but unrealistic example:

Suppose population has mean μ and standard deviation σ. Population mean unknown & need to find value.
Need to draw a random sample of size n & calculate .

Central limit theorem states that is normally distributed if X is normally distributed, or approximately
normal distributed if X is nonnormal & n is sufficiently large enough. This means

X −µ
Z=
σ / n → used only when population is or assumed to be normal

is standard normally distributed or approximately so.

Probability statement associated w/ sampling distribution of mean:

σ σ
P(μ - zα / 2 < < μ + zα / 2 )=1–α
n n

Using algebraic manipulation, can express probability in diff form:

σ σ
P( - zα / 2 <μ< + zα / 2 )=1–α
n n

^ another form of probability statement about sample mean, the confidence interval estimator of μ.
σ σ
zα / 2 zα / 2
+ n - n

Probability 1 – α is called the confidence level

σ
zα / 2
- n
is the lower confidence limit (LCL)

σ
zα / 2
+ n is called the upper confidence level (UCL)

We often represent confidence interval estimator as

σ
zα / 2
± n

Interval estimators produce a range of values & attach a degree of confidence associated with that interval,
hence the name confidence interval.

Confidence level – the probability that the interval includes the actual value of μ. Usually set 1 – α close to 1
(between .90 & .99).

 Be careful

- Endpoints of the interval are rv’s because diff → diff intervals


- We have constructed a random interval
- μ is a constant but unknown
- For a particular sample (& sample mean value) μ is either in the confidence interval or not

Confidence interval’s for means/proportions typically have a similar structure:

1. Centred at sample statistics


2. Endpoints are ± some multiple of the standard error (standard deviation of the sampling distribution)
3. The “multiple” is determined by confidence level chosen by investigator

To apply formula, specify confidence level 1 – α, from which we determine α, α/2, z a/2 (table 3 Appendix B).

1 – α = .95 becomes 95% confidence interval estimator of μ.


Interpreting the confidence interval estimate

e.g. want to estimate mean value of distribution resulting from the throw of a fair die. We know distribution,
we also know that μ = 3.5 & σ = 1.71 (from textbook). Pretend we don’t know σ & want to estimate it. To do
so, we draw a sample of size n = 100 & calculate .

Confidence interval estimate of μ is

σ
zα / 2
± n

The 90% confidence interval estimator is

1 – α = .9, α = .1

σ
zα / 2
Hence ± n ± 1.645 x 1.17/10 = ± .281
=

This means if we repeatedly draw samples of size 100 from this population, 90% of values of will be such μ
would lie somewhere between - .281 & +.281, & 10% of the values of will produce intervals that would
not include μ.

If we draw 40 samples of 100 observations each, 4 (10%) values of will produce intervals that exclude μ. We
don’t always get the expected 10% (in this case) “wrong” intervals.

Improve confidence associated w/ interval estimate by letting confidence level 1 – α = .95 etc.

Information and the width of the interval

Wide interval provides little info.

Width of confidence interval estimate is a function of the population standard deviation, the confidence level,
& sample size.

Doubling population standard deviation → doubling width of confidence interval estimate. Logical, if there’s a
great deal of variation in the rv, it is more difficult to accurately estimate the population mean. That difficulty
is translated into a wider interval.

Although we cannot control value of σ, we can select values of sample size & confidence level:

↓ confidence level → narrows interval. Vice versa. This is because we need to widen interval to be more
confidence in the estimate. However, large confidence level is desirable. Need trade-off between confidence
level & width of interval. 95% confidence level is standard

↑ sample size fourfold → ↓ width of interval by half. Large sample provides more potential info. ↑ amount
of info. Reflected in a narrower interval. But tradeoff is ↑ sampling size → ↑ sampling cost.
Estimating population mean using the sample median

Sample mean produces better estimators than sample mean because sample median ignores actual
observations & uses ranks instead. This means we lose info. With less info. We have less precision in the
interval estimators (wider interval estimators) & so ultimately making poorer decisions.

10.3 Selecting the Sample Size


Can control width of interval estimate by determining sample size necessary to produce narrow intervals.

Error of estimation

Sampling error – difference between an estimator & a parameter.

Difference between & μ can also be defined as error of estimation.

σ σ
zα / 2 zα / 2
P(- n μ n
< - <+ )=1–α

σ σ
zα / 2 zα / 2
Tells us difference between & μ lies between - n n
&+ with probability 1 – α.

σ
zα / 2
n
In other words, max. error of estimation that we are wiling to tolerate is , & label this value B which
stands for the bound of an error of estimation.

σ
zα / 2
n
B=

Determining the sample size

N = [( zα / 2 x σ )/ B ]2

Any non integer value is rounded up.

Chapter 11: Introduction to Hypothesis Testing (week 8)


11.1 Concepts of Hypothesis Testing
Criminal testing best known nonstatistical application of hypothesis testing.
Person accused of crime faces a trial where jury makes decision on basis of evidence presented. Jury conducts
a test of hypothesis, 2 that are tested & decision made on basis of evidence presented by both sides:

1. Null hypothesis (H 0 ) – defendant is innocent, or more generally, some statement about a population
parameter
2. Alternative hypothesis or research hypothesis (H 1 ) – defendant is guilty

2 possible decisions:

1. Convict defendant = rejecting null hypothesis in favour of the alternative (enough evidence to
conclude that defendant is guilty)
2. Acquit defendant = not rejecting the null hypothesis in favour of the alternative (not enough evidence
to conclude that defendant was guilty)

2 types of errors:

1. Type I error – occurs when we reject a true null hypothesis = innocent person wrongly committed.
P(type I error) = P(reject H 0 | H 0 true) = α (significance level).
2. Type II error – defined as not rejecting a false null hypothesis = guilty defendant is acquitted. P(type II
error) = β.

α & β are inversely related (↓ one → ↑ other).

Type 1 error more serious, so its P is smaller as we assume null hypothesis is true & prosecution must prove
otherwise.

Null hypothesis will always state the parameter equals the specified value in the alternative hypothesis, and
alternative hypothesis states what we are investigating.

Test statistic – criterion on which we base our decision about the hypothesis (evidence presented in the case
in a criminal trial). Based on the best estimator of the parameter, e.g. best estimator of the population mean is
X −µ
the sample mean. !!!! ALL TEST STATISTICS ARE APPROX NORMAL
σ n

If test statistic’s value is inconsistent w/ null hypothesis, we reject null hypothesis & infer that the alternative
hypothesis is true. E.g. trying to decide if mean is greater than 350, large value of say 600 would provide
enough evidence to infer mean is > 350. But if is close to 350 say 355, we would say this does not provide
much evidence to infer that the mean is > than 350. In the absence of sufficient evidence, we do not reject null
hypothesis in favour of the alternative.

Sufficient evidence = evidence beyond a reasonable doubt.

11.2 Testing the Population Mean when the Population Standard Deviation is
Known
e.g. H 1 : μ > 170 (install new system)
H 0 : μ = 170 (do not install new system)

2 approaches to determine if sample mean contains sufficient info to infer about population mean e.g. is 178
sufficiently greater than 170 to allow us to confidently infer that the population mean is > than 170:

1. Rejection region – is a range of values such that if the test statistic falls into that range, we reject the
null hypothesis in favour of the alternative hypothesis.

Sample mean of 500 would mean null hypothesis is false & we would reject it, or if sample mean = 171,
we do not reject null hypothesis because it is possible to observe a sample mean of 171 from a
population whose mean is 170. However, sample mean of 178 is neither very far nor close.

L = sample mean just large enough to reject null hypothesis

Therefore, rejection region is: > L

α = P(rejecting H 0 given H 0 is true)

= P( > L given H 0 is true)

Standardise:

P[( - μ)/ σ n >( L - μ )/ σ n ] = P[Z > ( L - μ )/ σ n]=α

As P(Z > z α ) = α

zα = ( L - μ )/ σ n

if we set α & know σ & n can determine x L

Choice of significance level matters!

 How do I choose the significance level α?

 No rules

 Conventional choices are α = 0.1 or 0.05 or 0.01

Region left of tail (<) use –ve sign. Vice versa.


Standard test statistic

Instead of using test statistic where rejection region has to be set up in terms of , we can use
standardised value of (easier) – standardised test statistic.

Z = [( - μ)/ σ n]

Algebraically, rejection region is Z > Z α = z .05 = 1.645

Z = [( - μ)/ σ n ] = 178-170/65/20 = 2.46

Because 2.46 > 1.645, reject null hypothesis.

Standardised test statistic = test statistic

Test is statistically significant at x significance level when null hypothesis rejected

e.g. test was significant at the 5% significant level.

2. p-value of a test – is the probability of observing a test statistic at least as extreme as the one
computed given that the null hypothesis is true. Measure of the amount of statistical evidence that
supports the alternative hypothesis. DOESN’T INVOLVE CHOOSING α

An extremely small p-value indicates that the actual data differs markedly from that expected if the null
hypothesis were true

Drawback of rejection region method is that it implies a decision is made based purely on the result of the
rejection region method. However, it is only one of several factors considered by a manager when making a
decision e.g. cost & feasibility of restructuring the billing system, possibility of error.

To make a better decision, need to measure amount of statistical evidence supporting the alternative
hypothesis which is provided by p value of a test.

e.g. p-value is probability of observing a sample mean at least as large as 178 when population mean is
170

p-value = P( > 178) = P(Z > 2.46) = 1 – P(Z < 2.46) = .0069

Interpreting the p-value

Probability of observing a sample mean at least as large as 178 from a population whose mean is 170 given H 0
is .0069, which is very small. Event very unlikely we doubt the null hypothesis is true, so reject null hypothesis
& support alternative hypothesis.

Closer is to the hypothesised mean, 170, large the p-value is. Vice versa.

Values of far above 170 tend to indicate that the alternative hypothesis is true. Thus smaller the p-value,
the more statistical evidence supporting the alternative hypothesis.
Describing the p-value

- p-value < .01 = overwhelming evidence to infer that the alternative hypothesis is true. Test is highly
significant.
- .01 < P-value < .05 = strong evidence to infer that alternative hypothesis is true. Result deemed
significant
- .05 < P-value < .10 = weak evidence to infer that alternative hypothesis is true. When p-value > 5% we
say result is not statistically significant.
- P value > .10 = little to no evidence to infer that alternative hypothesis is true

The p-value and rejection region methods

P value > α = do not reject null hypothesis. Vice versa

Interpreting results of a test

Rejecting hypothesis does not mean alternative hypothesis is true since out conclusion is based on sample
data & not entire population, rather we say there is enough statistical evidence to infer that null hypothesis is
false & alternative hypothesis is true.

Likewise, if value of test statistic does not fall into rejection region, we don’t say we accept null hypothesis
(which implies we’re saying it’s true), rather we state we do not reject null hypothesis & we conclude there
isn’t enough evidence to show that the alternative hypothesis is true.

Conclusion of a test of hypothesis – reject null hypothesis = conclude enough statistical evidence to infer that
the alternative hypothesis is true. Vice versa.

One- and Two-tail tests

One-tail tests – rejection region located in only 1 tail of the sampling distribution. Upper or lower tail.

When do we conduct one- and two-tail tests

If α = .5, in 2 tail test, becomes .25 each rejection region of each tail.

From 1 tail test → 2 tail test, null hypothesis same but alternative hypothesis changes. Two-tail tests
conducted when the alternative hypothesis specifies that the mean is not equal to the value stated in the null
hypothesis:

Conduct one-tail test that focuses on the right tail of sampling distribution when we want to know whether
there is enough evidence to infer that the mean is > the quantity specified by the null hypothesis:

H0: μ = μ0

H1: μ > μ0

Left tail test used when we want to determine whether there is enough evidence to infer that the mean is <
value of the mean state in the null hypothesis:
H0: μ = μ0

H1: μ < μ0

2 one-tail tests:

H0: μ = μ0

H1: μ ≠ μ0

Testing hypotheses and confidence interval estimators

Determine whether hypothesised value of mean falls into the interval estimate.

11.3 Calculating the Probability of a Type II Error (week 9)


P (don’t reject H 0 | H 0 is false) = β.

e.g. department store thinking of introducing a new billing system. New system only effective if mean monthly
acc > $170. n = 400, X = $178 σ = $65

rejection region: xL >$175.34.

therefore β = P( X < 175.34 |H 0 is false). When H 0 is false, it means mean ≠ 170.

To compute β need to specify μ

Suppose when acc at least $180, new system so attractive manager wouldn’t want to make mistake of not
installing it.

β = P( X < 175.34 | μ = 180) = P(Z < -1.43) = 0.0764.

this tells us when mean acc = $180, probability of not rejecting H 0 when H 0 is false = .0764.

Effect on β due to changing α

For example above, say α = .01 rather than .05.


Rejection region: z > z .01 = 2.33

Or

X > 177.57.

Probability of Type II error when μ = 180:

Β = P( Z < -.75) = .2266

By ↓ α → rejection region ↓ & this ↑ area where H 0 is not rejected so β ↑.

Type I & II error have inverse relationship.

Judging the test

A statistical test of hypothesis is defined by α & n, both of which are selected by practitioner. Can judge how
well test functions by determining probability of Type II error at some value of the parameter.

e.g. same example as above

n = 400, α = .05, β = .0764 when X =180

if we think β is too high, can ↓ it by ↑ α, which would ↑ chance of making Type I error which is costly.
Alternatively, can ↑ n.

when n = 1000, β = P(Z < -3.22) = 0.

Developing an understanding of statistical concepts: larger sample size = more info = better decisions

n ↑ → standard error of mean ( σ n ) ↓ → narrower distribution which represents more info & ↑ info
reflected in smaller probability of type II error.

Below shows after n ↑ to 1000.


Larger n = better decisions in long run as β lower.

Power of a test

Another way of expressing how well a test performs is to report its power – the probability of it leading us to
reject H 0 when it’s false. Power of a test = 1 – β.

If, given same H 1 , n, α, one test has a higher power than another, it is said to be more powerful.

P(Reject H 0 |H 0 not correct) = Power = 1- β

↑ α → ↑power as β ↓.

Operating characteristic curve

To compute β, need to specify α, n & alternative value of population mean.

Determining the alternative hypothesis to define Type I & II errors

We control error that is more costly, which is α. So the error that is more costly, we set up hypothesis
according to it.

11.4 The Road Ahead


What statistical technique to use under what problems.

Data type
Problem objective Nominal Ordinal interval
Describe a population 12.3, 15.1 Not covered 12.1, 12.2
Compare 2 populations 13.5, 15.2 19.1, 19.2 13.1, 13.3, 13.4, 19.1, 19.3
Compare 2 or more populations 15.2 19.3 Chapter 13, 19.3
Analyse the relationship between 2 variables 15.2 19.4 Chapter 16, 19.3
Analyse the relationship between 2 or more Not Not covered Chapter 17 & 18
variables covered
Chapter 12: Inference about a Population
12.1 Inference about a Population Mean when the Standard Deviation is
Unknown
Usually when population mean is unknown, so is the population standard deviation.

When the σ is unknown & the population is normal, the test statistic for testing hypotheses about μ is:

X −µ
t (t-statistic) = where σ is substituted for s
s n

which is student t distributed with v = n – 1 degrees of freedom.

Confidence interval estimator of μ when σ is unknown:

s
X ± tα / 2 v=n–1
n

e.g. H 1 : μ > 2.0

H 0 : μ = 2.0 where α = .01 & n = 148

Rejection region: t > t α, v = t .01, 147 ≈ t .01, 150 = 2.351

From sample, calculate sample mean & sample standard deviation

X − µ 2.18 − 2.0
Thus t = = = 2.23
s n .981 148

2.23 >2.351, cannot reject H 0 in favour of H 1

e.g. calculate sample mean & standard deviation from sample data.

we can 95% confidence interval estimate, 1 – α = .95 therefore α = .05 & α/2 = .025

t α/2, v = t .025, 208 ≈ t .025, 200 = 1.972.

thus 95% confidence interval estimate of μ is:

s 4400
X ± tα / 2 = 11 343 ± 1.972 = 11343 ± 640
n 184

Or LCL = 10703, UCL = 11983

When population is nonnormal, still can use t-test & confidence interval estimate given n is large enough.
Require n depends on extent of nonnormality. Because in large samples s will be close to σ with high
probability.
Properties of t-distribution

- Symmetric, unimodal distribution


- Looks very similar to normal but has “fatter” tails
- Is characterized by degrees of freedom n
- For large n the distribution tends to a normal distribution

P(t > t a,n ) = a

Estimating the totals of finite populations

When population is small, must adjust test statistic & interval estimator using finite population correction
factor. However, if population 20 times larger than n, can ignore it.

Finite populations allow us to use confidence interval estimator of a mean to produce a confidence interval
estimator of the population total by multiplying LCL & UCL by population size.

s
N [ X ± tα / 2 ]
n

Developing an understanding of statistical concepts 1


n

∑ (x − x )
2
i
s2 = i =1
n-1

X = 10. E.g. X 1 = 6 & X 2 = 8 then X 3 must = 16 so mean is 10.


Suppose n = 3 & we find that

Therefore 2 degrees of freedom as we lose 1 degree of freedom because we had to calculate X .

Developing an understanding of statistical concepts 2

t-statistic like z-statistic measures difference between X & hypothesised value of μ in terms of the no. of
standard errors.

s
However, when σ is unknown, we estimate standard error by
n

Developing an understanding of statistical concepts 3

Only variable in z-statistic is X which varies from sample to sample.

t-statistic has 2 variables, X & s, both which will vary from sample to sample.

Because of the greater uncertainty, t-statistic will display greater variability.


12.3 Inference about a Population Proportion
Look at nominal data. All we can do is count no. of occurrences of ea value & calculate proportions.

Thus parameter of interest in describing a population of nominal data is the population proportion p.

This parameter used to calculated probabilities based on binomial experiment. 2 possible outcomes per trial.

Most practical applications of inference about p involve more than 2 outcomes, but many cases only
interested in “success” & other outcomes are “failures”.

E(X)=np & Var(X)= np(1-p)

Statistic and sampling distribution

Logical statistic used to estimate & test the population proportion is the sample proportion:

X
Pˆ = P̂ is equivalent to X .
n → unbiased point estimator of population proportion

Where x is no. of success in sample & n is sample size.

Sampling distribution of P̂ is aprrox. Normal with mean p & standard deviation p (1 − p ) / n provided that np
& n(1-p) > 5.

Express this sampling distribution as

Pˆ − p
Z=
p (1 − p ) / n

Testing and estimating a proportion

Test statistic for p:

Pˆ − p
Z=
p (1 − p ) / n

Which is approx normal for np & np(1-p) > 5.

Confidence interval estimator of p is:

Pˆ ± z α/ 2 P( 1-P) / n

Which is useless as to produce interval estimate, we must compute standard error p (1 − p ) / n which
requires us to know p, the parameter we wish to estimate in the first place.

Hence estimate p with P̂ :


Confidence interval estimator of P:

Pˆ ± zα/ 2 Pˆ ( 1-Pˆ ) / n

e.g. H 1 : p > .5

H 0 : p = .5

Pˆ − p
Test statistic is Z =
p (1 − p ) / n

Say α= .05

Thus rejection region is z > z α = z .05 = 1.645

Count no. of success, find x = 407, n = 765

X 407
Hence sample proportion is Pˆ = = = 1.77
n 765

Because test statistic is approx normally distributed, can determine p-value

p-value = p(z > 1.77) = 1 – p(Z , 1.77) = 1 - .9616 = .0384

1.77 > 1.645 & .0384 < .05

Therefore, there is enough evidence at the 5% significant level to reject H 0

Missing data

Sample no longer truly random is ppl refuse to answer.

Missing data – nonresponses that we wish to eliminate.

Estimating the total number of successes in a large finite population

Finite population correction factor when population isn’t at least 20 times > than sample.

Confidence interval estimator:

Pˆ ± z α/ 2 Pˆ ( 1-Pˆ ) / n

Times it by population size to give total no. of success in a large finite population.

N [ Pˆ ± z α/ 2 Pˆ ( 1-Pˆ ) / n ]

Selecting the sample size to estimate the proportion

N depends on confidence interval & bound on error of estimation that stats practitioner will tolerate.
When parameter to be estimated is a proportion, bound of error of estimation is:

B = zα/ 2 Pˆ ( 1-Pˆ ) / n

Solving for n:

N = [ z α/ 2 Pˆ ( 1-Pˆ ) / B]^2

To calculate P̂ when we have no info on it, assumed P̂ = .5

Chapter 16: Simple Linear Regression and


Correlation (week 10)
16.1 Model
Deterministic models – equations that allow us to determine the value of the dependent variable (left side of
equation) from the values of the independent variables. E.g. y = mx + b.

Unrealistic because we can’t determine price of house based on size alone.

Probabilistic model – a realistic representation of the relationship between 2 variables - represents the
randomness that is part of a real-life process. E.g. simple linear regression model

To create a probabilistic model, start with a deterministic model that approximates relationship we want to
model. Then add a term that measures the random error of the deterministic component.

e.g. real estate agent knows cost of building new house is $100/square foot & that most lots sell for about
$100 000. The approx selling price would be

y = 100 000 + 100x

where y = selling price, x = size of house in square feet.

A 2000 square feet house is then

y = 100 000 + 100(2000) = 300 000

we know however that selling price isn’t exactly $300 000, but range from $200 000 - $400 000. In other
words, deterministic model isn’t suitable, so use probabilistic model to represent situation.

Y = 100 000 + 100x + ϵ

where ϵ = error or disturbance term – difference between actual selling price & estimated price based on the
size of the house. It accounts for all variables, measureable & immeasurable, that are not part of the model.
Value of ϵ will vary from sale even if x is constant, meaning houses of same size sell for diff prices because of
diff location etc.

simple linear regression model – 1 independent variable. Analyses the relationship between 2 variables, x & y,
both of which must be interval. Finds line of best fit. P. 642 excel

regression = helps us to estimate probabilistic model

To define relationship, need to know β 0 & β 1 but they are population parameters, which are almost always
unknown.

Y = β0 + β 1x + ϵ

Where

- Y = dependent variable
- X = independent / explanatory variable
- β 0 = y-intercept all population parameters, can’t be measured
- β 1 =slope of the line (rise/run)
- ϵ = Error variable

16.2 Estimating the Coefficients


Draw random sample from population of interest & calculate sample statistics we need to estimate
parameters β 0 & β 1 . As β 0 & β 1 represent coefficients of a straight line, their estimators are based on drawing
a straight line through the sample data. straight line we wish to use is one that comes closest to the sample
data points – ordinary least squares line. Unbiased estimator.

Have (Yi, Xi) pairs for i = 1, … , n

yˆ = b0 + b1 x

b 0 = y-intercept, estimate of - β0 → Sign of slope coefficient is same sign as covariance (& correlation)
between Yi & Xi

b 1 = slope, estimate of β1

ŷ = predicted value of y

Coefficients b 0 & b 1 chosen to minimise sum of squared deviations


n

∑ (y − y hat )
2
i
i =1

In other words, ŷ on average comes closest to observed values of y.

no ϵ in estimator because all numbers based on sample data.

can only determine value of ŷ for values of x that is within the range of the sample values of x!!!
Least squares line coefficients

s xy
b1 =
s x2

b0 = Y − b1 X

where
n

∑ (X
i =1
i − X )(Yi − Y )
s xy =
n −1
n

∑ (X
i =1
i − X )^ 2
s =
2

n −1
x

∑x i
x= i =1
n
n

∑y i
ybar = i =1

Shortcut formula for b1

pg 638

Residuals ( ei ) – deviations between actual data points & line. Observations of the error variable. differences
between observed values of y i & predicted values of Yˆ .

ei = Yi − Yˆi

Sum of squares error (SSE) – minimised sum of squared deviations. Basis for other statistics that assess how
well the linear model fits the data.

Some basics:

• Population regression relationship is


Y i = β 0 + β 1 X i + ϵ I → links the (X,Y) pairs via the unknown parameters and the unobserved
errors

• OLS produces

Y i = b 0 + b 1 X i + e i → the predicted regression relationship links the (X,Y) pairs via estimated
parameters and calculated residuals

• Disturbance term ϵ i plays a crucial role in regression

Distinguishes regression models from deterministic functions

Represents factors other than X i that affect Y i

Regression treats these other factors as unobserved

• β 1 is the marginal effect of X i on Y i holding these other factors constant

Reliable estimates of β 1 will require assumptions restricting the relationship between X i & ϵ i

Desire to make ceteris paribus interpretations of β 1 → include more explanatory variables in


the regression

Leads to extension to multiple regression

• population & sample regression line diff, but as n↑, SRL → PRL

16.3 Error Variable: Required Conditions


Basic assumptions of simple linear regression model:

1. ϵ is normally distributed for all observations


2. Mean of the distribution is 0; E(ϵ i |x i )=0 for all observations → means data points lie on the SRL on
average. The conditional mean of the disturbances does not depend on x ensures unbiasedness of the
OLS estimator. It implies that omitted factors that might affect expenditure but appear in the
disturbance are assumed to be uncorrelated with x.
3. Standard deviation of ϵ is σ ϵ , which is constant regardless of the value of x → say the disturbances are
homoskedastic - variation in error terms are constant, i.e. dots are of equal distance from SRL.
4. disturbances for any two observations are independent. This implies there is no correlation between
disturbances associated with different observations.
5. (y i , x 1 ) are drawn by simple random sampling, hence are independent & identically distributed.

ϵ I ~ N(0, σ ϵ 2)************
this shows homoskedasticity, standard deviation same so
shape same.

Assumptions of Classical Linear Regression Model:

1. Linear model is correct i.e. β 1 is indeed constant.


2. Random sampling → unbiased estimator
3. Sample variation exists in X → Values of X1, X2, … , Xn are not all equal because b 1 =
n

s xy ∑ (X i − X )(Yi − Y )
= i =1
& if variance = 0, X = X for all x, making denominator 0 & b 1 = infinity.
s x2 n

∑(X
i =1
i − X) 2

4. Zero conditional mean: E(ϵ i |x i )=0 which implies ϵ i & X i are uncorrelated
5. Homoskedasticity: Var(ϵ i) = σ 2
6. Disturbances are uncorrelated: Cov(e i , e j ) = 0, (i not equal j)
7. Disturbances are normally distributed (used for inference)

1 – 4 allow OLS estimates to be unbiased & consistence, 5 allows estimates to be best out of all linear
estimates.

Given all 7 assumptions: Y i ~N(β 0 + β 1 X i ,s2 ) where E(Y) = β 0 + β 1 X**********

Decomposition of variance

ŷ = value of y on our sample linear regression model

Y = sample mean of the y values

Y = our observed value (actual value)


( ) (
Yi − Y = Yˆi − Y + Yi − Yˆi )
( )
= Yˆi − Y + ei
 Total   part explained   part 
  =   +  
 deviation   by X   unexplained 
Can show that

∑ (Y − Y ) = ∑ (Yˆ − Y ) + ∑ e
2 2 2
i i i

 Total sum   regression   error 


  =   +  
 of squares   sum of squares   sum of squares 
SST = SSR + SSE

SST – total sum of squares; distance between actual y &Y

SSR – sum of squares due to the regression; what your model explains; distance between ŷ & Y i.e. fitted
line to mean value.

SSE – sum of squared errors (residuals); what your model doesn’t explain i.e. distance between ŷ & y.

Coefficient of determination (R2) – tests strength of linear relationship. Explanatory power of the model. NOT
WHETHER IF THEY ARE +VELY OR –VELY RELATED.

Define :
SSR SSE
R2 = = 1−
SST SST
2
R measures the goodness of fit of the model. This means how much variation in y can be explained by the model.
Note :
0 ≤ R2 ≤ 1
Closer R 2 is to 1 the better the fit
e.g. R2 = .6483, means 64.83% of variation in dependent variable is explained by variation in independent
variable, remaining 35.17% unexplained.

16.4 Assessing the Model


If there’s relationship or nonlinear relationship between 2 variables, least squares method is impractical.

Hence we need to assess how well linear model fits data, as if it’s poor, we discard linear model & seek
another one.

Standard error of estimate, t-test of the slope, & coefficient of determination used to evaluate whether linear
model should be employed.

All methods based on sum of squares for error.

Sum of squares for error

Least squares method determines the coefficients that minimise the sum of squared deviations between
points & line defined by coefficients.

Minimised sum of square deviations = sum of squares for error (SSE)

Shortcut calculation of SSE

s 2 xy
SSE = (n-1)( s y 2 - )
s x2

Where s y 2 is the sample variance of the dependent variable

Standard error of estimate

Error variable ϵ is normally distributed w/ mean 0 & standard deviation σ ϵ.

 Disturbance term shows spread (standard deviation) around population regression line. But we do not
know spread, need to estimate.

 Estimated value of σ is Standard error of estimate (SEE)

 It measures fit of regression model

If σ ϵ is large, some errors will be large, which implied model’s fit is poor. If σ ϵ is small, errors tend to be close
to mean 0, hence model fits well.

Hence we can use σ ϵ to measure suitability of using a linear model. But σ ϵ is a population parameter, and is
unknown.
2
Unbiased estimator of variance of the error variable σ ϵ is standard error of estimate (Square root of S2 ϵ )

n 2
e SSE
SEE = s where s 2
= i =1 i
=
n−2 n−2

Divisor of n − 2 is a degrees of freedom adjustment (for the


2 β parameters that need to be estimated) to ensure an
unbiased estimator

p. 651 excel

compare s ϵ with Y . Standard error of estimate cannot be used as an absolute measure of the model’s utility
as no predefined upper limit of s ϵ

Testing the slope

If no linear relationship between x & y, SLR line would be horizontal, suggesting β 1 = 0.

Draw inferences about population slope β 1 from sample slope b 1 .

H 0 : β 1 = 0 as null hypothesis specifies there’s no linear relationship

Most often conduct 2 tail test to determine whether there is sufficient evidence to infer that a linear
relationship exists, can be 1.

H1: β1 ≠ 0

No linear relationship does not mean there’s no relationship at all, can imply quadratic relationship etc.

Confidence interval estimator of β 1

b 1 ± t α/2, v s b1 !!!!!!!!!!!!!!!!!!!!!!!!!!!##**********

Estimator and sampling distribution

b 1 is an unbiased estimator of β 1 → E(b 1 ) = β 1

b1 & b0:

- are normally distributed as they are linear functions of Y 1 which are assumed to be normal
- Even without normality of Y 1 can invoke CLT & assume b i will be asymptotically normal

Estimated standard error of b 1 is

s^2 SSE
S b1 = where s e 2 =
∑(X i − X ) 2
n−2

t-statistic:

t = (b 1 - β 1 ) / s b1 ~ t n-2
if test statistic falls within rejection region, conclude variables are linearly related. β 1 > 0 = +vely relation, β 1

< 0 = -ve related. Since β 1 is coefficient of x, this means 1 unit change in X will cause a β 1 change in y.

Sub β 1 = 0 as we assume H 0 is true.

Do not test for β 0 as interpreting value of y-intercept can lead to erroneous conclusions.

Cause-and-effect relationship

If there’s evidence of a linear relationship, changes in independent variable do not cause changes in
dependent variable.

Regression analysis only show a statistical relationship exists.

Chapter 13: Inference about Comparing Two


Populations
13.1 Inference about the Difference between Two means: Independent Samples
to estimate difference between 2 population means, draw random samples from each of 2 populations.

Independent samples – samples completely unrelated to one another.

Best estimator of difference between 2 population means, µ 1 - µ 2 , is difference between 2 sample means,
x 1 - x 2.

Sampling distribution of x 1 - x 2

1. x 1 - x 2 is normally distributed if populations are normal or approximately normal if populations are


nonnormal & sample sizes are large
2. Expected value of x 1 - x 2 is

E( x 1 - x 2 ) = µ 1 - µ 2

3. Variance of µ 1 - µ 2 is

V( µ 1 - µ 2 ) = σ 2 1 /n 1 + σ 2 2 /n 2
4. Standard error of x 1 - x 2 is root σ 2 1 /n 1 + σ 2 2 /n 2 .

Chapter 16: Simple Linear Regression and


Correlation (week 11)
16.5 Using the Regression Equation
Point prediction - ŷ is a point estimate of predicted value for y when x = something.

Point prediction by itself will not provide any info on how closely value will match true selling P. need to use
interval.

Predicting the particular value of y for a given x

Prediction interval

1 (X g − X )2
: ŷ ± t α/2, n-2 x s ϵ 1 + +
n ∑ (X i − X )2

Estimating the expected value of y for a given x

For a given value of x, there is a population of values of y whose mean is

E(y) = β 0 + β 1 x

Use confidence interval to estimate mean of y.

t = (b 1 - β 1 ) / s b1 ~ t n-2

thus P(-t α/2, n-2 < (b 1 - β 1 ) / s b1 < t α/2, n-2 ) = 1 – α

Confidence interval estimator of y

1 (X g − X )2
rearranging: ŷ ± t α/2, n-2 x s ϵ +
n ∑ (X i − X )2

Chapter 22: Multiple Regression


One of the main uses for regression models is for prediction or forecasting

Basic idea – use the fitted regression line (and b 0 , b 1 ) to predict a value of Y from a given value of X
Sydney unleaded petrol prices 1997Q3 to 2004Q4:
Forecasts to 2006Q3

120

100
Price (cents)

80

60

40

20

0
0 5 10 15 20 25 30 35 40
Time (t)

Trend prediction Actual price

17.1 Model and Required Conditions


Now assume k independent variables are potentially related to the dependent variable.

Thus model is represented by:

Y = β 0 + β 1 x + β 2 x 2 +…+ β k x k + ϵ

The estimated slope coefficient for the explanatory variable x 1 represents how much the outcome variable
changes when x 1 increases by one unit and all other variables remain the same.

With a large enough sample you can treat the estimated regression coefficients as if they are normally
distributed irrespective of the underlying distribution of the error term.

Error variable is retained because although we included additional independent variables, deviations between
predicted values of y & actual values of y will still occur.

As you add more explanatory variables to a multiple regression model you expect the standard error of the
estimate to ↓

17.2 Estimating the Coefficients and Assessing the Model


Steps to how a regression analysis is performed:

1. Select the independent variables you believe are linearly related to the dependent variable: e.g.
education, age etc → income
2. Use a computer to generate the coefficients & the statistics used to assess the model: standard error
SSE
of estimate: s ϵ = root where k is the no. of independent variables in the model. Judge
n − k −1
magnitude of standard error of estimate to mean of y.
3. test validity of model: H 0 : β 1 = β 2 =… β k = 0 H 1 : at least one β i ≠ 0. If null true, none of the
independent variables is linearly related to y & therefore model is invalid.
4. Interpreting the coefficients:
A key “threat” to using simple linear regression for empirical work is the problem of omitted variables

e.g. Suppose you are interested in the relationship between food consumption and level of household
income. Even though your primary interest is in the consumption-income relationship it is good practice to
add in other explanatory variables such as household size in order to avoid problems of confoundment.

Interpretation in multiple regression

Consider 2 explanatory variables:

Y i = β 0 + β 1 X 1i + β 2 X 2i + ϵ i , i = 1, … , n

 What is the interpretation of b 1 ?

 b 1 is the effect of a change in X 1i on Y i holding X 2i constant or controlling for X 2i

 b 1 (& b 2 ) are sometimes called partial regression coefficients

 Suppose instead run regression of Y i on X 1i alone

 Further suppose b 2 ≠ 0 & X 1i is correlated with X 2i

 Then X 2i will appear in the new disturbance term

 A4 will be violated – why?

 Because new disturbance term & X 1i will be correlated

 Violation of A4 implies?

 A biased estimator of b 1

 Estimate of b 1 will include impact of X 2i on Y i captured by the correlation between X 1i &


X 2i

1 additional assumption for multiple regression:

8. No perfect (multi)collinearity which precludes exact linear relationships between explanatory


variables; no variables can be perfectly linearly correlated or least squares won’t work.

“Linear” in linear regression refers to the parameters

Adjusted R2

It is adjusted to take into account the sample size & no. of independent variables. If no. of independent
variables (k) is large relative to sample size, the unadjusted R2 value may be unrealistically high

Add more independent variables should ↑ R2, but does not necessarily mean relevantly ↑ if you add useless
variables.

 Adding an extra explanatory variable will never imply a fall in R2


 Useful to make degrees of freedom adjustments to measures of fit

 EXCEL reports an adjusted R2

 May increase or decrease as explanatory variables are added

Recall
SSR SSE
R2 = = 1−
SST SST
Adjusted R denoted by R 2 is given by
2

SSE /(n − k − 1)
R 2 = 1−
SST /(n − 1)
where n = number of observations
& k-1 = number of explanatory variables

Adjusted R 2 ' s in height income eg


M 1:R 2 = 0.0248
M 2 : R 2 = 0.0425
M 3 : R 2 = 0.1362

t-Tests and the analysis of variance

t-tests allow us to determine whether β i ≠ 0 (for i = 1, 2, … , k). F-test in the analysis of variance combines
these t-tests, meaning we test all β i at 1 time to determine if at least 1 of them ≠ 0. F-test only needed to be
performed once as opposed to the k number of t-tests, so the probability that a Type I error will occur in a
single trial is α, meaning the chance of erroneously concluding the model is valid is substantially less with F
than t- tests.

Also, because of a commonly occurring problem called multicollinearity t-tests may indicate that some
independent variables are not linearly related to the dependent variable when in fact they are.
Multicollinearity doesn’t affect F-tests.

Chapter 4: Numerical Descriptive Techniques (week 12)


4.6 Applications in Finance: Market Model
2 goals of investing:

1. Maximise expected return


2. Minimise risk (variance & standard deviation used to measure risk of investments).
OLS model assumes the rate of return on stock is linearly related to the rate of return on stock market index.
Rate of return on index is calculated same way as return on a single stock is computed. E.g. index end of last yr
was 10 000, value at end of this yr is 11 000, market index annual return is 10%.

Return on stock is dependent variable (Y), return on index is independent variable (X).

b 1 measures how sensitive the stock’s rate of return is to changes in the level of the overall market.

If b 1 > 1 → stock’s rate of return is more sensitive to changes in the level of the overall market than the
average stock, meaning it’s more volatile than the market & therefore riskier than the entire market. i.e. b 1
= 2, 1% ↑ in index results → average ↑ of 2% in stock’s return.

Systematic and firm-specific risk

b 1 measure stock’s market-related (or systematic) risk because it measures volatility of stock P that is related
to overall market. Tells us nothing about strength of relationship.

R2 measures proportion of total risk that is market related.

Firm-specific (or nonsystematic) risk is the proportion of risk associated w/ events specific to company rather
than market. E.g. company’s managers.

Chapter 12: Inferences about a Population


12.2 Inference about a Population Variance
Parameter is population variance σ 2 as oppose to population mean like before. σ 2 usually unknown, need to
draw inferences.

Statistic and sampling distribution

Unbiased & consistent estimator of σ 2 is sample variance s2.

Testing and estimating a population variance

Test statistic used to test hypothesis about σ 2 is

(n − 1) s 2
χ2 = ~ χ n2−1
σ 2

Which is called chi-squared ( χ 2 ) statistic with v = n – 1 degrees of freedom. Chi-squared distributed.

Rejection region:

χ 2 < χ 2 1- α, n-1 or χ 2 > χ 2 α, n-1


Confidence interval estimator of σ 2

(n − 1) s 2
LCL =
χα2 / 2,n −1

(n − 1) s 2
UCL =
χ12−α / 2,n −1

Notice the CI is not symmetric about s2.

Recall for population mean CI was sample mean ± margin of error

But for population variance the CI is (s2 – errorL <σ2 <s2 + errorU ) & errorL ≠ errorU

Checking the required condition

Chi-squared test & estimator of σ 2 also requires sample population to be normal.

chi-squared distribution is skewed to the right & depends on degrees of freedom.


Chapter 15: Chi-Squared Tests
Data often occurs in nominal (categorical) form

 Categories are mutually exclusive & exhaustive

 Think of each respondent/observation as being a trial

15.1 Chi-Squared Goodness-of-fit Test


Hypothesis test = binomial experiment as only success or failure.

Multinomial experiment – 2 or more possible outcomes per trial.

Multinomial experiment properties:

1. Experiment consists of a fixed no. n of trials


2. Outcome of each trial can be classified into k categories called cells
3. Probability pi that the outcome will fall into cell i remains constant for each trial.
Moreover, p1 + p2 +…+ pk = 1
4. Each trial of the experiment is independent of other trials → random sampling

 Want to compare observed & expected distributions

 Obviously could calculate differences in expected & observed category frequencies

 Inference problem is to determine whether those differences are statistically large

Chi-squared goodness of fit test - used to test if observed & expected distributions are the same

When k = 2, multinomial experiment is identical to binomial.

Like counting no. of successes in binomial, we count no. of outcomes falling into each of the k cells in
multinomial experiments. In doing so, we get observed frequencies f 1 , f 2 ,…,f k where f i is the observed
frequency of outcomes falling into cell i where i = 1, 2,…, k.

As experiment has n trials”

f 1 , f 2 ,…,f k = n

Chi-squared goodness-of-fit test is the equivalent to z-test of p in binomial experiments but for multinomial
experiments.

Expected frequency: e i = np i !!!!!!!!!!!!!!!!!!!!! → From previous data, the expected frequencies of each cell

Observed frequencies: f i → what is actually observed in the sample population


Chi-squared goodness-of-fit test statistic

c
( f i − e i )2
χ =∑ 2
~ χ c2−1 ,
i =1 ei

- where f1 = observed frequency in cell i

- ei = expected frequency in cell i

Sampling distribution is approx chi-squared distributed with v = c – 1 degrees of freedom,


where c = number of cells.

If H 0 is true, observed & expected frequencies should be similar & so test statistic should be small. If H 0 is
untrue, some observed & expected frequencies will differ & test statistic will be large.

Rejection region:

χ 2 > χ 2 α, k-1

Required condition

Like for normal approximation to the binomial in the sampling distribution of a proportion in that we needed
np & n(1-p) to be > 5, chi-squared test statistic needs to satisfy the rule of five – n must be large enough so
that the expected value for each cell must be 5 or more. Cells can be combined to satisfy this condition.

This is because the test can be unreliable if any values of e i = p i n get too small (e.g. 3 or 4).

15.2 Chi-Squared Test of a Contingency Table


 Testing strategy is similar to that used for the goodness of fit test

 Compare observed cell frequencies with those expected under null hypothesis of independence

Chi-Squared Test of a Contingency Table – used to determine whether there is enough evidence to infer that 2
nominal variables are related & to infer that differences exist between 2 or more populations of nominal
variables.

e.g. 2 variables: undergrad degree (B.A., B.Eng., B.B.A & other) & MBA major (marketing, finance &
accounting). Want to know whether variables are related. OR determine whether differences exist between
BA’s, BENG’s, BBA’s ^ others. In other words, treat holders of each undergrad degree as a separate population
& each population has 2 possible values represented by MBA major. Objective is to compare 4 populations. OR
other way around.

H 0 : 2 variables are independent

H 1 : 2 variables are dependent


Graphical technique

Bar chart to display data from sample.

Test statistic

χ 2 = ∑∑
r c (o
ij − eij )
2

~ χν2
i =1 j =1 eij

where oij = observed frequency of cell in row i column j


eij = expected frequency of cell in row i column j
ν = (r − 1) × (c − 1)

Same test statistic as one used to test proportions in the goodness-of-fit test.

Need probabilities of expected values e i to calculate test statistic. Get probabilities from sample data.

Assume H 0 is true:

P (A and B) = P(A) x P(B) → if events A & B are independent.

Expected frequency of call in row i and column j in a contingency table

ni. n. j ni. n. j
eij = × ×n =
n n n

where ni. = total obs. in row i = 1,..., r


n. j = total obs. in column j = 1,..., c

Expected cell frequencies should satisfy the rule of five.

Rejection region & p-Value

To determine RR, need to know no. of degrees of freedom associated w/ chi-squared statistic. V = (r – 1)(c – 1)
for a contingency table w/ r rows & c columns.

χ 2 > χ 2 α, v

p- value can’t be determined manually.

Rule of five

In contingency table where 1 or more cells have expected values < 5, we need to combine rows or columns to
satisfy the rule of 5.

You might also like