PASS notes for week 3 ECON1203.

BES PASS S1 10 1

WEEK 3 BES PASS

Descriptive Statistics

- Population - a set of all possible observations.

- Sample - a portion of a population. We often use information concerning a sample to

make an inference (conclusion) about the population.

- Parameter - describes a characteristic of the population, eg: the population variance

- Statistic- describes a characteristic of a sample, eg: the sample variance

Frequency Distribution and Histograms

- Class - a collection of data which are mutually exclusive

- Frequency distribution - a grouping of data into classes

- Relative frequency distribution - calculates the number of data in a class as a percentage

of the total data

Shapes of Distributions and Histograms

- A histogram is symmetrical if one half of the histogram is a mirror reflection of the other

- Non-symmetrical distributions are said to be skewed

a) b) c)

Skewed to the right Skewed to the left Symmetric Distribution

(Positively skewed) (Negatively skewed)

Mode < Median < Mean Mode > Median > Mean Mode = Median = Mean

Measures of Central Tendency: The Mean, Mode and Median

- The mean is the average of scores:

- The mode is the value that has the highest frequency

- The median is the middle value of data ordered from lowest to highest

- The median and the mode are relatively less sensitive to outliers.

Quartiles and Percentiles, including the Median

- Percentile values divide the data (arranged in ascending order) into 100 equal parts.

They are a measure of relative standing.

- P% of the data is less than the pth percentile, and (100 p)% of the data is greater than

the pth percentile.

Population mean: = xi/N Sample mean: x = xi/n

Omkar & Yaying Wednesday 5-6pm

BES PASS S1 10 2

L =

100

p

x (n + 1)

- The median is the 50

th

percentile

- The inter-quartile range is the difference between the 75

th

percentile (upper quartile)

and the 25

th

percentile (lower quartile)

E.g. 2, 3, 3, 6, 8, 9, 14, 16, 17, 20 n = 10

o lower quartile = (25 x 11) / 100 = 2.75 obtain data value of the way

between 2

nd

and 3

rd

scores = 3

o Median = (50 x 11) / 100 = 5.5 obtain data value of the way between 5

th

and 6

th

scores = 8.5

o Upper quartile = (75 x 11)/100 = 8.25 obtain data value of the way between

8

th

and 9

th

scores = 16.25

PRACTICE QUESTION #1:

1. A firm has 45 employees. The data shows the number of weeks vacation the employees

take annually:

Weeks Employees Weeks Employees

2 19 4 8

3 14 8 4

a) Calculate the first quartile for the number of weeks of vacation. 2

b) Find the median and the third (upper) quartile. 3, 4

Measures of Dispersion

Measures of dispersion tell us about the variability in a set of data:

- Range = largest score smallest score

- Inter-quartile range (IQR) = 3

rd

quartile 1

st

quartile. This represents the spread of the

middle 50% of the data set.

- Variance and Standard Deviation (SD): are both based on deviations from the mean

For a population, the variation is given by

( )

N

X

i

2

2

o

E

= =

N

N X

i

2 2

E

For a sample, the variance is given by

( )

1

2

2

E

=

n

x x

s

i

=

1

2

2

E

n

x n x

i

In both cases, the standard deviation is found by taking the square root of the variance.

E.g. Given the following income data: $23,000 $36,500 $47,200 $20,200 $61,300:

o Sample mean = (23,000+36,500+47,200+20,200+61,300) / 5 = $37,640

o Sample variance =

( ) | | ( ) 000 , 743 , 292 $ 1 5 / 640 , 37 5 300 , 61 200 , 20 200 , 47 500 , 36 000 , 23

2 2 2 2 2 2

= + + + +

o Sample standard deviation = 73 . 109 , 17 $ 000 , 743 , 292 =

Omkar & Yaying Wednesday 5-6pm

BES PASS S1 10 3

- The Coefficient of Variation (CV): the coefficient of variation should be used as it is a

relative measure of dispersion where the data sets:

o Are measured in different units

o Have means which are widely divergent even though they are measured in the

same units ie. They differ significantly

E.g. A set of test scores has mean = 890 and SD = 120.

o The coefficient of variation = 120/890 = 0.1348

PRACTICE QUESTION #2:

1. The following data represent a sample of test scores: 88, 76, 67, 90, 98, 68, 75, 86, 82,

90. Calculate:

a) the sample mean = 82

b) the sample standard deviation = 10.23067

c) The coefficient of variation = 0.12476

Random variable: is a function that assigns a real number to each point in the sample space. ie.

something that changes.

Discrete Random Variables

- A random variable is a random value that may occur within a set sample space.

- The random variable may be either continuous, i.e. obtain any value within the space

(height), or discrete i.e. set values, whole numbers.

Expected Value

- Expected value is the centre of the probability distribution with random variable X.

- It is denoted as E(X) i.e. the expected value (X).

( ) ( )

Allx

E X xP X x = =

- If the probability distribution is the same as the population relative frequency

distribution, the E(X) = .

E.g. Let X represent possible returns (in hundreds of dollars) from a portfolio of shares. The

table below shows the probabilities assigned to possible returns.

x 12 20 56

P(X=x) 0.3 0.6 0.1

What is the mean return? 21.2

Population CV =

o

Sample CV =

x

s

Omkar & Yaying Wednesday 5-6pm

BES PASS S1 10 4

Variance

Mathematically:

V(X) =

2

( ( )) ( )

Allx

x E X P X x =

=

2 2

) ( )] ( [ X E x X P X

i

=

In other words, it is each possibility minus the expected value (E(x)) for the set, squared, then times the

probability of x. Then sum for each x.

E.g. Consider a coin tossing game where you win 9 dollars if a head appears on a single toss

of a fair coin but you lose 9 dollars if a tail appears. Let the random variable R represent the

return on a single toss. What is the variance of R? 81

Measures of Linear Relationship

Covariance:

Strength of the linear relationship or association between two random variables.

Positive (negative) covariance positive (negative) linear association

Zero covariance no linear association

Population covariance:

Sample covariance:

Correlation:

The limitation of the covariance concept is that it is dependent upon the units used to

measure X and Y. To avoid this, the population correlation coefficient () and the sample

correlation coefficient (r) are used instead.

Population correlation: where -1 1

Sample correlation: where -1 r 1

Least Squares Methodology

The method of least squares calculates the line of best fit for a particular data set by

minimising the sum of squared residuals. Results are:

x b y b

s

s

b

x

xy

1 0

2

1

= =

