You are on page 1of 4

Omkar & Yaying Wednesday 5-6pm

BES PASS S1 10 1
WEEK 3 BES PASS

Descriptive Statistics
- Population - a set of all possible observations.
- Sample - a portion of a population. We often use information concerning a sample to
make an inference (conclusion) about the population.
- Parameter - describes a characteristic of the population, eg: the population variance
- Statistic- describes a characteristic of a sample, eg: the sample variance

Frequency Distribution and Histograms
- Class - a collection of data which are mutually exclusive
- Frequency distribution - a grouping of data into classes
- Relative frequency distribution - calculates the number of data in a class as a percentage
of the total data

Shapes of Distributions and Histograms
- A histogram is symmetrical if one half of the histogram is a mirror reflection of the other
- Non-symmetrical distributions are said to be skewed





a) b) c)
Skewed to the right Skewed to the left Symmetric Distribution
(Positively skewed) (Negatively skewed)
Mode < Median < Mean Mode > Median > Mean Mode = Median = Mean

Measures of Central Tendency: The Mean, Mode and Median
- The mean is the average of scores:



- The mode is the value that has the highest frequency
- The median is the middle value of data ordered from lowest to highest
- The median and the mode are relatively less sensitive to outliers.

Quartiles and Percentiles, including the Median
- Percentile values divide the data (arranged in ascending order) into 100 equal parts.
They are a measure of relative standing.
- P% of the data is less than the pth percentile, and (100 p)% of the data is greater than
the pth percentile.
Population mean: = xi/N Sample mean: x = xi/n

Omkar & Yaying Wednesday 5-6pm
BES PASS S1 10 2
L =
100
p
x (n + 1)
- The median is the 50
th
percentile
- The inter-quartile range is the difference between the 75
th
percentile (upper quartile)
and the 25
th
percentile (lower quartile)

E.g. 2, 3, 3, 6, 8, 9, 14, 16, 17, 20 n = 10
o lower quartile = (25 x 11) / 100 = 2.75 obtain data value of the way
between 2
nd
and 3
rd
scores = 3
o Median = (50 x 11) / 100 = 5.5 obtain data value of the way between 5
th

and 6
th
scores = 8.5
o Upper quartile = (75 x 11)/100 = 8.25 obtain data value of the way between
8
th
and 9
th
scores = 16.25

PRACTICE QUESTION #1:
1. A firm has 45 employees. The data shows the number of weeks vacation the employees
take annually:
Weeks Employees Weeks Employees
2 19 4 8
3 14 8 4
a) Calculate the first quartile for the number of weeks of vacation. 2
b) Find the median and the third (upper) quartile. 3, 4

Measures of Dispersion
Measures of dispersion tell us about the variability in a set of data:
- Range = largest score smallest score
- Inter-quartile range (IQR) = 3
rd
quartile 1
st
quartile. This represents the spread of the
middle 50% of the data set.
- Variance and Standard Deviation (SD): are both based on deviations from the mean
For a population, the variation is given by
( )
N
X
i
2
2

o
E
= =
N
N X
i
2 2
E

For a sample, the variance is given by
( )
1
2
2

E
=
n
x x
s
i
=
1
2
2

E
n
x n x
i


In both cases, the standard deviation is found by taking the square root of the variance.

E.g. Given the following income data: $23,000 $36,500 $47,200 $20,200 $61,300:
o Sample mean = (23,000+36,500+47,200+20,200+61,300) / 5 = $37,640
o Sample variance =
( ) | | ( ) 000 , 743 , 292 $ 1 5 / 640 , 37 5 300 , 61 200 , 20 200 , 47 500 , 36 000 , 23
2 2 2 2 2 2
= + + + +

o Sample standard deviation = 73 . 109 , 17 $ 000 , 743 , 292 =

Omkar & Yaying Wednesday 5-6pm
BES PASS S1 10 3
- The Coefficient of Variation (CV): the coefficient of variation should be used as it is a
relative measure of dispersion where the data sets:
o Are measured in different units
o Have means which are widely divergent even though they are measured in the
same units ie. They differ significantly




E.g. A set of test scores has mean = 890 and SD = 120.
o The coefficient of variation = 120/890 = 0.1348

PRACTICE QUESTION #2:
1. The following data represent a sample of test scores: 88, 76, 67, 90, 98, 68, 75, 86, 82,
90. Calculate:
a) the sample mean = 82
b) the sample standard deviation = 10.23067
c) The coefficient of variation = 0.12476

Random variable: is a function that assigns a real number to each point in the sample space. ie.
something that changes.
Discrete Random Variables
- A random variable is a random value that may occur within a set sample space.
- The random variable may be either continuous, i.e. obtain any value within the space
(height), or discrete i.e. set values, whole numbers.

Expected Value
- Expected value is the centre of the probability distribution with random variable X.
- It is denoted as E(X) i.e. the expected value (X).

( ) ( )
Allx
E X xP X x = =


- If the probability distribution is the same as the population relative frequency
distribution, the E(X) = .

E.g. Let X represent possible returns (in hundreds of dollars) from a portfolio of shares. The
table below shows the probabilities assigned to possible returns.
x 12 20 56
P(X=x) 0.3 0.6 0.1
What is the mean return? 21.2



Population CV =

o
Sample CV =
x
s


Omkar & Yaying Wednesday 5-6pm
BES PASS S1 10 4
Variance

Mathematically:
V(X) =
2
( ( )) ( )
Allx
x E X P X x =
=
2 2
) ( )] ( [ X E x X P X
i
=

In other words, it is each possibility minus the expected value (E(x)) for the set, squared, then times the
probability of x. Then sum for each x.

E.g. Consider a coin tossing game where you win 9 dollars if a head appears on a single toss
of a fair coin but you lose 9 dollars if a tail appears. Let the random variable R represent the
return on a single toss. What is the variance of R? 81



Measures of Linear Relationship
Covariance:
Strength of the linear relationship or association between two random variables.
Positive (negative) covariance positive (negative) linear association
Zero covariance no linear association
Population covariance:
Sample covariance:

Correlation:
The limitation of the covariance concept is that it is dependent upon the units used to
measure X and Y. To avoid this, the population correlation coefficient () and the sample
correlation coefficient (r) are used instead.

Population correlation: where -1 1
Sample correlation: where -1 r 1

Least Squares Methodology

The method of least squares calculates the line of best fit for a particular data set by
minimising the sum of squared residuals. Results are:

x b y b
s
s
b
x
xy
1 0
2
1
= =