Professional Documents
Culture Documents
Reading materials:
Chap 4 (Keller)
1 2
5 6
1
Arithmetic mean from raw data Arithmetic mean from frequency table
N
X i
• Apply this formula for the sample:
• Arithmetic mean from population: i 1
N
k
n
xi
x f i i
• Arithmetic mean from sample: x i 1
x i 1 k
n f
i 1
i
Where: Xi, xi - the value of each item Where: xi - the value of class i
N, n - total number of items fi – frequency of class i
7 8
• Advantages:
– Easy to understand and calculate
– Values of every items are included => representative for
the whole set of data
• Disadvantages
– Sensitive to outliers:
Sample: (43; 38; 37; : : : ; 27; 34): => x 33.5
Contaminated sample
(43; 38; 37; : : : ; 27; 1934): => x 71.5
9 10
Median is the value of the observation which is • If the data has an odd number of observations:
located in the middle of the data set (n 1)th
– Middle observation:
2
Steps to find median: Median x ( n 1)th
1. Arrange the observations in order of size (normally 2
ascending order) • If the data has an even number of observations:
2. Find the number of observations and hence the middle – There are two observations located in the middle and
observation
3. The median is the value of the middle observation M edian ( x th x th )/2
n n
1
2 2
11 12
2
Example Advantages and disadvantages of median
• Advantages:
• E.g1. Raw data: 11, 11, 13, 14, 17 => find median
– Easy to understand and calculate
• E.g 2. Raw data: 11, 11, 13, 14, 16, 17 => find – Not affected by outlying values => thus can be used
median when the mean would be misleading
• Disadvantages
– Value of one observation => fails to reflect the whole
data set
– Not easy to use in other analysis
13 14
Mode
Example to calculate mode
8 3
• Steps to find mode
12 7
1. Draw a frequency table for the data
16 12
2. Identify the mode as the most frequent value 17 8
19 5
15 16
3
Which measure of centre is best? Measures of dispersion (variability)
• Mean generally most commonly used
• Sensitive to extreme values • Measures of dispersion tell you how spread
• If data skewed/extreme values present, median better, e.g.
real estate prices out all other values of the distribution from
• Mode generally best for categorical data – e.g. restaurant the central tendency
service quality (below): mode is very good. (ordinal)
• Measures of dispersion
Rating # customers • The range, quartile range, and quartile deviation
Excellent 20
• Variance and standard deviation
Very good 50
Good 30
Satisfactory 12
Poor 10
Very Poor 6 19 20
21 22
4
Quartiles
Quartile range and quartile deviation
• Quartiles: are defined as values of observations
which are a quarter of the way through data • Quartile range = Q3 – Q1
– Q1 - the first quartile: the value of the Q3 Q1
observation of which 25% of observations fall • Quartile deviation =
2
below
– Q2 - the second quartile: the median (50% of the • Advantages of quartile deviation (semi-interquartile range):
observations fall below) less affected by extreme value
• Disadvantages: take into account only 50% of the data
– Q3 - the third quartile: the value of the
observation of which 75% of observations fall
below
25 26
Variance
Standard deviation ( )
2
(Xi ) 2
• Standard deviation (S.D) is the square root of variance
• Variance from population:
N • S.D from population:
27 28
x A 16% xB 12%
s 280.34(%)
2
A
2 s A2 99.37(%) 2
5
Example – 2 funds over 10 years (2) Empirical rules or the law of 3
• For a normal or symmetrical distribution:
Depending on how Risk-averse you are: – 68.26% of all obs fall within 1 standard deviation of the
mean, i.e. in the range:
Fund A: higher risk, but also higher average rate
( x 1s ) ( x 1s )
of return.
– 95.45% of all obs fall within 2 standard deviation of the
mean, i.e. in the range:
( x 2s) ( x 2s)
– 99.73% of all obs fall within 3 standard deviation of the
mean, i.e. in the range:
( x 3s ) ( x 3s )
32
Boxplot of Height
200
whisker
190
box median
170
33 34
• Need MEDIAN and QUARTILES to create a boxplot Boxplot of Symmetric, Positive skew, Negative skew, Bimodal
• MEDIAN = middle of observations, i.e. ½ way through 5.0
observations
• Skewness/
• QUARTILES = mark quarter points of observations, i.e. ¼ 2.5
(Q1) and ¾ (Q3) of the way through data [(n+1)/4; symmetry
3(n+1)/4] • Modality
Data
0.0
• INTERQUARTILE RANGE = Q3-Q1 • Range
• Whiskers: max length is 1.5*IQR; stretch from box to
-2.5
furthest data point (within this range)
• Points further out from box marked with stars; called
-5.0
outliers
Symmetric Positive skew Negative skew Bimodal
35 36
6
Coefficient of skewness (C of S) Activity 1
• This measures the shape of distribution • Summary statistics of two data sets are as follows
• There are some measures of skewness.
Set 1: Set 2:
• Below is a common one: Pearson’s coefficient of skewness.
Ages of students Wages of staffs
Coefficient of skewness = 3 x (mean-median)/standard studying at UNSW
deviation Mean 22.4839 294.3
• If C of S is nearly +3 or -3, the distribution is highly skewed Median 21 292.5
Standard deviation 6.3756 125.93
• If C of S is positive => distribution is skewed to the right
(positive skew)
• If C of S is negative => distribution is skewed to the left Compute the Pearson’s coefficient of skewness of these data
(negative skew) sets and describe their shapes of distribution
37 38
• Covariance
8
• Correlation coefficient
150
6
Frequency
Frequency
100
4
50
2
0
39
41 42