You are on page 1of 7

Lecture 4

BUSINESS STATISTICS DESCRIPTIVE STATISTICS:


Advanced Educational Program Numerical summaries

Reading materials:
Chap 4 (Keller)

1 2

Outline Measure of center and spread


• Measures of center:
- Mean, median, mode
- Selection of measures of location
• Measures of dispersion (spread):
- Range, quartile range, quartile deviation,
variance, standard deviation
• Empirical rule (general case: Chebyshev’s
law)
• Coefficient of skewness
• Coefficient of variation
3 4

Measures of center Measures of center

• A measure of center or location shows


where the center of the data is
• Three most useful measures of location:
 Arithmetic mean/average
 Median
 Mode

5 6

1
Arithmetic mean from raw data Arithmetic mean from frequency table
N

X i
• Apply this formula for the sample:
• Arithmetic mean from population:  i 1

N
k
n

 xi
x f i i
• Arithmetic mean from sample: x i 1
x i 1 k
n f
i 1
i

Where: Xi, xi - the value of each item Where: xi - the value of class i
N, n - total number of items fi – frequency of class i

7 8

Advantages and disadvantages of arithmetic mean Mean is sensitive to outliers

• Advantages:
– Easy to understand and calculate
– Values of every items are included => representative for
the whole set of data
• Disadvantages
– Sensitive to outliers:
Sample: (43; 38; 37; : : : ; 27; 34): => x  33.5
Contaminated sample
(43; 38; 37; : : : ; 27; 1934): => x  71.5

9 10

Median Calculate median from raw data

 Median is the value of the observation which is • If the data has an odd number of observations:
located in the middle of the data set (n  1)th
– Middle observation:
2
 Steps to find median: Median  x ( n 1)th
1. Arrange the observations in order of size (normally 2
ascending order) • If the data has an even number of observations:
2. Find the number of observations and hence the middle – There are two observations located in the middle and
observation
3. The median is the value of the middle observation M edian  ( x th x th )/2
n n 
   1 
2 2 

11 12

2
Example Advantages and disadvantages of median

• Advantages:
• E.g1. Raw data: 11, 11, 13, 14, 17 => find median
– Easy to understand and calculate
• E.g 2. Raw data: 11, 11, 13, 14, 16, 17 => find – Not affected by outlying values => thus can be used
median when the mean would be misleading

• Disadvantages
– Value of one observation => fails to reflect the whole
data set
– Not easy to use in other analysis

13 14

Mode
Example to calculate mode

• Mode is the value which occurs most


frequently in the data set X Frequency

8 3
• Steps to find mode
12 7
1. Draw a frequency table for the data
16 12
2. Identify the mode as the most frequent value 17 8
19 5

15 16

Mean, median and mode in normal and skewed


Bimodal and multimodal data distributions

Bimodal (two modes) Multimodal (several modes)


17 18

3
Which measure of centre is best? Measures of dispersion (variability)
• Mean generally most commonly used
• Sensitive to extreme values • Measures of dispersion tell you how spread
• If data skewed/extreme values present, median better, e.g.
real estate prices out all other values of the distribution from
• Mode generally best for categorical data – e.g. restaurant the central tendency
service quality (below): mode is very good. (ordinal)
• Measures of dispersion
Rating # customers • The range, quartile range, and quartile deviation
Excellent 20
• Variance and standard deviation
Very good 50
Good 30
Satisfactory 12
Poor 10
Very Poor 6 19 20

Why do we need measures of dispersion? Why measures of dispersion? (1)

• Two data sets of midterm marks of 5 students:


– First set: 100, 40, 40, 35, 35 => Mean: 50
– Second set: 70, 55, 50, 40, 35 => Mean: 50
 Which mean (first or second) is more reliable?
• Need to know the spread of other values around the
central tendency, especially important in analysing
stock market.

21 22

Why measures of dispersion? (2) Range

• Range is the difference between the largest and


smallest value => Sort data before computing range
• Formula: Range = maximum value - minimum
value
• Advantages of Range: easy to calculate for
ungrouped data.
• Disadvantages:
– Take into account only two values
– Affected by one or two extreme values
– More difficult to calculate for grouped data
23 24

4
Quartiles
Quartile range and quartile deviation
• Quartiles: are defined as values of observations
which are a quarter of the way through data • Quartile range = Q3 – Q1
– Q1 - the first quartile: the value of the Q3  Q1
observation of which 25% of observations fall • Quartile deviation =
2
below
– Q2 - the second quartile: the median (50% of the • Advantages of quartile deviation (semi-interquartile range):
observations fall below) less affected by extreme value
• Disadvantages: take into account only 50% of the data
– Q3 - the third quartile: the value of the
observation of which 75% of observations fall
below
25 26

Variance
Standard deviation ( )

2  
(Xi  ) 2
• Standard deviation (S.D) is the square root of variance
• Variance from population:
N • S.D from population:

• Variance from sample s2 


 ( x  x) 2
  2
n 1
• Advantages:
• S.D from sample:
s  s2
• Take into account all values
• Easy to interpret the result. • Advantages:
• Disadvantages: the unit of variance has no meaning • Overcome the disadvantage of meaningless unit of
variance
• The most widely used measure of dispersion (the bigger
its value => the more spread out are the data)

27 28

Application of this in finance Example – 2 funds over 10 years (1)

• Variance (or S.D) of an investment, can be used


as a measure of risk e.g. on profits/return.
• Rates of return
• Larger variance  larger risk A 8.3 -6.2 20.9 -2.7 33.6 42.9 24.4 5.2 3.1 30.5
• Usually, higher rate of return, higher risk B 12.1 -2.8 6.4 12.2 27.8 25.3 18.2 10.7 -1.3 11.4

x A  16% xB  12%
s  280.34(%)
2
A
2 s A2  99.37(%) 2

• Which fund will you invest?

5
Example – 2 funds over 10 years (2) Empirical rules or the law of 3 
• For a normal or symmetrical distribution:
 Depending on how Risk-averse you are: – 68.26% of all obs fall within 1 standard deviation of the
mean, i.e. in the range:
Fund A: higher risk, but also higher average rate
( x  1s )  ( x  1s )
of return.
– 95.45% of all obs fall within 2 standard deviation of the
mean, i.e. in the range:
( x  2s)  ( x  2s)
– 99.73% of all obs fall within 3 standard deviation of the
mean, i.e. in the range:
( x  3s )  ( x  3s )

32

Meaning of the law of 3 Boxplot


• Convert z-score to probability (next lecture)
• Identify outliers Here is the Boxplot of height of international students
studying at UNSW

Boxplot of Height

200

whisker
190

180 upper quartile


Height

box median
170

160 lower quartile


whisker
150

33 34

Boxplots Shapes of Boxplots

• Need MEDIAN and QUARTILES to create a boxplot Boxplot of Symmetric, Positive skew, Negative skew, Bimodal
• MEDIAN = middle of observations, i.e. ½ way through 5.0
observations
• Skewness/
• QUARTILES = mark quarter points of observations, i.e. ¼ 2.5
(Q1) and ¾ (Q3) of the way through data [(n+1)/4; symmetry
3(n+1)/4] • Modality
Data

0.0
• INTERQUARTILE RANGE = Q3-Q1 • Range
• Whiskers: max length is 1.5*IQR; stretch from box to
-2.5
furthest data point (within this range)
• Points further out from box marked with stars; called
-5.0
outliers
Symmetric Positive skew Negative skew Bimodal

35 36

6
Coefficient of skewness (C of S) Activity 1
• This measures the shape of distribution • Summary statistics of two data sets are as follows
• There are some measures of skewness.
Set 1: Set 2:
• Below is a common one: Pearson’s coefficient of skewness.
Ages of students Wages of staffs
Coefficient of skewness = 3 x (mean-median)/standard studying at UNSW
deviation Mean 22.4839 294.3
• If C of S is nearly +3 or -3, the distribution is highly skewed Median 21 292.5
Standard deviation 6.3756 125.93
• If C of S is positive => distribution is skewed to the right
(positive skew)
• If C of S is negative => distribution is skewed to the left  Compute the Pearson’s coefficient of skewness of these data
(negative skew) sets and describe their shapes of distribution

37 38

Distribution shapes Investigating the relationship between variables


10
200

• Covariance
8

• Correlation coefficient
150

6
Frequency

Frequency
100

4
50

2
0

20 40 60 80 100 200 300 400 500 600


age wages

Skewed to the right Nearly normal

39

Covariance Correlation coefficient

41 42

You might also like