Lecture 3 - Numerical Statistics

Lecture 4
BUSINESS STATISTICS DESCRIPTIVE STATISTICS:

Advanced Educational Program Numerical summaries
Reading materials:
Chap 4 (Keller)
1 2
Outline Measure of center and spread

• Measures of center:
- Mean, median, mode
- Selection of measures of location
• Measures of dispersion (spread):
- Range, quartile range, quartile deviation,
variance, standard deviation
• Empirical rule (general case: Chebyshev’s
law)
• Coefficient of skewness
• Coefficient of variation
3 4
Measures of center Measures of center
• A measure of center or location shows

where the center of the data is
• Three most useful measures of location:
 Arithmetic mean/average
 Median
 Mode
5 6
1
Arithmetic mean from raw data Arithmetic mean from frequency table
N
X i
• Apply this formula for the sample:
• Arithmetic mean from population:  i 1
N
k
n
 xi
x f i i
• Arithmetic mean from sample: x i 1
x i 1 k
n f
i 1
i
Where: Xi, xi - the value of each item Where: xi - the value of class i
N, n - total number of items fi – frequency of class i
7 8
Advantages and disadvantages of arithmetic mean Mean is sensitive to outliers
• Advantages:
– Easy to understand and calculate
– Values of every items are included => representative for
the whole set of data
• Disadvantages
– Sensitive to outliers:
Sample: (43; 38; 37; : : : ; 27; 34): => x  33.5
Contaminated sample
(43; 38; 37; : : : ; 27; 1934): => x  71.5
9 10
Median Calculate median from raw data
 Median is the value of the observation which is • If the data has an odd number of observations:
located in the middle of the data set (n  1)th
– Middle observation:
2
 Steps to find median: Median  x ( n 1)th
1. Arrange the observations in order of size (normally 2
ascending order) • If the data has an even number of observations:
2. Find the number of observations and hence the middle – There are two observations located in the middle and
observation
3. The median is the value of the middle observation M edian  ( x th x th )/2
n n 
   1 
2 2 
11 12
2
Example Advantages and disadvantages of median
• Advantages:
• E.g1. Raw data: 11, 11, 13, 14, 17 => find median
– Easy to understand and calculate
• E.g 2. Raw data: 11, 11, 13, 14, 16, 17 => find – Not affected by outlying values => thus can be used
median when the mean would be misleading
• Disadvantages
– Value of one observation => fails to reflect the whole
data set
– Not easy to use in other analysis
13 14
Mode
Example to calculate mode
• Mode is the value which occurs most

frequently in the data set X Frequency
8 3
• Steps to find mode
12 7
1. Draw a frequency table for the data
16 12
2. Identify the mode as the most frequent value 17 8
19 5
15 16
Mean, median and mode in normal and skewed

Bimodal and multimodal data distributions
Bimodal (two modes) Multimodal (several modes)

17 18
3
Which measure of centre is best? Measures of dispersion (variability)
• Mean generally most commonly used
• Sensitive to extreme values • Measures of dispersion tell you how spread
• If data skewed/extreme values present, median better, e.g.
real estate prices out all other values of the distribution from
• Mode generally best for categorical data – e.g. restaurant the central tendency
service quality (below): mode is very good. (ordinal)
• Measures of dispersion
Rating # customers • The range, quartile range, and quartile deviation
Excellent 20
• Variance and standard deviation
Very good 50
Good 30
Satisfactory 12
Poor 10
Very Poor 6 19 20
Why do we need measures of dispersion? Why measures of dispersion? (1)
• Two data sets of midterm marks of 5 students:

– First set: 100, 40, 40, 35, 35 => Mean: 50
– Second set: 70, 55, 50, 40, 35 => Mean: 50
 Which mean (first or second) is more reliable?
• Need to know the spread of other values around the
central tendency, especially important in analysing
stock market.
21 22
Why measures of dispersion? (2) Range
• Range is the difference between the largest and

smallest value => Sort data before computing range
• Formula: Range = maximum value - minimum
value
• Advantages of Range: easy to calculate for
ungrouped data.
• Disadvantages:
– Take into account only two values
– Affected by one or two extreme values
– More difficult to calculate for grouped data
23 24
4
Quartiles
Quartile range and quartile deviation
• Quartiles: are defined as values of observations
which are a quarter of the way through data • Quartile range = Q3 – Q1
– Q1 - the first quartile: the value of the Q3  Q1
observation of which 25% of observations fall • Quartile deviation =
2
below
– Q2 - the second quartile: the median (50% of the • Advantages of quartile deviation (semi-interquartile range):
observations fall below) less affected by extreme value
• Disadvantages: take into account only 50% of the data
– Q3 - the third quartile: the value of the
observation of which 75% of observations fall
below
25 26
Variance
Standard deviation ( )
2  
(Xi  ) 2
• Standard deviation (S.D) is the square root of variance
• Variance from population:
N • S.D from population:
• Variance from sample s2 

 ( x  x) 2
  2
n 1
• Advantages:
• S.D from sample:
s  s2
• Take into account all values
• Easy to interpret the result. • Advantages:
• Disadvantages: the unit of variance has no meaning • Overcome the disadvantage of meaningless unit of
variance
• The most widely used measure of dispersion (the bigger
its value => the more spread out are the data)
27 28
Application of this in finance Example – 2 funds over 10 years (1)
• Variance (or S.D) of an investment, can be used

as a measure of risk e.g. on profits/return.
• Rates of return
• Larger variance  larger risk A 8.3 -6.2 20.9 -2.7 33.6 42.9 24.4 5.2 3.1 30.5
• Usually, higher rate of return, higher risk B 12.1 -2.8 6.4 12.2 27.8 25.3 18.2 10.7 -1.3 11.4
x A  16% xB  12%
s  280.34(%)
2
A
2 s A2  99.37(%) 2
• Which fund will you invest?
5
Example – 2 funds over 10 years (2) Empirical rules or the law of 3 
• For a normal or symmetrical distribution:
 Depending on how Risk-averse you are: – 68.26% of all obs fall within 1 standard deviation of the
mean, i.e. in the range:
Fund A: higher risk, but also higher average rate
( x  1s )  ( x  1s )
of return.
– 95.45% of all obs fall within 2 standard deviation of the
( x  2s)  ( x  2s)
– 99.73% of all obs fall within 3 standard deviation of the
( x  3s )  ( x  3s )
32
Meaning of the law of 3 Boxplot

• Convert z-score to probability (next lecture)
• Identify outliers Here is the Boxplot of height of international students
studying at UNSW
Boxplot of Height
200
whisker
190
180 upper quartile

Height
box median
170
160 lower quartile

whisker
150
33 34
Boxplots Shapes of Boxplots
• Need MEDIAN and QUARTILES to create a boxplot Boxplot of Symmetric, Positive skew, Negative skew, Bimodal
• MEDIAN = middle of observations, i.e. ½ way through 5.0
observations
• Skewness/
• QUARTILES = mark quarter points of observations, i.e. ¼ 2.5
(Q1) and ¾ (Q3) of the way through data [(n+1)/4; symmetry
3(n+1)/4] • Modality
Data
0.0
• INTERQUARTILE RANGE = Q3-Q1 • Range
• Whiskers: max length is 1.5*IQR; stretch from box to
-2.5
furthest data point (within this range)
• Points further out from box marked with stars; called
-5.0
outliers
Symmetric Positive skew Negative skew Bimodal
35 36
6
Coefficient of skewness (C of S) Activity 1
• This measures the shape of distribution • Summary statistics of two data sets are as follows
• There are some measures of skewness.
Set 1: Set 2:
• Below is a common one: Pearson’s coefficient of skewness.
Ages of students Wages of staffs
Coefficient of skewness = 3 x (mean-median)/standard studying at UNSW
deviation Mean 22.4839 294.3
• If C of S is nearly +3 or -3, the distribution is highly skewed Median 21 292.5
Standard deviation 6.3756 125.93
• If C of S is positive => distribution is skewed to the right
(positive skew)
• If C of S is negative => distribution is skewed to the left  Compute the Pearson’s coefficient of skewness of these data
(negative skew) sets and describe their shapes of distribution
37 38
Distribution shapes Investigating the relationship between variables

10
200
• Covariance
8
• Correlation coefficient
150
6
Frequency
Frequency
100
4
50
2
0
20 40 60 80 100 200 300 400 500 600

age wages
Skewed to the right Nearly normal
39
Covariance Correlation coefficient
41 42

Lecture 3 - Numerical Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3 - Numerical Statistics

Uploaded by

Copyright:

Available Formats

Lecture 4

BUSINESS STATISTICS DESCRIPTIVE STATISTICS:

Outline Measure of center and spread

Measures of center Measures of center

• A measure of center or location shows

Advantages and disadvantages of arithmetic mean Mean is sensitive to outliers

Median Calculate median from raw data

• Mode is the value which occurs most

Mean, median and mode in normal and skewed

Bimodal (two modes) Multimodal (several modes)

Why do we need measures of dispersion? Why measures of dispersion? (1)

• Two data sets of midterm marks of 5 students:

Why measures of dispersion? (2) Range

• Range is the difference between the largest and

• Variance from sample s2 

Application of this in finance Example – 2 funds over 10 years (1)

• Variance (or S.D) of an investment, can be used

• Which fund will you invest?

Meaning of the law of 3 Boxplot

180 upper quartile

160 lower quartile

Boxplots Shapes of Boxplots

Distribution shapes Investigating the relationship between variables

20 40 60 80 100 200 300 400 500 600

Skewed to the right Nearly normal

Covariance Correlation coefficient

You might also like