# Lecture 2: Graphical Techniques and Numerical Measures

1

Graphical Excellence
 “Graphical

excellence” deals with the effective use of graphical techniques.  Effective graphical techniques are
– informative, – concise, – clear presentation of the data to the viewer.

How can we achieve graphical excellenc
2

 Graphical

excellence is achieved when

– The graph presents large data sets concisely and – – –

coherently. The ideas and concepts to be delivered are clearly understood to the viewer. The graph encourages the viewer to compare variables. The display induces the viewer to address the substance of the data and not the form of the graph. There is no distortion of what the data reveal.
3

Graphical Deception
 It

is important to be able to evaluate critically the information presented by graphical techniques.  Things to be cautious about when observing a graph:
– Is there a missing scale on one axis. – Do not be influenced by a graph’s caption. – Are changes presented in absolute values only,

or in percent form too.

4

?

Is there a missing changes presented in absolute values o Are scale on one axis. in percent form too. or
(3%) 120.0 (2%) 110.0 (1%) 100.0

Time

Time

10%

Aug. 98

Sept. 98

Dollars

10%

Has any axis been stretched?

1980

1985

1990
5

Measures of Central Location Usually, we focus our attention on two aspects of measures of central location: Measure of the central data point (the average). Measure of dispersion of the data about the average.

The central data point reflects the locations of all the actual data points.

6

Measures of Central Location (Central Tendency) Usually, we focus our attention on two aspects of measures of central location: Measure of the central data point (the average). Measure of dispersion of the data about the average.

With two data points, If the third data point appear the central location exactly in the middle of the current range, the central should fall in the middle t if the third data point location should not change pears on the left hand-side between them (in order (because it is currently the midrange, it should “pull”to reflect the location of residing in the middle). e central location to the left. both of them).

ith one data point early the central cation is at the point elf.

7

Arithmetic mean
– This is the most popular and useful measure of

central location Sum of the measurements Mean = Number of measurements
Sample mean Population mean

x=

n n x ∑i=1 xii i=1

n n

µ=

N ∑ i=1 xi

N

Sample size

Population size
8

• Example 4.1
6 7 3 9 4 6 ∑ i6 1 xi x1 + x2 + x3 + x4 + x5 + x6 7 3 9 −2 4 = x= = = 6 6

mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is 4.5 4.5

• Example 4.2

ppose the telephone bills of example 2.1 represent populat measurements. The population mean is

µ=

200 ∑ i=1 xi

200

=

42.19 x2 15.30 53.21 42.19+ 15.30+ ...+ 53.21 x1 x200

200

43.59 = 43.59

9

• Example 4.3

When many of the measurements have the same value, the easurement can be summarized in a frequency table. Supp he number of children in a sample of 16 employees were reco s follows: NUMBER OF CHILDREN 0 NUMBER OF EMPLOYEES 3 1 4 2 7 3 2

16 employees

x=

∑16 xi i=1

x1+ x2...+ x16 3(0) + 4(1 + 7(2) + 2(3) ) = = = 1.5 16 16 16
10

The median
– The median of a set of measurements is the

value that falls in the middle when the measurements are arranged in order of magnitude. Example 4.4

Seven employee salaries were Suppose one employee’s salary of \$31, recorded (in 1000s) : 28, 60, 26, 32, 30, was 29. 26, added to the group recorded befor Find the median salary. Find the median salary.

Odd number of

Even of observat observations number There are two middle values! 26,26,28,29,30,32,60 26,26,28,29, 29.5,30,32,60, 26,26,28,29, 30,32,60,31 30,32,60,31 26,26,28,29,30,32,60,31 26,26,28,29, First, sort the salaries. 11 Then, locate the values in

First, sort the salaries. Then, locate the value in

The mode
– The mode of a set of measurements is the value

that occurs most frequently. – Set of data may have one mode (or modal class), or two or more modes.
The modal class For large data sets the modal class is much more relevant than the a singlevalue mode.

12

– Example 4.5  The manager of a men’s store observes the waist size (in inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.  The mode of this data set is 34 in.

This information seems valuable (for example, for the design of a new display in the store), much more than “ the median is 33.2 in.”.
13

Measures of Central Location Usually, we focus our attention on two aspects of measures of central location: Measure of the central data point (the average). Measure of dispersion of the data about the average.

The central data point reflects the locations of all the actual data points.

14

• Example 4.6

A professor of statistics wants to report the results of a midt exam, taken by 100 students. The data appear in file XM04 Find the mean, median, and mode, and describe the informa they provide. The mean provides information
M rk a s Ma en St n a Erro a d rd r Md n e ia Md oe St n a D v t n a d rd e iaio Sa p V ria ce m le a n Ku o rt sis Sk w e e n ss Ra g ne M im m in u Mx u a im m Su m Co n ut 7. 8 39 2 526 .1 0 1 3 8 1 8 4 2 .5 2 6 1 013 42 40 6 .3 3 3 0 960 .3 3 6 6 -1 7 0 8 .0 3 9 8 9 1 1 10 0 79 38 10 0

about the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams. The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%.

The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated.Then, the mod 15 becomes a logical measure to comput

Excel Histogram
Frequency 10 0 20 3 30 2 40 6 50 6 60 5 70 10 80 16 90 28 100 24 More 0 Bin

Fre q u e n cy 30 20 10 0
10 20 30 40 50 60 70 80 90 10 0 Mo re

The histogram is skewed to the left

Modal class
16

Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution (“skewed to the right”)

Mode Mean Median

17

 If

a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution A negatively skewed distribu (“skewed to the left”) (“skewed to the right”)

Mode Mean Median

Mean Mode 18 Median

The geometric mean
– This is a measure of the average growth rate. – Let Ri denote the the rate of return in period i

(i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.
Rg = n (1+ R1)( + R2)...(+ Rn) − 1 1 1
If For the given series of rate of the rate of return was Rg in every period, the n-period return would returns the n-period return is n be calculated by (1+ Rg) calculated by

(1+ R1)( + R2)...(+ Rn) = (1+ Rg)n 1 1

19

– Example 4.7  A firm’s sales were \$1,000,000 three years ago.  Sales have grown annually by 20%, 10%, -5%.  Find the geometric mean rate of growth in sales. – Solution

Since Rg is the geometric mean (1+R)3 = (1+.2)(1+.1)(1-.05)= 1.2540

Thus,

Rg = 3 (1+ .2)( + .1 1− .05 − 1= .0784or 7.84 1 )( ) , %.

20

Measures of variability
(Dispersion or Spread)  Measures of central location fail to tell the whole story about the distribution.  A question of interest still remains unanswered:
How typical is the average value of all the measurements in the data set? or How much spread out are the measurements about the average value?
21

Observe two hypothetical data sets
Low variability data set

The average value provides a good representation of the values in the data set.

High variability data set

This is the previous data set. It is now changing to...

The same average value does not provide as good presentation of the values in the data set as before.

22

The range
– The range of a set of measurements is the difference

between the largest and smallest measurements. – Its major advantage is the ease with which it can be computed. – Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points.
But, how do all the measurements spread out? The range cannot assist in answering this question

? Range? ?
Smallest measurement Largest measurement
23

The variance
– This measure of dispersion reflects the values

of all the measurements. – The variance of a population of N measurements x1, x2,…,xN having a mean µ is defined as
σ2 = ∑N1(xi − µ)2 i= N

– The variance of a sample of n measurements

x1, x2, …,xn having a mean x is defined as
s2 = ∑n 1(xi − x)2 i= n− 1

24

Consider two small populations: 9-10= -1 11-10= Population A: 8, 9, 10, 11,us start by calculating +1 Let 12 the 16 Population B: 4, 7, 10, 13,sum of deviations 8-10= -2
12-10= +2

A B
4

Thus, a measure of dispersion Sum = 0 is needed that agrees with this The sum of deviations observation. is zero in both cases, 8 9 10 11 12
…but measurements in B The mean of both are much more dispersed populations is 10... then those in A. therefore, another measure is needed. 4-10 = - 6 16-10 = +6 7-10 = -3

7

10

13

16 13-10 = +3 Sum = 0
25

The sum of squared deviations is used in calculating the variance. See example next.

9-10= -1 11-10= +1 8-10= -2 12-10= +2

Sum = 0

A
8 9 10 11 12

The sum of deviations is zero in both cases, therefore, another measure is needed. 4-10 = - 6 16-10 = +6 7-10 = -3

B
4 7 10 13

16 13-10 = +3 Sum = 0
26

Let us calculate the variance of the two populations
(8− 102 + (9− 102 + (10 102 + (11 102 + (12 102 ) ) − ) − ) − ) σ2 = =2 A 5

(4− 102 + (7− 102 + (10 102 + (13 102 + (16 102 ) ) − ) − ) − ) 2 σB = = 18 5 Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of all, the sum of squared After dispersion instead? deviations increases in magnitude when the dispersion of a data set increases!! 27

Which data set has a larger dispersion? Which data set has a larger dispersion?

Let us calculate the sum of squared deviations for both data

However, when calculated on Data set B “per observation” basis (variance), is more dispersed the data set dispersions mean around the are properly ranked

A
1 2 3
5 times

B
1
5 times

3

5

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2 = 10 A/N = 10/5 = 2 σ A 2 = Sum SumB = (1-3)2 + (5-3)2 = 8

!

σ

B

2

= SumB/N = 8/2 = 4
28

– Example 4.8  Find the mean and the variance of the following sample of measurements (in years).

3.4, 2.5, 4.1, 1.2, 2.8, 3.7
– Solution
6 ∑i=1xi

A shortcut formula

3.4+ 2.5+ 4.1+ 1.2+ 2.8+ 3.7 177 . x= = = = 2.95 6 6 6 n ∑n 1(xi − x)2 (∑n 1xi)2  1  2 2 i= = s = = ∑ xi − i=  n− 1 n− 1i=1 n
=[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years)2
29

Measures of variability
(Dispersion or Spread)  Measures of central location fail to tell the whole story about the distribution.  A question of interest still remains unanswered:
How typical is the average value of all the measurements in the data set? or How much spread out are the measurements about the average value?
30

The variance
– This measure of dispersion reflects the values

of all the measurements. – The variance of a population of N measurements x1, x2,…,xN having a mean µ is defined as
σ2 = ∑N1(xi − µ)2 i= N

– The variance of a sample of n measurements

x1, x2, …,xn having a mean x is defined as
s2 = ∑n 1(xi − x)2 i= n− 1

31

– The standard deviation of a set of measurements is

the square root of the variance of the measurements.
Sample standard deviation s2 :s = Population st andard deviation= σ 2 :σ

– Example 4.9  Rates of return over the past 10 years for two mutual funds are shown below. Which one have a higher level of risk? Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4 32

– Solution – Let us use the Excel printout that is run from the “Descriptive statistics” sub-menu (use file Xm0409)
Fund A FundB 16 Mean 5.295 Standard Error 14.6 Median # N/A Mode 16.74 Standard Dev iation 280.3 Sam Variance ple -1.34 Kurtosis 0.217 Skewness 49.1 Range -6.2 Minim um 42.9 Maxim um 160 Sum 10 Count 12 3.152 11.75 # N/A 9.969 99.37 -0.46 0.107 30.6 -2.8 27.8 120 10
33

Mean Standard Error Median Mode und A should be considered Standard Dev iation skier because its standard Sam Variance ple eviation is larger Kurtosis Skewness Range Minim um Maxim um Sum Count

The coefficient of variation
– The coefficient of variation of a set of

measurements is the standard deviation divided by the mean value. s
Sample coefficienvariation = t of : cv x σ Population coefficienvariation = t of : CV µ

– This coefficient provides a proportionate

measure of variation.

A standard deviation of 10 may be perceived as large when the mean value is 100, but only moderately large when the mean value is 500 34

Interpreting Standard Deviation
 The

standard deviation can be used to

– compare the variability of several distributions – make a statement about the general shape of a

distribution.
 The

empirical rule: If a sample of measurements has a mound-shaped distribution, the interval

(x− s,x+ s) contains approximat ofthe ely 68% measuremen ts (x− 2s,x+ 2s) contains approximat ofthe ely 95% measuremen ts (x− 3s,x+ 3s) contains virtuallyofthe all measuremen ts
35

– Example 4.10  The duration of 30 long-distance telephone calls are shown next. Check the empirical rule for the this set of measurements.
• Solution First check if the histogram has an approximate mound-shape 10
8 6 4 2 0 2 5 8 11 14 17 20 More
36

• Calculate the mean and the standard deviation: Mean = 10.26; Standard deviation = 4.29. • Calculate the intervals:
(x− s,x+ s) = (10.264.29, 10.26 4.29) (5.97, + = 14.55)
(x − 2 ,x + 2 ) = (1.68, s s 18.84)

(x− 3s,x+ 3s) = (-2.61, 23.13)

Interval Empirical Rule Actual percentage Interval Empirical Rule Actual percentage
5.97, 14.55 68% 5.97, 14.55 68% 1.68, 18.84 95% 1.68, 18.84 95% -2.61, 23.13 100% -2.61, 23.13 100% 70% 70% 96.7% 96.7% 100% 100%

37

Other conclusions By the empirical rule, approximately 95% of the area under a mound-shaped histogram lies between

(x− 2s,x+ 2s)

95% of the area

– Since about 95% of all the measurements fall
s≅

x− 2s,

x

x+ 2s

within two standard deviation around the mean
172 . = 4.3min utes 4
38

the telephone calls duration problem e range is 19.5-2.3=17.2 minutes.

Range s≅ 4

The Chebyshev theorem
– Given any set of measurements and a number k

(not smaller than 1), the fraction of these measurements that lie within k standard deviations around the mean is at least 1-1/k2. 1-1/22=3/4 – This theorem is valid for any set of measurements (sample, population) of any shape. 2=8/9 1-1/3 K Interval Chebyshev Empirical Rule
1 2 3

x − s,x + s x− 2s,x+ 2s x− 3s,x + 3s

at least 0% at least 75% at least 89%

approximately 68% approximately 95% approximately 100%
39

• Example 4.6

A professor of statistics wants to report the results of a midt exam, taken by 100 students. The data appear in file XM04 Find the mean, median, and mode, and describe the informa they provide. The mean provides information
M rk a s Ma en St n a Erro a d rd r Md n e ia Md oe St n a D v t n a d rd e iaio Sa p V ria ce m le a n Ku o rt sis Sk w e e n ss Ra g ne M im m in u Mx u a im m Su m Co n ut 7. 8 39 2 526 .1 0 1 3 8 1 8 4 2 .5 2 6 1 013 42 40 6 .3 3 3 0 960 .3 3 6 6 -1 7 0 8 .0 3 9 8 9 1 1 10 0 79 38 10 0

about the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams. The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%.

The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated.Then, the mod 40 becomes a logical measure to comput