Professional Documents
Culture Documents
(Descriptive Statistics)
1
Graphical Excellence
“Complex ideas communicated with
clarity, precision, and efficiency”
Shows the data
Makes you think about substance rather than
method, graphic design, or something else
Many numbers in a small space
Makes large data sets coherent
Encourages the eye to compare different
pieces of the data
2
Charles Joseph Minard
Available at
http://www.math.yorku.ca/SCS/Gallery/
Source: Minard, C. J. Carte figurative et approximative des quantités de vin français exportés
par mer en 1864. 1865. ENPC (École Nationale des Ponts et Chaussées), 1865.
Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT:
Graphics Press, 2001.
3
Summarizing Categorical Data
A frequency table shows the number Bar charts and Pie Charts are used
of occurrences of each category. to graph categorical data. A Pareto
Relative frequency is the proportion chart is a bar chart with categories
of the total in each category. arranged from the highest to lowest
(QC: “vital few from the trivial many”).
Sp ark
Lo se
oa A
B
Total 670 100.0
e d ups
oa p
p
Te ers
ro
r C Dro
r C ter
W ter
ou
P
D
au a C
s
in
H
er
ol cal
g
at
rti
nt
Ve
le
le
ol
H
R
4
Pie Chart and Bar Chart of Attraction
Popularity at an Amusement Park
20.0
15.0
10.0
5.0
0.0
Sp ark
Lo se
oa A
e d ups
oa p
p
Te ers
ro
r C Dro
r C ter
W ter
ou
P
D
au a C
s
in
H
er
ol cal
g
at
rti
nt
Ve
le
le
ol
H
R
Available at
http://www.math.yorku.ca/SCS/Gallery/
Histogram
7
Scatter Plot of Iris Data
4.0
iris[, "Sepal W", "Setosa"]
3.5
3.0
2.5
0 10 20 30 40 50
observation number
8
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Scatter Plot of Iris Data with Observation
Number Indicated
16
34
33
4.0
15
6 17
1920 45 47
11 22 49
5 23 38
3.5
1 18 28 37 41 44
iris21
78 12 21 25 27 29 32 40
24 50
3 30 36 43 48
4 10 31 35
3.0
2 1314 26 39 46
9
2.5
plot(iris21)
text(iris21)
42
0 10 20 30 40 50
observation number 9
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
3.0
Plot of data using jitter function in S-Plus
3.0
2.5
2.5
2.0
2.0
jitter(x)
1.5
1.5
x
1.0
1.0
0.5
0.5
0.0
0.0
0 100 200 300 400 500 0 100 200 300 400 500
Frequency
Compression
0 5 10 15 20 25 30
Production Order Compression
13
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of Iris Data with Density Curve
1.0
0.8
0.6
0.4
0.2
0.0
2.0 2 .5 3.0 3 .5 4 .0 4 .5
s iris2 1
14
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Stem and Leaf Diagram Cum. Dist. Function
Data: Gas Mileage
Stem Leaf Count
5 0 1 CDF Plot
4 1.0
0.9
4 00 2 0.8
3 5678 4 0.7
Cum Prob
3 01112 5 0.6
0.5
2 557889 6
0.4
2 0134 4 0.3
1 7 1 0.2
1 3 1 0.1
10 15 20 25 30 35 40 45 50 55
Shows distribution of data Miles per gallon
similar to a histogram but
preserves the actual data. Step occurs at each data value
Can see numerical patterns (higher for more values at the
in the data (like 40’s and 50). same data point).
15
Stem and Leaf Diagram for Iris Data
• Decimal point is 1 place to the left of the colon
• 23 : 0
• 24 :
• 25 :
• 26 :
• 27 :
• 28 :
• 29 : 0
• 30 : 000000
• 31 : 0000
• 32 : 00000
• 33 : 00
• 34 : 000000000
• 35 : 000000
• 36 : 000
• 37 : 000
• 38 : 0000
• 39 : 00
• 40 : 0
• 41 : 0
• 42 : 0
• 43 :
• 44 : 0
16
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary Statistics for Numerical Data
Measures of Location: n
x1 + x2 + + xn ∑x i
Mean (“average”): x= = i =1
n n
Median: middle of the ordered sample (like θ.5 for distribution)
xmin = x(1) ≤ x(2) ≤ …≤ x(n) = xmax
⎧ x⎛ n+1 ⎞ if n is odd
⎪ ⎜⎝ 2 ⎟⎠
⎪
median= ⎨ ⎡ ⎤
⎪ 2 ⎢ x⎛ n ⎞ + x⎛ n+2 ⎞ ⎥ if n is even
1
⎪⎩ ⎢⎣ ⎜⎝ 2 ⎟⎠ ⎜⎝ 2 ⎟⎠ ⎥⎦
Median of {0,1,2} is 1 : n=3 so n+1=4 & (n+1)/2=2 (2nd value)
Median of {0,1,2,3} is 1.5 (assumes data is continuous): n=4
Mode: The most common value 17
Mean or Median?
Appropriate summary of the center of the data?
– Mean if the data has a symmetric distribution with light tails
(i.e. a relatively small proportion of the observations lie
away from the center of the data).
– Median if the distribution has heavy tails or is asymmetric.
Extreme values that are far removed from the main body of the
data are called outliers.
– Large influence on the mean but not on the median.
Right and left skewness (asymmetry)
mode
mode (high point)
median
median
mean
mean
18
Quantiles, Fractiles, Percentiles
For a theoretical distribution:
The pth quantile is the value of a random variable X,
xp, such that P(X<xp)=p. For the normal dist’n:
In S-Plus: qnorm(p), 0<p<1, gives the quantile.
In S-Plus: pnorm(q) gives the probability.
For a sample:
The order statistics are the sample values in
ascending order. Denoted X(1) ,…X(n)
The pth quantile is the data value in the sorted
sample, such that a fraction p of the data is less than
or equal to that value.
19
1.0
Normal CDF
pnorm(0.8416212)=0.8
0.8
0.6
pnorm(x)
0.4
0.2
0.0
qnorm(0.8)=0.8416212
-3 -2 -1 0 1 2 3
x 20
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
An algorithm for finding sample quantiles:
21
Quantiles, continued
(pth quantile is 100pth percentile)
Example:
Data: {0, 1, 2, 3, 4, 5, 6}
= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}
n=7
Q1 = ceiling(0.25*7) = 2 ⇒ Q1 = x(2)= 1 = 25th percentile
Q2 = ceiling(0.50*7) = 4 ⇒ Q2 = x(4)= 3 = median (50th percentile)
Q3 = ceiling(0.75*7) = 6 ⇒ Q3 = x(6)= 5 = 75th percentile
22
Measures of Dispersion (Spread, Variability):
Two data sets may have the same center and but quite
different dispersions around it.
Two ways to summarize variability:
1. Give the values that divide the data into equal parts.
– Median is the 50th percentile
– The 25th, 50th, and 75th percentiles are called
quartiles (Q1,Q2,Q3) and divide the data into four
equal parts.
– The minimum, maximum, and three quartiles are
called the “five number summary” of the data.
2. Compute a single number, e.g., range, interquartile
range, variance, and standard deviation.
23
Measures of Dispersion, continued
1 n
1 ⎡ n
2⎤
Sample variance: s =
2
∑
n − 1 i =1
( xi − x ) =
2
⎢ ∑
n − 1 ⎣ i =1
xi − n( x ) ⎥
2
⎦
Sample standard deviation: s = s2
Sample mean, variance, and standard deviations are sample analogs
of the population mean, variance, and standard deviation (µ, σ2, σ)
24
Other Measures of Dispersion
25
Computations for Measures of Dispersion
Example:
Data: {0, 1, 2, 3, 4, 5, 6}
= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}
mean = (0+1+2+3+4+5+6)/ 7 = 21/ 7 = 3
min = 0, max = 6
Q1 = x(2)= 1 = 25th percentile
Q2 = x(4)= 3 = median (50th percentile)
Q3 = x(6)= 5 = 75th percentile
Range = max - min = 6 - 0 = 6
IQR = Q3 - Q1 = 5 - 1 = 4
s2 = [(02+12+22+32+42+52+62) - 7(32)]/(7-1) = [91-63]/6 =4.67
s = sqrt(4.67) = 2.16
26
Sample Variance and Standard Deviation
s2 and s should only be used to summarize dispersion with
symmetric distributions.
For asymmetric distribution, a more detailed breakup of the
dispersion must be given in terms of quartiles.
For normal data and large samples:
– 50% of the data values fall between mean ± 0.67s
– 68% of the data values fall between mean ± 1s
– 95% of the data values fall between mean ± 2s
– 99.7% of the data values fall between mean ± 3s
For normally distributed data:
IQR=(mean + 0.67s) - (mean - 0.67s) = 1.34s
27
Standard Normal Density
0.4
0.3
dnorm(x)
0.2
0.1
68%
0.0
-4 -2 0 2 4
x 28
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Box (and Whiskers) Plots
Visual display of summary of data (more than five numbers)
Outlier Box Plot Data: Gas Mileage Quantile Box Plot
IQR = Q3 - Q1
Upper Fence = Q3 + 1.5 x IQR
90th
Lower Fence = Q1 – 1.5 x IQR percentile
Rectangle:
Q3 Two lines are called whiskers
and extend to the most extreme
median data values that are still inside
the fences.
Q1
Observations outside the
fences are regarded as 10th
percentile
possible outliers and are
denoted by dots and circles or
asterisks.
29
4.0
3.5
3.0
2.5
Box Plot for Iris Data
iris21
30
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
QQ Plots
Compare Sample to Theoretical
Distribution
Order the data. The ith ordered data value is the pth quantile,
where p = (i - 0.5)/n, 0<p<1.
Text uses i/(n+1).
(Why can’t we just say i/n)?
31
4.0
3.5
QQ (Normal) Plot for Iris Data
iris21
3.0
2.5
-2 -1 0 1 2
4
2
0
-2 -1 0 1 2
0
-1
-2 -1 0 1 2
Quantiles of Standard Normal 35
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
40
30
20
10
0
Histogram of the same data
0 2 4 6 8 10
x 36
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summarizing Multivariate Data
When two or more variables are measured on each
sampling unit, the result is multivariate data.
If only two variables are measured the result is bivariate
data. One variable may be called the x variable and the
other the y variable.
We can analyze the x and y variable separately with the
methods we have learned so far, but these methods would
NOT answer questions about the relationship between x
and y.
– What is the nature of the relationship between x and y (if
any)?
– How strong is the relationship?
– How well can one variable be predicted from the other?
37
Summarizing Bivariate Categorical Data
Two-way Table
Overall Job Satisfaction
Annual Very Slightly Slightly Very Satisfied Row Sum
Salary Dissatisfied Dissatisfied Satisfied
Less than 81 64 29 10 184
$10,000
$10,000- 73 79 35 24 211
25,000
$25,000- 47 59 75 58 239
50,000
More than 14 23 84 69 190
$50,000
39
Simpson’s Paradox
40
Doctors’ Salaries
• The interpreter of a survey of doctors’
salaries in 1990 and again in 2000
concluded that their average income
actually declined from $97,000 in 1990
to $91,000 in 2000.”
• Income is measured here in nominal
(not adjusted for inflation) dollars.
41
What about the “Rest of the
Story”?
• What deductive piece of logic might
clarify the real meaning of this particular
pair of statistics?
• Look more deeply: Is there a piece
missing?
• Here is a very simple breakdown of “the
numbers” that may help.
42
Doctors’ Salaries by Age
1980 1990
Age fraction, f1 Income fraction, f2 Income
<=45 0.5 $60,000 0.7 $70,000
>45 0.5 $120,000 0.3 $130,000
45
Statistical Ideal
Randomized study
Practical reality
Benjamin Disraeli
47
Summarizing Bivariate Numerical Data
No. Method Method
1 (xi) 2 (yi) 100
90
1 88 86
80
2 78 81 70
Method 2
3 90 87 60
50
4 91 90
40
5 89 89 30
6 79 80 20
10
7 76 74
0
8 80 78 75 80 85 90 95
9 78 76
Method 1
10 90 86
48
Labeled Scatter Plot
Year Country Country Country Country
A B C D
1965 64.7 64.8 61.1 86.2
1970 65.0 65.2 61.2 86.5
Can you see the
1975 66.8 66.3 63.0 87.4
1980 66.9 67.4 62.8 87.0
improvements in
1985 67.9 68.5 63.1 89.2 the literacy rates
1990 68.3 69.1 63.5 89.4 for these four
1995 70.8 69.4 64.3 90.1 countries more
2000 71.7 70.0 65.1 90.5
easily in the
100.0 Table or in the
90.0
80.0
Figure?
70.0
Literary Rate
60.0 Country A
Country B
50.0
Country C
40.0
Country D
30.0
20.0
10.0
0.0
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
Year 49
Sample Correlation Coefficient
A single numerical summary statistic which measures
the strength of a linear relationship between x and y.
n
s xy 1
r=
sx s y
where s xy = ∑
n − 1 i =1
( xi − x )( yi − y )
r = covar(x,y)/(stddev(x)*stddev(y))
40
20
0
0 20 40 60 80 100
x
51
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
15
10 What is the correlation?
y
5
0
-4 -2 0 2 4
x
52
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
1.0
0.5
0.0
What is the correlation?
y
-0.5
-1.0
-4 -2 0 2 4
x
53
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Correlation and Causation
High correlation is frequently mistaken for a cause and effect
relationship. Such a conclusion may not be valid in
observational studies, where the variables are not controlled.
– A lurking variable may be affecting both variables.
– One can only claim association, not causation.
Countries with high fat diets tend to have higher incidences of
cancer. Can we conclude causation?
A common lurking variable in many studies is time order.
– Wealth and health problems go up with age.
Does wealth cause health problems?
Sometimes correlations can be found without any plausible
explanation, e.g., sun spots and economic cycles.
54
Plots for Multivariate Data
55
Box Plots of Auto Data
widths indicate number of each type
35
fuel.frame[, "Mileage"]
30
25
20
fuel.frame[, "Type"] 56
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Scatter plot matrix Iris –(Versicolor)
2.0 2.4 2.8 3.2 1.0 1.2 1.4 1.6 1.8
7.0
6.5
6.0
Sepal.L.
5.5
5.0
3.2
2.8
Sepal.W.
2.4
2.0
5.0
4.5
4.0
Petal.L.
3.5
3.0
1.8
1.6
1.4
Petal.W.
1.2
1.0
5.0 5.5 6.0 6.5 7.0 3.0 3.5 4.0 4.5 5.0
57
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
• Galaxy S-PLUS Language Reference
• Radial Velocity of Galaxy NGC7531
• SUMMARY:
• The galaxy data frame records the radial velocity of a spiral galaxy
measured at 323 points in the area of sky which it covers. All the
measurements lie within seven slots crossing at the origin. The positions
of the measurements given by four variables (columns).
• ARGUMENTS:
• east.west
– the east-west coordinate. The origin, (0,0), is near the center of the
galaxy, east is negative, west is positive.
• north.south
– the north-south coordinate. The origin, (0,0), is near the center of the
galaxy, south is negative, north is positive.
• angle
– degrees of counter-clockwise rotation from the horizontal of the slot
within which the observation lies.
• radial.position
– signed distance from origin; negative if east-west coordinate is
negative.
• velocity
– radial velocity measured in km/sec. .
58
This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Galaxy Data
-40 -20 0 20 40 1400 1500 1600 1700
30
10
east.west
-10
-30
20 40
north.south
0
-40
20 40 60
radial.position
0
-40
1600
velocity
1400
60
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Earthquake Data
36.0 36.5 37.0 37.5 38.0 38.5
-120
-121
longitude
-122
-123
38.0
latitude
37.0
36.0
5
magnitude
4
3
-123 -122 -121 -120 3 4 5 61
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Earthquake 3D
62
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Narrative Graphics of Space
and Time
• Adding spatial dimensions to a graph so that
the data are moving over space and time can
enhance the explanatory power of time series
displays
• The Classic of Charles Joseph Minard (1781-
1870) shows the terrible fate of Napoleon’s
army during his Russian campaign of 1812.
A copy of the map is available at
http://www.math.yorku.ca/SCS/Gallery/
Map Source: Minard, C. J. Carte figurative des pertes successives en hommes de l'armée
qu'Annibal conduisit d'Espagne en Italie en traversant les Gaules (selon Polybe). Carte figurative
des pertes successives en hommes de l'armée française dans la campagne de Russie, 1812-
1813. École Nationale des Ponts et Chaussées (ENPC), 1869. Also available in: Tufte, Edward 63
R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.
Beginning at the left on the Polish-Russian border near
the Niemen River the thick band shows the size of the
army (422,000) as it invaded Russia in June 1812.
– The width of the band indicates the size of the
army…
– The army reached a sacked and deserted Moscow
with 100,000 men
– Napoleon’s retreat path from Moscow is depicted by
a dark, lower band, linked to a temperature scale
and dates at the bottom.
– The men struggled into Poland with only 10,000
troops remaining.
64
• Minard’s graphic tells a rich, coherent
story with its multivariate data, far more
enlightening than just a single number
65
Scatter plot matrix of air data set in S-Plus
0 50 100 200 300 5 10 15 20
5
4
ozone
3
2
1
250
radiation
150
0 50
90
80
temperature
70
60
20
15
wind
10
5
1 2 3 4 5 60 70 80 90 66
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
ozone
5
4
3
2
1
plot(temperature,ozone)
60 70 80 90
temperature
67
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Fitting Lines
We often try to fit a straight line to bivariate data as a way to summarize
bivariate data:
fit = a + bx
The fit is often denoted by yˆi = a +bxi . The residuals are yi − yˆi .
What about curvature and outliers?
68
Resistant Line
Divide x data into thirds. Find median of x in each third, and
median of the y’s that correspond to the x’s in each third.
Call these three pairs (xa, ya), (xb, yb), (xc, yc). Fit a least-squares
line to these three points.
3
2
1
60 70 80 90
temperature
70
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Prediction and Residuals
Fitted lines can be used to predict. If we go too far beyond
range of x-data, we can expect poor results. Consider problems
of interpolation and extrapolation.
1 n
∑ ( i i)
2
s= y − y
ˆ
n − 2 i =1
0
-1
0 20 40 60 80 100
73
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
resid(lmfit)
2
1
0
-1
Residuals vs. Fitted Values for ozone data
fitted(lmfit)
74
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Smoothing
• Fitting curves to data
• Separate Signal from noise
• Fitted values, ŷ , are a weighted average of the
response y.
• Weights are a function of predictor x.
• Degrees of freedom indicate roughness
• Simple linear regression, df=2
75
plot(temperature,ozone)
5
4 lines(smooth.spline(temperature,ozone,df=16.5))
ozone
3
2
1
60 70 80 90
temperature
76
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
plot(temperature,ozone)
5
4
lines(smooth.spline(temperature,ozone,df=6))
ozone
3
2
1
60 70 80 90
temperature
77
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Time-Series/Runs Chart
100.0 Plot of Compression
90.0
80.0
vs. Time (Order of
70.0 Production)
Compression
60.0
30.0
process not in
20.0 “statistical control” as
10.0
seen from the
0.0
0 5 10 15 20 25 30 downward drift.
Production Order
78
Time-Series Data
Data obtained at successive time points for the same
sampling unit(s).
79
Data Smoothing and Forecasting
Two types of averages for time-series data:
1. Moving averages
2. Exponentially weighted averages
80
(Arithmetic) Moving Averages (MA)
The average of a set of w successive data values (called a
window); the oldest data is successively dropped off.
xt − w+1 + … + xt
MAt = for t = w, w + 1, … , T
w
The bigger the window (w), the more the smoothing.
EWMAt = w xt + (1 − w) EWMAt −1
where 0 < w < 1 is the smoothing constant (usually 0.2 to 0.3).
∑(x t −1 − x )( xt − x )
r1 = t =2
T
∑ t
( x
t =1
− x ) 2
∑(x t −k − x )( xt − x )
rk = t = k +1
T
∑ t
( x
t =1
− x ) 2
83
Lag Plots in S-Plus
lag.plot(x) or plot(x[1:(n-i)],x[(i+1):n])
Housing starts 1966:1974, lagged scatterplots
200
200
200
Series 1
Series 1
Series 1
150
150
150
100
100
100
50
50
50
200
200
200
Series 1
Series 1
Series 1
150
150
150
100
100
100
50
50
50