Professional Documents
Culture Documents
Lecture 1
Lecture 1
Categories.
treatment groups
exposure groups
disease status
Categorical Variables
Dichotomous (binary) – two levels
Dead/alive
Treatment/placebo
Disease/no disease
Exposed/Unexposed
Heads/Tails
Pulmonary Embolism (yes/no)
Male/female
Categorical Variables
Nominal variables – Named categories
Order doesn’t matter!
Counts
Time
Age
Height
Quantitative Variables
Discrete Numbers – a limited set of distinct
values, such as whole numbers.
no
yes
Bar Chart for SI categories
200.0 Much easier to
183.3 extract information
166.7 from a bar chart
than from a table!
Number of Patients
150.0
133.3
116.7
100.0
83.3
66.7
50.0
33.3
16.7
0.0
1 2 3 4 5 6 7 8 9 10
maximum (1.7)
Outliers
1.3
Q3 + 1.5IQR
= .8+1.5(.25)=1.17
“whisker” 5
75th percentile (0.8)
0.7 interquartile range median (.66)
(IQR) = .8-.55 = .25 25th percentile (0.55)
8.3
0.0
0.0 0.7 1.3 2.0
SI
Histogram
6.0 100 bins (too much detail)
4.0
Percent
2.0
0.0
0.0 0.7 1.3 2.0
SI
Histogram
200.0 2 bins (too little detail)
133.3
Percent
66.7
0.0
0.0 0.7 1.3 2.0
SI
Box Plot: Shock Index
2.0
skew”
1.3
0.7
0.0
SI
Box Plot: Age
100.0
maximum
More symmetric
interquartile range
Years
median
25th percentile
33.3
minimum
0.0
AGE
Variables
Histogram: Age
14.0 Not skewed, but not
bell-shaped either…
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Some histograms from our class
data (n=18 so far…)
Starting with politics…
Feelings about math and
writing…
Last year’s class (for comparison)
Sleep behaviors…
Dietary behaviors…
Dietary behaviors…
Physical activity …
Optimism…
Last year’s class, for
comparison…
Health Care Debate…
Measures of central
tendency
Mean
Median
Mode
Central Tendency
Mean – the average; the balancing
point
∑X i
17 + 19 + 21 + 22 + 23 + 23 + 23 + 38
i =1
X= = = 23.25
n 8
Mean of age in Kline’s data
556.9546
14.0
Percent
9.3
4.7
0.0
0.0 33.3 66.7 100.0
Mean of age in Kline’s data
14.0
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
X
i 1
i
181 Histogram
* 1 750 * 0 181
X .1944
100.0 n 931 931
80.56%
66.7 (750)
Percent
33.3
19.44%
(181)
0.0
0.0 0.3 0.7 1.0
PE
Mean
The mean is affected by extreme values
(outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
5 5 5 5
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Central Tendency
Median – the exact middle value
Calculation:
If there are an odd number of observations,
14.0
Percent
9.3
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Median of age in Kline’s data
14.0
50% 50%
of mass of
mass
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
Does PE have a median?
Yes, if you line up the 0’s and 1’s, the
middle number is 0.
Median
The median is not affected by extreme
values (outliers).
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
Central Tendency
Mode – the value that occurs most
frequently
Mode: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Range of PE?
1-0 = 1
Quartiles
25% 25% 25% 25%
Q Q Q
1 2 3
The first quartile, Q1, is the value for which
25% of the observations are smaller and 75%
are larger
Q2 is the same as the median (50% are
smaller, 50% are larger)
Only 25% of the observations are greater than
the third quartile
Interquartile Range
Median
minimum Q1 (Q2) Q3 maximum
15 35 49 65 94
Interquartile range
= 65 – 35 = 30
Variance
Average (roughly) of squared deviations
of values from the mean
2
(x X )
i
2
S i
n 1
Why squared deviations?
Adding deviations will yield a sum of 0.
Absolute values are tricky!
Squares eliminate the negatives.
Result:
Increasing contribution to the variance as
you go farther from the mean.
Standard Deviation
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data
n
(x X )
i
2
S i
n 1
Calculation Example:
Sample Standard Deviation
Age data (n=8) : 17 19 21 22 23 23 23 38
SD is 79/6 = 13
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Std. Deviation age
0.0
0.0 0.5 1.0 1.5 2.0
SI
Std. Deviation SI
Variation Section of SI
80.56%
19.44%
Std. Deviation PE
Variation Section of PE
Standard
Parameter Variance Deviation
Value 0.156786 0.3959621
Comparing Standard
Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 4.570
Bienaymé-Chebyshev Rule
Regardless of how the data are distributed,
a certain percentage of values must fall
within K standard deviations from the mean:
68% of
the data
from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.69
Notice the X-
axis
From: Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot
Wainer, H. 1997, p.29.
Correctly scaled X-axis…
Report of the Presidential Commission on the Space Shuttle
Challenger Accident, 1986 (vol 1, p. 145)
The graph excludes the observations where no O-rings failed.
Smooth curve at least shows the trend toward failure at high and
low temperatures…
http://www.math.yorku.ca/SCS/Gallery/
Even better: graph all the data (including non-failures)
using a logistic regression model
from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.74
What’s the message here?