Lecture 1

Looking at Data
Clinical Data Example
 1. Kline et al. (2002)

 The researchers analyzed data from 934 emergency
room patients with suspected pulmonary embolism
(PE). Only about 1 in 5 actually had PE. The researchers
wanted to know what clinical factors predicted PE.
 I will use four variables from their dataset today:
 Pulmonary embolism (yes/no)
 Age (years)
 Shock index = heart rate/systolic BP
 Shock index categories = take shock index and divide it into 10
groups (lowest to highest shock index)
Descriptive Statistics
Types of Variables: Overview
Categorical Quantitative
binary nominal ordinal discrete continuous

2 categories +
more categories +
order matters +
numerical +
uninterrupted
Categorical Variables
 Also known as “qualitative.”
 Categories.
 treatment groups
 exposure groups
 disease status
 Dichotomous (binary) – two levels
 Dead/alive
 Treatment/placebo
 Disease/no disease
 Exposed/Unexposed
 Heads/Tails
 Pulmonary Embolism (yes/no)
 Male/female
 Nominal variables – Named categories
Order doesn’t matter!
 The blood type of a patient (O, A, B, AB)

 Marital status
 Occupation
 Ordinal variable – Ordered categories. Order
matters!
 Staging in breast cancer as I, II, III, or IV

 Birth order—1st, 2nd, 3rd, etc.
 Letter grades (A, B, C, D, F)
 Ratings on a scale from 1-5
 Ratings on: always; usually; many times; once in a
while; almost never; never
 Age in categories (10-20, 20-30, etc.)
 Shock index categories (Kline et al.)
Quantitative Variables
 Numerical variables; may be
arithmetically manipulated.
 Counts
 Time
 Age
 Height
 Discrete Numbers – a limited set of distinct
values, such as whole numbers.
 Number of new AIDS cases in CA in a year (counts)

 Years of school completed
 The number of children in the family (cannot have a half
a child!)
 The number of deaths in a defined time period (cannot
have a partial death!)
 Roll of a die
 Continuous Variables - Can take on any
number within a defined range.
 Time-to-event (survival time)
 Age
 Blood pressure
 Serum insulin
 Speed of a car
 Income
 Shock index (Kline et al.)
Looking at Data
  How are the data distributed?
 Where is the center?
 What is the range?
 What’s the shape of the distribution (e.g., Gaussian,
binomial, exponential, skewed)?
 Are there “outliers”?
 Are there data points that don’t make

sense?
The first rule of statistics:
USE COMMON SENSE!
90% of the information is

contained in the graph.
Bar Chart
 Used for categorical variables to show
frequency or proportion in each
category.
 Translate the data from frequency
tables into a pictorial representation…
Bar Chart: categorical
variables
no
yes
Bar Chart for SI categories
200.0 Much easier to
183.3 extract information
166.7 from a bar chart
than from a table!
Number of Patients
150.0
133.3
116.7
100.0
83.3
66.7
50.0
33.3
16.7
0.0
1 2 3 4 5 6 7 8 9 10
Shock Index Category

Box plot and histograms: for
continuous variables
 To show the distribution (shape, center,
range, variation) of continuous
variables.
Box Plot: Shock Index
2.0
Shock Index Units
maximum (1.7)
Outliers
1.3
Q3 + 1.5IQR
= .8+1.5(.25)=1.17
“whisker” 5
75th percentile (0.8)
0.7 interquartile range median (.66)
(IQR) = .8-.55 = .25 25th percentile (0.55)
minimum (or Q1-

1.5IQR)
0.0
SI
Histogram of SI
25.0 Bins of size 0.1
Note the “right skew”

16.7
Percent
8.3
0.0
0.0 0.7 1.3 2.0
SI
Histogram
6.0 100 bins (too much detail)
4.0
Percent
2.0
0.0
0.0 0.7 1.3 2.0
SI
Histogram
200.0 2 bins (too little detail)
133.3
Percent
66.7
0.0
0.0 0.7 1.3 2.0
SI
Box Plot: Shock Index
2.0
Also shows the “right

Shock Index Units
skew”
1.3
0.7
0.0
SI
Box Plot: Age
100.0
maximum
More symmetric
66.7 75th percentile
interquartile range
Years
median
25th percentile
33.3
minimum
0.0
AGE
Variables
Histogram: Age
14.0 Not skewed, but not
bell-shaped either…
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Some histograms from our class
data (n=18 so far…)
Starting with politics…
Feelings about math and
writing…
Last year’s class (for comparison)
Sleep behaviors…
Dietary behaviors…
Dietary behaviors…
Physical activity …
Optimism…
Last year’s class, for
comparison…
Health Care Debate…
Measures of central
tendency
 Mean
 Median
 Mode
Central Tendency
 Mean – the average; the balancing
point
calculation: the sum of values divided by the

sample size
n
In math ∑xX1 + X 2 +  + X n
shorthand: i =1
X= =
n n
Mean: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38
∑X i
17 + 19 + 21 + 22 + 23 + 23 + 23 + 38
i =1
X= = = 23.25
n 8
Mean of age in Kline’s data
Means Section of AGE

GeometricHarmonic
Parameter Mean Median Mean Mean Sum Mode
Value 50.19334 49 46.66865 43.00606 46730 49
556.9546
14.0
Percent
9.3
4.7
0.0
0.0 33.3 66.7 100.0
Mean of age in Kline’s data
14.0
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
The balancing point

Mean of Pulmonary Embolism?
(Binary variable?)
n
X
i 1
i
181 Histogram
* 1  750 * 0 181
X    .1944
100.0 n 931 931
80.56%
66.7 (750)
Percent
33.3
19.44%
(181)
0.0
0.0 0.3 0.7 1.0
PE
Mean
 The mean is affected by extreme values
(outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1  2  3  4  5 15 1  2  3  4  10 20
 3  4
5 5 5 5
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Central Tendency
 Median – the exact middle value
Calculation:
 If there are an odd number of observations,
find the middle value

 If there are an even number of observations,
find the middle two values and average them.

Median: example
Some data:
Median = (22+23)/2 = 22.5

Median of age in Kline’s data
GeometricHarmonic
Parameter Mean Median Mean Mean Sum Mode
Value 50.19334 49 46.66865 43.00606 46730 49
14.0
Percent
9.3
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Median of age in Kline’s data
14.0
50% 50%
of mass of
mass
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
Does PE have a median?
 Yes, if you line up the 0’s and 1’s, the
middle number is 0.
Median
 The median is not affected by extreme
values (outliers).
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
Central Tendency
 Mode – the value that occurs most
frequently
Mode: example
Some data:
Mode = 23 (occurs 3 times)

Mode of age in Kline’s data
GeometricHarmonic
Parameter Mean Median Mean Mean Sum
Mode
Value 50.19334 49 46.66865 43.00606 46730 49
Mode of PE?
 0 appears more than 1, so 0 is the
mode.
Measures of
Variation/Dispersion
 Range
 Percentiles/quartiles
 Interquartile range
 Standard deviation/Variance
Range
 Difference between the largest and the

smallest observations.
Range of age: 94 years-15 years = 79 years
14.0
9.3
Percent
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Range of PE?
 1-0 = 1
Quartiles
25% 25% 25% 25%
Q Q Q
1 2 3
 The first quartile, Q1, is the value for which
25% of the observations are smaller and 75%
are larger
 Q2 is the same as the median (50% are
smaller, 50% are larger)
 Only 25% of the observations are greater than
the third quartile
Interquartile Range
 Interquartile range = 3rd quartile – 1st

quartile = Q3 – Q1
Interquartile Range: age
Median
minimum Q1 (Q2) Q3 maximum
25% 25% 25% 25%
15 35 49 65 94
Interquartile range
= 65 – 35 = 30
Variance
 Average (roughly) of squared deviations
of values from the mean
2
 (x  X )
i
2
S  i
n 1
Why squared deviations?
 Adding deviations will yield a sum of 0.
 Absolute values are tricky!
 Squares eliminate the negatives.
 Result:
 Increasing contribution to the variance as
you go farther from the mean.
Standard Deviation
 Most commonly used measure of variation
 Shows variation about the mean
 Has the same units as the original data
n
 (x  X )
i
2
S i
n 1
Calculation Example:
Sample Standard Deviation
Age data (n=8) : 17 19 21 22 23 23 23 38
n=8 Mean = X = 23.25
(17  23.25) 2  (19  23 .25) 2    (38  23 .25) 2

S
8 1
280
  6 .3
7
Std. dev is a measure of
the “average” scatter
14.0
around the mean.
Estimation method: if the
distribution is bell shaped,
the range is around 6 SD,
9.3 so here rough guess for
Percent
SD is 79/6 = 13
4.7
0.0
0.0 33.3 66.7 100.0
AGE (Years)
Std. Deviation age
Variation Section of AGE

Standard
Parameter Variance Deviation
Value 333.1884 18.25345
Std Dev of Shock Index
250.0
Std. dev is a measure of

187.5
around the mean.
Count
125.0 Estimation method: if

the distribution is bell
shaped, the range is
around 6 SD, so here
62.5 rough guess for SD is
1.4/6 =.23
0.0
0.0 0.5 1.0 1.5 2.0
SI
Std. Deviation SI
Variation Section of SI
Standard Std Error Interquartile

Parameter Variance Deviation of Mean Range Range
Value 4.155749E-02 0.2038566 6.681129E-03 0.2460432 1.430856
Std. Dev of binary variable, PE
181 * (1  .1944 ) 2  750 * (0  .1944 ) 2
S Std. dev is a measure of
931  1
145 .8 around the mean.
  .3959
930
80.56%
19.44%
Std. Deviation PE
Variation Section of PE
Standard
Parameter Variance Deviation
Value 0.156786 0.3959621
Comparing Standard
Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 4.570
Bienaymé-Chebyshev Rule
 Regardless of how the data are distributed,
a certain percentage of values must fall
within K standard deviations from the mean:
Note use of  (sigma) to

Note use of  (mu) to represent “standard deviation.”
represent “mean”.
At least within
(1 - 1/12) = 0% …….….. k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………....k=3 (μ ± 3σ)
Symbol Clarification
 S = Sample standard deviation
(example of a “sample statistic”)
  = Standard deviation of the entire
population (example of a “population
parameter”) or from a theoretical
probability distribution
 X = Sample mean
 µ = Population or theoretical mean
**The beauty of the normal curve:
No matter what  and  are, the area between - and

+ is about 68%; the area between -2 and +2 is
about 95%; and the area between -3 and +3 is
about 99.7%. Almost all values fall within 3 standard
deviations.
68-95-99.7 Rule
68% of
the data
95% of the data
99.7% of the data

Summary of Symbols
 S2= Sample variance
 S = Sample standard dev
 2 = Population (true or theoretical) variance
  = Population standard dev.
 X = Sample mean
 µ = Population mean
 IQR = interquartile range (middle 50%)
Examples of bad graphics
What’s wrong with
this graph?
from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.69
Notice the X-
axis
From: Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot
Wainer, H. 1997, p.29.
Correctly scaled X-axis…
Report of the Presidential Commission on the Space Shuttle
Challenger Accident, 1986 (vol 1, p. 145)
The graph excludes the observations where no O-rings failed.
Smooth curve at least shows the trend toward failure at high and
low temperatures…
http://www.math.yorku.ca/SCS/Gallery/
Even better: graph all the data (including non-failures)
using a logistic regression model
Tappin, L. (1994). "Analyzing data relating to the Challenger disaster".

Mathematics Teacher, 87, 423-426
What’s wrong with
this graph?
from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.74
What’s the message here?
Diagraphics II, 1994

Diagraphics II, 1994
For more examples…
 http://www.math.yorku.ca/SCS/Gallery/
References
 http://www.math.yorku.ca/SCS/Gallery/
 Kline et al. Annals of Emergency Medicine 2002; 39: 144-152.
 Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
 Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87, 423-
426
 Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983.
 Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot
Wainer, H. 1997.

Lecture 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1

Uploaded by

Copyright:

Available Formats

Looking at Data

Clinical Data Example

 1. Kline et al. (2002)

binary nominal ordinal discrete continuous

 The blood type of a patient (O, A, B, AB)

 Staging in breast cancer as I, II, III, or IV

 Number of new AIDS cases in CA in a year (counts)

 Are there “outliers”?

 Are there data points that don’t make

90% of the information is

Shock Index Category

minimum (or Q1-

25.0 Bins of size 0.1

Note the “right skew”

Also shows the “right

66.7 75th percentile

calculation: the sum of values divided by the

Means Section of AGE

The balancing point

find the middle value

find the middle two values and average them.

Median = (22+23)/2 = 22.5

Mode = 23 (occurs 3 times)

 Difference between the largest and the

 Interquartile range = 3rd quartile – 1st

25% 25% 25% 25%

n=8 Mean = X = 23.25

(17  23.25) 2  (19  23 .25) 2    (38  23 .25) 2

Variation Section of AGE

Std. dev is a measure of

125.0 Estimation method: if

Standard Std Error Interquartile

Note use of  (sigma) to

No matter what  and  are, the area between - and

95% of the data

99.7% of the data

Tappin, L. (1994). "Analyzing data relating to the Challenger disaster".

Diagraphics II, 1994

You might also like