You are on page 1of 89

Looking at Data

Clinical Data Example

 1. Kline et al. (2002)


 The researchers analyzed data from 934 emergency
room patients with suspected pulmonary embolism
(PE). Only about 1 in 5 actually had PE. The researchers
wanted to know what clinical factors predicted PE.
 I will use four variables from their dataset today:
 Pulmonary embolism (yes/no)
 Age (years)
 Shock index = heart rate/systolic BP
 Shock index categories = take shock index and divide it into 10
groups (lowest to highest shock index)
Descriptive Statistics
Types of Variables: Overview
Categorical Quantitative

binary nominal ordinal discrete continuous


2 categories +
more categories +
order matters +
numerical +
uninterrupted
Categorical Variables
 Also known as “qualitative.”

 Categories.

 treatment groups
 exposure groups
 disease status
Categorical Variables
 Dichotomous (binary) – two levels

 Dead/alive
 Treatment/placebo
 Disease/no disease
 Exposed/Unexposed
 Heads/Tails
 Pulmonary Embolism (yes/no)
 Male/female
Categorical Variables
 Nominal variables – Named categories
Order doesn’t matter!

 The blood type of a patient (O, A, B, AB)


 Marital status
 Occupation
Categorical Variables
 Ordinal variable – Ordered categories. Order
matters!

 Staging in breast cancer as I, II, III, or IV


 Birth order—1st, 2nd, 3rd, etc.
 Letter grades (A, B, C, D, F)
 Ratings on a scale from 1-5
 Ratings on: always; usually; many times; once in a
while; almost never; never
 Age in categories (10-20, 20-30, etc.)
 Shock index categories (Kline et al.)
Quantitative Variables
 Numerical variables; may be
arithmetically manipulated.

 Counts
 Time
 Age
 Height
Quantitative Variables
 Discrete Numbers – a limited set of distinct
values, such as whole numbers.

 Number of new AIDS cases in CA in a year (counts)


 Years of school completed
 The number of children in the family (cannot have a half
a child!)
 The number of deaths in a defined time period (cannot
have a partial death!)
 Roll of a die
Quantitative Variables
 Continuous Variables - Can take on any
number within a defined range.
 Time-to-event (survival time)
 Age
 Blood pressure
 Serum insulin
 Speed of a car
 Income
 Shock index (Kline et al.)
Looking at Data
  How are the data distributed?
 Where is the center?
 What is the range?
 What’s the shape of the distribution (e.g., Gaussian,
binomial, exponential, skewed)?

 Are there “outliers”?

 Are there data points that don’t make


sense?
The first rule of statistics:
USE COMMON SENSE!

90% of the information is


contained in the graph.
Bar Chart
 Used for categorical variables to show
frequency or proportion in each
category.
 Translate the data from frequency
tables into a pictorial representation…
Bar Chart: categorical
variables

no

yes
Bar Chart for SI categories
200.0 Much easier to
183.3 extract information
166.7 from a bar chart
than from a table!
Number of Patients

150.0
133.3
116.7
100.0
83.3
66.7
50.0
33.3
16.7
0.0
1 2 3 4 5 6 7 8 9 10

Shock Index Category


Box plot and histograms: for
continuous variables
 To show the distribution (shape, center,
range, variation) of continuous
variables.
Box Plot: Shock Index
2.0
Shock Index Units

maximum (1.7)

Outliers
1.3
Q3 + 1.5IQR
= .8+1.5(.25)=1.17
“whisker” 5
75th percentile (0.8)
0.7 interquartile range median (.66)
(IQR) = .8-.55 = .25 25th percentile (0.55)

minimum (or Q1-


1.5IQR)
0.0
SI
Histogram of SI

25.0 Bins of size 0.1

Note the “right skew”


16.7
Percent

8.3

0.0
0.0 0.7 1.3 2.0
SI
Histogram
6.0 100 bins (too much detail)

4.0
Percent

2.0

0.0
0.0 0.7 1.3 2.0
SI
Histogram
200.0 2 bins (too little detail)

133.3
Percent

66.7

0.0
0.0 0.7 1.3 2.0
SI
Box Plot: Shock Index
2.0

Also shows the “right


Shock Index Units

skew”

1.3

0.7

0.0
SI
Box Plot: Age
100.0
maximum

More symmetric

66.7 75th percentile

interquartile range
Years

median

25th percentile
33.3

minimum

0.0
AGE
Variables
Histogram: Age
14.0 Not skewed, but not
bell-shaped either…

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Some histograms from our class
data (n=18 so far…)
Starting with politics…
Feelings about math and
writing…
Last year’s class (for comparison)
Sleep behaviors…
Dietary behaviors…
Dietary behaviors…
Physical activity …
Optimism…
Last year’s class, for
comparison…
Health Care Debate…
Measures of central
tendency
 Mean
 Median
 Mode
Central Tendency
 Mean – the average; the balancing
point

calculation: the sum of values divided by the


sample size
n
In math ∑xX1 + X 2 +  + X n
shorthand: i =1
X= =
n n
Mean: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

∑X i
17 + 19 + 21 + 22 + 23 + 23 + 23 + 38
i =1
X= = = 23.25
n 8
Mean of age in Kline’s data

Means Section of AGE


GeometricHarmonic
Parameter Mean Median Mean Mean Sum Mode
Value 50.19334 49 46.66865 43.00606 46730 49

556.9546

14.0
Percent

9.3

4.7

0.0
0.0 33.3 66.7 100.0
Mean of age in Kline’s data
14.0

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0

The balancing point


Mean of Pulmonary Embolism?
(Binary variable?)
n

X
i 1
i
181 Histogram
* 1  750 * 0 181
X    .1944
100.0 n 931 931
80.56%
66.7 (750)
Percent

33.3
19.44%
(181)
0.0
0.0 0.3 0.7 1.0
PE
Mean
 The mean is affected by extreme values
(outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1  2  3  4  5 15 1  2  3  4  10 20
 3  4
5 5 5 5
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Central Tendency
 Median – the exact middle value

Calculation:
 If there are an odd number of observations,

find the middle value


 If there are an even number of observations,

find the middle two values and average them.


Median: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Median = (22+23)/2 = 22.5


Median of age in Kline’s data
Means Section of AGE
GeometricHarmonic
Parameter Mean Median Mean Mean Sum Mode
Value 50.19334 49 46.66865 43.00606 46730 49

14.0
Percent

9.3

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Median of age in Kline’s data
14.0
50% 50%
of mass of
mass
9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
Does PE have a median?
 Yes, if you line up the 0’s and 1’s, the
middle number is 0.
Median
 The median is not affected by extreme
values (outliers).

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3
Central Tendency
 Mode – the value that occurs most
frequently
Mode: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Mode = 23 (occurs 3 times)


Mode of age in Kline’s data
Means Section of AGE
GeometricHarmonic
Parameter Mean Median Mean Mean Sum
Mode
Value 50.19334 49 46.66865 43.00606 46730 49
Mode of PE?
 0 appears more than 1, so 0 is the
mode.
Measures of
Variation/Dispersion
 Range
 Percentiles/quartiles
 Interquartile range
 Standard deviation/Variance
Range

 Difference between the largest and the


smallest observations.
Range of age: 94 years-15 years = 79 years
14.0

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Range of PE?
 1-0 = 1
Quartiles
25% 25% 25% 25%

Q Q Q
1 2 3
 The first quartile, Q1, is the value for which
25% of the observations are smaller and 75%
are larger
 Q2 is the same as the median (50% are
smaller, 50% are larger)
 Only 25% of the observations are greater than
the third quartile
Interquartile Range

 Interquartile range = 3rd quartile – 1st


quartile = Q3 – Q1
Interquartile Range: age

Median
minimum Q1 (Q2) Q3 maximum

25% 25% 25% 25%

15 35 49 65 94

Interquartile range
= 65 – 35 = 30
Variance
 Average (roughly) of squared deviations
of values from the mean

2
 (x  X )
i
2

S  i

n 1
Why squared deviations?
 Adding deviations will yield a sum of 0.
 Absolute values are tricky!
 Squares eliminate the negatives.

 Result:
 Increasing contribution to the variance as
you go farther from the mean.
Standard Deviation
 Most commonly used measure of variation
 Shows variation about the mean
 Has the same units as the original data
n

 (x  X )
i
2

S i
n 1
Calculation Example:
Sample Standard Deviation
Age data (n=8) : 17 19 21 22 23 23 23 38

n=8 Mean = X = 23.25

(17  23.25) 2  (19  23 .25) 2    (38  23 .25) 2


S
8 1
280
  6 .3
7
Std. dev is a measure of
the “average” scatter
14.0
around the mean.
Estimation method: if the
distribution is bell shaped,
the range is around 6 SD,
9.3 so here rough guess for
Percent

SD is 79/6 = 13

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Std. Deviation age

Variation Section of AGE


Standard
Parameter Variance Deviation
Value 333.1884 18.25345
Std Dev of Shock Index
250.0

Std. dev is a measure of


187.5
the “average” scatter
around the mean.
Count

125.0 Estimation method: if


the distribution is bell
shaped, the range is
around 6 SD, so here
62.5 rough guess for SD is
1.4/6 =.23

0.0
0.0 0.5 1.0 1.5 2.0
SI
Std. Deviation SI

Variation Section of SI

Standard Std Error Interquartile


Parameter Variance Deviation of Mean Range Range
Value 4.155749E-02 0.2038566 6.681129E-03 0.2460432 1.430856
Std. Dev of binary variable, PE
181 * (1  .1944 ) 2  750 * (0  .1944 ) 2
S Std. dev is a measure of
931  1
the “average” scatter
145 .8 around the mean.
  .3959
930

80.56%

19.44%
Std. Deviation PE
Variation Section of PE
Standard
Parameter Variance Deviation
Value 0.156786 0.3959621
Comparing Standard
Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926

Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 4.570
Bienaymé-Chebyshev Rule
 Regardless of how the data are distributed,
a certain percentage of values must fall
within K standard deviations from the mean:

Note use of  (sigma) to


Note use of  (mu) to represent “standard deviation.”
represent “mean”.
At least within
(1 - 1/12) = 0% …….….. k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………....k=3 (μ ± 3σ)
Symbol Clarification
 S = Sample standard deviation
(example of a “sample statistic”)
  = Standard deviation of the entire
population (example of a “population
parameter”) or from a theoretical
probability distribution
 X = Sample mean
 µ = Population or theoretical mean
**The beauty of the normal curve:

No matter what  and  are, the area between - and


+ is about 68%; the area between -2 and +2 is
about 95%; and the area between -3 and +3 is
about 99.7%. Almost all values fall within 3 standard
deviations.
68-95-99.7 Rule

68% of
the data

95% of the data

99.7% of the data


Summary of Symbols
 S2= Sample variance
 S = Sample standard dev
 2 = Population (true or theoretical) variance
  = Population standard dev.
 X = Sample mean
 µ = Population mean
 IQR = interquartile range (middle 50%)
Examples of bad graphics
What’s wrong with
this graph?

from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.69
Notice the X-
axis

From: Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot
Wainer, H. 1997, p.29.
Correctly scaled X-axis…
Report of the Presidential Commission on the Space Shuttle
Challenger Accident, 1986 (vol 1, p. 145)
The graph excludes the observations where no O-rings failed.
Smooth curve at least shows the trend toward failure at high and
low temperatures…

http://www.math.yorku.ca/SCS/Gallery/
Even better: graph all the data (including non-failures)
using a logistic regression model

Tappin, L. (1994). "Analyzing data relating to the Challenger disaster".


Mathematics Teacher, 87, 423-426
What’s wrong with
this graph?

from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.74
What’s the message here?

Diagraphics II, 1994


Diagraphics II, 1994
For more examples…
 http://www.math.yorku.ca/SCS/Gallery/
References
 http://www.math.yorku.ca/SCS/Gallery/
 Kline et al. Annals of Emergency Medicine 2002; 39: 144-152.
 Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
 Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87, 423-
426
 Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983.
 Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot
Wainer, H. 1997.

You might also like