You are on page 1of 33

Mathematical

Biostatistics
Boot Camp:
Lecture 11,
Plotting

Brian Caffo

Table of
contents
Mathematical Biostatistics Boot Camp: Lecture 11, Plotting
Histograms

Stem and leaf

Dotcharts Brian Caffo


Boxplots
Department of Biostatistics
KDEs
Johns Hopkins Bloomberg School of Public Health
QQ-plots Johns Hopkins University
Mosaic plots

October 22, 2012


Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Table of contents
Plotting

Brian Caffo
1 Table of contents
Table of
contents
2 Histograms
Histograms

Stem and leaf 3 Stem and leaf


Dotcharts

Boxplots 4 Dotcharts
KDEs

QQ-plots 5 Boxplots
Mosaic plots
6 KDEs

7 QQ-plots

8 Mosaic plots
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Histograms
Plotting

Brian Caffo

Table of
contents

Histograms

Stem and leaf

Dotcharts • Histograms display a sample estimate of the density or mass function by


Boxplots plotting a bar graph of the frequency or proportion of times that a variable
KDEs takes specific values, or a range of values for continuous data, within a sample
QQ-plots

Mosaic plots
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Example
Plotting

Brian Caffo

Table of
contents

Histograms
• The data set islands in the R package datasets contains the areas of all
Stem and leaf
land masses in thousands of square miles
Dotcharts

Boxplots
• Load the data set with the command data(islands)
KDEs • View the data by typing islands
QQ-plots
• Create a histogram with the command hist(islands)
Mosaic plots
• Do ?hist for options
Mathematical
Biostatistics
Boot Camp: Histogram of islands
Lecture 11,
Plotting

Brian Caffo
41

40
Table of
contents

Histograms
30
Stem and leaf
Frequency

Dotcharts
20

Boxplots

KDEs

QQ-plots
10

Mosaic plots

2 1 1 1 1 1
0 0
0

0 5000 10000 15000

islands
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Pros and cons
Plotting

Brian Caffo

Table of
contents

Histograms
• Histograms are useful and easy, apply to continuous, discrete and even
Stem and leaf

Dotcharts
unordered data
Boxplots • They use a lot of ink and space to display very little information
KDEs • It’s difficult to display several at the same time for comparisons
QQ-plots

Mosaic plots Also, for this data it’s probably preferable to consider log base 10 (orders of
magnitude), since the raw histogram simply says that most islands are small
Mathematical
Biostatistics
Boot Camp: Histogram of log10(islands)
Lecture 11,
Plotting

25
Brian Caffo

Table of 20

20
contents

Histograms

Stem and leaf 15


Frequency

15

Dotcharts

Boxplots
10

KDEs

QQ-plots
5
Mosaic plots 4
5

2
1 1
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

log10(islands)
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Stem-and-leaf plots
Plotting

Brian Caffo

Table of
contents

Histograms • Stem-and-leaf plots are extremely useful for getting distribution information on
Stem and leaf the fly
Dotcharts
• Read the text about creating them
Boxplots
• They display the complete data set and so waste very little ink
KDEs

QQ-plots • Two data sets’ stem and leaf plots can be shown back-to-back for comparisons
Mosaic plots • Created by John Tukey, a leading figure in the development of the statistical
sciences and signal processing
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Example
Plotting

Brian Caffo

Table of
contents
> stem(log10(islands))
Histograms

Stem and leaf The decimal point is at the |


Dotcharts

Boxplots 1 | 1111112222233444
KDEs 1 | 5555556666667899999
QQ-plots 2 | 3344
Mosaic plots
2 | 59
3 |
3 | 5678
4 | 012
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Dotcharts
Plotting

Brian Caffo

Table of
contents

Histograms
• Dotcharts simply display a data set, one point per dot
Stem and leaf

Dotcharts
• Ordering of the of the dots and labeling of the axes can the display additional
Boxplots information
KDEs • Dotcharts show a complete data set and so have high data density
QQ-plots
• May be impossible to construct/difficult to interpret for data sets with lots of
Mosaic plots
points
Mathematical
islands data: log10(area) (log10(sq. miles))
Biostatistics
Boot Camp:
Lecture 11, Victoria
Plotting Vancouver
Timor
Tierra del Fuego
Tasmania
Brian Caffo Taiwan
Sumatra
Spitsbergen
Southampton
Table of South America
contents Sakhalin
Prince of Wales
Novaya Zemlya
Histograms North America
Newfoundland
New Zealand (S)
Stem and leaf New Zealand (N)
New Guinea
New Britain
Dotcharts Moluccas
Mindanao
Melville
Boxplots Madagascar
Luzon
Kyushu
KDEs Java
Ireland
Iceland
QQ-plots Honshu
Hokkaido
Hispaniola
Mosaic plots Hainan
Greenland
Europe
Ellesmere
Devon
Cuba
Celon
Celebes
Britain
Borneo
Banks
Baffin
Axel Heiberg
Australia
Asia
Antarctica
Africa

1.0 1.5 2.0 2.5 3.0 3.5 4.0


Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Discussion
Plotting

Brian Caffo

Table of
contents

Histograms

Stem and leaf

Dotcharts • Maybe ordering alphabetically isn’t the best thing for this data set
Boxplots • Perhaps grouped by continent, then nations by geography (grouping Pacific
KDEs
islands together)?
QQ-plots

Mosaic plots
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Dotplots comparing grouped data
Plotting

Brian Caffo

Table of
contents

Histograms
• For data sets in groups, you often want to display density information by group
Stem and leaf

Dotcharts
• If the size of the data permits, it displaying the whole data is preferable
Boxplots • Add horizontal lines to depict means, medians
KDEs
• Add vertical lines to depict variation, show confidence intervals interquartile
QQ-plots
ranges
Mosaic plots
• Jitter the points to avoid overplotting (jitter)
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Example
Plotting

Brian Caffo

Table of
contents

Histograms

Stem and leaf


• The InsectSprays dataset contains counts of insect deaths by insecticide type
Dotcharts
(A, B, C, D, E, F)
Boxplots

KDEs
• You can obtain the data set with the command
QQ-plots data(InsectSprays)
Mosaic plots
Mathematical
Biostatistics
Boot Camp:
Lecture 11, The gist of the code is below
Plotting

Brian Caffo
attach(InsectSprays)
Table of plot(c(.5, 6.5), range(count))
contents
sprayTypes <- unique(spray)
Histograms

Stem and leaf


for (i in 1 : length(sprayTypes)){
Dotcharts
y <- count[spray == sprayTypes[i]]
Boxplots
n <- sum(spray == sprayTypes[i])
KDEs points(jitter(rep(i, n), amount = .1), y)
QQ-plots lines(i + c(.12, .28), rep(mean(y), 2), lwd = 3)
Mosaic plots lines(rep(i + .2, 2),
mean(y) + c(-1.96, 1.96) * sd(y) / sqrt(n)
)
}
Mathematical
Biostatistics

25
Boot Camp:
Lecture 11,
Plotting

Brian Caffo

20
Table of
contents

Histograms 15
Count

Stem and leaf

Dotcharts
10

Boxplots

KDEs

QQ-plots
5

Mosaic plots
0

A B C D E F

Spray
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Boxplots
Plotting

Brian Caffo

Table of
contents
• Boxplots are useful for the same sort of display as the dot chart, but in
Histograms
instances where displaying the whole data set is not possible
Stem and leaf

Dotcharts
• Centerline of the boxes represents the median while the box edges correspond
Boxplots to the quartiles
KDEs • Whiskers extend out to a constant times the IQR or the max value
QQ-plots
• Sometimes potential outliers are denoted by points beyond the whiskers
Mosaic plots
• Also invented by Tukey
• Skewness indicated by centerline being near one of the box edges
Mathematical
Biostatistics
Boot Camp:
Lecture 11,

25
Plotting

Brian Caffo

Table of 20
contents

Histograms

Stem and leaf


15

Dotcharts

Boxplots
10

KDEs

QQ-plots

Mosaic plots
5
0

A B C D E F
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Boxplots discussion
Plotting

Brian Caffo

Table of
contents
• Don’t use boxplots for small numbers of observations, just plot the data!
Histograms

Stem and leaf • Try logging if some of the boxes are too squished relative to other ones; you
Dotcharts can convert the axis to unlogged units (though they will not be equally spaced
Boxplots anymore)
KDEs • For data with lots and lots of observations omit the outliers plotting if you get
QQ-plots
so many of them that you cant see the points
Mosaic plots
• Example of a bad box plot
boxplot(rt(500, 2))
Mathematical
Biostatistics
Boot Camp:

10
Lecture 11,
Plotting

Brian Caffo

0
Table of
contents

Histograms
−10

Stem and leaf

Dotcharts

Boxplots
−20

KDEs

QQ-plots

Mosaic plots
−30
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Kernel density estimates
Plotting

Brian Caffo

Table of
contents
• Kernel density estimates are essentially more modern versions of histograms
Histograms
providing density estimates for continuous data
Stem and leaf

Dotcharts
• Observations are weighted according to a “kernel”, in most cases a Gaussian
Boxplots density
KDEs • “Bandwidth” of the kernel effectively plays the role of the bin size for the
QQ-plots histogram
Mosaic plots a. Too low of a bandwidth yields a too variable (jagged) measure of the density
b. Too high of a bandwidth oversmooths
• The R function density can be used to create KDEs
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Example
Plotting

Brian Caffo

Table of
contents

Histograms
Data is the waiting and eruption times in minutes between eruptions of the Old
Stem and leaf

Dotcharts
Faithful Geyser in Yellowstone National park
Boxplots
data(faithful)
KDEs
d <- density(faithful$eruptions, bw = "sj")
QQ-plots
plot(d)
Mosaic plots
density.default(x = faithful$eruptions, bw = "sj")
Mathematical
Biostatistics

0.6
Boot Camp:
Lecture 11,
Plotting

0.5
Brian Caffo

Table of

0.4
contents

Histograms
Density

Stem and leaf


0.3

Dotcharts

Boxplots
0.2

KDEs

QQ-plots
0.1

Mosaic plots
0.0

2 3 4 5

N = 272 Bandwidth = 0.14


Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Imaging example
Plotting

Brian Caffo

Table of
contents

Histograms

Stem and leaf • Consider the following image slice (created in R) from a high resolution MRI
Dotcharts of a brain
Boxplots
• This is a single (axial) slice of a three-dimensional image
KDEs

QQ-plots
• Consider discarding the location information and plotting a KDE of the
Mosaic plots intensities
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Plotting

Brian Caffo

Table of
contents

Histograms

Stem and leaf

Dotcharts

Boxplots

KDEs

QQ-plots

Mosaic plots
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Plotting

Brian Caffo
Background

0.04
Table of
contents

Histograms
density

Stem and leaf 0.02 Grey matter


Dotcharts
White matter
Boxplots

KDEs

QQ-plots
0.00

Mosaic plots

0 50 100 150 200 250


Mathematical
Biostatistics
Boot Camp:
Lecture 11,
QQ-plots
Plotting

Brian Caffo

Table of
contents

Histograms

Stem and leaf


• QQ-plots (for quantile-quantile) are extremely useful for comparing data to a
Dotcharts
theoretical distribution
Boxplots

KDEs • Plot the empirical quantiles against theoretical quantiles


QQ-plots • Most useful for diagnosing normality
Mosaic plots
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Plotting

Brian Caffo • Let xp be the p th quantile from a N(µ, σ 2 )


Table of
• Then P(X ≤ xp ) = p
contents xp −µ
• Clearly P(Z ≤ σ ) =p
Histograms

Stem and leaf • Therefore xp = µ + zp σ (this should not be news)


Dotcharts • Result, quantiles from a N(µ, σ 2 ) population should be linearly related to
Boxplots standard normal quantiles
KDEs
• A normal qq-plot plot the empirical quantiles against the theoretical standard
QQ-plots

Mosaic plots
normal quantiles
• In R qqnorm for a normal QQ-plot and qqplot for a qqplot against an
arbitrary distribution
Mathematical
Biostatistics Normal Q−Q Plot
Boot Camp:
Lecture 11,
Plotting

Brian Caffo

10
Table of
contents

Histograms
5
Sample Quantiles

Stem and leaf

Dotcharts

Boxplots
0

KDEs

QQ-plots
−5

Mosaic plots

−3 −2 −1 0 1 2 3

Theoretical Quantiles
Mathematical
Biostatistics Normal Q−Q Plot
Boot Camp:
Lecture 11,

5
Plotting

Brian Caffo

Table of

4
contents

Histograms
Sample Quantiles

Stem and leaf

Dotcharts

Boxplots
2

KDEs

QQ-plots
1

Mosaic plots
0

−3 −2 −1 0 1 2 3

Theoretical Quantiles
Mathematical
Biostatistics Normal Q−Q Plot
Boot Camp:
Lecture 11,
Plotting

Brian Caffo

15
Table of
contents

Histograms
Sample Quantiles

Stem and leaf


10

Dotcharts

Boxplots

KDEs

QQ-plots
5

Mosaic plots
0

−3 −2 −1 0 1 2 3

Theoretical Quantiles
Mathematical
Biostatistics
Boot Camp:
Lecture 11,
Mosaic plots
Plotting

Brian Caffo • Mosaic plots are useful for displaying contingency table data
Table of • Consider Fisher’s data regarding hair and eye color data for people from
contents
Caithness
Histograms

Stem and leaf


library(MASS)
Dotcharts

Boxplots
data(caith)
KDEs
caith
QQ-plots mosaicplot(caith, color = topo.colors(4),
Mosaic plots main = "Mosiac plot")
fair red medium dark black
blue 326 38 241 110 3
light 688 116 584 188 4
medium 343 84 909 412 26
dark 98 48 403 681 85
Mathematical
Biostatistics Mosiac plot
Boot Camp:
Lecture 11,
blue light medium dark
Plotting

Brian Caffo

Table of
contents

fair
Histograms

Stem and leaf

Dotcharts
red

Boxplots

KDEs
medium

QQ-plots

Mosaic plots
dark
black

You might also like