Slides chp02 Stats 20221

biometry – bio220
chapter 2
displaying data
1
Reminder from Ch1
2
why graphs?
• humans and primates - visual species
• can understand patterns in data via graphics
• when doing analysis: plot your data first  get
a feeling  decide on the analysis later
• show your results via graphs
3
Types of graphs
• Bar plots
• Histograms (single, multiple)
• Cumulative relative frequency plots
• Mosaic plots
• Scatter plots
• Line graphs
• Others
4
bar plots
• bar plot (or bar graph) for categorical data
• plots frequencies or relative frequencies
• relative frequency = proportion = fraction
(from 0 to 1)
• example: causes of death in US teenagers
5
bar plots
- what is the
variable?
- type of
the
variable?
6
bar plots
would the shape be
different if we had plotted
relative frequencies or
percentages?
7
bar plots
• rules for bar plot:
1- the order of the bars:
• if data ordinal, bars follow the variables' order
• if nominal, you may order by frequency
2- gaps between bars – why?
3- aid your reader: add info on total number of
observations in the legend
4- prefer 0 as baseline
8
bar plot vs histogram
9
histograms
• histograms for numerical data
• bin (group) the data
• plot frequencies (or relative frequencies) of
each bin
• no gap between adjacent bars  indicates
continuity
10
histograms
assume we
are studying
species
abundance:
what is the
variable
here?
11
histograms
• if we are interested in describing "abundance
for each bird species"  our variable is "bird
species" (each observation classified as X or Y
species)  a nominal variable  bar plot
12
histograms
• if we are interested in describing "bird species
abundance in general = number of individuals
of any one species (rare vs common)"  our
variable is numeric  histogram
• for a histogram, we may need to create bins

• we can bin the abundance data by 50  will
have 650/50 = 13 bins
13
histograms
14
histograms
15
histograms
16
histograms
• the shape tells you about the data
distributions  the population’s character
• note the outlier
17
distributions
18
distributions
• types: uniform vs normal vs exponential vs
gamma
• mode: the peak = most frequent bin
• unimodal vs bimodal
• skew = asymmetry
• skewed to the right or left = tail extends right
or left, respectively
• outliers = extreme observations
19
rules about histograms
• 1) choose number of bins (i.e. interval size)
carefully
• your choice may help detect or conceal signals in
data
• Sturge`s Rule for optimal bin numbers
– Optimal Bins = ⌈log2n + 1⌉
– n: The total number of observations in the dataset.
– ⌈ ⌉: Symbols mean “ceiling” – i.e. round the answer
up to the nearest integer.
20
bin size choice can hide signals
21
• 2) observation at boundaries  added to the
right-hand bin (higher bin)
– e.g. 250 into 250-300, not 200-250
• 3) prefer 0 as baseline
22
• 4) prefer to break at readable numbers
– e.g. 0.5 instead of 0.483
• 5) always add info on total number of

observations (n) in the legend
23
cumulative relative frequency plots
24
how will the

cumulative relative
frequency plot of this
data look like?
25
26
• another representation of numerical data
• can be helpful for multiple distributions
• shows percentiles / quantiles
• 5th percentile = 0.05th quantile: a value X,
where 5% of all values are < X
• 50th percentile = 0.5th quantile: median
27
• cumulative relative frequency distribution: a
graph of quantiles (or percentiles)
• y-axis: represent 0-1, or 0-100%
28
high slope  more

data in that range
low slope  less data

in that range
29
• histogram better to digest & detect patterns
than cumulative freq plots
• but cumulative freq plots can be helpful for
comparing multiple distributions
30
associations between categorical variables
• contingency tables: is variable X related to Y?
• example: does laying extra egg (after

experimental egg removal) increase malaria
risk in birds?
31
32
• how can we plot this?
• bar plot or histogram?
33
34
35
• mosaic plot: shows the area

• bars not of fixed width, reflect total
observation
– can choose between bar plot and mosaic plot
36
association between categorical and numeric
variables
• multiple histograms / cumulative freq graphs
• example: hemoglobin level in men from USA

(sea-level), Andes, Tibet, and Ethiopia (all
≥3500m)
37
variables
38
variables
39
association between 2 numeric var: scatter
plot
• what does
each point
represent?
• are the two
variables
positively
correlated?
40
association between 2 numeric var: scatter
plot
• each point is a unit, plotted for X and Y

• X-axis: explanatory (independent)
• Y-axis: response (dependent)
• example: male fish sexual attractiveness – is it

inherited?
41
association between 2 numeric var: line
graph
42
association between 2 numeric var: line
graph
• each point is a unit

• X: an ordered series, like time
• one Y for each X (like a function)
• points connected by a line
43
association between 3 or more numeric
variables
• maps
• X & Y: ordered series (usually spatial)
• Z: another variable (e.g. frequency), shown in
color key
• example: ozone concentration in the Southern

Hemisphere
44
association between 3 or more numeric
variables
45
other common graphs: boxplots
46
other common graphs:
Kaplan-Meier curves
• cumulative probability of survival of fruit flies
under dietary restriction (DR)
47
10.1038/nature08619
other common graphs: heatmaps
genes individuals (age)
color key
(relative expression) 48
https://www.nature.com/articles/nature02661
to make a good graph
• identify axes
• provide additional info in the legend
• represent magnitudes accurately
• if possible, show the original data (not just
summaries)
• draw clearly, minimize unclear patterns
• make easy to interpret
49
what to avoid when drawing graphs?
50
51
• bars: imply relative magnitude / relative

frequency  zero baseline more accurate
• first graph implies change is e.g. 20X
• but there is also clear change with time 
• can use scatter plot / line graph instead
• or, indicate that the y-axis is cut
52
• don’t use non-essential elements
53
• don’t make it too complicated
54
to make a good graph
• choosing the right amount of information to
convey
• labeling axes
• font size
• adding units
• choice of color (red/green not good)
55
summary
56
summary
57
exercise
• please solve all problems at the back of the
chapter 2
• Ask your questions at Office hours
58

Slides chp02 Stats 20221

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides chp02 Stats 20221

Uploaded by

Copyright:

Available Formats

biometry – bio220

• example: causes of death in US teenagers

• for a histogram, we may need to create bins

• 5) always add info on total number of

how will the

high slope  more

low slope  less data

• contingency tables: is variable X related to Y?

• example: does laying extra egg (after

• how can we plot this?

• bar plot or histogram?

• mosaic plot: shows the area

• multiple histograms / cumulative freq graphs

• example: hemoglobin level in men from USA

• each point is a unit, plotted for X and Y

• example: male fish sexual attractiveness – is it

• each point is a unit

• example: ozone concentration in the Southern

• bars: imply relative magnitude / relative

• don’t use non-essential elements

• don’t make it too complicated

You might also like