You are on page 1of 58

biometry – bio220

chapter 2
displaying data

1
Reminder from Ch1

2
why graphs?
• humans and primates - visual species
• can understand patterns in data via graphics
• when doing analysis: plot your data first  get
a feeling  decide on the analysis later
• show your results via graphs

3
Types of graphs
• Bar plots
• Histograms (single, multiple)
• Cumulative relative frequency plots
• Mosaic plots
• Scatter plots
• Line graphs
• Others

4
bar plots
• bar plot (or bar graph) for categorical data
• plots frequencies or relative frequencies
• relative frequency = proportion = fraction
(from 0 to 1)

• example: causes of death in US teenagers

5
bar plots

- what is the
variable?
- type of
the
variable?

6
bar plots
would the shape be
different if we had plotted
relative frequencies or
percentages?

7
bar plots
• rules for bar plot:
1- the order of the bars:
• if data ordinal, bars follow the variables' order
• if nominal, you may order by frequency
2- gaps between bars – why?
3- aid your reader: add info on total number of
observations in the legend
4- prefer 0 as baseline
8
bar plot vs histogram

9
histograms
• histograms for numerical data
• bin (group) the data
• plot frequencies (or relative frequencies) of
each bin
• no gap between adjacent bars  indicates
continuity

10
histograms

assume we
are studying
species
abundance:
what is the
variable
here?

11
histograms
• if we are interested in describing "abundance
for each bird species"  our variable is "bird
species" (each observation classified as X or Y
species)  a nominal variable  bar plot

12
histograms
• if we are interested in describing "bird species
abundance in general = number of individuals
of any one species (rare vs common)"  our
variable is numeric  histogram

• for a histogram, we may need to create bins


• we can bin the abundance data by 50  will
have 650/50 = 13 bins

13
histograms

14
histograms

15
histograms

16
histograms
• the shape tells you about the data
distributions  the population’s character
• note the outlier

17
distributions

18
distributions
• types: uniform vs normal vs exponential vs
gamma
• mode: the peak = most frequent bin
• unimodal vs bimodal
• skew = asymmetry
• skewed to the right or left = tail extends right
or left, respectively
• outliers = extreme observations
19
rules about histograms
• 1) choose number of bins (i.e. interval size)
carefully
• your choice may help detect or conceal signals in
data
• Sturge`s Rule for optimal bin numbers
– Optimal Bins = ⌈log2n + 1⌉
– n: The total number of observations in the dataset.
– ⌈ ⌉: Symbols mean “ceiling” – i.e. round the answer
up to the nearest integer.

20
bin size choice can hide signals

21
rules about histograms
• 2) observation at boundaries  added to the
right-hand bin (higher bin)
– e.g. 250 into 250-300, not 200-250

• 3) prefer 0 as baseline

22
rules about histograms
• 4) prefer to break at readable numbers
– e.g. 0.5 instead of 0.483

• 5) always add info on total number of


observations (n) in the legend

23
cumulative relative frequency plots

24
cumulative relative frequency plots

how will the


cumulative relative
frequency plot of this
data look like?

25
cumulative relative frequency plots

26
cumulative relative frequency plots
• another representation of numerical data
• can be helpful for multiple distributions
• shows percentiles / quantiles
• 5th percentile = 0.05th quantile: a value X,
where 5% of all values are < X
• 50th percentile = 0.5th quantile: median

27
cumulative relative frequency plots
• cumulative relative frequency distribution: a
graph of quantiles (or percentiles)
• y-axis: represent 0-1, or 0-100%

28
cumulative relative frequency plots

high slope  more


data in that range

low slope  less data


in that range

29
cumulative relative frequency plots
• histogram better to digest & detect patterns
than cumulative freq plots
• but cumulative freq plots can be helpful for
comparing multiple distributions

30
associations between categorical variables

• contingency tables: is variable X related to Y?

• example: does laying extra egg (after


experimental egg removal) increase malaria
risk in birds?

31
associations between categorical variables

32
associations between categorical variables

• how can we plot this?

• bar plot or histogram?

33
associations between categorical variables

34
associations between categorical variables

35
associations between categorical variables

• mosaic plot: shows the area


• bars not of fixed width, reflect total
observation
– can choose between bar plot and mosaic plot

36
association between categorical and numeric
variables

• multiple histograms / cumulative freq graphs

• example: hemoglobin level in men from USA


(sea-level), Andes, Tibet, and Ethiopia (all
≥3500m)

37
association between categorical and numeric
variables

38
association between categorical and numeric
variables

39
association between 2 numeric var: scatter
plot
• what does
each point
represent?
• are the two
variables
positively
correlated?

40
association between 2 numeric var: scatter
plot

• each point is a unit, plotted for X and Y


• X-axis: explanatory (independent)
• Y-axis: response (dependent)

• example: male fish sexual attractiveness – is it


inherited?

41
association between 2 numeric var: line
graph

42
association between 2 numeric var: line
graph

• each point is a unit


• X: an ordered series, like time
• one Y for each X (like a function)
• points connected by a line

43
association between 3 or more numeric
variables
• maps
• X & Y: ordered series (usually spatial)
• Z: another variable (e.g. frequency), shown in
color key

• example: ozone concentration in the Southern


Hemisphere

44
association between 3 or more numeric
variables

45
other common graphs: boxplots

46
other common graphs:
Kaplan-Meier curves
• cumulative probability of survival of fruit flies
under dietary restriction (DR)

47
10.1038/nature08619
other common graphs: heatmaps
genes individuals (age)

color key
(relative expression) 48
https://www.nature.com/articles/nature02661
to make a good graph
• identify axes
• provide additional info in the legend
• represent magnitudes accurately
• if possible, show the original data (not just
summaries)
• draw clearly, minimize unclear patterns
• make easy to interpret

49
what to avoid when drawing graphs?

50
what to avoid when drawing graphs?

51
what to avoid when drawing graphs?

• bars: imply relative magnitude / relative


frequency  zero baseline more accurate
• first graph implies change is e.g. 20X
• but there is also clear change with time 
• can use scatter plot / line graph instead
• or, indicate that the y-axis is cut

52
what to avoid when drawing graphs?

• don’t use non-essential elements

53
what to avoid when drawing graphs?

• don’t make it too complicated

54
to make a good graph
• choosing the right amount of information to
convey
• labeling axes
• font size
• adding units
• choice of color (red/green not good)

55
summary

56
summary

57
exercise
• please solve all problems at the back of the
chapter 2
• Ask your questions at Office hours

58

You might also like