Professional Documents
Culture Documents
chapter 2
displaying data
1
Reminder from Ch1
2
why graphs?
• humans and primates - visual species
• can understand patterns in data via graphics
• when doing analysis: plot your data first get
a feeling decide on the analysis later
• show your results via graphs
3
Types of graphs
• Bar plots
• Histograms (single, multiple)
• Cumulative relative frequency plots
• Mosaic plots
• Scatter plots
• Line graphs
• Others
4
bar plots
• bar plot (or bar graph) for categorical data
• plots frequencies or relative frequencies
• relative frequency = proportion = fraction
(from 0 to 1)
5
bar plots
- what is the
variable?
- type of
the
variable?
6
bar plots
would the shape be
different if we had plotted
relative frequencies or
percentages?
7
bar plots
• rules for bar plot:
1- the order of the bars:
• if data ordinal, bars follow the variables' order
• if nominal, you may order by frequency
2- gaps between bars – why?
3- aid your reader: add info on total number of
observations in the legend
4- prefer 0 as baseline
8
bar plot vs histogram
9
histograms
• histograms for numerical data
• bin (group) the data
• plot frequencies (or relative frequencies) of
each bin
• no gap between adjacent bars indicates
continuity
10
histograms
assume we
are studying
species
abundance:
what is the
variable
here?
11
histograms
• if we are interested in describing "abundance
for each bird species" our variable is "bird
species" (each observation classified as X or Y
species) a nominal variable bar plot
12
histograms
• if we are interested in describing "bird species
abundance in general = number of individuals
of any one species (rare vs common)" our
variable is numeric histogram
13
histograms
14
histograms
15
histograms
16
histograms
• the shape tells you about the data
distributions the population’s character
• note the outlier
17
distributions
18
distributions
• types: uniform vs normal vs exponential vs
gamma
• mode: the peak = most frequent bin
• unimodal vs bimodal
• skew = asymmetry
• skewed to the right or left = tail extends right
or left, respectively
• outliers = extreme observations
19
rules about histograms
• 1) choose number of bins (i.e. interval size)
carefully
• your choice may help detect or conceal signals in
data
• Sturge`s Rule for optimal bin numbers
– Optimal Bins = ⌈log2n + 1⌉
– n: The total number of observations in the dataset.
– ⌈ ⌉: Symbols mean “ceiling” – i.e. round the answer
up to the nearest integer.
20
bin size choice can hide signals
21
rules about histograms
• 2) observation at boundaries added to the
right-hand bin (higher bin)
– e.g. 250 into 250-300, not 200-250
• 3) prefer 0 as baseline
22
rules about histograms
• 4) prefer to break at readable numbers
– e.g. 0.5 instead of 0.483
23
cumulative relative frequency plots
24
cumulative relative frequency plots
25
cumulative relative frequency plots
26
cumulative relative frequency plots
• another representation of numerical data
• can be helpful for multiple distributions
• shows percentiles / quantiles
• 5th percentile = 0.05th quantile: a value X,
where 5% of all values are < X
• 50th percentile = 0.5th quantile: median
27
cumulative relative frequency plots
• cumulative relative frequency distribution: a
graph of quantiles (or percentiles)
• y-axis: represent 0-1, or 0-100%
28
cumulative relative frequency plots
29
cumulative relative frequency plots
• histogram better to digest & detect patterns
than cumulative freq plots
• but cumulative freq plots can be helpful for
comparing multiple distributions
30
associations between categorical variables
31
associations between categorical variables
32
associations between categorical variables
33
associations between categorical variables
34
associations between categorical variables
35
associations between categorical variables
36
association between categorical and numeric
variables
37
association between categorical and numeric
variables
38
association between categorical and numeric
variables
39
association between 2 numeric var: scatter
plot
• what does
each point
represent?
• are the two
variables
positively
correlated?
40
association between 2 numeric var: scatter
plot
41
association between 2 numeric var: line
graph
42
association between 2 numeric var: line
graph
43
association between 3 or more numeric
variables
• maps
• X & Y: ordered series (usually spatial)
• Z: another variable (e.g. frequency), shown in
color key
44
association between 3 or more numeric
variables
45
other common graphs: boxplots
46
other common graphs:
Kaplan-Meier curves
• cumulative probability of survival of fruit flies
under dietary restriction (DR)
47
10.1038/nature08619
other common graphs: heatmaps
genes individuals (age)
color key
(relative expression) 48
https://www.nature.com/articles/nature02661
to make a good graph
• identify axes
• provide additional info in the legend
• represent magnitudes accurately
• if possible, show the original data (not just
summaries)
• draw clearly, minimize unclear patterns
• make easy to interpret
49
what to avoid when drawing graphs?
50
what to avoid when drawing graphs?
51
what to avoid when drawing graphs?
52
what to avoid when drawing graphs?
53
what to avoid when drawing graphs?
54
to make a good graph
• choosing the right amount of information to
convey
• labeling axes
• font size
• adding units
• choice of color (red/green not good)
55
summary
56
summary
57
exercise
• please solve all problems at the back of the
chapter 2
• Ask your questions at Office hours
58