Professional Documents
Culture Documents
The Fundamentals of Constructing and Interpreting Heat Maps
The Fundamentals of Constructing and Interpreting Heat Maps
Abstract
This chapter is intended to introduce the fundamental principles of the heat map, the most widely used
medium to present high-throughput data, to scientists unaccustomed to analyzing large data sets. Its scope
includes describing the general features of heat maps, how their components are designed, the meaning of
parameters such as “distance method” and “linkage method,” and the influence of manipulations such as
row-scaling and logarithmic transformations on data interpretation and presentation. This chapter may
serve as a guide to understanding the use of heat maps in published analyses or to aid in their design,
allowing efficient interpretations of high-throughput experiments, exploration of hypotheses, or clear
communications of findings.
Key words Heat map, Dendrogram, Hierarchical clustering, Linkage method, Correlation
1 Introduction
1.1 Motivation The demand for analyses of large data sets is becoming more
widespread with continued advancements in high-throughput
molecular profiling technologies. Scientists can now broadly exam-
ine multiple levels of regulation including those of the genome,
transcriptome, proteome, posttranslational modifications, and the
metabolome. These techniques provide researchers with the oppor-
tunity to move beyond studying systems in isolation, and to broadly
quantify the impact of their experimental conditions. However,
technological advances on their own will not facilitate moving
from high-throughput systems-level information to specific
insights into biological processes. The communication gaps
between those who produce large data sets, those who analyze
them, and those with expertise in specific biological fields must be
traversed. Thus, this chapter is intended to serve as a branch
extending some of the fundamentals of “big-data” analyses to
scientists wishing to expand their research questions into the con-
text of biological systems.
Sarah-Maria Fendt and Sophia Y. Lunt (eds.), Metabolic Signaling: Methods and Protocols, Methods in Molecular Biology, vol. 1862,
https://doi.org/10.1007/978-1-4939-8769-6_20, © Springer Science+Business Media, LLC, part of Springer Nature 2019
279
280 Nathaniel M. Vacanti
1.2 Heat Map A heat map representing the quantified proteome of human tissues,
Overview as measured by Zhu et al. [1], is displayed in Fig. 1. In this
particular example, each column represents a human tissue sample
and each row the protein product of a gene.
The large checkered palate of colors is a map of values (in this
case the protein abundances) for each sample-gene pair and is
henceforth referred to as the matrix visualization. The colors are
representative of a scale from low (cold) to high (hot) as specified
Fig. 1 Heat map displaying the relative quantified proteome of human tissue
samples [1]. Each column corresponds to a human tissue sample and each row
to the protein product of a gene. Values are normalized and log2 transformed as
described in [1]. Hierarchical clustering is based on Pearson correlation applying
complete linkage
Constructing and Interpreting Heat Maps 281
Fig. 2 A reproduction of the column dendrogram associated with the heat map in
Fig. 1. Distances separating highlighted dendrogram leaves are represented by
similarly colored double-sided arrows marking the heights (h) of the linkages
connecting the leaves
1
In the pictured example (Fig. 1), the gradient scale spans 2 to 2 and represents log2 transformed normalized
values. Normalization is to a representation of the average value (commonly employed in proteomics, but not
discussed here). Thus 1 on the scale represents about twice the average value and 1 represents about half of the
average value.
282 Nathaniel M. Vacanti
2 Materials
3 Methods
3.1 Hierarchical Heat map rows and columns are typically arranged by the applica-
Clustering tion of hierarchical clustering. A series of steps common to many
hierarchical clustering procedures are described below and referred
3.1.1 General Algorithm
to in this chapter as the general algorithm. For clarity, an “item” is
defined as a single set of measurements. For example, each tissue
sample (such as kidney 3) in Fig. 1 is considered an item. An item
may refer to a set of measurements corresponding to a row or a
column of a heat map because hierarchical clustering can be per-
formed on both the rows and columns. Items grouped together as a
result of hierarchical clustering are collectively defined as a cluster
(though a cluster may sometimes consist of a single item).
Note that there are many software packages available to per-
form hierarchical clustering; a general algorithm is described here
to provide an understanding that will be useful when reading about
parameters described later.
1. Calculate a distance between each item and each other item.
Store these distance values in a matrix where the rows and
columns are both labeled with the item names as in Fig. 3
(top). Methods to calculate these distance values are discussed
in Subheading 3.1.2.
Constructing and Interpreting Heat Maps 283
3.1.2 Calculating The first step listed in the general algorithm for hierarchical cluster-
Distances Between ing is to calculate each pairwise distance between individual items
Individual Items and to tabulate them in a matrix such as that corresponding to the
first iteration in Fig. 3. To do so, the distance method must be
specified. The three most widely used are Euclidian distance, Pear-
son correlation, and Spearman correlation, which are described
below. Please be aware that a complete understanding of the equa-
tions presented in this section is not necessary to grasp the concepts
communicated. They are included as a parallel description intended
for the mathematically inclined.
Euclidian Distance Consider the data set in Fig. 4a. Each sample has two associated
protein measurements. If each protein quantity is represented on an
axis in two-dimensional space, each sample would correspond to a
point in that space (Fig. 4b). The Euclidian distance between the
points is then the physical distance separating them and is calculated
by the application of Eq. 1.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Xn
d 21 ¼ ðx i2 x i1 Þ2 ð1Þ
i¼1
Pn
i¼1 ðx i x Þ y i y
ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
Pn 2 Pn 2
i¼1 ð x i
x Þ i¼1 y i y
Fig. 6 Heat maps displaying the relative quantified phosphoproteome of HeLa cells under the specified
conditions. Each column corresponds to a sample of Hela cells and each row to a phosphorylated protein.
Applied distance methods are provided above the respective heat maps. Complete linkage is applied. Values
are normalized to untreated and log2 transformed as described in [2]
3.1.3 Calculating After the first iteration of the general algorithm, it is necessary to
Distances Between Groups calculate distances between clusters that are groups of items (Sub-
and Individual Items heading 3.1.1, step 2). To do so, a linkage method must be
specified. Three commonly used linkage methods (complete, sin-
gle, and average) determine distances between clusters based on the
pairwise distances between member items of each cluster.
Complete linkage takes the distance between clusters to be the
largest observed between a pair of individual items, one represented
from each cluster. Single linkage is computed analogously, except
the shortest distance between items is used. Average linkage con-
siders all pairwise distances between items of each cluster and takes
the average of these distances to be the distance separating the
clusters. Employing single linkage results in “long” or “winding”
Constructing and Interpreting Heat Maps 289
3.2 Row Scaling A consideration that greatly affects the presentation of heat maps
independently of the distance and linkage methods is whether and
how the rows are scaled. Row scaling allows for correction of two
inherent properties of the data which affect item clustering.
Rows whose corresponding measurements are generally larger
in magnitude will have a greater effect on item clustering. This is a
particularly important consideration in the analysis of micro-array
and metabolomic data because signals can vary widely depending
on the characteristics of the mRNA probe or the metabolite (ten-
dency to ionize in mass spectrometry or magnitude of spin-state
transition in NMR). These varying magnitudes are not driven by
biological properties of the system and may greatly confound inter-
pretation of clustering if not corrected for by mean, median, or
some other form of normalization.
The second data property that greatly affects clustering is that
higher variant rows will drive the clustering. This is not necessarily
problematic because differences in biological systems may be driven
by the higher variant measurements (transcripts, metabolites, pro-
teins, etc.). However, near equal weight can be given to each
measurement if rows are normalized by a degree of data spread
such as standard deviation or inner quartile range.
Constructing and Interpreting Heat Maps 291
References
1. Zhu Y, Orre LM, Johansson HJ, Huss M, fractionation by HiRIEF coupled to LC-MS
Boekel J, Vesterlund M, Fernandez- allows for in-depth quantitative analysis of the
Woodbridge A, Branca RMM, Lehtio J (2018) phosphoproteome. Sci Rep 7(1):4513. https://
Discovery of coding regions in the human doi.org/10.1038/s41598-017-04798-z
genome by integrated proteogenomics analysis 3. Ward JH (1963) Hierarchical grouping to opti-
workflow. Nat Commun 9(1):903. https://doi. mize an objective function. J Am Stat Assoc 58
org/10.1038/s41467-018-03311-y (301):236–244. https://doi.org/10.1080/
2. Panizza E, Branca RMM, Oliviusson P, Orre 01621459.1963.10500845
LM, Lehtio J (2017) Isoelectric point-based