You are on page 1of 13

Chapter 20

The Fundamentals of Constructing and Interpreting


Heat Maps
Nathaniel M. Vacanti

Abstract
This chapter is intended to introduce the fundamental principles of the heat map, the most widely used
medium to present high-throughput data, to scientists unaccustomed to analyzing large data sets. Its scope
includes describing the general features of heat maps, how their components are designed, the meaning of
parameters such as “distance method” and “linkage method,” and the influence of manipulations such as
row-scaling and logarithmic transformations on data interpretation and presentation. This chapter may
serve as a guide to understanding the use of heat maps in published analyses or to aid in their design,
allowing efficient interpretations of high-throughput experiments, exploration of hypotheses, or clear
communications of findings.

Key words Heat map, Dendrogram, Hierarchical clustering, Linkage method, Correlation

1 Introduction

1.1 Motivation The demand for analyses of large data sets is becoming more
widespread with continued advancements in high-throughput
molecular profiling technologies. Scientists can now broadly exam-
ine multiple levels of regulation including those of the genome,
transcriptome, proteome, posttranslational modifications, and the
metabolome. These techniques provide researchers with the oppor-
tunity to move beyond studying systems in isolation, and to broadly
quantify the impact of their experimental conditions. However,
technological advances on their own will not facilitate moving
from high-throughput systems-level information to specific
insights into biological processes. The communication gaps
between those who produce large data sets, those who analyze
them, and those with expertise in specific biological fields must be
traversed. Thus, this chapter is intended to serve as a branch
extending some of the fundamentals of “big-data” analyses to
scientists wishing to expand their research questions into the con-
text of biological systems.

Sarah-Maria Fendt and Sophia Y. Lunt (eds.), Metabolic Signaling: Methods and Protocols, Methods in Molecular Biology, vol. 1862,
https://doi.org/10.1007/978-1-4939-8769-6_20, © Springer Science+Business Media, LLC, part of Springer Nature 2019

279
280 Nathaniel M. Vacanti

1.2 Heat Map A heat map representing the quantified proteome of human tissues,
Overview as measured by Zhu et al. [1], is displayed in Fig. 1. In this
particular example, each column represents a human tissue sample
and each row the protein product of a gene.
The large checkered palate of colors is a map of values (in this
case the protein abundances) for each sample-gene pair and is
henceforth referred to as the matrix visualization. The colors are
representative of a scale from low (cold) to high (hot) as specified

Fig. 1 Heat map displaying the relative quantified proteome of human tissue
samples [1]. Each column corresponds to a human tissue sample and each row
to the protein product of a gene. Values are normalized and log2 transformed as
described in [1]. Hierarchical clustering is based on Pearson correlation applying
complete linkage
Constructing and Interpreting Heat Maps 281

Fig. 2 A reproduction of the column dendrogram associated with the heat map in
Fig. 1. Distances separating highlighted dendrogram leaves are represented by
similarly colored double-sided arrows marking the heights (h) of the linkages
connecting the leaves

by the gradient scale1. In general, the rows and columns are


arranged in a way where regions of neighboring row-column pairs
have similar associated values (a result of hierarchical clustering,
discussed in Subheading 3.1). Thus the matrix visualization is a
map of the “temperature” localization, lending the name “heat
map” to describe this general style of presentation.
The seemingly uneven trees of brackets flanking the top and left
sides of the matrix visualization are dendrograms. The top (col-
umn) dendrogram is discussed for illustrative purposes; however,
the concepts are analogously applied to the left-side (row)
dendrogram.
Each dendrogram leaf corresponds to a column and thus a
tissue sample. The height at which two leaves are connected is
representative of their similarity (methods to quantify similarity as
distances are discussed in Subheading 3.1). Thus, according to
this dendrogram (magnified in Fig. 2), the samples of “liver 4”
and “liver 1” are more similar than “liver 4” and “kidney 3.”
Though an unsurprising finding, it illustrates how to interpret a
dendrogram (and serves as an important quality control check of
the data set). Other possible observations include that the liver
samples are more similar to those of the kidney than to those of
the testis and that tonsil 1 and tonsil 2 are more similar to each
other than to tonsil 3.
The color-bar located between the top dendrogram and
the matrix visualization (Fig. 1) is the group designations bar. In
this case, the groups correspond to the tissues of origin of each of
the samples as specified in the group key. The group designations

1
In the pictured example (Fig. 1), the gradient scale spans 2 to 2 and represents log2 transformed normalized
values. Normalization is to a representation of the average value (commonly employed in proteomics, but not
discussed here). Thus 1 on the scale represents about twice the average value and 1 represents about half of the
average value.
282 Nathaniel M. Vacanti

bar provides a useful visualization of how similar group members


are to each other or to members of other groups because the
distribution of a group’s members among dendrogram branches
can be quickly assessed. In this case, samples from the same tissue
appear similar to each other. Note that the group designations bar
can sometimes be deceiving. If “kidney 3” were a liver sample, the
group designations bar would have it stratified with the other liver
samples. However, as discussed above, the dendrogram clearly
distinguishes it as separated from the liver samples. Thus the
group designations bar provides a broad overview of the stratifica-
tions, but does not serve as a substitute for inspection of the
dendrogram.

2 Materials

This chapter provides an explanation of concepts required to


understand and interpret heat maps, and no materials are required.
However, software packages are needed to construct informative
heat maps. Morpheus (https://software.broadinstitute.org/mor
pheus/) as well as R functions such as heatmap and heatmap.plus
(heatmap.plus package) are very capable. Morpheus can be used
without knowledge of programming while the R packages require
familiarity with the R programming language.

3 Methods

3.1 Hierarchical Heat map rows and columns are typically arranged by the applica-
Clustering tion of hierarchical clustering. A series of steps common to many
hierarchical clustering procedures are described below and referred
3.1.1 General Algorithm
to in this chapter as the general algorithm. For clarity, an “item” is
defined as a single set of measurements. For example, each tissue
sample (such as kidney 3) in Fig. 1 is considered an item. An item
may refer to a set of measurements corresponding to a row or a
column of a heat map because hierarchical clustering can be per-
formed on both the rows and columns. Items grouped together as a
result of hierarchical clustering are collectively defined as a cluster
(though a cluster may sometimes consist of a single item).
Note that there are many software packages available to per-
form hierarchical clustering; a general algorithm is described here
to provide an understanding that will be useful when reading about
parameters described later.
1. Calculate a distance between each item and each other item.
Store these distance values in a matrix where the rows and
columns are both labeled with the item names as in Fig. 3
(top). Methods to calculate these distance values are discussed
in Subheading 3.1.2.
Constructing and Interpreting Heat Maps 283

Fig. 3 An illustration of the process of forming a dendrogram based on iterative


computations of distance matrices. W, X, Y, and Z are the items. All distance
values are fictional and provided for illustrative purposes

2. Form a matrix analogous to that formed in step 1, except for


the existing clusters instead of the individual items. On the first
iteration, each cluster is composed of a single item and the
matrix formed in step 1 is used here. On subsequent iterations,
the clusters may also be groups of items. These distance values
are stored in a matrix where the rows and columns are both
labeled with the cluster names as in Fig. 3 (middle and bot-
tom). Methods to calculate the distances between clusters are
discussed in Subheading 3.1.3.
3. Find the two clusters that have the shortest distance separating
them, arrange them to be adjacent to each other, and then
connect them with a double right-angle bracket whose total
height is representative of that distance. These paired clusters
are now considered a single cluster.
4. Determine if there is more than one cluster remaining. If there
is, repeat steps 2–4.
This process is illustrated in Fig. 3. The distance between a pair
(of items or clusters) connected by a right-angle bracket is repre-
sented by the distance from the top of that bracket to the bottom of
the dendrogram. The algorithm described above is agglomerative,
284 Nathaniel M. Vacanti

meaning the clustering is performed from the “bottom up.” Divi-


sive algorithms are also used, where a single cluster is the starting
point and it is subsequently divided, but are less common and not
discussed here.

3.1.2 Calculating The first step listed in the general algorithm for hierarchical cluster-
Distances Between ing is to calculate each pairwise distance between individual items
Individual Items and to tabulate them in a matrix such as that corresponding to the
first iteration in Fig. 3. To do so, the distance method must be
specified. The three most widely used are Euclidian distance, Pear-
son correlation, and Spearman correlation, which are described
below. Please be aware that a complete understanding of the equa-
tions presented in this section is not necessary to grasp the concepts
communicated. They are included as a parallel description intended
for the mathematically inclined.

Euclidian Distance Consider the data set in Fig. 4a. Each sample has two associated
protein measurements. If each protein quantity is represented on an
axis in two-dimensional space, each sample would correspond to a
point in that space (Fig. 4b). The Euclidian distance between the
points is then the physical distance separating them and is calculated
by the application of Eq. 1.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Xn
d 21 ¼ ðx i2  x i1 Þ2 ð1Þ
i¼1

where d21 is the distance separating points 2 and 1, i iterates


through the dimensions in space (the proteins), n is the number of
dimensions in space, x is a coordinate of a dimension, and the
subscripts 2 and 1 designate the individual points (samples).
Thus, xi2 is the coordinate (protein abundance) of the ith dimen-
sion (protein) for point (sample) 2.
Most high-throughput data sets have more than two measure-
ments associated with each sample. Though somewhat abstract
because Euclidian distance cannot be visualized beyond three
dimensions, Eq. 1 holds true regardless and is often applied to
data sets containing thousands of dimensions (such as that used
to generate Fig. 1 where each measured protein abundance is
considered a dimension).

Correlation Consider the abundances of four proteins (c, d, e, and f) measured


in each of 20 samples. Correlation between the abundances of any
pair of these proteins expresses the degree to which they have a
linear relationship. Accordingly, if the pair is perfectly correlated
(correlation coefficient ¼ 1), all of the points in a plot of the
abundances of one versus the other across all 20 samples would
Constructing and Interpreting Heat Maps 285

Fig. 4 Illustrations of the computation of Euclidian distance and the concept of


correlation. (a) Table of abundances of protein a and protein b. (b) Plot of
abundances of protein a versus protein b. (c) Plot of abundances of protein c
versus protein d. (d) Plot of abundances of protein e versus protein f. All protein
abundance values are fictional and provided for illustrative purposes

lie on a straight line with a positive slope. In this example, the


abundances of proteins c and d are positively correlated (Fig. 4c).
The correlation is strong, but not perfect because all of the points
do not lie exactly on a single line. Perfect correlations are essentially
nonexistent in real data. A pair of protein abundances can also have
a negative correlation. If the negative correlation is perfect (corre-
lation coefficient ¼ 1), all points of an analogous plot would lie on
a straight line with a negative slope. In this example, the abun-
dances of proteins e and f are negatively correlated (Fig. 4d). Again,
the correlation is strong, but not perfect because all of the points do
not lie exactly on a single line. A correlation coefficient can take any
value between 1 and 1, indicating the strength of the negative or
positive linear relationship. A value near 0 indicates the absence of a
linear relationship. The mathematical definition of the correlation
coefficient is provided by Eq. 2.
286 Nathaniel M. Vacanti

Pn  
i¼1 ðx i  x Þ y i  y
ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
Pn 2 Pn  2
i¼1 ð x i  
x Þ i¼1 y i  y

where r is the correlation coefficient, i iterates through the


measurements for the items x and y, n is the total number of
measurements for each item, and x and y are the mean measurement
values of items x and y. Of note, the ith set of measurements
contributes to a positive correlation when both members of the
set are on the same side of their respective item means, and con-
tributes to a negative correlation when members of the set are on
opposite sides of their respective item means (this determines the
sign of the numerator). The denominator serves to set the range of
possible values to be 1 to 1.
Correlation coefficients can be calculated for each pair of items;
however, they are not distance values and cannot be placed directly
into a distance matrix of the form in Fig. 3. In most cases, research-
ers are interested in grouping items together whose measurements
are positively correlated. Thus, a correlation coefficient value of one
should correspond to the shortest possible distance of zero. This is
accomplished by transforming the correlation coefficient to a dis-
tance by subtracting it from 1, then dividing by 2 (sets the mini-
mum to 0 and the maximum to 1) (Eq. 3). Note that there are
other methods to transform a correlation coefficient into a distance.
1r
d¼ ð3Þ
2
where d is the distance between two items and r is the correla-
tion coefficient between the same items.
Pearson and Spearman correlations differ in that Spearman
correlation coefficients are calculated on values that have been
transformed to ranks. The ranks range from 1 for the smallest
value, to the number of measurements for the largest value. If
values are identical, then they all receive the average of the ranks
they would occupy had they been adjacently ranked. Values are not
ranked prior to calculating Pearson correlation coefficients.
Use of Spearman correlation vastly reduces sensitivity to out-
liers and is insensitive to nonlinear transformations (e.g., logarith-
mic or exponential). Use of Spearman correlation is an excellent
starting point if one is uncertain about the nature of the data set
relationships (linear, exponential, logarithmic, etc.) or if order-of-
magnitude errors are suspected in some of the measurements.
However, if linear relationships are expected (recall correlation
analyses quantify the degree of linearity between items) and the
measurements are believed to be stable (roughly meaning they are
rarely expected to be off by an order of magnitude), then Pearson
Constructing and Interpreting Heat Maps 287

correlation may be more informative. Pearson correlations are still


extremely useful when examining data with nonlinear relationships
because data can be transformed prior to performing analyses. For
example, if exponential relationships are expected (very common),
logarithmically transforming the data may result in linear
relationships.

Euclidian Distance Versus With an understanding of the fundamentals of common distance


Correlation methods, criteria for selecting Euclidian distance or a correlation-
based method can be discussed. This determination should be
based on the data set and which of its properties the researcher
wishes to visualize.
Consider the plot in Fig. 5. When performing hierarchical
clustering based on Euclidian distance, curves A and B cluster
tightly together and are well separated from curve C. However,
when Pearson correlation is used, curves A and C cluster tightly and
are well separated from curve B. Both methods are correctly
applied, yet they produce very different results. Clustering based
on correlation will group items based on the similarity in the
pattern of the measurements, that is, how similarly the measure-
ments fluctuate above and below the mean, while clustering based
on Euclidian distance groups items based on the similarity in the
magnitude of each measurement.
Now consider the two presentations of the quantitative phos-
phoproteome of HeLa cells, as measured by Panizza et al. [2], in
Fig. 6. The two experimental conditions (mitotic arrested and

Fig. 5 An illustration of applying Euclidian distance and correlation-based


distance methods. Values are fictional and provided for illustrative purposes
288 Nathaniel M. Vacanti

Fig. 6 Heat maps displaying the relative quantified phosphoproteome of HeLa cells under the specified
conditions. Each column corresponds to a sample of Hela cells and each row to a phosphorylated protein.
Applied distance methods are provided above the respective heat maps. Complete linkage is applied. Values
are normalized to untreated and log2 transformed as described in [2]

pervanadate-treated) are known inducers of protein phosphoryla-


tion, creating sharp “on-off” effects in the measurements. Thus
alterations in the phosphoproteome are expected to be distin-
guished by measurements of magnitude, and Euclidian distance
based clustering best stratifies the items (in this case an “item” is
referring to a column) into biologically relevant clusters.

3.1.3 Calculating After the first iteration of the general algorithm, it is necessary to
Distances Between Groups calculate distances between clusters that are groups of items (Sub-
and Individual Items heading 3.1.1, step 2). To do so, a linkage method must be
specified. Three commonly used linkage methods (complete, sin-
gle, and average) determine distances between clusters based on the
pairwise distances between member items of each cluster.
Complete linkage takes the distance between clusters to be the
largest observed between a pair of individual items, one represented
from each cluster. Single linkage is computed analogously, except
the shortest distance between items is used. Average linkage con-
siders all pairwise distances between items of each cluster and takes
the average of these distances to be the distance separating the
clusters. Employing single linkage results in “long” or “winding”
Constructing and Interpreting Heat Maps 289

Fig. 7 Dendrogram displaying the Euclidian distance-based hierarchical


clustering of relative quantified human kidney and liver sample proteomes.
Centroid linkage is applied. Values are normalized and log2 transformed prior
to clustering as described in [1]. Linkages highlighted in red indicate where
cluster linkages cross dendrogram branches

clusters (imagine a snake-like shape if the data could be represented


in two or three-dimensional space), complete linkage produces
“compact” clusters, and average linkage produces something
in-between.
Another widely known linkage method, but not as commonly
employed, is centroid linkage. Conceptually, the centroid can be
thought of as a cluster’s center of gravity. Thus centroid linkage
links clusters (in step 2 of Subheading 3.1.1) based on finding the
minimum distance between cluster centers. This is apparently a
reasonable approach; however, it can produce strange results. In
some instances two items can be paired, bringing the centroid of
the newly formed cluster in closer proximity to an outside cluster
than the paired items are to each other. This situation may result in
a link crossing a branch as displayed in Fig. 7.
As an important conceptual note, centroids are a property of
mappings on a coordinate system. Thus centroid-based clustering
may seem only applicable when Euclidian distance is considered.
However, distances between cluster centroids (and between any
item and cluster centroids) can be determined by considering the
distance matrix created for step 1 of Subheading 3.1.1 (the details
of calculating distance to centroid values from distance matrices are
outside the scope of this chapter). Thus when correlations are
mapped to distances (as in Eq. 3), analyses requiring distances to
centroids can be applied.
Another commonly used linkage method is Ward’s method.
However, its application requires a modified algorithm:
290 Nathaniel M. Vacanti

1. Calculate a distance between each item and each other item


(using any of the distance methods described above). These
distance values are stored in a matrix where the rows and
columns are both labeled with the item names as in Fig. 3.
2. Use the distance matrix to find the two clusters where merging
into a single cluster minimizes the gain in the total sum of
squared distances between individual points and respective
cluster centroids. Initially, clusters are single items and the
sum of squared distances to cluster centroids is zero.
3. Place these two clusters adjacent to each other and connect
them by a double right-angle bracket whose total height is
representative of the newly formed cluster’s sum of squared
distances to the centroid. These paired clusters are now consid-
ered a single cluster.
4. Determine if there is more than one cluster remaining. If there
is, repeat steps 2–4.
The take-home message about Ward’s linkage method is that it
differs because clusters are not formed by linking the closest items,
but by examining all potential new clusters that could be formed by
a single link and selecting the link which maximizes the “average”
tightness of all existing clusters, or more mathematically correct,
minimizes an objective function which is typically “within cluster
variance” [3]. Thus applying Ward’s linkage method also results in
compact clusters.

3.2 Row Scaling A consideration that greatly affects the presentation of heat maps
independently of the distance and linkage methods is whether and
how the rows are scaled. Row scaling allows for correction of two
inherent properties of the data which affect item clustering.
Rows whose corresponding measurements are generally larger
in magnitude will have a greater effect on item clustering. This is a
particularly important consideration in the analysis of micro-array
and metabolomic data because signals can vary widely depending
on the characteristics of the mRNA probe or the metabolite (ten-
dency to ionize in mass spectrometry or magnitude of spin-state
transition in NMR). These varying magnitudes are not driven by
biological properties of the system and may greatly confound inter-
pretation of clustering if not corrected for by mean, median, or
some other form of normalization.
The second data property that greatly affects clustering is that
higher variant rows will drive the clustering. This is not necessarily
problematic because differences in biological systems may be driven
by the higher variant measurements (transcripts, metabolites, pro-
teins, etc.). However, near equal weight can be given to each
measurement if rows are normalized by a degree of data spread
such as standard deviation or inner quartile range.
Constructing and Interpreting Heat Maps 291

Importantly, the description of any data set must be referenced


to determine if/how it has been normalized or transformed. For
instance, micro-array readouts are often “RMA normalized” and
presented as log2 transformed. Additionally, there are varying
methods to tabulate proteomics (depending partly on whether it
is derived from a multi-plexed, SILAC, or label-free experiments)
and RNAseq data. Understanding the meaning of values within a
data set is the first step to performing or interpreting analyses.

References

1. Zhu Y, Orre LM, Johansson HJ, Huss M, fractionation by HiRIEF coupled to LC-MS
Boekel J, Vesterlund M, Fernandez- allows for in-depth quantitative analysis of the
Woodbridge A, Branca RMM, Lehtio J (2018) phosphoproteome. Sci Rep 7(1):4513. https://
Discovery of coding regions in the human doi.org/10.1038/s41598-017-04798-z
genome by integrated proteogenomics analysis 3. Ward JH (1963) Hierarchical grouping to opti-
workflow. Nat Commun 9(1):903. https://doi. mize an objective function. J Am Stat Assoc 58
org/10.1038/s41467-018-03311-y (301):236–244. https://doi.org/10.1080/
2. Panizza E, Branca RMM, Oliviusson P, Orre 01621459.1963.10500845
LM, Lehtio J (2017) Isoelectric point-based

You might also like