Professional Documents
Culture Documents
2
One way to begin exploring a large data set the basic case of functional relationships: An equi-
is to search for pairs of variables that are closely table statistic should give similar scores to func- Rows
associated. To do this, we could calculate some tional relationships with similar R2 values (given
3
measure of dependence for each pair, rank the sufficient sample size).
pairs by their scores, and examine the top-scoring Here, we describe an exploratory data anal-
...
pairs. For this strategy to work, the statistic we ysis tool, the maximal information coefficient
use to measure dependence should have two heu- (MIC), that satisfies these two heuristic proper-
ristic properties: generality and equitability. ties. We establish MIC’s generality through proofs, C 1.0
Normalized Score
By generality, we mean that with sufficient show its equitability on functional relationships
sample size the statistic should capture a wide through simulations, and observe that this trans-
0.5
range of interesting associations, not limited to lates into intuitively equitable behavior on more
specific function types (such as linear, exponential, general associations. Furthermore, we illustrate
or periodic), or even to all functional relation- that MIC gives rise to a larger family of sta- 0.0
Ho
20
ships (3). The latter condition is desirable because tistics, which we refer to as MINE, or maximal
riz
on
10
tal
0 10
MINE statistics can be used not only to identify Bins
is B
1 0 Axis
Department of Computer Science, Massachusetts Institute of Vertical
ins
Technology (MIT), Cambridge, MA 02139, USA. 2Broad Institute interesting associations, but also to characterize
of MIT and Harvard, Cambridge, MA 02142, USA. 3Department them according to properties such as nonline-
of Statistics, University of Oxford, Oxford OX1 3TG, UK. 4De-
arity and monotonicity. We demonstrate the Fig. 1. Computing MIC (A) For each pair (x,y), the
partment of Mathematics, Harvard College, Cambridge, MA MIC algorithm finds the x-by-y grid with the highest
02138, USA. 5Department of Computer Science and Applied application of MIC and MINE to data sets in
induced mutual information. (B) The algorithm
Mathematics, Weizmann Institute of Science, Rehovot, Israel. health, baseball, genomics, and the human
6
Center for Systems Biology, Department of Organismic and normalizes the mutual information scores and
microbiota. compiles a matrix that stores, for each resolution,
Evolutionary Biology, Harvard University, Cambridge, MA 02138,
The maximal information coefficient. Intu- the best grid at that resolution and its normalized
USA. 7Wellcome Trust Centre for Human Genetics, University of
Oxford, Oxford OX3 7BN, UK. 8Department of Biology, MIT, itively, MIC is based on the idea that if a re- score. (C) The normalized scores form the char-
Cambridge, MA 02139, USA. 9Department of Systems Biology, lationship exists between two variables, then a acteristic matrix, which can be visualized as a sur-
Harvard Medical School, Boston, MA 02115, USA. 10School of grid can be drawn on the scatterplot of the two face; MIC corresponds to the highest point on this
Engineering and Applied Sciences, Harvard University, Cam- variables that partitions the data to encapsulate
bridge, MA 02138, USA. surface. In this example, there are many grids that
that relationship. Thus, to calculate the MIC of a achieve the highest score. The star in (B) marks a
*These authors contributed equally to this work.
†To whom correspondence should be addressed. E-mail:
set of two-variable data, we explore all grids up sample grid achieving this score, and the star in (C)
dnreshef@mit.edu (D.N.R.); yreshef@post.harvard.edu (Y.A.R.) to a maximal grid resolution, dependent on the marks that grid’s corresponding location on the
‡These authors contributed equally to this work. sample size (Fig. 1A), computing for every pair surface.
A B 1
CorGC **
(MIC)
Cubic 1.00 0.61 0.69 3.09 3.12 0.98 1.00
0.4
Exponential 1.00 0.70 1.00 2.09 3.62 0.94 1.00
Sinusoidal 0.2
1.00 -0.09 -0.09 0.01 -0.11 0.36 0.64
(Fourier frequency)
Correlation Coefficient
1.00 0.00 0.00 0.01 0.20 0.40 0.80
(non-Fourier frequency)
0.2
G Maximal Information Coefficient (MIC)
0.80 0.65 0.50 0.35
0
Relationship Type Added Noise 0 0.2 0.4 0.6 0.8 1
D 5.0
Mutual Information (Kraskov et al.)
2
Line and Parabola
1.5
X
1
Ellipse
* *
0.5
Sinusoid
* *
(Mixture of two signals) 0
0 0.2 0.4 0.6 0.8 1
Non-coexistence
E 1
Maximal Correlation (ACE)
0.8
correlation (estimated using ACE), and the principal curve-based CorGC de-
pendence measure, respectively, of 27 different functional relationships with 0.2
independent uniform vertical noise added, as the R2 value of the data relative to
the noiseless function varies. Each shape and color corresponds to a different 0
0 0.2 0.4
* 0.6 0.8
*
1
combination of function type and sample size. In each plot, pairs of thumbnails
show relationships that received identical scores; for data exploration, we would F 1
CorGC (Principal Curve-Based)
like these pairs to have similar noise levels. For a list of the functions and sample
sizes in these graphs as well as versions with other statistics, sample sizes, and
noise models, see figs. S3 and S4. (G) Performance of MIC on associations not
0.8
* *
well modeled by a function, as noise level varies. For the performance of other 0.6
statistics, see figs. S5 and S6.
0.4
* *
0.2
0
0 0.2 0.4 0.6 0.8 1
Noise (1 - R2)
A B C
30 30 30
ns
ns
ns
20 20 20
Bi
Bi
Bi
0.0 0.0 0.0
xis
xis
0 0 0
xis
lA
lA
lA
10 10 10 10 10
Ver 10
ta
ta
Ver
ta
Ver tica
on
on
tica
on
tica 20 l Ax 20 l Ax 20
l Ax
riz
riz
is B is B
riz
is B 0 ins 0 0
Ho
Ho
ins
Ho
ins 30 30 30
D E F
Normalized Score
30 30 30
ns
s
ns
n
20 20 20
Bi
Bi
Bi
xis
0 0 0
Ax
lA
lA
10 10 10
Ver 10 Ver 10 10
al
Ver
ta
ta
tica
nt
tica tica
on
on
l Ax l Ax l Ax
o
20 20
is B 20
riz
is B
riz
riz
is B 0 ins 0 ins 0
ins
Ho
Ho
Ho
30 30 30
A B C D E
Fig. 3. Visualizations of the characteristic matrices of common relation- axis bins (columns), and the z axis represents the normalized score of the
ships. (A to F) Surfaces representing the characteristic matrices of several best-performing grid with those dimensions. The inset plots show the rela-
common relationship types. For each surface, the x axis represents num- tionships used to generate each surface. For surfaces of additional relation-
ber of vertical axis bins (rows), the y axis represents number of horizontal ships, see fig. S7.
A 1 C 40 D 90 E
20 60
800
Pearson Correlation Coefficient (ρ)
0.5
10
0 30 0
0 4 8 12 16 2 4 6 0 1x105 2x105 2x106
C F Dentist Density (per 10,000) Children Per Woman Number of Physicians
0
G E
F 75 G 60 H
3
25
2000
D
0
0
-1 0 0
0 0.25 0.5 0.75 1 0 20,000 40,000 0 150 300 0 20,000 40,000
MIC Score Income / Person (Int$) Health Exp. / Person (US$) Gross Nat’l Inc / Person (Int$)
B 1 I MIC ~
= 0.85 MIC ~
= 0.65
300 100
Under Five Mortality Rate
Mutual Information =
Pearson Correlation Coefficient (ρ)
200
0.5
60
100
c f 20
0
0 0 2000 4000 6000 10 40 70 100
Health Exp. / Person (Int$) Rural Access to Improved Water Sources (%)
ge
Mutual Information =~ 0.65
250
Under Five Mortality Rate
15 80
12
9
-0.5
6
3
125
60
d
0
-1
0 1.0 2.0 3.0
0 40
Mutual Information (Kraskov et al.) 0 2000 4000 6000 100 500 900
Health Exp. / Person (US$) Cardiovascular Disease Mortality (per 1E5)
Fig. 4. Application of MINE to global indicators from the WHO. (A) MIC Pacific island nations in which obesity is culturally valued (33); most other
versus r for all pairwise relationships in the WHO data set. (B) Mutual countries follow a parabolic trend (table S10). (H) A superposition of two
information (Kraskov et al. estimator) versus r for the same relationships. relationships that scores high under all three tests, presumably because the
High mutual information scores tend to be assigned only to relationships majority of points obey one relationship. The less steep minority trend
with high r, whereas MIC gives high scores also to relationships that are consists of countries whose economies rely largely on oil (37) (table S11).
nonlinear. (C to H) Example relationships from (A). (C) Both r and MIC yield The lines of best fit in (D) to (H) were generated using regression on each
low scores for unassociated variables. (D) Ordinary linear relationships score trend. (I) Of these four relationships, the left two appear less noisy than the
high under both tests. (E to G) Relationships detected by MIC but not by r, right two. MIC accordingly assigns higher scores to the two relationships on
because the relationships are nonlinear (E and G) or because more than one the left. In contrast, mutual information assigns similar scores to the top two
relationship is present (F). In (F), the linear trendline comprises a set of relationships and similar scores to the bottom two relationships.
0 1 2 3 4
(figs. S8 and S9) (24). D
CPR6
Because MIC is general and roughly equal to 1
R2 on functional relationships, we can also define 6
F 0
a natural measure of nonlinearity by MIC – r2,
where r denotes the Pearson product-moment cor- -1
relation coefficient, a measure of linear depen- E 0 1 2 3 4
D
Gene Expression
dence. The statistic MIC – r2 is near 0 for linear 0 C E HSP12
relationships and large for nonlinear relationships 0 0.2 0.4 0.6 0.8 1 1
-1
2.3 and table S1). Finally, MINE statistics can
also be used in cluster analysis to observe the F 0 1 2 3 4
Second Generation
Abundance (%) of OTU4273 (Eubacteriaceae)
0.15
0.04 0.04
0.1
0.02 0.02
0.05
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0 0.1 0.2 0.3 0 0.02 0.04 0.06 0.08 0.1
Abundance (%) of OTU710 (Erysipelotrichaceae) Abundance (%) of OTU3857 (Porphyromonadaceae) Abundance (%) of OTU708 (Lachnospiraceae)
The Structure of the Eukaryotic includes one rRNA chain (18S) and 33 proteins.
Of the 79 proteins, 32 have no homologs in crys-
SUPPLEMENTARY http://science.sciencemag.org/content/suppl/2011/12/15/334.6062.1518.DC2
MATERIALS
http://science.sciencemag.org/content/suppl/2011/12/14/334.6062.1518.DC1
RELATED http://science.sciencemag.org/content/sci/334/6062/1502.full
CONTENT
REFERENCES This article cites 35 articles, 5 of which you can access for free
http://science.sciencemag.org/content/334/6062/1518#BIBL
PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. The title Science is a registered trademark of AAAS.
Copyright © 2011, American Association for the Advancement of Science