You are on page 1of 14

Immunity

Perspective

Expanding the Immunology Toolbox:


Embracing Public-Data Reuse and Crowdsourcing
Rachel Sparks,1 William W. Lau,2 and John S. Tsang1,*
1Systems Genomics and Bioinformatics Unit, Laboratory of Systems Biology, National Institutes of Allergy and Infectious Diseases,

National Institutes of Health


2Office of Intramural Research, Center for Information Technology, National Institutes of Health

*Correspondence: john.tsang@nih.gov
http://dx.doi.org/10.1016/j.immuni.2016.12.008

New technologies have been propelling dramatic increases in the volume and diversity of large-scale public
data, which can potentially be reused to answer questions beyond those originally envisioned. However, this
often requires computational and statistical skills beyond the reach of most bench scientists. The develop-
ment of educational and accessible computational tools is thus critical, as are crowdsourcing efforts that uti-
lize the community’s expertise to curate public data for hypothesis generation and testing. Here we review
the history of public-data reuse and argue for greater incorporation of computational and statistical sciences
into the biomedical education curriculum and the development of biologist-friendly crowdsourcing tools.
Finally, we provide a resource list for the reuse of public data and highlight an illustrative crowdsourcing
exercise to explore public gene-expression data of human autoimmune diseases and corresponding mouse
models. Through education, tool development, and community engagement, immunologists will be poised to
transform public data into biological insights.

Introduction inflammation (Sweeney et al., 2015). These examples highlight


Genetics, biochemistry, and molecular and cellular biology have how integration of public data from multiple studies can lead to
long been staples of the immunologists’ toolbox. They have led new and potentially robust biological insights and how it might
to fundamental insights about the immune system, whose serve as a productive starting point for informing the design of
enormous complexity, however, continues to demand new future experiments. This data-reuse approach differs from and
investigative tools. Recently, the advent of high-throughput, mul- is complementary to the traditional ‘‘single lab,’’ hypothesis-
tiplexed molecular technologies, such as next-generation DNA driven research paradigm (Rung and Brazma, 2013) and can
sequencing and high-dimensional flow and mass cytometry, be empowering for all scientists. However, to date, computa-
have provided a more global, higher-resolution view of immune tional biologists have been the primary group of researchers
cells and tissues. Gene expression can now be routinely driving and benefitting the most from public-data reuse and inte-
measured genome-wide (a.k.a. the ‘‘transcriptome’’) in different gration approaches because of their computer programming
immune-cell populations down to the single-cell level (Satija and and statistical expertise.
Shalek, 2014; Shapiro et al., 2013). Similar measurements can Here we argue that it is imperative for the immunology com-
also be made in other modalities, including chromatin and munity to embrace the reuse and exploration of public data
methylation states (Laird, 2010; Schwartzman and Tanay, alongside their current arsenal of experimental tools to tackle
2015; Zhou et al., 2011), protein expression (Darmanis et al., immunological complexity. There are obvious cultural and scien-
2016; Larance and Lamond, 2015), and post-translational mod- tific challenges, including a lack of general awareness about the
ifications (Olsen and Mann, 2013). Coupled with the increasingly scientific potential of data reuse and the fact that few bench sci-
broad data-sharing requirements from funding agencies (Paltoo entists have sufficient computational and statistical expertise for
et al., 2014), the volume and complexity of biological information productive, hands-on analysis of public data. Below we first sur-
have been increasing at an unprecedented rate. Because of the vey the history and examples of large-scale public-data genera-
highly multiplexed nature of these measurements (e.g., covering tion and reuse to highlight the benefits and challenges. We then
most genes), the data can often be reused to answer questions discuss ideas involving a combination of education and the
beyond those posed in the original publication or envisioned development of open, biologist-friendly software platforms to
when the dataset was generated. Indeed, there have been not only enable hands-on testing and generation of hypotheses
increasing in silico hypothesis testing and generation via reuse through the use of public data but also take advantage of the
of public data, initially more in cancer biology (Segal et al., collective scientific expertise of the immunology research com-
2005; Segal et al., 2004) but recently also in immunology; for munity (a.k.a. ‘‘crowdsourcing’’) to help enhance the usability
example, researchers have used public gene-expression data and utility of the growing body of public datasets (Saez-Rodri-
from multiple studies to infer blood transcriptomic modules guez et al., 2016; Silberzahn and Uhlmann, 2015). Finally, we
(BTMs) (Chaussabel and Baldwin, 2014; Li et al., 2014), derive report encouraging observations from an experiment testing
universal blood signatures of viral infection (Andres-Terre et al., elements of our proposal in the microcosm of the National
2015; Tsalik et al., 2016), and develop predictive signatures to Institutes of Health (NIH) immunology community, with whom
differentiate between infectious and non-infectious sources of we carried out a crowdsourcing, ‘‘jamboree’’ exercise to assess

Immunity 45, December 20, 2016 ª 2016 Published by Elsevier Inc. 1191
Immunity

Perspective

Box 1. Major Concepts and Glossary

Differentially expressed (DE) gene detection: identifies genes expressed at different levels (i.e., with increased or decreased
expression) between two groups of samples. One can subsequently evaluate the identified DE genes via gene-set enrichment
analysis to determine whether genes annotated in particular biological pathways or processes (e.g., those from the Gene
Ontology Consortium or KEGG [Kyoto Encyclopedia of Genes and Genomes]) are overrepresented in the list of DE genes.
Gene-set enrichment analysis can be performed with programs such as ToppGene (https://toppgene.cchmc.org) (Chen
et al., 2009), which takes in a set of DE genes determined via fixed cutoffs (e.g., a p value and/or a fold-change threshold);
or the analysis can be performed by a cutoff-free approach such as GSEA (http://software.broadinstitute.org/gsea/index.
jsp) (Subramanian et al., 2005).
False Discovery Rate (FDR): the fraction of statistically significant tests that are false positives (Benjamini and Hochberg,
1995).
Replication: the ability to reproduce the findings of one study in a subsequent study (Goodman et al., 2016).
Gene-expression signature: a set of genes whose expression level is associated with a phenotype.
Signature matching: the use of pattern-matching algorithms to compare the gene-expression signature reflecting a particular
phenotype to a set of established gene-expression signatures (Lamb, 2007; Lamb et al., 2006).
Meta-analysis: a statistical technique that pools information from multiple studies to identify coherent signals. This pooling is
often done with a collection of data sets where within each data set the same type of comparison or correlation analysis is per-
formed (e.g., transcripts differentially expressed in lupus patients versus healthy controls), but it is possible to apply the same
approach to pool information across different types of comparisons or correlation analyses (e.g., searching for conserved signals
between lupus versus healthy and rheumatoid arthritis versus healthy comparisons). Statistical methods frequently used for meta-
analysis include combining p values, combining effect sizes, combining ranks, and directly merging the raw data (Campain and
Yang, 2010; Chang et al., 2013; Evangelou and Ioannidis, 2013; Ramasamy et al., 2008; Sweeney et al., 2016; Tseng et al., 2012).
Module analysis: modules are sets of genes that are grouped together in a biologically meaningful way. Gene modules are
often formed through identification of mRNA transcripts with correlated expression across a set of conditions. Module analysis
is a type of dimension reduction, an effort to extract essential information from data sets with a large number of variables
(Chaussabel and Baldwin, 2014; Chaussabel et al., 2008; Li et al., 2014; Segal et al., 2005; Segal et al., 2004).
Machine learning: computer algorithms that can infer ‘‘rules’’ from data (Libbrecht and Noble, 2015). There are two primary
categories: supervised and unsupervised machine learning. Supervised learning creates classification rules from labeled
data (e.g., identification of gene-expression patterns that can separate known cancer types). Unsupervised learning detects
structure or classification rules from unlabeled data (e.g., learning previously unknown subtypes of cancer from data alone.)

Combine data from

Study A Comparison Group Pair (CGP)


Disease group Control group
Genes

Genes

VS. Sample group


Study B
Samples Samples

Study C
expressed genes
Perform meta-analysis Up Down Downstream analyses
PEX19 HADH
Meta-analysis RAB17
HAS3
PGAP3
FBLIM1
Pathway enrichment
Study N techniques SOCS1 SNAI2 analysis
STK4 SSR4
(see Box 1) EVPL PYY Module analysis, etc.
XAF1 ZNFX1
ABRACL FABP4
FDX1 RAD52

(Continued on next page)


1192 Immunity 45, December 20, 2016
Immunity

Perspective

Box 1. Continued
Figure B1. Basic Concept of Meta-Analysis Using Gene-Expression Data
A collection of CGPs (comparison group pairs) where each CGP comprises two groups of gene-expression profiles (e.g., disease versus healthy sample
groups) is shown. Meta-analysis is performed by deriving a CGP collection, typically from more than one study, and evaluating for coherent signals (genes with
consistent increased or decreased expression) across the CGPs. One can use multiple statistical techniques to perform meta-analyses (see text in this Box).
The output of meta-analysis typically includes lists of differentially expressed (i.e., with increased or decreased expression) genes, which can be used for
downstream analyses, such as pathway enrichment analysis or module analysis.

transcriptomic signatures of several common human autoim- by assessing how these ‘‘unknown’’ profiles clustered with pro-
mune diseases and their mouse models by using public data. files in the compendium. This concept later inspired the creation
Ultimately, we hope that better education, continued tool devel- of the ‘‘Connectivity Map,’’ a collection of mRNA signatures
opment, and broad community engagement will help usher in an (i.e., a set of genes with altered expression in one condition
era where scientists at any level, from the undergraduate student versus another) generated through the exposure of human cell
to the principal investigator, can effectively utilize data from the lines to a large number of small molecules (Lamb, 2007; Lamb
public domain to advance the field of immunology. et al., 2006). In similarity to the Hughes et al. approach, re-
searchers can use these reference profiles with pattern-match-
A Survey of Large-Scale Data Generation and Reuse: ing computational tools to identify both correlated and anti-
From Cancer to Immunology and Beyond correlated relationships with other gene expression data through
Gene-expression data obtained from microarrays, and more a process termed ‘‘signature matching’’ (Box 1) (Iorio et al., 2013;
recently, RNA-seq, are exemplar large-scale, publicly available Lamb et al., 2006). These datasets have also been amply reused
biological data types whose reuse remains largely untapped by by bioinformaticians for the development of computational
the broad immunology community. Pioneered as a technique methods and the construction of gene network models (Pe’er
to measure genome-wide transcript expression (Chee et al., et al., 2001; Segal et al., 2003a; Segal et al., 2003b). However,
1996; Schena et al., 1995), microarray technology found early performing such analyses has been primarily limited to re-
application in the field of cancer biology, where it enabled a searchers with advanced computational and statistical skills
more comprehensive comparison of gene expression profiles and has thus been largely out of reach for the traditional bench
between tumor and healthy tissue samples (Alon et al., 1999). scientist.
In the late 1990s, the development of machine-learning tech- With the rapid accumulation of large datasets such as gene-
niques to correlate expression patterns with cancer classes expression compendia has come the establishment of numerous
(e.g., acute myeloid leukemia versus acute lymphoblastic leuke- data repositories, including Gene Expression Omnibus (GEO)
mia) led to the use of microarray data both in cancer subtype pre- and ArrayExpress, for housing such data (see Box 2 for refer-
diction (i.e., assignment of samples to known classes of cancer) ences, internet links, and more details on public data repositories
and in class discovery (identification of new cancer types or sub- discussed here). Some more recent repositories are organized
types) (Alizadeh et al., 2000; Golub et al., 1999). Within a few around biological areas and contain multi-modal data generated
years, expression data generated from several types of cancer from the same samples. For example, The Cancer Genome Atlas
(e.g., breast, lung, skin, and colon) had provided cancer biolo- (TCGA) and the International Cancer Genome Consortium
gists with a more global perspective on the transcriptional contain DNA sequence, gene expression, and other types of mo-
changes that occur in carcinogenesis (Alon et al., 1999; Bittner lecular and clinical data from multiple tumor types. In the immu-
et al., 2000; Perou et al., 2000; Wang et al., 2000). Shortly there- nological realm, ImmPort serves as a repository of immunology
after, these techniques were adopted by other fields, including research data; ImmuneSpace, a data portal of the Human Immu-
infectious diseases (Jenner and Young, 2005) and immunology, nology Project Consortium (HIPC), contains human immune
as demonstrated by an early evaluation of peripheral blood profiling data, such as blood transcriptomic and phenotyping
mononuclear cell (PBMC) gene expression from patients with of peripheral immune cells, taken before and after perturbations
the autoimmune disease systemic lupus erythematosus (SLE), such as influenza vaccination. On the mouse side, the Immuno-
which identified an ‘‘interferon (IFN) signature’’ involving the up- logical Genome Project (ImmGen) is a large-scale collaborative
regulation of IFN-inducible genes in SLE patients compared to effort to create a gene-expression compendium of mouse im-
healthy controls (Baechler et al., 2003; Bennett et al., 2003). mune cells.
Concurrently, in the late 1990s several groups created The increase in publicly available expression datasets spurred
compendia of expression profiles from yeast in a variety of interest in reusing and integrating data across studies through
states. Some yeast had genetic mutations, some had been meta-analysis—an approach that pools information from multi-
exposed to chemicals, drugs, and environmental stresses, and ple studies to identify coherent signals (Figure B1 in Box 1).
some were in distinct stages of the cell cycle (Gasch et al., For example, the IFN signature originally identified in SLE pa-
2000; Hughes et al., 2000; Spellman et al., 1998). Most notable tients was subsequently found in several other autoimmune con-
was a compendium of expression profiles generated by Hughes ditions, including rheumatoid arthritis (RA) and Sjogren’s
et al. (Hughes et al., 2000) from genetic mutations and chemical syndrome, through meta-analysis studies that evaluated several
exposures. They used this collection of reference expression autoimmune conditions simultaneously (Higgs et al., 2012;
patterns to search for potential pathways affected by chemical Toro-Domı́nguez et al., 2014). In comparison to single-study
compounds with previously unknown targets or to assess the ef- analysis, meta-analysis has several potential advantages,
fects of gene deletions in uncharacterized open-reading frames including increased statistical power, and the results are

Immunity 45, December 20, 2016 1193


Immunity

Perspective

typically more robust. Such studies typically have better replica- and clinical outcomes from approximately 18,000 subjects with
bility because platform- and study-specific issues are minimized 39 cancer subtypes (Gentles et al., 2015). They performed a
through computational techniques that identify congruent sig- ‘‘pan-cancer meta-analysis’’ to identify genes associated with
nals across studies [reviewed in (Campain and Yang, 2010; prognosis and used the immune-cell deconvolution tool CIBER-
Chang et al., 2013; Ramasamy et al., 2008; Sweeney et al., SORT (Newman et al., 2015) to infer the relative frequency of 22
2016; Tseng et al., 2012); examples in (Andres-Terre et al., leukocyte subsets within each tumor type. Together, these data
2015; Khatri et al., 2013; Sweeney et al., 2015)]. Meta-analyses again suggested a relationship between the tumor immune envi-
can still be susceptible to pre-analytic biases, such as those ronment and disease outcome; similar to the findings of Bindea
associated with dataset selection, but having a clear precon- et al. (2013), a larger presence of T cells within the tumor was
ceived strategy for the identification of independent discovery associated with better patient outcomes. In a more focused eval-
and validation datasets and a well-thought-out analysis plan uation of the specific role of cytotoxic immune cells (CD8+ T cells
can help to mitigate these issues (Ramasamy et al., 2008). and NK cells) in the tumor microenvironment, Rooney et al.
Early examples of gene-expression meta-analyses are found created a quantitative measure of immune cytolytic activity on
primarily in cancer biology (Rhodes et al., 2002; Rhodes et al., the basis of mRNA levels of granzyme A and perforin by using
2004; Wirapati et al., 2008), where independent expression data from multiple cancers in TCGA (Rooney et al., 2015). They
profiling of tumor samples produced numerous datasets that found that higher expression of this ‘‘cytolytic signature’’ was
could be analyzed collectively. More recently, meta-analysis of associated with a modest increase in survival, greater expres-
TCGA and other public expression data from tumor samples sion of tumor neoepitopes, expression of endogenous retrovi-
has been used for molecular and genomic characterization of ruses within several tumor types, and increased presence of mu-
multiple cancer subtypes (Weinstein et al., 2013) as well as to un- tations in genes involved in antigen presentation and extrinsic
cover novel cancer biology, including the effect of copy number apoptosis. Taken together, these studies of the tumor-immune
variation on gene expression (Fehrmann et al., 2015), and the relationship are notable for their insights into a complex biology
identification of common cancer subtypes among different tu- and for the utility of public gene-expression data in contributing
mors (Hoadley et al., 2014) and tumor-associated microRNAs to this understanding.
and methylation status (Gross et al., 2015). These examples There has also been increasing in silico hypothesis testing via
demonstrate how data available in the public domain can be em- reuse of public gene-expression data in immunology. In an effort
powering for scientists who wish to generate and test hypo- to better understand the human immune response to vaccina-
theses, particularly prior to designing an experiment or when tion, Li et al. (2014) combined data generated in a human study
deciding how to prioritize laboratory resources. As the volume of responses to the polysaccharide and conjugate meningo-
of public gene expression data has grown, there is increased po- coccal vaccines with public gene-expression data from several
wer to extract biological signals within the data (Lukk et al., 2010; other vaccination studies. To assess multi-gene, ‘‘module’’-level
Torrente et al., 2016) and greater ability to construct independent changes in peripheral blood immune cells after vaccination, they
discovery and validation study sets that are critical for the devel- first used public data from more than 30,000 human blood
opment and evaluation of robust gene-expression signatures gene-expression profiles to construct a set of BTMs, each of
(Andres-Terre et al., 2015; Khatri et al., 2013; Sweeney et al., which comprises genes whose expression is correlated with
2015). each other (see Box 1). A similar modular approach was used
As the importance of immunotherapy in cancer becomes earlier in a study of SLE (Chaussabel and Baldwin, 2014; Chaus-
increasingly appreciated, and given the pioneering role that can- sabel et al., 2008). The Li et al. (2014) study found that three
cer biology played in the reuse of gene expression data, it is not modular transcriptomic patterns were present in PBMCs in the
surprising that the field of cancer immunology has been utilizing first few days after vaccination and that, interestingly, these pat-
public data to assess immune infiltrates in tumors. One example terns were clustered by the type of vaccine used (live viral versus
involves using public expression data from purified subsets of polysaccharide versus protein). The authors thus concluded that
immune cells to build a compendium of mRNA signatures for there was unlikely to be a common, early-response signature
‘‘deconvolving’’ the frequency of these cell subsets in tumors predictive of the antibody response to multiple types of vac-
(Bindea et al., 2013). Researchers used this approach, in combi- cines. However, in separate efforts, universal blood signatures
nation with other experimental techniques, to evaluate human of viral infection have been derived from infection data (An-
colorectal cancer samples at both the center of the tumor and dres-Terre et al., 2015; Tsalik et al., 2016). Notably, one study
at the invasive margin. This approach generated an ‘‘immune used meta-analysis of public expression data from multiple
spatiotemporal landscape’’ within a tumor, which highlighted studies to generate conserved gene signatures for several types
the important role of the adaptive immune system in the tumor of viral infection or influenza-specific infection; the influenza-
immunological milieu and led to the discovery that certain im- specific signature was able to distinguish influenza infection
mune-cell infiltrates were associated with disease prognosis. from bacterial and non-influenza viral infections (Andres-Terre
Similarly, a recent study that characterized the infiltrating T cell et al., 2015). Similarly, a meta-analysis of public microarray data-
receptor repertoire by using more than 9,000 RNA-seq samples sets generated an 11-gene signature that allowed researchers to
from 29 cancer types in TCGA found that the degree of TCR differentiate between infectious and non-infectious sources of
CDR3 diversity was positively correlated with the burden of tu- inflammation after, for example, trauma (Sweeney et al., 2015).
mor nonsynonymous somatic mutations and the expression Other interesting immunological hypotheses recently evalu-
level of two genes encoding cancer/testis antigens (Li et al., ated solely through the reuse of public data include the question
2016). Gentles et al. compiled public data on gene expression of a universal signature of organ rejection and the balance of

1194 Immunity 45, December 20, 2016


Immunity

Perspective

leukocyte subsets in schizophrenia. In one case, researchers been generated via two main strategies: the yeast two-hybrid
combined eight transplant gene-expression datasets from four strategy for obtaining pairwise interactions and mass spectrom-
solid organs (kidney, lung, heart, and liver) to identify a common etry for characterizing protein complexes [(Rolland et al., 2014);
gene expression signature of acute rejection (Khatri et al., 2013). reviewed in (Lage, 2014)]. Several databases have been estab-
This signature was reassuringly also positively correlated with lished to facilitate the collection and dissemination of these
the degree of histologic graft injury in a separate group of data, including CORUM, a curated database of protein com-
renal-transplant public datasets. In another example, as part of plexes (Ruepp et al., 2010). Another effort is the International Mo-
an investigation of the link between immunology and psychiatry, lecular Exchange (IMEx) consortium, which is an attempt to pro-
researchers applied CIBERSORT (Newman et al., 2015) to public vide quality-controlled PPI data available to the public [(Orchard
gene-expression data from subjects with schizophrenia and bi- et al., 2012); www.imexconsortium.org]. Researchers have inte-
polar disorder to infer the immune cell proportions in blood for grated such pathway and PPI data with GWAS data to identify
each group of subjects. This approach identified a lower number potential biological networks involved in specific human pheno-
of circulating natural killer cells in medicated schizophrenia pa- types (Califano et al., 2012; Cotsapas et al., 2011; Lage, 2014;
tients than in healthy controls (Karpin ski et al., 2016). As more Rossin et al., 2011). Similarly, expression quantitative trait locus
data become available in the public domain, these hypotheses (eQTL) data, which link genetic variations to differences in gene
can be further refined, and new hypotheses can be generated. expression, are increasingly available for various tissues and cell
Researchers have also applied public gene-expression data to types through public databases such as that generated by the
identify possible drug targets and potentially novel drug indica- Genotype-Tissue Expression (GTEx) project (see Box 2) (Gibson
tions [reviewed in (Chen and Butte, 2016)]. This is facilitated et al., 2015). Meta-analysis also allows the combination of multi-
by the ability to generate disease-specific expression signa- ple datasets to generate potentially more robust eQTLs (Westra
tures and by the development of drug-specific gene-signature et al., 2013). Similar to PPI data, eQTLs can help researchers
databases, such as the ‘‘Connectivity Map’’ discussed earlier interpret associations from GWASs by, for example, generating
(Lamb, 2007; Lamb et al., 2006). Through identification of inverse hypotheses on the identity of genes whose expression level
drug-disease correlations, numerous novel drug indications, may be impacted by GWAS variants (Zhu et al., 2016). eQTL
including fenofibrate for the prevention of graft rejection and data can also allow researchers to assess the extent of genetic
cimetidine in the treatment of lung adenocarcinoma, have been contributions and generate hypothesis on potential mechanisms
proposed (Kidd et al., 2016; Roedder et al., 2013; Sirota et al., that underlie gene expression signatures associated with partic-
2011). Although some independent experimental data support ular phenotypes (Huan et al., 2015).
these novel drug-disease associations, there is not yet strong Immunology research often generates cellular phenotyping
clinical data demonstrating the success of this approach. data, such as those obtained from flow and mass cytometry
In addition to gene-expression data, meta-analyses of publicly (Bendall et al., 2012). Until recently, these data have not routinely
available data from genome-wide association studies (GWASs) been deposited in public repositories (Box 2), but increased mul-
are now routinely used as a way to increase statistical power tiplexing of markers (now up to the 40s) and studies with larger
and decrease the risk of false positives associated with smaller sample sizes have propelled the generation and deposition of
studies (Begum et al., 2012; Evangelou and Ioannidis, 2013). bigger datasets with increasing potential for reuse (Bhattacharya
A particularly successful case is autoimmunity, where GWAS et al., 2014; Roederer et al., 2015). However, reusing such data
meta-analyses have identified numerous unique and shared can be labor intensive, and the development of automated anal-
loci among several diseases, including SLE, RA, type I diabetes, ysis tools, such as software for automatic gating of immune cell
multiple sclerosis, Crohn’s disease, and ulcerative colitis (Ander- subsets (Saeys et al., 2016), is still in its infancy. Despite these
son et al., 2011; Bradfield et al., 2011; Franke et al., 2010; Már- challenges, community-based efforts to advance the develop-
quez et al., 2016; Morris et al., 2016; Patsopoulos et al., 2011; ment of such tools are underway (Aghaeepour et al., 2013; Finak
Stahl et al., 2010). Although the push for larger meta-analyses et al., 2016), and promising examples of reusing and integrating
has successfully uncovered new risk loci (Okada et al., 2014), cellular phenotyping datasets are emerging (Jujjavarapu et al.,
increasing coverage or sample sizes does not necessarily result 2016; Lu et al., 2016).
in additional associations (Fuchsberger et al., 2016). Candidate
genes emerging from GWAS meta-analyses could be causal Toward Adding Public-Data Reuse to the
for the disease, and thus there has been strong interest in using Immunologists’ Toolbox
meta-analysis results to help reposition drugs or identify novel As the examples above demonstrate, reuse of public data pre-
drug targets (Grover et al., 2015; Hurle et al., 2013; Nelson sents an opportunity for bench scientists to test and generate hy-
et al., 2015; Sanseau et al., 2012). One example is the associa- potheses; this is particularly useful for labs, including a majority
tion between CD40 and RA (Okada et al., 2014; Raychaudhuri of immunology labs, that do not routinely generate such large-
et al., 2008); there is currently a phase Ib clinical trial of a scale data. In this sense, the growth of public data also offers
CD40L antagonist in patients with adult-onset RA (clinical trial an opportunity to level the playing field for smaller or less-well-
number NCT02780388). funded groups. This will complement the traditional research
Pairing GWAS data with information about biological path- model that involves hypothesis generation followed by experi-
ways and protein-protein interactions (PPIs) can aid the interpre- ments testing the hypothesis with a paradigm in which one can
tation of results (Wang et al., 2010); both types of information are first generate or refine hypotheses by analyzing publicly avail-
available in public biological repositories (Box 2). Large-scale able data, then design and execute experiments (Figure 1A).
PPI data (sometimes referred to as the ‘‘interactome’’) have As illustrated above, starting with an examination of public

Immunity 45, December 20, 2016 1195


Immunity

Perspective

Box 2. Resources for Reusing Public Data

Commonly used data repositories:


d Gene Expression Omnibus (GEO: http://www.ncbi.nlm.nih.gov/geo/) (Barrett et al., 2013) and ArrayExpress (https://www.ebi.
ac.uk/arrayexpress/) (Kolesnikov et al., 2015): contain multiple forms of data, including expression, chromatin accessibility,
and methylation, obtained from microarray and next-generation sequencing. Some GEO data sets have undergone expert cu-
ration and annotation (GEO DataSets; http://www.ncbi.nlm.nih.gov/gds).
d Flowrepository (https://flowrepository.org) (Spidlen et al., 2012): a database for flow-cytometry data.
d Databases of biological pathways and protein-protein interactions (reviewed in (Klingström and Plewczynski, 2011)). Examples
include the Reactome Pathway Database (http://www.reactome.org) (Fabregat et al., 2016) and the STRING protein-protein
interaction database (http://string-db.org) (Szklarczyk et al., 2015).
Large data compendia (including multi-modal data sets associated with individual studies):
d ImmGen (http://www.immgen.org) (Heng et al., 2008): microarray gene-expression data on the major immune cell populations
of the mouse.
d The Immunology Database and Analysis Portal (ImmPort; http://immport.org/immport-open/public/home/home) (Bhatta-
charya et al., 2014): experimental and clinical trial data generated from studies primarily funded by the National Institute of
Allergy and Infectious Diseases (NIAID).
d ImmuneSpace (https://www.immunespace.org) (Brusic et al., 2014): the data portal of the NIAID HIPC program; contains mul-
tiple data types generated to characterize the human immune system.
d The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov) (Tomczak et al., 2015): genomic, transcriptomic, proteomic,
epigenomic, and clinical data on more than 30 types of cancer and matched normal tissues.
d International Cancer Genome Consortium (ICGC; http://icgc.org) (Hudson et al., 2010): multi-modal data on more than 20 types
of cancer and matched normal tissues from around the world.
d Genotype-Tissue Expression (GTEx) project (http://www.gtexportal.org/home/) (GTEx Consortium, 2013): expression quanti-
tative trait loci (eQTLs) database created with multiple tissues from deceased human donors.
d Connectivity Map (CMAP) (https://portals.broadinstitute.org/cmap/) (Lamb, 2007; Lamb et al., 2006): a compendium of gene-
expression profiles created from small-molecule perturbations of human cell lines. Using expression-pattern matching, the
user can query a signature against this compendium to find matches.
d Library of Integrated Network-based Cellular Signatures (LINCS) (http://www.lincsproject.org): Similar to the CMAP but with
more extensive perturbations and cell types.
Analysis tools and platforms that enable data reuse:
d Search-Based Exploration of Expression Compendium (SEEK; http://seek.princeton.edu) (Zhu et al., 2015): a cross-platform
gene co-expression search system. Using gene sets provided by the user, it identifies genes and public datasets of interest
(e.g., microarray, next-generation sequencing) that are co-expressed with the query genes.
d OMiCC (https://omicc.niaid.nih.gov) (Shah et al., 2016): open, crowdsourcing-based platform aimed at biologists without
computational training for annotating and performing (meta-) analysis of public microarray datasets.
d Open ImmPort (http://immport.org/immport-open/public/home/home) (Bhattacharya et al., 2014): contains an example of
how to re-analyze data from a published influenza vaccine study with sample code in R and Python. There are also additional
analysis tools, such as those for automatic analysis of flow-cytometry data (access requires registration and approval).
d ImmuNet (http://immunet.princeton.edu) (Gorenshteyn et al., 2015): 15 immunological functional-relationship networks
created with publicly available functional genomics data. Using genes provided by the user, it retrieves immunological subnet-
works functionally associated with those genes.
d GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r/) (Barrett et al., 2013): online tool that can be used for comparing groups of
samples within a GEO series. The output is a list of differentially expressed genes. It also generates the R code used to perform
the analysis.
d Interactive Gene Expression Data Browser (https://gxb.benaroyaresearch.org/dm3/landing.gsp) (Speake et al., 2015): a web-
based viewer for interactive exploration of microarray data sets; preloaded with more than 150 public data sets relevant to hu-
man immunology.
d InSilicoDB (www.insilicodb.com) (Coletta et al., 2012): aimed at biologists without computational training, it allows for explo-
ration of public genomics data, with the option to upload and analyze personal, unpublished data.
d NextBio (www.nextbio.com) (Kupershmidt et al., 2010): a database of searchable signatures, including those derived from
gene-expression, genetic, and other types of high-throughput biological data.
d Molecular Signatures Database (MSigDB) (http://www.broadinstitute.org/msigdb) (Liberzon et al., 2011): a large collection of
annotated gene sets, including immunology-specific sets.
d Selected R packages for data reuse and meta-analysis: MetaOmics, a suite of three R packages (MetaQC, MetaDE, and
MetaPath) for quality control, differentially expressed gene identification and enriched pathway detection for microarray
meta-analysis (http://www.pitt.edu/tsengweb/MetaOmicsHome.htm) (Wang et al., 2012). MetaIntegrator, an R package
facilitating multi-cohort gene-expression meta-analysis (Haynes et al., 2016).

1196 Immunity 45, December 20, 2016


Immunity

Perspective
Figure 1. Expanding the Immunologists’
A Toolbox
(A) The traditional research paradigm (left) focuses
on hypothesis generation followed by generation
of experimental data for hypothesis testing. An
augmented, complementary paradigm (right)
uses public data to generate and/or refine hy-
potheses prior to designing experiments.
(B) A proposal for adding public-data exploration
and reuse into the immunologists’ toolbox that
involves (1) education in bioinformatics, computer
programming, and relevant areas of statistics
and applied mathematics; (2) development of
software and ‘‘data commons’’ platforms to
enable hands-on exploration of public data for
hypothesis generation and testing; and (3) com-
munity engagement to create reusable content
such as sample and sample group annotations
B and data compendia.

(Box 1). Commercial offerings such as


NextBio and InSilicoDB offer program-
ming-free analysis of pre-processed
public data (Box 2), but these options
can involve proprietary algorithms or
analysis parameters that are not easily
modified by the user, can be impractical
for individuals with limited resources
(e.g., students or labs from developing
countries), and are limited in their ability
to serve as an open resource to help pro-
pagate new ideas and tools within the
greater immunology community.
Computational and statistical skills are
not the only hurdles; truly embracing the
data-reuse paradigm requires a cultural
change (Rung and Brazma, 2013). Most
bench scientists understandably take
pride in generating their own data and
thus may find it difficult to relegate hy-
pothesis generation and testing to pub-
licly available data generated by others.
data might result in additional biological insights because (1) the However, as a community we are already employing an increas-
signal-to-noise ratio can be improved through meta-analysis of ingly large amount of resources generated by ‘‘big science’’ ef-
data from multiple, independent studies and (2) as pointed out forts; examples include mouse immune-cell profiling data from
by Khatri and colleagues (Andres-Terre et al., 2015; Khatri the ImmGen Project (Heng et al., 2008) (Box 2). Reusing existing
et al., 2013; Sweeney et al., 2015), it can be advantageous to uti- data is also philosophically similar to consulting the literature
lize heterogeneity across studies (e.g., different microarray plat- when designing experiments and generating hypotheses. And
forms, distinct patient cohorts) because the coherent signals as discussed above, hypotheses developed from meta-analysis
across studies are potentially more likely to be biologically of multiple studies can potentially be more robust against various
meaningful and reliable as the basis of generating and testing biases than data generated within a single lab. However, it is also
new hypotheses. worth emphasizing that careful interpretation of meta-analysis re-
Despite advances in the development of data standards and sults is needed given study-to-study heterogeneity in both tech-
software tools to promote public data reuse and exploration, nical and biological factors. For example, differences in subject
reuse of public data remains sparsely practiced outside of the inclusion criteria (e.g., age restrictions) can be prevalent among
bioinformatics community as a result of several barriers. First different human studies such that coherent signals across
and foremost, computational and statistical analysis of public studies might emerge from only a subset of the subjects with
data, such as meta-analysis across multiple studies, remains shared biological features; thus, one should cautiously evaluate
technically challenging for most biologists. For example, the abil- the broad applicability of such results to the general human pop-
ity to derive or match a gene-expression signature remains ulation [see also Tseng et al. (2012) and Ramasamy et al. (2008)
largely out of reach for biologists who lack bioinformatics training for discussion of potential biases and related statistical issues

Immunity 45, December 20, 2016 1197


Immunity

Perspective

on microarray meta-analysis]. Furthermore, examination of pub- Overcoming the above major obstacles to make public-data
lic data does not replace conducting experiments or human clin- reuse a common practice in immunology and, more broadly, in
ical trials, which are required to confirm hypotheses and generate all of biology will take time and concerted efforts (Figure 1B). First,
de novo information most fitting for the scientific questions posed as a community we need to embrace adding rigorous training in
(Dolinski and Troyanskaya, 2015). With the mandate from funding computational and statistical sciences and bioinformatics to
agencies to deposit data in the public domain (Paltoo et al., 2014), the standard biological sciences curriculum, starting at the un-
a gradual movement toward more accessible supplemental data dergraduate level. We must also commit to covering more immu-
associated with publications (Pop and Salzberg, 2015), and a nology-specific issues, such as the analysis of flow cytometry
push for increasing sharing of raw clinical trial data (Doshi et al., and single-cell phenotyping data, in immunology graduate pro-
2013; Ross and Krumholz, 2013), being thoughtful about what grams. Doing so would provide future immunologists a solid
public data can be utilized is a prudent starting point for many hy- theoretical grounding and hands-on experience in computational
pothesis-generating or -testing endeavors. Furthermore, data immunology. Second, analysis software and platforms designed
reuse can increase research efficiency and reduce redundancy, specifically for public-data reuse and data integration are needed
which should be particularly emphasized in research on human for different data modalities, including those, such as flow cytom-
subjects given the inherent risk to participants (Doshi et al., 2013). etry, that immunologists use frequently. Although some software
Even with the right tools, skills, and mindsets, the varying avail- resources have already been created (Box 2), particularly for re-
ability and quality of the data and of the associated documentation using and analyzing public gene expression data, they are often
and annotations pose another challenge. The genetics community limited to one or a subset of analytical steps, and therefore putting
has been a leader in data sharing, as exemplified both by early together an entire workflow often requires some programming
large-scale data-generation efforts such as the Human Genome expertise. An important goal of such software should be to enable
Project (Collins et al., 2003) [and more recently, the 1000 Genomes direct, hands-on exploration of public data even by those without
Project (The 1000 Genomes Project Consortium, 2015)] and by formal bioinformatics training. Doing so provides several benefits
far-reaching data-sharing policies such as the Fort Lauderdale (Shah et al., 2016). First, one can directly exercise his/her biolog-
and Toronto agreements (Birney et al., 2009; Kaye et al., 2009). ical intuition. Second, it helps to promote deeper, more inte-
However, reuse of more complex functional genomics data typi- grated collaboration between bench and bioinformatics scien-
cally requires greater documentation and annotation. Take micro- tists, for example, by allowing bioinformaticians to focus more
array data as an example: the quality of the data depends on the on the computationally challenging aspects while immunologists
biological and experimental conditions under which they are explore the biological questions of interest by using well-estab-
generated and the computational procedures by which the data lished analytical techniques. The most successful software tools
are processed. Crucial information about samples can be missing will also emphasize user education, so that biologists can learn
or provided without the use of standardized vocabularies, and one the statistical language and techniques necessary to properly uti-
often needs to consult the original publication and its authors to lize the software and interpret the results. Software can thus also
gather the necessary information to reuse the data. Even the play an important role in achieving the educational goals dis-
most well-curated datasets might need additional annotation cussed above by allowing students to learn both biology and bio-
before they can be reused effectively. For example, it is crucial informatics simultaneously. In doing so, students would come to
to determine what sample groupings and comparisons should understand that the two are often inextricably linked.
be made to generate a desired gene-expression signature (e.g., In addition to enabling programming-free (meta-) analysis of
naive versus activated CD4+ T cells); and one often needs biolog- public data, such software platforms can serve as ‘‘data com-
ical expertise in the area of inquiry to perform such annotations. mons’’ for the entire research community (e.g., see https://
To improve data and annotation quality, some groups have www.synapse.org), for example to allow sharing of raw data, as
made significant efforts to promote standards for data submis- well as annotations, data compendia, analysis results, and other
sion to public repositories and encourage the use of standard- reusable content created by members of the community. Not
ized vocabularies and ontologies for annotation (Musen et al., only can such practices of crowdsourcing tap into the collective
2012) to facilitate data reuse and increase study reproducibility expertise of the community to create reusable information, they
(Brazma, 2009; Brazma et al., 2001; Rung and Brazma, 2013). also promote a culture of sharing and openness (Figure 1B).
Despite advances in standardized vocabulary and ontology The immunological community in particular should embrace
development, putting them into practice remains challenging and can benefit from this collaborative approach given both the
because large amounts of existing data would need to be re-an- complex jargon and biology of the field and the accelerating
notated, and many major data repositories have yet to directly use of technologies that generate increasingly complex data
support their use. Data submission and sharing standards would types across multiple biological scales (Germain et al., 2011).
ideally allow other investigators to more easily reproduce the an-
alyses reported by the primary researchers and enable more The NIH OMiCC Jamboree: A Microcosmic
informed comparisons across studies. Although such standards Crowdsourcing Experiment
have not been consistently enforced (Brazma, 2009; Ioannidis We recently developed OMiCC (Shah et al., 2016) (see Box 2), a
et al., 2009), greater cooperation between journals and public free online platform that implements some of the key ideas dis-
databases to require data annotation and deposition prior to cussed above for programming-free (meta-) analysis of public
publication of papers reporting large-scale datasets could gene-expression data. One unique feature of OMiCC is the
have a significant positive impact on the quantity and availability capacity to ‘‘crowdshare’’ the work of annotating public data
of reusable public data. for automated downstream (meta-) analyses, a critical but

1198 Immunity 45, December 20, 2016


Immunity

Perspective

Box 3. NIH OMiCC Jamboree

Here we provide a summary of the jamboree. The analysis approaches, observations, and caveats can be found at in Lau and
Sparks et al. (Lau et al., 2016) and on the supplemental website accompanying this paper (https://omicc.niaid.nih.gov/
2016-nih-jamboree-analysis/report.html).
Advertised to the NIH immunology community, the jamboree involved a half-day group orientation to the OMiCC platform and a
subsequent day-long jamboree. The 29 volunteer participants, consisting of faculty, fellows, and students, were separated into ten
groups; each group had at least one participant who felt proficient in using OMiCC after the half-day training session. The groups
were each responsible for a human disease (one of five autoimmune diseases: diabetes mellitus type 1, multiple sclerosis, rheu-
matoid arthritis, Sjögren’s syndrome, systemic lupus erythematosus, or the inflammatory condition sarcoidosis) or the corre-
sponding mouse models. The groups were directed to search OMiCC for gene-expression data on their assigned topic and to
focus on studies with both case and control samples generated from specific sources (e.g., PBMC and whole blood (WB) for hu-
man samples). Using the datasets they found, the participants used their biological knowledge to annotate sample groups and
create ‘‘comparison group pairs’’ (CGPs; Figure B1 in Box 1). We then analyzed CGPs from all groups after the jamboree both
within and outside of OMiCC to assess the benefits and caveats of using crowdsourcing for gathering and annotating public
data to drive hypothesis generation (Lau et al., 2016).
We used meta-analysis within OMiCC to derive (1) cross-CGP gene expression signatures for each disease (disease versus control
comparisons) and (2) conserved signatures across all diseases within each species (pan-disease signatures); we also examined
signature conservation between human and mouse (Lau et al., 2016).
Overall, a large number of transcripts with increased or decreased expression in disease versus control comparisons were iden-
tified for each disease (Lau et al., 2016). Comparison of these gene signatures among different diseases within and across species
suggests greater overlap among diseases within a species than across the two species (most likely also reflecting differences be-
tween non-blood murine tissues and human blood). Gene-set and pathway enrichment analysis of individual disease signatures
revealed shared signatures across diseases. Even though the data can be noisy and there are a number of caveats, our jamboree
experiment suggests that there are potentially interesting and robust signals to be mined; given a programming-free, didactic,
community-based platform such as OMiCC, together with some upfront training on the platform, biologists with any level of bio-
informatics experience can benefit from and contribute to utilizing public data for hypothesis exploration. Our post-jamboree user
survey revealed a similarly positive sentiment from a majority of the participants (see supplemental website).
Key Lessons Learned
d CGP quality can vary depending on the biological complexity of the data set and the biological expertise of the participant. One
or more of the following could help mitigate these challenges: (1) having more detailed inclusion and exclusion criteria, (2) early
feedback on CGPs, and (3) emphasis on CGP quality rather than quantity.
d The learning curve to use OMiCC can be steep for some participants, suggesting that more extensive training of the software
beforehand would be helpful.
d For diseases with a sufficient number of data sets, it would be helpful to collect independent discovery and validation CGP sets
for downstream validation.

time-intensive step in data reuse. OMiCC stores and allows by using public gene-expression data for achieving a shared
sharing of user-created sample annotations, groupings, two- research goal, we organized a social and scientific experiment
group comparisons, and compendia for generating expression that we called the ‘‘NIH OMiCC Jamboree’’ (see Box 3). Inspired
signatures. Thus, OMiCC provides the research community by the original ‘‘Annotation Jamboree’’ (Adams et al., 2000), we
(‘‘the crowd’’) with the capacity to move science forward through aimed to compile and annotate gene-expression data across
‘‘virtual,’’ community-wide collaborations; a similar concept was multiple human autoimmune diseases and the corresponding
perhaps first demonstrated through the group effort to annotate mouse models during the jamboree. After the jamboree, we per-
the genomic sequence of Drosophila melanogaster during an formed meta-analysis of multiple studies within and across dis-
11-day ‘‘Annotation Jamboree’’ in 1999 (Adams et al., 2000; eases by using OMiCC and compared human and mouse to
Pennisi, 2000). Various forms of crowdsourcing have been evaluate pan-disease and pan-species gene-expression signa-
used in the biomedical and computing communities in the form tures. Our jamboree exercise provides evidence that when
of programming ‘‘hackathons,’’ data-analysis marathons, and enabled by appropriate tools, a ‘‘crowd’’ of biologists without
open challenges aimed at taking advantage of collective com- specialized training in bioinformatics can work together to
munity expertise to tackle complex problems (Celi et al., 2014; potentially accelerate the pace by which the increasingly large
Saez-Rodriguez et al., 2016). These approaches have the poten- amounts of public data can be meta-analyzed for generating
tial to become even more broadly applicable and fruitful given and testing hypotheses. The data, annotations, and analysis re-
the increasingly connected global research community. sults generated by this exercise provide a resource for the com-
To test the hypothesis that OMiCC could be used effectively munity to further explore autoimmune disease gene-expression
by biologists without formal computational training and that it signatures (see Box 3 on accessing the jamboree data and anno-
could facilitate an organized, concerted group effort to collect, tations). The format and materials used in our jamboree should
annotate, and design two-group comparisons (see Figure B1) also be easily replicable in other institutions (e.g., universities)

Immunity 45, December 20, 2016 1199


Immunity

Perspective

to help educate students and to utilize the ‘‘crowd’’ to assess Zhao, and Ofer Zimmerman. We also thank Neha Bansal, Candace Liu, and
Abigail Thorpe for assisting with the jamboree; Yongjian Guo and Yong Lu
other biological questions of interest (materials used in the
for OMiCC support; BCBB/OCICB of NIAID for providing computing support
jamboree are freely available upon request). These ‘‘grassroots’’ and web hosting; NIH Facilities for providing the OMiCC Jamboree hosting
efforts can in turn contribute valuable curated and annotated da- venue; and members of the J.S.T. lab for discussions. This research was
tasets to the larger community. funded by the Intramural Programs of the National Institute of Allergy and In-
fectious Diseases (NIAID) and the Center for Information Technology (CIT) at
the NIH.
Conclusion and Outlook
As highlighted here, there is increasing application of data reuse
REFERENCES
within the immunological community, but major hurdles remain
for broader adoption. As the volume of public data grows, so Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanati-
will the extent of missed opportunity if immunologists fail to des, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. (2000). The
take advantage of the wealth of data in the public domain. genome sequence of Drosophila melanogaster. Science 287, 2185–2195.

Some of the most successful researchers in the future will likely Aghaeepour, N., Finak, G., Hoos, H., Mosmann, T.R., Brinkman, R., Gottardo,
be those who recognize the value of reusing public data and are R., and Scheuermann, R.H.; FlowCAP Consortium; DREAM Consortium
(2013). Critical assessment of automated flow cytometry data analysis tech-
able to utilize it to complement their work at the bench or in niques. Nat. Methods 10, 228–238.
the clinic. It is also increasingly likely that some immunological
questions can be addressed most effectively by utilizing large- Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A.,
Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse
scale public data, owing to the growing size and utility of data- large B-cell lymphoma identified by gene expression profiling. Nature 403,
sets present in the public domain and the difficulty that a single 503–511.
lab would face in trying to generate the required data de novo.
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Lev-
Our jamboree experience illustrates how free, user-friendly, ine, A.J. (1999). Broad patterns of gene expression revealed by clustering anal-
community-based platforms (such as OMiCC) can potentially ysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc.
Natl. Acad. Sci. USA 96, 6745–6750.
facilitate data reuse, user education, and crowdsourcing. But
OMiCC is just an illustrative beginning, and much more needs Anderson, C.A., Boucher, G., Lees, C.W., Franke, A., D’Amato, M., Taylor,
to be done. Specifically, we need functionalities that (1) go K.D., Lee, J.C., Goyette, P., Imielinski, M., Latiano, A., et al. (2011). Meta-anal-
ysis identifies 29 additional ulcerative colitis risk loci, increasing the number of
beyond the meta-analysis of gene expression data by extending confirmed associations to 47. Nat. Genet. 43, 246–252.
into other modalities, such as flow cytometry data, (2) incorpo-
Andres-Terre, M., McGuire, H.M., Pouliot, Y., Bongen, E., Sweeney, T.E., Tato,
rate more sophisticated statistical modeling and analysis
C.M., and Khatri, P. (2015). Integrated, multi-cohort analysis identifies
approaches, and (3) provide additional annotation and crowd- conserved transcriptional signatures across multiple respiratory viruses. Im-
sourcing features, for example to enable group-based collabora- munity 43, 1199–1211.
tions. The development of natural text-processing methods that Baechler, E.C., Batliwalla, F.M., Karypis, G., Gaffney, P.M., Ortmann, W.A.,
can automatically annotate samples and extract biologically Espe, K.J., Shark, K.B., Grande, W.J., Hughes, K.M., Kapur, V., et al. (2003).
meaningful comparisons could also dramatically increase the Interferon-inducible gene expression signature in peripheral blood cells of pa-
tients with severe lupus. Proc. Natl. Acad. Sci. USA 100, 2610–2615.
amount of reusable content. The expansion and evolution of
data types and data volume will only increase in the future: by Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky,
M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Holko, M., et al. (2013).
taking the necessary steps as a community to move toward uti-
NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids
lizing available data to the fullest extent possible, the field of Res. 41, D991–D995.
immunology will have the opportunity to be a leader in transform-
Begum, F., Ghosh, D., Tseng, G.C., and Feingold, E. (2012). Comprehensive
ing public data—a continually growing, massive resource—into literature review and statistical considerations for GWAS meta-analysis. Nu-
biological insights. cleic Acids Res. 40, 3777–3784.

Bendall, S.C., Nolan, G.P., Roederer, M., and Chattopadhyay, P.K. (2012).
AUTHOR CONTRIBUTIONS A deep profiler’s guide to cytometry. Trends Immunol. 33, 323–332.

R.S. designed and organized the jamboree, performed post-jamboree data Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A
practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57,
curation, and wrote the manuscript; W.W.L. helped design the jamboree, per-
289–300.
formed post-jamboree data curation, designed and performed post-jamboree
data analysis, and developed and authored the supplemental website; J.S.T. Bennett, L., Palucka, A.K., Arce, E., Cantrell, V., Borvak, J., Banchereau, J.,
conceived and guided the project, designed and helped organize the and Pascual, V. (2003). Interferon and granulopoiesis signatures in systemic
jamboree, helped design post-jamboree data analysis plan, helped post- lupus erythematosus blood. J. Exp. Med. 197, 711–723.
jamboree data curation, and wrote the manuscript.
Bhattacharya, S., Andorf, S., Gomes, L., Dunn, P., Schaefer, H., Pontius, J.,
Berger, P., Desborough, V., Smith, T., Campbell, J., et al. (2014). ImmPort:
ACKNOWLEDGMENTS disseminating data to the public for the future of immunology. Immunol. Res.
58, 234–239.
We thank the jamboree participants for their contributions, including data
Bindea, G., Mlecnik, B., Tosolini, M., Kirilovsky, A., Waldner, M., Obenauf,
gathering, annotation, comparison-group formation via OMiCC, as well as
A.C., Angell, H., Fredriksen, T., Lafontaine, L., Berger, A., et al. (2013). Spatio-
feedback on the manuscript. Participants include (listed alphabetically by temporal dynamics of intratumoral immune cells reveal the immune landscape
last name): James Austin, Julián Candia, William Coley, Ehren Dancy, Karen in human cancer. Immunity 39, 782–795.
L. Elkins, Sara Faghihi-Kashani, Julio Gomez-Rodriguez, Liliana Guedez, Ma-
ria J. Gutierrez, Trung Ho, Reiko Horai, Sunmee Huh, Chie Iwamura, Jaimy Joy, Birney, E., Hudson, T.J., Green, E.D., Gunter, C., Eddy, S., Rogers, J., Harris,
Ju-Gyeong Kang, Sunil Kaul, Laura B. Lewandowski, Nathan P. Manes, Mary J.R., Ehrlich, S.D., Apweiler, R., Austin, C.P., et al.; Toronto International Data
J. Mattapallil, Sarfraz Memon, M. Jubayer Rahman, Kameron B. Rodrigues, Release Workshop Authors (2009). Prepublication data sharing. Nature 461,
Bruno Silva, Amit Singh, Anthony J. St. Leger, Jessica Tang, Hang Xie, Yongge 168–170.

1200 Immunity 45, December 20, 2016


Immunity

Perspective
Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, Evangelou, E., and Ioannidis, J.P. (2013). Meta-analysis methods for genome-
M., Simon, R., Yakhini, Z., Ben-Dor, A., et al. (2000). Molecular classification of wide association studies and beyond. Nat. Rev. Genet. 14, 379–389.
cutaneous malignant melanoma by gene expression profiling. Nature 406,
536–540. Fabregat, A., Sidiropoulos, K., Garapati, P., Gillespie, M., Hausmann, K., Haw,
R., Jassal, B., Jupe, S., Korninger, F., McKay, S., et al. (2016). The Reactome
Bradfield, J.P., Qu, H.Q., Wang, K., Zhang, H., Sleiman, P.M., Kim, C.E., pathway Knowledgebase. Nucleic Acids Res. 44 (D1), D481–D487.
Mentch, F.D., Qiu, H., Glessner, J.T., Thomas, K.A., et al. (2011). A genome-
wide meta-analysis of six type 1 diabetes cohorts identifies multiple associ- Fehrmann, R.S., Karjalainen, J.M., Krajewska, M., Westra, H.J., Maloney, D.,
ated loci. PLoS Genet. 7, e1002293. Simeonov, A., Pers, T.H., Hirschhorn, J.N., Jansen, R.C., Schultes, E.A.,
et al. (2015). Gene expression analysis identifies global gene dosage sensitivity
Brazma, A. (2009). Minimum Information About a Microarray Experiment (MI- in cancer. Nat. Genet. 47, 115–125.
AME)–successes, failures, challenges. ScientificWorldJournal 9, 420–423.
Finak, G., Langweiler, M., Jaimes, M., Malek, M., Taghiyar, J., Korin, Y.,
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Raddassi, K., Devine, L., Obermoser, G., Pekalski, M.L., et al. (2016). Stan-
Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., et al. (2001). dardizing flow cytometryImmunophenotyping analysis from the Human Immu-
Minimum information about a microarray experiment (MIAME)—Toward stan- noPhenotyping Consortium. Sci. Rep. 6, 20686.
dards for microarray data. Nat. Genet. 29, 365–371.
Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., Radford-Smith, G.L.,
Brusic, V., Gottardo, R., Kleinstein, S.H., and Davis, M.M.; HIPC steering Ahmad, T., Lees, C.W., Balschun, T., Lee, J., Roberts, R., et al. (2010).
committee (2014). Computational resources for high-dimensional immune Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s
analysis from the Human Immunology Project Consortium. Nat. Biotechnol. disease susceptibility loci. Nat. Genet. 42, 1118–1125.
32, 146–148.
Fuchsberger, C., Flannick, J., Teslovich, T.M., Mahajan, A., Agarwala, V.,
Califano, A., Butte, A.J., Friend, S., Ideker, T., and Schadt, E. (2012). Gaulton, K.J., Ma, C., Fontanillas, P., Moutsianas, L., McCarthy, D.J., et al.
Leveraging models of cell regulation and GWAS data in integrative network- (2016). The genetic architecture of type 2 diabetes. Nature 536, 41–47.
based association studies. Nat. Genet. 44, 841–847.
Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz,
Campain, A., and Yang, Y.H. (2010). Comparison study of microarray meta- G., Botstein, D., and Brown, P.O. (2000). Genomic expression programs in the
analysis methods. BMC Bioinformatics 11, 408. response of yeast cells to environmental changes. Mol. Biol. Cell 11, 4241–
4257.
Celi, L.A., Ippolito, A., Montgomery, R.A., Moses, C., and Stone, D.J. (2014).
Crowdsourcing knowledge discovery and innovations in medicine. J. Med. Gentles, A.J., Newman, A.M., Liu, C.L., Bratman, S.V., Feng, W., Kim, D., Nair,
Internet Res. 16, e216. V.S., Xu, Y., Khuong, A., Hoang, C.D., et al. (2015). The prognostic landscape
of genes and infiltrating immune cells across human cancers. Nat. Med. 21,
Chang, L.C., Lin, H.M., Sibille, E., and Tseng, G.C. (2013). Meta-analysis 938–945.
methods for combining multiple expression profiles: comparisons, statistical
characterization and an application guideline. BMC Bioinformatics 14, 368. Germain, R.N., Meier-Schellersheim, M., Nita-Lazar, A., and Fraser, I.D.
(2011). Systems biology in immunology: A computational modeling perspec-
Chaussabel, D., and Baldwin, N. (2014). Democratizing systems immunology
tive. Annu. Rev. Immunol. 29, 527–585.
with modular transcriptional repertoire analyses. Nat. Rev. Immunol. 14,
271–280.
Gibson, G., Powell, J.E., and Marigorta, U.M. (2015). Expression quantitative
trait locus analysis for translational medicine. Genome Med. 7, 60.
Chaussabel, D., Quinn, C., Shen, J., Patel, P., Glaser, C., Baldwin, N., Stich-
weh, D., Blankenship, D., Li, L., Munagala, I., et al. (2008). A modular analysis
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,
framework for blood genomics studies: application to systemic lupus erythe-
J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al. (1999). Molec-
matosus. Immunity 29, 150–164.
ular classification of cancer: Class discovery and class prediction by gene
Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., expression monitoring. Science 286, 531–537.
Lockhart, D.J., Morris, M.S., and Fodor, S.P. (1996). Accessing genetic infor-
mation with high-density DNA arrays. Science 274, 610–614. Goodman, S.N., Fanelli, D., and Ioannidis, J.P. (2016). What does research
reproducibility mean? Sci. Transl. Med. 8, 341ps12.
Chen, B., and Butte, A.J. (2016). Leveraging big data to transform target selec-
tion and drug discovery. Clin. Pharmacol. Ther. 99, 285–297. Gorenshteyn, D., Zaslavsky, E., Fribourg, M., Park, C.Y., Wong, A.K., Tadych,
A., Hartmann, B.M., Albrecht, R.A., Garcı́a-Sastre, A., Kleinstein, S.H., et al.
Chen, J., Bardes, E.E., Aronow, B.J., and Jegga, A.G. (2009). ToppGene Suite (2015). Interactive Big Data resource to elucidate human immune pathways
for gene list enrichment analysis and candidate gene prioritization. Nucleic and diseases. Immunity 43, 605–614.
Acids Res. 37, W305–W311.
Gross, A.M., Kreisberg, J.F., and Ideker, T. (2015). Analysis of matched tumor
Coletta, A., Molter, C., Duqué, R., Steenhoff, D., Taminau, J., de Schaetzen, V., and normal profiles reveals common transcriptional and epigenetic signals
Meganck, S., Lazar, C., Venet, D., Detours, V., et al. (2012). InSilico DB shared across cancer types. PLoS ONE 10, e0142618.
genomic datasets hub: an efficient starting point for analyzing genome-wide
studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor. Grover, M.P., Ballouz, S., Mohanasundaram, K.A., George, R.A., Goscinski,
Genome Biol. 13, R104. A., Crowley, T.M., Sherman, C.D., and Wouters, M.A. (2015). Novel therapeu-
tics for coronary artery disease from genome-wide association study data.
Collins, F.S., Morgan, M., and Patrinos, A. (2003). The Human Genome Project: BMC Med. Genomics 8 (Suppl 2 ), S1.
Lessons from large-scale biology. Science 300, 286–290.
GTEx Consortium (2013). The Genotype-Tissue Expression (GTEx) project.
Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M., Wallace, C., Abe- Nat. Genet. 45, 580–585.
casis, G.R., Barrett, J.C., Behrens, T., Cho, J., et al.; FOCiS Network of Con-
sortia (2011). Pervasive sharing of genetic effects in autoimmune disease. Haynes, W.A., Vallania, F., Liu, C., Bongen, E., Tomczak, A., Andres-Terrè, M.,
PLoS Genet. 7, e1002254. Lofgren, S., Tam, A., Deisseroth, C.A., Li, M.D., et al. (2016). Empowering
multi-cohort gene expression analysis to increase reproducibility. Pac.
Darmanis, S., Gallant, C.J., Marinescu, V.D., Niklasson, M., Segerman, A., Fla- Symp. Biocomput. 22, 144–153.
mourakis, G., Fredriksson, S., Assarsson, E., Lundberg, M., Nelander, S., et al.
(2016). Simultaneous multiplexed measurement of RNA and proteins in single Heng, T.S., and Painter, M.W.; Immunological Genome Project Consortium
cells. Cell Rep. 14, 380–389. (2008). The Immunological Genome Project: Networks of gene expression in
immune cells. Nat. Immunol. 9, 1091–1094.
Dolinski, K., and Troyanskaya, O.G. (2015). Implications of Big Data for cell
biology. Mol. Biol. Cell 26, 2575–2578. Higgs, B.W., Zhu, W., Richman, L., Fiorentino, D.F., Greenberg, S.A., Jallal, B.,
and Yao, Y. (2012). Identification of activated cytokine pathways in the blood of
Doshi, P., Goodman, S.N., and Ioannidis, J.P. (2013). Raw data from clinical systemic lupus erythematosus, myositis, rheumatoid arthritis, and sclero-
trials: Within reach? Trends Pharmacol. Sci. 34, 645–647. derma patients. Int. J. Rheum. Dis. 15, 25–35.

Immunity 45, December 20, 2016 1201


Immunity

Perspective
Hoadley, K.A., Yau, C., Wolf, D.M., Cherniack, A.D., Tamborero, D., Ng, S., nectivity Map: using gene-expression signatures to connect small molecules,
Leiserson, M.D., Niu, B., McLellan, M.D., Uzunangelov, V., et al.; Cancer genes, and disease. Science 313, 1929–1935.
Genome Atlas Research Network (2014). Multiplatform analysis of 12 cancer
types reveals molecular classification within and across tissues of origin. Larance, M., and Lamond, A.I. (2015). Multidimensional proteomics for cell
Cell 158, 929–944. biology. Nat Rev Mol Cell Biol 16, 269–280.

Huan, T., Esko, T., Peters, M.J., Pilling, L.C., Schramm, K., Schurmann, C., Lau, W.W., Sparks, R., OMiCC Jamboree Working Group, and Tsang, J.S.
Chen, B.H., Liu, C., Joehanes, R., Johnson, A.D., et al.; International (2016). Meta-analysis of crowdsourced data compendia suggests pan-dis-
Consortium for Blood Pressure GWAS (ICBP) (2015). A meta-analysis of ease transcriptional signatures of autoimmunity. F1000Research http://dx.
gene expression signatures of blood pressure and hypertension. PLoS Genet. doi.org/10.12688/f1000research.10465.1.
11, e1005035.
Li, S., Rouphael, N., Duraisingham, S., Romero-Steiner, S., Presnell, S., Davis,
Hudson, T.J., Anderson, W., Artez, A., Barker, A.D., Bell, C., Bernabé, R.R., C., Schmidt, D.S., Johnson, S.E., Milton, A., Rajam, G., et al. (2014). Molecular
Bhan, M.K., Calvo, F., Eerola, I., Gerhard, D.S., et al.; International Cancer signatures of antibody responses derived from a systems biology study of five
Genome Consortium (2010). International network of cancer genome projects. human vaccines. Nat. Immunol. 15, 195–204.
Nature 464, 993–998.
Li, B., Li, T., Pignon, J.C., Wang, B., Wang, J., Shukla, S.A., Dou, R., Chen, Q.,
Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, Hodi, F.S., Choueiri, T.K., et al. (2016). Landscape of tumor-infiltrating T cell
C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D., et al. (2000). Functional dis- repertoire of human cancers. Nat. Genet. 48, 725–732.
covery via a compendium of expression profiles. Cell 102, 109–126.
Libbrecht, M.W., and Noble, W.S. (2015). Machine learning applications in ge-
Hurle, M.R., Yang, L., Xie, Q., Rajpal, D.K., Sanseau, P., and Agarwal, P. netics and genomics. Nat. Rev. Genet. 16, 321–332.
(2013). Computational drug repositioning: From data to therapeutics. Clin.
Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P.,
Pharmacol. Ther. 93, 335–341.
and Mesirov, J.P. (2011). Molecular signatures database (MSigDB) 3.0. Bioin-
Ioannidis, J.P., Allison, D.B., Ball, C.A., Coulibaly, I., Cui, X., Culhane, A.C., Fal- formatics 27, 1739–1740.
chi, M., Furlanello, C., Game, L., Jurman, G., et al. (2009). Repeatability of pub-
Lu, Y., Biancotto, A., Cheung, F., Remmers, E., Shah, N., McCoy, J.P., and
lished microarray gene expression analyses. Nat. Genet. 41, 149–155.
Tsang, J.S. (2016). Systematic analysis of cell-to-cell expression variation of
T lymphocytes in a human cohort identifies aging and genetic associations.
Iorio, F., Rittman, T., Ge, H., Menden, M., and Saez-Rodriguez, J. (2013). Tran-
Immunity 45, 1162–1175.
scriptional data: A new gateway to drug repositioning? Drug Discov. Today 18,
350–357. €, J., Parkinson, H., Goncalves, A., Huber, W.,
Lukk, M., Kapushesky, M., Nikkila
Ukkonen, E., and Brazma, A. (2010). A global map of human gene expression.
Jenner, R.G., and Young, R.A. (2005). Insights into host responses against
Nat. Biotechnol. 28, 322–324.
pathogens from transcriptional profiling. Nat. Rev. Microbiol. 3, 281–294.
Márquez, A., Vidal-Bralo, L., Rodrı́guez-Rodrı́guez, L., González-Gay, M.A.,
Jujjavarapu, C., Hughey, J., Gheradini, F., Bruggner, R., Nolan, G., Bhatta- Balsa, A., González-Álvaro, I., Carreira, P., Ortego-Centeno, N., Ayala-Gutiér-
charya, S., and Butte, A.J. (2016). A Framework for Meta-Analysis of Cytome- rez, M.M., Garcı́a-Hernández, F.J., et al. (2016). A combined large-scale meta-
try Data. Journal of Immunology 196, 69.16. analysis identifies COG6 as a novel shared risk locus for rheumatoid arthritis
and systemic lupus erythematosus. Ann. Rheum. Dis. Published online May
ski, P., Frydecka, D., Sa˛siadek, M.M., and Misiak, B. (2016). Reduced
Karpin
18, 2016.
number of peripheral natural killer cells in schizophrenia but not in bipolar dis-
order. Brain Behav. Immun. 54, 194–200. Morris, D.L., Sheng, Y., Zhang, Y., Wang, Y.F., Zhu, Z., Tombleson, P., Chen,
L., Cunninghame Graham, D.S., Bentham, J., Roberts, A.L., et al. (2016).
Kaye, J., Heeney, C., Hawkins, N., de Vries, J., and Boddington, P. (2009). Data Genome-wide association meta-analysis in Chinese and European individuals
sharing in genomics—Re-shaping scientific practice. Nat. Rev. Genet. 10, identifies ten new loci associated with systemic lupus erythematosus. Nat.
331–335. Genet. 48, 940–946.
Khatri, P., Roedder, S., Kimura, N., De Vusser, K., Morgan, A.A., Gong, Y., Musen, M.A., Noy, N.F., Shah, N.H., Whetzel, P.L., Chute, C.G., Story, M.A.,
Fischbein, M.P., Robbins, R.C., Naesens, M., Butte, A.J., and Sarwal, M.M. and Smith, B.; NCBO team (2012). The National Center for Biomedical
(2013). A common rejection module (CRM) for acute rejection across multiple Ontology. J. Am. Med. Inform. Assoc. 19, 190–195.
organs identifies novel therapeutics for organ transplantation. J. Exp. Med.
210, 2205–2221. Nelson, M.R., Tipney, H., Painter, J.L., Shen, J., Nicoletti, P., Shen, Y., Flora-
tos, A., Sham, P.C., Li, M.J., Wang, J., et al. (2015). The support of human ge-
Kidd, B.A., Wroblewska, A., Boland, M.R., Agudo, J., Merad, M., Tatonetti, netic evidence for approved drug indications. Nat. Genet. 47, 856–860.
N.P., Brown, B.D., and Dudley, J.T. (2016). Mapping the effects of drugs on
the immune system. Nat. Biotechnol. 34, 47–54. Newman, A.M., Liu, C.L., Green, M.R., Gentles, A.J., Feng, W., Xu, Y., Hoang,
C.D., Diehn, M., and Alizadeh, A.A. (2015). Robust enumeration of cell subsets
Klingström, T., and Plewczynski, D. (2011). Protein-protein interaction and from tissue expression profiles. Nat. Methods 12, 453–457.
pathway databases, a graphical review. Brief. Bioinform. 12, 702–713.
Okada, Y., Wu, D., Trynka, G., Raj, T., Terao, C., Ikari, K., Kochi, Y., Ohmura,
Kolesnikov, N., Hastings, E., Keays, M., Melnichuk, O., Tang, Y.A., Williams, K., Suzuki, A., Yoshida, S., et al.; RACI consortium; GARNET consortium
E., Dylag, M., Kurbatova, N., Brandizi, M., Burdett, T., et al. (2015). (2014). Genetics of rheumatoid arthritis contributes to biology and drug dis-
ArrayExpress update—Simplifying data submissions. Nucleic Acids Res. 43, covery. Nature 506, 376–381.
D1113–D1116.
Olsen, J.V., and Mann, M. (2013). Status of large-scale analysis of post-trans-
Kupershmidt, I., Su, Q.J., Grewal, A., Sundaresh, S., Halperin, I., Flynn, J., lational modifications by mass spectrometry. Mol. Cell. Proteomics 12, 3444–
Shekar, M., Wang, H., Park, J., Cui, W., et al. (2010). Ontology-based meta- 3452.
analysis of global collections of high-throughput public data. PLoS ONE 5,
e13066. Orchard, S., Kerrien, S., Abbani, S., Aranda, B., Bhate, J., Bidwell, S., Bridge,
A., Briganti, L., Brinkman, F.S., Cesareni, G., et al. (2012). Protein interaction
Lage, K. (2014). Protein-protein interactions and genetic diseases: The interac- data curation: the International Molecular Exchange (IMEx) consortium. Nat.
tome. Biochim. Biophys. Acta 1842, 1971–1980. Methods 9, 345–350.

Laird, P.W. (2010). Principles and challenges of genomewide DNA methylation Paltoo, D.N., Rodriguez, L.L., Feolo, M., Gillanders, E., Ramos, E.M., Rutter,
analysis. Nat. Rev. Genet. 11, 191–203. J.L., Sherry, S., Wang, V.O., Bailey, A., Baker, R., et al.; National Institutes of
Health Genomic Data Sharing Governance Committees (2014). Data use under
Lamb, J. (2007). The Connectivity Map: A new tool for biomedical research. the NIH GWAS data sharing policy and future directions. Nat. Genet. 46,
Nat. Rev. Cancer 7, 54–60. 934–938.

Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Patsopoulos, N.A., Esposito, F., Reischl, J., Lehr, S., Bauer, D., Heubach, J.,
Lerner, J., Brunet, J.P., Subramanian, A., Ross, K.N., et al. (2006). The Con- Sandbrink, R., Pohl, C., Edan, G., Kappos, L., et al.; Bayer Pharma MS

1202 Immunity 45, December 20, 2016


Immunity

Perspective
Genetics Working Group; Steering Committees of Studies Evaluating IFNb-1b Sanseau, P., Agarwal, P., Barnes, M.R., Pastinen, T., Richards, J.B., Cardon,
and a CCR1-Antagonist; ANZgene Consortium; GeneMSA; International L.R., and Mooser, V. (2012). Use of genome-wide association studies for drug
Multiple Sclerosis Genetics Consortium (2011). Genome-wide meta-analysis repositioning. Nat. Biotechnol. 30, 317–320.
identifies novel multiple sclerosis susceptibility loci. Ann. Neurol. 70, 897–912.
Satija, R., and Shalek, A.K. (2014). Heterogeneity in immune responses: From
Pe’er, D., Regev, A., Elidan, G., and Friedman, N. (2001). Inferring subnetworks populations to single cells. Trends Immunol. 35, 219–229.
from perturbed expression profiles. Bioinformatics 17 (Suppl 1 ), S215–S224.
Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitative
Pennisi, E. (2000). Ideas fly at gene-finding jamboree. Science 287, 2182– monitoring of gene expression patterns with a complementary DNA microar-
2184. ray. Science 270, 467–470.

Perou, C.M., Sørlie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., Rees, C.A., Schwartzman, O., and Tanay, A. (2015). Single-cell epigenomics: Techniques
Pollack, J.R., Ross, D.T., Johnsen, H., Akslen, L.A., et al. (2000). Molecular and emerging applications. Nat. Rev. Genet. 16, 716–726.
portraits of human breast tumours. Nature 406, 747–752.
Segal, E., Shapira, M., Regev, A., Pe’er, D., Botstein, D., Koller, D., and Fried-
Pop, M., and Salzberg, S.L. (2015). Use and mis-use of supplementary mate- man, N. (2003a). Module networks: Identifying regulatory modules and their
rial in science publications. BMC Bioinformatics 16, 237. condition-specific regulators from gene expression data. Nat. Genet. 34,
166–176.
Ramasamy, A., Mondry, A., Holmes, C.C., and Altman, D.G. (2008). Key issues
Segal, E., Yelensky, R., and Koller, D. (2003b). Genome-wide discovery of
in conducting a meta-analysis of gene expression microarray datasets. PLoS
transcriptional modules from DNA sequence and gene expression. Bioinfor-
Med. 5, e184.
matics 19 (Suppl 1 ), i273–i282.
Raychaudhuri, S., Remmers, E.F., Lee, A.T., Hackett, R., Guiducci, C., Burtt, Segal, E., Friedman, N., Koller, D., and Regev, A. (2004). A module map
N.P., Gianniny, L., Korman, B.D., Padyukov, L., Kurreeman, F.A., et al. showing conditional activity of expression modules in cancer. Nat. Genet.
(2008). Common variants at CD40 and other loci confer risk of rheumatoid 36, 1090–1098.
arthritis. Nat. Genet. 40, 1216–1223.
Segal, E., Friedman, N., Kaminski, N., Regev, A., and Koller, D. (2005). From
Rhodes, D.R., Barrette, T.R., Rubin, M.A., Ghosh, D., and Chinnaiyan, A.M. signatures to models: Understanding cancer using microarrays. Nat. Genet.
(2002). Meta-analysis of microarrays: interstudy validation of gene expression 37 (Suppl ), S38–S45.
profiles reveals pathway dysregulation in prostate cancer. Cancer Res. 62,
4427–4433. Shah, N., Guo, Y., Wendelsdorf, K.V., Lu, Y., Sparks, R., and Tsang, J.S.
(2016). A crowdsourcing approach for reusing and meta-analyzing gene
Rhodes, D.R., Yu, J., Shanker, K., Deshpande, N., Varambally, R., Ghosh, D., expression data. Nat. Biotechnol. 34, 803–806.
Barrette, T., Pandey, A., and Chinnaiyan, A.M. (2004). Large-scale meta-anal-
ysis of cancer microarray data identifies common transcriptional profiles of Shapiro, E., Biezuner, T., and Linnarsson, S. (2013). Single-cell sequencing-
neoplastic transformation and progression. Proc. Natl. Acad. Sci. USA 101, based technologies will revolutionize whole-organism science. Nat. Rev.
9309–9314. Genet. 14, 618–630.

Roedder, S., Kimura, N., Okamura, H., Hsieh, S.C., Gong, Y., and Sarwal, M.M. Silberzahn, R., and Uhlmann, E.L. (2015). Crowdsourced research: Many
(2013). Significance and suppression of redundant IL17 responses in acute hands make tight work. Nature 526, 189–191.
allograft rejection by bioinformatics based drug repositioning of fenofibrate.
PLoS ONE 8, e56657. Sirota, M., Dudley, J.T., Kim, J., Chiang, A.P., Morgan, A.A., Sweet-Cordero,
A., Sage, J., and Butte, A.J. (2011). Discovery and preclinical validation of
Roederer, M., Quaye, L., Mangino, M., Beddall, M.H., Mahnke, Y., Chattopad- drug indications using compendia of public gene expression data. Sci. Transl.
hyay, P., Tosi, I., Napolitano, L., Terranova Barberio, M., Menni, C., et al. Med. 3, 96ra77.
(2015). The genetic architecture of the human immune system: A bioresource
for autoimmunity and disease pathogenesis. Cell 161, 387–403. Speake, C., Presnell, S., Domico, K., Zeitner, B., Bjork, A., Anderson, D., Ma-
son, M.J., Whalen, E., Vargas, O., Popov, D., et al. (2015). An interactive web
Rolland, T., Taşan, M., Charloteaux, B., Pevzner, S.J., Zhong, Q., Sahni, N., Yi, application for the dissemination of human systems immunology data. Journal
S., Lemmens, I., Fontanillo, C., Mosca, R., et al. (2014). A proteome-scale map of translational medicine 13, 196.
of the human interactome network. Cell 159, 1212–1226.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B.,
Rooney, M.S., Shukla, S.A., Wu, C.J., Getz, G., and Hacohen, N. (2015). Mo- Brown, P.O., Botstein, D., and Futcher, B. (1998). Comprehensive identifica-
lecular and genetic properties of tumors associated with local immune cyto- tion of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by
lytic activity. Cell 160, 48–61. microarray hybridization. Mol. Biol. Cell 9, 3273–3297.

Spidlen, J., Breuer, K., Rosenberg, C., Kotecha, N., and Brinkman, R.R. (2012).
Ross, J.S., and Krumholz, H.M. (2013). Ushering in a new era of open science
FlowRepository: A resource of annotated flow cytometry datasets associated
through data sharing: the wall must come down. JAMA 309, 1355–1356.
with peer-reviewed publications. Cytometry A 81, 727–731.
Rossin, E.J., Lage, K., Raychaudhuri, S., Xavier, R.J., Tatar, D., Benita, Y., Cot- Stahl, E.A., Raychaudhuri, S., Remmers, E.F., Xie, G., Eyre, S., Thomson, B.P.,
sapas, C., and Daly, M.J.; International Inflammatory Bowel Disease Genetics Li, Y., Kurreeman, F.A., Zhernakova, A., Hinks, A., et al.; BIRAC Consortium;
Constortium (2011). Proteins encoded in genomic regions associated with im- YEAR Consortium (2010). Genome-wide association study meta-analysis
mune-mediated disease physically interact and suggest underlying biology. identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514.
PLoS Genet. 7, e1001273.
Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gil-
Ruepp, A., Waegele, B., Lechner, M., Brauner, B., Dunger-Kaltenbach, I., lette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and Me-
Fobo, G., Frishman, G., Montrone, C., and Mewes, H.W. (2010). CORUM: sirov, J.P. (2005). Gene set enrichment analysis: A knowledge-based
the comprehensive resource of mammalian protein complexes–2009. Nucleic approach for interpreting genome-wide expression profiles. Proc. Natl.
Acids Res 38, D497–D501. Acad. Sci. USA 102, 15545–15550.
Rung, J., and Brazma, A. (2013). Reuse of public genome-wide gene expres- Sweeney, T.E., Shidham, A., Wong, H.R., and Khatri, P. (2015). A comprehen-
sion data. Nat. Rev. Genet. 14, 89–99. sive time-course-based multicohort analysis of sepsis and sterile inflammation
reveals a robust diagnostic gene set. Sci. Transl. Med. 7, 287ra71.
Saeys, Y., Gassen, S.V., and Lambrecht, B.N. (2016). Computational flow cy-
tometry: Helping to make sense of high-dimensional immunology data. Nat. Sweeney, T.E., Haynes, W.A., Vallania, F., Ioannidis, J.P., and Khatri, P. (2016).
Rev. Immunol. 16, 449–462. Methods to increase reproducibility in differential gene expression via meta-
analysis. Nucleic Acids Res. gkw797, Published online September 14, 2016.
Saez-Rodriguez, J., Costello, J.C., Friend, S.H., Kellen, M.R., Mangravite, L., http://dx.doi.org/10.1093/nar/gkw797.
Meyer, P., Norman, T., and Stolovitzky, G. (2016). Crowdsourcing biomedical
research: Leveraging communities as innovation engines. Nat. Rev. Genet. 17, Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-
470–486. Cepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., et al. (2015).

Immunity 45, December 20, 2016 1203


Immunity

Perspective
STRING v10: protein-protein interaction networks, integrated over the tree of Wang, X., Kang, D.D., Shen, K., Song, C., Lu, S., Chang, L.C., Liao, S.G., Huo,
life. Nucleic Acids Res. 43, D447–D452. Z., Tang, S., Ding, Y., et al. (2012). An R package suite for microarray meta-
analysis in quality control, differentially expressed gene analysis and pathway
The 1000 Genomes Project Consortium (2015). A global reference for human enrichment detection. Bioinformatics 28, 2534–2536.
genetic variation. Nature 526, 68–74.
Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A.,
Tomczak, K., Czerwin ska, P., and Wiznerowicz, M. (2015). The Cancer Ellrott, K., Shmulevich, I., Sander, C., and Stuart, J.M.; Cancer Genome Atlas
Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. On- Research Network (2013). The Cancer Genome Atlas Pan-Cancer Analysis
col. (Pozn.) 19 (1A), A68–A77. Project. Nat. Genet. 45, 1113–1120.
Toro-Domı́nguez, D., Carmona-Sáez, P., and Alarcón-Riquelme, M.E. (2014). Westra, H.J., Peters, M.J., Esko, T., Yaghootkar, H., Schurmann, C., Kettunen,
Shared signatures between rheumatoid arthritis, systemic lupus erythemato- J., Christiansen, M.W., Fairfax, B.P., Schramm, K., Powell, J.E., et al. (2013).
sus and Sjögren’s syndrome uncovered through gene expression meta-anal- Systematic identification of trans eQTLs as putative drivers of known disease
ysis. Arthritis Res. Ther. 16, 489. associations. Nat. Genet. 45, 1238–1243.
Torrente, A., Lukk, M., Xue, V., Parkinson, H., Rung, J., and Brazma, A. (2016).
Wirapati, P., Sotiriou, C., Kunkel, S., Farmer, P., Pradervand, S., Haibe-Kains,
Identification of cancer related genes using a comprehensive map of human €tz, F., et al. (2008). Meta-
B., Desmedt, C., Ignatiadis, M., Sengstag, T., Schu
gene expression. PLoS ONE 11, e0157484.
analysis of gene expression profiles in breast cancer: toward a unified under-
Tsalik, E.L., Henao, R., Nichols, M., Burke, T., Ko, E.R., McClain, M.T., Hud- standing of breast cancer subtyping and prognosis signatures. Breast Cancer
son, L.L., Mazur, A., Freeman, D.H., Veldman, T., et al. (2016). Host gene Res. 10, R65.
expression classifiers diagnose acute respiratory illness etiology. Sci. Transl.
Med. 8, 322ra11. Zhou, V.W., Goren, A., and Bernstein, B.E. (2011). Charting histone modifica-
tions and the functional organization of mammalian genomes. Nat. Rev. Genet.
Tseng, G.C., Ghosh, D., and Feingold, E. (2012). Comprehensive literature re- 12, 7–18.
view and statistical considerations for microarray meta-analysis. Nucleic Acids
Res. 40, 3785–3799. Zhu, Q., Wong, A.K., Krishnan, A., Aure, M.R., Tadych, A., Zhang, R., Corney,
D.C., Greene, C.S., Bongo, L.A., Kristensen, V.N., et al. (2015). Targeted explo-
Wang, T., Hopkins, D., Schmidt, C., Silva, S., Houghton, R., Takita, H., ration and analysis of large cross-platform human transcriptomic compendia.
Repasky, E., and Reed, S.G. (2000). Identification of genes differentially Nat. Methods 12, 211–214.
over-expressed in lung squamous cell carcinoma using combination of
cDNA subtraction and microarray analysis. Oncogene 19, 1519–1528. Zhu, Z., Zhang, F., Hu, H., Bakshi, A., Robinson, M.R., Powell, J.E., Montgom-
ery, G.W., Goddard, M.E., Wray, N.R., Visscher, P.M., and Yang, J. (2016).
Wang, K., Li, M., and Hakonarson, H. (2010). Analysing biological pathways in Integration of summary data from GWAS and eQTL studies predicts complex
genome-wide association studies. Nat. Rev. Genet. 11, 843–854. trait gene targets. Nat. Genet. 48, 481–487.

1204 Immunity 45, December 20, 2016

You might also like