Professional Documents
Culture Documents
Array Expression
Array Expression
Abstract
ArrayExpress is a public resource for microarray data that has two
major goals: to serve as an archive providing access to microarray data
supporting publications and to build a knowledge base of gene expression
profiles. ArrayExpress consists of two tightly integrated databases:
ArrayExpress repository, which is an archive, and ArrayExpress data
warehouse, which contains reannotated data and is optimized for queries.
As of December 2005, ArrayExpress contains gene expression and other
microarray data from almost 35,000 hybridizations, comprising over 1200
studies, covering 70 different species. Most data are related to peer‐
reviewed publications. Password‐protected access to prepublication data
is provided for reviewers and authors. Data in the repository can be
queried by various parameters such as species, authors, or words used in
the experiment description. The data warehouse provides a wide range of
queries, including ones based on gene and sample properties, and provides
capabilities to retrieve data combined from different studies. The ArrayEx-
press resource also includes Expression Profiler (EP)—a microarray data
mining, analysis, and visualization tool—and MIAMExpress—an online
data submission tool. This chapter describes all major ArrayExpress com-
ponents from the user perspective: how to submit to, retrieve from, and
analyze data in ArrayExpress.
Introduction
Since the first genome‐wide microarray gene expression studies were
published in 1997 (e.g., De Risi et al., 1997), microarrays have become a
standard technology in life sciences research. The amounts of data gener-
ated in a single microarray experiment considerably exceed that generated
by any traditional technology, or by DNA sequencing. Not only do micro-
arrays produce large amounts of data, but these data are complex. A series
of non‐trivial data processing steps have to be applied to raw microarray data
to obtain biologically meaningful results. It has been widely acknowledged
that to interpret microarray experiment results both raw and processed data
(2) how data can be analyzed in Expression Profiler, and (3) how data can
be submitted to ArrayExpress.
selected for the warehouse on the basis of the quality of annotation, pres-
ence of raw and normalized data, and MIAME compliance. The array
annotation is improved and updated using the Ensembl genome database
if the respective array features have been mapped in Ensembl. For genomes
or arrays not present in Ensembl, the UniProt database is used. Annotation,
such as Gene Ontology (GO) (Harris et al., 2004) terms, gene names,
synonyms, and InterPro IDs (Mulder et al., 2003), are added from the latest
release of the respective databases.
The ArrayExpress data warehouse supports queries on gene attributes,
for example, gene names, GO terms, InterPro terms, and on sample
and experiment properties, for example, anatomical terms, disease states,
or array platforms used in the experiment (see http://www.ebi.ac.uk/
arrayexpress). The user can retrieve, combine, and visualize the gene
expression values for multiple experiments. For example, a query of gene
name ‘‘jun’’ and sample property ‘‘leukemia’’ results in retrieval of all
experiments that contain data for one or several genes matching this name
(e.g., jun, junb, jund) and that appear in experiments where one or more
samples are annotated as ‘‘leukemia.’’ First a list of genes that match the
query, that is, jun, junb, and jund, is retrieved, from which the user can
make a subselection by ticking the respective check boxes. Data for the
selected genes are visualized using line plots and can be selected for further
analysis. Links are also provided back to the repository where users can
access the full annotation and supporting raw data.
Instead of querying for individual genes by names or properties, all
genes of an organism can be selected. In this case the query retrieves full
gene expression data matrix in a tab‐delimited format, and matrices from
different experiments can be selected (using tick boxes) and combined in a
single matrix. Figure 5 shows the data selection path in the data warehouse
simple interface, with boxes corresponding to sets of data objects and
arrows describing filtering and data retrieval happening as a result of query
parameters supplied by the user.
The advanced data warehouse query option enables users to choose the
query path in a more complex way. A variety of data fields can be used to
select subsets of experiments, hybridizations, samples, and genes and at any
point export the data matrix with the chosen gene, sample, and hybridization
annotation. For instance, the following query scenarios are possible:
Select all experiments performed in a given laboratory.
Select all hybridizations that belong to these experiments, and where
the sample has been treated with a certain compound.
Export all data available for these hybridizations.
Select genes belonging to a certain GO category.
[20] DATA STORAGE AND ANALYSIS IN ARRAYEXPRESS 377
FIG. 5. Example workflow for the ArrayExpress data warehouse simple query interface.
378 DNA microarrays, part B [20]
columns tabs. The Missing values subsection filters out rows of the matrix
with more than a specified percentage of the values marked as NA (not
available). The Select by similarity option provides an option to supply a list
of genes and, for each of those, select a specified number of most similarly
expressed ones in the same data set.
Expression Profiler provides a graphical interface to three commonly
used BioConductor data normalization routines for Affymetrix and other
380 DNA microarrays, part B [20]
microarray data, namely GCRMA, RMA, and VSN (Huber et al., 2002;
Irizarry et al., 2003; Wu et al., 2003) through the Data Normalization
component (from the Data Transformation menu). In EP, GCRMA and
RMA can only be applied to Affymetrix CEL file imports, whereas VSN
can be applied universally to all types of data.
The Data Transformation component is used where data need to be
transformed to make it suitable for some specific analysis. For instance,
with the Absolute‐to‐Relative transformation, absolute expression values
for each gene can be transformed into relative ones: either relative to a
specified column of the data set or relative to the mean value of the gene.
Clustering Analysis
Expression Profiler provides fast implementations of two classes of clus-
tering algorithms (clustering menu): hierarchical clustering and flat parti-
tioning in the Hierarchical Clustering and K‐means/K‐medoids Clustering
components respectively. A range of distance measures such as Euclidean
or correlation‐based distance measures can be used. The Signature Algorithm
component is an alternative approach to clustering‐like analysis, based on the
method by Bergmann et al. (2003).
[20] DATA STORAGE AND ANALYSIS IN ARRAYEXPRESS 381
[20]
K‐means clustering (K ¼ 5) to a dendrogram. Overlaps between matching clusters are shown with Venn diagrams.
[20] DATA STORAGE AND ANALYSIS IN ARRAYEXPRESS 383
ArrayExpress curators will work with the local database developers to make
sure that the exported MAGE‐ML files are consistent with the best practice
recommendations (see http://www.mged.org/Workgroups/MIAME/miame_
mage‐om.html), as well as that the use of identifiers is consistent with those
in ArrayExpress. Once such a pipeline is established, submitting data to
ArrayExpress becomes a simple task and there is no need to use MIAM-
Express. Pipelines from more than 10 different laboratory database have been
established, including Stanford Microarray Database (SMD) (Sherlock,
2001), MIDAS at TIGR (Saeed et al., 2003, 2006), and University Medical
Center Utrecht microarray database, as well as the array manufacturers
Affymetrix and Agilent, and more are under construction.
Details of the submission process, including when and how to build a
pipeline and what is needed for submission, can be found at www.ebi.ac.uk/
arrayexpress.
Future
One of the immediate future goals of ArrayExpress is to populate the data
warehouse with more data from the repository. We estimate that 50% of data
submitted to the repository will eventually be loaded into the warehouse. At
the same time we are working to extend the functionality of the data ware-
house, and new features, such as selecting genes by similar expression profiles
(which is already possible in Expression Profiler), will be added. The data
warehouse will be closely integrated with Expression Profiler data analysis
tools. Links from other EBI resources, including Ensembl, will be expanded
so that users querying, for instance, genomic information, can easily get the
related expression information. We are also exploring a possibility that data
loaded in the data warehouse can be renormalized for consistency (note that
currently we use normalized data provided by the authors).
Another major development task is making data submissions to Array
Express easier. The batch upload tool facilitates the submission of large
experiments. This tool will be improved further to simplify submissions of
standard experimental designs, such as experiments on one‐channel arrays.
The batch upload tool will also be used to establish direct submission
routes to ArrayExpress from laboratory databases.
One of the guiding principles of ArrayExpress development has been
the support of community data standards, such as MIAME, MAGE‐ML,
and MGED ontology. As the development of MAGE‐ML is continuing,
ArrayExpress will be updated constantly to support these developments.
Acknowledgments
EMBL, the European Commission (FELICS), International Life Sciences Institute
(ILSI), and the National Institutes of Health (NIH) support ArrayExpress development.
[20] DATA STORAGE AND ANALYSIS IN ARRAYEXPRESS 385
References
Ball, C. A., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P., Matese, J. C.,
Parkinson, H., Quackenbush, J., Ringwald, M., Sansone, S. A., Sherlock, G., Spellman, P.,
Stoeckert, C., Tateno, Y., Taylor, R., White, J., and Winegarden, N. (2004). Submissions of
microarray data to public repositories. PLOS Biol. 2, e317.
Ball, C. A., Sherlock, G., Parkinson, H., Rocca‐Sera, P., Brooksbank, C., Causton, H. C.,
Cavalieri, D., Gaasterland, T., Hingamp, P., Holstege, F., Ringwald, M., Spellman, P.,
Stoeckert, C. J., Jr., Stewart, J. E., Taylor, R., Brazma, A., and Quackenbush, J., Microarray
Gene Expression Data Society (2004). Standards for microarray data. Science 298, 539.
Barrett, T., and Edgar, R. (2006). Gene Expression Omnibus: Microarray data storage,
submission, retrieval, and analysis. Methods Enzymol. 411, 352–369.
Bergmann, S., Ihmels, J., and Barkai, N. (2003). Iterative signature algorithm for the analysis of
large‐scale gene expression data. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys. 67(3 Pt 1), 031902.
Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N.,
Holloway, E., Kapushesky, M., Kemmeren, P., Lara, G. G., Oezcimen, A., Rocca‐Serra, P.,
and Sansone, S. A. (2003). ArrayExpress: A public repository for microarray gene expression
data at the EBI. Nucleic Acids Res. 31(1), 68–71.
Brazma, A., Robinson, A., Cameron, G., and Ashbumer, M. (2000). ‘‘One‐stop shop for
microarray data.’’ Nature 403(6771), 699–700.
Culhane, A. C., Perriere, G., Considine, E. C., Cotter, T. G., and Higgins, D. G. (2002).
Between‐group analysis of microarray data. Bioinformatics 18(12), 1600–1608.
DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic
control of gene expression on a genomic scale. Science 278(5338), 680–686.
Edgar, R., Domrachev, M., and Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene
expression and hybridization array data repository. Nucleic Acids Res. 30(1), 207–210.
Erickson, J. N., and Spana, E. P. (2006). Mapping Drosophila genomic aberration
breakpoints with comparative genome hybridization on microarrays. Methods Enzymol
410, 377–386.
Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B.,
Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R.,
Li, F. L. C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L.,
Yang, J. Y. H., and Zhang, J. (2004). Bioconductor: Open software development for
computational biology and bioinformatics. Genome Biol. 5, R80.
Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S.,
Marshall, B., Mungall, C., Richter, J., Rubin, G. M., Blake, J. A., Bult, C., Dolan, M.,
Drabkin, H., Eppig, J. T., Hill, D. P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J. M.,
Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S., Fisk, D. G., Hirschman, J. E., Hong,
E. L., Nash, R. S., Sethuraman, A., Theesfeld, C. L., Botstein, D., Dolinski, K., Feierbach, B.,
Berardini, T., Mundodi, S., Rhee, S. Y., Apweiler, R., Barrell, D., Camon, E., Dimmer, E.,
Lee, V., Chisholm, R., Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E. M., Sternberg, P.,
Gwinn, M., Hannick, L., Wortman, J., Berriman, M., Wood, V., de la Cruz, N., Tonellato, P.,
Jaiswal, P., Seigfried, T., and White, R., and Gene Ontology Consortium (2004). The Gene
Ontology (GO) database and informatics resource. Nucleic Acids Res. 31, 236–261.
Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A., and Vingron, M. (2002). Variance
stabilization applied to microarray data calibration and to the quantification of differential
expression. Bioinformatics 18(Suppl. 1), S96–S104.
Ikeo, K., Ishi‐i, J., Tamura, T., Gojobori, T., and Tateno, Y. (2003). CIBEX: Center for
information biology gene expression database. C. R. Biol. 326(10–11), 1079–1082.
Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003).
Summaries of affymetrix GeneChip probe level data. Nucleic Acids Res. 31(4), e15.
386 DNA microarrays, part B [20]
Kapushesky, M., Kemmeren, P., Culhane, A. C., Durinck, S., Ihmels, J., Korner, C., Kull, M.,
Torrente, A., Sarkans, U., Vilo, J., and Brazma, A. (2004). Expression Profiler: Next
generation—an online platform for analysis of microarray data. Nucleic Acids Res. 32,
W465–W470.
Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Barrell, D., Bateman, A., Binns, D.,
Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R. R., Courcelle, E., Das, U., Durbin, R.,
Falquet, L., Fleischmann, W., Griffiths‐Jones, S., Haft, D., Harte, N., Hulo, N., Kahn, D.,
Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Lonsdale, D., Silventoinen, V.,
Orchard, S. E., Pagni, M., Peyruc, D., Ponting, C. P., Selengut, J. D., Servant, F., Sigrist, C. J. A.,
Vaughan, R., and Zdobnov, E. J. (2003). The InterPro Database, 2003 brings increased
coverage and new features. Nucleic Acids Res. 31, 315–318.
Negre, N., Lavrov, S., Hennetin, J., Bellis, M., and Cavalli, G. (2006). Mapping the
distribution of chromatin proteins by ChIP on chip. Methods Enzymol. 410, 316–341.
Parkinson, H., Sarkans, U., Shojatalab, M., Abeygunawardena, N., Contrino, S., Couison, R.,
Fame, A., Lara, G. G., Holloway, E., Kapushesky, M., Lija, P., Mukherjee, G., Oezcimen, A.,
Rayner, T., Rocca‐Serra, P., Sharma, A., Sansone, S., and Brazma, A. (2005). ‘‘Array‐
Express—A public repository for microarray gene expression data at the EBI.’’ Nucleic
Acids Res. 33, Database Issue: D553–D555.
Reimers, M., and Carey, V. J. (2006). Bioconductor: An open source framework for
bioinformatics and computational biology. Methods Enzymol. 411, 119–134.
Saeed, A. I., Bhagabati, N. K., Braisted, J. C., Liang, W., Sharov, V., Howe, E. A., Li, J.,
Thiagarajan, M., White, J. A., and Quackenbush, J. (2006). TM4 microarray software
suite. Methods Enzymol. 411, 134–193.
Saeed, A. I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M.,
Currier, T., Thiagarajan, M., Sturn, A., Snuffin, M., Rezantsev, A., Popov, D., Ryltsov, A.,
Kostukovich, E., Borisovsky, I., Liu, Z., Vinsavich, A., Trush, V., and Quackenbush, J.
(2003). TM4: A free, open‐source system for microarray data management and analysis.
Biotechniques 34(2), 374–378.
Sarkans, U. P. H., Lara, G. G., Dezcimen, A., Sharma, A., Abeygunawardena, N., Contrino, S.,
Holloway, E., Rocca‐Serra, P., Mukherjee, G., Shojatalab, M., Kapushesky, M., Sansone, S. A.,
Fame, A., Rayner, T., and Brazma, A. (2005). ‘‘The ArrayExpress gene expression database:
A software engineering and implementation perspective.’’ Bioinformatics 21(8), 1495–1501.
Scacheri, P. C., Crawford, G. E., and Davis, D. (2006). Statistics for ChIP‐chip and DNase
hypersensitivity experiments on NimbleGen arrays. Methods Enzymol. 411, 270–282.
Sherlock, G., Hernandez‐Boussard, T., Kasarskis, A., Binkley, G., Matese, J. C., Dwight, S. S.,
Kaloper, M., Weng, S., Jin, H., Ball, C. A., Eisen, M. B., Spellman, P. T., Brown, P. O.,
Botstein, D., and Cherry, J. M. (2001). The Stanford Microarray Database. Nucleic Acids
Res. 29(1), 152–155.
Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D.,
Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L., Goncalves, J., Markel, S.,
Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M.,
Aronow, B. J., Robinson, A., Bassett, D., Stoeckert, C. J., Jr., and Brazma, A. (2002).
Design and implementation of microarray gene expression markup language (MAGE‐
ML). Genome Biol. 3(9), RESEARCH0046.
Thioulouse, J., Chessel, D., Dolédec, S., and Olivier, J. M. (1997). ADE‐4: A multivariate
analysis and graphical display software. Stat. Comput. 7(1), 75–83.
Torrente, A., Kapushesky, M., and Brazma, A. (2005). A new algorithm for comparing and
visualizing relationships between hierarchical and flat gene expression data clusterings.
Bioinformatics 21, 3993–3999.
Wu, Z., Irizarry, R., Gentleman, R., Murillo, F., and Spencer, F. (2003). ‘‘A Model Based
Background Adjustment for Oligonucleotide Expression Arrays.’’ John Hopkins University,
Baltimore, MD.