/  8
 
Minireview
Gene expression data analysis
Alvis Brazma*, Jaak Vilo
European Molecular Biology Laboratory, Outstation Hinxton ^ the European Bioinformatics Institute, Cambridge CB10 1SD, UK 
Received 5 June 2000Edited by Gianni Cesareni
Abstract Microarrays are one of the latest breakthroughs inexperimental molecular biology, which allow monitoring of geneexpression for tens of thousands of genes in parallel and arealready producing huge amounts of valuable data. Analysis andhandling of such data is becoming one of the major bottlenecks inthe utilization of the technology. The raw microarray data areimages, which have to be transformed into gene expressionmatrices ^ tables where rows represent genes, columns representvarious samples such as tissues or experimental conditions, andnumbers in each cell characterize the expression level of theparticular gene in the particular sample. These matrices have tobe analyzed further, if any knowledge about the underlyingbiological processes is to be extracted. In this paper weconcentrate on discussing bioinformatics methods used for suchanalysis. We briefly discuss supervised and unsupervised dataanalysis and its applications, such as predicting gene functionclasses and cancer classification. Then we discuss how the geneexpression matrix can be used to predict putative regulatorysignals in the genome sequences. In conclusion we discuss somepossible future directions. ß 2000 Federation of European Bio-chemical Societies. Published by Elsevier Science B.V. Allrights reserved.
Key words:
Microarray; Gene expression; Promotersequence; Pattern discovery; Clustering
1. Introduction
With several eukaryotic genomes completed and the drafthuman genome published, we are now entering the postge-nomic age. The main focus in genomic research is switchingfrom sequencing to using the genome sequences in order tounderstand how genomes are functioning. Some questions wewould like to ask are
b
what are the functional roles of di¡erent genes and in whatcellular processes do they participate;
b
how are genes regulated, how do genes and gene productsinteract, what are these interaction networks;
b
how does gene expression level di¡er in various cell typesand states, how is gene expression changed by various dis-eases or compound treatments.Knowing the gene transcript abundance in various tissues,developmental stages and under various conditions is impor-tant for attacking these questions. Although mRNA is not theultimate product of a gene, transcription is the ¢rst step ingene regulation, and information about the transcript levels isneeded for understanding gene regulatory networks. More-over, the measurement of mRNA levels currently is consider-ably cheaper and can be done in a more high-throughput waythan direct measurements of the protein levels. The correla-tion between the mRNA and protein abundance in the cellmay not be straightforward, nevertheless the absence of mRNA in a cell is likely to imply a not very high level of the respective protein and thus at least qualitative estimatesabout the proteome can be based on the transcriptome infor-mation. The mRNA and protein level correlation studies areunder way (see [1]).The ability to monitor gene expression at the transcriptlevel has become possible due to the advent of DNA micro-array technologies (see [2]). A microarray is a glass slide, ontowhich single-stranded DNA molecules are attached at ¢xedlocations (spots). There may be tens of thousands of spotson an array, each related to a single gene. Microarrays exploitthe preferential binding of complementary single-stranded nu-cleic acid sequences. There are several variations of microar-ray technologies each used in a speci¢c way.One of the most popular experimental platforms is used forcomparing mRNA abundance in two di¡erent samples (or asample and a control). RNA from the sample and controlcells are extracted and labeled with two di¡erent £uorescentlabels, e.g. a red dye for the RNA from the sample populationand a green dye for that from the control population. Bothextracts are washed over the microarray. Gene sequences fromthe extracts hybridize to their complementary sequences in thespots.To measure the relative abundance of the hybridized RNAthe array is excited by a laser. If the RNA from the samplepopulation is in abundance, the spot will be red, if the RNAfrom the control population is in abundance, it will be green.If sample and control bind equally, the spot will be yellow,while if neither binds, it will not £uoresce and appear black.Thus, from the £uorescence intensities and colors for eachspot, the relative expression levels of the genes in the sampleand control populations can be estimated.By measuring transcription levels of genes in an organismunder various conditions, at di¡erent developmental stagesand in di¡erent tissues, we can build up `gene expressionpro¢les' which characterize the dynamic functioning of eachgene in the genome. We can imagine the expression data rep-resented in a matrix with rows representing genes, columnsrepresenting samples (e.g. various tissues, developmentalstages and treatments), and each cell containing a numbercharacterizing the expression level of the particular gene inthe particular sample. We will call such a table a
gene expres-
0014-5793/00/$20.00 ß 2000 Federation of European Biochemical Societies. Published by Elsevier Science B.V. All rights reserved.PII: S0014-5793(00)01772-5*Corresponding author. Fax: +44 1223 494468.E-mail: brazma@ebi.ac.uk; vilo@ebi.ac.ukFEBS 23893 FEBS Letters 480 (2000) 17^24
 
sion matrix
. Building up a database of such matrices will helpus to understand gene regulation, metabolic and signalingpathways, the genetic mechanisms of disease, and the responseto drug treatments. For instance, if overexpression of certaingenes is correlated with a certain cancer, we can explore whichother conditions a¡ect the expression of these genes andwhich other genes have similar expression pro¢les. We canalso investigate which compounds (potential drugs) lowerthe expression level of these genes.
2. From raw data to gene expression matrix
Like many experimental technologies, microarrays measurethe target quantity (i.e. relative or absolute mRNA abun-dance) indirectly by measuring another physical quantity ^ the intensity of the £uorescence of the spots on the arrayfor each £uorescent dye, i.e. for each optical wavelength(so-called channel). Therefore the raw data produced by mi-croarrays are in fact monochrome images (Fig. 1). Transform-ing these images into the gene expression matrix is a non-trivial process: the spots corresponding to genes on the micro-array should be identi¢ed, their boundaries determined, the£uorescence intensity from each spot measured and comparedto the background intensity and to these intensities for otherchannels. The software for this initial image processing isoften provided with the image scanner, since it will dependon particular properties of the hardware. Often laboriousmanual adjustment of the grid for spots is used. We will notdiscuss the raw data processing in detail in this paper, somesurvey of image analysis software can be found on http://cmpteam4.unil.ch/biocomputing/array/software/MicroArray_ Software.html.In any physical experiment it is important to know not onlythe value of the measurement, but also the standard error or
Fig. 1. A sample image from scanning a hybridized rat microarray containing over 5000 genes. Each spot features a pool of identical single-stranded DNA molecules representing a single gene. The brightness of the spot is proportional to the amount of £uorescent mRNA hybridizedto the DNA of the spot. Automated image analysis software should identify these £uorescence spots, determine their boundaries, and the £uo-rescence intensity from each spot should be measured and compared to the background £uorescence. Moreover, the image should be comparedto a similar image obtained from the control measurements and the ratio of background-subtracted intensities calculated. In this way imagesare transformed into a gene expression matrix, which can be analyzed further by numerical methods. The image was kindly provided by TomFreeman (Sanger Centre, Cambridge, UK).
A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24
18
 
some other indicator of reliability for each data point. Formost microarray technology platforms only the ratio of thebackground-subtracted signals of the given sample and thecontrol is meaningful. If the spot intensity is low, the ratioof these numbers may be high, but the measurement may notbe reliable. The spot quality can be assessed not only by theabsolute intensity in each channel, but also by many otherfactors, such as uniformity of the individual pixel intensities,or the shape of the spot. Unfortunately there is currently nostandard way of assessing the spot measurement reliability. If experiments have been done in replicates, they can be used toassess the standard errors in addition to the single measure-ment quality assessments. Little has been published yet onhow to use the reliability of gene expression measurementsby combining the information about the spot image in eachchannel and the replicate images.Another di¤culty in creating a gene expression matrixcomes from the necessity to identify each spot with the re-spective gene. This is not always possible, since spots aretypically based on EST sequences, and linking the EST tothe respective gene may be non-trivial. Typically it is donethrough EST clustering. Additionally, the same gene may berepresented by several spots on the array, either by exactly thesame or by a di¡erent sequence. What expression level toattribute to the gene, if measurements from these di¡erentspots di¡er?Microarray-based gene expression measurements are stillfar from giving estimates of mRNA counts per cell in thesample. The measurements are relative by nature: essentiallywe can compare the expression level either of the same gene indi¡erent samples, or of di¡erent genes in the same sample.Moreover, appropriate normalization should be applied toenable any data comparisons. Typically it is assumed thatabundance ratios of 1.5^2 are indicative of a change in geneexpression, but such estimates are very crude. The reliabilityof ratios depends on the absolute intensity values, as well asvarying from spot to spot due to speci¢city of the sequenceand cross-hybridization of homologous sequences (for in-stance see [3]). This should be kept in mind while analyzingthe gene expression matrix. The value of microarray-basedgene expression measurements would be considerably higherif reliability and limitations of particular microarray platformsfor particular kinds of measurements, as well as cross-plat-form comparison and normalization, were studied and pub-lished.After we have processed the raw image data into the geneexpression matrix, the next task is to analyze this matrix andto try to extract from it some knowledge about the underlyingbiological processes.
3. Gene expression matrix analysis
There are two straightforward ways how gene expressionmatrix can be studied:1. comparing expression pro¢les of genes by comparing rowsin the expression matrix;2. comparing expression pro¢les of samples by comparingcolumns in the matrix.Additionally both methods can be combined (provided thatthe data normalization allows it). When comparing rows orcolumns, we can look either for similarities or for di¡erences.If we ¢nd that two rows are similar, we can hypothesize thatthe respective genes are co-regulated and possibly functionallyrelated. By comparing samples, we can ¢nd which genes aredi¡erentially expressed and, for instance, study e¡ects of var-ious compounds.Before we can perform any comparisons, we need a way tomeasure the similarity (or distance) between the objects we arecomparing. We can regard these objects (rows or columns inthe matrix) as points in
n
-dimensional space or as
n
-dimen-sional vectors, where
n
is the number of samples for genecomparison, or number of genes for sample comparison.The natural, so-called Euclidean distance (for de¢nition see[4]) between these points in the
n
-dimensional space may bethe most obvious, but not necessarily the best choice. It isintuitively appealing to use the correlation coe¤cient calcu-lated by treating the two
n
-dimensional vectors as series of random variables. In fact this distance is related to the anglebetween the two
n
-dimensional vectors. Euclidean and corre-lation distance measures are related, if we normalize thelength of the
n
-dimensional vectors to 1. This makes it possi-ble to use correlation distance even in the cases when Euclid-ean properties are important. Some other distance measures,including rank correlation coe¤cient and mutual information-based measure, are proposed in D'haesleer et al. [5]. Cur-rently, to the best of our knowledge, there is no theory howto choose the best distance measure. Possibly one `right' dis-tance measure in the expression pro¢le space does not exist,and the choice should depend on the questions that we areasking. Standard sets of known co-regulated genes in variousorganisms and gene regulatory network modeling can poten-tially help in ¢nding theoretically substantiated similaritymeasures.After having chosen the similarity measure in the expressionpro¢le space we can study the expression matrix in either asupervised or an unsupervised manner. The supervised ap-proach assumes that for some (or all) pro¢les we have addi-tional information, such as functional classes for the genes, ordiseased/normal states attributed to the samples. We can viewthis additional information as labels attached to the rows orcolumns. Having this information, a typical task is to build aclassi¢er able to predict the labels from the expression pro¢le.A typical example of unsupervised data analysis is expressionpro¢le clustering to ¢nd groups of co-regulated genes or re-lated samples. For conceptual illustration of unsupervised andsupervised analysis see Fig. 2. First we discuss the clusteringapproach.
3.1. Unsupervised analysis
The goal of clustering is to group together objects (genes orsamples) with similar properties. This can also be viewed asthe reduction of the dimensionality of the system. Clusteringis not a new technique, many algorithms have been developedfor it and many of these algorithms have been applied toanalyze expression data. The hierarchical [6] and
-mean clus-tering algorithms [7,8] as well as self-organizing maps [9] haveall been used for clustering expression pro¢les. Even a simpleclustering algorithm based on binning (i.e. discretizing theexpression pro¢le space and clustering together the pro¢lesthat map into the same bin) has been shown to be usefulfor clustering genes and subsequent discovering of transcrip-tion factor binding sites [10]. More recently new algorithms
A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24
19

Share & Embed

More from this user

Add a Comment

Characters: ...