/  6
 
Nucleic Acids Research, 2007, Vol. 35, Web Server issue
W91–W96 
doi:10.1093/nar/gkm260
FatiGO
1
: a functional profiling tool for genomic data.Integration of functional annotation, regulatory motifsand interaction data with microarray experiments
Fa´ tima Al-Shahrour
1
, Pablo Minguez
1
, Joaquı´ n Ta´ rraga
1,2
, Ignacio Medina
1
,Eva Alloza
1
, David Montaner
1,2
and Joaquı´ n Dopazo
1,2,
*
1
Bioinformatics Department, Centro de Investigacio´ n Prı´ ncipe Felipe (CIPF), Valencia 46013, Spain and
2
Functional Genomics Node, INB, CIPF, Valencia 46013, Spain
Received January 30, 2007; Revised March 27, 2007; Accepted April 8, 2007
 ABSTRACTThe ultimate goal of any genome-scale experimentis to provide a functional interpretation of the data,relating the available information with the hypoth-eses that originated the experiment. Thus, func-tional profiling methods have become essentialin diverse scenarios such as microarray experi-ments, proteomics, etc. We present the FatiGO
1
,a web-based tool for the functional profiling ofgenome-scale experiments, specially orientedto the interpretation of microarray experiments.In addition to different functional annotations(gene ontology, KEGG pathways, Interpro motifs,Swissprot keywords and text-mining based bioen-tities related to diseases and chemical compounds)FatiGO
1
includes, as a novelty, regulatory andstructural information. The regulatory informationused includes predictions of targets for distinctregulatory elements (obtained from the Transfacand CisRed databases). Additionally FatiGO
1
usespredictions of target motifs of miRNA to infer whichof these can be activated or deactivated in thesample of genes studied. Finally, properties of geneproducts related to their relative location andconnections in the interactome have also beenused. Also, enrichment of any of these functionalterms can be directly analysed on chromosomalcoordinates. FatiGO
1
can be found at: http:/www.fatigoplus.org and within the Babelomicsenvironment http://www.babelomics.orgINTRODUCTION
It is well-known that genes do not operate alone withinthe cell, but in an intricate network of interactions that weonly recently start to understand (1–3). These observa-tions question the validity of the traditional reductionisticvision, in which one or a few key genes would be thecausative factors of phenotypes or diseases (4), and urgesto take into consideration the functional dimension in theinterpretation of genome-scale experiments. In this newscenario, the deregulation of blocks of functionally relatedgenes would be behind the disease phenotype (5). Thus,there is a clear necessity for methods and tools to assist inthe functional interpretation of genome-scale experimentssuch as microarrays, and to formulate genome-scalehypothesis from a systems biology perspective (6–8) in away that the collective properties of groups of genes aretaken into account. Thus, a number of tools, pioneered byOnto-Express (9), or FatiGO (10) that used multiple-testing correction for the first time, GOMiner (11), etc.,can be considered representatives of a family of methodsthat make use of different functional annotations(typically gene ontology (GO) or KEGG pathways) tofind significant functional enrichments that might beuseful in the interpretation of the results of microarrayexperiments (8,12). Functional enrichment methods areapplied
a posteriori 
on a group of genes of interest thathave been selected in a previous step on the basis of theirexperimental values. For example, common criteria forselecting groups of genes in the context of microarrayexperiments would be their differential expression betweentwo classes of experiments. By means of this simpletwo-step approach of gene selection followed by thefunctional enrichment analysis, a reasonable biologicalinterpretation of a microarray experiment can beachieved. Nevertheless, it has been noted that the artificialimposition of a threshold in the gene selection step,which ignores the cooperative behaviour among genes,can have arbitrary consequences on the proper interpreta-tion of the experiment (13). Thus, a new generation of methods which directly test the coordinate behaviourof blocks of functionally related genes has recently beenproposed (14,15).
*To whom correspondence should be addressed. Tel:
þ
34 963289680; Fax:
þ
34 963289701; Email: jdopazo@cipf.es
ß
2007 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
 
There are, however, situations in which functionalenrichment methods have a domain of application becausethe groups of genes to study are defined as discrete classeswithout the necessity of applying any arbitrary threshold.For example, many clustering methods define non-overlapping groups (clusters) of genes in a deterministicway. Other experiments such as ChIP-on-Chip can be usedto define groups of genes under the control of one or moretranscription factors, undergoing methylation, etc., andsuch groups are also real discrete classes.We present here the FatiGO
þ
, an advanced versionof the original FatiGO program (10). FatiGO
þ
is partof a new generation of functional enrichment toolscharacterized by the integration of heterogeneous, biologi-cally relevant information. FatiGO
þ
includes newfunctional annotations, more regulatory information,structural data on protein interactions, new tests andnew facilities for the representation of the results, such asdisplaying enrichment values onto pathway representa-tions. Additional Table 1 shows a comparison betweenFatiGO
þ
and other web-based tools for functionalenrichment analysis. New aspects such as the use of theinteractome scaffold (not merely the list of knowncomplexes as in other tools (16,17)), the use of newregulatory information (e.g. miRNAs) or the use of text-mining derived functional terms, just to cite themost prominent features, represent a novelty of FatiGO
þ
with respect to other tools.
Programcharacteristics and functionality
The original FatiGO program (10) was published in 2004and, since then it has become a extensively used tool(FatiGO/FatiGO
þ
have had an average of more than100 experiments analysed per day and a cumulated total of more than 37000 worldwide uses during the last year; seea map with the geographical distribution of users here:http://bioinfo.cipf.es/access_map/map.html). FatiGO
þ
constitutes the extension of the concept of functionalenrichment not only to new functional annotations butalso to other relevant classes of biological concepts thatcan be assigned to genes, or their corresponding geneproducts, such as regulation and protein interaction data.
Purpose of the analysis.
The main purpose of FatiGO
þ
isto check for significant enrichments of the functionalcharacteristics selected by the user in one of the lists of genes with respect to the other one. Typically thisoperation represents the comparison of a pre-selected listof genes with respect to the genes of reference (usually therest of genes involved in the experiment). Functionalenrichment is carried out by many different tools alreadyavailable (8), nevertheless, FatiGO
þ
can be used for otherpurposes. The main screen displays four tabs. By default,the enrichment analysis option is selected. There two listsof genes (pre-selected and reference groups) can be inputfor their comparison. Another tab offer the possibility of carrying out enrichment analysis using functional termsdefined by the user. Additionally, other tab can beused just for listing functional terms or characteristicspresent in a list of genes. Finally, another tab presentsa different interface, in which enrichment analysis canbe directly performed on chromosomal regions. Thus, thepre-selected group of genes is defined by delimiting thecoordinates of a chromosomal region and the referencegroup is the rest of the genes in the genome. This is veryuseful to study the functional impact of groups of genesthat cluster in close positions in the genome in pathologiesthat occur with chromosomal copy number alterations.
Input data format.
Input data format are simple text filesof genes (although data can also be pasted within thecorresponding boxes). The most common gene or proteinidentifiers are accepted. This is achieved by using Ensembl(version 42) identifiers as universal cross-references.The following model organisms have been included inFatiGO
þ
:
Homo sapiens, Mus musculus, Gallus gallus,Rattus norvegicus, Drosophila melanogaster, Daniorerio, Caenorhabditis elegans, Saccharomyces cerevisiae,Arabidopsis thaliana
and
Streptomyces coelicolor
. Datacan also be directly imported from the GEPAS suite of microarray data analysis (http://www.gepas.org). In thisway, it is possible from GEPAS to directly studyfunctional enrichment of genes contained in a cluster orco-expression, genes differentially expressed, genesselected in a predictor, genes contained in a givenchromosomal region, etc. When the option for functionalenrichment using user-defined functional categories, inaddition to the lists of genes a table of correspondencegene-annotation must be provided. Again this is a verysimple text file with two tab-delimited columns in whichthe first one contains gene identifiers and the second onethe functional annotation.
Understanding the differences between two lists of genes inthe light of functional annotations.
We have includedin FatiGO
þ
different functional annotations availablefor genes. Probably the most widely used is GO (18).GO represents the biological knowledge as a tree (moreprecisely as a directed acyclic graph, DAG, in which anode can have more that one parent) where elements nearthe root of the tree make reference to more generalconcepts while deeper elements near the leaves of the treemake reference to more specific concepts. Conceptually wecan consider that, if a gene is annotated to a given levelthen it is automatically annotated at all the upper levels(all the parent levels) up to the root. Since genes areannotated at different levels of the GO hierarchy, it iscommon to use this abstraction to choose a pre-definedlevel in the hierarchy instead of using directly the originallevels of annotation of the genes. This strategy is used toproduce GOSlim (see http://www.geneontology.org/GO.slims.shtml), which are cut-down versions of the GOhierarchy containing subsets of the terms in the whole GOat different levels. GoSlim levels give a broad overview of the ontology content without the detail of the specific finegrained terms. Choosing a GOSlim level of GO increasesthe power of the enrichment tests (7,8,12). What FatiGO
þ
does is the enrichment analysis at different GOSlim levelsand reports the deepest (the more detailed definition) levelat which significance is found. This strategy is known asnested inclusive analysis (NIA) (19).
W92
Nucleic Acids Research, 2007, Vol. 35, WebServer issue
 
The KEGG pathways database (20) is anotherwell-known source for functional annotation. We havealso included functional motifs mapped to proteins by theInterpro database (21) and the keywords in the Swissprotentries (22). All these functional annotations can beconsidered discrete classes (that is, a gene has or has nota given functional annotation) so, enrichment is tested bymeans of Fisher’s exact test for 2
Â
2 contingency tables(10,23,24). For each functional annotation the genesbelonging to the two groups compared are distributed ina 2
Â
2 contingency table, where rows account forpresence/absence of the annotation, and columns repre-sent each of the two clusters.Additionally, we have used text-mining methods (25) toextract other functional aspects of the genes beyond theones covered by the ‘traditional’ repositories describedbefore. In the approach described here we have used twotypes of specific words (bioentities): those referred tochemical products and those related to diseases. Thebioentities were extracted from PubMed, and are relatedto human genes by a score derived from the frequency of gene-bioentity co-occurrences and depending on theirproximity within the text. The scores are based on howunlikely it is to observe a certain level of co-occurrencesto happen by chance (26). The gene-bioentity corre-spondence tables with the respective scores wereobtained using the AKS software (available at: http://www.bioalma.com/aks2/). Contrarily to the case of GOand other similar functional categories, bioentities are notdiscrete classes. Thus, the membership of a gene to agiven bioentity is conditioned through the scores.Then, instead of the usual Fisher’s (or equivalent) test,we use a Kolmogorov–Smirnov test to check for bioentityenrichments.Finally, any other functional term can be used provid-ing it can be represented as a table of correspondencebetween genes and instances of the term. Any user-definedtable of correspondence gene-term can be input byFatiGO
þ
and functional enrichment with a Fisher’sexact test on contingency tables can be conducted.
Understanding the differences between two lists of genesusing regulatory information.
Information on the regula-tion of gene expression can also be used within the contextof functional enrichment tests. It can be useful todetermine if, for example, a set of genes pre-selectedbecause they co-express across time are under the controlof the same transcription factor. Different databasescontaining transcription factor binding sites and otherregulatory motifs are available so as promoter regions of the genes can be scanned for the possible presence of thesetarget motifs. We have used information on diverseregulatory motifs from the CisRed (27) and Transfac(28) databases. These motifs can be studied at differentlocations in the promoters (in the first 1kb, in 5kb or upto 10kb).Also other levels of regulation beyond the transcriptionfactors or other common regulatory motifs can beconsidered by including information on microRNAstargets. It has recently been demonstrated the role of miRNAs in the negative regulation of the expression of their target genes (29). Actually, miRNAs have gainedimportance as cancer markers as they have demonstratedto be deregulated in some cancers (30). In the context of gene expression one might be interested in knowingwhether there is some enrichment in miRNA targetsequences in genes that have been deregulated in aparticular cancer when compared to their non-deregulatedcounterparts. A significant enrichment would clearly pointto the presence or absence of one or several miRNAs.Data on miRNA targets have been taken from themiRBase (31).
Understanding the differences between two lists of genesover the interactome scaffold.
Data on protein–proteininteraction allows to understand the physical interplaybetween blocks of genes. Two common cases of obviousbiological relevance, in which we can expect interactionsamong genes, are protein complexes (or other proteinaggregates) and signalling cascades. For example, in thecontext of gene expression it can be extremely usefulto study the effect of gene expression changes within apre-defined scaffold of interactions.The parameter used here is the number of partners withwhich a protein interacts. FatiGO
þ
will check, by meansof a Kolmogorov–Smirnov test for abnormally high(significant) number of interactions in one group of pre-selected genes with respect to the reference group. Thatwould point to the possible existence of one or morecomplexes (or at least to the existence of a complexinteraction scheme) within the group of pre-selected genes.We use the modules graph and RBGL (32) from thebioconductor package to manage the interactome andcalculate the parameters. The interactome database hasbeen built using the PIANA (33) program, that allowsintegrating into one unique database different sources of protein–protein interactions. The databases included were:BIND (34), DIP (35), HPRD (36), MIPS (37), as well asinteractions from different publications (2,3,38).
Comparing a set of genes to lists of gene expression intissues and diseases.
A quite common necessity in drugdiscovery or when testing animal models is to knowwhether the effect of a drug is restoring the normalfunction of a tissue or if a given model can be consideredto represent a given disease. A powerful method to checkthe real impact of any treatment is to compare theexpression of the genes with the corresponding profiles of gene expression in the situation aimed.FatiGO
þ
uses two repositories containing informationof gene expression in different tissues. One of them wasobtained from the SAGE Tag libraries (Cancer GenomeAnatomy Project). A total of 279 human libraries thatbelong to 29 different tissues and 190 mouse libraries from26 tissues have been used (the data were obtained from:http://cgap.nci.nih.gov/SAGE). The other repository con-tains gene expression data obtained from microarrayexperiments available from the Genomics Institute of theNovartis Foundation data. A total of 79 human tissuesand 61 mouse tissues with normal histology were obtainedfrom http://wombat.gnf.org/index.html.
Nucleic Acids Research, 2007, Vol. 35, WebServer issue
W93

Share & Embed

More from this user

Add a Comment

Characters: ...