You are on page 1of 12

Hindawi Publishing Corporation

Advances in Bioinformatics
Volume 2012, Article ID 582765, 12 pages
doi:10.1155/2012/582765

Research Article
Exploring Biomolecular Literature with EVEX: Connecting Genes
through Events, Homology, and Indirect Associations

Sofie Van Landeghem,1, 2 Kai Hakala,3 Samuel Rönnqvist,3 Tapio Salakoski,3, 4
Yves Van de Peer,1, 2 and Filip Ginter3
1 Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
2 Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
3 Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
4 Turku BioNLP Group, Turku Centre for Computer Science (TUCS), Joukahaisenkatu 3-5, 20520 Turku, Finland

Correspondence should be addressed to Filip Ginter, ginter@cs.utu.fi

Received 22 November 2011; Revised 16 March 2012; Accepted 28 March 2012

Academic Editor: Jin-Dong Kim

Copyright © 2012 Sofie Van Landeghem et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an
exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique
to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web
application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining
results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract
generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation,
regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The
search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a
powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such
as coregulators.

1. Introduction First, relationships between biomolecular entities are
now being extracted in much greater detail. Until recently,
The field of natural language processing for biomolecular the focus was on extracting untyped and undirected binary
texts (BioNLP) aims at large-scale text mining in support of relations which, while stating that there is some relationship
life science research. Its primary motivation is the enormous between two objects, gave little additional information about
amount of available scientific literature, which makes it the nature of the relationship. Recognizing that extracting
essentially impossible to rapidly gain an overview of prior such relations may not provide sufficient detail for wider
research results other than in a very narrow domain of adoption of text mining in the biomedical community, the
interest. Among the typical use cases for BioNLP applications focus is currently shifting towards a more detailed analysis
are support for database curation, linking experimental data of the text, providing additional vital information about the
with relevant literature, content visualization, and hypoth- detected relationships. Such information includes the type
esis generation—all of these tasks require processing and of the relationship, the specific roles of the arguments (e.g.,
summarizing large amounts of individual research articles. affector or affectee), the polarity of the relationship (positive
Among the most heavily studied tasks in BioNLP is the versus negative statement), and whether it was stated in a
extraction of information about known associations between speculative or affirmative context. This more detailed text
biomolecular entities, primarily genes, and gene products, mining target was formalized as an event extraction task
and this task has recently seen much progress in two general and greatly popularized in the series of BioNLP Shared
directions. Tasks on Event Extraction [1, 2]. These shared tasks mark

1. Finally. we describe a hypothesis generation module that the extraction of detailed events according to their definition finds missing links between two entities. presents an evaluation of the EVEX dataset and the described putationally intensive methods can be successfully applied algorithms. Other systems. the role of Cause or Theme in the event. as compared to other available large-scale text mining a large scale. applied to all citations in the 2009 distribution of PubMed Further. that are enriched with homology-based information and researchers in the life sciences. As part of the current study. recognizing the fact that. and protein mentions. In a subsequent study of Van Landeghem et al. it must cover as much of the additionally extracts indirect associations by applying cross- available literature as possible. Although a major step forward from the This extraction system was combined with the BANNER original text-bound events produced by the event extraction named entity recognizer [11]. that is. event including all available literature. The features of the web application are illustrated on a large scale. manual querying is not an acceptable way to access 2009–2011 have been processed. distinct event types: binding. actual applications of the resulting methods require results. [10]. the iHOP [3] and homologs. in order for a text applications. typically processing all available abstracts in Section 4. its text-bound extraction research. or a gene/protein pair. the dataset was refined. Core Events. ideally we present several novel algorithms for event ranking. allowing the user in the BioNLP Shared Tasks is the dataset of Björne et al. generalized. predicates with a variable number of web application is available at http://www. citations from the period of events. While small-scale studies on document aggregation and combination of events. the processing of considerably larger volumes of text. that allows fast retrieval of interesting gene/protein pairs. and providing an intuitive pairwise point of view articles. Consequently. comprising 19 million events among 36 million gene that act as coregulators of a group of common target genes. and still remains state-of-the-art. Among accessible through the EVEX resource has been generated by the main contributions of this subsequent study was the the Turku Event Extraction System. and other researchers in the life sciences. well-defined and carefully constructed corpora comprising In the following section. we introduce a publicly available web million tagged gene symbols and 21. The first large-scale application that specifically targets Finally. results to a broad audience of end-users including biologists. Numerous studies have refinement. and the gene family-based generalizations. This event extraction pipeline was not familiar with the intricacies of the event representation. This data was obtained by processing all 18 million titles and abstracts in the 2009 PubMed distribution using the winning system of the BioNLP’09 2. pipeline with several minor improvements. presenting the first events.3 million extracted application based on the EVEX dataset. For instance. as the massive relational database contains millions [9]. We conclude by summarizing the main contri- Medie [4] systems allow users to directly mine literature butions of this work and highlighting several interesting relevant to given genes or proteins of interest.1. forming a complete event system.org/. the main audience for the EVEX database was still extraction pipeline that had the highest reported accuracy on the BioNLP community. is that it covers highly detailed event structures mining service to be adopted by its target audience. The core set of text mining results as a relational (SQL) database referred to as EVEX. enabling coarse grouping of simi- specifically on extracting negative evidence from scientific lar events. to retrieve proteins with common binding partners or genes [9]. presenting a real-world use case on the budding in PubMed and/or all full-text articles in the open-access yeast gene Mec1. and retrieval of indirect associations. There are nine pose familiarity with the underlying event representation. other events. The dataset contains events as defined in the context geneticists. with the highest possible accuracy. Each argument is defined as having dataset with an intuitive interface that does not presup. Section 3 been published demonstrating that even complex and com. we provide more details on several hundred abstracts are of great utility to BioNLP the EVEX text mining dataset. two abstract layers are defined on top of the figures and tables. the BioNOT system [8] focuses complex event structures. using essentially the same the data for daily use in life science research. Further. text mining systems are now being applied on cation. family definitions. as shown easily accessible for researchers in the life sciences who are in the recent ST’11 [12]. including also ability. The underlying event dataset has thus been brought application that brings large-scale event-based text mining up to date and will be regularly updated in the future. interlinked overview of all events for a given gene or protein. the dataset is not the task in 2009. the winning system of generalization of the events.3 In this study. EVEX Dataset Shared Task. phosphorylation. which has known mammalian and plant section of PubMed Central. such as the BioText This section describes the original event data. as well as a search engine [6] and Yale Image Finder [7] allow for a ranking procedure that sorts events according to their reli- comprehensive search in full-text articles. EBIMed [5] offers a broad scope by also including gene ontology terms such as biological processes. Data and Methods and species names.evexdb. The arguments which can be gene/protein symbols or.2 Advances in Bioinformatics a truly community-wide effort to develop efficient systems The application presents a comprehensive and thoroughly to extract sufficiently detailed information for real-world. practical applications. as well as drugs 2. recur- primary purpose of the application is to provide the EVEX sively.1. resulting in 40. and released 2. The main novel feature of this appli- Second. structured queries far beyond the usual keyword search. regulation . using publicly available gene the BioNLP’09 Shared Task (ST) on Event Extraction [1]. The of the ST’09. allowing for opportunities for future work. that is. Further.

The purpose of this canonicalization is to hits for a certain query. localization. fungi. There must have exactly one Theme argument and may have one is not a single master classifier to predict the events in their Cause argument. even proteins and mRNA. Further. into their most likely families. regulation further resolves these canonical gene symbols. For instance. or Entrez Gene [14]. An example event structure is illustrated in Figure 1. in parentheses. the lowest 2. deal with 2. Subsequently. biochemical entities to be confident for it to be ranked high. or to additionally transcription. and aggregated confidence values.” “average confidence. illustrating recursive [16]. localization.2). or resolve (figure adapted from [10]). we extended EVEX to also include families from Ensembl Genomes. The extraction of event structures the ambiguity prevalent among the symbols. we will further refer to all standard deviation. whereby all events is the binding event. and unspecified). a Theme. regulation event and degradation typically refers to protein catabolism. are respectively labeled as “high discarding nonalphanumeric characters and lowercasing. the event trigger and each of its arguments. to the same gene family. with no restrictions as to whether these entirety.2. protein catabolism.Advances in Bioinformatics 3 Pos. we have implemented an event scoring arguments: (1) events of the type phosphorylation. Building on top of these definitions. For based generalizations. a gene or a protein. abstracting away from lexical variants (positive.” The other 4 categories.” When presenting multiple possible ized into esr1. using two distinct IL-2 Binding resources for defining homologous genes and gene families: T T HomoloGene (eukaryots. This machine learning system uses linear Support must only have a single argument. as genes. the EVEX dataset nesting of events where the (T)heme of the positive regulation event now defines four event generalizations. the event in Figure 1 would The confidence scores of the two different classifiers are be stated as Positive-Regulation(C:IL-2. One of the major limitations of normalized confidence among all classification decisions the original core set of events is that they are strictly text- involved in extracting that specific event. EVEX thus is highly dependent on the lexical and syntactic constructs . each representing the and suffixes) to obtain the core gene symbol. As a result. where the event type is stated first. that is. the score were originally treated as merely text strings with no database of a generalized event is the average of the scores of all its identity referring to external resources such as UniProt [13] occurrences. events within the top First. and we therefore first T:p55)). Using minimum bound and provide no facility for a more general treatment. which must be Vector Machines (SVMs) as the underlying classifier [17]. protists. apply the synonym-expansion module through the family- each event refers to a specific trigger word in text. NF-κB p55 [15]). we will state events using a simple a single confidence score to a specific event occurrence. and gene expression. the distance to any number of gene/protein Theme arguments and cannot the decision hyperplane of the linear classifier. the predictions from these two separate classifiers must be followed by a comma-separated list of arguments enclosed aggregated. and bacteria enhancing binding activity of NF-κB to p55. whenever C T possible. (2) events of the binding type may have Every classification is given a confidence score. The canonicalization algorithm itself cannot. For confidence.1. transcription. Event Generalizations. the score of a specific event occurrence is assigned to be the minimum of the normalized scores of its event trigger and its arguments. Event Refinement. Finally.2. biochemical entities.3. T:Binding(T:NF-κB. These different generalizations are all example. individual classifications are made to predict arguments are genes/proteins or recursively other events. separately for triggers and arguments. which provides Figure 1: Event representation of the statement IL-2 acts by coverage for metazoa. For brevity. the identified gene symbols in the EVEX dataset 20% of the confidence range are classified as “very high are canonicalized by removing superfluous affixes (prefixes confidence. plants. the full string human Esr-1 subunit is canonical- “very low confidence. The EVEX dataset addresses these issues To assign a meaningful interpretation to the normalized by providing event generalizations [10]. [14]) and Ensembl (vertebrates. it becomes straightforward to retrieve all information on a specific gene symbol. through the canonicalization algorithm. however. not directly mutually comparable. the web application uses the original abstract away from minor spelling variants and to deal with scores to rank the events from high to low reliability. are aggregated. as the aggregation function roughly corresponds to the fuzzy such as being able to abstract from different name spelling and operator in that it requires all components of an event variants and symbol synonymy. the word increases typically triggers a positive implemented on the web application (Section 4. negative. In order to assign In the following text. protein algorithm based on the output of the Turku Event Extraction catabolism. and finally (3) regulation events scores are associated with more confident decisions. As part of this study. the fact that the BANNER named entity recognizer often includes a wider context around the core gene symbol. 2. where C: and T: denote the role of the argument as normalize all scores in the dataset to zero mean and unit (C)ause or (T)heme. Event Ranking. followed by next 20% of all events. Further. To rank the extracted events according Event definitions impose several restrictions on event to their reliability.” “low confidence” and instance. where higher have a Cause argument. Rather. and gene expression System. bracketed notation. The (C)ause argument is the gene symbol IL-2 whose arguments have the same canonical form.

extracted from the sentence Thrombin regulation event. the event user queries for Akt. changing is reclassified as indirect regulation. removing the nested single-argument positive association as regulation is only for the purpose of coarse regulation event. T:Positive-Regulatio(C:EGF. T:G3)). For instance.4. while an unspecified regulation gene pairs extracted from them. sets of coregulated genes sciences and can be implemented on top of the events that share a common function. A cell’s activity is often organized This pairwise point of view comes natural in the life into regulatory modules. Pairwise Abstraction. information in the sentence. both While correctly extracted. do not explicitly express . This is because the event extraction system analysis to a comparatively small number of event structures. it is represented as a coregulation when associated genes grouped by their type of connection with indirect associations. of the outer event. of interest here are Thrombin—Akt and EGF—Akt. algorithm only allows one such change of polarity. will mark both of the words involve more than one gene or that are a recursive argument increase and induces as triggers for positive regulation events in such an event. such nested single-argument associations coarsely categorized as regulation. which is translated from the bracketed notation into possibilities in EVEX. The most basic query issued on While the association between G1 and G2 is discarded in the EVEX web application involves a single gene. Clearly. when presenting the details the English statement Upregulation of AKT phosphorylation of the extracted event to the user. regulatory events (i. which has another positive regulation event augmented EGF-stimulated Akt phosphorylation. A particular erwise. G2 outer event to be of the unspecified regulation type. that is. for example. the argument. that the categorization of the T: MAPK). Therefore. 2. the candidate pair (G1-G2) is discarded. which step (3) since it in many cases cannot convincingly be classi- triggers the generation of a structured overview page. whenever a events long. Such modules can be with ease by analyzing common event structures and found by automated analysis and clustering of genome- defining argument pairs within. the final event structure is extracted As an example. a regulation of a negative the association type of the candidate pair G2-G3 regulation is interpreted as a negative regulation. (Note that due to the restrictions of event consideration is given to the polarity of the regulations. limiting the set of event occurrences from in the sentence Ang II induces a rapid increase in MAPK 21 M to 12 M events. Any subsequent removal of a nested single-argument regulatory (3) If one of the genes is a Cause argument of an event event that results in a type change forces the new type of the which itself is a Theme argument. regulator of G3 is the Cause argument of the nested To avoid excessive inferences not licensed by the text. Indirect Associations.5. The refinements discussed wide expression profiles [18].1). described in the following section. in Regulation(C:G1. as defined in Section 2. and regulation oth- regulatory events as long as any rule matches. As illustrated in Table 1. the original structure of the by EGF is upregulated by Thrombin. removing intermediary single-argument of one specific binding event. we classified as binding if both participants are a Theme proceed iteratively. which account for the vast majority of event occurrences. T:Regulation(C:G2. The user will as equivalent with all other events that can be refined to the additionally be presented with the details of the original same elementary structure. both Thrombin and EGF will be listed above can be restated as Positive-Regulation(C: Ang II. positive and negative regulation refer to having a general Table 2 lists the most common structures. There is a limited number of prevalent event structures Table 1 lists the set of refinement rules. event is preserved. are unnecessarily complex. Ang II is a Cause argument of a positive phorylation(T:Akt))). The most important underlying sought. is trained to closely follow the actual statements in the Furthermore.e. This refinement helps to establish the event grouping of the results on the overview page. and. Phos- MAPK)).4 Advances in Bioinformatics used in the sentence and may therefore contain unnecessary event structures present in the data. let us consider the event Positive- as Positive-Regulation(C: Ang II.1. however. listing fied as a regulation. Note. together with the positive or negative effect. restricting the required complexity. we only need to consider those events that sentence and thus.3 substantially decrease the number of unique by the BioNLP Shared Tasks. In this context. T:G3).. that is. 2. The algorithm to extract the could not be resolved to either category due to missing gene pairs from the event structures proceeds as follows. the Akt gene will Cause argument). event which itself is a Cause argument. activity. Individual events. functionality implemented by the web application is thus the ability to identify and categorize pairs of related genes. application of the rules may result in a change of polarity G2 in Regulation(C:Regulation(C:G1. arguments as described in Section 2. Consequently. The pairs as its Theme. as regulators. for instance. T:G2).) outcome of reducing chains of single-argument regulations (2) If one of the genes is a Theme argument of an of mixed polarity is less obvious. the regulation (G1). T: Positive-Regulation(T: Regulation(C:Thrombin. since the direct the polarity of the outer event from unspecified to negative. However. regulations with a Theme but no whenever a user queries for Thrombin. are the query gene (Section 4. often forming chains that are several be listed among the regulation targets. only binding While a nested chain of single-argument positive regulations and regulation events can have more than one can be safely reduced to a single positive regulation. (1) All argument pairs are considered a candidate and To simplify the single-argument regulatory events. enhancing the event aggregation event. for example.

T) E2 prevented downregulation of p21 Reg(C.6 58. involving any nested combination of the three types of regulation: positive regulation (Pos). can be identified by combining the information expressed established by combining binding and regulatory events through a in various events retrieved across different articles. Occ. T:Phy(T:B)) AB such associations. T:Pos(T:geneB)) which is rewritten to Pos(C:geneA. Further. T) PKS5 mediates PM H +. T:Pos) Neg(C. which A and B share a common binding partner A×Z×B allows efficient querying across events. considering only events with more than one gene or protein symbol.0 73. and to any physical event (Phy) concerning a single gene such as protein-DNA binding. Results and Performance Evaluation stored in the database. T) DtRE is required for repression of CAB2 Table 2: The most prevalent (refined) event patterns in the EVEX data. T:Phy(T:B)) A>B 0. First. facilitating the discovery of functional modules resource from several points of view. T:Reg) Neg(C. and their recursively nested events.2 98. Original Result Example Pos(C.0 Bind(T:A. it needs to be performance of the event extraction system used to produce stated that these associations are mainly hypothetical. ulation(C:geneB. we discuss the through text mining information. T:B) A>B 3. The right-most column depicts the extracted gene pair and a coarse classification of its association type. T:Neg) Pos(C. T) GW5074 prevents this effect on ENT1 mRNA Neg(C. negative regulation (Neg) and unspecified regulation (Reg). the events Regulation(C:geneA.7 ∗ Reg(C:∗ Reg(T:Phy(T:A)).1. and bindings are represented with ×. published evaluations both within the BioNLP Shared Task Details on gene expression events can be found by browsing and in other domains. T:Phy(T:B)) AB 0. localization. T) BIN2 negatively regulates BZR1 accumulation Reg(C.4 82.3 and the usage of a relational database. coregulators additionally require coexpression. T) The effect of hCG in downregulating ER beta Pos(C.2 98. However. transcription. T:Neg) Neg(C. The first two columns refer to the percentage of event occurrences covered by the given pattern and the cumulative percentage of event occurrences up to and including the pattern. Z The indirect associations as implemented for the web application include coregulation and common binding part- ners (Table 3). protein catabolism.5 ∗ Reg(C:A. [%] Event pattern Gene pair 58. of the methods and data employed specifically in the EVEX . for example. T:Neg) Neg(C. and gene expression.9 ∗ Reg(C:Phy(T:A). T:Reg) Pos(C. coregulators or genes that are targeted by a common In this section. [%] Cum. T:B) A×B 4.7 ∗ Reg(C:A. phosphorylation. For common interaction partner gene Z. These links have been precalculated and 3. Second. These aggregated patterns refer to any type of regulation (∗ Reg). indirect regulatory associations Table 3: Indirect associations between gene A and gene B. In full. Such A>Z<B A and B coregulate Z hypothesis generation is greatly simplified by the fact that the events have been refined using the procedure described A<Z>B A and B are being regulated by Z in Section 2. The nested regulations are all regulations without a Cause and their detailed structure is omitted for brevity. we present the evaluation of the EVEX regulator. as. the core set of events in EVEX. T) PDK1 is involved in the regulation of S6K Neg(C. T:geneB) with geneA and geneB being any two genes. to binding events between two genes (Bind). T:geneZ) and Reg. T:Pos) Pos(C.6 Phy(T:A) — ∗ 15. we present several evaluations the sentences of specific genes as described in Section 4.0 90. However.6 Reg(T:A) — ∗ 8. T:geneZ) can be aggregated to present the Association Interpretation hypothesis that geneA and geneB coregulate geneZ.2 99. while A  B expresses an indirect regulation.ATPase regulation Reg(C. × and for regulations A > B means A regulates B.8 98.Advances in Bioinformatics 5 Table 1: Listing of the refinement rules.0 Reg(T:Phy(T:A)) — 8. enabling fast retrieval of. T:Pos) Pos(C.1 ∗ Reg(C:Phy(T:A). T:B) AB 0. occ. Bindings are represented with instance. A > B means A regulates B. A and B refer to gene symbols. the first structure would read Pos(C:geneA. T:Reg) Reg(C.7 94. T) BRs induce accumulation of BZR1 protein Pos(C. Each parent event has a regulatory (T)heme argument and an optional (C)ause. reviewing a number of for example. T) CaM regulates activation of HSFs Neg(C.

The family disambiguation algorithm thus precision rate of 64% was previously obtained by manual discards a long tail of very infrequent canonical symbols.73% recall. These findings are similar to the previous statistics presented the named entities (i. This process reduces the number of distinct event use case that involves finding related binding partners for structures by more than 60%. [10]. brought up to was recently shown to achieve state-of-the-art results in date by including the 2009–2011 abstracts. evaluation of 100 random events [19].7 and 1. the newly intro- extracted by BANNER were estimated to achieve a precision duced families of Ensembl Genomes clearly provide a higher of 87%. successfully generating more than a hundred Using the confidence values for ranking. These figures indicate that the performance of coverage: 8-9 percentage points higher than HomoloGene or the various text mining components generalize well from Ensembl. that the results of the systematic screening was performed between the interval of refinement algorithm are merely used as an abstract layer −1.7% precision and 42.95%F-score [9]. Core Event Predictions. demonstrating the resource in a . with a threshold above 0. In the same study. results in extremely high precision (91.6% recall.1). extracted events.0% and recall of 14. In a Finally. Event Refinement. 4. demonstrating the usability of EVEX in real-world increasing the proportion of symbols that can be matched use cases.2) to the correctness of the fies and greatly reduces the heterogeneity in event structures. the whole set of gene symbol as direct argument increases from 1471 K predictions achieves 59.8% associated web application have recently been applied in recall would still translate to more than a million high. to gene databases. consisting of 150 in combination with the pairwise view of the events. For example.e.7 (fifth interval).. To investigate the correlation of the argument regulatory events. we have subse- thousand simplified events that can straightforwardly be quently applied a cut-off threshold on the results.06% recall. simulating a events. to relevant information.4. using an improved version of TEES. While only a small fraction of all unique canonical 54. a canonical form of the gene symbols is produced. and to identify and aggregate equivalent events across various the novel event refinement algorithms introduced above.9%) 3. using a step-size of 0.1. By canonicalizing the symbols as predicted System (TEES). accounting for lexical variants and synonymy. The EVEX dataset and the but only 4. 3. articles. The Turku Event Extraction symbol databases.2. by BANNER. It was was obtained [10]. On the scale of EVEX. allowing the user to reject between 0. only keep- parsed for pairwise relations. 67 interacting pairs were found in EVEX. providing the opportunity precision.70 (fourth interval) would result in an or accept the inferences made by the refinement algorithm. By removing the chains of single- 3. however. we have measured the precision and recall facilitating semantic interpretation and search for similar rates of binding events between two genes. This algorithm has previously been evaluated on the ST’09 training set. is to PubMed abstracts with 94 gold-standard binding pairs.3. achieving 46.7 and 1. For increase the coverage of finding related genes for a certain this dataset. coli. ing predictions with confidence values above the threshold. the review existing results as well as present new evaluations EVEX resource provides several algorithms to generalize of the confidence scores and their correlation with event gene symbols and their events. A It has to be noted. 59.37%F-score on the corresponding abstract-only GENIA symbols matches the gene families from HomoloGene or subchallenge [12]. the refinement process simpli- confidence values (Section 2. This updated system The statistics on coverage of gene symbols. EVEX Generalizations. expression in E. When applying the algorithm as detailed confidence values ranging between −1.1. a cut-off value detailed information is requested.5. As described in Section 2.3.6 Advances in Bioinformatics resource in addition to the core event predictions: we 3. This experiment was The main purpose of the event refinement algorithm.48% precision. are depicted in the ST’11.6 length. The family-based generalizations have also been previ- 58. a focused study targeting the regulation of NADP(H) precision events. which specifically aims at identifying entities that are likely to match gene and protein 3. the family-based generalization algorithms. Biological Applications. and 51. conducted on the ST’09 development set. Only taking the top ranked predictions. of EVEX. Confidence Values.3.48% precision. In the current ously evaluated for both HomoloGene and Ensembl defini- study. with input query gene. Ensembl (Genomes) (between 3 and 6%). To expand the coverage of these generalizations. an increase of 11 percentage points in F-score was extensively evaluated on the BioNLP Shared Tasks. average precision rate of 70. gene and protein symbols) as by Van Landeghem et al.3. The original event structures as depicting the average precision and recall values for each extracted by TEES are always presented to the user when aggregated interval of 0. we discuss two biologically motivated applications first step.8% recall. to 1588 K.4%. a 51 and 61%). however. domain-specific training data to the entire PubMed. The to group similar events together and to offer quick access results have been aggregated and summarized in Figure 2. the original set of event predictions extracted from tions. this small fraction To assess the generalizability of the text mining results accounts for more than half of all occurrences (between from domain-specific datasets to the whole of PubMed.05 (60 evaluations). obtaining 50. a certain query gene (Section 4. we have added definitions from Ensembl Genomes. in this the PubMed 2009 distribution has been brought up to date study. the winning system of the ST’09.10 and 0. and Table 4. When in Section 2. the source of the core set of EVEX events. Additionally. the number of events with more than one evaluated against the gold-standard data.

cerevisiae. By sorting the events according to their confidence values. grisea. and yeast cell cycle.1. a signal transduction analysing three high-quality pathway models. and related biomolecular entities of a gene or pair of genes but rather a lack of semantic coverage. searching for a gene symbol or a pair of genes system. the suitability of the EVEX dataset for meiosis and plays a critical role in the maintenance and web application to the task of pathway curation was of genome stability. The main functionality of the EVEX most common reason for a pathway interaction not being resource is providing fast access to relevant information extracted is not a failure of the event extraction pipeline. at least in separated by a comma. gossypii.0 K 3.3 M extracted gene symbols. we present a use case on a specific budding yeast Genomes families: 72% of event occurrences had both of gene. pombe. Mec1. volume of the event data in EVEX and the ability to aggregate pression data. When homolog of the mammalian ATR/ATM. Only 11% of interactions in the evaluated pathways straightforward way to achieve this is through the canonical were not recovered due to a failure of the event extraction generalization.Advances in Bioinformatics 7 100 90 80 70 60 (%) 50 40 30 20 10 0 Very low Low Average High Very high confidence confidence confidence confidence confidence Avg.1 M 52. evaluation results of 50% for binding events and 46% for regulation events. as such a bioinformatics use case is already ST’09 task and thus out of scope for the event extraction covered by the publicly available MySQL database. and N. out of the total number of 40.0% 40. and the regulatory network isolation. This result shows that the recall in EVEX. precision Avg. their arguments assigned to the correct family.3% Ensembl 60. S.) The most system.0% HomoloGene 68. the only event types that allow more than one argument. Distinct symbols Occurrences Canonical 1833.2% 20. Table 4: Gene symbol coverage comparison. 461 occurrences of several event occurrences into a single generalized event. . As part of this study. In these cases. Mec1 is required In a separate study. A thorough manual evaluation further suggested that. which is conserved in S.7% 21. This increase can very likely be attributed to the extracted from EVEX was integrated with microarray coex. a tradeoff between precision and recall is obtained. (Analysis of large gene lists is currently not interaction corresponds to an event type not defined in the supported.3 M 60. 60% of all interactions could be retrieved from EVEX using the canonical generalization.9 M 51. lactis.3 M 100. crassa. measured against the gold-standard data of the ST’09 development set. This figure does not automatically mean the failure to extract the compares favorably with the BioNLP’09 Shared Task official generalized event.1% real-life biological use case. E. Web Application to be correctly extracted were further evaluated for the To illustrate the functionality and features of the web correctness of the assignment of their arguments to Ensembl application. it is considered to be a analyzed with a particular focus on recall [21]. surprisingly. with precision of 53%. showing the number of distinct canonical symbols as well as the number of different occurrences covered. with encouraging results [20]. Furthermore.8% Ensembl Genomes 100. the of interest. M. the 4. the pathways under evaluation by Ohta et al. recall Figure 2: Evaluation of predicted binding events. mTOR protein [22].2 K 3. is clearly above The Ensembl Genomes generalization was used to allow the recall value published for the event extraction system in for homology-based inference.6 K 5..5% 24. The event occurrences that were judged 4. two-argument events in the NADP(H) regulatory network where the failure to extract an individual event occurrence were manually evaluated. Gene Overview. K.1 K 100. TLR.

as the false result only has 1 piece of evidence. retrieval purposes. EVEX enables a search of all events linking these two genes through any direct or indirect association (Figure 5). In comparative genomics. even though they contain no extracted events. they are highly relevant for information functions and pathways. 24]. Further. and 263 coregulators. When event type. each accompanied with an example sentence. Selecting the target RAD9. This illustrates the opportunity to use the large-scale event extraction results for pruning false positives of the text mining algorithm. by summarizing all events pertaining . can be retrieved equally fast. which can have a certain polarity (positive/negative) and may involve physical events such as phosphorylation or protein-DNA binding. The EVEX resource provides such functional- pointers to relevant literature while still requiring manual ity for inferring interactions and other biomolecular events analysis to determine the exact type of information. related diseases. the first one is obviously the only correct one.8 Advances in Bioinformatics When typing the first characters of a gene symbol. it is common practice to transfer functional annotations At the bottom of the overview page. This page provides conclusive evidence for a binding event between RAD9 and Mec1. This overview lists 21 regulation targets. Further. Within each category. but the other event types may also be expanded. guiding the user to likely gene symbols found in text. phosphorylation. the web application visualises all event structures expressing regulation of RAD9 by Mec1 (Figure 4). The different types of event structures are coarsely grouped into categories of similar events and presented from most to least reliable using the confidence scores. may by event type. grouped by targets that are regulated by both genes. relevant links to additional sentences and articles are associations. However. overview page also lists potential coregulations. and is thus displayed first. At supporting both regulations are presented. and binding partners is provided. In the screenshot. ners. an overview of all regulators. At the top of the page. or general by themselves. the overview page of Mec1 (Figure 3) contains additional relevant information including links to sentences stating events of Mec1 without a second argument. regulated genes. Further. provided. example sentences are always chosen to be those associated with the highest confidence score. finding interesting sentences and articles describing specific processes such as protein catabolism or 4. This enables a quick overview of the mechanisms through which the regulation is established. Other indirect the bottom. providing larity [23. Such based on homology.2. a list of candidate matches is proposed. the Figure 3: Search results for Mec1 on the canonical generalization. 11 regulators. Homology-Based Inference. The search page then automatically generates a listing of relevant biomolecular events. and with a “very low” confidence. grouped sentences.5). while the correct regulation is supported by 3 different evidence excerpts. Exploring the relationship between RAD9 and Mec1 further. both a Mec1 regulates RAD9 and a RAD9 regulates Mec1 event are presented. inspecting the sentences. Apart from the regulatory and binding mechanisms. such as Rad53. Figure 3 shows the results when searching for Mec1. coregulators are listed together with the number of coregulated genes (Section 2. ranging from (very) high to average and (very) low (Section 2. such as common regulators and binding part. enumerating An overview of directly associated genes is presented. 27 binding partners. two of which are of “high” confidence.2). only the box with regulation targets accessing the details for this hypothesis. all evidence excerpts is shown. Finally. While these events incorporate only a include useful background information on the gene such single gene or protein and may not be very informative as relevant experimental studies. the events are ranked by confidence. grouped by event type. a similar and between related organisms for genes sharing sequence simi- even more general set of sentences can be found.

this visualization pro- but deprecated synonym of Mec1 [25]. (very) high to average and (very) low. rectly associated genes and proteins by generating summary onyms. than just one gene name. Regulatory mechanisms can have a certain polarity (positive/negative) and may involve physical events such as phosphorylation or protein-DNA binding. will thus contained within an abstract. the The resulting page presents not only results for the symbol site always provides the opportunity to inspect the textual Mec1. application provides a powerful method to browse indirectly malian ATR orthologs. While ESR1 is a known interested in the text mining details. The generated listings of regulators and binding overviews. a session-based search history at the righthand side accommodating for specific organisms and use cases.3. by allowing the retrieval of nested Arabidopsis. which was developed as a supporting resource for For each gene family present in the text mining data. the phosphorylation of RAD9. To easily trace back previously found methods of defining gene families (Section 2. prevent Esr1 from being used as a synonym for Mec1. For example. allowing a fast overview of event information which is the basis of the gene family generalizations. to a certain family when searching for one of its members 4. stav visualiser. its mam.Advances in Bioinformatics 9 Figure 4: Detailed representation of all evidence supporting the regulation of RAD9 by Mec1. but also for common symbols which are considered evidence in detail. synonyms on the gene-family level. it is not considered vides valuable insights into the automated event extraction as a viable synonym of Mec1. considering Esr1 generally process. aspect of the EVEX web application is the ability to retrieve For example. Finally. This type Consider. This open-source tool provides family profile lists all genes and synonyms for a specific fam- a detailed and easily graspable presentation of the event ily. For this reason. In contrast. ranging from retrieves interaction information for Mec1 and its homologs. the web genes. we have described we can extend the search through Ensembl Genomes and how EVEX can assist in the retrieval of directly and indi- retrieve information on homologous genes and their syn. it is necessary to dis- this time each symbol refers to a whole gene family rather tinguish trustworthy predictions from unreliable hypotheses. of this event. linking to the authoritative resources such as Entrez Gene structures and the associated textual spans. On top of those. Additionally. a the ST’11 [2] (Figure 6). a box with related searches suggests families resulting in a family of 19 evolutionarily conserved relevant queries related to the current page. . by HomoloGene only includes the 6 conserved Mec1 genes when accessing the details of Mec1’s regulation of RAD9 in the Ascomycota. [10]. phosphorylation and selecting the phosphorylation event. but and to be valuable to a domain expert. Reliable synonyms found in text do however include ATR and SCKL. the web application integrates the stav visualiser [26]. from literature. Further. for example. including the budding yeast gene Mec1. For of the screen provides links to the latest searches issued example. Site Navigation. To any user and the Taxonomy database at NCBI. each results. automatically generated confidence values Conducting such a generalized search for Mec1. In the previous sections. The EVEX web application includes several distinct 4. The synonym opportunity to visualise whole PubMed abstracts with the disambiguation algorithm of Van Landeghem et al. the web application provides the refers to the family of estrogen receptors. An important (Section 2. the corresponding family defined and parent interactions of a specific event. such as ATR. To allow a detailed inspection query.1). EVEX are displayed for each extracted interaction. to be applicable in real-life use cases partners are structured in exactly the same way as before. and genes from green algae and associated information.4. instead of only looking at the information the original sentences and articles for all claims extracted for one particular gene symbol as described previously. of synonym expansion goes well beyond a simple keyword regulated by Mec1 (Figure 4). Ensembl Genomes defines rather coarse grained on the site.1). Manual Inspection of Text Mining Results. However.

Van de Peer wants to acknowledge support from Ghent University (Multidisciplinary Research 5. only the regulation boxes BioNLP’11 Shared Task [2. we quickly learn that RAD9 phosphorylation has many different potential regulators. initiated by the Belgian State. the events are refined to unify different event structures that have a nearly identical interpretation. allowing for recursively nested events and different event types ranging from phosphorylation to catabolism and regulation. as a form of literature-based hypothesis generation. as BioNLP methods keep evolving towards more detailed and accurate predictions. This sort of explorative S. Further. 27] and integrating noncausal are shown in detail. The core set of events can be expanded by also processing all full-text articles from the open-access sec- tion of PubMed Central. Additionally. Ad12. In the screenshot. The extracted events provide a detailed representation of the textual statements. This work was partly funded by the Academy . Conclusions and Future Work Partnership Bioinformatics: from nucleotides to networks) and the Interuniversity Attraction Poles Programme (IUAP This paper presents a publicly available web application P6/25). accounting for lexical variation and generalizing gene symbols with respect to their gene family. There are a number of future directions that can be followed in order to extend and further improve the EVEX web application. Additionally. This aggregation allows for efficient access to relevant information across articles and species. identifying indirect associations such as common coregulators and common binding partners. Figure 5: All events linking Mec1 and RAD9 through either direct by including epigenetics data as recently introduced by the or indirect associations. First. This interpretation is extended also to combinations of events. events are assigned confidence scores and ranked according to their reliability. without sacrificing the expressiveness of the events. enabling queries using specific which two genes interact. Y. for example. The EVEX web application is the first publicly released resource that provides intuitive access to these detailed event-based text mining results. 29]. gene normalization This page enables a quick overview of the mechanisms through data can be incorporated. Science Policy Office providing access to over 21 million detailed events among (BioMaGNet). a web service positive text mining results which can be identified by comparing may be developed to allow programmatic access to the confidence values and the evidence found in the sentences. entity relations [28. while at the same time highlighting false gene or protein identifiers [30]. the events are aggregated across articles. travel grant to Turku. the EVEX web application groups events with respect to the involvement of pairs of genes. but the other event types may also be expanded. Van Landeghem would like to thank the Research information retrieval and cross-article discovery is exactly Foundation Flanders (FWO) for funding her research and a the type of usage aimed at by the EVEX resource. allowing bulk queries and result export for further postprocessing in various bioinformatics applications. As such. Finally. This dataset is the result of processing the entire collection of PubMed titles and abstracts through a state-of-the-art event extrac- tion system and is regularly updated as new citations are added to PubMed. providing the users with a familiar gene-centric point of view. several steps are taken to allow for efficient querying of the large- scale event dataset. As the application mainly targets manual explorative browsing for supporting research in the life sciences. Finally. Acknowledgments such as Ad5.10 Advances in Bioinformatics more than 40 million identified gene/protein symbols in nearly 6 million PubMed titles and abstracts. EVEX web application. the dataset can be enriched with new information. and C-Abl. evidence is shown for many parent events involving different regulation polarities and various genes causing this specific phosphorylation. Further.

S. Björne. Pyysalo. Ohta. pp.-D. 1968–1970. vol. H. Ohta. [24] S. pp. “Generalizing biomedical by CSC-IT Center for Science Ltd. T. Ginter. 23. Association no. “BANNER: an executable tion in Saccharomyces cerevisiae. 2010. N. 2011. Article ID btq180. M. Björne. A. Sayers.” Bioinformatics. References 39. pp. 2009. “Module networks: engine and GUI-based efficient MEDLINE search tool based identifying regulatory modules and their condition-specific on deep syntactic parsing. [10] S. and J. Genes and gene products (“GGPs”) are marked. [4] T. genomics resource to study gene and genome evolution in eralization of text mining predictions. “Scal. 2009. Hearst. 2196–2197. 2. Gaudan. 2011.. pp. “Complex event extraction at PubMed scale.” Nature Genetics. 2011. Sterck et al. vol. supplement 1. Tsujii. “A gene network for navigating rithms for multiclass problems.-D.” Nucleic Acids Research. Björne. 15. F. arrows denote the roles of each argument in the event (e. 38. Association for Computational Research. D. Birney et al. C. J.” 22. 951–991. S. vol. 17–20. Salakoski. 2007. no. Rebholz-Schuhmann. J. Carballo and R. A. article 207. a bud- [9] J. 105–113. Biology. Regev et al. no.. Barrett. 3104–3112.” BMC Bioinformatics. Raimondo. M. 34. Kirsch.. 1–6. 2009. and M. “From pathways to biomolec- Bioinformatics. vol. Article ID 420. [22] J. and T.” in . no. Ohta. Leaman and G.” in Proceedings of the 3rd pp. Tsujii. pp. Benson et al. (YIF): a new search engine for retrieving biomedical images.” Research. Pyysalo. facts for proteins from Medline. Redfern et al. 24. 2011. vol. is required 37. and J. pp. Kato and H. Pacific Symposium on Biocomputing. vol. S. “BioText search and co-expression networks: targeting NADP(H) metabolism engine: beyond abstract search. no. “An intelligent search [18] E. 2009. Jones. “Overview resources of the National Center for Biotechnology Informa- of BioNLP’09 shared task on event extraction. R. Crammer and Y. and J. and T. ESR1. D16.. Loewenstein. 2012. no. 28–36. J. “Ultraconservative online algo- [3] R. 539–550. pp.. J. Ohta. Valencia. Van Landeghem. Kersey. Shapira. 17. 5.” Genome Linguistics. Proost. [13] The UniProt Consortium. Association for Computational Linguistics. Pacific Symposium on [26] P. “PLAZA: a comparative “EVEX: a PubMed-scale resource for homology-based gen. for Computational Linguistics. 2007. [15] P. D. Ohta. pp.” [16] P.” Nucleic Acids Research. pp. J. Nguyen. Van de Peer. 39.” Nucleic Acids Research. T. “Ensembl genomes: in Proceedings of the BioNLP Workshop Companion Volume extending ensembl across the taxonomic space. Divoli. H. Cha. 23. ics. M. T. H.. 12. no. Kreula. Agarwal. 2003. 2008. Barrell et al. 12.Advances in Bioinformatics 11 Our data suggest that Dpb11 is held in proximity to damaged DNA through an interaction with the Cause Positive regulation Theme Theme GGP Phosphorylation GGP phosphorylated 9-1-1 complex. 2009. and J. supplement 8. F. Riethoven. “BioNLP Shared Task 2011: supporting resources.” in Proceedings of [8] S. pp. 4-5. M. 15. 16. Gonzalez. Pyysalo. and F. Y. “EBIMed—text crunching to gather vol. and P. A. database of biomedical negated sentences. 2004. F. Kim. “Overview of BioNLP shared task 2011. Topić. no. Tsujii. Krauthammer. and T. “An essential gene. S. 1994.” in Proceedings tion. D563–D569. pp. vol.” [2] J. 2.. “Ongoing and future developments at the universal protein resource. Xu. 2011.-D.” in Proceedings of the COLING/ACL regulators from gene expression data. 28– [25] R. in E. Pyysalo. 21. ding yeast homolog of mammalian ATR/ATM. Workshop on Building and Evaluating Resources for Biomedical [7] S. Tsujii. Y. vol.” Nature Genetics. Research. D800–D806. Tsujii. Ginter.” Journal of Machine Learning the literature. Finland. Association for Computational Linguistics. Finland and the event extraction. of Finland. vol. E. R.. no. Pyysalo. W. Yu. Kano. 38. T:Phosphorylation(T:RAD9)). T. Department of IT. N. Salakoski. Stenetorp. pp. 36. 2003.” Nucleic Acids for Shared Task. This visualization corresponds to the formal bracketed format of the event: Positive-regulation(C: Mec1. pp. [5] D. Salakoski.. vol. Ogawa. University of Turku. [19] J. Ginter. pp. pp. Kim. and I. Association for Computational function annotation by homology-based inference. in Proceedings of the BioNLP Workshop Companion Volume [23] Y.” Bioinformatics. Y. and the computational resources were provided [12] J.” Chromosome ing up biomedical event extraction to the entire PubMed. Pyysalo. [14] E. [20] S. Espoo. Singer. Arregui. Tsujii. T.” Plant Cell. leading to Mecl-dependent phosphorylation of Rad9 Figure 6: Visualization of a specific event occurrence by the stav text annotation visualiser. S. ular events: opportunities and challenges. Association for Computational Linguistics. H. “Protein for Shared Task. pp. S. coli with event extraction. D. Y. 10. supplement 1. Amode. for mitotic cell growth. M. 3718–3731. supplement 1. Kaewphan. Kim. pp. no. Salakoski. McCusker. Ninomiya et al. Van Landeghem. “Meiotic roles of Mec1. Flicek. 2. 3.” in Proceedings of the plants.” [21] T. 166–176. DNA repair and meiotic recombina- [11] R. “Ensembl 2011. no. Ginter. D. Biocomputing. 1. “Database [1] J. Linguistics. vol. Theme or Cause). pp. e237–e244. vol. as well as the trigger words that refer to specific event types. Homann and A. Bossy. Kohane. D5– of the BioNLP Workshop Companion Volume for Shared Task. Guturu et al. A. Lawson. D214–D219. Nguyen. “Integrating large-scale text mining [6] M. Van de Peer. 12. Van Bel. “BioNOT: a searchable the BioNLP Workshop Companion Volume for Shared Task. and T. L. Ginter. 13. O. vol. “Yale Image Finder Text Mining (BioTxtM ’12). 1–9. article S4. BioNLP Workshop Companion Volume for Shared Task. S.” BMC Bioinformat.g. 2011. G. [17] K. 2006. vol. i382–i390. 2007. 652–663. Nucleic Acids Research. 7. 2006 Interactive Presentation Sessions. Stoehr. pp. no. P. Segal. S. A.” Bioinformatics. 2010. 26. vol. S. Finally. aricle 664. Miyao. F. 2011. 2012. S. survey of advances in biomedical named entity recognition. vol.

and Y. Y. Lu. Björne. vol.. “Overview of the entity relations (REL) supporting task of BioNLP Shared Task 2011.” BMC Bioinformatics.” in Proceedings of the BioNLP Workshop Companion Volume for Shared Task. Association for Compu- tational Linguistics. “The gene normalization task in BioCreative III. [27] J. 2011. Oregon.12 Advances in Bioinformatics Proceedings of the BioNLP Workshop Companion Volume for Shared Task. [30] Z. Portland. pp. Van Landeghem. supplement 8. Tsujii. 13. pp. 12. 2011. Kao. 2011. Association for Computational Linguistics. Ohta. Salakoski. Pyysalo. vol. “Semantically linking molecular entities in literature through entity relationships.” in Proceedings of the BioNLP Workshop Companion Volume for Shared Task. J. pp. De Baets. C. 2012. T. article S2. 83–88. Abeel. “Generalizing biomedical event extraction. supplement 8. H. 183–191. and J. T. 2011. 112–120. T. Wei et al. Björne and T. . B. [28] S. [29] S. H.” BMC Bioinformatics. article S6. USA. Van de Peer. Salakoski.