You are on page 1of 14

Literature in Biomedicine

• Much literature generated quickly.
– 11 million citations in MEDLINE. – 400,000 added yearly.

Natural Language Processing in Biology
Jeffrey Chang Russ Altman BMI 214

• Need methods to deal with data.
– Query – Summarize – Organize – Understand

PubMed

PubMed Central

Two General Approaches
1. Statistical Natural Language Processing • Look at documents as a collection of words • Base analysis on the statistics of word occurrences, neighbors • Do not try to understand all sentence details. 2. Grammar-based, parsing techniques • Look at structure of sentences (or more) • Identify parts-of-speech (POS) • Develop deep model of what is said. Statistical methods have been applied mostly in biology, but fusion may be best…

Definitions
• Corpus (C, with N documents)
Collection of documents.

• Term Frequency (tf)

Number of times a word appears in a document. Number of documents a word appears in. Total number of times a word appears in a corpus.

• Document Frequency (df) • Collection Frequency (cf)

Page 1

Documents as Vectors
”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.”

Comparing Two Documents

acid amino analysis comparison control environments […] our

2 2 1 1 1 2 1

A document is summarized as a vector of word counts. Each dimension contains the number of times a word appears.

(Manning & Schuetze)

Vector Cosine

Weighting the “important” words Use Term Frequency and Inverse Document Frequency

Cosine of angle between two vectors.

[1 + log(tft,d)] * log (N/dft)
Fewer documents, more weight.
df = # of documents word is in tf = # of times word in document acid amino analysis […] our (1+log(2))*IDFacid (1+log(2))*IDFamino (1+log(1))*IDFanalysis (1+log(1))*IDFour

Stemming
• Want to group together different variations of the same word.

Suffix Stemming Algorithm
“Two words are considered to have the same stem if they have the same beginnings and their endings differ in one or two characters.” (Andrade 1998)
“kinase-” and “kinase-s” “transcript-s” and “transcript-ed”

–Dehydrogenase vs. dehydrogenases –Activate vs. activated vs. activating

• Morphological stemmers require a lexicon.

–Hard to compile for biomedical domain.

Page 2

Porter (Rule-based) Stemming
http://www.tartarus.org/~martin/PorterStemmer/

Stopwords Many of the words in the corpus contribute little to the meaning.

73 rules organization -> organ (Krovetz 93)
static RuleList step1a_rules[] = { 101, "sses", "ss", 3, 1, 102, "ies", "i", 2, 0, 103, "ss", "ss", 1, 1, 104, "s", LAMBDA, 0, -1, 000, NULL, NULL, 0, 0, };

-1, -1, -1, -1, 0,

NULL, NULL, NULL, NULL, NULL,

and, an, by, from, of, the, with (Hersh) (Can be specific to a corpus.)

So how many words do we need to use?

Porter Stemmer Example
Step 1b (m>0) EED -> EE (*v*) ED -> feed agreed plastered bled motoring sing -> -> -> -> -> -> fee agree plaster bled motor sing

SWISS-PROT
Release 37, Dec 98 77,977 sequences 59,835 references 64Mb of text 110081 unique words

(*v*) ING ->

If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE BL -> BLE IZ -> IZE conflat(ed) troubl(ed) siz(ed) -> -> -> conflate trouble size

http://www.expasy.ch/sprot/sprot-top.html

SWISSPROT Record

ID AC DT DT DE GN OS OC RN RP RX RA RT RT RL CC CC CC CC CC DR DR DR KW FT FT SQ

KPEL_DROME STANDARD; PRT; 501 AA. Q05652;DT 01-OCT-1994 (Rel. 30, Created) 01-OCT-1994 (Rel. 30, Last sequence update) 30-MAY-2000 (Rel. 39, Last annotation update) PROBABLE SERINE/THREONINE-PROTEIN KINASE PELLE (EC 2.7.1.37). PLL. Drosophila melanogaster (Fruit fly). Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; [1] SEQUENCE FROM N.A., AND MUTAGENESIS. MEDLINE; 93177834. Shelton C.A., Wasserman S.A.; "Pelle encodes a protein kinase required to establish dorsoventral polarity in the Drosophila embryo."; Cell 72:515-525(1993). -!- FUNCTION: REQUIRED FOR THE NUCLEAR IMPORT OF THE DORSAL PROTEIN WHICH ESTABLISHES DORSOVENTRAL POLARITY IN DROSOPHILA EMBRYOS. -!- CATALYTIC ACTIVITY: ATP + A PROTEIN = ADP + A PHOSPHOPROTEIN. -!- DEVELOPMENTAL STAGE: EXPRESSED THROUGHOUT THE LIFE CYCLE WITH HIGHEST LEVELS IN 0-3 HOUR-OLD EMBRYOS AND ADULT FEMALES. FLYBASE; FBgn0010441; pll. INTERPRO; IPR002290; -. PROSITE; PS00107; PROTEIN_KINASE_ATP; 1. Transferase; Serine/threonine-protein kinase; ATP-binding. DOMAIN 213 499 PROTEIN KINASE. BINDING 240 240 ATP. SEQUENCE 501 AA; 56160 MW; 4B29E2B40ACB81A8 CRC64; MSGVQTAEAE AQAQNQANGN RTRSRSHLDN TMAIRLLPLP VRAQLCAHLD ALDVWQQLAT AVKLYPDQVE QISSQKQRGR SASNEFLNIW GGQYNHTVQT LFALFKKLKL HNAMRLIKDY RQVTDRVPEN ETKKNLLDYV KQQWRQNRME LLEKHLAAPM GKELDMCMCA IEAGLHCTAL DPQDRPSMNA VLKRFEPFVT D

Word Frequency in SP37

//

Page 3

Zipf's Law
Empirical observation of the pattern of usage frequencies of words. CF * R = K
– CF - Collection Frequency – R - Rank – K - constant

Zipf for SP37

Information Summarization

Clustering Microarray Papers
• Use standard clustering algorithms. • Documents are vectors of words.

(Altman & Raychaudhuri)

TextQuest: Concept Discovery
Cluster documents to discover broad themes in a corpus. Find words that describe each cluster:
1. 2. 3. 4.

K-Means clustering (I)
Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change.

Score = log(fij/fi)
fij - frequency of word i in cluster j fi – frequency of word i in corpus
* Documents are vectors of word counts.

Page 4

K-Means clustering (II)
1. 2. 3. 4. Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change. 1. 2. 3. 4.

K-Means clustering (III)
Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change.

K-Means clustering (IV)
1. 2. 3. 4. Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change. 1. 2. 3. 4.

K-Means clustering (V)
Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change.

K-Means clustering (VI)
1. 2. 3. 4. Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change. 1. 2. 3. 4.

K-Means clustering (VII)
Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change.

Page 5

K-Means clustering (VIII)
1. 2. 3. 4. Choose K documents as “cluster centers”. Assign all documents to the nearest cluster. Recalculate new cluster centers. Repeat 2-4 until clusters do not change.

Cluster of Drosophila Development
PubMed queries: anterior-posterior dorsal-ventral Dorsoventral axis specification Egg chamber / oocyte patterning

Segmentation and embryonic patterning

(Iliopoulos 2001)

How Do We Summarize Protein Families?
• Use protein families from FSSP database. • Get articles for each family from SWISS-PROT.
frequency of word a in family i average frequency of word a sequences in family i with word a sequences in family i Families with word a 1, if family has word a 0, if family does not have word a

Describing Protein Families
Appears in few families very frequently
tokenization artifact

(Andrade 1998)

Data Collection

What kinds of data to collect? • Genes and gene products. • Protein localization. • Disease associated with proteins. • Protein-protein interactions. • Pathways.

Database

Page 6

Collecting Data with Information Extraction (IE)
Find specific facts from free text.
Entities Relations
ction Extra

Relations from IE

on mati Infor

LOCALIZED_TO(CYP3A4, LIVER) HAS_VARIABILITY(CYP3A4) AFFECTS(CYP3A4, INDINAVIR)

Diagram of an IE System

Pre-Processing

Rules

PARSING

NP

V

NP

TOKENIZATION This system synthesizes fibroblast growth factor. Pre-Processing Extraction App POS TAGGING STEMMING DT NN NN synthes VBZ NN NN

Rules for IE
• Information Extraction systems typically rule-based. • IF <pre-conditions> THEN <action> • Rules typically developed by domain experts manually.
Rules Pre-Processing Extraction App

Examples of IE Rules
Role:
<NP> receptor -> <protein> receptor <protein> activates <protein> <finding> in <bodyloc> <conj> <bodyloc>

Relations:

Rules

Pre-Processing

Extraction App

Page 7

Protein-Protein Interactions in Drosophila Cell Cycle
• Look for pattern in MEDLINE abstracts: protein A -- action -- protein B • Protein names specified by user • 14 possible actions:
acetylatactivatassociated with bind destabilizinhibit interact is conjugated to modulatphosphorylatregulatstabilizsuppress target

Interactions Found

(Blaschke 1999)

Protein Names Protein names come in many forms:
• • • Single word with mixed case or numbers. e.g. Nef, p53 Compound word. e.g. interleukin 1-responsive kinase Single word all lowercase. e.g. actin, insulin

Recognizing Protein Names • Finding “core terms” (candidate
protein names) –Capital letters and numbers –P54 SAP kinase

• Identifying “f-terms” (high
frequency associations) –EGF receptor –Ras GTPase-activating protein
(Fukuda 1997)

(Fukuda 1997)

Core-Terms for protein names
• • • • • Include words with upper case, numerical figures, and/or special symbols No lower case words longer than 9 characters with "-". (full-length) No words with more than half special symbols. (+/-) No units. (aa, AA, fold, bp) Ignore literature references.
(Fukuda 1997)

Concatenate Core- and F- Terms
Look at surface clues
• • • • • Connect adjacent terms Src SH3 domain Include parentheses Connect words if nouns, adjectives, or numbers inside Ras guanine nucleotide exchange factor Sos Extend left to a determiner. the focal adhesion kinase Extend right if there is a single upper case letter or greek word. p85 alpha
(Fukuda 1997)

Use a POS tagger

Page 8

Computing Biologically

Application: How to find sequence homologies?
PSI-BLAST • Iterative BLAST • More sensitive, but subject to "profile drift"
Sequence Profile

Search Database

Construct Profile

Multiple Alignment Sequence Database

Augment with Literature
Sequence Profile

Using Text Increases Precision
1

Interpolated Precision

0.95

Search Database

Construct Profile

0.9

0.85

0.8

Multiple Alignment Sequence Database

Examine Literature

0

0.1

0.2 Recall

0.3

0.4

PSI-BLAST

5% tex t cutoff

10% tex t cutoff

20% tex t cutoff

precision - correct hits / all hits recall - correct hits / total correct answers

Application : Assigning GO codes to genes using literature
INPUT:

Problem
1. Controlled terminology of gene function-the Gene Ontology (GO) 2. Literature associated with a set of genes-SGD (yeast genome database) OUTPUT: Algorithm to assign codes to genes

Genome Research 12(1), p 203-214

Page 9

Method
Focus on 21 high level GO process terms. Standard Maximum Entropy classifier compared with: • Naïve Bayes • Nearest Neighbor

1

me ta bo lis m c e ll_ c y c le

0 .8

me io s is intra c e llula r_ pro te in_ tr a ffic

Precision

0 .6

0 .4

0 .2

0 0 0 .2 0 .4 0 .6 R e c a ll 0 .8 1

1

s igna l_ tra ns duc tio n c e ll_ fus io n

0 .8

bio ge ne s is tra ns po rt

Precision

0 .6

io n_ ho me o s ta s is

0 .4

0 .2

0 0 0 .2 0 .4 0 .6 R e c a ll 0 .8 1

Document classification into GO codes

Page 10

Application: Assessing Functional Coherence of groups of genes Soumya Raychaudhuri Hinrich Schuetze Russ Altman

PROBLEM
Grouping genes together is common activity. When a group of genes is produced from a novel technology (such as microarrays), how can we assess the significance of this grouping?

Gene Clusters from clustering of yeast genes based on expression patterns under a variety of conditions

Manual labeling
Spindle pole formation Proteasome mRNA splicing Glycolysis Mitochondrial ribosome ATP synthesis Chromatin structure Ribosome/translation DNA replication TCA cycle

Semantically similar articles refer to related genes

“Neighbor Divergence” score (comparison of observed/expected coreferences)

Count of neighbors referring to group genes

Page 11

Scores of real clusters vs. random clusters

Assess alternative metrics….

Degradation of performance by adding “noise” genes

Gene Clusters
Spindle pole formation Proteasome mRNA splicing Glycolysis Mitochondrial ribosome ATP synthesis Chromatin structure Ribosome/translation DNA replication TCA cycle

Analysis of Eisen Clusters

Page 12

Conclusions
Can use literature to assess the functional coherence of groups of genes. Can distinguish “real” groups from random. Can identify coherence in manual groups. Some biases in method still need to be removed (e.g. large groups favored, can fuse two strong, but unrelated groups)

Application: Detecting abbreviations in biomedical literature

PROBLEM
Increasing occurrence of abbreviations in the biomedical literature. Represents a challenge to both humans and text processing algorithms. Can we detect abbreviations reliably in abstracts of PubMed? NOTE: Excellent, but different approach published by Pustejovsky, Castano et al. = ACROMED work.

Sample Alignments
NEUROPEPTIDE Y N----P-------Y N------P-----Y Beta-Endorphin BETA-E----P--c-Jun N-terminal Kinase --J---N----------K-----

= NPY =Beta-EP = JNK

Page 13

Features of Alignment Used
1. 2. 3. 4. 5. 6. 7. 8. Lower case vs. upper case letters Beginning of word End of word Syllable boundary Neighbor Percent aligned Unused words Aligned/word

Abbreviation Server http://abbreviation.stanford.edu/

Summary
• Much biological information is encoded as free text. • NLP can analyze the text using a combination of statistical and rule-based approaches. • Computational analyses of text can be useful, but are noisy and must be interpreted carefully.

Page 14