Professional Documents
Culture Documents
The Stanford CoreNLP Natural Language Processing Toolkit
The Stanford CoreNLP Natural Language Processing Toolkit
Tokeniza)on*
Abstract (tokenize)*
Sentence*Spli0ng*
Execu)on*Flow*
Morphological*Analysis*
pipeline that provides core natural lan- (lemma)* Annota)on*
Named*En)ty*Recogni)on* Object*
guage analysis. This toolkit is quite widely (ner)*
used, both in the research NLP community Syntac)c*Parsing* Annotated*
(parse)* text*
and also among commercial and govern- Coreference*Resolu)on**
(dcoref)*
ment users of open source NLP technol- Other*Annotators*
ogy. We suggest that this follows from (gender, sentiment)!
Figure 2: Minimal code for an analysis pipeline. The annotators provided with StanfordCoreNLP
can work with any character encoding, making use
export StanfordCoreNLP_HOME /where/installed of Java’s good Unicode support, but the system
java -Xmx2g -cp $StanfordCoreNLP_HOME/* defaults to UTF-8 encoding. The annotators also
edu.stanford.nlp.StanfordCoreNLP
-file input.txt support processing in various human languages,
providing that suitable underlying models or re-
Figure 3: Minimal command-line invocation. sources are available for the different languages.
import java.io.*;
The system comes packaged with models for En-
import
import
java.util.*;
edu.stanford.nlp.io.*;
glish. Separate model packages provide support
import
import
edu.stanford.nlp.ling.*;
edu.stanford.nlp.pipeline.*;
for Chinese and for case-insensitive processing of
import
import
edu.stanford.nlp.trees.*;
edu.stanford.nlp.trees.TreeCoreAnnotations.*;
English. Support for other languages is less com-
import edu.stanford.nlp.util.*; plete, but many of the Annotators also support
public class StanfordCoreNlpExample { models for French, German, and Arabic (see ap-
public static void main(String[] args) throws IOException {
PrintWriter xmlOut = new PrintWriter("xmlOutput.xml");
pendix B), and building models for further lan-
Properties props = new Properties(); guages is possible using the underlying tools. In
props.setProperty("annotators",
"tokenize, ssplit, pos, lemma, ner, parse"); this section, we outline the provided annotators,
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation( focusing on the English versions. It should be
"This is a short sentence. And this is another.");
noted that some of the models underlying annota-
pipeline.annotate(annotation);
pipeline.xmlPrint(annotation, xmlOut); tors are trained from annotated corpora using su-
// An Annotation is a Map and you can get and use the pervised machine learning, while others are rule-
// various analyses individually. For instance, this
// gets the parse tree of the 1st sentence in the text. based components, which nevertheless often re-
List<CoreMap> sentences = annotation.get( quire some language resources of their own.
CoreAnnotations.SentencesAnnotation.class);
if (sentences != null && sentences.size() > 0) {
CoreMap sentence = sentences.get(0);
tokenize Tokenizes the text into a sequence of to-
Tree tree = sentence.get(TreeAnnotation.class);
PrintWriter out = new PrintWriter(System.out);
kens. The English component provides a PTB-
out.println("The first sentence parsed is:");
tree.pennPrint(out);
style tokenizer, extended to reasonably handle
} noisy and web text. The corresponding com-
}
} ponents for Chinese and Arabic provide word
and clitic segmentation. The tokenizer saves the
Figure 4: A simple, complete example program.
character offsets of each token in the input text.
cleanxml Removes most or all XML tags from
annotators in a pipeline is controlled by standard the document.
Java properties in a Properties object. The most ssplit Splits a sequence of tokens into sentences.
basic property to specify is what annotators to run, truecase Determines the likely true case of tokens
in what order, as shown here. But as discussed be- in text (that is, their likely case in well-edited
low, most annotators have their own properties to text), where this information was lost, e.g., for
allow further customization of their usage. If none all upper case text. This is implemented with
are specified, reasonable defaults are used. Run- a discriminative model using a CRF sequence
ning the pipeline is as simple as in the first exam- tagger (Finkel et al., 2005).
ple, but then we show two possibilities for access- pos Labels tokens with their part-of-speech (POS)
ing the results. First, we convert the Annotation tag, using a maximum entropy POS tagger
object to XML and write it to a file. Second, we (Toutanova et al., 2003).
show code that gets a particular type of informa- lemma Generates the lemmas (base forms) for all
tion out of an Annotation and then prints it. tokens in the annotation.
Our presentation shows only usage in Java, but gender Adds likely gender information to names.
the Stanford CoreNLP pipeline has been wrapped ner Recognizes named (PERSON, LOCATION,
by others so that it can be accessed easily from ORGANIZATION, MISC) and numerical
many languages, including Python, Ruby, Perl, (MONEY, NUMBER, DATE, TIME, DU-
Scala, Clojure, Javascript (node.js), and .NET lan- RATION, SET) entities. With the default
/** Simple annotator for locations stored in a gazetteer. */
annotators, named entities are recognized package org.foo;
public class GazetteerLocationAnnotator implements Annotator {
using a combination of CRF sequence taggers // this is the only method an Annotator must implement
public void annotate(Annotation annotation) {
trained on various corpora (Finkel et al., 2005), // traverse all sentences in this document
for (CoreMap sentence:annotation.get(SentencesAnnotation.class)) {
while numerical entities are recognized using // loop over all tokens in sentence (the text already tokenized)
List<CoreLabel> toks = sentence.get(TokensAnnotation.class);
two rule-based systems, one for money and for (int start = 0; start < toks.size(); start++) {
// assumes that the gazetteer returns the token index
numbers, and a separate state-of-the-art system // after the match or -1 otherwise
int end = Gazetteer.isLocation(toks, start);
for processing temporal expressions (Chang if (end > start) {
for (int i = start; i < end; i ++) {
and Manning, 2012). toks.get(i).set(NamedEntityTagAnnotation.class,"LOCATION");
}
regexner Implements a simple, rule-based NER }
over token sequences building on Java regular }
}