You are on page 1of 12

Course G22.

2580 - Web Search Engines


3/9/2011
Wei Xu xuwei@cs.nyu.edu
 WordNet®
 a large lexical database of English
 a combination of dictionary and thesaurus
 created and maintained by Cognitive Science Lab
of Princeton University
 designed to establish the connections between
words
 http://wordnet.princeton.edu/
 WORDnet
 4 types of Parts of Speech (POS)
▪ Noun, Verb, Adjective, Adverb
 Synset
▪ the smallest unit in WordNet
▪ a synonym set
▪ Represent a specific meaning of a word
 wordNET
 Synsets are connected to one anther through
semantic and lexical relations
 Type of relations (based on POS)
▪ hypernyms (kind-of): ‘vehicle’ is a hypernym of ‘car’
▪ hyponyms (kind-of): ‘car’ is a hyponym of ‘vehicle’
▪ holonym (part-of): ‘building’ is a holonym of ‘window’
▪ meronym(part-of): ‘window’ is a meronym of ‘building’
▪ similar to: ‘smart’ is similar to ‘intelligent’
▪ antonyms: ‘smart’ is antonym of ‘unintelligent’
hypernym hyponym
 Unix-style manual
 Web Interfaces
 Local Interfaces/APIs
 Java
 Perl
 C#

http://wordnet.princeton.edu/wordnet/related-
projects/#web
 Definition:
 the process for removing suffixes of words to get
their base or root form

 Example:
 ‘fishing’, ‘fished’, ‘fish’, ‘fisher’  ‘fish’
 Porter Stemmer
 http://tartarus.org/~martin/PorterStemmer/

 Krovetz Stemmer (in Lemur package)


 http://www.lemurproject.org/phorum/read.php?1
1,1394

 WordNet Stemmer
 http://tipsandtricks.runicsoft.com/Other/JavaSte
mmer.html
 Tokenization
 The process of breaking a stream of text up into
“words” and punctuation marks.
 Sentence Splitting
 Part of Speech Tagging
 Example:

He/PRP 's/VBZ at/IN peace/NN with/IN the/DT


house/NN and/CC could/MD stay/VB there/RB
indefinitely/RB ./.
 Name Entity Recognition
 The process of labeling sequences of words which are the
names of things, such as person, company, location
names.
 Example:

Jim bought 300 shares of Acme Corp. in 2006.

<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought 300


shares of <ENAMEX TYPE="ORGANIZATION">Acme
Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.
 Stanford POS tagger
 http://nlp.stanford.edu/software/tagger.shtml
 Stanford NER
 http://nlp.stanford.edu/software/CRF-NER.shtml
 GATE
 http://gate.ac.uk/
 JET
 http://cs.nyu.edu/grishman/jet/license.html
 http://www.cs.nyu.edu/courses/spring10/G22.2590-
001/schedule.html

You might also like