Professional Documents
Culture Documents
Introduction
•It was developed by Steven Bird and Edward Loper in the Department of
Computer and Information Science at the University of Pennsylvania.
•A software package for manipulating linguistic data and performing NLP tasks.
•NLTK is intended to support research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial intelligence, information
retrieval, and machine learning
Introduction
• Natural Language Toolkit (NLTK) is a suite of open source Python modules, data
sets and tutorials
• Suite of classes for several NLP tasks
• Supporting research and development in natural language processing
• Download NLTK from nltk.org
Components of NLTK
• corpus readers
• tokenizers
• stemmers
• taggers
• parsers
• wordnet
• semantic interpretation
• clusterers
• evaluation metrics
•…
2. Corpora
• Brown Corpus
• Carnegie Mellon Pronouncing Dictionary
• CoNLL 2000 Chunking Corpus
• Project Gutenberg Selections
• NIST 1999 Information Extraction: Entity Recognition Corpus
• US Presidential Inaugural Address Corpus
• Indian Language POS-Tagged Corpus
• Floresta Portuguese Treebank
• Prepositional Phrase Attachment Corpus
• SENSEVAL 2 Corpus
• Sinica Treebank Corpus Sample
• Universal Declaration of Human Rights Corpus
• Stopwords Corpus
• TIMIT Corpus Sample
• Treebank Corpus Sample
• …
3. Documentation
•NLTK includes more than 50 corpora and lexical sources such as the Penn
Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s
Dependency Thesaurus.
• The process of classifying words into their parts of speech and labelling them
accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.
Parts of speech are also known as word classes or lexical categories.
• The collection of tags used for a particular task is known as a tag set.
Task: Given a string of words, identify the parts of speech for each word.
>>> text = word_tokenize("Hello welcome to the world of to learn Categorizing and POS Tagging with NLTK and
Python")
>>> nltk.pos_tag(text)
OUTPUT:
[('Hello', 'NNP'), ('welcome', 'NN'), ('to', 'TO'), ('the', 'DT'), ('world', 'NN'), ('of', 'IN'), ('to', 'TO'), ('learn', 'VB'), ('Categorizing',
'NNP'), ('and', 'CC'), ('POS', 'NNP'), ('Tagging', 'NNP'), ('with', 'IN'), ('NLTK', 'NNP'), ('and', 'CC'), ('Python', 'NNP')]
Exercise 1.