You are on page 1of 14

INTRODUCTION TO NLTK

Introduction

•The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and


programs for symbolic and statistical natural language processing (NLP) for English
written in the Python programming language.

•It was developed by Steven Bird and Edward Loper in the Department of
Computer and Information Science at the University of Pennsylvania.

•A software package for manipulating linguistic data and performing NLP tasks.

•NLTK is intended to support research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial intelligence, information
retrieval, and machine learning
Introduction

• Natural Language Toolkit (NLTK) is a suite of open source Python modules, data
sets and tutorials
• Suite of classes for several NLP tasks
• Supporting research and development in natural language processing
• Download NLTK from nltk.org
Components of NLTK

1. Code: corpus readers, tokenizers, stemmers, taggers, chunkers, parsers,


wordnet, ... (50k lines of code)
2. Corpora: >30 annotated data sets widely used in natural language processing
(>300Mb data)
3. Documentation: a 400-page book, articles, reviews, API documentation
1. Code

• corpus readers
• tokenizers
• stemmers
• taggers
• parsers
• wordnet
• semantic interpretation
• clusterers
• evaluation metrics
•…
2. Corpora

• Brown Corpus
• Carnegie Mellon Pronouncing Dictionary
• CoNLL 2000 Chunking Corpus
• Project Gutenberg Selections
• NIST 1999 Information Extraction: Entity Recognition Corpus
• US Presidential Inaugural Address Corpus
• Indian Language POS-Tagged Corpus
• Floresta Portuguese Treebank
• Prepositional Phrase Attachment Corpus
• SENSEVAL 2 Corpus
• Sinica Treebank Corpus Sample
• Universal Declaration of Human Rights Corpus
• Stopwords Corpus
• TIMIT Corpus Sample
• Treebank Corpus Sample
• …
3. Documentation

• A 400-page book about natural language processing


in Python and NLTK
• teaches Python and NLP
• provides numerous examples and exercises
• Installation instructions
• Presentation slides for some of the book chapters
• API Documentation: describes every module,
interface, class, and method
Installing NLTK

• Download and Install


 http://nltk.org/install.html
• Download NLTK data
>>> import nltk
>>> nltk.download()
Linguistic Tasks

•NLTK supports classification, tokenization, stemming, tagging, parsing, and


semantic reasoning functionalities.

•NLTK includes more than 50 corpora and lexical sources such as the Penn
Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s
Dependency Thesaurus.

• The process of classifying words into their parts of speech and labelling them
accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.
Parts of speech are also known as word classes or lexical categories.

• The collection of tags used for a particular task is known as a tag set.

Copyright @ 2015 Learntek. All Rights Reserved.


Linguistic Tasks

Part of Speech Tagging Authoring


Parsing Machine Translation
Word Net Summarization
Named Entity Recognition Information Extraction
Information Retrieval Spoken Dialog Systems
Sentiment Analysis Natural Language Generation
Document Clustering Word Sense Disambiguation
Topic Segmentation
Part of Speech Tagging

Task: Given a string of words, identify the parts of speech for each word.

A man walks into a bar.


Det Noun Verb Prep Det Noun
Using a Tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a


part of speech tag to each word. To do this first we have to use tokenization concept
(Tokenization is the process by dividing the quantity of text into smaller parts called
tokens.)
>>> import nltk

>>>from nltk.tokenize import word_tokenize

>>> text = word_tokenize("Hello welcome to the world of to learn Categorizing and POS Tagging with NLTK and
Python")

>>> nltk.pos_tag(text)

OUTPUT:

[('Hello', 'NNP'), ('welcome', 'NN'), ('to', 'TO'), ('the', 'DT'), ('world', 'NN'), ('of', 'IN'), ('to', 'TO'), ('learn', 'VB'), ('Categorizing',
'NNP'), ('and', 'CC'), ('POS', 'NNP'), ('Tagging', 'NNP'), ('with', 'IN'), ('NLTK', 'NNP'), ('and', 'CC'), ('Python', 'NNP')]
Exercise 1.

• Download and install NLTK.


• Carry out sentence and word tokenization using NLTK.
• Convert a piece of text to lowercase using NLTK.
• Explore what are stopwords and remove stopwords from a piece of English text
using NLTK.
• NLTK comes with a corpus of stop words in various languages. Print all the Greek
stopwords.
• Explore the corpora available in NLTK. Print the words in Brown corpora.
• What are n-grams? Can you obtain all n-grams from a piece of text for n=4.

You might also like