Welcome to Scribd!

What Can We Learn Just Through Tokenization?

Uploaded by

0% found this document useful (0 votes)

9 views2 pages

The document describes 7 tasks for natural language processing laboratory work. The first task involves tokenizing a corpus by splitting text on whitespace and punctuation. The second task is to implement lemmatization to reduce words to their base forms. The third task is to implement part-of-speech tagging on the corpus. The remaining tasks involve extracting frequent words and tags from the corpus and improving the tokenizer to handle emails, URLs, and abbreviations as single tokens.

Original Description:

Original Title

Laboratorul 5

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

9 views2 pages

What Can We Learn Just Through Tokenization?

Uploaded by

Mirela Lupu

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

Statistical Methods in NLP

Laboratory 5

Given a character sequence and a defined document unit, tokenization is the task of chopping it up
into pieces, called tokens, at the same time throwing away certain characters, such as punctuation.
Here is an example of tokenization:

Input: Paris is a nice city.

Output:
Paris
is
a
nice
city
.

What can we learn just through tokenization?

- Text statistics: no. of words, multi-word expressions, length of words/sentences, freq. of
vowels/consonants; language Identification, etc.

The major question of the tokenization phase is what are the correct tokens to use. In this example,
it looks fairly trivial: you chop on whitespace and separate punctuation characters. This is a starting
point, and your first assignment.

Task 1 (2 points):
Obtain some plain text data, and store it in a file ‘corpus.txt’, of about 2 pages. Then
implement a tokenizer which splits the corpus on whitespace and separates punctuation
characters.

Link: https://docs.python.org/3/library/tokenize.html#examples

Task 2 (2 points):
Implement lemmatization on your corpus.

Lemmatization aims to remove inflectional endings only and to return the base or dictionary form of
a word (which is known as the lemma).
If confronted with the token saw, lemmatization would attempt to return either see or saw
depending on whether the use of the token was as a verb or a noun.

Link: http://www.nltk.org/api/nltk.stem.html?highlight=lemmatizer

1
Statistical Methods in NLP

Laboratory 5

Task 3 (3 points):
Implement a part-of-speech tagger for your corpus.

Link: https://www.nltk.org/book/ch05.html

Task 4 (2 points):
Extract the most frequent words out of your corpus.

Task 5 (2 points):
Extract the most frequent part-of-speech tags out of your corpus.

_________________________________BONUS______________________________________

Task 6 (2 points):
Communication and computer technologies have introduced new types of character sequences that
a tokenizer should tokenize as a single token, including:
- email addresses: jblack@mail.yahoo.com
- web URLs: http://stuff.big.com/new/specials.html - numeric IP addresses: 142.32.48.231
- and more.
Adjust your tokenizer to recognize only one token for emails, URLs etc.

Task 7 (3 points):
What about splitting “Mr. John is a doctor.”? The abbreviation tells us that the point should not be a
split here, but come together with the preceding word: “Mr. John” in one token.
Adjust your tokenizer to treat abbreviations is this way.

Grade 2 Math Text Book
Document111 pages
Grade 2 Math Text Book
George Alen
No ratings yet
Python Interview Questions For Experienced
Document10 pages
Python Interview Questions For Experienced
zynofus technology
No ratings yet
Final Material - Module - Non-Decoders - TCSD V3
Document100 pages
Final Material - Module - Non-Decoders - TCSD V3
Lhodz Pascua Tabaquero
100% (1)
Excel With Python Performing Advanced Operations
Document57 pages
Excel With Python Performing Advanced Operations
martin napanga
No ratings yet
Premium-C1 Level Exam Reviser
Document18 pages
Premium-C1 Level Exam Reviser
Ralphy Malphy
0% (3)
Action Verbs Esl Vocabulary Picture Dictionary Worksheet For Kids PDF
Document2 pages
Action Verbs Esl Vocabulary Picture Dictionary Worksheet For Kids PDF
Ahmad Bakhtiar
80% (5)
Recurrent Neural Networks Tutorial, Part 2
Document16 pages
Recurrent Neural Networks Tutorial, Part 2
hoja
No ratings yet
Beehive 3 Students Book (British)
Document138 pages
Beehive 3 Students Book (British)
Квіталіна Сергієнко
75% (4)
Language Focus and Vocabulary: Worksheets Answer Key
Document3 pages
Language Focus and Vocabulary: Worksheets Answer Key
NOOR FARHANAH BINTI ABD RAHMAN Moe
100% (1)
Cruse Alan - Meaning in Language
Document254 pages
Cruse Alan - Meaning in Language
Ivona Hanušová
No ratings yet
October 2022 Ms
Document10 pages
October 2022 Ms
esraagalal3
100% (1)
1st Periodical Test English5 Melc Based With Tos
Document6 pages
1st Periodical Test English5 Melc Based With Tos
GLadz Congson
No ratings yet
Dsbdal A7
Document65 pages
Dsbdal A7
airprojectjnv2020
No ratings yet
Assignment 3: Named Entity Recognition: Training Dataset
Document4 pages
Assignment 3: Named Entity Recognition: Training Dataset
ryder the ryder
No ratings yet
Final LP-VI NLP Manual 2023-24
Document29 pages
Final LP-VI NLP Manual 2023-24
shreyasnagare3635
No ratings yet
Libshorttext Refrensi 14
Document5 pages
Libshorttext Refrensi 14
bisniskuy
No ratings yet
Introduction To Smalltalk - Chapter 3 - Principles of Smalltalk Ivan Tomek 9/17/00
Document38 pages
Introduction To Smalltalk - Chapter 3 - Principles of Smalltalk Ivan Tomek 9/17/00
Gratian Stevie
No ratings yet
Chapter 2
Document45 pages
Chapter 2
Quang H. Lê
No ratings yet
Lab2 Report
Document4 pages
Lab2 Report
Prudhvisaichandra Karuturi
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
Document11 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
shoaib riaz
No ratings yet
769 Homework
Document4 pages
769 Homework
Farzana Mahamud Rini
No ratings yet
HW LM
Document36 pages
HW LM
Sabrina Li
No ratings yet
CD - CH2 - Lexical Analysis
Document67 pages
CD - CH2 - Lexical Analysis
fan
No ratings yet
Weekly Progress Report: 1 Full Name
Document7 pages
Weekly Progress Report: 1 Full Name
a_damrong
No ratings yet
Unit 2-LEXICAL ANALYSIS
Document46 pages
Unit 2-LEXICAL ANALYSIS
Buvana Muruga
No ratings yet
EXPERIMENT NO 2 Shristi
Document3 pages
EXPERIMENT NO 2 Shristi
Shrishti Tiwari
No ratings yet
Understanding Language Model
Document5 pages
Understanding Language Model
shahzad sultan
No ratings yet
NLP Manual (1-12)
Document54 pages
NLP Manual (1-12)
sj120cp
No ratings yet
NLP Manual (1-12)
Document55 pages
NLP Manual (1-12)
sj120cp
No ratings yet
csc241 Assignment 4
Document4 pages
csc241 Assignment 4
Victor Cabrejos
No ratings yet
Mail Type Spam Classifier: Abstarct
Document9 pages
Mail Type Spam Classifier: Abstarct
Muneer Ahmad
No ratings yet
Lecture 3
Document4 pages
Lecture 3
uzma
No ratings yet
SI 413, Unit 4: Scanning and Parsing: 1 Syntax & Semantics
Document12 pages
SI 413, Unit 4: Scanning and Parsing: 1 Syntax & Semantics
Aditya Roy karmakar
No ratings yet
CD Lab Manual
Document48 pages
CD Lab Manual
fafocif758
No ratings yet
Chapter-03: What Is Lexical Analysis?
Document6 pages
Chapter-03: What Is Lexical Analysis?
abu syed
No ratings yet
NLP Assignment 2
Document2 pages
NLP Assignment 2
muhammad shoaib
No ratings yet
CS 2421 Lab 2
Document4 pages
CS 2421 Lab 2
Mani Kamali
No ratings yet
Fundamental Text Analysis Pipeline and Recognizing Textual Entailment
Document5 pages
Fundamental Text Analysis Pipeline and Recognizing Textual Entailment
Sunanda Bansal
No ratings yet
Grammar Checker
Document6 pages
Grammar Checker
wichupinuno
No ratings yet
CSE3099-TARP: Automated E-Mail Reply by Chatbot Using Pytorch (Neural Networks)
Document24 pages
CSE3099-TARP: Automated E-Mail Reply by Chatbot Using Pytorch (Neural Networks)
Subarna Lamsal
No ratings yet
What's The Scope?: About This Book
Document9 pages
What's The Scope?: About This Book
Sai
No ratings yet
NLP - Short Assignments
Document8 pages
NLP - Short Assignments
wemela1891
No ratings yet
Lexing and Tokens
Document6 pages
Lexing and Tokens
ricardoescuderorrss
No ratings yet
09 Rohit Jujaray NLP Experiments
Document24 pages
09 Rohit Jujaray NLP Experiments
NEMAT KHAN
No ratings yet
Automata Theory and Compiler Design: Name: Smitha.A Usn: 1Vj21Cs042 Branch: Cse
Document9 pages
Automata Theory and Compiler Design: Name: Smitha.A Usn: 1Vj21Cs042 Branch: Cse
Smitha.A Smitha.A
No ratings yet
65 SC Tae1 A3
Document3 pages
65 SC Tae1 A3
Mr Unknown
No ratings yet
Lexical Analysis
Document14 pages
Lexical Analysis
Loving Heart
No ratings yet
Lab 03 CMP7202
Document7 pages
Lab 03 CMP7202
ALAO
No ratings yet
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
Document42 pages
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
Nyamonaa Agata
No ratings yet
Lexical Analysis in Compiler Design With Example
Document8 pages
Lexical Analysis in Compiler Design With Example
Aansa Malik
No ratings yet
Compiler Lab Report
Document17 pages
Compiler Lab Report
Prosenjit Das
No ratings yet
Irjet V5i8212 PDF
Document3 pages
Irjet V5i8212 PDF
sparsh Gupta
No ratings yet
Case Study of Lexical Analyzer PDF
Document3 pages
Case Study of Lexical Analyzer PDF
AMIT DHANDE
No ratings yet
Unit 2 Lexical Analyzer
Document30 pages
Unit 2 Lexical Analyzer
Binay Adhikari
No ratings yet
NLP Manual (1-12) 1
Document56 pages
NLP Manual (1-12) 1
sj120cp
No ratings yet
Qualcomm Qns
Document9 pages
Qualcomm Qns
Princy Murugesan
No ratings yet
CSC101 - ICT - Lab 1
Document42 pages
CSC101 - ICT - Lab 1
iKonik
No ratings yet
ANLP semVI Labmanual
Document33 pages
ANLP semVI Labmanual
kun.dha.rt22
No ratings yet
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
Document5 pages
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
davidwarn20
No ratings yet
Project Assignment 4: Markov Chain
Document10 pages
Project Assignment 4: Markov Chain
Ritwik Dutta
No ratings yet
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
From Everand
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
Michael Stueben
No ratings yet
Lab 04 - Lexical Analysis (Part 2) : Lab Objectives: Upon Successful Completion of This Topic, You Will Be Able To
Document5 pages
Lab 04 - Lexical Analysis (Part 2) : Lab Objectives: Upon Successful Completion of This Topic, You Will Be Able To
irsam
No ratings yet
Python Chatbot Project
Document10 pages
Python Chatbot Project
Vanitha G
No ratings yet
Assignment 1 Data File Lab
Document15 pages
Assignment 1 Data File Lab
Expert Tutor
No ratings yet
The Traditional Approach To Natural Language Processing
Document7 pages
The Traditional Approach To Natural Language Processing
Madhu
No ratings yet
Compiler Design: Lexical Analysis
Document68 pages
Compiler Design: Lexical Analysis
shihab shoron
No ratings yet
CD Lect5
Document40 pages
CD Lect5
Ali Badran
No ratings yet
Preprocessing NLTK
Document5 pages
Preprocessing NLTK
shahzad sultan
No ratings yet
CSC101 - ICT - Lab Manual Sp22 - v3.1
Document138 pages
CSC101 - ICT - Lab Manual Sp22 - v3.1
Muhammad Ismail
No ratings yet
Untitled
Document602 pages
Untitled
Agustin Salinas
No ratings yet
Lesson Ii. Punctuation Marks: 1. Uses of Period (.)
Document11 pages
Lesson Ii. Punctuation Marks: 1. Uses of Period (.)
KyleRhayneDiazCaliwag
No ratings yet
Degree of Adjectives
Document5 pages
Degree of Adjectives
hkush78
No ratings yet
Adjective:: Definitions
Document15 pages
Adjective:: Definitions
shabeerhakimi
No ratings yet
A Dancers Lot 2022
Document14 pages
A Dancers Lot 2022
Lara
No ratings yet
Humans Across Cultures May Share The Same Universal Musical Grammar New Scientist 19
Document3 pages
Humans Across Cultures May Share The Same Universal Musical Grammar New Scientist 19
Gogelio
No ratings yet
Thời gian làm bài: 45 phút (không kể thời gian giao đề)
Document5 pages
Thời gian làm bài: 45 phút (không kể thời gian giao đề)
Lê Nguyên Anh Trần
No ratings yet
Repasos Lecciones Intermediate
Document20 pages
Repasos Lecciones Intermediate
Jzodo Del Carpio Lazo
No ratings yet
Direct and Reported Speech Day 3 1
Document2 pages
Direct and Reported Speech Day 3 1
Jay Ann Canillo-Arceo
No ratings yet
Learning Competencies
Document5 pages
Learning Competencies
Gracelyn Jhing Racoma Camasura
No ratings yet
Finite Verb Usage
Document7 pages
Finite Verb Usage
Mburu Karanja
No ratings yet
He Weighs Nine .: Stone Stones Either Could Be Used Here
Document3 pages
He Weighs Nine .: Stone Stones Either Could Be Used Here
Payal
No ratings yet
1200 Errors Part - 19 PDF
Document196 pages
1200 Errors Part - 19 PDF
Dsk
No ratings yet
Morphology PDF
Document42 pages
Morphology PDF
Ayman Abdel-tawab
No ratings yet
Become A Confident English Speaker SV PDF
Document4 pages
Become A Confident English Speaker SV PDF
konsik.wioleta
No ratings yet
Oral Literature
Document44 pages
Oral Literature
Shakur Shay-ee
No ratings yet
English CG, Grade 3 & 4, Q2
Document9 pages
English CG, Grade 3 & 4, Q2
Carl james Mingo
No ratings yet
Coding-Decoding (4) English 1582355442
Document9 pages
Coding-Decoding (4) English 1582355442
sai
No ratings yet
What Is Semantics? Its Type
Document2 pages
What Is Semantics? Its Type
shaah2003
No ratings yet
Study Notes On English Grammar - PRONOUNS - SSC CGL 2017 - SSC CGL, CHSL, Railway Exam Preparation - SSC Adda
Document5 pages
Study Notes On English Grammar - PRONOUNS - SSC CGL 2017 - SSC CGL, CHSL, Railway Exam Preparation - SSC Adda
Ratanhari Nagargoje
No ratings yet
ADAS.2022.K10.PHT - Tieng Anh - Unit1.Language
Document2 pages
ADAS.2022.K10.PHT - Tieng Anh - Unit1.Language
16040484 Phạm Ngọc Việt Anh
No ratings yet