Welcome to Scribd!

NLPmidterm Slide

Uploaded by

0% found this document useful (0 votes)

5 views16 pages

This document presents the midterm report on exercises completed for the natural language processing course. It includes 3 exercises: 1) using MinHash and SimHash algorithms to measure text similarity, 2) applying preprocessing, splitting into train and test sets, and using KNN, decision tree, and logistic regression models for classification, and 3) implementing an N-gram language model with smoothing to solve zero probability problems and output token probabilities. Accuracy is higher for SimHash than MinHash and decision trees performed best among the classification models. The exercises aimed to develop skills in NLP techniques including text similarity, classification, and language modeling.

Original Description:

NLP (Natural Language Processing)

Original Title

NLPmidterm_slide

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

5 views16 pages

NLPmidterm Slide

Uploaded by

Loc Tran

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 16

Search inside document

TON DUC THANG UNIVERSITY

FACULTY OF INFORMATION TECHNOLOGY

Natural Language Processing

MIDTERM REPORT

Author: Trà Lâm Thanh Hà – 520H067

Trần Lê Thành Lộc – 519H0310

Intructor: Mr. Lê Anh Cường

HO CHI MINH CITY, 2021

CONTENT
Introduction
 Exercise 1  Exercise 2  Exercise 3

1. Algorithm use to handle 1. Preprocessing 1. Preprocessing

exercise 2. Split train and test 2. UNK solve zero problem
2. Definition of algorithm 3. Model using 3. Tokenize and modeling
using 4. Accuracy of model 4. Outputs
3. Compare and accuracy 5. Demo code
of algorithm
4. Demo code

Conclusion
Reference
1
Exercise 1
Algorithm use to handle exercise
In order to get the similar content from news or report, we will use SimHash and MinHash to measure the
similarity of it.

• Definition of MinHash: A minhash function converts tokenized text into a set of hash integers,
then selects the minimum value.
+ Math Formular of hash: x: input integer,
a,b: random number with a,b < x
c: random number with c > x
• Definition of SimHash: is a hashing function and its property is that the more similar the text inputs are,
the smaller the Hamming distance of their hashes.
+ Math Formular: Wi: weight of i-th word in text.
TF(i): frequency of i-th word in text

• Definition of Jaccard Distance: is a statistic used for gauging the similarity and diversity of sample sets.
+ Math Formular:
Compare and accuracy of algorithm
MIN HASH SIM HASH
BigO with k hash functions: O(mnk + m2k) BigO: O(n^2)

Accuracy: SIM HASH < MIN HASH Accuracy: SIM HASH < MIN HASH

Time running: SIM HASH > MIN HASH Time running: SIM HASH > MIN HASH

Uses Jaccard Index Uses cosine similarity

Exercise 2
Preprocessing
+ Read file csv by pandas and dataframe
(Because we lack of computer resources so we just run 10000 rows data)

+ Initialize the variable of content, title, categories

+ After that, we split it in 2 array X,y

+ Nextly, we tokenize text in both X,y array

Split train and test
+ Firstly, transfer set X from text to matrix
and call it X_train1

+ We use sklearn library to split the set X_train1,y

to the set of train and test

+ Fitting model in X_train1,y or X_train, y_train

is both possible
Model using and accuracy of model
Step of using any models:
1. First, fitting model in train set
2. Second, input the value want to predict and convert it to matrix
3. Third, predicting value of test set
KNN: Decision Tree:
Logistic Regression

==> The most efficient is Decision Tree

The worst efficient is KNN
Exercise 3
N-Gram with Smoothing:
1. Preprocessing text
- Firstly, my data chosen, is columns ‘content’ in dataset.

(a item row in ‘content’ )

- Then, we must remove special characters in data also lower all of words in this column.
- n-gram(2-gram in this case) sentences in text:
N-Gram with Smoothing:
2. UNK solve zero problem

- Next, we count item in n-gram to calculate prob :

- Use UNK to avoid zero prob:

before UNK: after UNK
N-Gram with Smoothing:
3. Tokenize and Modeling

- And, we tokenize before item in n-gram to calculate prob :

- Need encode  modeling  decode

Encode Modeling Decode

N-Gram with Smoothing:
4. Outputs
- Outputs:
REFERENCES
1. Mengxia Wang-Wenqiang Fan, An Improved Simhash Algorithm for Academic Paper Checking System,
2007
2. Pyi, MinHash
3. SimHash
4. sklearn, https://scikit-learn.org/stable/tutorial/index.html
5. Stanford, 04-lsh theory, slide 5
6. https://github.com/memosstilvi/simhash
7. http://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/notes/Min-Hash/index.html
8. machinelearningcoban

Vendor Security Checklist
Document11 pages
Vendor Security Checklist
Haris
No ratings yet
Introduction To Programming in Java by Robert Sedgewick
Document5 pages
Introduction To Programming in Java by Robert Sedgewick
rybupedo
0% (2)
CLI Ericsson
Document22 pages
CLI Ericsson
Jose Alejandro Morales
33% (3)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Project Report On Hotel Reservation
Document63 pages
Project Report On Hotel Reservation
Bijoy Gohain
60% (25)
Aspen Process Explorer Installation Guide
Document78 pages
Aspen Process Explorer Installation Guide
Nicole Li
No ratings yet
Slot24 25 26 TextProcessing 2021 04
Document58 pages
Slot24 25 26 TextProcessing 2021 04
Công Quân
No ratings yet
2021 APL Questions Assignment III
Document3 pages
2021 APL Questions Assignment III
Ankita Gupta
No ratings yet
N7 Hash
Document8 pages
N7 Hash
Laurentiu GUBAVU
No ratings yet
Ebook To Become ML Engineer
Document21 pages
Ebook To Become ML Engineer
మయూ ఖ
No ratings yet
Computing The Maximum Using (Min,+) Formulas: Electronic Colloquium On Computational Complexity, Report No. 20 (2018)
Document15 pages
Computing The Maximum Using (Min,+) Formulas: Electronic Colloquium On Computational Complexity, Report No. 20 (2018)
Banavath Tarun
No ratings yet
Linear Equation
Document9 pages
Linear Equation
sam_kamali85
No ratings yet
Problem Solving Techniques
Document6 pages
Problem Solving Techniques
go4nagaraju
No ratings yet
6.006 Introduction To Algorithms: Mit Opencourseware
Document5 pages
6.006 Introduction To Algorithms: Mit Opencourseware
jyoti78
No ratings yet
Experiment No. 10 TE SL-II (ANN)
Document3 pages
Experiment No. 10 TE SL-II (ANN)
Rutuja
No ratings yet
Design and Analysis of Algorithm: Unit 1
Document80 pages
Design and Analysis of Algorithm: Unit 1
arjunbhardwajab12
No ratings yet
Experiment No 8
Document7 pages
Experiment No 8
Aman Jain
No ratings yet
Java Practise Exercise
Document3 pages
Java Practise Exercise
sivaterror
No ratings yet
Learning To Hash With Binary Deep Neural Network: October 2016
Document17 pages
Learning To Hash With Binary Deep Neural Network: October 2016
riadelectro
No ratings yet
Semester Final Project Report
Document11 pages
Semester Final Project Report
Engineer Zain
No ratings yet
Hash-Data Structure
Document16 pages
Hash-Data Structure
nikag20106
No ratings yet
Balaji Problem Solving
Document10 pages
Balaji Problem Solving
Anuj More
No ratings yet
Part1 3enumeration
Document145 pages
Part1 3enumeration
Trần Khánh Lương
No ratings yet
L01 PDF
Document23 pages
L01 PDF
boli
No ratings yet
++probleme Tot
Document22 pages
++probleme Tot
Rochelle Byers
No ratings yet
Daa S
Document6 pages
Daa S
kolasandeep555
No ratings yet
Tillämpad Maskininlärning - Labb-1
Document17 pages
Tillämpad Maskininlärning - Labb-1
basel
No ratings yet
ASNM Program Explain
Document4 pages
ASNM Program Explain
Keseho
No ratings yet
Arabic OCR Report
Document20 pages
Arabic OCR Report
Amir
No ratings yet
SBC RC
Document12 pages
SBC RC
Rudresh Hiremat
No ratings yet
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
Document8 pages
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
Achuthan Sekar
No ratings yet
COMP1901 Research Project
Document12 pages
COMP1901 Research Project
taiiq zhou
No ratings yet
Machine Learning Assignment
Document8 pages
Machine Learning Assignment
JoshuaDownes
No ratings yet
AI Mini Project Report
Document7 pages
AI Mini Project Report
Darshan
No ratings yet
Expt 5
Document20 pages
Expt 5
Amisha Sharma
No ratings yet
Dsa 2 PDF
Document12 pages
Dsa 2 PDF
Lol Telr
No ratings yet
N Gram
Document11 pages
N Gram
ASHU K
No ratings yet
Over Description About The Model
Document3 pages
Over Description About The Model
www.santhoshvjd123
No ratings yet
DL Mannual For Reference
Document58 pages
DL Mannual For Reference
Devant Pajgade
No ratings yet
Unit Iii Greedy and Dynamic Programming
Document16 pages
Unit Iii Greedy and Dynamic Programming
Mohamed Imran
No ratings yet
Improving Text Classifiers Through Controlled Text Generation Using Tranformer Wasserstein Autoencoder
Document9 pages
Improving Text Classifiers Through Controlled Text Generation Using Tranformer Wasserstein Autoencoder
Jishnu P Mohanakrishnan
No ratings yet
Documentation in Daa
Document16 pages
Documentation in Daa
joyce_khenzie23
No ratings yet
Matlab Review PDF
Document19 pages
Matlab Review PDF
Mian Husnain
No ratings yet
A 02
Document2 pages
A 02
dsa
No ratings yet
Hacking AES-128
Document5 pages
Hacking AES-128
paney93346
No ratings yet
Introduction To Deep Learning Assignment 0: September 2023
Document3 pages
Introduction To Deep Learning Assignment 0: September 2023
christiaanbergsma03
No ratings yet
Lab 10. Quadrature: Name: 1 Instructions
Document3 pages
Lab 10. Quadrature: Name: 1 Instructions
Carlos Gomez
No ratings yet
Huffman Coding Assignment
Document7 pages
Huffman Coding Assignment
Mavine
0% (1)
The Tip of The Iceberg: 1 Before You Start
Document18 pages
The Tip of The Iceberg: 1 Before You Start
regupathi6413
No ratings yet
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
Document28 pages
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
adel hany
No ratings yet
Proyecto Coursera
Document17 pages
Proyecto Coursera
Gabriel Granda
No ratings yet
Experiment No.09: Part A
Document7 pages
Experiment No.09: Part A
anushka
No ratings yet
Department of Computing: MATH333: Numerical Analysis
Document3 pages
Department of Computing: MATH333: Numerical Analysis
Hussain Rizvi
No ratings yet
ADA Que Bank 2022-2023
Document14 pages
ADA Que Bank 2022-2023
Chavda Shaktisinh
No ratings yet
6th Sem Algorithm (c-14) Answer
Document15 pages
6th Sem Algorithm (c-14) Answer
rituprajna2004
No ratings yet
Mayank Kumar Ratre
Document15 pages
Mayank Kumar Ratre
Gjrn Hhr
No ratings yet
Proj 2
Document3 pages
Proj 2
lorentzongustaf
No ratings yet
Xperm: Fast Index Canonicalization For Tensor Computer Algebra
Document16 pages
Xperm: Fast Index Canonicalization For Tensor Computer Algebra
seppi05
No ratings yet
Analysis of Algorithms
Document5 pages
Analysis of Algorithms
Danish Hussain
No ratings yet
ADA Solved
Document14 pages
ADA Solved
ganashreep2003
No ratings yet
Unit 4 NSC
Document8 pages
Unit 4 NSC
Hriday
No ratings yet
MIDS Lab Theory
Document6 pages
MIDS Lab Theory
Robert Stark
No ratings yet
Computational Physics
Document36 pages
Computational Physics
Thanasis Giannitsis
No ratings yet
LP III DAA Write Up
Document8 pages
LP III DAA Write Up
Try Try
No ratings yet
String Algorithms in C: Efficient Text Representation and Search
From Everand
String Algorithms in C: Efficient Text Representation and Search
Thomas Mailund
No ratings yet
EKT 0 CCO Whats New in Release 2.0 Feature Pack11 DONE
Document96 pages
EKT 0 CCO Whats New in Release 2.0 Feature Pack11 DONE
Guillermo Ocampo
No ratings yet
24 38 Joos Siemens Introduction To SimTalk 2 0
Document11 pages
24 38 Joos Siemens Introduction To SimTalk 2 0
h_eijy2743
100% (1)
Internet of Things: Privacy, Security and Governance
Document15 pages
Internet of Things: Privacy, Security and Governance
khushali
No ratings yet
SAP FICO Validation
Document2 pages
SAP FICO Validation
Ajeesh Sudevan
No ratings yet
Business Process Management
Document6 pages
Business Process Management
Jason
No ratings yet
Wa0000
Document14 pages
Wa0000
WIDO Sumbogo
No ratings yet
Book Bank Management System
Document15 pages
Book Bank Management System
nabin ku. das
No ratings yet
EXP 1 MAD LAB - Armaan Verma
Document8 pages
EXP 1 MAD LAB - Armaan Verma
Armaan Verma
No ratings yet
Computer Activity Cyber Safety
Document40 pages
Computer Activity Cyber Safety
lavanya
No ratings yet
Unit 1 Introduction To Embedded System Design PDF
Document99 pages
Unit 1 Introduction To Embedded System Design PDF
NIDHI. R .SINGH
100% (1)
Arc Report Operational Historian
Document6 pages
Arc Report Operational Historian
Pablo
No ratings yet
Unit-4 Operating System
Document25 pages
Unit-4 Operating System
Poornima.B
No ratings yet
Boxology (Acute Medicine & Critical Care) - Compressed PDF
Document60 pages
Boxology (Acute Medicine & Critical Care) - Compressed PDF
shumonto ornob
No ratings yet
IC Project Timeline Template 8857 PDF
Document6 pages
IC Project Timeline Template 8857 PDF
Carlo
No ratings yet
Built-In Data Types
Document2 pages
Built-In Data Types
John Arvie P. Castillo
No ratings yet
Mobile App Functions: What Are Mobile Applications?
Document3 pages
Mobile App Functions: What Are Mobile Applications?
Maleeha Sabir
No ratings yet
Total Building Solutions For Hospitals: The Next Generation of Intelligence
Document8 pages
Total Building Solutions For Hospitals: The Next Generation of Intelligence
ابو محمد البصري العراقي
No ratings yet
Mark Moukarzelresume 2019
Document1 page
Mark Moukarzelresume 2019
api-310772684
No ratings yet
Chapter 6 and 7 - Procedure and Function
Document71 pages
Chapter 6 and 7 - Procedure and Function
Mafa Oktavia
No ratings yet
Computer Skills: IBN Sina University University Requirements
Document21 pages
Computer Skills: IBN Sina University University Requirements
Muna Abu Gosseisa
No ratings yet
Guide For LC-2, LM-2, MTX-L, SCG-1and Multi-Sensor Support
Document2 pages
Guide For LC-2, LM-2, MTX-L, SCG-1and Multi-Sensor Support
martin
No ratings yet
Aegyptus: Egyptian Hierogglyphs, Oogpti Iand Merogiti
Document6 pages
Aegyptus: Egyptian Hierogglyphs, Oogpti Iand Merogiti
김유찬
No ratings yet
National Instruments LabVIEW DataSocket Documentation
Document43 pages
National Instruments LabVIEW DataSocket Documentation
kriole13
No ratings yet
Question Bank
Document2 pages
Question Bank
Muru Gan
No ratings yet
Header Start Here
Document4 pages
Header Start Here
harshitha
No ratings yet