You are on page 1of 16

TON DUC THANG UNIVERSITY

FACULTY OF INFORMATION TECHNOLOGY

Natural Language Processing

MIDTERM REPORT

Author: Trà Lâm Thanh Hà – 520H067

Trần Lê Thành Lộc – 519H0310

Intructor: Mr. Lê Anh Cường

HO CHI MINH CITY, 2021


CONTENT
Introduction
 Exercise 1  Exercise 2  Exercise 3

1. Algorithm use to handle 1. Preprocessing 1. Preprocessing


exercise 2. Split train and test 2. UNK solve zero problem
2. Definition of algorithm 3. Model using 3. Tokenize and modeling
using 4. Accuracy of model 4. Outputs
3. Compare and accuracy 5. Demo code
of algorithm
4. Demo code

Conclusion
Reference
1
Exercise 1
Algorithm use to handle exercise
In order to get the similar content from news or report, we will use SimHash and MinHash to measure the
similarity of it.

• Definition of MinHash: A minhash function converts tokenized text into a set of hash integers,
then selects the minimum value.
+ Math Formular of hash: x: input integer,
a,b: random number with a,b < x
c: random number with c > x
• Definition of SimHash: is a hashing function and its property is that the more similar the text inputs are,
the smaller the Hamming distance of their hashes.
+ Math Formular: Wi: weight of i-th word in text.
TF(i): frequency of i-th word in text

• Definition of Jaccard Distance: is a statistic used for gauging the similarity and diversity of sample sets.
+ Math Formular:
Compare and accuracy of algorithm
MIN HASH SIM HASH
BigO with k hash functions: O(mnk + m2k) BigO: O(n^2)

Accuracy: SIM HASH < MIN HASH Accuracy: SIM HASH < MIN HASH

Time running: SIM HASH > MIN HASH Time running: SIM HASH > MIN HASH

Uses Jaccard Index Uses cosine similarity


Exercise 2
Preprocessing
+ Read file csv by pandas and dataframe
(Because we lack of computer resources so we just run 10000 rows data)

+ Initialize the variable of content, title, categories

+ After that, we split it in 2 array X,y

+ Nextly, we tokenize text in both X,y array


Split train and test
+ Firstly, transfer set X from text to matrix
and call it X_train1

+ We use sklearn library to split the set X_train1,y


to the set of train and test

+ Fitting model in X_train1,y or X_train, y_train


is both possible
Model using and accuracy of model
Step of using any models:
1. First, fitting model in train set
2. Second, input the value want to predict and convert it to matrix
3. Third, predicting value of test set
KNN: Decision Tree:
Logistic Regression

==> The most efficient is Decision Tree


The worst efficient is KNN
Exercise 3
N-Gram with Smoothing:
1. Preprocessing text
- Firstly, my data chosen, is columns ‘content’ in dataset.

(a item row in ‘content’ )

- Then, we must remove special characters in data also lower all of words in this column.
- n-gram(2-gram in this case) sentences in text:
N-Gram with Smoothing:
2. UNK solve zero problem

- Next, we count item in n-gram to calculate prob :

- Use UNK to avoid zero prob:


before UNK: after UNK
N-Gram with Smoothing:
3. Tokenize and Modeling

- And, we tokenize before item in n-gram to calculate prob :

- Need encode  modeling  decode

Encode Modeling Decode


N-Gram with Smoothing:
4. Outputs
- Outputs:
REFERENCES
1. Mengxia Wang-Wenqiang Fan, An Improved Simhash Algorithm for Academic Paper Checking System,
2007
2. Pyi, MinHash
3. SimHash
4. sklearn, https://scikit-learn.org/stable/tutorial/index.html
5. Stanford, 04-lsh theory, slide 5
6. https://github.com/memosstilvi/simhash
7. http://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/notes/Min-Hash/index.html
8. machinelearningcoban

You might also like