You are on page 1of 53

NAME : TRAN NGUYEN ANH THOAI

MODULE :

COURSE CODE :

COURSEWORD LEADER :

DUE DATE :

CENTRE : GREENWICH, HCMC

WORD :
Commitment
This dissertation is done and has references to documents, articles, websites as described in
the references and I will quote for each reference.

I hereby certify that in addition to the reference citations, all contents and data in the essay
have been compiled by myself based on the research results under the supervision of Mr. Le
Minh Nhat Trieu. I accept full responsibility for violations of the regulations.
Honors
First thing, I would like to thank my loved ones, especially my parents. Because they are
feeder, and giving me the best things in this life, from matter to morality

Second thing, I want to thanks to the FPT University and the Greenwich University. Thanks
to the school and teachers who creates classes and subjects for students. Thank you for the
knowledge that the teacher has communicated

Thank you to Mr. Le Minh Nhat Trieu, who accompanied me during the past school year at
TopUp semester. He devoted his devotion, time, and energy to support me as much as
possible.

Finally, we would like to thank the teachers and friends who have accompanied me during
the past 4 school years. Thank you for your, your valuable knowledge and enthusiasm to
help
Abstraction
Nowadays, the development of IT has changed our life so much. Special Data mining and
machine learning. It has been applied in all field in our life from face, voice recognition,
nature language processing, Especially natural language processing many areas in today's
life use examples in some places using robots capable of communication to replace humans
in communication. A typical example is the explosion of anti-epidemic robots in the covid-19
age. In the field of mass media, newspapers and news production are getting more and
more attention from the masses. Accompanying that is a great deal of work, it is a waste of
time if we sit and read each title of an article for us to classify it. Due to grasping the deadly
weakness of the media industry. This essay was written to fix the problem of sorting articles
topic, but because time is limited, topics only revolve around the world, sports, life, law,
health... Usually, the articles will be stored in natural language, unstructured data. The
easiest way to classify these articles is the vector space itself. However, in order to vector
the information, we need to process the data first. Specific tasks need to be done with
cutting words, removing accents in sentences and eliminating stop word. In this topic to be
able to separate words I use the segmentation tool to separate words, then construct
vectors based on Bow, Word2Vec methods. Then use jupyter notebook to show the results
obtained in news classification using ML method
1. Introduction

Digital life all information or knowledge that we know or do not know is all on the
internet. This problem also solves a lot of human problems such as document storage
without paper or pen, long storage time, convenient for searching. Although the face of the
huge amount of information that is properly categorized, it is important to be concerned.
But in fact, this job needs to be done manually and takes a lot of time and effort. So the
automatic classification is very necessary.

Seeing this need, I decided to explore the steps to conduct information classification
using ML. The method of classifying news with the data set is news taken from online news
sites in Vietnamese. From there we proceed to build and apply the classification methods.
This is a research project and also the subject of my graduation thesis.

The purpose of this essay is to find out how I use machine learning to categorize Vietnamese
news and how I do it.

1.1 Thesis layout


CHAPTER 2

MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING IN GERNERAL

2.1. Overview background of machine learning.

In today's industrialized life. The amount of data is increasing in quality and quantity.
But only a small part of that huge chunk of data has value. The desire to find and exploit
information and value from that data block has opened a new wing for the information
technology industry. It is information extraction from the database (Knowledge Discovery
from Data).
Steps for data mining include:

● Identify the request and the associated data space (problem understand and
data understand).

● Data preparation requirements, including data cleaning, data integration,


data selection and data transformation.

● Data mining including identify the target of the data to be exploited and
exploitation technique. The result will be an images or text source.

● Evaluation Based on the criteria and filter the source of the obtained data

● Deployment

The data mining process is repeated many times. Data extraction is the process of
extracting data from a data set. This needs to use knowledge of many fields such as IT, AI,
database, math...

Mining methods include:

● classification: This is a technique that allows the classification of an object into one
or more certain classes.

● Regression: Defines a data sample into a predictive variable has real value

● Clustering: "Cluster" means a group of data objects. Similar objects are located in a
cluster. The result is similar objects in the same group.
2.2. What is machine learning.

With the explosion of big data and the classical algorithms that haven't
performed well on paper yet. The emergence of Machine Learning is inevitable, it also
leaves a new piece for the IT industry

Machine learning is a field of artificial intelligence involved in the research and construction
of techniques that allow systems to "learn" automatically from data to solve specific
problems.

Machine learning is strongly related to statistics, as both fields are studying


data analysis, but unlike statistics, machine learning focuses on the complexity of algorithms
in performing computation. Many reasoning problems are classified as NP-difficult
problems, so part of machine learning is to study the development of approximate inference
algorithms that can be handled.

Currently, thanks to the development of hardware, there has been the


production and improvement of many new algorithms such as Deep Learning,
Reinforcement learning. But all must have the appearance of ML. It is the core of the
advanced algorithms now.

The outstanding feature of this document classification is the variety of


topics. Number of topics and texts is unlimited. For example, take some popular topics in
Vietnamese news such as law, health, life, education, economics ..

Machine learning is widely used today including data tracing machine,


medical diagnostics, detecting fake credit cards, analyzing the stock market, classifying DNA
sequences, speech and writing recognition. , automatic translation, game play and robot
locomotion. [1]
2.3 General Machine Learning Structure
There are two problems that need to be addressed in machine learning is training phase and
testing phase.

Training phase.

+ The issues that need to be explained in the training phase are the Features extractor
and Main Algorithms.
+ Raw input data is all the information we know about the data, for example the value
of each pixel for an image, every word, every sentence for the text, for the audio file
it is a signal ... This data is called raw data and is in an unstructured form. In order for
the machine to understand and learn this raw data type, we need to convert it to a
2-dimensional vector.
+ Prior knowledge about data: different theories about the data type also benefit. For
example, the text classification problem we need is knowledge related to that
problem. As in this issue of this essay is Vietnamese text analysis, we need to have
stop word files or vocab sets related to Vietnamese text.
+ Main Algorithms: after extracting features from the data set, these extracted
features are applied to training algorithms such as Classification, Clustering,
Regression...

Testing phase

+ This step is much simpler because it is a forward version of the training phase with
new raw inputs. So it is a revamp of the previous training phase to create the
corresponding vector features. This also requires using algorithms to predict the
output.

2.4 Naturual Language Processing With Machine Learning

Textual data mining is a large amount of human knowledge used in


communication or in documents used constants. This makes the text data set extremely
large, so the amount of human knowledge that is accumulating constantly increases over
time. However, with the rapid development of such a large number of documents, people
could not control, evaluate and classify themselves as usual. Not to mention the amount of
data that can’t be controlled, causing many serious consequences for security in general and
human life in particular. Because of this urgency, the use of ML in NLP helps people save
energy in order to select, classify and filter clean information that needs attention. The text
classification problem plays a very important role in handling big data today. So in this
essay, we will dig in depth about the methods that ML applies to NLP to see the power of
ML in the current 4.0 era.

Text data mining is the process of extracting data from articles or documents
in the form of text. This is a multi-disciplinary problem including information retrieval, text
analysis, information extraction, clustering, categorization… In the next part we will submit
in-depth presentation of the problem of categorization of content of the topic of this thesis.

In the field of Natural Language Processing is also an array of it, but for the
input data type it is the text that is stored in an unstructured form. For raw data input is text
that is scraped from articles or data files saved in an unstructured format. This problem was
given by me in the (overall machine learning) section when we need to have knowledge
related to the problem we need to solve. Some methods to be able to extract the feature of
text in natural language processing are Bag of Word, TF-IDF, Word Embedding.

News classification is one form of the text classification problem. Text


classification is a post classic word processing math. According to Yang & Xiu (1999)[2]”
Literature classification AutoText is the assignment of classification labels on a new
document based on the degree of similarity of that text compared to labeled texts in the
training set ”.

In the world there are many research projects achieve positive results.
Example Support Vector Machine – Joachims 1998, k-Nearest Neighbor – Yang 1994, Linear
Least Squares Fit – Yang and Chute 1994, Neural Network – Wiener et al 1995, Naïve Bayes
– Baker and Mccallum 2000, Centroid- based – Shankar and Karypis 1998. These methods
are based on statistical probability or word weight information in the text. But for
Vietnamese A lot of times the token is interpreted as a word, although this is not entirely
correct. In English, for example, words are usually separated by spaces, but New York is still
considered a default word even though it has a space in between. Hence there is only 1
token in this case. Another example is that I call the words ‘I 'and‘ am ’even though there
are no spaces. In this case, we have 2 tokens, there are still many limitations due to difficulty
in separating words and sentences.

2.5 Machine learning And Natural Language Processing With Performance

Because of the increasing needs of life, the society is more and more developed. The
language processing method is increasingly being applied in life.

Example : In the business sector Facebook uses NLP to keep track of trending topics
and popular hashtags. Mange the news with Fake News Detection, Spam detection. Spam
detection technologies use NLP's text sorting capabilities to scan emails and identify them as
spam or phishing. . Create the Chat Bot for replying customer. Generate the text for create
new documents. Translate the text to another language . And so many tasks in life can to be
solved with Natural Language Processs and Machine Learning.

CHAPTER 3

TEXT CLASSIFICAITON IN NATUTAL LANGUAGE PROCESSING AND MACHINE LEARNING

3.1 Background

Because computers and Internet are widely used, the extremely huge number of
information is produced everyday. Nowadays, the information already is full of our life.
Most of the information is stored as texts. A large number of unstructured texts is posted
and sorted in web pages, digital libraries and community. Therefore, the automatic method
is necessary to help people manage and filter these information instead of manual work.
Predicting the class labels for the online texts has been required by a variety of applications.
For example, in spam filtering, classification methods are used to determine the junk
information automatically. In news organization, because most news is provide on Internet
and the amount is huge, it is impractical to finish this task manually.

The motivation for exploiting background knowledge in text classification is


attributed to two reasons. First, more information from texts can make more reasonable
classification. Second, people have basic concepts and general knowledge in their mind,
however, the common corpora/datasets are some kinds of special case which would lack
some basic concepts and general knowledge. These basic concepts and general knowledge
are the background knowledge in our life.

So Classificaiton is one of the most important task for filter the important texts we
need to use or skip the useless text.

3.2. Application

There are a lot of applications in nowadays need the classifiation task. Depending on
the classification task, there are different kinds of class sets. The most usefull application we
always use in nowadays is base on the key word we search on Google , the sever use
classification for filtering the theme of information we need to find , and give back the result
correctly what we want. Or Google , Yahoo, Facebook use the Classification task for
removing the spam email help people acess the dangerous link.
3.3 Overview of Methods and Contributions

3.3.1. General Method

The goal of text classification is predicting the correct class label for a given text. Text
classification task is defined as a set of training texts D={X1, ..., Xn}, each text is labeled with
a category value drawn from a set of k different discrete values which are indexed by {1, ...,
k} [1]. All the texts are split into 2 subsets, training texts and test texts. The training texts are
used to train classification model by using machine learning algorithms. The test texts are
used to evaluate the performance of the model. For a test text whose category is unknown,
the model is used to predict its category. Each category is assigned with a label. These labels
are numeric values that represent the categories. Practically, text classification task is
computing the text’s label.

For example: Thủ tướng hôm nay vừa có một chuyến thăm ở Trung Quốc

This sentence belong to the politics theme

3.3.2. Background Knowledge

For understanding the categories of the sentences , the model need the knowledge
of the language vocabulary and knowledge structure of the language

For example , to understanding the sentence in English Language , corpus/dataset is


the dictionary we use for model get all the Knowledge Base , and then retrieval the meaning
of the sentences. And then base on the structure knowledge the model can understanding
the situation of sentences

3.3.3 Feature engineering

The process of converting the original raw data set into sets of attributes. This makes
the original raw data set better to solve the problem more easily. Helps more compatible
with prediction models and improves predictive model accuracy.
Some of the text conversion models that are widely used in NLP today Bag of Word, TF-IDF,
Word2Vec.

3.3.3.1. Bag of word

In 1954 Harris, Zellig, a linguist and mathematician, talked about the bag of word in
linguistic context in this article on distribution structure. According to Harris, Zellig[4] in
Distributional Structure” …For language is not merely a bag of words but a tool with
particular properties which have been fashioned in the course of its use ” This is the first
premise for data scientists to research and develop a bag of word technique to apply to data
preprocessing for NLP.

The bag of word model is a popular method to convert data types from unstructured
text data types to simple vector spatial representations. It creates a dictionary that contains
words that do not repeat in a text. With the sentence pattern is a vector of equal length to
the length of the dictionary in the sentence and each cell in the vector represents the
number of occurrences of that word. As its name suggests, it is a pocket of words containing
words in the text, arranged in a mess, regardless of the appearance of the sequence or
grammar in the sentence.
In practice, the Bag-of-words model is mainly used as a tool of feature generation.
After transforming the text into a "bag of words", we can calculate various measures to
characterize the text. The most common type of characteristics, or features calculated from
the Bag-of-words model is term frequency, namely, the number of times a term appears in
the text. For the example above, we can construct the following two lists to record the term
frequencies of all the distinct words (BoW1 and BoW2 ordered as in BoW3):

Suppose we have two sentences in a Vietnamese text.

1. Con chó đang nằm canh nhà (The dog is at the house guard)

2. Con mèo đang nằm ngủ (The cat is sleeping)

Based on the above two sentences, a list of 8 words will appear.

[“con”,”chó”,”đang ”,”ngủ”,”mèo”,”nằm”,”nhà”,”canh”]

Based on the dictionary created from the two paragraphs above, we proceed to
create a vector to store the frequency of words appearing in each sentence.Since the length
of the paragraph is five words and six words, we will have the following element.

1. [1,1,1,0,0,1,1,1]

2. [1,0,1,1,1,1,0,0]

However, if a word appears more than once in a document, it is only counted as that
word appears only once and the weight to mark the number of words in a sentence
automatically increases.

Example, we have two sentences.

● Con chó nằm cạnh con mèo


● Con mèo nằm trước nhà

Based on the above two sentences, a list of 9 words will appear.

[“mèo”, “chó”, “cạnh”, “nằm”, “trước”, “nhà”, “con”,]

Based on the created dictionary we proceed as follows.

● [0,1,1,1,0,0,2]
● [1,0,0,1,1,1,1]

3.3.3.2.TF-IDF

This method was introduced by Karen Spärck Jones in an article appeared in 1972
under the title "term specificity". Although it works as the "Heuristic method", it has been
controversial on theoretical grounds for at least 30 years. But this is also the premise to
serve the data preprocessing of the NLP model

Like the second example in the Bag of Word section, the problem will arise when in a
paragraph with too many duplicate words or words that appear too often in a paragraph. So
they will interfere with other words in the dictionary. If the frequency of occurrence of
words is too much in a text, then if only considering the frequency of occurrence of each
word, the classification will give wrong results and lead to the rate of accuracy is not high.

TF-IDF (Term frequency – Inverse Document Frequency) is a popular method of


representing text as a vector. TF-IDF is a method of weighting the important words in the
text. High values show the importance of that word in the text, but it is inverse with the
number of times it appears in the text.

TF – term frequency (how often the word appears in the text) Each text in a
dictionary has a different length, so some words may appear many times in a large
document. So TF (term frequency) is calculated by.

number of occurrences of word t in the text.

The denominator is the total number of words in the document d. IDF-(Inverse


Document Frequency) inverses the frequency of the text, helping to better appreciate the
importance of a word. When calculating TF, the importance of each word is the same in a
text. However, there are many characters whose frequency occurs but is not of high
importance, for example “như”,”thì”,”là”,”bị”… with Vietnamese. In English text is

“is”,”the”,”and”,”a”,”an”… Using IDF helps us to reduce the importance of such words. is


a text number of in file D.

z is the number of text in set D containing the word t, with cases where the word t is

not in the dictionary, value = 0 so we add 1 to avoid happening, in addition,


using the logarithm does not change the idf property.

3.3.3.3.N-Gram language model

Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone is
going to say. Predicting upcoming words, or assign probabilities to sentences is important
because probabilities are essential in any task in which we have to identify words in noisy,
ambiguous input, like speech recognition. For a speech recognizer to realize that you said I
will be back soonish and not I will be bassoon dish, it helps to know that back soonish is a
much more probable sequence than bassoon dish. For writing tools like spelling correction
or grammatical error correction, we need to find and correct errors in writing like Their are
two midterms, in which There was mistyped as Their, or Everything has improve, in which
improve should have been improved. The phrase There are will be much more probable
than Their are, and has improved than has improve, allowing us to help users by detecting
and correcting these errors. [https://web.stanford.edu/~jurafsky/slp3/3.pdf] So Ngram is
the method can solve this problem with the task of computing P(w|h), the probability of a
word w given some history h.
Suppose the history h is “its water is so transparent that” and we want to know the
probability that the next word is the:

P(the|its water is so transparent that).

One way to estimate this probability is from relative frequency counts: take a very
large corpus, count the number of times we see its water is so transparent that, and count
the number of times this is followed by the. This would be answering the question “Out of
the times we saw the history h, how many times was it followed by the word w”, as follows:

P(the|its water is so transparent that)

= Count(its water is so transparent that the)/ Count(its water is so


transparent that)

The n-gram (which looks n−1 words into the past). Thus, the general equation for
this n-gram approximation to the conditional probability of the next word in a sequence is

P(wn|w1:n−1) ≈ P(wn|wn−N+1:n−1)

In VietNamese sentences , we have the example :

"thủ_tướng đức nhận_lời tham_dự lễ kỷ_niệm"

→ thủ_tướng đức, đức nhận_lời, nhận_lời tham_dự, tham_dự lễ , lễ kỷ_niệm


Some example in Ngram is Bigram , Trigram , … . The Ngram method is very usefull
for understanding the sentences , but the cost of memory is very big , so we can’t use this
method for the big sentences.

3.3.3.4. Truncated Singular Value Decomposition

Matrix factorization is one of the most popular method for reducing the dataset help
hardware can save the memory while calculating . And Singular Value Decomposition is one
of the Matrix factorization method. Suppose we have A(mxn) matrix, we can factorize the
matrix like this:
In Truncated Singular Value Decomposition(Low-rank approximation) have the

formula with A is the matrix , k is the rank , σ is the value of cross line in matrix

if using 90% information of matrix we will calculate the and choose k is


minimum number.

3.3.4. Machine Learning Model

3.3.4.1. Naive Bayes

Naive Bayes is the most classical method. Naive Bayes classification method created
by Thomas Bayes is based on the Bayes theorem. Naive Bayes is a prime example of the
simplest solutions that are also the most powerful. Even with the remarkable advancement
of machine learning in recent years, the naive method not only proves simple but also fast,
accurate and reliable. Especially in the field of natural language processing. This method is
used a lot in the classification problem.
Based on the Bayes formula, we will find the probability to get the label based on the
probabilities of the given words. This proves that the prediction of a label for a certain type
of text depends on the frequency of the occurrence of words, sentences and the dependent
probability of holding words. Apply algorithm P (label | text) = (P (text | label) * P (label)) / P
(text). For example consider a classification problem with c labels whose input vector x

represents a word we call this the probability of falling into subclass c when we
know the vector x. From there, data can be classified by determining the class with the

highest probability.

Applying the Bayes formula to p(c | x) yields:

with

We can omit p (x) because no p (x) does not depend on c. For calculation
convenience, we can assume that the components of the variable x are independent of each
other, if c is known

In the algorithm executio, the values and are taken from the Training
data set

Training

+ For each label c, we compute its probability p ©

+ For each element , belonging to x we calculate its probability for subclass c

Testing

We have elements . I predict its label will be equal to

Advantages
The algorithm works simply, effectively, quickly and saves time. Widely used in
classification problems. Can provide higher predictability than other models although less
data input is required.

Disadvantages

Probability that the output will be wrong in some cases. So we should not focus too
much on its probability . The naive Bayes algorithm is only correct in some cases, but if
applied to the real world the algorithm will limit its capabilities.The independence
assumption does not work well in situations where the data are interdependent.The model
parameters are independent probability values, so the interaction between them cannot be
estimated.

3.3.4.2.Neural Network

The first neural network appeared in 1944 by Warren McCullough and Walter Pitts,
two researchers at the University of Chicago. In 1952 they moved to MIT as the first
collaborative member of cognitive science.

3.3.4.3.Recurrent Neural Network

Recurrent Neural Networks were studied by David Rumelhart, an American


psychologist in 1986. In 1982 the Hopfield network a special type of RNN was discovered by
the American scientist John Joseph Hopfield. In 1993 this was the turning point of the RNN
model when it solved the problem of Deep Learning requiring more than 1000 layers.

As you know, neural networks are developed and functioned like human neural
networks. It consists of 3 main parts: Input layer (x), hidden layer (Neural network), output
layer (y). The input data and the output data output of the NN network are independent of
each other, so it cannot be used to predict the output probability of the problem as
described, complete the sentence …

For example, when you are reading this sentence each word in the sentence will
contribute to make up a piece of information and make up the meaning of the whole
sentence. Based on the sentence you just read above your brain, you will store the
information and continue processing the semantics of the next sentence. This is a
complicated process that the Neural Network cannot do. So RNN was born to solve this
problem. RNN is able to recall the information that has been calculated previously to be able
to give the most accurate prediction for the current prediction steps.

Training

For each time , the trigger value and the output are calculated using the
following formula:

and

Inside is the weight and bias parameters of the model are the trigger
functions used in the model.Just like a normal neural network model to perform the
classification we perform the following steps:
1. Feed forward

2. Calculate loss

3. Back propagation

4. Perform the above steps with enough epochs.

Advantages

Unlike neural networks, RNN can compute the output based on the previous value.
The weight number is shared all the time.The RNN model is created to memorize each
previous value. Do not change model size according to input size

Disadvantages

The calculation of hidden layers occurs continuously, leading to slow calculation.RNN


training model is very complicated.Difficulty in remembering information that is too old.This
leads to vanishing gradient and exploding easily

3.3.4.4.Support Vector Machine

3.3.4.4.1. Original Algorithm

The Support Vector Machine algorithm was invented by Vladimir N. Vapnik and
Alexey Ya. Chervonenkis two Soviet mathematicians in 1963. This algorithm supports a lot in
theory as well as in practice.

The main method of support vector machine is to give a training set that is the input
data set represented in vector space, where each label data is each point within this space.
In 2D this will be a line that divides the points in this space into two separate sides.
Temporarily called the positive side is the negative side, so this line will divide the positive
side and the negative side separate from each other with the greatest difference. This
greatest difference is called the distance between the data side, if this distance is larger, the
two data sides will clearly divide, to achieve the best results. The goal of the support vector
machine is to find the maximum distance between two data sides to give the best
classification results.
Margin is the distance from the dividing line to the nearest point of each label data,
in order to divide the label data best, we need to satisfy some criteria such as the dividing
line to make the margin gaps large enough. The wide margin, it easier to separate label
data.

Assume the data on the blue label is 1 and the red is -1 and the divider between
them is

We have the following constraints:

label data = 1:

label data = -1:

Next, we select 2 faces passing through label data 1 and -1 data points

is and

To calculate the margin, we use the formula to calculate the distance points in space
with

d is the number of dimensions in space is the distance of any point

easy to see due to the division of the plane is always the same sign, so this
space is not negative.

Margin is the minimum distance to the divider

The goal of the support vector machine is to find w and b such that the maximum margin

Advantages
+ Support vector machine works quite well for separating label data
+ It handles quite well in multi-dimensional data problem
+ SVM relatively efficient in memory
Disadvantages
+ Division is not good when the number of properties of the data is much larger
than the amount of data
+ SVM not suitable for large data sets

3.3.4.4.2.Sequential Minimal Optimization (SMO)


[https://zeyiwen.github.io/papers/jmlr18-thundersvm.pdf]
The goal of the training is to find a weight vector α that maximizes the value of the
objective function F(a). In training algorithm, the Sequential Minimal Optimization (SMO)
algorithm. It iteratively improves the weight vector until the optimal condition of the SVM is
met. The optimal condition is reflected by an optimality indicator vector f = {f1,f2,f3,f4} .
This is the optimality indicator for the i-th instanceis the optimality indicator for the i-th
instance x(i) and f(i) can be obtained using the following equation

In each iteration, the SMO algorithm has the following three steps:

where:

Step 2: Improve the weights of x(u) and x(i) denoted by a(u) , a(i)

Step 3: Update the optimality indicators of all the instances. The optimality indicator

SMO repeats the above steps until the optimal condition is met
Prediction: After the training, the trained SVM is used to predict the labels of unseen
instances

Chapter 4

Implement Algorithm

1. Pre-data processing

1.1.Crawl Data:

At present, the increasing volume of news leads to an explosion of textual


information. The use of the text should be well elaborated. So is the classification of text by
subject. This is a text classification in Vietnamese so we need to learn about the Training
datasets created by Vietnamese texts. This is the data set used from the VNTC data set, but I
also use the BeautifulSoup tool to increase the accuracy of the data set including 10 topics
"Social politics, life, science. , business, law, health, world, sports, culture, computers ".
Make sure model can have enough knowledge about the word, we have crawl bot to collect
dataset from https://vnexpress.net/ for inceasing the dataset of 10 topics.

In this code we use Beautiful Soup Method(Released in 2004 by Leonard Richardson


this is a library integrated into the required python 2.7 and python 3 language version. This
library has 3 originals including BS3 and BS, which were officially released from May 2006 to
March 2012. The current release is Beautiful Soup 4 which was released on May 17, 2020.).
Normally in order to crawl data from a website we need to know the API that that website
provides. Sometimes there are cases where API is not available so we need some technique
to get information without API.Beautiful Soup library used to get data from HTML and XML
files. The parser was created to extract data from HTML which is useful for searching the
web
As mentioned, text is only a non-data structure, so that the machine can understand
and perform automatic classification, we need to convert them to the appropriate form, the
structured language. We seperate the whole post in website to each sentence in each file of
topic folder.
1.2. Charactics Of The Vietnamese Language

Vietnamese is a single language, that is, each language (syllable) is pronounced


separately and represented by a script. This feature is evident in all aspects of phonetics,
vocabulary, and grammar.

1.2.1. Format The Symbol

Vietnamese words have 2 syntax for typing is Telex and Vini. In diffirent system , the
syntax will be diffirent. So formatting is needed for preprocess. We define the function for
solving the syntax correctly
1.2.2. Tokenization

Tokenization is the process of converting a sequence of characters into a series of


tokens (a token is a sequence of characters with a specific meaning, representing a semantic
unit in language processing).A lot of times the token is interpreted as a word, although this
is not entirely correct. In English, for example, words are usually separated by spaces, but
New York is still considered a default word even though it has a space in between. Hence
there is only 1 token in this case. Another example is that I call the words ‘I 'and‘ am ’even
though there are no spaces. In this case, we have 2 tokens.

One point to note is the concept of "word type" and "word token". Types are the
total number of words present in a corpus, regardless of the number of occurrences of that
word. For example, in a paragraph, a word that appears 40 times is only counted once. As
for Token, when there are 12 words in a sentence, there may only be 9 words appearing and
3 words repeating.
In Vietnamese text, we have the library for converting the word to vector by Gensim
library and ViTokenize.

1.2.4. Stopword

In computing and natural language processing, stop words are words that are
filtered out before Stopor after text data processing. Although stop words are generally
considered common words in a language, there is not yet a general list of stop words used
in all natural language processing tools, and indeed not all. all tools have this list. Any group
of words can be chosen as stop words for a given purpose, these are the most common
words, such as "the, is, at, which, and on..." in English. In Vietnamese is "bị,
bởi,cả,thì,là,mà...".
But in Vietnamese language , some of the stopword list cant be effective for another
dataset, some of words may be usefull for model. For make sure the good model , we will
build our stop word base on word count frequencies.

1.2.5.Analyze topic requirements:

Talk about the problem of classified news. We have two main types of research.
The first form is to rely on machine learning methods or to refine models from machine
learning methods to improve classification efficiency. The second form is based on Dataset
to conduct Training and Testing and apply algorithms to conduct classification and based on
the results to evaluate the classification information.

Based on the knowledge mentioned. Thread is done in the second form. Through
data collection, data preprocessing, and label data ... and application to the technical
algorithms mentioned. On the strengths of document classification based on existing
techniques. I have the advantage of being able to pre-read the document, refer to its
predecessors, and reuse the algorithms to apply to the classification. There are also
supporting libraries in the data preprocessing and engineering processes.

1.2.6. Level Word Embedding - Ngram Embedding - Bag Of Words :

Using embedding pretrained convert the text to vector. Then we use truncated to
reduce the vector
2.Text Classification Structure By Machine Learning

2.1. Flow Chart

Based on a natural language processing process model. I will conduct a step-by-step


analysis to give everyone a better overview of the process.

The first are text data types (Vietnamese text). These are paragraphs of text that I
crawled from typical electronic news sites, vnexpress.vn. These pieces of text are usually
crawled and labeled data is assigned during collection. Then, we use methods to clean data
by methods such as tokenization, removing character characteristics, creating a list of Stop
Word... After all the raw data processing steps we will have a clean data set. After obtaining
the data set there are absolutely no interfering words. We have two methods of training
data. In this case we divide it into two directions. The first direction assumes that in the
process of crawling data from online news sites, we have classified each text into each data
label, then we can directly use techniques to vector the head data to enter. After the input
data vectorization brought the data set into arithmetic form, we can begin to apply the data
set to the previously understood algorithm methods to compare the accurate prediction of
each alternative.

This is also called the training process step, this step is to apply standardized data
sets and use algorithms to request machine learning information from it. After the results
let us continue to put a text in the model and predict which topic the text belongs. The
second uses the pre-trained word embedding technique to process a piece of news that has
no subject or knows the subject. Its input is an automatic classification program, which may
or may not be included in data training. This is called data testing. To be able to apply a
Training dataset into an algorithm this means we have gone through the data processing
steps, removing stop words, removing special characters ... And We have vectorize the data
set and each news in the topic is a corresponding vector. The frequency of words is
constructed based on bias. After vectorize the data set, we put it into the algorithm and use
the training algorithm and we will also apply technical tools to train the dataset together.
After using the algorithm to train the data set we perform classification against the
algorithmic models.

2.2. Implement Model

2.2.1. Nerual Network

We use the sklearn Library for define the backbone of model with input shape = 300
(Same shape of truncated sample). Hidden layer 1 = 2048 , hidden layer 2 = 1024, hidden
layer 3 = 512 , label = 10
Level Word

Ngram
Bag Of Words

2.2.2. Recurrent Nerual NetWork:

We define the parameter is: input_shape=300 , hidden1=2048 , hidden2=1024,


hidden3=512, labels=10. Batchsize = 512 , epoch = 8
Level-Word
N-gram

Bag Of Word
2.2.3.Naive Bayes:
Level Word

Ngram
Bag Of Word

2.2.4. Support Vector Machine

We use the thundersvm library for support vector machine , with SMO optimization
for calculation.
Level word
Ngram

Bag Of word

2.2.5. Conclusion:

In this result we can see some overall results:

+ In nerual network,RNN, Naive Bayes, Level word and Bag Of Word


methods work really good. But Ngram has low performance
+ In Support Vector Machine , Level Word is out perform Bag Of Word
and Ngram
Overall: In different language , embedding method and Natural Language Processing
model will perform different result. In this thesis , the limitation is we didn’t use the
vocabulary from another source code so the model can’t cover all the word if the vocabulary
which we build not contains it.So this is the reason why the Ngram method has low
performance . And the last limitation is all the model can’t understand clearly what exactly
content in the sentences if the sentences have a lot of combinations of word from different
categories. But in the general, all the models have the nice result.

Refer

[3]

You might also like