Professional Documents
Culture Documents
MODULE :
COURSE CODE :
COURSEWORD LEADER :
DUE DATE :
WORD :
Commitment
This dissertation is done and has references to documents, articles, websites as described in
the references and I will quote for each reference.
I hereby certify that in addition to the reference citations, all contents and data in the essay
have been compiled by myself based on the research results under the supervision of Mr. Le
Minh Nhat Trieu. I accept full responsibility for violations of the regulations.
Honors
First thing, I would like to thank my loved ones, especially my parents. Because they are
feeder, and giving me the best things in this life, from matter to morality
Second thing, I want to thanks to the FPT University and the Greenwich University. Thanks
to the school and teachers who creates classes and subjects for students. Thank you for the
knowledge that the teacher has communicated
Thank you to Mr. Le Minh Nhat Trieu, who accompanied me during the past school year at
TopUp semester. He devoted his devotion, time, and energy to support me as much as
possible.
Finally, we would like to thank the teachers and friends who have accompanied me during
the past 4 school years. Thank you for your, your valuable knowledge and enthusiasm to
help
Abstraction
Nowadays, the development of IT has changed our life so much. Special Data mining and
machine learning. It has been applied in all field in our life from face, voice recognition,
nature language processing, Especially natural language processing many areas in today's
life use examples in some places using robots capable of communication to replace humans
in communication. A typical example is the explosion of anti-epidemic robots in the covid-19
age. In the field of mass media, newspapers and news production are getting more and
more attention from the masses. Accompanying that is a great deal of work, it is a waste of
time if we sit and read each title of an article for us to classify it. Due to grasping the deadly
weakness of the media industry. This essay was written to fix the problem of sorting articles
topic, but because time is limited, topics only revolve around the world, sports, life, law,
health... Usually, the articles will be stored in natural language, unstructured data. The
easiest way to classify these articles is the vector space itself. However, in order to vector
the information, we need to process the data first. Specific tasks need to be done with
cutting words, removing accents in sentences and eliminating stop word. In this topic to be
able to separate words I use the segmentation tool to separate words, then construct
vectors based on Bow, Word2Vec methods. Then use jupyter notebook to show the results
obtained in news classification using ML method
1. Introduction
Digital life all information or knowledge that we know or do not know is all on the
internet. This problem also solves a lot of human problems such as document storage
without paper or pen, long storage time, convenient for searching. Although the face of the
huge amount of information that is properly categorized, it is important to be concerned.
But in fact, this job needs to be done manually and takes a lot of time and effort. So the
automatic classification is very necessary.
Seeing this need, I decided to explore the steps to conduct information classification
using ML. The method of classifying news with the data set is news taken from online news
sites in Vietnamese. From there we proceed to build and apply the classification methods.
This is a research project and also the subject of my graduation thesis.
The purpose of this essay is to find out how I use machine learning to categorize Vietnamese
news and how I do it.
In today's industrialized life. The amount of data is increasing in quality and quantity.
But only a small part of that huge chunk of data has value. The desire to find and exploit
information and value from that data block has opened a new wing for the information
technology industry. It is information extraction from the database (Knowledge Discovery
from Data).
Steps for data mining include:
● Identify the request and the associated data space (problem understand and
data understand).
● Data mining including identify the target of the data to be exploited and
exploitation technique. The result will be an images or text source.
● Evaluation Based on the criteria and filter the source of the obtained data
● Deployment
The data mining process is repeated many times. Data extraction is the process of
extracting data from a data set. This needs to use knowledge of many fields such as IT, AI,
database, math...
● classification: This is a technique that allows the classification of an object into one
or more certain classes.
● Regression: Defines a data sample into a predictive variable has real value
● Clustering: "Cluster" means a group of data objects. Similar objects are located in a
cluster. The result is similar objects in the same group.
2.2. What is machine learning.
With the explosion of big data and the classical algorithms that haven't
performed well on paper yet. The emergence of Machine Learning is inevitable, it also
leaves a new piece for the IT industry
Machine learning is a field of artificial intelligence involved in the research and construction
of techniques that allow systems to "learn" automatically from data to solve specific
problems.
Training phase.
+ The issues that need to be explained in the training phase are the Features extractor
and Main Algorithms.
+ Raw input data is all the information we know about the data, for example the value
of each pixel for an image, every word, every sentence for the text, for the audio file
it is a signal ... This data is called raw data and is in an unstructured form. In order for
the machine to understand and learn this raw data type, we need to convert it to a
2-dimensional vector.
+ Prior knowledge about data: different theories about the data type also benefit. For
example, the text classification problem we need is knowledge related to that
problem. As in this issue of this essay is Vietnamese text analysis, we need to have
stop word files or vocab sets related to Vietnamese text.
+ Main Algorithms: after extracting features from the data set, these extracted
features are applied to training algorithms such as Classification, Clustering,
Regression...
Testing phase
+ This step is much simpler because it is a forward version of the training phase with
new raw inputs. So it is a revamp of the previous training phase to create the
corresponding vector features. This also requires using algorithms to predict the
output.
Text data mining is the process of extracting data from articles or documents
in the form of text. This is a multi-disciplinary problem including information retrieval, text
analysis, information extraction, clustering, categorization… In the next part we will submit
in-depth presentation of the problem of categorization of content of the topic of this thesis.
In the field of Natural Language Processing is also an array of it, but for the
input data type it is the text that is stored in an unstructured form. For raw data input is text
that is scraped from articles or data files saved in an unstructured format. This problem was
given by me in the (overall machine learning) section when we need to have knowledge
related to the problem we need to solve. Some methods to be able to extract the feature of
text in natural language processing are Bag of Word, TF-IDF, Word Embedding.
In the world there are many research projects achieve positive results.
Example Support Vector Machine – Joachims 1998, k-Nearest Neighbor – Yang 1994, Linear
Least Squares Fit – Yang and Chute 1994, Neural Network – Wiener et al 1995, Naïve Bayes
– Baker and Mccallum 2000, Centroid- based – Shankar and Karypis 1998. These methods
are based on statistical probability or word weight information in the text. But for
Vietnamese A lot of times the token is interpreted as a word, although this is not entirely
correct. In English, for example, words are usually separated by spaces, but New York is still
considered a default word even though it has a space in between. Hence there is only 1
token in this case. Another example is that I call the words ‘I 'and‘ am ’even though there
are no spaces. In this case, we have 2 tokens, there are still many limitations due to difficulty
in separating words and sentences.
Because of the increasing needs of life, the society is more and more developed. The
language processing method is increasingly being applied in life.
Example : In the business sector Facebook uses NLP to keep track of trending topics
and popular hashtags. Mange the news with Fake News Detection, Spam detection. Spam
detection technologies use NLP's text sorting capabilities to scan emails and identify them as
spam or phishing. . Create the Chat Bot for replying customer. Generate the text for create
new documents. Translate the text to another language . And so many tasks in life can to be
solved with Natural Language Processs and Machine Learning.
CHAPTER 3
3.1 Background
Because computers and Internet are widely used, the extremely huge number of
information is produced everyday. Nowadays, the information already is full of our life.
Most of the information is stored as texts. A large number of unstructured texts is posted
and sorted in web pages, digital libraries and community. Therefore, the automatic method
is necessary to help people manage and filter these information instead of manual work.
Predicting the class labels for the online texts has been required by a variety of applications.
For example, in spam filtering, classification methods are used to determine the junk
information automatically. In news organization, because most news is provide on Internet
and the amount is huge, it is impractical to finish this task manually.
So Classificaiton is one of the most important task for filter the important texts we
need to use or skip the useless text.
3.2. Application
There are a lot of applications in nowadays need the classifiation task. Depending on
the classification task, there are different kinds of class sets. The most usefull application we
always use in nowadays is base on the key word we search on Google , the sever use
classification for filtering the theme of information we need to find , and give back the result
correctly what we want. Or Google , Yahoo, Facebook use the Classification task for
removing the spam email help people acess the dangerous link.
3.3 Overview of Methods and Contributions
The goal of text classification is predicting the correct class label for a given text. Text
classification task is defined as a set of training texts D={X1, ..., Xn}, each text is labeled with
a category value drawn from a set of k different discrete values which are indexed by {1, ...,
k} [1]. All the texts are split into 2 subsets, training texts and test texts. The training texts are
used to train classification model by using machine learning algorithms. The test texts are
used to evaluate the performance of the model. For a test text whose category is unknown,
the model is used to predict its category. Each category is assigned with a label. These labels
are numeric values that represent the categories. Practically, text classification task is
computing the text’s label.
For example: Thủ tướng hôm nay vừa có một chuyến thăm ở Trung Quốc
For understanding the categories of the sentences , the model need the knowledge
of the language vocabulary and knowledge structure of the language
The process of converting the original raw data set into sets of attributes. This makes
the original raw data set better to solve the problem more easily. Helps more compatible
with prediction models and improves predictive model accuracy.
Some of the text conversion models that are widely used in NLP today Bag of Word, TF-IDF,
Word2Vec.
In 1954 Harris, Zellig, a linguist and mathematician, talked about the bag of word in
linguistic context in this article on distribution structure. According to Harris, Zellig[4] in
Distributional Structure” …For language is not merely a bag of words but a tool with
particular properties which have been fashioned in the course of its use ” This is the first
premise for data scientists to research and develop a bag of word technique to apply to data
preprocessing for NLP.
The bag of word model is a popular method to convert data types from unstructured
text data types to simple vector spatial representations. It creates a dictionary that contains
words that do not repeat in a text. With the sentence pattern is a vector of equal length to
the length of the dictionary in the sentence and each cell in the vector represents the
number of occurrences of that word. As its name suggests, it is a pocket of words containing
words in the text, arranged in a mess, regardless of the appearance of the sequence or
grammar in the sentence.
In practice, the Bag-of-words model is mainly used as a tool of feature generation.
After transforming the text into a "bag of words", we can calculate various measures to
characterize the text. The most common type of characteristics, or features calculated from
the Bag-of-words model is term frequency, namely, the number of times a term appears in
the text. For the example above, we can construct the following two lists to record the term
frequencies of all the distinct words (BoW1 and BoW2 ordered as in BoW3):
1. Con chó đang nằm canh nhà (The dog is at the house guard)
[“con”,”chó”,”đang ”,”ngủ”,”mèo”,”nằm”,”nhà”,”canh”]
Based on the dictionary created from the two paragraphs above, we proceed to
create a vector to store the frequency of words appearing in each sentence.Since the length
of the paragraph is five words and six words, we will have the following element.
1. [1,1,1,0,0,1,1,1]
2. [1,0,1,1,1,1,0,0]
However, if a word appears more than once in a document, it is only counted as that
word appears only once and the weight to mark the number of words in a sentence
automatically increases.
● [0,1,1,1,0,0,2]
● [1,0,0,1,1,1,1]
3.3.3.2.TF-IDF
This method was introduced by Karen Spärck Jones in an article appeared in 1972
under the title "term specificity". Although it works as the "Heuristic method", it has been
controversial on theoretical grounds for at least 30 years. But this is also the premise to
serve the data preprocessing of the NLP model
Like the second example in the Bag of Word section, the problem will arise when in a
paragraph with too many duplicate words or words that appear too often in a paragraph. So
they will interfere with other words in the dictionary. If the frequency of occurrence of
words is too much in a text, then if only considering the frequency of occurrence of each
word, the classification will give wrong results and lead to the rate of accuracy is not high.
TF – term frequency (how often the word appears in the text) Each text in a
dictionary has a different length, so some words may appear many times in a large
document. So TF (term frequency) is calculated by.
z is the number of text in set D containing the word t, with cases where the word t is
Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone is
going to say. Predicting upcoming words, or assign probabilities to sentences is important
because probabilities are essential in any task in which we have to identify words in noisy,
ambiguous input, like speech recognition. For a speech recognizer to realize that you said I
will be back soonish and not I will be bassoon dish, it helps to know that back soonish is a
much more probable sequence than bassoon dish. For writing tools like spelling correction
or grammatical error correction, we need to find and correct errors in writing like Their are
two midterms, in which There was mistyped as Their, or Everything has improve, in which
improve should have been improved. The phrase There are will be much more probable
than Their are, and has improved than has improve, allowing us to help users by detecting
and correcting these errors. [https://web.stanford.edu/~jurafsky/slp3/3.pdf] So Ngram is
the method can solve this problem with the task of computing P(w|h), the probability of a
word w given some history h.
Suppose the history h is “its water is so transparent that” and we want to know the
probability that the next word is the:
One way to estimate this probability is from relative frequency counts: take a very
large corpus, count the number of times we see its water is so transparent that, and count
the number of times this is followed by the. This would be answering the question “Out of
the times we saw the history h, how many times was it followed by the word w”, as follows:
The n-gram (which looks n−1 words into the past). Thus, the general equation for
this n-gram approximation to the conditional probability of the next word in a sequence is
P(wn|w1:n−1) ≈ P(wn|wn−N+1:n−1)
Matrix factorization is one of the most popular method for reducing the dataset help
hardware can save the memory while calculating . And Singular Value Decomposition is one
of the Matrix factorization method. Suppose we have A(mxn) matrix, we can factorize the
matrix like this:
In Truncated Singular Value Decomposition(Low-rank approximation) have the
formula with A is the matrix , k is the rank , σ is the value of cross line in matrix
Naive Bayes is the most classical method. Naive Bayes classification method created
by Thomas Bayes is based on the Bayes theorem. Naive Bayes is a prime example of the
simplest solutions that are also the most powerful. Even with the remarkable advancement
of machine learning in recent years, the naive method not only proves simple but also fast,
accurate and reliable. Especially in the field of natural language processing. This method is
used a lot in the classification problem.
Based on the Bayes formula, we will find the probability to get the label based on the
probabilities of the given words. This proves that the prediction of a label for a certain type
of text depends on the frequency of the occurrence of words, sentences and the dependent
probability of holding words. Apply algorithm P (label | text) = (P (text | label) * P (label)) / P
(text). For example consider a classification problem with c labels whose input vector x
represents a word we call this the probability of falling into subclass c when we
know the vector x. From there, data can be classified by determining the class with the
highest probability.
with
We can omit p (x) because no p (x) does not depend on c. For calculation
convenience, we can assume that the components of the variable x are independent of each
other, if c is known
In the algorithm executio, the values and are taken from the Training
data set
Training
Testing
Advantages
The algorithm works simply, effectively, quickly and saves time. Widely used in
classification problems. Can provide higher predictability than other models although less
data input is required.
Disadvantages
Probability that the output will be wrong in some cases. So we should not focus too
much on its probability . The naive Bayes algorithm is only correct in some cases, but if
applied to the real world the algorithm will limit its capabilities.The independence
assumption does not work well in situations where the data are interdependent.The model
parameters are independent probability values, so the interaction between them cannot be
estimated.
3.3.4.2.Neural Network
The first neural network appeared in 1944 by Warren McCullough and Walter Pitts,
two researchers at the University of Chicago. In 1952 they moved to MIT as the first
collaborative member of cognitive science.
As you know, neural networks are developed and functioned like human neural
networks. It consists of 3 main parts: Input layer (x), hidden layer (Neural network), output
layer (y). The input data and the output data output of the NN network are independent of
each other, so it cannot be used to predict the output probability of the problem as
described, complete the sentence …
For example, when you are reading this sentence each word in the sentence will
contribute to make up a piece of information and make up the meaning of the whole
sentence. Based on the sentence you just read above your brain, you will store the
information and continue processing the semantics of the next sentence. This is a
complicated process that the Neural Network cannot do. So RNN was born to solve this
problem. RNN is able to recall the information that has been calculated previously to be able
to give the most accurate prediction for the current prediction steps.
Training
For each time , the trigger value and the output are calculated using the
following formula:
and
Inside is the weight and bias parameters of the model are the trigger
functions used in the model.Just like a normal neural network model to perform the
classification we perform the following steps:
1. Feed forward
2. Calculate loss
3. Back propagation
Advantages
Unlike neural networks, RNN can compute the output based on the previous value.
The weight number is shared all the time.The RNN model is created to memorize each
previous value. Do not change model size according to input size
Disadvantages
The Support Vector Machine algorithm was invented by Vladimir N. Vapnik and
Alexey Ya. Chervonenkis two Soviet mathematicians in 1963. This algorithm supports a lot in
theory as well as in practice.
The main method of support vector machine is to give a training set that is the input
data set represented in vector space, where each label data is each point within this space.
In 2D this will be a line that divides the points in this space into two separate sides.
Temporarily called the positive side is the negative side, so this line will divide the positive
side and the negative side separate from each other with the greatest difference. This
greatest difference is called the distance between the data side, if this distance is larger, the
two data sides will clearly divide, to achieve the best results. The goal of the support vector
machine is to find the maximum distance between two data sides to give the best
classification results.
Margin is the distance from the dividing line to the nearest point of each label data,
in order to divide the label data best, we need to satisfy some criteria such as the dividing
line to make the margin gaps large enough. The wide margin, it easier to separate label
data.
Assume the data on the blue label is 1 and the red is -1 and the divider between
them is
label data = 1:
Next, we select 2 faces passing through label data 1 and -1 data points
is and
To calculate the margin, we use the formula to calculate the distance points in space
with
easy to see due to the division of the plane is always the same sign, so this
space is not negative.
The goal of the support vector machine is to find w and b such that the maximum margin
Advantages
+ Support vector machine works quite well for separating label data
+ It handles quite well in multi-dimensional data problem
+ SVM relatively efficient in memory
Disadvantages
+ Division is not good when the number of properties of the data is much larger
than the amount of data
+ SVM not suitable for large data sets
In each iteration, the SMO algorithm has the following three steps:
where:
Step 2: Improve the weights of x(u) and x(i) denoted by a(u) , a(i)
Step 3: Update the optimality indicators of all the instances. The optimality indicator
SMO repeats the above steps until the optimal condition is met
Prediction: After the training, the trained SVM is used to predict the labels of unseen
instances
Chapter 4
Implement Algorithm
1. Pre-data processing
1.1.Crawl Data:
Vietnamese words have 2 syntax for typing is Telex and Vini. In diffirent system , the
syntax will be diffirent. So formatting is needed for preprocess. We define the function for
solving the syntax correctly
1.2.2. Tokenization
One point to note is the concept of "word type" and "word token". Types are the
total number of words present in a corpus, regardless of the number of occurrences of that
word. For example, in a paragraph, a word that appears 40 times is only counted once. As
for Token, when there are 12 words in a sentence, there may only be 9 words appearing and
3 words repeating.
In Vietnamese text, we have the library for converting the word to vector by Gensim
library and ViTokenize.
1.2.4. Stopword
In computing and natural language processing, stop words are words that are
filtered out before Stopor after text data processing. Although stop words are generally
considered common words in a language, there is not yet a general list of stop words used
in all natural language processing tools, and indeed not all. all tools have this list. Any group
of words can be chosen as stop words for a given purpose, these are the most common
words, such as "the, is, at, which, and on..." in English. In Vietnamese is "bị,
bởi,cả,thì,là,mà...".
But in Vietnamese language , some of the stopword list cant be effective for another
dataset, some of words may be usefull for model. For make sure the good model , we will
build our stop word base on word count frequencies.
Talk about the problem of classified news. We have two main types of research.
The first form is to rely on machine learning methods or to refine models from machine
learning methods to improve classification efficiency. The second form is based on Dataset
to conduct Training and Testing and apply algorithms to conduct classification and based on
the results to evaluate the classification information.
Based on the knowledge mentioned. Thread is done in the second form. Through
data collection, data preprocessing, and label data ... and application to the technical
algorithms mentioned. On the strengths of document classification based on existing
techniques. I have the advantage of being able to pre-read the document, refer to its
predecessors, and reuse the algorithms to apply to the classification. There are also
supporting libraries in the data preprocessing and engineering processes.
Using embedding pretrained convert the text to vector. Then we use truncated to
reduce the vector
2.Text Classification Structure By Machine Learning
The first are text data types (Vietnamese text). These are paragraphs of text that I
crawled from typical electronic news sites, vnexpress.vn. These pieces of text are usually
crawled and labeled data is assigned during collection. Then, we use methods to clean data
by methods such as tokenization, removing character characteristics, creating a list of Stop
Word... After all the raw data processing steps we will have a clean data set. After obtaining
the data set there are absolutely no interfering words. We have two methods of training
data. In this case we divide it into two directions. The first direction assumes that in the
process of crawling data from online news sites, we have classified each text into each data
label, then we can directly use techniques to vector the head data to enter. After the input
data vectorization brought the data set into arithmetic form, we can begin to apply the data
set to the previously understood algorithm methods to compare the accurate prediction of
each alternative.
This is also called the training process step, this step is to apply standardized data
sets and use algorithms to request machine learning information from it. After the results
let us continue to put a text in the model and predict which topic the text belongs. The
second uses the pre-trained word embedding technique to process a piece of news that has
no subject or knows the subject. Its input is an automatic classification program, which may
or may not be included in data training. This is called data testing. To be able to apply a
Training dataset into an algorithm this means we have gone through the data processing
steps, removing stop words, removing special characters ... And We have vectorize the data
set and each news in the topic is a corresponding vector. The frequency of words is
constructed based on bias. After vectorize the data set, we put it into the algorithm and use
the training algorithm and we will also apply technical tools to train the dataset together.
After using the algorithm to train the data set we perform classification against the
algorithmic models.
We use the sklearn Library for define the backbone of model with input shape = 300
(Same shape of truncated sample). Hidden layer 1 = 2048 , hidden layer 2 = 1024, hidden
layer 3 = 512 , label = 10
Level Word
Ngram
Bag Of Words
Bag Of Word
2.2.3.Naive Bayes:
Level Word
Ngram
Bag Of Word
We use the thundersvm library for support vector machine , with SMO optimization
for calculation.
Level word
Ngram
Bag Of word
2.2.5. Conclusion:
Refer
[3]