You are on page 1of 3

● Generate word cloud:

● Identify the most frequent words(N-gram):


An N-gram model predicts the occurrence of a word based on the occurrence of
its N-1 previous word. The analysis of the word frequency represents are as
follow:

Figure 4: Most frequently words occur in “Title” Section

Figure 4 represents the most frequent words in the “Title” section. The most frequent words
used in the Titles are a relic, cultural, and tourism. Afterward, heritage and development are
used.
Figure 5: Most frequently words occurs in “Abstract” Section

Similar to figure 5, the most frequently used in the abstracts are cultural, relic and tourism.
Although in figure 4 and figure 5, there is a minor difference between the frequency of words
used. As per the analysis, the words are almost similar in the title and abstract, only the
frequency of the word are shuffled. The most frequent words used in the title are
and the most frequent words used in the abstract are

1. Bert-based Text Classification:

For the text classification via machine learning, we used Bidirectional Encoder Representation
From Transformer (BERT). BERT is an architecture based on the transformer. The transformer
model was introduced by Google and employs the mechanism of self-attention that is suitable for
understanding languages. In general language, models work on the one-directional training with
the aim of next work prediction but in the case of BERT, it utilizes the bidirectional training. The
ability of the BERT model is limited as an encoder that can be used for reading text input and
processing. There are two methods that enable the BERT to become the bidirectional modal: 1)
MLM (Masked language modeling) 2) Next Sentence Prediction (NSP). In this study, the
process for the Bert-based classification are as follow:

1. Load Pre Trained Embedding: After the dataset analysis, three categories are decided
as “Culture”, “Heritage” and “Tourism”. In the next step, pre-trained data is downloaded
through the Gensim API and check the similar words to the above-mentioned categories.
2. Create Dictionary: Three different clusters are created with the categories of “Culture”,
“Heritage”, and “Tourism”. Every cluster consists of 30 similar words.
3. Word Embedding:

Figure 6: Word Embedding


4. Apply Bert Model:

Add 2 tables and one more page

You might also like