Professional Documents
Culture Documents
Proceedings of the 30th Hangul and Korean Information Processing Conference (2018)
summary
This paper analyzes the emotion of sentences by simultaneously passing morphemes, syllables, and graphemes of Korean sentences through different convolutional layers.
Ryu Ha proposes a multi-channel CNN. In the case of colloquial sentences containing typos, morpheme-based CNN
Features that cannot be extracted can be extracted from syllables or graphemes. Although morpheme-based CNNs are widely used in Korean
sentiment analysis, the multi-channel CNN model in this paper classifies the sentiment of sentences more accurately by simultaneously considering
morphemes, syllables, and graphemes. The model proposed in this paper classified the sentiment of sentences about 4.8% more accurately in the
baseball comment data and about 1.3% more accurately in the movie review data than the morpheme-based CNN.
1. Introduction This paper's model, which classifies the emotion of a sentence by taking all into
account, can classify the emotion of a sentence more accurately than other existing
Sentiment analysis is a natural language analysis technology that identifies the CNN models. This was confirmed in two different online data sets.
used to identify user opinions or predict future election results [1]. Traditional
machine learning methods such as Naive Bayes and Logistic Regression were There are studies that have expanded Y. Kim's text classification research
widely used in existing emotional analysis, but recently, deep learning-based using CNN [5] and applied it to Korean [6,7]. Y. Kim proposed a word-based CNN
technologies have recorded high performance in many fields [2]. Although interest in the paper [5], but it has a serious OOV (Out of Vocabulary) problem in Korean.
in opinion mining and sentiment analysis technology is growing, most sentiment This is because, due to the nature of Korean, which is an agglutinative language,
analysis studies are focused on English [3]. Since countless word phrases can be created. Morphemes or syllables are often used as
each language has different grammatical rules, and there are grammatical input instead of words. Morphemes, the smallest meaningful units in Korean, are
differences between Korean and English, inaccurate results may come out in actively used in Korean text classification research. In [6],
Korean sentiment analysis if the techniques and models used in English sentiment Naver movie reviews were divided into morphemes using Konlpy's Twitter
analysis are used as is. High performance can be expected only when sentiment morpheme analyzer to learn word2vec, and then used as input values for CNN.
analysis is performed using models and technologies suited to the linguistic The overall model structure is similar to the model in [5], and this model classifies
characteristics of Korean [4]. In this paper, we propose a multi-channel CNN Naver movie review sentiment about 8% more accurately than the Naive Bayes
(Convolutional Neural Network) model that is effective in sentiment classification of model. [7] is a study that proposed a syllable-based CNN model in Korean. In [7],
Korean sentences. The proposed model uses three different convolutional layers a syllable-based CNN model, not morpheme-based, was used to make predictions
to receive even for new, unlearned words. Through this, the syllable-based CNN model that
morphemes, syllables, and graphemes as input. In particular, in the case of learned 'Google' or 'Galaxy Note'
online text, there are many abbreviations or spelling errors, which can lead to a lot can make similar predictions for compound words and abbreviations such as
of information loss at the morpheme level. By extracting feature vectors from 'Google God' and 'Galaxy Note', making the OOV (Out of Vocabulary) problem
graphemes and syllables, information lost in typos and grammatical errors at the morpheme-based. It is improved over CNN [7]. Sentiment analysis studies using
grapheme level and compound words and abbreviations at the syllable level can multi-channel CNN, which formed the basis of this paper, include [8, 9, 10]. In
be extracted. And, for new words that have not been learned, features that cannot the study of [8]
be extracted from morphemes can be extracted from graphemes and syllables.
Feature vectors found from morphemes, graphemes, and syllables
- 79 -
Machine Translated by Google
Proceedings of the 30th Hangul and Korean Information Processing Conference (2018)
In this case, it was confirmed that multi-channel CNN, which uses words and The Multi-channel CNN model proposed in this paper, as shown in Figure 1, uses
characters simultaneously, is superior to word-based CNN or character-based CNN morphemes, syllables, and graphemes as input. The multi-channel CNN model is a
when classifying spoken English sentences. In [9], when word embeddings such as variation of Y. Kim's CNN model [5] that inputs only one subword-level, and can input
Word2vec, Glove, and Syntactic were used simultaneously through multi-channel morphemes, syllables, and graphemes at the same time. Through this, documents
CNN in sentiment classification, performance improved compared to using a single can be classified by considering multiple subword-levels rather than considering only
word embedding. Also, recently, [10] classified the sentiment of movie reviews by one subword-level. The above multi-channel CNN model was constructed to
simultaneously using sentence graphemes and morphemes in Korean sentiment simultaneously use feature vectors extracted from morphemes, syllables, and
analysis. However, in [10], the syllables of sentences were not considered and the graphemes. As shown in Figure 1, the multi-channel CNN model has three input
experiment was conducted only on Naver movie review data, which among online channels, and one sentence is divided into morphemes, syllables, and graphemes
comments is relatively standardized and has long sentences. On the other hand, in and used as
this paper, we not only improved emotional classification performance through a multi- input for the three channels. As shown in Figure 1, “Go LG!” If the sentence is
channel model that uses all three morphemes, graphemes, and syllables, but also divided into morphemes, it becomes [“LG”, “Going”, “!”], which goes into the first
conducted experiments on two data sets with different characteristics. Through channel. Also, when the above sentence is divided into syllables, it becomes [“L”,
experiments, the final model of this paper recorded about 2.4% higher performance “Ji”, “<space>”, “Hwa”, “I”, “Ting”, “!”], which is input to the second channel. use. A
on baseball comment data and about 1.12% higher performance on Naver movie <space> token is inserted between words. Likewise, the above sentence is divided
review data than the multi-channel model that did not consider syllables. Additionally, into graphemes [“ÿ”,”ÿ”,”ÿ”,”<eoc>”, “ÿ”,”ÿ”, … … ] is used as the input of the last
in this study, we present an optimized combination of Korean sentence input units channel. The <eoc> token was inserted between syllables. The morpheme-based
for CNN through experiments on CNN combining several Korean units. channel and syllable-based channel limited the maximum length of a sentence to
50 tokens, so longer sentences were cut out and shorter sentences were filled with
padding. In the case of grapheme-based channels, the maximum length was limited
to 150 tokens and truncated or padding was input as above. In each channel, the
embedding of input values is learned through an embedding layer. We experimented
with using pre-trained Korean FastText embeddings provided by Facebook Research,
but there was no significant performance improvement, so embeddings randomly
3. Korean sentence classification using multi-channel CNN initialized to 300 dimensions were used in this model. The resulting value after
passing through the embedding layer is used as an input to the convolution layer,
which consists of three different sized filter windows. The output of the convolution
layer is connected to one matrix through Max-Pooling, and the sentences are classified
through the Softmax layer.
4.1 Data
- 80 -
Machine Translated by Google
Proceedings of the 30th Hangul and Korean Information Processing Conference (2018)
Table 2: Baseball comment sentiment example A syllable-level morpheme analyzer using Bidirectional LSTM and CRF
Emotion was used [11]. As a result of experimenting with several morpheme
Sentence “I’m 100% sure KIA will Positive analyzers, using the in-house morpheme analyzer was able to distinguish
win” “Just get some rest after the denial the sentiment of baseball comment data about 2% more accurately than
season” neutrality using Konlpy's Twitter or morpheme-based CNN using Kkma.
“Good job” “Well done Positive
Hyunsik!” “I paid more and bought it for 210,000 won. Gabio, even if you wait 3 hours, denial
you won’t be able to get a
Syllable-based CNN: Same as the models above, and used syllables as
ticket.” “The injured area is also unique” Neutral
input. In the preprocessing process, the <space> token was entered
Baseball comments: The following baseball article comment
between words.
sentiment corpus data was used as experimental data. This data is a
data set produced in-house, and comments are tagged with sentiments
Grapheme-based CNN: Same as the above models, and graphemes
of ‘positive’, ‘negative’, and ‘neutral’. Positive, negative, and neutral
were used as input values. In the preprocessing process, when
are evenly distributed in learning and testing units in a 1:1:1 ratio. As
decomposing a sentence into graphemes, <eoc> tokens were entered
can be seen from the examples in Table 2, the sentences are made up
between syllables, and <space> tokens were entered between words.
of colloquial language, often contain spacing and grammatical errors,
and are often not long. “Well done Hyunsik!” In the same case, it is
tagged as positive, but the similar “doing well” is tagged as neutral.
Multichannel CNN (morpheme + syllable + grapheme): This is the final
There are expressions that can be used positively, such as “You are
model proposed in this paper and is a multi-channel CNN model that
good,” but in some situations, they can be used sarcastically, negatively
uses all morphemes, syllables, and graphemes. Preprocessing is the
or neutrally, so sentiment analysis can be difficult. Additionally, there are
same as the above models for each sub-word. In addition to this final
many sentences that have an ambiguous sensibility, such as “The
model, multi-channel CNNs consisting of a combination of two of
injured area is also unique,” for a person to judge whether it is neutral or
morphemes, syllables, and graphemes were also tested for comparison.
negative.
4.2 Experimental The values shown in Table 4 are the final parameters found in the
model Word-based CNN (Baseline): Y. Kim's CNN model [5] was used. grid search. Epoch was set to 100 and early-stopping was performed.
Sentences were separated into word units and used as input values for Dropout was commonly set to 0.5, and the Adam optimizer was used.
CNN. In all models, three different sizes of Filter window were used: 3, 4, and
5.
Morpheme-based CNN: Same as the word-based model, and used
morphemes rather than words as input. man's
- 81 -
Machine Translated by Google
Proceedings of the 30th Hangul and Korean Information Processing Conference (2018)
4.4 Experiment evaluation criteria When doing so, accuracy was used. In addition, the classification F1 score for each
emotion was also included in the experiment results to show classification performance
Evaluate the performance of the overall model as emotions are evenly distributed in for each emotion.
Model Accuracy positive (f1) neutral (f1) negation(f1) Accuracy positive(f1) negation(f1)
CNN (Eojeol) Baseline 45.03% 55.85% 32.41% 39.10% 79.17% 80.68% 77.40%
Multi-channel CNN (syllable + grapheme) 60.25% 66.43% 50.48% 63.64% 84.96% 85.02% 84.90%
Multi-channel CNN (morpheme + syllable) 63.14% 69.94% 57.79% 62.48% 85.33% 85.18% 85.48%
Multi-channel CNN (morpheme + grapheme) 64.56% 70.08% 58.31% 65.37% 85.15% 85.38% 84.92%
Multi-channel CNN (morpheme + syllable + grapheme) 66.95% 71.46% 60.88% 68.91% 86.27% 86.42% 86.11%
forward to it” “Even though our team has lost in a row, the players are in a great positive Positive negative positive positive
mood…” “This is the best scammer of all time,,,” negative Negative neutral negative positive
“There are no bad comments… You must have had a successful life positive Positive positive negative negative
“For domestic use” “Suarez in soccer ” negative Neutral neutral positive positive
5. Experimental results In the sentence, there is a morpheme with a negative sentiment, ‘losing a row,’ so the
morpheme-based CNN classifies the sentiment as negative. If I had been able to
The CNN sentiment analysis results for each model are summarized in Table 5. decompose the word ‘good’ into morphemes, I could have distinguished this sentence
- 82 -
Machine Translated by Google
Proceedings of the 30th Hangul and Korean Information Processing Conference (2018)
Channel CNN appears to have higher performance than multi-channel CNN, [3] M. Bautin, L. Vijayarenu and S. Skiena,
which uses only morphemes and syllables or only morphemes and graphemes. International Sentiment Analysis for News and Blogs,
Ah, when three subword-levels are used simultaneously, the emotion of the sentence ICWSM, 2008.
can be determined more accurately while considering more diverse feature vectors. [4] H. Jang and H. Shin, Language-Specific Sentiment Analysis
You can classify it as “The best scam of all time,,” in Morphologically Rich Languages, COLING '10
In a morpheme-based or grapheme-based CNN, the part Proceedings of the 23rd International Conference on
It was classified incorrectly because static feature vectors could not be extracted. Computational Linguistics: Posters, pp. 498-506, 2010.
In syllable-based CNN, features between syllables are considered.
The sentence was correctly classified as negative. Also morphemes [5] Y. Kim, Convolutional Neural Networks for
and grapheme description of features that could not be extracted from syllable-based CNN. Sentence Classification. Proceedings of the 2014
Because it can be extracted from a semi-CNN, emotional classification Conference on Empirical Methods in Natural
performance is highest when all three types of subword-levels are used. Language Processing, EMNLP 2014, pp. 1746-1751,
all. 2014.
In baseball comments, implicit or ambiguous sentences such as “for domestic use” [6] W. Kim and K. Park, Design of Hangul text sentiment classifier using convolutional
make it difficult to accurately classify the sentiment without knowing the context of the sentence. neural network, Proceedings of the 2017 Korean Computer Science Conference
Because experimental models often classify chapters as neutral, of the Korean Society of Information Scientists and Engineers, pp. 642-644, 2017.
their performance for neutral emotions is generally lower than for [7] S. Choi et al., A Syllable-based Technique for Word
positive or negative emotions. For movie review data without neutral Embeddings of Korean Words, Proceedings of the First
sentiment, the final multi-channel CNN model reviews accurately up to 86.7%.Workshop on Subword and Character Level Models in
Classify emotions. The improvement in emotional classification NLP, pp. 36-40, 2017.
performance when using multi-channel CNN was found in movie review data. [8] J. Park and P. Fung, One-step and Two-step
It is larger than in baseball comment data. movie review data Classification for Abusive Language Detection on Twitter,
Because the sentence length is long and the learning unit is large, Proceedings of the First Workshop on Abusive
Although it is possible to classify the emotion of most sentences using Language Online, pp. 41-45, 2017.
phonetics, baseball comment data contains many short, colloquial sentences, [9] Y. Zhang, S. Roller and B. Wallace, MGNC-CNN: A Simple
including neutral emotions, making it difficult to classify using only morphemes. Approach to Exploiting Multiple Word Embeddings
I think it's because there are a lot of sentences. both things for Sentence Classification, Proceedings of
This argument is made by verifying the model on data sets with different characteristics. NAACL-HLT 2016, pp. 1522-1527, 2016.
The model presented by Moon is effective in online data sentiment classification.
It was confirmed that it was. [10] K. Mo et al., Text Classification based on
Convolutional Neural Network with Word and
6. Conclusion Character Level, Journal of Korean Institute of Industrial
Engineers Online, 2018.
In this paper, through several experiments, Korean colloquial sensibility [11] H. Kim et al., Syllable-level morpheme analyzer using part-of-
We proposed a multi-channel CNN based on morphemes, graphemes, speech distribution and Bidirectional LSTM CRFs, Proceedings of
and syllables that is effective for analysis. Out of vocabulary (OOV) problem the 28th Hangul and Korean Information Processing Conference, 2016.
It was confirmed that feature vectors extracted from syllables and
graphemes can be used complementary to morpheme-based feature
vectors by solving the problems of the morpheme-based CNN. Higher
emotional content than the commonly used morpheme- and syllable-based CNN
It was confirmed that it has the potential to be used in other studies classifying
colloquial Korean language by showing the accuracy of interpretation.
References
[1] H. Lee and S. Lee, Implementation of sentiment analysis and
sentiment information attachment system, KIPS Tr. Software
and Data Eng, Vol.5, No.8, pp. 377-384, 2016.
[2] 513-520, 2011.
- 83 -