You are on page 1of 18

applied

sciences
Article
The Study on the Text Classification Based on Graph
Convolutional Network and BiLSTM †
Bingxin Xue *, Cui Zhu, Xuan Wang and Wenjun Zhu

Faculty of Information Technology, Beijing University of Technology, Beijing 100020, China


* Correspondence: xuebx163@163.com
† This paper is an extension version of the conference paper: Xue, B.; Zhu, C.; Wang, X.; Zhu, W. The Study on the
Text Classification Based on Graph Convolutional Network and BiLSTM. In Proceedings of the 8th
International Conference on Computing and Artificial Intelligence (ICCAI 2022), Tianjin, China,
18–21 March 2022; pp. 323–331.

Abstract: Graph Convolutional Neural Network (GCN) is widely used in text classification tasks.
Furthermore, it has been effectively used to accomplish tasks that are thought to have a rich relational
structure. However, due to the sparse adjacency matrix constructed by GCN, GCN cannot make
full use of context-dependent information in text classification, and it is not good at capturing local
information. The Bidirectional Encoder Representation from Transformers (BERT) has the ability
to capture contextual information in sentences or documents, but it is limited in capturing global
(the corpus) information about vocabulary in a language, which is the advantage of GCN. Therefore,
this paper proposes an improved model to solve the above problems. The original GCN uses word
co-occurrence relationships to build text graphs. Word connections are not abundant enough and
cannot capture context dependencies well, so we introduce a semantic dictionary and dependencies.
While the model enhances the ability to capture contextual dependencies, it lacks the ability to capture
sequences. Therefore, we introduced BERT and Bi-directional Long Short-Term Memory (BiLSTM)
Network to perform deeper learning on the features of text, thereby improving the classification
Citation: Xue, B.; Zhu, C.; Wang, X.; effect of the model. The experimental results show that our model is more effective than previous
Zhu, W. The Study on the Text research reports on four text classification datasets.
Classification Based on Graph
Convolutional Network and BiLSTM. Keywords: text classification; graph convolutional network; dependencies; Bi-directional Long
Appl. Sci. 2022, 12, 8273. https:// Short-Term Memory; ResNet
doi.org/10.3390/app12168273

Academic Editors: Hiroshi Fujita


and Jie Yang
1. Introduction
Received: 30 June 2022
Accepted: 16 August 2022
Text classification is the basic work of natural language processing tasks, and it is
Published: 18 August 2022
also one of the hot spots in its research field. It mainly uses text classification to manage
texts effectively and plays an important role in application fields such as information
Publisher’s Note: MDPI stays neutral
retrieval, public opinion analysis, topic classification, spam screening, and opinion mining.
with regard to jurisdictional claims in
Traditional text classification models are mainly based on machine learning algorithms,
published maps and institutional affil-
such as Support Vector Machines (SVM) [1], naive Bayes [2], etc. These algorithms are
iations.
relatively mature and have achieved good results. Of course, they also have limitations,
such as high-dimensionality, data sparsity and other problems in text classification, insuf-
ficient representation of text semantic features, etc. Deep learning methods can extract
Copyright: © 2022 by the authors.
text features that are richer in semantic information, and it can greatly save the cost of
Licensee MDPI, Basel, Switzerland. manual intervention. Existing text classification models based on deep neural networks
This article is an open access article include typical neural network methods (e.g., Convolutional Neural Network (CNN) [3],
distributed under the terms and Recurrent Neural Network (RNN) [4], Capsule Network [5]), and their excellent variants
conditions of the Creative Commons (e.g., Long and Short-Term Memory (LSTM) [6] Network, Gated Recurrent Unit (GRU) [7],
Attribution (CC BY) license (https:// Convolutional Recurrent Neural Network (CRNN) [8]), and various models based on at-
creativecommons.org/licenses/by/ tention [9] (e.g., BERT [10]), etc. These deep learning models can well capture the semantic
4.0/). and syntactic information in the local continuous word sequence, but most of them may

Appl. Sci. 2022, 12, 8273. https://doi.org/10.3390/app12168273 https://www.mdpi.com/journal/applsci


Appl. Sci. 2022, 12, 8273 2 of 18

ignore the information with discontinuous semantic words. Moreover, they are limited
in their ability to capture the global features of the information [11]. For example, global
word co-occurrence information, global syntactic structure information, etc., and the global
dependencies between all words also have a certain guiding role for classification. The
Graph Convolutional Neural Network [12] method has been a research hotspot in recent
years. This model performs graph convolution operations on the constructed topological
relationship graph to obtain features, thereby realizing text classification. In order to solve
the classification problem, many variants of GCN [13–16] have been proposed and explored.
Although GCN is gradually becoming a good choice for graph-based text classification, the
current research still has some shortcomings that cannot be ignored.
In the recent study on text classification, researchers represented text as a graph
structure and captured the structural information of the text as well as the discontinuous
and long-distance dependencies between words through a Graph Convolutional Network.
The most representative work is Graph Convolutional Network (GCN) [12] and its variant
Text GCN [13]. This method represents words as nodes and aggregates the neighborhood
information of the nodes through an adjacency matrix, thus integrating the global context
of domain-specific languages to some extent. When the graph convolutional network
model builds text graphs, most of them use the co-occurrence relationship of words and
the inclusion relationship between documents and words, resulting in a lack of richer
relationship information expression between words and words and the insufficient capture
of long-distance dependencies of words in sentences. The measure of a word co-occurrence
relationship generally uses normalized point-wise mutual information (NPMI) to calculate
the weight between two word nodes. However, the calculation of NPMI depends on the
corpus. If some words have a low probability of appearing in the corpus, this method may
make the calculation result of NPMI very small, resulting in the determination that the
similarity of two words is low or not.
In addition, graph convolution based on word co-occurrence relationships and seman-
tic dictionary composition can only capture the relationship of non-continuous entities in
the text and cannot simultaneously capture the short-distance and long-distance depen-
dencies of words in sentences. For example, a text graph is constructed from a corpus
of word co-occurrence relations and semantic dictionaries, as shown in Figure 1. Here
“This movie can be described in one word: wonderful.” is a sentence in the corpus. Taking
“wonderful” as the center, the blue circle represents the aggregation of one-level neighbor
node information, and the green circle represents the aggregation of two-level neighbor
node information. Among them, “wonderful” is used to modify “movie” in the sentence,
and it is very useful for analyzing the expression of the text as a whole. However, the two
word nodes “movie” and “wonderful” in Figure 1 are not directly related, but there should
be a certain dependency between them. Graph Convolutional Network only aggregates
direct neighbor node information, so one layer of GCN can only capture short-distance
dependencies of words in sentences. Capturing long-range dependencies between words
such as “movie” and “wonderful” can be solved by increasing the number of GCN layers.
However, current research shows that multi-layer graph convolutional networks for text
classification tasks incur high space complexity. At the same time, increasing the number
of network layers will also make the local features converge to similar values. To overcome
this problem, we introduce dependencies when building the text graph. Dependency
syntax is mainly concerned with the relationship between words (sentence components) in
a sentence, and it is not affected by the physical location of the components. By introducing
dependencies, short-range dependencies and long-range dependencies can be well utilized
while also providing syntactic constraints and partially reducing the number of GCN layers.
However, due to the characteristics of the graphs, when learning word features, graph
convolutional networks do not effectively capture the word order that is very useful for
them, resulting in the loss of contextual semantic information of the text. For example,
“You have a lot of hard work.”, “You have to work hard.”, where the word “work” will
change meaning depending on its position in the contextual sentence. The graph formed by
Appl. Sci. 2022, 12, 8273 3 of 18

the words in these two sentences cannot reflect the order between the words, so it cannot
represent the characteristics of word order changes. Obviously, this is very important
for understanding the meaning of the sentence. In addition, the graph convolutional
network cannot distinguish polysemy; for example, “She’s grown into quite an individual.”,
“So this individual came up.”, the meaning of the word “individual” expresses different
semantics depending on the context. In the first sentence, it is the meaning of “person with
personality”, and in the second sentence, it has the meaning of “weird person”. In this
case, if not differentiated, it may affect the understanding of the text, thereby affecting the
classification effect. Leveraging multilayer multi-head self-attention and positional word
embeddings, BERT mainly focuses on local contiguous word sequences that provide local
contextual information. By pre-training a large amount of text data in different fields, BERT
can well infer the meaning that polysemous words should express in the current context.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 3 of 18
However, BERT may have difficulty interpreting global information about vocabulary.

word

the

no reason
in

can

movie
charm

fun
to
be described

wonderful

one
watch

Figure1.1.Text
Figure TextGraph.
Graph.

ToHowever,
improve the dueabove-mentioned
to the characteristicsproblems,
of thethis paper when
graphs, proposes an improved
learning model
word features,
(IMGCN) for text classification.
graph convolutional networksOn dothe
notbasis of Vocabulary
effectively captureGraph
the wordConvolution
order that(VGCN) [15],
is very useful
a for
semantic dictionary
them, resulting in (WordNet) and dependencies
the loss of contextual semanticare introduced,
information andtext.
of the theFor
features
example,of
GCN
“Youand haveBERT
a lotareof further learned
hard work.”, “Youthrough
have residual bi-layerwhere
to work hard.”, BiLSTM thewith
wordan“work”
attentionwill
mechanism
change meaning [17]. By buildingon
depending dependencies,
its position inGCNs can be made
the contextual to capture
sentence. long-range
The graph formed
dependencies
by the wordsof inwords
these twoin sentences
sentencesmorecannotefficiently. WordNet
reflect the was introduced
order between the words, to provide
so it can-
more useful connections
not represent between of
the characteristics words.
wordAt the changes.
order same time, the BERT this
Obviously, model is used
is very to ex-
important
tract the local featurethe
for understanding information
meaning of of the
a text. The global
sentence. vocabulary
In addition, the information obtained net-
graph convolutional by
GCN is combined with the context information obtained by pre-trained
work cannot distinguish polysemy; for example, “She’s grown into quite an individual.”, BERT. Considering
that
“Sothe
thistext has a hierarchical
individual came up.”,structure
the meaningof word-sentence-document,
of the word “individual” in order todifferent
expresses obtain
more comprehensive text features, we use hierarchical BiLSTM to extract
semantics depending on the context. In the first sentence, it is the meaning of “person with the combined
features hierarchically
personality”, and in the to capture
second word, sentence,
sentence, it has and document
the meaning offeature
“weirdinformation,
person”. In re- this
spectively.
case, if not differentiated, it may affect the understanding of the text, therebymechanism
In the process of word-to-sentence feature extraction, the attention affecting the
isclassification
used to generate effect.sentence
Leveragingembeddings,
multilayerwhich solves the
multi-head problem that
self-attention keyword
and features
positional word
cannot be paid attention to in text classification. We changed the
embeddings, BERT mainly focuses on local contiguous word sequences that provide local initialization method of
the context vector in the attention mechanism and combined the bidirectional
contextual information. By pre-training a large amount of text data in different fields, word hidden
features
BERT can as the
wellinitial
infer value of the context
the meaning vector to guide
that polysemous words the learning
should of the
express attention
in the current
mechanism. By introducing residual connections, the neural network
context. However, BERT may have difficulty interpreting global information about degradation problem
vocab-
ulary.
To improve the above-mentioned problems, this paper proposes an improved model
(IMGCN) for text classification. On the basis of Vocabulary Graph Convolution (VGCN)
[15], a semantic dictionary (WordNet) and dependencies are introduced, and the features
of GCN and BERT are further learned through residual bi-layer BiLSTM with an attention
Appl. Sci. 2022, 12, 8273 4 of 18

that occurs when stacking multi-layer network models can be alleviated, and the model
can learn residuals to better obtain new features.
We call the proposed model IMGCN. Experiments on four benchmarking datasets
demonstrate that IMGCN can effectively improve the results in current GCN-based meth-
ods and achieve better results. The main contributions of this paper are as follows:
• The dependency and semantic dictionary are fully integrated into IMGCN. They can
enrich the connections between words and capture the long-range dependencies of
words in sentences while providing syntactic constraints and partially reducing the
number of GCN layers.
• Combining graph features and sequence features, using hierarchical BiLSTM to ex-
tract input features hierarchically, capturing word and sentence feature information,
respectively, so as to obtain more comprehensive text features.

2. Related Work
Traditional text classification methods are mainly based on feature engineering. Fea-
ture engineering includes the bag-of-words model [18] and n-grams, etc. Later, there was
research [19,20] that converted text into graphics and performed feature engineering on
graphics. However, these methods cannot automatically learn the embedding represen-
tation of nodes. Compared with traditional methods, deep learning-based methods can
automatically acquire features, and they can learn the deep semantic information of the
text. In terms of deep learning models, Kim 2014 [3] uses a simple convolutional neural
network (CNN) for text classification. Tang et al. 2015 [21] proposed a gated recurrent
neural network. Yang et al. 2016 [17] proposed the use of hierarchical attention networks to
model and classify documents. Wang et al. 2016 [22] introduced the attention mechanism
into LSTM. Dong et al., 2020 [23] introduce the Bidirectional Encoder Representation from
Transformers (BERT) [10] and the self-interaction attention mechanism. Most attention-
based deep neural networks mainly focus on local continuous word sequences, which
provide local context information. However, the ability to capture the global characteristics
of the relevant information may not be sufficient.
In recent years, many studies have tried to apply convolution operations to graph
data. Kipf and Welling 2017 [12] proposed Graph Convolutional Networks (GCNs), and
the model has achieved good results on node classification tasks. Velickovic et al. 2018 [24]
proposed a Graph Attention Network (GAT), which uses the attention mechanism [25]
to assign weights to the neighborhood information of nodes. Compared with the GCN
without adding weights, it has a better promotion. Later, Yao et al. 2019 [13] proposed the
TextGCN model, which introduced the Graph Convolutional Network to text classification
for the first time. By modeling the text into a graph, the nodes in the graph are composed
of words and documents, and a good effect is finally obtained. Cavallari et al. 2019 [26]
introduced a new setting for graph embedding, which considers embedding communities
instead of individual nodes.
Now, many studies try to combine GCN with other models. Ye et al. 2020 [14] input
the word and document nodes trained by GCN and BERT’s word vector into the BiLSTM
classification model, where the input vector of GCN is a one-hot vector. Lu et al. 2020 [15]
integrated the vocabulary graph embedding module with BERT and achieved good results
in many public datasets, where GCN uses BERT’s word vector. In contrast to this, in our
method, GCN uses BERT’s word vectors and uses dependencies when building the text
graph. One graph focuses on extracting dependencies between words, and the other graph
focuses on word co-occurrences. We use a two-layer BiLSTM with attention combined with
GCN and BERT; one layer is used to focus on the word sequence, the other layer focuses on
the sentence sequence, and a residual operation is added after BiLSTM to learn the residual.

3. Proposed Method
Following in-depth research on text classification based on neural networks, this
paper proposes IMGCN. The whole process of IMGCN can be divided into the steps of
layer focuses on the sentence sequence, and a residual operation is added after BiLSTM to
learn the residual.

Appl. Sci. 2022, 12, 8273 3. Proposed Method 5 of 18


Following in-depth research on text classification based on neural networks, this pa-
per proposes IMGCN. The whole process of IMGCN can be divided into the steps of text
composition,
text composition, graph convolutional
graph convolutional network,
network,feature fusion,
feature and and
fusion, finalfinal
prediction.
prediction.The Thear-
chitecture of IMGCN is shown in Figure 2. The colored continuous
architecture of IMGCN is shown in Figure 2. The colored continuous line in the IMGCN line in the IMGCN
architecture
architecture isis the
the output
output ofof each
each single
single model,
model, and and thethe intermittent
intermittent line line represents
represents the the
modules
modulescomposed
composed of ofaasingle
singlemodel.
model.We Weconstruct
constructtwo twotexttextgraphs
graphsby byestablishing
establishingedges edges
between
between wordword nodes.
nodes. Build
Buildaa vocabulary
vocabularygraph graphfrom from word
word co-occurrence
co-occurrence relationsrelations andand
semantic
semanticlexicon.
lexicon.BuildBuilda dependency
a dependency graph
graph from the inclusion
from the inclusion and dependencies
and dependencies of doc-of
uments and words. We use point-wise mutual information
documents and words. We use point-wise mutual information (NPMI) [27] and a semantic (NPMI) [27] and a semantic
dictionary
dictionary to to assign
assign association
association weights
weightsto to edges
edges between
between words words to to jointly
jointly build
build aa text
text
graph.
graph. Then,
Then,the thedependency-based
dependency-basedTF-IDF TF-IDFmethodmethodisisused usedto toassign
assignassociation
associationweights
weights
to
tothe
theedges
edgesbetween
between words
words andanddocuments
documents andand use ause matrix transformation
a matrix transformation to map tothem
map
them
to to the relations
the relations between between
wordswords
and words,and words,
thereby thereby constructing
constructing another another text graph.
text graph. Then
Then
the the embedding
word word embedding of BERTofisBERT used as is used as the
the input input
vector vector required
required by the graph by the graph
convolu-
convolutional network, and the two text graphs are respectively
tional network, and the two text graphs are respectively used as the adjacency matrix re- used as the adjacency
matrix by
quired required
the graph by the graph convolutional
convolutional network. Subsequently,
network. Subsequently, they are separately
they are separately fed into a
fed into a graph
two-layer two-layer graph convolutional
convolutional network for network
training,for and
training,
the twoandhidden
the twooutputhiddenvectors
output
vectors
are then are then summed.
summed. Next, theNext,
added theword
added word
vector is vector
fused withis fusedthe with
feature thevector
featureofvector
the pre- of
the pre-trained
trained BERT and BERT
inputand
to input to the first-layer
the first-layer BiLSTM model BiLSTM to model
extractto theextract the semantic
semantic depend-
dependency
ency informationinformation
betweenbetween
words withinwordsawithinsingle asentence.
single sentence.
To further To distinguish
further distinguish
the im-
the importance of words, we add an attention mechanism
portance of words, we add an attention mechanism to the first layer of BiLSTM to the first layer oftoBiLSTM
distin-
to distinguish
guish the importance
the importance of wordsof bywords by assigning
assigning different different
word weights. word weights.
The wordThe word
weights
weights and the output of the first layer of BiLSTM are weighted
and the output of the first layer of BiLSTM are weighted and summed to form an overall and summed to form an
overall representation
representation of the current
of the current sentence. sentence.
Here, the Here, the bidirectional
bidirectional wordword hidden hidden features
features are
are combined as the context vector of the attention mechanism
combined as the context vector of the attention mechanism to guide the learning of the to guide the learning of the
attention mechanism. The obtained sentence feature representation
attention mechanism. The obtained sentence feature representation is then input to the is then input to the
second layer
second layer of of the
theBiLSTM
BiLSTMtoto extract
extract semantic
semantic dependency
dependency information
information between
betweensentences
sen-
within the text. The learned sentence feature representation
tences within the text. The learned sentence feature representation and the feature and the feature obtainedob-
by theby
tained context attention
the context layer layer
attention are subjected
are subjectedto residual operation
to residual operationto form an overall
to form an overalltext
representation.
text representation. Finally, the overall
Finally, text representation
the overall text representation is input to thetoSoftmax
is input the Softmax classification
classifi-
layer to
cation obtain
layer the final
to obtain thecategory of the text.
final category of the text.

GCN
wordnet
Feature Fusion
w4

w1 w5

Vocabulary Graph w3
Residual Unit
w2 w6
w8
Label
w7
Attention Softmax
+ BiLSTM BiLSTM +
w3 Layer
w1 w5
w4

Dependency Graph w6
w8

w2
w7 Output Layer
Word Embedding

Word

BERT
Embedding
w1 d1

Word-Document
.....
.....

Relationship wn
dn

w1 w2 … wk Input Data

Dependencies

Figure
Figure2.
2.The
Thearchitecture
architectureof
ofIMGCN.
IMGCN.

3.1. Text Graph


3.1.1. Vocabulary Graph
In this paper, we first use a semantic dictionary (Wordnet) and Normalized Pointwise
Mutual Information (NPMI) to construct a text graph. For constructing text graphs using
word co-occurrence relationships, normalized pointwise mutual information (NPMI) is
generally used to calculate edge weights between two word nodes, which is a common
Appl. Sci. 2022, 12, 8273 6 of 18

measure of word association. However, the calculation of NPMI depends on the corpus.
If the probability of some words appearing in the corpus is low, this method may lead to
a small calculation result of NPMI, and the similarity of two words is judged to be low
or not. For example, the word “abundant” has a very low probability of appearing in the
corpus. Let P (word and other words) be the co-occurrence probability between words.
When calculating the co-occurrence probability of “abundant” and other words, the values
of P (abundant and other words) and P (abundant) are very small; even if it is 0, it will
result in a very small value of NPMI, so in the end, “abundant” and other words may be
judged as having little or no semantic relevance.
In addition, there are a lot of connections between words, and the most direct con-
nection is the synonymous relationship that exists between certain words. For example,
the words “abundant” and “plentiful” have very similar meanings and can be used inter-
changeably in most cases. If two words are synonymous, then only constructing a text
graph through word co-occurrence relationship may lack this part of word relationship
information. If this synonymous relationship can be used, the relationship information of
words can be enriched, thereby improving the accuracy of the calculation. The semantic
dictionary describes the different semantic relations between words. Therefore, we jointly
construct a text graph using word co-occurrence relations and semantic lexicon.
For the selection of the semantic dictionary, we use WordNet. WordNet describes a
network of semantic relationships between words. For every content word in the language,
it gives all possible word meanings. WordNet organizes words in a dictionary using
semantic and lexical relations, where semantic relations represent the degree of association
between the semantic items of two words. Nouns, verbs, adverbs, and adjectives are each
organized into a network of synonyms, each set of synonyms representing a basic semantic
concept, and these sets are also connected by various relationships. Since a word can
have multiple meanings, it is necessary to compare the meanings between the two words.
Therefore, we first use WordNet to expand synonyms and then use the WordNet-based
Wup method to calculate the semantic similarity between words in the two synonym sets
and calculate the average of all similarity scores. This value is the final similarity score of
the two words. The larger the calculated value, the higher the semantic similarity between
them. Among them, the “Wup” method [28] is a similarity measurement method based
on the path structure proposed by Wu and Palmer. The similarity is calculated according
to the similarity of the meaning of the words and the position of the synonym sets in the
hypernym tree relative to each other.
Therefore, for two words i and j, we first calculate the edge weights between two words
using the NPMI method. If the calculated NPMI value is small, the WordNet-based method
is used for calculation, which can minimize the problems caused by the low frequency of
some words in the corpus, and at the same time, add more word relationship information.
Formally, the edge weight between node i and node j is defined as:



 NPMI(i,j) i,j are words, NPMI(i,j) > 0

WordSim(i,j) 0.5 ≤ WordSim(i,j) ≤ 1, NPMI(i,j) < 0
Aij = (1)


 1 i=j

 0 otherwise

The NPMI value of a pair of words i, j is calculated as:

1 p(i, j)
NPMI(i, j) = − log (2)
logp(i, j) p(i)p(j)

#w(i,j) #w(i)
where p(i, j) = #w , p(i) = #w , #W(i) is the number of all sliding windows containing
word i, #W(i, j) is the number of all sliding windows containing word i and word j, and
#W is the total number of sliding windows. We set the window larger to gain long-term
dependence. The value range of NPMI is [− 1, 1] . A positive NPMI value indicates that
Appl. Sci. 2022, 12, 8273 7 of 18

the semantic relevance between words is high, while a negative NPMI value indicates that
there is little or no semantic relevance. Therefore, we only create an edge between word
pairs with positive NPMI values.
When the NPMI value of a pair of words i, j is calculated to be negative, their semantic
similarity is calculated as:
len len
∑q=21 ∑p=11 simwup wp , kq

WordSim(i, j) = (3)
len1 + len2 − a

where simwup (w p , kq ) = N +2N×2 +


N3
2×N3 represents the semantic similarity of two words in
1
the synonym lexicon, Let s3 be the deepest parent concept of wp and kq , N1 is the number
of nodes on the path from wp to s3 , N2 is the number of nodes on the path from kq to
s3 , and N3 is the root of the
 concept hierarchy tree from s3 the number of nodes on the
path of the node. Let S1 = w1 , w2 , . . . , wp , p ∈ [1, len1 ] be the
synonym word set of
word i, len1 be the number of words in S1 , S2 = k1 , k2 , . . . , kq , q ∈ [1, len2 ] be the
synonym word set of word j, len2 be the number of words in S2 , and a is the number of
word pairs with 0 semantic similarity in the word set. The value range of WordSim(i, j) is
[0, 1]. If the WordSim(i, j) values of the two words are within a certain range, we create
an edge between the two words. Experiments show that the performance is better when
WordSim(i, j) is between 0.5 and 1.

3.1.2. Dependency Graph


Based on the above two relational compositions, graph convolutional networks can
capture short-range dependencies of words in sentences but cannot capture long-range
dependencies at the same time. Therefore, we introduce dependencies to construct another
text graph to overcome the above problems. In addition, in order to obtain the inclusion
relationship between documents and words while introducing the dependencies between
words, we combine the two relationships. The edge weights between general document
nodes and word nodes are calculated using the TF-IDF algorithm. TF-IDF is the word
frequency-inverse text frequency index. The term frequency (TF) is the number of times a
word appears in the document, and the inverse text frequency (IDF) reflects the general
importance of the feature word in the corpus. However, the TF-IDF method lacks the
combination of textual semantics, which is useful for understanding textual content. To
overcome this problem, we refer to [29], introduce dependencies, and use an improved
edge weight calculation method to understand and optimize text features.
Dependency syntax is mainly concerned with the relationship between the various
words in a sentence, and it is not affected by the physical location of the components [30].
Since the importance and relevance of various dependencies in a sentence are not the
same, first, the importance of words to sentences, texts, and even categories is determined
according to the different dependencies between words and predicates in terms of word
meaning. Then, according to the importance of different components to the sentence, the
sentence composition is divided into eight levels.
The results of dependency analysis are usually represented by a labeled directed graph.
For example, as shown in Figure 3, “root” represents the central word “has”. Examples
of the relationship between the performance head word and its subsidiary words are
“has” and “performances”. The result of dependency analysis shows that the relationship
between these two words is a direct object (dobj, direct object) relationship.
Appl. Sci. 2022, 12, 8273 8 of 18
Appl. Sci. 2022, 12, x FOR PEER REVIEW 8 of 18

det nsubj root dobj amod amod punct

The film has several strong performances .

DT NN VBZ JJ JJ NNS .
Figure
Figure 3.
3. Directed
Directed graph
graph of
of dependency
dependency analysis.
analysis.

After
After classifying
classifying the thetext
textfeatures
featuresininthe thedata dataset setaccording
accordingtoto thethedependency
dependency rela-
re-
tionship,
lationship,this thispaper
paperuses usesthe thefollowing
following dependency-based
dependency-based TFIDF TFIDF weight
weight calculation
method
method [29]. The The specific
specific steps
steps are as follows:
For
For the word word i,i,wewecalculate
calculatethe thenumber
number ofof times
times that
that thethewordword i appears
i appears in text
in the the text
and
and
set itset it as
as n. Then,n. Then,
accordingaccording to theof
to the result result of the dependency
the dependency syntax implemented
syntax analysis analysis imple- by
Stanfordby
mented Parser, it is obtained
Stanford Parser, it that the wordthat
is obtained i belongs
the word to the m-th (1to≤themm-th
i belongs ≤ n(1 ) sentence
≤ m ≤ n)
component
sentence in the text.
component inFurthermore, according according
the text. Furthermore, to paper [29], classify
to paper theclassify
[29], m-th occurrence
the m-th
of word i inofthe
occurrence text iinto
word ki,mtext
in the level,
intoand ki,massign
level,weight
and assignwi,m weight
to it. Thewcalculation
i,m to it. Theprocess
calcula-is
shown
tion in Formula
process is shown (4):in Formula (4):
ki,m π
wi,m = 2 cos[( k )λ λ× π ] (4)
8 i,m 2 (4)
wi,m = 2cos[( ) × ]
n 8 2
TFi = ∑m=1 wi,m (5)
TFi = ∑m=1 wi,m
n
(5)
TFi D
TF-IDFi = × log( D+ 0 .01)
TF (6)
TF-IDFi = s i p
× log( i +0.01) (6)
s p
Then the improved frequency TFi with the weighti of word i in the text is calculated by
Formula
Then(5). theFor word i, the
improved improved
frequency TFTFIDF
i withi theweightweightbased of on
wordthe idependency
in the text isrelationship
calculated
is as
by shown in
Formula (5).Formula
For word (6).i,Where s represents
the improved TFIDF thei total
weightnumber
basedofon words in the text where
the dependency rela-
the word i is located, pi represents the number of texts
tionship is as shown in Formula (6). Where s represents the total number of words in the containing the word i, and D
represents the total number of texts in the dataset. Where λ
text where the word i is located, pi represents the number of texts containing the word i,is a parameter, which is used to
adjust
and the weight the
D represents gaptotal
betweennumber feature levels,
of texts in and the range
the dataset. is [0,1].
Where λ is a parameter, which
Calculate
is used to adjustthe theweight
weightofgap thebetween
edge between feature the word
levels, andnode and the
the range document node
is [0,1].
according
Calculateto the theabove
weightmethod,
of theand edge obtain
betweenthe adjacency
the word matrix node and A, as theshown in Formula
document node
(7). Then, the relationship between the words and documents
according to the above method, and obtain the adjacency matrix A, as shown in Formula is mapped to the relationship
between
(7). Then,words and wordsbetween
the relationship containing thedocument
words andinformation
documents and dependencies
is mapped through
to the relation-

ship between
matrix words and
transformation so aswords containing
to construct document
another information
text graph A. and dependencies
through matrix transformation so as to construct another text graph A.

A = ATTA (7)
(7)
A=A A
n×m
where A ∈ Rn×m
A∈R , nisisthe
,n thenumber
numberofofdocument
documentnodes
nodesand
andmmisisthe
thenumber
numberof
ofword
wordnodes.
nodes.

3.2. Vocabulary
3.2. Vocabulary GCNGCN
A general
A general graph
graph convolutional
convolutional network
network is is aa multi-layer
multi-layer (usually
(usually two-layer)
two-layer) neural
neural
network that performs convolution directly on the graph and updates
network that performs convolution directly on the graph and updates a node’s embed- a node’s embedding
vector
ding based
vector on the
based on information
the informationof its neighbor
of its neighbor nodes.
nodes.TheThegraph
graphnodes
nodes ofof aa graph
graph
convolutional network
convolutional network are are “task
“task entities”,
entities”, such
such asas documents
documents that
that need
need toto be
be classified.
classified. It
It
requires all entities, including those from training, validation, and test sets, to
requires all entities, including those from training, validation, and test sets, to be dis- be displayed
in the graph,
played so thatso
in the graph, node
thatrepresentations are not
node representations aremissing from from
not missing downstream tasks.tasks.
downstream This
limits the application of graph convolutional networks to many prediction
This limits the application of graph convolutional networks to many prediction tasks tasks where
test data are not seen during training. Therefore, we use Vocabulary Graph Convolution
where test data are not seen during training. Therefore, we use Vocabulary Graph Con-
(VGCN), whose graph is constructed on the vocabulary without using the training set
volution (VGCN), whose graph is constructed on the vocabulary without using the
Appl. Sci. 2022, 12, 8273 9 of 18

to obtain entities, mainly convolving related words. Then for a single document, the
single-layer graph convolution is shown in Formula (8):

H = XAW (8)
∼ ∼
where A represents the vocabulary graph. XA to extract part of vocabulary related to the
input matrix X, combining the words in the input sentence with the relevant words in the
vocabulary. W hold hidden state vector of a weight of a single document, dimension is
|V| × h.
The corresponding two-level GCN with LeakyReLU function is:

GCN = LeakyReLU(Xmv Avv Wvh )Whc (9)

where m is the mini-batch size, v is the vocabulary size, h is the hidden layer size, and c is
the class size or sentence embedding size. W holds the weight of the hidden state vector of
a single document. Each row of Xmv is a vector containing document features, which is
the word embedding of BERT. Then a two-layer convolution is performed to combine the
words in the input sentence with the related words in the vocabulary.
Specifically, we perform two layers of graph convolution for the two text graphs
generated in the previous section, then add the two generated hidden layer vectors.
In the training process of the original GCN, ReLU is used as the activation function
of the GCN. In this paper, the non-linear function LeakyReLU is selected as the activation
function of GCN, which overcomes the problem of gradient disappearance and speeds up
training. In addition, L2 regularization is also used to reduce overfitting. The LeakyReLU
function is as follows: 
0.2x, x < 0
LeakyReLU(x) = (10)
x, x ≥ 0

3.3. Feature Fusion and Classification


Since words are represented as nodes, the neighborhood information of nodes is
aggregated through an adjacency matrix, ignoring the order structure of a text, resulting
in the loss of contextual semantic information of a text. Therefore, we combine the global
vocabulary information obtained by GCN with the context information obtained by the
pre-trained BERT. While using the global vocabulary information, we can also take into
account the order of words to capture contextual information. In order to dig deeper into
the semantics, the sequence model LSTM can be used to fuse the two parts of information,
but it can only capture features in a single direction, from front to back or from back to
front. The mode of human reading text is to use the foregoing and the following to jointly
infer the meaning of a word when encountering a word. Therefore, BiLSTM models that
can capture features in both front-to-back and back-to-front directions are more in line with
this pattern. Considering that text has a hierarchical structure, the text is composed of
sentences, and sentences are composed of words. Therefore, we use a two-layer BiLSTM
to learn word features and sentence features, respectively. Since the feature words have
different importance to the text, and the output vector of BiLSTM does not distinguish the
importance of words, we add an attention mechanism to its hidden output. We changed
the initialization method of the context vector in the attention mechanism and combined
the bidirectional word hidden features as the initial value of the context vector to guide the
learning of the attention mechanism. At the same time, in order to prevent the deepening
of the model layers from affecting the classification results, we introduced the residual
operation. The specific structure of the model feature fusion module is shown in Figure 4.
Appl.Sci.
Appl. Sci.2022, 12,x8273
2022,12, FOR PEER REVIEW 1010ofof18
18

Residualtwo-layer
Figure4.4.Residual
Figure two-layerBiLSTM
BiLSTMstructure
structurewith
withattention.
attention.

Themodule
The module consists
consists of
of aa semantic
semantic feature
featuresplicing
splicinglayer,
layer,word-level
word-levelfeature
featureextraction
extrac-
layer, contextual attention layer, and a sentence-level feature extraction layer. First, the
tion layer, contextual attention layer, and a sentence-level feature extraction layer. First,
GCN and the pre-trained BERT model convert the text word sequence into two different
the GCN and the pre-trained BERT model convert the text word sequence into two differ-
word vector representations. Secondly, the corresponding fusion of the obtained two dif-
ent word vector representations. Secondly, the corresponding fusion of the obtained two
ferent word vector representations is input into the first-layer BiLSTM model to extract
different word vector representations is input into the first-layer BiLSTM model to extract
the semantic dependency information between words within a single sentence, and a
the semantic dependency information between words within a single sentence, and a
layer of attention mechanism is added to the first-layer BiLSTM, constructs an overall
layer of attention mechanism is added to the first-layer BiLSTM, constructs an overall rep-
representation of the current sentence. Here, the bidirectional word hidden features are
resentation of the current sentence. Here, the bidirectional word hidden features are com-
combined as the context vector of the attention mechanism. Then the obtained sentence
bined as the context vector of the attention mechanism. Then the obtained sentence feature
feature representation is input to the second layer BiLSTM to extract semantic dependency
representation is input to the second layer BiLSTM to extract semantic dependency infor-
information between sentences within the text. The learned sentence feature representa-
mation between sentences within the text. The learned sentence feature representation
tion and the sentence feature representation obtained by the context attention layer are
and the
subjectedsentence feature
to residual representation
operation to form anobtained
overallby the
text context attention
representation. layerthe
Finally, are sub-
overall
jected to residual operation
text representation to the
is input to form an overall
Softmax text representation.
classification Finally,
layer to obtain the overall
the final categorytextof
representation
the text. is input to the Softmax classification layer to obtain the final category of
the text.
3.3.1. Semantic Feature Connection Layer
3.3.1. We
Semantic
do notFeature
directlyConnection
classify theLayer
output features of GCN but use the generated hidden
Werepresentation
feature do not directlyHclassify
(2) the output learning.
for subsequent features of GCN
Since H(but2) use the generated hidden
does not adequately capture
(2)
feature representation
contextual informationH and fordoessubsequent learning.
not discriminate word Since
vector H(2)representations
does not adequately cap-
of polysemy,
ture contextual
pre-trained information
BERT and does
is introduced. not discriminate
The BERT model uses word vector representations
a bidirectional of pol-
transformer structure
ysemy, pre-trained
for encoding, whichBERT is introduced.
can characterize theThe BERT
specific model uses
semantics a bidirectional
of each word in thetransformer
context and
structure formultiple
distinguish encoding, which can
semantics characterize
according to thethe specificrelationship
semantic semantics of ofeach word inEach
the context. the
context and distinguish multiple semantics according to the semantic
word obtains the word vector representation W of the text through the pre-trained BERT relationship of the
context.
model, asEach wordinobtains
shown Formula the(11):
word vector representation W of the text through the pre-
trained BERT model, as shown innFormula (11): o
(i) (i) (i) (i)
W(i) = (i) w1 , (i)
w2 (i), . . . , w(i)j , . . .(i)wn (11)
W = w1 ,w2 , ..., wj , ... wn (11)
(i) (i)
(i)
where W
where W(i) represents the
represents the vector
vector matrix
matrixof
ofthe
thei-th
i-thtext,
text,wwj j represents
representsthe
thefeature
featurevector
vectorof
the
of j-th
the word
j-th word inin
the i-th
the text,
i-th and
text, n represents
and thethe
n represents maximum
maximum sentence
sentencelength.
length.
Appl. Sci. 2022, 12, 8273 11 of 18

Then, the embedded representation H(2) output after the second layer graph convo-
lution operation in the GCN model is spliced with the word vector W generated by the
pre-trained BERT to obtain a multi-feature vector X, as shown in Formula (12):

X = concat(W, H (2) ) (12)

3.3.2. Word-Level Feature Extraction Layer


After the semantic feature connection layer is the word-level feature layer BiLSTM
model, which is used to extract semantic dependency information between words within
a single sentence. The BiLSTM model corresponds to the forward and backward LSTM
models. The sequence information and contextual features are captured by LSTMs in two
directions, and the combination of the two output layers generated by the two directions
can provide past contextual information and future contextual information. The input of
this layer of BiLSTM is the result X after semantic feature concatenation.
Suppose a text has L sentences si , where i ∈ [1, L], and each sentence contains Ti
words. xit represents the word vector of the t-th word in the i-th sentence, where t ∈ [1, T].
In the forward LSTM layer, read sentence s si from xi1 x to xit y and perform forward


calculation to obtain the hidden state hit h; in the backward LSTM layer, read sentence si s
←−
from xit y to xi1 x and perform reverse calculation to obtain the hidden state hit m, as shown
in Equations (13) and (14):


→ −−−−→ −−−→
hit = LSTM (x it , hit−1 ), t ∈ [1, T] (13)

←− ←−−−− ←−−−
hit = LSTM (x i , hit−1 ), t ∈ [T, 1] (14)


Finally, the hidden state hit obtained by the forward LSTM layer is spliced with the
←−
hidden state hit obtained by the backward LSTM layer to obtain the final hidden layer state
hit of the bidirectional LSTM, as shown in Formula (15):


→ ← −
 
hit = hit , hit (15)

3.3.3. Contextual Attention Layer


Considering that the words in a sentence contribute differently to the meaning ex-
pression of the sentence, the attention mechanism can focus on important information
from a large amount of information while ignoring unimportant information. Therefore,
an improved attention mechanism is used to focus on important information, and these
word representations are aggregated to form sentence feature representations. The hid-
den output result hit of the word-level feature layer is used as the input of the context
attention layer, and the calculation process of assigning attention weights is shown in
Equations (16) and (17):
uit = tan h(W w hit +bw ) (16)
exp(u it T uw

αit =  (17)
T
∑t exp(u it uw

where Ww is the weight matrix, bw is the bias term, and αit is the normalized importance
weight. uw is the attention context vector, which contains useful information for the text to
guide the attention model to locate informative words from the input sequence, and thus
plays an important role in the attention mechanism. Prior work either ignores the context
vector or initializes it randomly, which greatly weakens the role of context. Therefore, we
use the combination of word hidden features learned by LSTM in two directions as the
Appl. Sci. 2022, 12, 8273 12 of 18

context vector of the attention mechanism, that is, uw = hit +hi1 . It contains contextual
information of words in sentences to guide the learning of the attention mechanism.
Next, the word importance weight calculated by Formula (17) and the hidden output
result hit of the word-level feature layer are weighted and summed. The word-level
feature vectors are aggregated into sentence-level feature vectors si to form an overall
representation of the current sentence, as shown in Formula (18):

si = ∑t (α it hit ) (18)

3.3.4. Sentence Level Feature Extraction Layer


Since a text contains multiple sentences, there are semantic dependencies between
each sentence. Therefore, the main task of the sentence-level feature extraction layer is to
extract the semantic dependency information between sentences within the text in units of
sentences. The sentence vector si obtained by the contextual attention layer is input into

a layer of BiLSTM for sentence sequence learning. Where hl is the hidden state obtained

by the forward LSTM layer through the forward input, hl is the hidden state obtained by
the backward LSTM layer through the reverse input, and finally, the two hidden states are
spliced to obtain hi .
→ −−−−→ −−−→
hl = LSTM (s i , ht−1 ), t ∈ [1, L] (19)
← ←−−−− ←−−−
hl = LSTM (s i , ht−1 ), t ∈ [L, 1] (20)
→ ←
hi = hl , hl (21)

The deepening of the number of network layers can improve the accuracy of the
model, but there are problems with the disappearance of the gradient of the deep network
and the degradation of the network. When the depth reaches a certain level, the accuracy of
the model begins to decline again, and the fitting ability becomes worse. In the deep neural
network, the residual network can effectively inhibit the network degradation phenomenon
and avoid the problem of sub-optimal solutions in deeper networks. Therefore, we intro-
duce a residual connection between two adjacent layers and perform a residual operation
on the sentence feature representation learned by the sentence-level feature extraction
layer and the sentence feature representation output by the context attention layer. Let the
model learn the residual to obtain a higher-level feature representation to achieve a better
classification effect, as shown in Formula (22):

Oi = si + hi (22)

Finally, the result Oi is input to the Softmax layer for classification, and the final
category of the text can be obtained.

4. Discussion
In this section, we will conduct experiments on four publicly available datasets and
analyze the experimental results of different models. Subsequently, ablation studies were
performed to further explore the effectiveness of each component.

4.1. Datasets
For an overall comparison with other more advanced methods, we choose the SST-2
dataset [31], MR [32], CoLA [33], and R8 dataset [13], which are commonly used in other
methods to evaluate our proposed method. The four datasets contain the content of reviews,
grammar, and news. The SST-2 dataset is the Stanford Sentiment Treebank dataset, which
is a binary single-sentence classification task consisting of sentences extracted from movie
reviews and human annotations of their sentiments. We use the public version, which
Appl. Sci. 2022, 12, 8273 13 of 18

has 6920 training samples, 872 validation samples, and 1821 test samples. In total, there
are 4963 positive reviews and 4650 negative reviews. The average length of reviews is
19.3 words. The MR dataset is a sentiment classification dataset for movie reviews. We used
the public version in [34] with 3554 test samples, 6392 training samples, and 710 validation
samples. The average length of reviews is 21.0 words. The CoLA dataset is a grammar
dataset released by New York University. This task is mainly to judge whether the grammar
of a given sentence is correct, which belongs to the single-sentence text binary classification
task. We use the public version with 8551 training data, 1043 validation data, and 1062 test
samples for a total of 6744 positives and 2850 negatives. The average length is 7.7 words.
The R8 dataset represents a subset of the Reuters 21,578 dataset. The dataset includes eight
categories, divided into 5485 training documents and 2189 test documents. We randomly
select 20% of the training set of this dataset as the validation set. The average length is
65.72 words.

4.2. Implementation Details


The WordNet threshold of all datasets in this paper is set to 0.5. When calculating
NPMI on the dataset, the entire sentence is used as a text window to construct a vocabulary
graph. The NPMI threshold of all datasets is set to 0.2 to filter out meaningless relationships
between words. In the IMGCN model, the graph embedding output size is set to 16, and
the hidden dimension of graph embedding is set to 128. We use the pre-trained BERT
model provided by Google and set the maximum sequence length to 200. The dropout rate
is set to 0.2. We train the model in 15 epochs. The batch size is set to 16. The Learning rate
of the SST-2 dataset is set to 1 × 10−5 , and the Learning rate of the other datasets is set to
8 × 10−5 . The loss weight decay of the CoLA dataset is set to 0.0001, and the loss weight
decay of other datasets is set to 0.001.

4.3. Main Experimental Results


To verify the performance of the proposed model, we conduct comparative experi-
ments with other models. Including the classic Bi-LSTM [35], Text-GCN [13], and BERT [10]
models, and three improved GCN methods: VGCN [15], VGCN-BERT [15], and STGCN +
BERT + BiLSTM [14].
• Bi-LSTM: Bidirectional LSTM is often used for text classification. Use BERT pre-trained
word embeddings as input to the Bi-LSTM model.
• Text-GCN: A text classification method that uses graphs to model text. The words
and documents in the text are regarded as nodes, where the edges of the document
and the words are based on the appearance information of the words in the doc-
ument, and the edges of the words and the words are based on the global word
co-occurrence information of the words. It has many improvements over many of the
most advanced models.
• VGCN: The model only uses VGCN. The pre-trained word embedding of BERT is
used as an input. The output of VGCN is relayed to a fully connected layer with the
Softmax function to generate classification scores. This model only uses the global
information of the vocabulary graph.
• BERT: We use a small version (BERT-base-uncase) of pre-trained BERT.
• VGCN-BERT: Based on the word co-occurrence information, a graph convolutional
network is constructed on the vocabulary graph, and then the graph embedding and
word embedding are provided to the self-attention encoder in BERT.
• STGCN + BERT + BiLSTM: The topic model is employed to extract the short text
topic information, and a short-text graph is constructed by word co-occurrence and
document word relations. The word nodes and document nodes trained by the GCN
and vectors generated by the BERT’s hidden layer are fed together to the BiLSTM
classifier for short text classification.
In order to comprehensively evaluate the model, the weighted average F1-Score and
macro F1-Score are selected as evaluation indicators. The main results on the accuracy,
Appl. Sci. 2022, 12, 8273 14 of 18

weighted average F1-Score, and macro F1-Score on test sets are presented in Table 1. The
experimental results on four benchmark datasets confirm that the performance of IMGCN
is basically better than other baseline models, which further proves the effectiveness and
robustness of IMGCN in text classification. The “—” in the table represents no data. The
bold in the table indicates the best results of the model. The Macro F1 score is the arithmetic
average of multiple F1 scores, and it assigns equal weight to each class. The weighted avg
F1 score assigns different weights to different classes according to the number of samples in
each class. When calculating the two scores, the respective Macro F1 scores and Weighted
avg F1 scores of the SST-2 and MR datasets are close. Since we rounded our calculations to
two decimal places, this resulted in the macro F1 score and the weighted average F1 score
being the same on SST-2 and MR.

Table 1. Accuracy, Weighted average F1-Score and (Macro F1-score) on the test sets.

SST-2 MR CoLA R8
Model Weighted Avg F1 Weighted Avg F1 Weighted Avg F1 Weighted Avg
Acc Acc Acc Acc
(Macro F1) (Macro F1) (Macro F1) F1 (Macro F1)
Text-GCN 80.45 82.39 75.67 76.70 56.18 (52.30) 52.30 — 97.00
Bi-LSTM 81.32 83.50 76.39 77.70 62.88 (55.25) 64.10 — 96.30
VGCN 81.64 83.43 76.42 76.42 63.59 (54.82) 65.61 97.12 (94.65) 97.43
BERT 91.49 91.53 86.24 80.30 81.22 (77.02) 81.20 98.12 (95.59) 98.20
STGCN + BERT +
— — — 82.50 — — — 98.50
BiLSTM
VGCN-BERT 91.93 91.94 86.35 86.35 83.68 (80.46) 83.70 98.04 (95.43) 98.04
IMGCN 92.48 92.48 87.81 87.81 84.02 (80.91) 84.37 98.14 (96.50) 98.34

It can be seen from the table that IMGCN has a good classification effect compared
with the baseline models, which proves the effectiveness of the model in this paper. The
performance of the Bi-LSTM model and the models VGCN and Text-GCN that utilize the
vocabulary graph are comparable. Compared with the VGCN and BERT models, the model
in this paper has a certain degree of effect improvement, indicating that the introduction
of dependencies, semantic dictionaries, the BERT model, Bi-LSTM model, and residual
attention in this model does enrich the contextual semantic information and local feature
information of GCN, making the model in this paper compared to other models have better
classification performance. At the same time, compared with VGCN-BERT, IMGCN always
performs better. All this shows that the combination of each component of this paper
is useful.

4.4. Ablation Study


In order to further study the benefits of the various components of IMGCN, this paper
conducted ablation experiments on the effectiveness of the constructed model. Since the
experimental results of IMGCN and the three variant models on the R8 dataset are not
much different, Table 2 only shows the results of the ablation experiments on the three
public datasets of SST-2, MR, and CoLA. We selected two indicators, weighted average
F1-Score and macro F1-Score. The experimental results of the IMGCN and the three variant
models prove that the three components used in this paper can improve the performance
of IMGCN. Among them, MGB is a model that only uses point-state mutual information
to build a graph network for learning and stitches it together with the local information
obtained by BERT. Based on the MGB model, IMB introduces a model that relies on syntactic
and semantic dictionaries for composition. IMGCN is a model that further fuses graph
features and local features obtained by BERT using hierarchical BiLSTM on the basis of IMB.
pl. Sci. 2022, 12,
Appl. x 2022,
Sci. FOR12,
PEER
8273 REVIEW 15 of 18 15 of

Table 2. Experimental results of ablation study.


Table 2. Experimental results of ablation study.
SST-2 MR CoLA
Model SST-2 MR CoLA
Model Weighted Avg F1 Macro F1 Weighted Avg F1 Macro F1 Weighted Avg F1 Macro F1
Weighted Avg F1 Macro F1 Weighted Avg F1 Macro F1 Weighted Avg F1 Macro F
MGB 91.38 91.38 86.49 86.49 80.70 76.30
MGB IMB 91.38 91.80 91.38 91.80 86.49
87.40 86.49
87.40 82.32 80.70 79.12 76.30
IMB IMGCN 91.80 92.48 91.80 92.48 87.40
87.81 87.40
87.81 84.02 82.32 80.91 79.12
MGCN 92.48 92.48 87.81 87.81 84.02 80.91
As shown in Table 2, we can see that the performance of the model with the corre-
As shown
sponding in Figure
module added 5,hasthe performance
a certain of the model
degree of improvement namedwith
compared IMBthehas a certain i
overall
model, indicating that the modules are complementary to each
provement compared to the MGB model. The F1 score of the IMB model on the th other. It can be observed
that different modules have different functions in different datasets.
datasets is higher than that of the IGB model. Although the gap is not high, it also prov
As shown in Figure 5, the performance of the model named IMB has a certain im-
that the introduction
provement compared of to
dependency
the MGB model. syntaxTheand semantic
F1 score of the dictionary
IMB model on cantheenrich
three the wo
embedding
datasets information of the
is higher than that graph.
of the Compared
IGB model. with
Although the the F1not
gap is score
high,of the proves
it also IMB model,
that the
F1 scores of introduction
the SST-2, MR,of dependency
and CoLA syntax and semantic
datasets on thedictionary
IMGCN can enrich
model allthe word
increased to v
embedding information of the graph. Compared with the F1 score of the IMB model, the F1
ying degrees. This shows that the fusion of global information and local information, co
scores of the SST-2, MR, and CoLA datasets on the IMGCN model all increased to varying
pareddegrees.
with simple splicing,
This shows that thefusion through
fusion of model learning
global information and localhas better classification
information, compared abili
which also
with shows
simple the effectiveness
splicing, fusion through of thislearning
model module.has better classification ability, which
also shows the effectiveness of this module.

94 94

92 92

90 90

88
Weighted avg F1

88
Macro F1

86 IMGCN 86 IMGCN
84 IMB 84 IMB
MGB MGB
82 82

80 80

78 78

76 76
SST-2 MR CoLA SST-2 MR CoLA

Figure 5. Histogram of experimental results of ablation study.


Figure 5. Histogram of experimental results of ablation study.
After combining the feature representation obtained by GCN and the pre-trained
After the
BERT, feature fusion
combining themodule
featureis used for further learning,
representation obtainedand the
byclassification
GCN andeffect is
the pre-train
better than directly splicing the two parts of the feature information, but the increase in
BERT, the feature fusion module is used for further learning, and the classification eff
the model will affect the training time. Therefore, we compare the training time of directly
is better than directly
concatenating splicing
the two-part the two
features andparts
using ofthethe feature
feature fusion information, but the
module to combine theincrease
the model
two-partwill affectonthe
features fourtraining
datasets,time. Therefore,
as shown in Figurewe 6. compare the training time of direc
As can be seen from Figure 5, the IMB
concatenating the two-part features and using the feature model directly splices the output
fusion modulefeatures of
to combine
the GCN model with the pre-trained BERT word vectors for classification, and the IMGCN
two-part features on four datasets, as shown in Figure 6.
model extracts hierarchical features based on the combination of the two features. As can
As canfrom
be seen be seen
Figurefrom
5, whenFigure 5, the
the model IMB model
becomes directly
more complex, thesplices
trainingthe
timeoutput
requiredfeatures
the GCN
for the model
model willwith the pre-trained
also increase. Compared with BERT theword vectors
IMB model, for classification,
the training time required and
IMGCN model
for each epochextracts hierarchical
of the IMGCN model isfeatures
70 s, 18 s, based
15 s, andon40the
s forcombination
the four datasetsofofthe two featur
CoLA,
SST-2, MR, and R8, respectively. The best results are achieved around
As can be seen from Figure 5, when the model becomes more complex, the training ti 15 epochs of training.
Therefore, compared with the IMB model, the overall training time of the IMGCN model
required for the model will also increase. Compared with the IMB model, the traini
on different datasets increases by about 4 to 17 min, while the classification accuracy and
time Weighted
requiredavg forFeach epoch of the IMGCN model is 70 s, 18 s, 15 s, and 40 s for the fo
1 value of the IMGCN model on the CoLA dataset increase by 1.92% and
datasets
2.32%,ofrespectively.
CoLA, SST-2, MR, compared
Therefore, and R8, respectively. The best
with the improvement results
of the model are achieved arou
classification
effect, the
15 epochs increase inTherefore,
of training. the model training
compared time iswith
withinthean IMB
acceptable
model, range.
theTherefore, it is
overall training ti
of the IMGCN model on different datasets increases by about 4 to 17 min, while the cl
sification accuracy and Weighted avg F value of the IMGCN model on the CoLA data
increase by 1.92% and 2.32%, respectively. Therefore, compared with the improvement
Appl. Sci. 2022, 12, 8273 16 of 18

meaningful
Appl. Sci. 2022, 12, x FOR PEER REVIEW for us to further fuse and learn the feature information obtained by the16GCN
of 18
model and the pre-trained BERT by using contextual attention and the hierarchical model.

134
CoLA
204

206
SST-2
224

233
MR
248

581
R8
621

0 100 200 300 400 500 600 700


Training Time(s)

IMB IMGCN

Figure 6. Average training time per epoch for different models on different datasets.
Figure 6. Average training time per epoch for different models on different datasets.

5.5.Conclusions
Conclusions
Inthis
In this research,
research, we propose
proposeaanew newIMGCN
IMGCNmodel model that uses
that thethe
uses dependency
dependency relation-
rela-
ship to capture
tionship the the
to capture context dependency,
context dependency, andandthe the
semantic
semanticdictionary
dictionarycan can
increase the use-
increase the
useful connection
ful connection between
between words
words to supplement
to supplement the shortcomings
the shortcomings of theof the graph
graph convo-
convolutional
lutional
networknetwork in text representation.
in text representation. At thetime,
At the same samea time, a two-layer
two-layer BiLSTMBiLSTM model is model
used isto
used to combine
combine the localthefeature
local feature information
information obtainedobtained
by theby the BERT
BERT modelmodeland the and the global
global infor-
information
mation obtained obtained
by the byGCN.
the GCN.ThenThen
minemine the deep
the deep features
features of theoftext.
the Then
text. Then
add an add an
atten-
attention mechanism
tion mechanism to focusto focus on keyword
on keyword features.
features. Furthermore,
Furthermore, by introducing
by introducing residual resid-
con-
ual connections,
nections, the neural
the neural network network degradation
degradation problem problem that occurs
that occurs when when the network
the network model
model is stacked with multiple layers is solved. At the same time, let
is stacked with multiple layers is solved. At the same time, let the model learn the residualthe model learn the
residual so as to better obtain new features and improve the effect of
so as to better obtain new features and improve the effect of text classification. We have text classification. We
have conducted experiments on four datasets, which have achieved
conducted experiments on four datasets, which have achieved a certain degree of im- a certain degree of
improvement comparedwith
provement compared withthethebaseline
baselinemodel.
model.
In
Infuture
futurework,
work,wewe will further
will furtherexplore
explorethethe
dependencies
dependencies and and
construct graph
construct features
graph fea-
richer in semantic information to improve the performance of the model.
tures richer in semantic information to improve the performance of the model. In addition, In addition, our
proposed
our proposed model doesdoes
model not not
analyze the the
analyze model
model complexity
complexity andandhowhow totoreduce
reducethethemodel
model
complexity, which is the direction of further
complexity, which is the direction of further research. research.

Author Contributions:Conceptualization,
AuthorContributions: Conceptualization,B.X.,
B.X.,C.Z.,
C.Z.,X.W.
X.W.and
andW.Z.;
W.Z.;Methodology,
Methodology,B.X.;
B.X.;Software,
Software,
B.X.;
B.X.; Validation, B.X.; Writing—original draft preparation, B.X.; Writing—review andediting,
Validation, B.X.; Writing—original draft preparation, B.X.; Writing—review and editing,B.X.
B.X.
and
andC.Z.;
C.Z.;Visualization,
Visualization,B.X.;
B.X.;Project
Projectadministration,
administration,B.X.
B.X.All
Allauthors
authorshave
haveread
readand
andagreed
agreedtotothe
the
published
publishedversion
versionofofthe
themanuscript.
manuscript.
Funding:This
Funding: Thisresearch
researchreceived
receivedno
noexternal
externalfunding.
funding.
Institutional
InstitutionalReview
ReviewBoard Statement:Not
BoardStatement: Notapplicable.
applicable.
Informed
InformedConsent Statement:Not
ConsentStatement: Notapplicable.
applicable.
Data
DataAvailability Statement:The
AvailabilityStatement: Thedata
datapresented
presentedininthis
thisstudy
studyare
areavailable
availableon
onrequest
requestfrom
fromthe
the
corresponding
correspondingauthor.
author.
Conflicts Interest:The
ConflictsofofInterest: Theauthors
authorsdeclare
declareno
noconflict
conflictofofinterest.
interest.
Appl. Sci. 2022, 12, 8273 17 of 18

References
1. Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the
ECML, Chemnitz, Germany, 21–23 April 1998.
2. Alhajj, R.; Gao, H.; Li, X.; Li, J.; Zaiane, O.R. Advanced Data Mining and Applications. In Proceedings of the Third International
Conference (ADMA 2007), Harbin, China, 6–8 August 2007.
3. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014.
4. Zhang, H.; Xiao, L.; Wang, Y.; Jin, Y. A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning.
arXiv 2017, arXiv:1707.0289.
5. Zhao, W.; Peng, H.; Eger, S.; Cambria, E.; Yang, M. Towards Scalable and Reliable Capsule Networks for Challenging NLP
Applications. arXiv 2019, arXiv:1906.0282.
6. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
7. Cho, K.; Merrienboer, B.V.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations
using RNN Encoder–Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078.
8. Wang, R.; Li, Z.; Cao, J.; Chen, T.; Wang, L. Convolutional Recurrent Neural Networks for Text Classification. In Proceedings of
the 2019 International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019.
9. Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you
Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017;
Volume 30.
10. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
arXiv 2018, arXiv:1810.04805.
11. Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.F.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro,
A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261.
12. Kipf, T.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2017, arXiv:1609.02907.
13. Yao, L.; Mao, C.; Luo, Y. Graph Convolutional Networks for Text Classification. arXiv 2019, arXiv:1809.05679. [CrossRef]
14. Zhenbo, B.; Shiyou, Z.; Hongjun, P.; Yuanhong, W.; Hua, Y. A Survey of Preprocessing Methods for Marine Ship Target Detection
Based on Video Surveillance. In Proceedings of the 2021 7th International Conference on Computing and Artificial Intelligence,
Tianjin, China, 23–26 April 2021.
15. Lu, Z.; Du, P.; Nie, J. VGCN-BERT: Augmenting BERT with Graph Embedding for Text Classification. Adv. Inf. Retr. 2020, 12035,
369–382.
16. Xue, B.; Zhu, C.; Wang, X.; Zhu, W. An Integration Model for Text Classification using Graph Convolutional Network and BERT.
J. Phys. Conf. Ser. 2021, 2137, 012052. [CrossRef]
17. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E.H. Hierarchical Attention Networks for Document Classification NAACL.
In Proceedings of the NAACL-HLT 2016, San Diego, CA, USA, 12–17 June 2016.
18. Harris, Z.S. Distributional structure. Word 1954, 10, 146–162. [CrossRef]
19. Rousseau, F.; Kiagias, E.; Vazirgiannis, M. Text Categorization as a Graph Classification Problem. In Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language
Processing, Beijing, China, 26–31 July 2015.
20. Luo, Y.; Uzuner, Ö.; Szolovits, P. Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical
relations. Brief. Bioinform. 2017, 18, 160–178. [CrossRef] [PubMed]
21. Tang, D.; Qin, B.; Liu, T. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification. In Proceedings
of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015.
22. Wang, Y.; Huang, M.; Zhu, X.; Zhao, L. Attention-based LSTM for Aspect-level Sentiment Classification. In Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016.
23. Dong, Y.; Liu, P.; Zhu, Z.; Wang, Q.; Zhang, Q. A Fusion Model-Based Label Embedding and Self-Interaction Attention for Text
Classification. IEEE Access 2020, 8, 30548–30559. [CrossRef]
24. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio’, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903.
25. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille
France, 6–11 July 2015.
26. Cavallari, S.; Cambria, E.; Cai, H.; Chang, K.C.; Zheng, V.W. Embedding Both Finite and Infinite Communities on Graphs. IEEE
Comput. Intell. Mag. 2019, 14, 39–50. [CrossRef]
27. Bouma, G. Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 2019, 30, 31–40.
28. Wu, Z.; Palmer, M. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting on Association for
Computational Linguistics, Las Cruces, NM, USA, 27–30 June 1994.
29. Tang, J.; Qu, M.; Mei, Q. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. In Proceedings
of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13
August 2015.
30. Tesnière, L. Eléments de syntaxe structurale, 1959 Paris, Klincksieck. Can. J. Linguist./Rev. Can. Linguist. 1960, 6, 67–69.
Appl. Sci. 2022, 12, 8273 18 of 18

31. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality
Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
Seattle, WA, USA, 18–21 October 2013.
32. Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30
June 2005.
33. Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguist. 2019, 7, 625–641.
[CrossRef]
34. Zhu, X.; Xu, Q.; Chen, Y.; Chen, H.; Wu, T. A Novel Class-Center Vector Model for Text Classification Using Dependencies and a
Semantic Dictionary. IEEE Access 2020, 8, 24990–25000. [CrossRef]
35. Graves, A.; Mohamed, A.; Hinton, G.E. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649.

You might also like