Professional Documents
Culture Documents
sciences
Article
The Study on the Text Classification Based on Graph
Convolutional Network and BiLSTM †
Bingxin Xue *, Cui Zhu, Xuan Wang and Wenjun Zhu
Abstract: Graph Convolutional Neural Network (GCN) is widely used in text classification tasks.
Furthermore, it has been effectively used to accomplish tasks that are thought to have a rich relational
structure. However, due to the sparse adjacency matrix constructed by GCN, GCN cannot make
full use of context-dependent information in text classification, and it is not good at capturing local
information. The Bidirectional Encoder Representation from Transformers (BERT) has the ability
to capture contextual information in sentences or documents, but it is limited in capturing global
(the corpus) information about vocabulary in a language, which is the advantage of GCN. Therefore,
this paper proposes an improved model to solve the above problems. The original GCN uses word
co-occurrence relationships to build text graphs. Word connections are not abundant enough and
cannot capture context dependencies well, so we introduce a semantic dictionary and dependencies.
While the model enhances the ability to capture contextual dependencies, it lacks the ability to capture
sequences. Therefore, we introduced BERT and Bi-directional Long Short-Term Memory (BiLSTM)
Network to perform deeper learning on the features of text, thereby improving the classification
Citation: Xue, B.; Zhu, C.; Wang, X.; effect of the model. The experimental results show that our model is more effective than previous
Zhu, W. The Study on the Text research reports on four text classification datasets.
Classification Based on Graph
Convolutional Network and BiLSTM. Keywords: text classification; graph convolutional network; dependencies; Bi-directional Long
Appl. Sci. 2022, 12, 8273. https:// Short-Term Memory; ResNet
doi.org/10.3390/app12168273
ignore the information with discontinuous semantic words. Moreover, they are limited
in their ability to capture the global features of the information [11]. For example, global
word co-occurrence information, global syntactic structure information, etc., and the global
dependencies between all words also have a certain guiding role for classification. The
Graph Convolutional Neural Network [12] method has been a research hotspot in recent
years. This model performs graph convolution operations on the constructed topological
relationship graph to obtain features, thereby realizing text classification. In order to solve
the classification problem, many variants of GCN [13–16] have been proposed and explored.
Although GCN is gradually becoming a good choice for graph-based text classification, the
current research still has some shortcomings that cannot be ignored.
In the recent study on text classification, researchers represented text as a graph
structure and captured the structural information of the text as well as the discontinuous
and long-distance dependencies between words through a Graph Convolutional Network.
The most representative work is Graph Convolutional Network (GCN) [12] and its variant
Text GCN [13]. This method represents words as nodes and aggregates the neighborhood
information of the nodes through an adjacency matrix, thus integrating the global context
of domain-specific languages to some extent. When the graph convolutional network
model builds text graphs, most of them use the co-occurrence relationship of words and
the inclusion relationship between documents and words, resulting in a lack of richer
relationship information expression between words and words and the insufficient capture
of long-distance dependencies of words in sentences. The measure of a word co-occurrence
relationship generally uses normalized point-wise mutual information (NPMI) to calculate
the weight between two word nodes. However, the calculation of NPMI depends on the
corpus. If some words have a low probability of appearing in the corpus, this method may
make the calculation result of NPMI very small, resulting in the determination that the
similarity of two words is low or not.
In addition, graph convolution based on word co-occurrence relationships and seman-
tic dictionary composition can only capture the relationship of non-continuous entities in
the text and cannot simultaneously capture the short-distance and long-distance depen-
dencies of words in sentences. For example, a text graph is constructed from a corpus
of word co-occurrence relations and semantic dictionaries, as shown in Figure 1. Here
“This movie can be described in one word: wonderful.” is a sentence in the corpus. Taking
“wonderful” as the center, the blue circle represents the aggregation of one-level neighbor
node information, and the green circle represents the aggregation of two-level neighbor
node information. Among them, “wonderful” is used to modify “movie” in the sentence,
and it is very useful for analyzing the expression of the text as a whole. However, the two
word nodes “movie” and “wonderful” in Figure 1 are not directly related, but there should
be a certain dependency between them. Graph Convolutional Network only aggregates
direct neighbor node information, so one layer of GCN can only capture short-distance
dependencies of words in sentences. Capturing long-range dependencies between words
such as “movie” and “wonderful” can be solved by increasing the number of GCN layers.
However, current research shows that multi-layer graph convolutional networks for text
classification tasks incur high space complexity. At the same time, increasing the number
of network layers will also make the local features converge to similar values. To overcome
this problem, we introduce dependencies when building the text graph. Dependency
syntax is mainly concerned with the relationship between words (sentence components) in
a sentence, and it is not affected by the physical location of the components. By introducing
dependencies, short-range dependencies and long-range dependencies can be well utilized
while also providing syntactic constraints and partially reducing the number of GCN layers.
However, due to the characteristics of the graphs, when learning word features, graph
convolutional networks do not effectively capture the word order that is very useful for
them, resulting in the loss of contextual semantic information of the text. For example,
“You have a lot of hard work.”, “You have to work hard.”, where the word “work” will
change meaning depending on its position in the contextual sentence. The graph formed by
Appl. Sci. 2022, 12, 8273 3 of 18
the words in these two sentences cannot reflect the order between the words, so it cannot
represent the characteristics of word order changes. Obviously, this is very important
for understanding the meaning of the sentence. In addition, the graph convolutional
network cannot distinguish polysemy; for example, “She’s grown into quite an individual.”,
“So this individual came up.”, the meaning of the word “individual” expresses different
semantics depending on the context. In the first sentence, it is the meaning of “person with
personality”, and in the second sentence, it has the meaning of “weird person”. In this
case, if not differentiated, it may affect the understanding of the text, thereby affecting the
classification effect. Leveraging multilayer multi-head self-attention and positional word
embeddings, BERT mainly focuses on local contiguous word sequences that provide local
contextual information. By pre-training a large amount of text data in different fields, BERT
can well infer the meaning that polysemous words should express in the current context.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 3 of 18
However, BERT may have difficulty interpreting global information about vocabulary.
word
the
no reason
in
can
movie
charm
fun
to
be described
wonderful
one
watch
Figure1.1.Text
Figure TextGraph.
Graph.
ToHowever,
improve the dueabove-mentioned
to the characteristicsproblems,
of thethis paper when
graphs, proposes an improved
learning model
word features,
(IMGCN) for text classification.
graph convolutional networksOn dothe
notbasis of Vocabulary
effectively captureGraph
the wordConvolution
order that(VGCN) [15],
is very useful
a for
semantic dictionary
them, resulting in (WordNet) and dependencies
the loss of contextual semanticare introduced,
information andtext.
of the theFor
features
example,of
GCN
“Youand haveBERT
a lotareof further learned
hard work.”, “Youthrough
have residual bi-layerwhere
to work hard.”, BiLSTM thewith
wordan“work”
attentionwill
mechanism
change meaning [17]. By buildingon
depending dependencies,
its position inGCNs can be made
the contextual to capture
sentence. long-range
The graph formed
dependencies
by the wordsof inwords
these twoin sentences
sentencesmorecannotefficiently. WordNet
reflect the was introduced
order between the words, to provide
so it can-
more useful connections
not represent between of
the characteristics words.
wordAt the changes.
order same time, the BERT this
Obviously, model is used
is very to ex-
important
tract the local featurethe
for understanding information
meaning of of the
a text. The global
sentence. vocabulary
In addition, the information obtained net-
graph convolutional by
GCN is combined with the context information obtained by pre-trained
work cannot distinguish polysemy; for example, “She’s grown into quite an individual.”, BERT. Considering
that
“Sothe
thistext has a hierarchical
individual came up.”,structure
the meaningof word-sentence-document,
of the word “individual” in order todifferent
expresses obtain
more comprehensive text features, we use hierarchical BiLSTM to extract
semantics depending on the context. In the first sentence, it is the meaning of “person with the combined
features hierarchically
personality”, and in the to capture
second word, sentence,
sentence, it has and document
the meaning offeature
“weirdinformation,
person”. In re- this
spectively.
case, if not differentiated, it may affect the understanding of the text, therebymechanism
In the process of word-to-sentence feature extraction, the attention affecting the
isclassification
used to generate effect.sentence
Leveragingembeddings,
multilayerwhich solves the
multi-head problem that
self-attention keyword
and features
positional word
cannot be paid attention to in text classification. We changed the
embeddings, BERT mainly focuses on local contiguous word sequences that provide local initialization method of
the context vector in the attention mechanism and combined the bidirectional
contextual information. By pre-training a large amount of text data in different fields, word hidden
features
BERT can as the
wellinitial
infer value of the context
the meaning vector to guide
that polysemous words the learning
should of the
express attention
in the current
mechanism. By introducing residual connections, the neural network
context. However, BERT may have difficulty interpreting global information about degradation problem
vocab-
ulary.
To improve the above-mentioned problems, this paper proposes an improved model
(IMGCN) for text classification. On the basis of Vocabulary Graph Convolution (VGCN)
[15], a semantic dictionary (WordNet) and dependencies are introduced, and the features
of GCN and BERT are further learned through residual bi-layer BiLSTM with an attention
Appl. Sci. 2022, 12, 8273 4 of 18
that occurs when stacking multi-layer network models can be alleviated, and the model
can learn residuals to better obtain new features.
We call the proposed model IMGCN. Experiments on four benchmarking datasets
demonstrate that IMGCN can effectively improve the results in current GCN-based meth-
ods and achieve better results. The main contributions of this paper are as follows:
• The dependency and semantic dictionary are fully integrated into IMGCN. They can
enrich the connections between words and capture the long-range dependencies of
words in sentences while providing syntactic constraints and partially reducing the
number of GCN layers.
• Combining graph features and sequence features, using hierarchical BiLSTM to ex-
tract input features hierarchically, capturing word and sentence feature information,
respectively, so as to obtain more comprehensive text features.
2. Related Work
Traditional text classification methods are mainly based on feature engineering. Fea-
ture engineering includes the bag-of-words model [18] and n-grams, etc. Later, there was
research [19,20] that converted text into graphics and performed feature engineering on
graphics. However, these methods cannot automatically learn the embedding represen-
tation of nodes. Compared with traditional methods, deep learning-based methods can
automatically acquire features, and they can learn the deep semantic information of the
text. In terms of deep learning models, Kim 2014 [3] uses a simple convolutional neural
network (CNN) for text classification. Tang et al. 2015 [21] proposed a gated recurrent
neural network. Yang et al. 2016 [17] proposed the use of hierarchical attention networks to
model and classify documents. Wang et al. 2016 [22] introduced the attention mechanism
into LSTM. Dong et al., 2020 [23] introduce the Bidirectional Encoder Representation from
Transformers (BERT) [10] and the self-interaction attention mechanism. Most attention-
based deep neural networks mainly focus on local continuous word sequences, which
provide local context information. However, the ability to capture the global characteristics
of the relevant information may not be sufficient.
In recent years, many studies have tried to apply convolution operations to graph
data. Kipf and Welling 2017 [12] proposed Graph Convolutional Networks (GCNs), and
the model has achieved good results on node classification tasks. Velickovic et al. 2018 [24]
proposed a Graph Attention Network (GAT), which uses the attention mechanism [25]
to assign weights to the neighborhood information of nodes. Compared with the GCN
without adding weights, it has a better promotion. Later, Yao et al. 2019 [13] proposed the
TextGCN model, which introduced the Graph Convolutional Network to text classification
for the first time. By modeling the text into a graph, the nodes in the graph are composed
of words and documents, and a good effect is finally obtained. Cavallari et al. 2019 [26]
introduced a new setting for graph embedding, which considers embedding communities
instead of individual nodes.
Now, many studies try to combine GCN with other models. Ye et al. 2020 [14] input
the word and document nodes trained by GCN and BERT’s word vector into the BiLSTM
classification model, where the input vector of GCN is a one-hot vector. Lu et al. 2020 [15]
integrated the vocabulary graph embedding module with BERT and achieved good results
in many public datasets, where GCN uses BERT’s word vector. In contrast to this, in our
method, GCN uses BERT’s word vectors and uses dependencies when building the text
graph. One graph focuses on extracting dependencies between words, and the other graph
focuses on word co-occurrences. We use a two-layer BiLSTM with attention combined with
GCN and BERT; one layer is used to focus on the word sequence, the other layer focuses on
the sentence sequence, and a residual operation is added after BiLSTM to learn the residual.
3. Proposed Method
Following in-depth research on text classification based on neural networks, this
paper proposes IMGCN. The whole process of IMGCN can be divided into the steps of
layer focuses on the sentence sequence, and a residual operation is added after BiLSTM to
learn the residual.
GCN
wordnet
Feature Fusion
w4
w1 w5
Vocabulary Graph w3
Residual Unit
w2 w6
w8
Label
w7
Attention Softmax
+ BiLSTM BiLSTM +
w3 Layer
w1 w5
w4
Dependency Graph w6
w8
w2
w7 Output Layer
Word Embedding
Word
BERT
Embedding
w1 d1
Word-Document
.....
.....
Relationship wn
dn
w1 w2 … wk Input Data
Dependencies
Figure
Figure2.
2.The
Thearchitecture
architectureof
ofIMGCN.
IMGCN.
measure of word association. However, the calculation of NPMI depends on the corpus.
If the probability of some words appearing in the corpus is low, this method may lead to
a small calculation result of NPMI, and the similarity of two words is judged to be low
or not. For example, the word “abundant” has a very low probability of appearing in the
corpus. Let P (word and other words) be the co-occurrence probability between words.
When calculating the co-occurrence probability of “abundant” and other words, the values
of P (abundant and other words) and P (abundant) are very small; even if it is 0, it will
result in a very small value of NPMI, so in the end, “abundant” and other words may be
judged as having little or no semantic relevance.
In addition, there are a lot of connections between words, and the most direct con-
nection is the synonymous relationship that exists between certain words. For example,
the words “abundant” and “plentiful” have very similar meanings and can be used inter-
changeably in most cases. If two words are synonymous, then only constructing a text
graph through word co-occurrence relationship may lack this part of word relationship
information. If this synonymous relationship can be used, the relationship information of
words can be enriched, thereby improving the accuracy of the calculation. The semantic
dictionary describes the different semantic relations between words. Therefore, we jointly
construct a text graph using word co-occurrence relations and semantic lexicon.
For the selection of the semantic dictionary, we use WordNet. WordNet describes a
network of semantic relationships between words. For every content word in the language,
it gives all possible word meanings. WordNet organizes words in a dictionary using
semantic and lexical relations, where semantic relations represent the degree of association
between the semantic items of two words. Nouns, verbs, adverbs, and adjectives are each
organized into a network of synonyms, each set of synonyms representing a basic semantic
concept, and these sets are also connected by various relationships. Since a word can
have multiple meanings, it is necessary to compare the meanings between the two words.
Therefore, we first use WordNet to expand synonyms and then use the WordNet-based
Wup method to calculate the semantic similarity between words in the two synonym sets
and calculate the average of all similarity scores. This value is the final similarity score of
the two words. The larger the calculated value, the higher the semantic similarity between
them. Among them, the “Wup” method [28] is a similarity measurement method based
on the path structure proposed by Wu and Palmer. The similarity is calculated according
to the similarity of the meaning of the words and the position of the synonym sets in the
hypernym tree relative to each other.
Therefore, for two words i and j, we first calculate the edge weights between two words
using the NPMI method. If the calculated NPMI value is small, the WordNet-based method
is used for calculation, which can minimize the problems caused by the low frequency of
some words in the corpus, and at the same time, add more word relationship information.
Formally, the edge weight between node i and node j is defined as:
NPMI(i,j) i,j are words, NPMI(i,j) > 0
WordSim(i,j) 0.5 ≤ WordSim(i,j) ≤ 1, NPMI(i,j) < 0
Aij = (1)
1 i=j
0 otherwise
1 p(i, j)
NPMI(i, j) = − log (2)
logp(i, j) p(i)p(j)
#w(i,j) #w(i)
where p(i, j) = #w , p(i) = #w , #W(i) is the number of all sliding windows containing
word i, #W(i, j) is the number of all sliding windows containing word i and word j, and
#W is the total number of sliding windows. We set the window larger to gain long-term
dependence. The value range of NPMI is [− 1, 1] . A positive NPMI value indicates that
Appl. Sci. 2022, 12, 8273 7 of 18
the semantic relevance between words is high, while a negative NPMI value indicates that
there is little or no semantic relevance. Therefore, we only create an edge between word
pairs with positive NPMI values.
When the NPMI value of a pair of words i, j is calculated to be negative, their semantic
similarity is calculated as:
len len
∑q=21 ∑p=11 simwup wp , kq
WordSim(i, j) = (3)
len1 + len2 − a
DT NN VBZ JJ JJ NNS .
Figure
Figure 3.
3. Directed
Directed graph
graph of
of dependency
dependency analysis.
analysis.
After
After classifying
classifying the thetext
textfeatures
featuresininthe thedata dataset setaccording
accordingtoto thethedependency
dependency rela-
re-
tionship,
lationship,this thispaper
paperuses usesthe thefollowing
following dependency-based
dependency-based TFIDF TFIDF weight
weight calculation
method
method [29]. The The specific
specific steps
steps are as follows:
For
For the word word i,i,wewecalculate
calculatethe thenumber
number ofof times
times that
that thethewordword i appears
i appears in text
in the the text
and
and
set itset it as
as n. Then,n. Then,
accordingaccording to theof
to the result result of the dependency
the dependency syntax implemented
syntax analysis analysis imple- by
Stanfordby
mented Parser, it is obtained
Stanford Parser, it that the wordthat
is obtained i belongs
the word to the m-th (1to≤themm-th
i belongs ≤ n(1 ) sentence
≤ m ≤ n)
component
sentence in the text.
component inFurthermore, according according
the text. Furthermore, to paper [29], classify
to paper theclassify
[29], m-th occurrence
the m-th
of word i inofthe
occurrence text iinto
word ki,mtext
in the level,
intoand ki,massign
level,weight
and assignwi,m weight
to it. Thewcalculation
i,m to it. Theprocess
calcula-is
shown
tion in Formula
process is shown (4):in Formula (4):
ki,m π
wi,m = 2 cos[( k )λ λ× π ] (4)
8 i,m 2 (4)
wi,m = 2cos[( ) × ]
n 8 2
TFi = ∑m=1 wi,m (5)
TFi = ∑m=1 wi,m
n
(5)
TFi D
TF-IDFi = × log( D+ 0 .01)
TF (6)
TF-IDFi = s i p
× log( i +0.01) (6)
s p
Then the improved frequency TFi with the weighti of word i in the text is calculated by
Formula
Then(5). theFor word i, the
improved improved
frequency TFTFIDF
i withi theweightweightbased of on
wordthe idependency
in the text isrelationship
calculated
is as
by shown in
Formula (5).Formula
For word (6).i,Where s represents
the improved TFIDF thei total
weightnumber
basedofon words in the text where
the dependency rela-
the word i is located, pi represents the number of texts
tionship is as shown in Formula (6). Where s represents the total number of words in the containing the word i, and D
represents the total number of texts in the dataset. Where λ
text where the word i is located, pi represents the number of texts containing the word i,is a parameter, which is used to
adjust
and the weight the
D represents gaptotal
betweennumber feature levels,
of texts in and the range
the dataset. is [0,1].
Where λ is a parameter, which
Calculate
is used to adjustthe theweight
weightofgap thebetween
edge between feature the word
levels, andnode and the
the range document node
is [0,1].
according
Calculateto the theabove
weightmethod,
of theand edge obtain
betweenthe adjacency
the word matrix node and A, as theshown in Formula
document node
(7). Then, the relationship between the words and documents
according to the above method, and obtain the adjacency matrix A, as shown in Formula is mapped to the relationship
between
(7). Then,words and wordsbetween
the relationship containing thedocument
words andinformation
documents and dependencies
is mapped through
to the relation-
∼
ship between
matrix words and
transformation so aswords containing
to construct document
another information
text graph A. and dependencies
through matrix transformation so as to construct another text graph A.
∼
A = ATTA (7)
(7)
A=A A
n×m
where A ∈ Rn×m
A∈R , nisisthe
,n thenumber
numberofofdocument
documentnodes
nodesand
andmmisisthe
thenumber
numberof
ofword
wordnodes.
nodes.
3.2. Vocabulary
3.2. Vocabulary GCNGCN
A general
A general graph
graph convolutional
convolutional network
network is is aa multi-layer
multi-layer (usually
(usually two-layer)
two-layer) neural
neural
network that performs convolution directly on the graph and updates
network that performs convolution directly on the graph and updates a node’s embed- a node’s embedding
vector
ding based
vector on the
based on information
the informationof its neighbor
of its neighbor nodes.
nodes.TheThegraph
graphnodes
nodes ofof aa graph
graph
convolutional network
convolutional network are are “task
“task entities”,
entities”, such
such asas documents
documents that
that need
need toto be
be classified.
classified. It
It
requires all entities, including those from training, validation, and test sets, to
requires all entities, including those from training, validation, and test sets, to be dis- be displayed
in the graph,
played so thatso
in the graph, node
thatrepresentations are not
node representations aremissing from from
not missing downstream tasks.tasks.
downstream This
limits the application of graph convolutional networks to many prediction
This limits the application of graph convolutional networks to many prediction tasks tasks where
test data are not seen during training. Therefore, we use Vocabulary Graph Convolution
where test data are not seen during training. Therefore, we use Vocabulary Graph Con-
(VGCN), whose graph is constructed on the vocabulary without using the training set
volution (VGCN), whose graph is constructed on the vocabulary without using the
Appl. Sci. 2022, 12, 8273 9 of 18
to obtain entities, mainly convolving related words. Then for a single document, the
single-layer graph convolution is shown in Formula (8):
∼
H = XAW (8)
∼ ∼
where A represents the vocabulary graph. XA to extract part of vocabulary related to the
input matrix X, combining the words in the input sentence with the relevant words in the
vocabulary. W hold hidden state vector of a weight of a single document, dimension is
|V| × h.
The corresponding two-level GCN with LeakyReLU function is:
∼
GCN = LeakyReLU(Xmv Avv Wvh )Whc (9)
where m is the mini-batch size, v is the vocabulary size, h is the hidden layer size, and c is
the class size or sentence embedding size. W holds the weight of the hidden state vector of
a single document. Each row of Xmv is a vector containing document features, which is
the word embedding of BERT. Then a two-layer convolution is performed to combine the
words in the input sentence with the related words in the vocabulary.
Specifically, we perform two layers of graph convolution for the two text graphs
generated in the previous section, then add the two generated hidden layer vectors.
In the training process of the original GCN, ReLU is used as the activation function
of the GCN. In this paper, the non-linear function LeakyReLU is selected as the activation
function of GCN, which overcomes the problem of gradient disappearance and speeds up
training. In addition, L2 regularization is also used to reduce overfitting. The LeakyReLU
function is as follows:
0.2x, x < 0
LeakyReLU(x) = (10)
x, x ≥ 0
Residualtwo-layer
Figure4.4.Residual
Figure two-layerBiLSTM
BiLSTMstructure
structurewith
withattention.
attention.
Themodule
The module consists
consists of
of aa semantic
semantic feature
featuresplicing
splicinglayer,
layer,word-level
word-levelfeature
featureextraction
extrac-
layer, contextual attention layer, and a sentence-level feature extraction layer. First, the
tion layer, contextual attention layer, and a sentence-level feature extraction layer. First,
GCN and the pre-trained BERT model convert the text word sequence into two different
the GCN and the pre-trained BERT model convert the text word sequence into two differ-
word vector representations. Secondly, the corresponding fusion of the obtained two dif-
ent word vector representations. Secondly, the corresponding fusion of the obtained two
ferent word vector representations is input into the first-layer BiLSTM model to extract
different word vector representations is input into the first-layer BiLSTM model to extract
the semantic dependency information between words within a single sentence, and a
the semantic dependency information between words within a single sentence, and a
layer of attention mechanism is added to the first-layer BiLSTM, constructs an overall
layer of attention mechanism is added to the first-layer BiLSTM, constructs an overall rep-
representation of the current sentence. Here, the bidirectional word hidden features are
resentation of the current sentence. Here, the bidirectional word hidden features are com-
combined as the context vector of the attention mechanism. Then the obtained sentence
bined as the context vector of the attention mechanism. Then the obtained sentence feature
feature representation is input to the second layer BiLSTM to extract semantic dependency
representation is input to the second layer BiLSTM to extract semantic dependency infor-
information between sentences within the text. The learned sentence feature representa-
mation between sentences within the text. The learned sentence feature representation
tion and the sentence feature representation obtained by the context attention layer are
and the
subjectedsentence feature
to residual representation
operation to form anobtained
overallby the
text context attention
representation. layerthe
Finally, are sub-
overall
jected to residual operation
text representation to the
is input to form an overall
Softmax text representation.
classification Finally,
layer to obtain the overall
the final categorytextof
representation
the text. is input to the Softmax classification layer to obtain the final category of
the text.
3.3.1. Semantic Feature Connection Layer
3.3.1. We
Semantic
do notFeature
directlyConnection
classify theLayer
output features of GCN but use the generated hidden
Werepresentation
feature do not directlyHclassify
(2) the output learning.
for subsequent features of GCN
Since H(but2) use the generated hidden
does not adequately capture
(2)
feature representation
contextual informationH and fordoessubsequent learning.
not discriminate word Since
vector H(2)representations
does not adequately cap-
of polysemy,
ture contextual
pre-trained information
BERT and does
is introduced. not discriminate
The BERT model uses word vector representations
a bidirectional of pol-
transformer structure
ysemy, pre-trained
for encoding, whichBERT is introduced.
can characterize theThe BERT
specific model uses
semantics a bidirectional
of each word in thetransformer
context and
structure formultiple
distinguish encoding, which can
semantics characterize
according to thethe specificrelationship
semantic semantics of ofeach word inEach
the context. the
context and distinguish multiple semantics according to the semantic
word obtains the word vector representation W of the text through the pre-trained BERT relationship of the
context.
model, asEach wordinobtains
shown Formula the(11):
word vector representation W of the text through the pre-
trained BERT model, as shown innFormula (11): o
(i) (i) (i) (i)
W(i) = (i) w1 , (i)
w2 (i), . . . , w(i)j , . . .(i)wn (11)
W = w1 ,w2 , ..., wj , ... wn (11)
(i) (i)
(i)
where W
where W(i) represents the
represents the vector
vector matrix
matrixof
ofthe
thei-th
i-thtext,
text,wwj j represents
representsthe
thefeature
featurevector
vectorof
the
of j-th
the word
j-th word inin
the i-th
the text,
i-th and
text, n represents
and thethe
n represents maximum
maximum sentence
sentencelength.
length.
Appl. Sci. 2022, 12, 8273 11 of 18
Then, the embedded representation H(2) output after the second layer graph convo-
lution operation in the GCN model is spliced with the word vector W generated by the
pre-trained BERT to obtain a multi-feature vector X, as shown in Formula (12):
−
→ −−−−→ −−−→
hit = LSTM (x it , hit−1 ), t ∈ [1, T] (13)
←− ←−−−− ←−−−
hit = LSTM (x i , hit−1 ), t ∈ [T, 1] (14)
−
→
Finally, the hidden state hit obtained by the forward LSTM layer is spliced with the
←−
hidden state hit obtained by the backward LSTM layer to obtain the final hidden layer state
hit of the bidirectional LSTM, as shown in Formula (15):
−
→ ← −
hit = hit , hit (15)
where Ww is the weight matrix, bw is the bias term, and αit is the normalized importance
weight. uw is the attention context vector, which contains useful information for the text to
guide the attention model to locate informative words from the input sequence, and thus
plays an important role in the attention mechanism. Prior work either ignores the context
vector or initializes it randomly, which greatly weakens the role of context. Therefore, we
use the combination of word hidden features learned by LSTM in two directions as the
Appl. Sci. 2022, 12, 8273 12 of 18
context vector of the attention mechanism, that is, uw = hit +hi1 . It contains contextual
information of words in sentences to guide the learning of the attention mechanism.
Next, the word importance weight calculated by Formula (17) and the hidden output
result hit of the word-level feature layer are weighted and summed. The word-level
feature vectors are aggregated into sentence-level feature vectors si to form an overall
representation of the current sentence, as shown in Formula (18):
si = ∑t (α it hit ) (18)
The deepening of the number of network layers can improve the accuracy of the
model, but there are problems with the disappearance of the gradient of the deep network
and the degradation of the network. When the depth reaches a certain level, the accuracy of
the model begins to decline again, and the fitting ability becomes worse. In the deep neural
network, the residual network can effectively inhibit the network degradation phenomenon
and avoid the problem of sub-optimal solutions in deeper networks. Therefore, we intro-
duce a residual connection between two adjacent layers and perform a residual operation
on the sentence feature representation learned by the sentence-level feature extraction
layer and the sentence feature representation output by the context attention layer. Let the
model learn the residual to obtain a higher-level feature representation to achieve a better
classification effect, as shown in Formula (22):
Oi = si + hi (22)
Finally, the result Oi is input to the Softmax layer for classification, and the final
category of the text can be obtained.
4. Discussion
In this section, we will conduct experiments on four publicly available datasets and
analyze the experimental results of different models. Subsequently, ablation studies were
performed to further explore the effectiveness of each component.
4.1. Datasets
For an overall comparison with other more advanced methods, we choose the SST-2
dataset [31], MR [32], CoLA [33], and R8 dataset [13], which are commonly used in other
methods to evaluate our proposed method. The four datasets contain the content of reviews,
grammar, and news. The SST-2 dataset is the Stanford Sentiment Treebank dataset, which
is a binary single-sentence classification task consisting of sentences extracted from movie
reviews and human annotations of their sentiments. We use the public version, which
Appl. Sci. 2022, 12, 8273 13 of 18
has 6920 training samples, 872 validation samples, and 1821 test samples. In total, there
are 4963 positive reviews and 4650 negative reviews. The average length of reviews is
19.3 words. The MR dataset is a sentiment classification dataset for movie reviews. We used
the public version in [34] with 3554 test samples, 6392 training samples, and 710 validation
samples. The average length of reviews is 21.0 words. The CoLA dataset is a grammar
dataset released by New York University. This task is mainly to judge whether the grammar
of a given sentence is correct, which belongs to the single-sentence text binary classification
task. We use the public version with 8551 training data, 1043 validation data, and 1062 test
samples for a total of 6744 positives and 2850 negatives. The average length is 7.7 words.
The R8 dataset represents a subset of the Reuters 21,578 dataset. The dataset includes eight
categories, divided into 5485 training documents and 2189 test documents. We randomly
select 20% of the training set of this dataset as the validation set. The average length is
65.72 words.
weighted average F1-Score, and macro F1-Score on test sets are presented in Table 1. The
experimental results on four benchmark datasets confirm that the performance of IMGCN
is basically better than other baseline models, which further proves the effectiveness and
robustness of IMGCN in text classification. The “—” in the table represents no data. The
bold in the table indicates the best results of the model. The Macro F1 score is the arithmetic
average of multiple F1 scores, and it assigns equal weight to each class. The weighted avg
F1 score assigns different weights to different classes according to the number of samples in
each class. When calculating the two scores, the respective Macro F1 scores and Weighted
avg F1 scores of the SST-2 and MR datasets are close. Since we rounded our calculations to
two decimal places, this resulted in the macro F1 score and the weighted average F1 score
being the same on SST-2 and MR.
Table 1. Accuracy, Weighted average F1-Score and (Macro F1-score) on the test sets.
SST-2 MR CoLA R8
Model Weighted Avg F1 Weighted Avg F1 Weighted Avg F1 Weighted Avg
Acc Acc Acc Acc
(Macro F1) (Macro F1) (Macro F1) F1 (Macro F1)
Text-GCN 80.45 82.39 75.67 76.70 56.18 (52.30) 52.30 — 97.00
Bi-LSTM 81.32 83.50 76.39 77.70 62.88 (55.25) 64.10 — 96.30
VGCN 81.64 83.43 76.42 76.42 63.59 (54.82) 65.61 97.12 (94.65) 97.43
BERT 91.49 91.53 86.24 80.30 81.22 (77.02) 81.20 98.12 (95.59) 98.20
STGCN + BERT +
— — — 82.50 — — — 98.50
BiLSTM
VGCN-BERT 91.93 91.94 86.35 86.35 83.68 (80.46) 83.70 98.04 (95.43) 98.04
IMGCN 92.48 92.48 87.81 87.81 84.02 (80.91) 84.37 98.14 (96.50) 98.34
It can be seen from the table that IMGCN has a good classification effect compared
with the baseline models, which proves the effectiveness of the model in this paper. The
performance of the Bi-LSTM model and the models VGCN and Text-GCN that utilize the
vocabulary graph are comparable. Compared with the VGCN and BERT models, the model
in this paper has a certain degree of effect improvement, indicating that the introduction
of dependencies, semantic dictionaries, the BERT model, Bi-LSTM model, and residual
attention in this model does enrich the contextual semantic information and local feature
information of GCN, making the model in this paper compared to other models have better
classification performance. At the same time, compared with VGCN-BERT, IMGCN always
performs better. All this shows that the combination of each component of this paper
is useful.
94 94
92 92
90 90
88
Weighted avg F1
88
Macro F1
86 IMGCN 86 IMGCN
84 IMB 84 IMB
MGB MGB
82 82
80 80
78 78
76 76
SST-2 MR CoLA SST-2 MR CoLA
meaningful
Appl. Sci. 2022, 12, x FOR PEER REVIEW for us to further fuse and learn the feature information obtained by the16GCN
of 18
model and the pre-trained BERT by using contextual attention and the hierarchical model.
134
CoLA
204
206
SST-2
224
233
MR
248
581
R8
621
IMB IMGCN
Figure 6. Average training time per epoch for different models on different datasets.
Figure 6. Average training time per epoch for different models on different datasets.
5.5.Conclusions
Conclusions
Inthis
In this research,
research, we propose
proposeaanew newIMGCN
IMGCNmodel model that uses
that thethe
uses dependency
dependency relation-
rela-
ship to capture
tionship the the
to capture context dependency,
context dependency, andandthe the
semantic
semanticdictionary
dictionarycan can
increase the use-
increase the
useful connection
ful connection between
between words
words to supplement
to supplement the shortcomings
the shortcomings of theof the graph
graph convo-
convolutional
lutional
networknetwork in text representation.
in text representation. At thetime,
At the same samea time, a two-layer
two-layer BiLSTMBiLSTM model is model
used isto
used to combine
combine the localthefeature
local feature information
information obtainedobtained
by theby the BERT
BERT modelmodeland the and the global
global infor-
information
mation obtained obtained
by the byGCN.
the GCN.ThenThen
minemine the deep
the deep features
features of theoftext.
the Then
text. Then
add an add an
atten-
attention mechanism
tion mechanism to focusto focus on keyword
on keyword features.
features. Furthermore,
Furthermore, by introducing
by introducing residual resid-
con-
ual connections,
nections, the neural
the neural network network degradation
degradation problem problem that occurs
that occurs when when the network
the network model
model is stacked with multiple layers is solved. At the same time, let
is stacked with multiple layers is solved. At the same time, let the model learn the residualthe model learn the
residual so as to better obtain new features and improve the effect of
so as to better obtain new features and improve the effect of text classification. We have text classification. We
have conducted experiments on four datasets, which have achieved
conducted experiments on four datasets, which have achieved a certain degree of im- a certain degree of
improvement comparedwith
provement compared withthethebaseline
baselinemodel.
model.
In
Infuture
futurework,
work,wewe will further
will furtherexplore
explorethethe
dependencies
dependencies and and
construct graph
construct features
graph fea-
richer in semantic information to improve the performance of the model.
tures richer in semantic information to improve the performance of the model. In addition, In addition, our
proposed
our proposed model doesdoes
model not not
analyze the the
analyze model
model complexity
complexity andandhowhow totoreduce
reducethethemodel
model
complexity, which is the direction of further
complexity, which is the direction of further research. research.
Author Contributions:Conceptualization,
AuthorContributions: Conceptualization,B.X.,
B.X.,C.Z.,
C.Z.,X.W.
X.W.and
andW.Z.;
W.Z.;Methodology,
Methodology,B.X.;
B.X.;Software,
Software,
B.X.;
B.X.; Validation, B.X.; Writing—original draft preparation, B.X.; Writing—review andediting,
Validation, B.X.; Writing—original draft preparation, B.X.; Writing—review and editing,B.X.
B.X.
and
andC.Z.;
C.Z.;Visualization,
Visualization,B.X.;
B.X.;Project
Projectadministration,
administration,B.X.
B.X.All
Allauthors
authorshave
haveread
readand
andagreed
agreedtotothe
the
published
publishedversion
versionofofthe
themanuscript.
manuscript.
Funding:This
Funding: Thisresearch
researchreceived
receivedno
noexternal
externalfunding.
funding.
Institutional
InstitutionalReview
ReviewBoard Statement:Not
BoardStatement: Notapplicable.
applicable.
Informed
InformedConsent Statement:Not
ConsentStatement: Notapplicable.
applicable.
Data
DataAvailability Statement:The
AvailabilityStatement: Thedata
datapresented
presentedininthis
thisstudy
studyare
areavailable
availableon
onrequest
requestfrom
fromthe
the
corresponding
correspondingauthor.
author.
Conflicts Interest:The
ConflictsofofInterest: Theauthors
authorsdeclare
declareno
noconflict
conflictofofinterest.
interest.
Appl. Sci. 2022, 12, 8273 17 of 18
References
1. Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the
ECML, Chemnitz, Germany, 21–23 April 1998.
2. Alhajj, R.; Gao, H.; Li, X.; Li, J.; Zaiane, O.R. Advanced Data Mining and Applications. In Proceedings of the Third International
Conference (ADMA 2007), Harbin, China, 6–8 August 2007.
3. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014.
4. Zhang, H.; Xiao, L.; Wang, Y.; Jin, Y. A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning.
arXiv 2017, arXiv:1707.0289.
5. Zhao, W.; Peng, H.; Eger, S.; Cambria, E.; Yang, M. Towards Scalable and Reliable Capsule Networks for Challenging NLP
Applications. arXiv 2019, arXiv:1906.0282.
6. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
7. Cho, K.; Merrienboer, B.V.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations
using RNN Encoder–Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078.
8. Wang, R.; Li, Z.; Cao, J.; Chen, T.; Wang, L. Convolutional Recurrent Neural Networks for Text Classification. In Proceedings of
the 2019 International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019.
9. Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you
Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017;
Volume 30.
10. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
arXiv 2018, arXiv:1810.04805.
11. Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.F.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro,
A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261.
12. Kipf, T.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2017, arXiv:1609.02907.
13. Yao, L.; Mao, C.; Luo, Y. Graph Convolutional Networks for Text Classification. arXiv 2019, arXiv:1809.05679. [CrossRef]
14. Zhenbo, B.; Shiyou, Z.; Hongjun, P.; Yuanhong, W.; Hua, Y. A Survey of Preprocessing Methods for Marine Ship Target Detection
Based on Video Surveillance. In Proceedings of the 2021 7th International Conference on Computing and Artificial Intelligence,
Tianjin, China, 23–26 April 2021.
15. Lu, Z.; Du, P.; Nie, J. VGCN-BERT: Augmenting BERT with Graph Embedding for Text Classification. Adv. Inf. Retr. 2020, 12035,
369–382.
16. Xue, B.; Zhu, C.; Wang, X.; Zhu, W. An Integration Model for Text Classification using Graph Convolutional Network and BERT.
J. Phys. Conf. Ser. 2021, 2137, 012052. [CrossRef]
17. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E.H. Hierarchical Attention Networks for Document Classification NAACL.
In Proceedings of the NAACL-HLT 2016, San Diego, CA, USA, 12–17 June 2016.
18. Harris, Z.S. Distributional structure. Word 1954, 10, 146–162. [CrossRef]
19. Rousseau, F.; Kiagias, E.; Vazirgiannis, M. Text Categorization as a Graph Classification Problem. In Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language
Processing, Beijing, China, 26–31 July 2015.
20. Luo, Y.; Uzuner, Ö.; Szolovits, P. Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical
relations. Brief. Bioinform. 2017, 18, 160–178. [CrossRef] [PubMed]
21. Tang, D.; Qin, B.; Liu, T. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification. In Proceedings
of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015.
22. Wang, Y.; Huang, M.; Zhu, X.; Zhao, L. Attention-based LSTM for Aspect-level Sentiment Classification. In Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016.
23. Dong, Y.; Liu, P.; Zhu, Z.; Wang, Q.; Zhang, Q. A Fusion Model-Based Label Embedding and Self-Interaction Attention for Text
Classification. IEEE Access 2020, 8, 30548–30559. [CrossRef]
24. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio’, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903.
25. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille
France, 6–11 July 2015.
26. Cavallari, S.; Cambria, E.; Cai, H.; Chang, K.C.; Zheng, V.W. Embedding Both Finite and Infinite Communities on Graphs. IEEE
Comput. Intell. Mag. 2019, 14, 39–50. [CrossRef]
27. Bouma, G. Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 2019, 30, 31–40.
28. Wu, Z.; Palmer, M. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting on Association for
Computational Linguistics, Las Cruces, NM, USA, 27–30 June 1994.
29. Tang, J.; Qu, M.; Mei, Q. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. In Proceedings
of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13
August 2015.
30. Tesnière, L. Eléments de syntaxe structurale, 1959 Paris, Klincksieck. Can. J. Linguist./Rev. Can. Linguist. 1960, 6, 67–69.
Appl. Sci. 2022, 12, 8273 18 of 18
31. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality
Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
Seattle, WA, USA, 18–21 October 2013.
32. Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30
June 2005.
33. Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguist. 2019, 7, 625–641.
[CrossRef]
34. Zhu, X.; Xu, Q.; Chen, Y.; Chen, H.; Wu, T. A Novel Class-Center Vector Model for Text Classification Using Dependencies and a
Semantic Dictionary. IEEE Access 2020, 8, 24990–25000. [CrossRef]
35. Graves, A.; Mohamed, A.; Hinton, G.E. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649.