You are on page 1of 16

Neural Processing Letters (2023) 55:735–750

https://doi.org/10.1007/s11063-022-10906-6

A Word-Concept Heterogeneous Graph Convolutional


Network for Short Text Classification

Shigang Yang1 · Yongguo Liu1 · Yun Zhang1 · Jiajing Zhu1

Accepted: 27 May 2022 / Published online: 22 June 2022


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
Text classification is an important task in natural language processing. However, most of
the existing models focus on long texts, and their performance in short texts is not satisfied
due to the problem of data sparsity. To solve this problem, recent studies have introduced
the concepts of words to enrich the representation of short texts. However, these methods
ignore the interactive information between words and concepts and lead introduced concepts
to be noises unsuitable for semantic understanding. In this paper, we propose a new model
called word-concept heterogeneous graph convolution network (WC-HGCN) to introduce
interactive information between words and concepts for short text classification. WC-HGCN
develops words and relevant concepts and adopts graph convolution networks to learn the
representation with interactive information. Furthermore, we design an innovative learning
strategy, which can make full use of the introduced concept information. Experimental results
on seven real short text datasets show that our model outperforms latest baseline methods.

Keywords Short text classification · Concepts · Words · Graph convolution network

1 Introduction

Rapid developments in social networks and their usage in everyday life have caused an explo-
sion in the amount of short electronic documents, and short texts, such as online news, queries,
reviews, tweets, are increasingly widespread on the Internet [1, 27]. Short text classification,
assigning predefined categories to texts, is an important basic task and it has a wide range of
applications, including news title classification [18, 32], sentiment analysis [4, 34], question
answering system [17]. In the past research, most of the models pay attention to long docu-
ments classification, including traditional machine learning methods, such as support vector
machines (SVM) [5, 30], naive Bayesian (NB) [1], and some deep learning based methods,
such as convolutional neural networks (CNN) [15], recurrent neural networks (RNN) [7],
transformer [29], etc. Although these methods have achieved great success, they are limited

B Yongguo Liu
liuyg@uestc.edu.cn
1 Knowledge and Data Engineering Laboratory of Chinese Medicine, School of Information and Software
Engineering, University of Electronic Science and Technology of China, Chengdu 610054, People’s
Republic of China

123
736 S. Yang et al.

in short texts due to the lack of sufficient context information and the data sparsity problem
in short text [32, 37].
To solve these problems, various text representation models are used to capture adequate
information from short texts. These text representation models can be divided into two groups
[32]: One are the explicit representation models, which introduce the features of short texts
from the outside, such as the concepts in the knowledge base. Chen et al. [4] introduced the
concepts of words by linking to a knowledge base, and the attention mechanism was used to
weight concept information. Finally, the concept information was combined with texts for
classification. Xu et al. [32] also introduced the concepts of words and assigned a weight to
each concept by words with context and finally, the representation of each word in the text
consisted of the pre-trained word vector and its concept information, and then a convolutional
neural network was used for classification. The effect of these methods in short texts hava been
improved, but they ignore the relevance between the introduced knowledge and the original
sentence and fail to model the interaction between text and external knowledge. This cause
that the introduced knowledge is unhelpful to enhance semantics of sentences. Moreover,
these useless knowledge could become noises and lead to the performance degradation. The
others are the implicit representation models, which mine hidden semantic information in
short texts to obtain richer features by various models, such as neural networks and topic
models. Zhang et al. [37] proposed a cluster-gated convolutional neural network (CGCNN),
which used word-level clustering to mine higher-level semantics features. Yang et al. [34]
proposed an extended topic model named seeded biterm topic model (SeedBTM) to explore
topic information in short texts. Implicit representation models can achieve great performance
with no external knowledge, however, they still face the problem of data sparsity.
In order to make better use of the introduced external concept knowledge and reduce the
impact of noisy data, we consider the interactive influence between words and concepts to
learn feature representation with context. More specifically, we establish a feature learning
space for words and concepts separately to learn the representation of each word and its con-
cepts in a specific context. Such word-concept data are suitable to use a heterogeneous graph
for representation and can be learned by graph neural network. Hence, we propose a graph
learning model combined with concept information for short text classification called word-
concept heterogeneous graph convolution network (WC-HGCN). It introduces the concept
information of words to enrich the feature representation of short texts and builds a text-level
heterogeneous graph for each sentence where the words and related concepts are viewed as
nodes. The graph convolutional network (GCN) is used to update the two kinds of nodes in
the graph by a designed strategy, and finally classifies the text with the two kinds of node
information. A large number of experiments showed that our model performs better than
other baseline models, which proves that our model can learn the appropriate representation
of words and concepts, and effectively mitigate the impact of noise. In summary, the main
contributions of this paper are as follows:

– We construct a text-level word-concept heterogeneous graph neural network for short


text classification.
– A large number of experiments show that the method in this paper can effectively use
the concept knowledge and reduce the impact of the introduced noise.

The rest of the paper is organised as follows: Sect. 2 describes the related work in short
text classification. The WC-HGCN framework is described in Sect. 3, while Sect. 4 describes
the experiment and Sect. 4.4 discusses the experimental results. Finally Sect. 5 presents the
conclusions obtained.

123
A Word-Concept Heterogeneous Graph Convolutional Network for... 737

2 Related Work

Text classification has a long history of research. Given a text or document, the task of text
classification is to give which category it belongs to. Traditional machine methods have
been widely used in text classification tasks, including SVM [14, 26], naive Bayesian (NB)
[6, 9, 33], random forest (RF) [12, 24, 31] and k-nearest neighbor (KNN) [3, 8]. These
models are simple and efficient but require a lot of manual features. With the improvement of
computational performance, in recent years, many methods based on deep learning have been
proposed to model text classification in an end-to-end way. The network structures used in
these models mainly include convolutional neural networks (CNN) [15, 32], recurrent neural
networks (RNN) [2, 20], transformer [29], graph neural networks (GNN) [25] and some
variant models, for example recurrent convolutional neural networks (RCNN) [25]. These
models can achieve higher accuracy and have a satisfied performance on different corpora.
However, most of them are designed for long text and face the problem of data sparsity when
applied in short texts.
In order to improve the performance in short texts, some methods designed for short text
have been proposed continuously in recent years. These methods can be mainly divided
into two categories: one are based on various architectures,and the other are based on more
features. The models based on various architectures focus on changing the model architecture
to adapt to the short text. Zeng et al. [36] proposed a topic memory network for short text
classification, which enriched semantic information by encoding latent topic representations
of class labels. Zhang et al. [37] proposed a cluster-gated convolutional neural network to
explore hidden semantic information in short texts. Although the above models perform well,
but they still cannot change the data sparsity and have a limited promotion space. The other
concentrate on introducing more external knowledge to enrich the semantic representation
of short texts. Chen et al. [4] introduced the concepts of words to enhance semantics by
linking to the knowledge base and used the attention mechanism to weight concepts. Xu
et al. [32] also used the concepts of words to alleviate the problem of data sparsity and
proposed the dual embeddings convolutional neural network(DE-CNN), which combined the
representation of words and its concepts for enhancing semantic information. These models
enrich the semantic features of sentences and achieve excellent performance. However, the
existing methods cannot well model the correlation between the introduced knowledge and
the original text, which could cause that the introduced knowledge is useless to enhance
semantic features. Furthermore, these useless knowledge will become noise, which makes
the semantic understanding of sentences difficult.
In addition, with the success of graph neural networks in various fields, many researchers
have applied it to text classification. Yao et al. [35] used the text graph convolution network
(Text-GCN) for text classification. They regarded text classification as the node classifica-
tion task and constructed a graph for each corpus, where each text and its contained words
were regarded as nodes, and the point-wise mutual information(PMI) between words and term
frequency-inverse document frequency (TF-IDF) between documents and words are regarded
as the weights of edges. Based on Text-GCN, Liu et al. [21] proposed tensor graph convo-
lutional networks(Tensor-GCN). Tensor-GCN constructed three graphs based on semantics,
syntax, and sequence. The information of nodes can be transmitted within each graph or
between graphs. However, the structure and parameters of these two methods are dependent
on the corpus and cannot be modified after training, so they are difficult to conduct the online
test. To solve this problem, recent works have focused on building the text level graph for
each sentence instead of the corpus. Huang et al. [11] proposed a text-level graph neural

123
738 S. Yang et al.

network, which constructs a graph for each text, and updates nodes through the message
propagation mechanism. Zhang et al. [38] also constructed a text-level graph and used gated
graph neural networks [19] to update nodes. However, these methods are proposed for long
documents and still face data sparsity in short texts. For short texts, Yuan et al. [10] proposed
a heterogeneous graph attention network (HGAN), which introduced topics and entities for
more abundant semantics. However, this model builds a graph for each corpus and still faces
the disadvantage that it cannot be tested for new samples.
To better introduce knowledge for short text classification and reduce the impact of noise,
we combine the advantages of graph neural networks and external knowledge. We use a
text-level heterogeneous graph to represent text and its concepts, and consider the interaction
between words and concepts, so as to learn the independent representation of each word and
concept.

3 Model

In this paper, we propose a word-concept heterogeneous graph convolution network (WC-


HGCN) for short text categorization, which can model the interaction between words and
concepts. We first obtain the words and their concepts for each text, then view them as
nodes to construct a text-concept heterogeneous graph, and then update node information
through graph convolution network. As shown in Fig. 1, our model consists of three parts:
word-concept heterogeneous graph construction, heterogeneous graph learning, and graph
classification.

3.1 Word-Concept Heterogeneous Graph Construction

We first introduce how to construct a text-level word-concept heterogeneous graph G =<


V , E > . As shown in Fig. 1, there are two types of nodes: word nodes and concept nodes,
and three kinds of edges: the edge between words w-w, and the directed edge of words to
concepts w-c, and the directed edge of concepts to words c-w. For an input sentence, suppose
it contains n words, expressed as S = {w1 , w2 , . . . , wn }, where wn refers to the n-th word
of sentence S. For each word, we introduce top k concepts. The corresponding concepts of
word wn can be expressed as Cn = {c1 , c2 , . . . , ck }, where ck denotes the k-th concept of
the word wn . The corresponding concepts of all words in sentence S can be described as
C = {C1 , C2 , . . . , Cn }. Following Xu et al. [32], we crawl the concepts and its probability
score of each word from Microsoft concept. graph1 For example, the top k = 2 concepts
of the word “Microsoft” is “company: 0.87, vendor: 0.13”, where “company” is the most
relevant concept of the word “Microsoft” with a 0.87 probability score. And then we regard
all words in the text S and their concepts C as the nodes of the graph. Next, we introduce
how to construct edges, including the edge w-w between words marked as E ww , the edge
w-c from words to concepts as E wc and the edge w-c from concepts to words as E cw . For
the edge w-w between words, we connect p closest neighbors for each word. The weights
of edges are initialized with normalized word co-occurrence frequency and dynamically
updated during training. For the edge w-c from words to concepts, each word is connected to
its corresponding concepts by a directed edge and its initial weight is the probability score.
For the edge c-w from concepts to words, they are the reverse connection of edge w-c. We
simply take m −1 as the initial weight of edge c-w if a concept has m words connected.
1 https://concept.research.microsoft.com/Home/API.

123
A Word-Concept Heterogeneous Graph Convolutional Network for... 739

Fig. 1 The architecture of our word-concept heterogeneous graph convolution networks(WC-HGCN). For a
sentence, we get all concepts of words in the sentence and build a word-concept heterogeneous graph. In the
heterogeneous graph, there are two kinds of nodes: words and concepts, where words are marked orange and
concepts are blue, and three kinds of edges: the edge of the word to word(w-w), the edge of the word to the
concept(w-c) and the edge of concept to word(c-w). Then, we carry on heterogeneous graphs learning and
update the nodes contained in edges w-w, w-c, and c-w by order. Finally, the two types of nodes in the graph
are summed separately, and then they are concatenated as the final feature of the sentence for classification.
For the convenience of display, the self-loop of all nodes is omitted

Concretely, the graph G =< V , E > of sentence S is defined as:


V = {v|v ∈ S ∪ U } (1)
ww wc
E = {e|e ∈ E ∪E ∪E cw
} (2)

3.2 Heterogeneous Graph Learning

After the graph is built, each node in the graph is represented as a vector by word embedding
and we use the graph convolution network(GCN) [35] to update the nodes in the graph.
Considering a graph G =< V , E >, where V (|V | = u) and E are the sets of nodes and
edges, respectively.
 We introduce adjacency matrix A of G and its degree matrix D, where
Dii = j A i j . For node v ∈ V , its vector representation is xv obtained by looking up the
embedding table, and we use the following Eq. 3 to update nodes for obtaining higher-level
feature representation h v . Where W0 is the weight matrix of edges.
h v = ReLU (D −1/2 AD −1/2 xv W0 ) (3)
We notice that some words may also belong to the concept category, such as “company”
and “fruit”. To make the expression of words and concepts clearer, we maintain a word
embedding table for words and a concept embedding table for concepts. If the node v, it
is expressed as xvs and if the node v, it is expressed as xcs . In order to learn the interaction

123
740 S. Yang et al.

between words and concepts, we design the learning order according to the type of edges and
only the nodes contained in the edge type are updated at each time. Firstly, we update the
nodes belonging to the edge w-w in the graph as shown in Fig. 1. The word “taking” updates
itself by aggregating information from directly connected neighbor nodes, “Microsoft” and
“on”. The relationship between words can be regarded as context learning and is helpful to
understand the context. When all nodes contained in the edge w-w are updated by obtaining
information from neighbors, each node obtains contextual information. Then we learn the
interactive information of words and concepts, including two steps: using word nodes with
context information to adjust concept nodes and using concept nodes to update word nodes
in turn. After the last step of learning, each word carries contextual information which is
essential to understand the sentence. However, the initial concept representations are incon-
sistent with the context of the current sentence, so we adjust the representation of concepts
using the words with context information, which means updating the nodes in the edge w-c.
Hence, each concept will aggregate information from itself and its word neighbor nodes. For
example, the concept “company” will receive information from the its word neighbor nodes
“company” and “apple”. Finally, when all concept nodes are updated, in turn, we use the
concept representation to enhance the expression of words. At this time, we update the nodes
contained in the edge c-w and each word receives information from its concept neighbor
nodes to update itself. Such as, the word “apple” will receive information from the concept
nodes “company” and “fruit”. After three updates, the context of the word and the interactive
information between the word and the concept have been fully learned.

3.3 Graph Classification

When the nodes in the graph are sufficiently updated, they are aggregated to a graph-level
representation for the short text. More specifically, all word nodes and concept nodes in the
graph are summed separately and they are concatenated as the final representation H of the
sentence. The final text category is obtained by a linear layer with a softmax layer. The goal
of training is to minimize the cross-entropy loss between ground truth label and predicted
label.
 n 
 
n∗k
H = Concat hw , hc (4)
w=1 c=1

p = so f tmax(W ∗ H + b) (5)


L=− yt log( pt ) (6)
t
where h w and h c are high-level feature representations of words and concepts, respectively.
W and b are weights and bias, yt is the t-th element of the one-hot label and pt represents
the score of the t-th category label normalized by the softmax function.

4 Experiments

In this section, we evaluate the performance of our model on the real-world datasets and
compare it with baseline models. Here, we first describe each dataset in detail and experi-

123
A Word-Concept Heterogeneous Graph Convolutional Network for... 741

Table 1 The description of Dataset Class Training Test Max Min Avg
experimental datasets. Here,
Class indicates the number of Biomedical 20 17976 1998 31 1 7.8
classification labels. Training and
Test define the number of training Dblp 6 61422 20000 40 1 8.5
and test data respectively. Max, MR 2 7074 3554 56 1 20.4
Min and Avg represent the SST1 5 9600 2210 51 1 18.4
maximum, minimum and average
words of sentences in the dataset SST2 2 7770 1821 51 1 18.5
respectively TREC 6 5394 500 38 3 11.3
Twitter 2 79762 8863 112 1 8.4

mental settings. Then, WC-HGCN is compared with nine baseline methods to evaluate its
performance in short text classification. After that, the parameter analysis of WC-HGCN are
presented. All experiments are conducted with Python 3.6 on a Ubuntu 18.04 server with a
Intel Xeon Gold 5112 CPU. Following previous works [11, 35], we choose the test accuracy
as the evaluation metric of classification. A higher accuracy represents better performance
of the model.

4.1 Dataset

We conduct extensive experiments on six commonly used short text classification datasets:
Biomedical, Dblp, MR, SST1, SST2, TREC, and Twitter. Detailed descriptions of datasets
are given in Table 1. For all datasets, we randomly select 10% of training set as validation set
for the reason that all datasets are without a standard validation set. All datasets are available
online.2
Biomedical Biomedical is a subset of the challenge data published on the BioASQ’s website,
where 19974 paper titles from 20 groups are randomly selected.
Dblp This dataset contains titles of papers from the computer science bibliography [28].
MR The movie review dataset contains 10,628 sentences and each sentence denotes a positive
or negative comment on movies.
SST1 Stanford sentiment treebank is an extension of MR, which renes the labels, including
very negative part, negative part, neutral part, positive part and very positive part.
SST2 Same as SST-1, SST2 includes only two classifications: positive and negative.
TREC This dataset is a question dataset, which classifies sentences into 6 question types,
including person, location, numeric information, etc.
Twitter We uses the data from SemEval 20133 to build Twitter dataset, in which 88625
sentences are selected for model evaluation.

4.2 Baseline Methods

For the purpose of comparison, we compare our model with the following baseline methods
for short text classification.
TextCNN [15]. This a classic convolutional neural network baseline model and it is the state-
of-the-art method for text classification.
2 https://drive.google.com/file/d/1Li2QyQm8dCVT81jBLmbT0tQm6HNVzTyA.
3 https://www.cs.york.ac.uk/semeval-2013/task2/.

123
742 S. Yang et al.

FastText [13]. A simple and efficient text classification method, which treats the average
of word/n-grams embeddings as document embeddings, then feeds document embeddings
into a linear classifier.
Bi-LSTM [20]. A text classification model based on Bi-directional LSTM. We implement
a simplified version with a single classification task.
Transformer [29]. Transformer is a model based on self-attention, which consists of two
parts: encoder and decoder. We use its encoder for short text classification.
Global-Local-Encoder(GL-Encoder) [22]. GL-Encoder is a text classification model,
which considers local and global information by two encoders. Here, we use CNN as its
encoders.
TextGCN [35]. This model views the text corpus as a graph containing documents and
words as nodes and applies graph convolution network for documents classification.
Text-level-GNN(TL-GNN) [11]. TL-GNN is a text-level graph neural network model based
message passing mechanism, in which a graph is constructed for each input text and the words
in the text are used as nodes.
STCKA [4]. STCKA is a short text classification model based on words and their concepts,
where each short text and its related concepts are encoded separately, and two attention
mechanisms are used to calculate the weight of the concept.
DE-CNN [32]. DE-CNN is a short text classification model based on texts and their
concepts, where each short text and its related concepts are encoded separately. Then they
are concatenated and encoded by CNN.

4.3 Parameter Setups

We use Adam [16] for learning with a learning rate of 0.001, and L2 weight decay is set
to 0.0001. The batch size is set to 256 and the total training epochs are set to 1000. In a
heterogeneous graph, words are only linked with the 1-hop neighbor. To prevent over-fitting,
we set the dropout rate to 0.5 and stop training if the validation accuracy does not decrease
for 20 consecutive epochs. The dimension of node representation is 300 and initialized with
random vectors or Glove [23]. The pre-trained Glove word vectors are available online. 4 For
baseline models, we use default parameter settings according to their original implementa-
tions. For the models using pre-trained word embeddings, we used 300-dimensional GloVe
word embeddings as same as our model. For the models using concepts, we introduce four
concepts for every word. For a fair comparison, a grid search strategy is adopted to tune the
parameters of WC-HGCN and the best results are reported.

4.4 Experimental Results and Discussion

4.4.1 Comparisons with Baseline Methods

Table 2 reports the results of our models against other baseline methods. Concretely, we find
that TextCNN, as a classic baseline model, can achieve better performance compared with
most baseline models. We think the main reason is that convolutional neural networks can
capture the most important feature from the short text by convolution operation. Although
FastText also has local sliding windows like TextCNN, its performance is not satisfied. This
shows that convolution operation can better mine semantic features in short texts. After

4 http://nlp.stanford.edu/data/glove.6B.zip.

123
A Word-Concept Heterogeneous Graph Convolutional Network for... 743

Table 2 Test accuracy comparison with baseline methods on benchmark datasets


Model Biomedical Dblp MR SST1 SST2 TREC Twitter Avg

TextCNN 0.6657 0.7667 0.7617 0.4127 0.8103 0.9688 0.5522 0.7054


Bi-LSTM 0.6436 0.7522 0.7580 0.4036 0.8067 0.9727 0.5518 0.6984
Transformer 0.6532 0.7560 0.7347 0.4158 0.8001 0.9648 0.5517 0.6966
FastText 0.6201 0.7491 0.6216 0.3344 0.6623 0.9453 0.5526 0.6408
GL-Encoder 0.6912 0.7418 0.6781 0.3950 0.7595 0.9780 0.5513 0.6850
Text-GCN 0.6862 0.7777 0.7634 0.3873 0.8166 0.9060 0.5479 0.6979
TL-GNN 0.5661 0.7715 0.7470 0.3824 0.7946 0.9720 0.5521 0.6837
STCKA 0.6802 0.7724 0.7670 0.4167 0.8207 0.9720 0.5514 0.7115
DE-CNN 0.6527 0.7565 0.6820 0.4181 0.7831 0.9609 0.5523 0.6865
WC-HGCN 0.7029 0.7870 0.7739 0.4264 0.8258 0.9840 0.5546 0.7221
The highest test accuracy for each dataset is marked in bold

TextCNN, we notice that Bi-LSTM and Transformer perform well. This result is expected
because their structure is suitable for dealing with sequential sentences. However, their accu-
racy is lower compared with TextCNN. We think the main reason is that the coherence between
words in short texts is weak, which causes their performance degradation. For GL-Encoder
model with CNN as encoders, we observe that its performance is not as good as TextCNN.
We suggest that the most likely cause is that the introduction of repeated features leads to
semantic confusion so GL-Encoder cannot extract effective semantic information. Besides,
we notice that the models based on graph neural networks can achieve a good result in the
short text, which shows that graph structure can obtain high-level semantic features from
sentences. However, the performance of these models cannot do better than our WC-HGCN
because they face the problem of data sparsity in short texts. This shows that the introduced
external knowledge can effectively solve the sparsity problem. To verify that our model can
learn more accurate semantic representation, we compare WC-HGCN with the two baseline
models introducing external knowledge, STCKA based on Bi-LSTM and DE-CNN based on
CNN. The results show that STCKA has a greater improvement compared with Bi-LSTM,
which proves that the introduction of external knowledge can strengthen the semantics of
sentences. While observing the performance of DE-CNN, we find that its performance has
declined compared with CNN, which shows that we should reasonably learn the represen-
tation of introduced knowledge. Otherwise, the introduced knowledge contains too much
noise, resulting in performance degradation. This can prove that our model can learn the
representation of introduced knowledge by considering the correlation between words and
concepts.

4.4.2 Effect of the Number of Concepts

Figure 2 shows the average test accuracy of our model on seven datasets with a different
number of concepts. As we can see clearly, the average accuracy first increases with the
increase in the number of concepts, reaching the highest value at 4 and it falls when its
number is larger than 4. This shows that a certain degree of knowledge introduction can
indeed alleviate the problem of data sparsity in short texts, however, excessive knowledge
introduction will bring noise, resulting in the performance degradation of the model. Table
3 shows in detail the variation of test accuracy on each dataset with a different number

123
744 S. Yang et al.

Fig. 2 The variation curve of average accuracy with a different number of concepts

Table 3 Test accuracy comparison with a different number of concepts on benchmark datasets
Concepts Biomedical Dblp MR SST1 SST2 TREC Twitter Avg

1 0.7029 0.7836 0.7632 0.4046 0.8049 0.9720 0.5517 0.7118


2 0.7019 0.7868 0.7739 0.3942 0.8093 0.9780 0.5527 0.7138
3 0.6954 0.7871 0.7691 0.4187 0.8071 0.9740 0.5459 0.7139
4 0.7024 0.7833 0.7634 0.3965 0.8203 0.9840 0.5541 0.7149
5 0.6934 0.7853 0.7634 0.4033 0.8049 0.9780 0.5529 0.7116

of concepts. From Table 3, it can be seen that the test accuracy has a rule of first rising
and then decreasing in most datasets, such as Dblp, MR, TREC. One main reason why the
performance of WC-HGCN have declined is that the introduced concepts are independent of
the context of the current sentence and become noises that interfere with the understanding of
sentence semantics. With the increase in the number of concepts, the score between the word
and introduced concepts become lower. Finally, these irrelevant concepts make semantic
understanding difficult.

4.4.3 Effect of the Number of Neighbors

As Table 4 shows, we compare the performance of different neighbor nodes. As can be seen
from Table 4, with the increase of neighbor nodes, the test accuracy has no change rule and
the fluctuation is stable. This means that GCN can learn information from other non-neighbor
nodes through the gradual transmission. When two nodes are not connected directly, if there
is a path between the two nodes, the information of nodes still can be passed through the
path.

4.4.4 Effect of the Number of WC-HGCN Layers

Figure 3 shows the performance of different layers. As can be seen from Fig. 3, with the
increase of layer, the test accuracy presents a downward trend. This means that the nodes in the

123
A Word-Concept Heterogeneous Graph Convolutional Network for... 745

Table 4 Test accuracy comparison with a different number of neighbors on benchmark datasets
Neighbors Biomedical Dblp MR SST1 SST2 TREC Twitter Avg

1 0.7024 0.7833 0.7634 0.3965 0.8203 0.9840 0.5541 0.7149


2 0.6959 0.7832 0.7674 0.3933 0.8121 0.9800 0.5544 0.7123
3 0.6994 0.7793 0.7609 0.4078 0.8104 0.9720 0.5546 0.7121
4 0.6914 0.7801 0.7598 0.3919 0.8154 0.9720 0.5542 0.7093
5 0.6909 0.7756 0.7665 0.4169 0.8165 0.9760 0.5542 0.7138

Fig. 3 The variation curve of test accuracy with a different number of layers

graph become over-smooth with the increasing layers and all nodes learn the same expression
eventually so that it cannot learn the characteristics that can distinguish the semantics. Table 5
shows the particular variation with a different number of layers. We observe that the average
test accuracy of the 2-layer WC-HGCN is 2.7% lower than that of the 1-layer WC-HGCN,
and the accuracy of the 10-layer WC-HGCN is lowered by 5.68%. Furthermore, we notice
that there is a quite difference of the test accuracy between 1-layer WC-HGCN and 5-layer
WC-HGCN on Biomedical and Dblp, 12.42% and 10.89% respectively. We believe that the
main reason for this phenomenon is that these two datasets have shorter average length than
other datasets. This cause that the higher-level layers cannot capture more effective features
from lower-level layers.

4.4.5 Analysis of Memory and Time Consuming

We further analyze the memory and time consuming of WC-HGCN and baselines. Tables
6 and 7 show the average values of memory consumption and running time for five runs,
respectively. It can be seen from Table 6 that due to its complex architecture, the memory
consumption of Transformer on seven datasets is higher than that of other models, and
the average memory consumption reaches 9421M. Besides, we notice that TL-GNN model
consumes the least memory, and the average memory consumption is only 588M. We infer
that one main reason for this is that duplicated words will be removed when building the graph,
resulting in less memory consumption for each instance data. Comparing the graph model TL-

123
746 S. Yang et al.

Table 5 Test accuracy comparison with a different number of layers on benchmark datasets
Layers Biomedical Dblp MR SST1 SST2 TREC Twitter Avg

1 0.7024 0.7833 0.7634 0.3965 0.8203 0.9840 0.5541 0.7149


2 0.6568 0.7670 0.7395 0.3512 0.7813 0.9660 0.5537 0.6879
3 0.6278 0.7513 0.7333 0.3584 0.7802 0.9560 0.5529 0.6800
4 0.6102 0.6860 0.7268 0.3457 0.7604 0.9600 0.5529 0.6631
5 0.5782 0.6744 0.7525 0.3475 0.7731 0.9700 0.5529 0.6641
6 0.5256 0.6872 0.7251 0.3158 0.7747 0.9440 0.5528 0.6465
7 0.5426 0.7153 0.7330 0.3498 0.7819 0.9500 0.5528 0.6608
8 0.4765 0.7182 0.7254 0.3253 0.7813 0.9340 0.5528 0.6448
9 0.4734 0.7172 0.7347 0.3752 0.7714 0.9500 0.5528 0.6535
10 0.5125 0.7320 0.7316 0.3394 0.7901 0.9480 0.5528 0.6581

Table 6 Comparison of average memory consuming on five runs (M)


Model Biomedical Dblp MR SST1 SST2 TREC Twitter Avg

TextCNN 429 919 561 511 424 279 192 721


Bi-LSTM 159 4963 2020 2206 1826 972 8745 3189
Transformer 5165 18699 6188 7054 4925 2293 21620 9421
FastText 1767 3142 1781 1931 1878 1492 5316 2472
GL-Encoder 634 1421 816 738 629 561 631 776
Text-GCN 1674 1977 625 636 1358 907 1912 1196
TL-GNN 332 813 458 524 587 248 1151 588
STCKA 485 996 824 813 745 522 1102 784
DE-CNN 1500 5085 1946 2338 1893 1023 6488 2896
WC-HGCN 1117 3566 1325 1303 1149 633 4247 1906

GNN, Text-GCN and WC-HGCN, we observe that Text-GCN needs more memory than TL-
GNN because Text-GCN uses the whole corpus to build graphs and there are more edges in the
graph than TL-GNN.WC-HGCN introduces the concept of words, and the constructed graph
contains the edge between words and concepts, so the memory consumption of WC-HGCN
in these three models is the largest. As can be seen from Table 7, we obverse that Text-GCN
has the lowest time consumption with an average time consumption of 53s. The most likely
causes are that Text-GCN constructs the whole corpus as a graph, and updates all parameters
through matrix operation in each epoch, while the parameters of other models are updated
through batch processing. Comparing the text-level WC-HGCN and TL-GNN models, we
notice that the average time consumption of WC-HGCN is lower, which shows that although
both models are based on text-level graph learning models, the learning mechanism of WC-
HGCN is faster and easier to converge than TL-GNN model.

123
A Word-Concept Heterogeneous Graph Convolutional Network for... 747

Table 7 Comparison of average time consuming on five runs (s)


Model Biomedical Dblp MR SST1 SST2 TREC Twitter Avg

TextCNN 185 686 127 128 150 60 598 276


Bi-LSTM 151 1001 95 156 104 54 1073 376
Transformer 326 1284 275 290 236 92 583 441
FastText 2862 15077 626 879 874 360 5637 3759
GL-Encoder 269 981 356 247 162 168 612 399
Text-GCN 32 136 44 41 30 16 71 53
TL-GNN 772 3563 395 455 379 239 5531 1619
STCKA 299 1463 188 254 199 91 1974 638
DE-CNN 1332 5502 547 752 633 187 6444 2199
WC-HGCN 458 2521 434 348 609 599 2120 1013

5 Conclusions

In this paper, we propose a novel word-concept heterogeneous graph convolutional network


(WC-HGCN), which incorporate words and concepts for short text classification. In WC-
HGCN, each sentence owns its structural heterogeneous graph and text-level interactions
between words and concepts can be learned by graph convolutional networks. Experimental
results on the real-word dataset show that WC-HGCN achieves better performance compared
with other baseline methods. This prove the effectiveness of our approach in modeling short
text. However, there are still problems to be solved. We find that WC-HGCN faces the
same problem that the performance becomes worse with the increment of the layer as graph
convolutional networks. In the future, we plan to introduce residual connection to improve
the prediction results. Such as, we can combine the original input and the output of the
lower-level as the input of higher-level layers.
Acknowledgements We thank the anonymous reviewers for their many innovative comments and suggestions.

Funding This research was supported in part by the National Key R&D Program of China under Grant
2017YFC1703905, the Natural Science Foundation of Sichuan under Grant 2022NSFSC0958, and the Sichuan
Science and Technology Program under Grants 2020YFS0372, 2020YFS0302 and 2020YFS0283.

Availability of data and material All datasets are available online.

Code Availability The code of this paper is available online.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.

References
1. Alsmadi IM, Gan KH (2019) Review of short-text classification. Int J Web Inf Syst 15(2):155–182. https://
doi.org/10.1108/IJWIS-12-2017-0083
2. Arevian G (2007) Recurrent neural networks for robust real-world text classification. In: 2007 IEEE /
WIC/ACM international conference on web intelligence, WI 2007, 2–5 November 2007, Silicon Valley,

123
748 S. Yang et al.

CA, USA, Main Conference Proceedings, IEEE Computer Society, pp 326–329. https://doi.org/10.1109/
WI.2007.126
3. Batal I, Hauskrecht M (2009) Boosting KNN text classification accuracy by using supervised term weight-
ing schemes. In: Cheung DW, Song I, Chu WW, Hu X, Lin JJ (eds) Proceedings of the 18th ACM
conference on information and knowledge management, CIKM 2009, Hong Kong, China, November
2–6, 2009, ACM, pp 2041–2044. https://doi.org/10.1145/1645953.1646296
4. Chen J, Hu Y, Liu J, Xiao Y, Jiang H (2019) Deep short text classification with knowledge powered atten-
tion. In: The thirty-third aaai conference on artificial intelligence, AAAI 2019, the thirty-first innovative
applications of artificial intelligence conference, IAAI 2019, The Ninth AAAI symposium on educational
advances in artificial intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, 2019,
AAAI Press, pp 6252–6259. https://doi.org/10.1609/aaai.v33i01.33016252
5. Dilrukshi I, de Zoysa K (2014) A feature selection method for twitter news classification. Int J Mach
Learn Comput 4(4):365
6. Ding W, Yu S, Wang Q, Yu J, Guo Q (2008) A novel naive bayesian text classifier. In: Yu F, Luo Q (eds)
International symposium on information processing, ISIP 2008/international pacific workshop on web
mining, and web-based application, WMWA 2008, Moscow, Russia, 23–25 May 2008, IEEE Computer
Society, pp 78–82.https://doi.org/10.1109/ISIP.2008.54
7. Du C, Huang L (2018) Text classification research with attention-based recurrent neural networks. Int J
Comput Commun Control 13(1):50–61. https://doi.org/10.15837/ijccc.2018.1.3142
8. Han E, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor clas-
sification. In: Cheung DW, Williams GJ, Li Q (eds) Knowledge discovery and data mining—PAKDD
2001, 5th Pacific-Asia Conference, Hong Kong, China, April 16–18, 2001, Proceedings, Springer, Lecture
Notes in Computer Science, vol 2035, pp 53–65.https://doi.org/10.1007/3-540-45357-1_9
9. Hindi KME, Aljulaidan RR, AlSalman H (2020) Lazy fine-tuning algorithms for naïve bayesian text
classification. Appl Soft Comput 96:106652.https://doi.org/10.1016/j.asoc.2020.106652
10. Hu L, Yang T, Shi C, Ji H, Li X (2019) Heterogeneous graph attention networks for semi-supervised
short text classification. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on
empirical methods in natural language processing and the 9th international joint conference on natural
language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, Association for
Computational Linguistics, pp 4820–4829.https://doi.org/10.18653/v1/D19-1488
11. Huang L, Ma D, Li S, Zhang X, Wang H (2019) Text level graph neural network for text classification. In:
Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural
language processing and the 9th international joint conference on natural language processing, EMNLP-
IJCNLP 2019, Hong Kong, China, November 3–7, 2019, Association for Computational Linguistics, pp
3442–3448.https://doi.org/10.18653/v1/D19-1345
12. Islam MZ, Liu J, Li J, Liu L, Kang W (2019) A semantics aware random forest for text classification.
In: Zhu W, Tao D, Cheng X, Cui P, Rundensteiner EA, Carmel D, He Q, Yu JX (eds) Proceedings of the
28th ACM international conference on information and knowledge management, CIKM 2019, Beijing,
China, November 3–7, 2019, ACM, pp 1061–1070. https://doi.org/10.1145/3357384.3357891
13. Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In:
Lapata M, Blunsom P, Koller A (eds) Proceedings of the 15th conference of the european chapter of the
association for computational linguistics, EACL 2017, Valencia, Spain, April 3–7, 2017, Volume 2: Short
Papers, Association for Computational Linguistics, pp 427–431.https://doi.org/10.18653/v1/e17-2068
14. Keerthi SS (2005) Generalized LARS as an effective feature selection tool for text classification with
svms. In: Raedt LD, Wrobel S (eds) Machine learning, proceedings of the twenty-second international
conference (ICML 2005), Bonn, Germany, August 7–11, 2005, ACM, ACM International Conference
Proceeding Series, vol 119, pp 417–424.https://doi.org/10.1145/1102351.1102404
15. Kim Y (2014) Convolutional neural networks for sentence classification. In: Moschitti A, Pang B, Daele-
mans W (eds) Proceedings of the 2014 conference on empirical methods in natural language processing,
EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of
the ACL, ACL, pp 1746–1751.https://doi.org/10.3115/v1/d14-1181
16. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd
international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015,
Conference Track Proceedings.http://arxiv.org/abs/1412.6980
17. Li C, Ouyang J, Li X (2019) Classifying extremely short texts by exploiting semantic centroids in word
mover’s distance space. In: Liu L, White RW, Mantrach A, Silvestri F, McAuley JJ, Baeza-Yates R, Zia L
(eds) The world wide web conference, WWW 2019, San Francisco, CA, USA, May 13–17, 2019, ACM,
pp 939–949.https://doi.org/10.1145/3308558.3313397
18. Li Y, Liu B (2020) A new vector representation of short texts for classification. Int Arab J Inf Technol
17(2):241–249. https://doi.org/10.34028/iajit/17/2/12

123
A Word-Concept Heterogeneous Graph Convolutional Network for... 749

19. Li Y, Tarlow D, Brockschmidt M, Zemel RS (2016) Gated graph sequence neural networks. In: Bengio
Y, LeCun Y (eds) 4th international conference on learning representations, ICLR 2016, San Juan, Puerto
Rico, May 2–4, 2016, Conference Track Proceedings.http://arxiv.org/abs/1511.05493
20. Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning.
In: Kambhampati S (ed) Proceedings of the twenty-fifth international joint conference on artificial intel-
ligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, IJCAI/AAAI Press, pp 2873–2879.http://
www.ijcai.org/Abstract/16/408
21. Liu X, You X, Zhang X, Wu J, Lv P (2020) Tensor graph convolutional networks for text classification.
In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative
applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational
advances in artificial intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, AAAI Press,
pp 8409–8416.https://aaai.org/ojs/index.php/AAAI/article/view/6359
22. Niu G, Xu H, He B, Xiao X, Wu H, Gao S (2019) Enhancing local feature extraction with global
representation for neural text classification. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the
2019 conference on empirical methods in natural language processing and the 9th international joint
conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7,
2019, Association for Computational Linguistics, pp 496–506.https://doi.org/10.18653/v1/D19-1047
23. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Moschitti
A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special
Interest Group of the ACL, ACL, pp 1532–1543.https://doi.org/10.3115/v1/d14-1162
24. Salles T, Gonçalves MA, Rodrigues V, da Rocha LC (2018) Improving random forests by neighborhood
projection for effective text classification. Inf Syst 77:1–21. https://doi.org/10.1016/j.is.2018.05.00610.
1016/j.is.2018.05.00610.1016/j.is.2018.05.00610.1016/j.is.2018.05.00610.1016/j.is.2018.05.00610.
1016/j.is.2018.05.006
25. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model.
IEEE Trans Neural Netw 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605
26. Shanahan JG, Roma N (2003) Improving SVM text classification performance through threshold adjust-
ment. In: Lavrac N, Gamberger D, Todorovski L, Blockeel H (eds) Machine learning: ECML 2003, 14th
European conference on machine learning, Cavtat-Dubrovnik, Croatia, September 22–26, 2003, Proceed-
ings, Springer, Lecture Notes in Computer Science, vol 2837, pp 361–372.https://doi.org/10.1007/978-
3-540-39857-8_33
27. Song G, Ye Y, Du X, Huang X, Bie S (2014) Short text classification: a survey. J Multim 9(5):635–643.
https://doi.org/10.4304/jmm.9.5.635-643
28. Tang J, Qu M, Mei Q (2015) PTE: predictive text embedding through large-scale heterogeneous text
networks. In: Cao L, Zhang C, Joachims T, Webb GI, Margineantu DD, Williams G (eds) Proceedings
of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney,
NSW, Australia, August 10–13, 2015, ACM, pp 1165–1174.https://doi.org/10.1145/2783258.2783307
29. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017)
Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan
SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural
information processing systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp 5998–6008.http://
papers.nips.cc/paper/7181-attention-is-all-you-need
30. Wang S, Li D, Song X, Wei Y, Li H (2011) A feature selection method based on improved fisher’s
discriminant ratio for text sentiment classification. Expert Syst Appl 38(7):8696–8702. https://doi.org/
10.1016/j.eswa.2011.01.077
31. Xu B, Huang JZ, Williams GJ, Li MJ, Ye Y (2012) Hybrid random forests: advantages of mixed trees
in classifying text data. In: Tan P, Chawla S, Ho CK, Bailey J (eds) Advances in knowledge discovery
and data mining—16th Pacific-Asia conference, PAKDD 2012, Kuala Lumpur, Malaysia, May 29-June
1, 2012, Proceedings, Part I, Springer, Lecture Notes in Computer Science, vol 7301, pp 147–158.https://
doi.org/10.1007/978-3-642-30217-6_13
32. Xu J, Cai Y, Wu X, Lei X, Huang Q, Leung H, Li Q (2020) Incorporating context-relevant concepts into
convolutional neural networks for short text classification. Neurocomputing 386:42–53. https://doi.org/
10.1016/j.neucom.2019.08.080
33. Xu S (2018) Bayesian naïve bayes classifiers to text classification. J Inf Sci 44(1):48–59. https://doi.org/
10.1177/0165551516677946
34. Yang Y, Wang H, Zhu J, Wu Y, Jiang K, Guo W, Shi W (2020) Dataless short text classification based on
biterm topic model and word embeddings. In: Bessiere C (ed) Proceedings of the twenty-ninth international
joint conference on artificial intelligence, IJCAI 2020, ijcai.org, pp 3969–3975.https://doi.org/10.24963/
ijcai.2020/549

123
750 S. Yang et al.

35. Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In: The thirty-third
AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial
intelligence conference, IAAI 2019, the Ninth AAAI symposium on educational advances in artificial
intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27—February 1, 2019, AAAI Press, pp 7370–
7377.https://doi.org/10.1609/aaai.v33i01.33017370
36. Zeng J, Li J, Song Y, Gao C, Lyu MR, King I (2018) Topic memory networks for short text classification.
In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (eds) Proceedings of the 2018 conference on empirical
methods in natural language processing, Brussels, Belgium, October 31—November 4, 2018, Association
for Computational Linguistics, pp 3120–3131.https://doi.org/10.18653/v1/d18-1351
37. Zhang H, Ni W, Zhao M, Lin Z (2019) Cluster-gated convolutional neural network for short text clas-
sification. In: Bansal M, Villavicencio A (eds) Proceedings of the 23rd Conference on Computational
Natural Language Learning, CoNLL 2019, Hong Kong, China, November 3-4, 2019, Association for
Computational Linguistics, pp 1002–1011.https://doi.org/10.18653/v1/K19-1094
38. Zhang Y, Yu X, Cui Z, Wu S, Wen Z, Wang L (2020) Every document owns its structure: Inductive text
classification via graph neural networks. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (eds) Proceedings
of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–
10, 2020, Association for Computational Linguistics, pp 334–339. https://doi.org/10.18653/v1/2020.acl-
main.31

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

123

You might also like