1 s2.0 S1877050922019597 Main

Available online at www.sciencedirect.
com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
www.elsevier.com/locate/procedia
9th International Conference on Information Technology and Quantitative Management

9th International Conference on Information Technology and Quantitative Management
Keywords Extraction and Thesaurus Construction for Domain News
Keywords Extraction and Thesaurus Construction for Domain News
Fan Mengaa, Kaile Zhouaa, Yi Buaa, Win-Bin Huangaa*, Pengyi Zhangaa, Fei Longbb, Yan Licd
Fan Meng , Kaile Zhou , Yi Bu , Win-Bin Huang *, Pengyi Zhang , Fei Long , Yan Licd
a
Deparment of Information Management, Peking University, Beijing 100871, China
b
a Chinaso
Deparment of Information Inc., Beijing
Management, 100077,
Peking China Beijing 100871, China
University,
c
State Key Laboratory of Media ConvergencebProduction Technology
Chinaso Inc., and Systems,
Beijing 100077, China Xinhua News Agency, Beijing 100803, China
d
c E Surfing
State Key Laboratory of Media IOT Tech
Convergence Co., Ltd, China
Production Telecom,
Technology and Guangzhou 510000,
Systems, Xinhua China
News Agency, Beijing 100803, China
d
E Surfing IOT Tech Co., Ltd, China Telecom, Guangzhou 510000, China
Abstract
Abstract
In modern information retrieval systems, the thesaurus is playing an increasingly important role. In order to better describe and
In modern
analyze theinformation
domain news, retrieval systems,
this paper the thesaurus
proposes a method ofis playing an increasingly
domain keyword important
extraction, role. In
and further order toan
constructs better describe
effective and
domain
analyze theCompared
thesaurus. domain news,
withthis
the paper proposes
previous a method
research, of domain
this paper graspskeyword
the core extraction,
informationand further
in the fieldconstructs an effective
by extracting domain
and combining
domain keywords,
thesaurus. Compared andwith
improves the domain
the previous effectiveness
research, this paperofgrasps
the thesaurus.
the core In addition, this
information paper
in the fieldconducts both manual
by extracting analysis
and combining
domain keywords,
and automated and improves
processing the domain
to construct effectiveness
high-quality of thewhich
thesaurus, thesaurus. In addition,
has practical this papervalue.
application conducts
The both
finalmanual
results analysis
provide
and automated
support processing
for the process to construct
of indexing, high-quality
organizing, thesaurus,
retrieving which has practical
and recommending news. application value. The final results provide
support
© 2022forThetheAuthors.
process ofPublished
indexing,byorganizing,
ELSEVIER retrieving
B.V. and recommending news.
© 2022
This
© 2022 The
is anThe Authors.
openAuthors.Published
access article by
Published Elsevier
under
by B.V.BY-NC-ND
the CC
ELSEVIER B.V. license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility thescientific
of the CC BY-NC-ND license
committee of (https://creativecommons.org/licenses/by-nc-nd/4.0)
the 9th International Conference on Information
Peer-review under responsibility of the scientific committee of the 9th International Conference on Information Technology and
Technology
Peer-review and
underQuantitative
Quantitative ManagementresponsibilityManagement
of the scientific committee of the 9th International Conference on Information
Technology andnews;
Keywords: domain Quantitative Management
thesaurus construction; keywords extraction; topic model; co-occurrence analysis
Keywords: domain news; thesaurus construction; keywords extraction; topic model; co-occurrence analysis
1. Introduction
1. Introduction
With the development of information technology, the total amount of information on the Internet has grown rapidly.
With the
Various development
retrieval systemsof information
have emerged technology,
and becomethe
an total amount
important wayof for
information
people toon the Internet
quickly find thehasinformation
grown rapidly.
and
knowledge they need. However, in natural language, concepts and terms are not one-to-one relationships, and and
Various retrieval systems have emerged and become an important way for people to quickly find the information the
knowledge they
phenomena need. However,
of polysemy in naturalare
and synonyms language, concepts which
very common, and terms are notthe
increases one-to-one
difficultyrelationships,
and complexityand the
of
phenomena of
information polysemy
retrieval. For and synonyms
example, are very between
the difference common,queries
whichusedincreases
by thethe difficulty
users and theand termscomplexity of
used in the
information retrieval. For example, the difference between queries used by the users and the terms used in the
* Corresponding author. Tel.: +86-186-0049-4648;

*E-mail address: huangwb@pku.edu.cn
Corresponding author. Tel.: +86-186-0049-4648;
E-mail address: huangwb@pku.edu.cn
1877-0509 © 2022 The Authors. Published by ELSEVIER B.V.
This is an open
1877-0509 © 2022access
The article
Authors. under the CCby
Published BY-NC-ND
ELSEVIERlicense
B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under
This is an open responsibility
access of the
article under thescientific committee
CC BY-NC-ND of (https://creativecommons.org/licenses/by-nc-nd/4.0)
license the 9th International Conference on Information Technology and
Quantitative Management
1877-0509 © 2022 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
10.1016/j.procs.2022.11.249
838 Fan Meng et al. / Procedia Computer Science 214 (2022) 837–844
2 Author name / Procedia Computer Science 00 (2019) 000–000
information resource may lead to miss detection, which will reduce the search performance. As a result, it is important
to build a professional and accurate domain thesaurus for the retrieval system.
The essence of thesaurus is the specification, constraint and description of concepts and terms and the relationship
between terms. Terms are divided into preferred words and non-preferred words, and preferred words correspond to
one-to-one concepts and are used for data indexing and retrieval. Using the thesaurus, the retrieval system can
automatically index the data, and conduct correlation search through the relationship between the subject words,
improve the search result and reduce the requirements for users’ background knowledge. Furthermore, information
resources can also be organized scientifically to provide users with advanced information recommendation service.
In recent years, the thesaurus of the certain fields has received widespread attention because of its important role.
For texts of specific fields, the construction of the thesaurus needs to reflect domain-specific knowledge and content.
In order to explore the core content of domain news in more depth, this paper introduces the method of domain
keyword extraction. Researches have shown that efficient and accurate keyword screening and extraction in the field
is of great help to identify domain knowledge and detect research frontiers and topic hotspots [1].
This paper takes the news text of the Sino-US trade war as the experimental object, and conducts the work of
domain keywords extraction and thesaurus construction. The results can help the news retrieval platforms quickly and
accurately construct a thesaurus of subdivision fields, supporting the process of indexing, organization, retrieval and
recommendation of news.
2. Related research
2.1. The construction of domain thesaurus
As a tool for information organization and retrieval, the research and application of the thematic thesaurus have a
long history. The thesaurus has always been widely used in the library literature retrieval system, but with the
development of the Internet, the birth and development of various network search engines have challenged the status
of the thesaurus. However, in 2015, Birger [2] discusses the relationship between modern web search engines and
traditional thesaurus, and points out that information retrieval in the network environment is still inseparable from the
thesaurus, in particular, the thematic thesaurus in various discipline segments still plays an important role.
With the development of information technology and the explosive growth of information resources in recent years,
manually building the thesaurus has gradually become an expensive and time-consuming way. Therefore, the
automatic construction of the thesaurus has become a recent research trend. There have been many studies in this
research direction. Guntzer et al. [3] propose a thesaurus construction system based on expert systems and user
interactions. This approach assumes that users will always adopt an effective search strategy, so a series of rules are
designed to analyze the users’ search patterns and extract inter-word relationships from that. The rules designed by
the experts will also be continuously iterated and improved in use. This method can achieve decent results, but the
labor cost of the expert efforts is high, and the design rules are not easy to maintain. Tsurumaru et al. [4] develop a
method to identify semantic relationships between words based on linguistic knowledge, identify vocabulary tuples
with syntactic relationships through syntactic analysis, calculation of mutual information values, etc., and calculate
the degree of correlation between words. However, this method is subject to the development of Chinese linguistics,
and there is little research and practice of this method in China.
For Chinese research on the automatic construction technology of thematic thesaurus has not made great progress
until recent years. An Yawei et al. [5] propose an automatic construction method for thematic thesaurus for large-
scale corpuses in specific fields. They construct a feature matrix based on the word co-occurrence matrix, and on this
basis, they divide the word clusters, reorganize them with the central word of each word cluster as the core, and finally
complete the automatic construction. Li Fajun [6] introduces an automatic construction method for domain thesaurus
based on graph theory clustering and PageRank, in which the graph structure is constructed according to lexical co-
occurrence. Sun Liyuan et al. [7] provide a thesaurus construction method for the field of efficient emergency decision-
making, and completed the construction of the thematic thesaurus in the field by establishing a document library
covering the field, constructing the knowledge representation of the thesaurus, and iteratively updating the thematic
thesaurus. Shao Wei et al. [8] propose a method based on dependent syntactic analysis, which not only solves the
Fan Meng et al. / Procedia Computer Science 214 (2022) 837–844 839
Author name / Procedia Computer Science 00 (2019) 000–000 3
problem of the division of unlanded words in the field, but also overcome the lack of relationship between the subject
words, and constructs a thesaurus in the field of science and technology policy without supervision.
While research on auto-construction is a hot topic, it also has certain limitations. Zeng Wen [9] points out that
although the automatic construction technology of thematic thesaurus is feasible and operable, it cannot be compared
with the wisdom of the human brain, and the effect is still not as good as the way experts build, and the current
automated construction technology is more of an auxiliary nature of automation.
2.2. Keyword extraction in specific domain
In information organization field, people hope to extract important core information from a large number of domain
corpora, so as to better grasp the core knowledge. Therefore, the core domain keywords are extracted from the massive
data, and the most important information features are retained through data dimensionality reduction. Network science
and analysis are introduced into the field, and keyword co-occurrence became mainstream[1].
The features chosen for network analysis varies in the existing studies. Khan et al. [10] propose a keyword
extraction method based on word frequency, they first select a high-frequency keyword set according to a certain
frequency threshold, and then establish a co-occurrence network and carry out network analysis, which is simple and
effective. Lee et al. [11] give a degree-based keyword extraction method, which pays more attention to the broad
connection of between keywords with other keywords, and believed that words associated with more keywords tended
to have a domain core. Finally, the top N nodes in the network are selected as the core keywords of the domain.
However, this method cannot reflect the strength of the co-occurrence relationship and is not suitable for weighted
graph. Therefore, Zhang et al. [12] provide a keyword extraction method based on the frequency of the relations.
Filtering and ranking weights based on a certain threshold yields an important set of relationships throughout the
network. Compared to the previous methods, the relational frequency-based approach is suitable for weighted graphs
and can reflect the connections between important keywords. Based on a similar method, Suna[13] conducts relevant
analysis and keyword extraction work on various fields of digital library research. Wei Yumei and Teng Guangqing
[1] also provide a comparative study of these keyword extraction methods, and conclude that the keyword method
based on word frequency is suitable for identifying hot spots, degree-based methods are suitable for identifying
domain cores in unweighted networks, and the relationship frequency-based approach focuses on the strength of
associations and is suitable for exploring keyword relevance.
3. Research methodology
3.1. Data description and preprocessing
The experimental data for this paper comes from news corpus related to the "Sino-US trade war" from China Search
(www.chinaso.com). This study tries to extract a more representative keyword set in this domain, construct a thesaurus
of the field topics accordingly, and improve the effect of information search and news recommendation in this field.
All news articles come from 53 different news outlets, containing 50,000 pieces of news from January 2018 to
February 2021, where most of them are found the length of the news body is between 400-3300 words, and the length
of the headline is between 15-40 words. In addition, the distribution of all words in the news corpus is a long-tail
distribution.
Data preprocessing process mainly includes two parts: deduplication removal and cleaning. Because the news texts
come from different sources, there are duplicate articles. News duplication reflects the different degrees of emphasis
on relevant news, but it can adversely affect content analysis, so these articles are first moved out. After deduplication,
the total number of news was reduced from 50,000 to 19,013. In terms of data cleaning, it is mainly to delete the
content in the corpus that has nothing to do with the news itself, such as the copyright declaration information at the
beginning or end of the news. The field review of the data after data preprocessing is shown in Table 1.
Table 1. Field review after data preprocessing

Field name Data type Count Unique values
news_source text 10806 2623
publish_time data-time 16797 13925
title text 19013 19013
url text 19013 19013
content text 19005 18887
news_author text 3435 2435
3.2. Domain keyword extraction
Extracting domain keywords allows us to grasp the core information characteristics of the field, discover the hot
spots of public opinion in the field, and explore different aspects of information in different dimensions. Such global,
macro-level information can provide a more accurate reference for the thesaurus of the topics that we have built in the
field. Specifically, in addition to choosing the method of co-occurrence relationship analysis commonly used in
previous studies, the LDA theme model (Latent Dirichlet Allocation) was tried[14].
3.2.1. LDA model

LDA is an important probability graph model in traditional machine learning, as shown in Figure 1, it is a multi-
layered directed structure that describes the probability relationship between the three levels of words, subjects, and
documents. LDA is designed to model text in an unsupervised way. In this paper, we make use of LDA model in the
following steps:
• Segment the news corpus, remove stop words, and represent each news as a bag-of-words model
• Model different categories of news, as well as all categories of news, and output the results to a file separately
• Analysis is based on different granularities such as words, nouns, noun phrases, etc.
• Try different numbers of topics and see how the number of topics affects the final result
Fig. 1. LDA probability graph
3.2.2. Co-occurrence relationship analysis

This paper adopts co-occurrence analysis based on the frequency of relationships, which pay more attention to the
quality of the relationship between keywords, explore the association between keywords, and provide macro-level
reference value for the construction of later inter-word relationships. In addition, this paper compares the co-
occurrence analysis of two granularities (nouns and noun phrases), combining nouns that appear consecutively within
a window length into noun phrases. By this means, we can form more domain-indicative phrase. The steps are as
follows:
• Segment the news corpus, remove stop words, and represent each news as a bag-of-words model,
• According to the threshold of a certain frequency of occurrence, some keywords are initially selected, with a
large number and a wider range,
• Based on these preliminary keywords, a co-occurrence analysis of all news (including the granularity of nouns
and noun phrases) is done, and the number of times all keywords appear in the same news is counted,
• Visualize the results of the analysis and, accordingly, continuously adjust and iterate on the previous steps.
3.3. Thesaurus construction
3.3.1. Construction principles

Based on the starting point of improving the performance of the search system, the thesaurus has the following
three principles: strong practicality, starting from the actual needs of citation and retrieval, selecting noun terms with
a certain frequency of use and can gather a certain amount of news; High accuracy, select words that meet the scientific
and universal nature, and limit the situation of homonyms and polysemantics; Use both phrases to give full play to the
superiority of combinations, and improve the specificity of theme words by using phrases.
3.3.2. Candidate words extraction

First, the pre-processed news text is cut into words, and each word is part-of-speech annotation, word frequency is
counted, and the tf * idf value is calculated. For special named entities (place names, personal names, institution names,
etc.), this paper conduct separate statistics to form a structured document. On the obtained initial data, we first filter
out the words that appear less frequently than a certain threshold, remove stop words and meaningless words. At this
point, we have a relatively clean set of candidate words and their statistics.
3.3.3. Inter-word similarity calculation

After candidate words extraction, the overall amount of words is still large, and there are many words with very
close semantics, which need to be combined to avoid data redundancy. In order to comprehensively reflect the inter-
word similarity, this paper make use of three different ways to calculate the semantic similarity between words,
including co-occurrence frequencies, cosine distances and Word2Vec[15] embedding similarity.
3.3.4. Merger and consolidation

At this point, we have preprocessed candidates, inter-word relationships, domain keywords and other aspects of
information support, and need to do the final merger and integration work.
First, all words are categorized, all nouns and adjectives are grouped into specific categories, such as "经济
(economic)" or "政治(political)", and the classification is determined with reference to the topic of "Sino-US trade
war" predicted by the news. This step is carried out with the participation of multiple people, and the final result is
compared and unified.
Next is the construction of inter-word relationships. We introduce and annotate the following relations: "RT"
indicating correlation, "NT" representing the inclusion relationship between the upper and lower classes, and "UF"
pointing relationship from the narrative to the entry word.
4. Research results
4.1. Domain keyword extraction
Based on the LDA field keyword extraction, it suggests that if the number of topics is set to a larger, the categories
are too scattered, such as the concepts of "基金(funds)" and "债券(bonds)" in the financial category will appear in
different categories; If the number of topics is set to a smaller size, it is easy to have words from different domains in
the same topic. The experiment finally showed that the number of topics was 10 to 15, and the effect was better. After
selecting a balanced number of topics, the output of the LDA is shown in Table 2.
As can be seen from the table, some topics are clearer and some topics have abstract vocabulary. Still, LDA can
output domain keywords and their relationships that have some value. For example, LDA can find the relationship
between words such as "华为(Huawei)", "知识产权(intellectual property rights)", "专利(patents)" and "任正非(Ren
Zhengfei)"; In terms of equity funds, keywords such as "基金(fund)", "股票(stock)", "收益(income)", "净值(net
value)" and so on are identified; In terms of grain trade, keywords such as "玉米(corn)", "大豆(soybeans)", "产量
(yield)" and "价格(price)" have also been linked. The output results show that LDA can find some keywords and their
relationships that are not obvious, but appear frequently in the corpus, and are more indicative of the field, but the
output results can directly reflect the theme of "Sino-US trade war" There is only theme 9, which accounts for a very
low proportion of the total number of topics, so the LDA method is not conducive to extracting a large number of
keywords from the theme of "Sino-US trade war".
Table 2. LDA results

Topic Vocabularies and weights
中国国家疫情经济美国社会香港台湾文化世界
1
0.022 0.012 0.010 0.008 0.008 0.008 0.007 0.007 0.007 0.006
中国印度航母海军女性时尚电影国家菲律宾美军
2
0.030 0.017 0.014 0.011 0.009 0.008 0.008 0.007 0.007 0.007
企业行业产品中国市场项目产业业务收入技术
3
0.024 0.015 0.014 0.012 0.011 0.011 0.010 0.009 0.009 0.009
价格市场需求国内中国产量大豆全球玉米企业
4
0.026 0.020 0.015 0.014 0.013 0.011 0.011 0.010 0.010 0.010
华为专利知识产权任正非法律司法协议案件人机谷歌
5
0.091 0.028 0.020 0.017 0.015 0.011 0.011 0.009 0.009 0.009
市场价格政策风险增速经济人民币需求压力利率
6
0.033 0.014 0.012 0.011 0.010 0.010 0.010 0.010 0.010 0.009
市场美国美元黄金美联储股市投资者数据行情 A股
7
0.025 0.020 0.017 0.017 0.012 0.011 0.011 0.009 0.008 0.007
芯片市场中国汽车技术品牌产品半导体全球手机
8
0.029 0.024 0.021 0.021 0.017 0.016 0.016 0.015 0.012 0.009
中国美国中美国家特朗普经济贸易战关税全球贸易
9
0.087 0.085 0.020 0.015 0.015 0.012 0.011 0.010 0.010 0.008
基金报告比例收益报告期业绩股票净值基金管理市场
10
0.084 0.036 0.027 0.024 0.023 0.019 0.019 0.017 0.016 0.015
Based on the co-occurrence analysis of domain keyword extraction, it is found that the effect of noun-based phrases
is better, and the visualization results can show the connections between phrases with more domain characteristics.
The result based on the noun phrase is shown in the following figure, the size and color shade of the node represent
the word frequency of the word, the edge weights between the nodes represent the co-occurrence frequency, the
stronger the co-occurrence relationship, the closer the distance between the nodes, the result is shown in Figure 4.
Fig. 2. Visualization results of the co-existent analysis

4.2. Thesaurus
The thesaurus construction results include 22 categories, and the statistics of each category are shown in the
following table:
Table 3. Thesaurus categories and statistics

Vocabulary categories Contains the number of words Vocabulary categories Contains the number of words
Named entity - Person name 373 military 81
Named Entity - Institution Name 167 person 486
Economic-macroeconomic concepts 81 region 74
Economic - business 113 technology 169
Economic - taxation 19 politics 132
economy-Trade & Transportation 43 healthy 41
Economics -Finance 252 Organizational structure 222
Economy - resources 38 educate 40
economy-Industries & Businesses 101 Time 35
Economical - manufacturing 203 religion 6
Economic - other 120 adjective 197
The thesaurus selects words with a certain frequency, deletes some low-frequency and non-retrievable words,
ensures its applicability, conforms to the law of news retrieval, and is closely related to the theme of "Sino-US trade
war", improving the efficiency of retrieval, and also has certain compatibility and scalability.
Taking "全球经济(global economy)" in "宏观经济概念(Macroeconomic Concepts)" as examples, the theme
vocabulary reveals that the thesaurus quality is high through manual consolidation and integration. In the "全球经济
(global economy)", the core theme words such as "国际分工(international division of labor)", "海外经济(overseas
economy)", "全球产业(global industry)" and "世界市场(world market)" are in the core position in the theme
vocabulary, and through the connection of the relationship between words, more specific theme words such as "CPI"
and "双循环(double circulation)" are got in touch.
5. Conclusion
In this paper, a keyword extraction and thesaurus construction method for domain news is proposed. First, domain
keywords are extracted in two ways and visualized to visually acquire domain knowledge. Then, through the automatic
method, the candidate words are extracted, the semantic similarity is calculated, the information support of the domain
keywords is added, and the manual analysis based on the expert knowledge level is introduced, and the subject words
are standardized and the subject word categories are indexed, to establish the inter-word relationship of the subject
word, etc. The result is a domain thesaurus to help improve the search performance of the relevant information retrieval
system.
This study also has some limitations, first of all, the automated construction of the thesaurus is still the trend of
related research, and the labor cost of the construction method in this paper still needs to be further reduced; In the
calculation of inter-word similarity, the Word2Vec model is based on the hypothesis that the common word similarity
is greater, so more methods can be considered to experiment with the contrast effect and better express the true
semantic similarity. In addition, in the analysis based on the granularity of phrases, the way of continuous nouns
constituting noun phrases is simply adopted, which may lead unreasonable phrases. In the future, other phrase
construction methods including dependent syntax analysis may be used to improve the result.
Acknowledgement
This work was supported in part by National Natural Science Foundation of China (#71801232, #71932008). And we’d like to express our sincere
gratitude to all the editors and reviewers.
References
[1] WEI Yumei, & TENG Guangqing. (2020). A comparative study on the extraction methods of important keywords in the field under the network
perspective. Intelligence Information Work, 41(3), 8
[2] Hjorland, B. . (2016). Does the traditional thesaurus have a place in modern information retrieval?. Knowledge Organization, 43(3), 145-159.
[3] U Güntzer, G Jüttner, G Seegmüller, & Sarre, F. . (1989). Automatic thesaurus construction by machine learning from retrieval sessions.
Information Processing & Management, 25(3), 265-273.
[4] Tsurumaru, H. , Hitaka, T. , & Yoshida, S. . (2017). An attempt to automatic thesaurus construction from an ordinary japanese lan8uaee
dictionary.
[5] An Yawei, Cao Xiaochun, & Luo Shun. (2018). Corpus-oriented domain thesaurus construction algorithm. Computer Science, 45(B06), 3
[6] Li Fajun. (2015). Clustering and based on graph theorypagerank Field post-claim glossaryBuild studies automatically. Innovative
technology(11), 4.
[7] SUN Liyuan, & SU Xinning. (2019). Research on the Construction of Thematic Thesaurus in the Field of Emergency Decision-making of
Universities. Information Science, 37(4), 7
[8] Shao Wei, & Hua Berlin. (2020). Unsupervised construction of thematic thesaurus in the field of science and technology policy based on
dependent syntactic analysis. Information Engineering, 6(6), 12
[9] ZENG Wen. (2012). Exploration and practice of automatic thesaurus construction technology of theme vocabulary in the networked digital era.
National Library Journal (4), 5
[10] Khan, G. F., & Wood, J. (2015). Information technology management domain: Emerging themes and keyword analysis. Scientometrics, 105(2),
959-972.
[11] Lee, P., & Su, H. (2010). Investigating the structure of regional innovation system research through keyword co-occurrence and social network
analysis. Innovation (North Sydney), 12(1), 26-40.
[12] Zhang, J. , Xie, J. , Hou, W. , Tu, X. , Xu, J. , & Song, F. , et al. (2012). Mapping the knowledge structure of research on patient adherence:
knowledge domain visualization-based co-word analysis and social network analysis. PLoS ONE, 7(4), e34497-.
[13] Suna. (2009). Research topics and progress in the field of digital library based on co-word analysis. Journal of Intelligence (6), 5
[14] Blei, D. M. , Ng, A. Y. , & Jordan, M. I. . (2001). Latent dirichlet allocation. The Annals of Applied Statistics.
[15] Mikolov, T. , Chen, K. , Corrado, G. , & Dean, J. . (2013). Efficient estimation of word representations in vector space. Computer Science.

1 s2.0 S1877050922019597 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1877050922019597 Main

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Procedia Computer Science 214 (2022) 837–844

9th International Conference on Information Technology and Quantitative Management

* Corresponding author. Tel.: +86-186-0049-4648;

2.1. The construction of domain thesaurus

2.2. Keyword extraction in specific domain

3.1. Data description and preprocessing

Table 1. Field review after data preprocessing

3.2. Domain keyword extraction

3.2.1. LDA model

Fig. 1. LDA probability graph

3.2.2. Co-occurrence relationship analysis

3.3. Thesaurus construction

3.3.1. Construction principles

3.3.2. Candidate words extraction

3.3.3. Inter-word similarity calculation

3.3.4. Merger and consolidation

4.1. Domain keyword extraction

Table 2. LDA results

Fig. 2. Visualization results of the co-existent analysis

Table 3. Thesaurus categories and statistics

You might also like