Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Article history: Covid 19 pandemic has placed the entire world in a precarious condition. Earlier it was a serious issue in
Received 29 December 2020 china whereas now it is being witnessed by citizens all over the world. Scientists are working hard to find
Received in revised form 29 December 2020 treatment and vaccines for the coronavirus, also termed as covid. With the growing literature, it has
Accepted 17 January 2021
become a major challenge for the medical community to find answers to questions related to covid-
Available online xxxx
19. We have proposed a machine learning-based system that uses text classification applications of
NLP to extract information from the scientific literature. Classification of large textual data makes the
Keywords:
searching process easier thus useful for scientists. The main aim of our system is to classify the abstracts
COVID-19
Text classification
related to covid with their respective journals so that a researcher can refer to articles of his interest from
KNN the required journals instead of searching all the articles. In this paper, we describe our methodology
MLP needed to build such a system. Our system experiments on the COVID-19 open research dataset and
XGBoost the performance is evaluated using classifiers like KNN, MLP, etc. An explainer was also built using
Explainability XGBoost to show the model predictions.
Ó 2021 Elsevier Ltd. All rights reserved.
Selection and peer-review under responsibility of the scientific committee of the Emerging Trends in
Materials Science, Technology and Engineering.
The district named Wanzhou was affected by COVID 19 pan- On 7th January the raging virus was detected as coronavirus. It
demic in China. Wanzhou became an enclosed center for epidemi- had >95% homology with the bat coronavirus and >70% similarity
ological investigations with the lockdown of Wuhan and with the SARS-CoV. The covid related articles started increasing
surrounding places on January 23, 2020. Hence an opportunity from February 2020 to date making it very difficult for human ana-
was provided for understanding the transmission dynamics and lysts to go through all the articles. Applications like identification
other risk factors associated with the spread of SARS-COV-2, the of objects in images, speech to text conversion, news items match-
agent of COVID-19. 47 other Chinese cities also implemented the ing, products related to user interests, etc can be done by using
same measures to tackle COVID-19. Most of the COVID-19 cases machine learning techniques [5]. They make use of the class of
are very mild in severity [1,2] thus reducing the likelihood that techniques named deep learning [6]. A system that is based on
they would look around for testing and medical care [3]. deep learning was proposed that uses NLP question answering
CORD 19 corpora [4] introduced by the Allen Institute for AI and methods to mine the literature [7]. CovidQA, a question-
other research groups consists of over 200,000 scholarly articles, of answering dataset was developed for covid 19 [8]. Shuja et al. [9]
which 100,000 are with full text, about COVID-19, SARS-CoV-2, and have formulated research domain taxonomy and identified fea-
similar coronaviruses like SARS and MERS. People applied various tures of datasets concerning the type, methods, application. Santos
AI-based techniques in information retrieval and NLP on this data- et al. [10] presented a dataset on COVID-19 in which research
set for extracting important information. We propose a machine activities overview was provided so that it would be easy to find
learning-based system that uses text classification application of scientists and researchers who are active in the task of combating
NLP for extracting the information from this scientific literature. the disease. The models in data mining to predict COVID-19
The main aim is to classify the covid related articles according to patient’s recovery using an epidemiological dataset of South Korea
the journals in which they were published. Hence it can be consid- COVID-19 patients were implemented [11]. The assessment of
ered as a multi-class classification problem. information flow as well as, scientific collaboration quality was
https://doi.org/10.1016/j.matpr.2021.01.480
2214-7853/Ó 2021 Elsevier Ltd. All rights reserved.
Selection and peer-review under responsibility of the scientific committee of the Emerging Trends in Materials Science, Technology and Engineering.
Please cite this article as: D. Godavarthi and Mary Sowjanya A, Classification of covid related articles using machine learning, Materials Today: Proceedings,
https://doi.org/10.1016/j.matpr.2021.01.480
D. Godavarthi and Mary Sowjanya A Materials Today: Proceedings xxx (xxxx) xxx
2
D. Godavarthi and Mary Sowjanya A Materials Today: Proceedings xxx (xxxx) xxx
Fig. 5. (a) Top frequent tags in PLoS One journal: (b) Top frequent tags in Virology journal.
Fig. 6. (a) Histogram and density plots for entity ‘PERSON’: (b) histogram and density plots for entity ‘ORG’.
3
D. Godavarthi and Mary Sowjanya A Materials Today: Proceedings xxx (xxxx) xxx
Fig. 8. (a) Plots obtained for char count: (b) plots obtained for word count.
Fig. 9. (a) Plots obtained for avg word length: (b) plots obtained for avg sentence length.
3. Methodology Data cleaning, Length analysis, Word count. The second module
performs Target encoding and Dataset partitioning. The third mod-
Our system based on machine learning consists of 3 modules.1) ule performs Feature engineering with vectorizer, Feature selec-
Text analysis 2) Data Processing 3) Machine learning. The first tion, Model design, Train/test, Evaluate, Explainability. The
module performs Language detection, Named entity recognition, architecture of our system is shown in Fig. 1.
4
D. Godavarthi and Mary Sowjanya A Materials Today: Proceedings xxx (xxxx) xxx
Fig. 11. (a) Most frequent words in virology journal: (b) Most frequent words in virology journal.
Fig. 12. (a) Word count for PLoS One journal: (b) Word count for bioRxiv journal.
into one column with a count giving the number of times that Fig. 4. As Spacy can identify a person’s name it was used for name
entity occurred in the text. Then one more column was created detection and then the string would be modified.
for every tag category and the count of each entity is place. We The top frequent tags and histogram, density plots are shown in
can observe the tag types distribution macro view as shown in Figs. 5 and 6.
6
D. Godavarthi and Mary Sowjanya A Materials Today: Proceedings xxx (xxxx) xxx
Fig. 19. (a) ROC curve for XGBoost: (b) Precision-Recall curve.
7
D. Godavarthi and Mary Sowjanya A Materials Today: Proceedings xxx (xxxx) xxx
with a count of the token. This information can be visualized using Table 1
the word cloud in which each tag frequency could be displayed Performance comparison of our model with XGBoost, MLP, KNN.
with font size and color. The word count for various journals is Machine Learning Algorithms Accuracy AUC
shown below in Fig. 12.
XGBoost 0.84 0.95
3.2. Text processing MLP 0.83 0.94
KNN 0.76 0.86
3.3.3. Design of the model [1] R.T. Gandhi, J.B. Lynch, C. del Rio, Mild or moderate Covid-19, N. Engl. J. Med.
Support vector machine, XGBoost and, MLP classifier are used to 383 (18) (2020) 1757–1766, https://doi.org/10.1056/nejmcp2009249.
[2] Z. Wu, J.M. McGoogan, Characteristics of and important lessons from the
train the model, then predictions can be made based on knowledge coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report
about the related conditions. If the dataset is very large then this of 72314 cases from the Chinese Center for Disease Control and Prevention,
algorithm is very suitable because it takes each feature indepen- JAMA – J. Am. Med. Assoc. 323 (13) (2020) 1239–1242, https://doi.org/
10.1001/jama.2020.2648.
dently into consideration, calculating each category probability,
[3] M. Peppa, W. John Edmunds, S. Funk, Disease severity determines health-
and then the highest probability category will be predicted. seeking behaviour amongst individuals with influenza-like illness in an
internet-based cohort, BMC Infect. Dis. 17 (1) (2017) 1–13, https://doi.org/
3.3.4. Training/testing 10.1186/s12879-017-2337-5.
[4] L. L. Wang et al., CORD-19: The C, 2020.
All the models are trained on the feature matrix and tested on [5] Y. Lecun, Y. Bengio, G. Hinton, Deep learning, no. May, 2015, doi:
the transformed test set and then a sci-kit learn pipeline was built. 10.1038/nature14539.
8
D. Godavarthi and Mary Sowjanya A Materials Today: Proceedings xxx (xxxx) xxx
[6] H. Wang, On the Origin of Deep Learning, pp. 1–72. [16] S.K. Sonbhadra, S. Agarwal, P. Nagabhushan, Target specific mining of COVID-
[7] D. Su, Y. Xu, T. Yu, F. Bin Siddique, E. J. Barezi, P. Fung, CAiRE-COVID: A 19 scholarly articles using one-class approach, pp. 1–12, 2020, [Online].
Question Answering and Multi-Document Summarization System for COVID- Available: http://arxiv.org/abs/2004.11706.
19 Research, 2020, [Online]. Available: http://arxiv.org/abs/2005.03975. [17] J.P. Andersen, M.W. Nielsen, N.L. Simone, R.E. Lewiss, R. Jagsi, COVID-19
[8] R. Tang et al., Rapidly Bootstrapping a Question Answering Dataset for COVID- medical papers have fewer women first authors than expected, Elife 9 (734)
19. (2020) 1–7, https://doi.org/10.7554/eLife.58807.
[9] J. Shuja, E. Alanazi, W. Alasmary, A. Alashaikh, COVID-19 open source data sets: [18] V. Kieuvongngam, B. Tan, Y. Niu, Automatic Text Summarization of COVID-19
a comprehensive survey, Appl. Intell. (2020), https://doi.org/10.1007/s10489- Medical Research Articles using BERT and GPT-2,” 2020, [Online]. Available:
020-01862-6. http://arxiv.org/abs/2006.01997.
[10] B. S. Santos, I. Silva, M. da C. Ribeiro-Dantas, G. Alves, P. T. Endo, L. Lima, [19] V. Chamola, V. Hassija, V. Gupta, M. Guizani, A Comprehensive Review of the
COVID-19: A scholarly production dataset report for research analysis, Data COVID-19 Pandemic and the Role of IoT , Drones, AI , Blockchain, and 5G in
Br., vol. 32, 2020, doi: 10.1016/j.dib.2020.106178. Managing Its Impact, no. April, pp. 90225–90265, 2020.
[11] L.J. Muhammad, M.M. Islam, S.S. Usman, S.I. Ayon, Predictive Data Mining [20] Feature extraction methods for character recognition – a survey, vol. 29, no. 4,
Models for Novel Coronavirus (COVID-19) Infected Patients’ Recovery, SN pp. 641–662, 1996.
Comput. Sci., vol. 1, no. 4, 2020, doi: 10.1007/s42979-020-00216-w. [21] G. Dhole, N. Uke, NLP Based Retrieval of Medical Information for Diagnosis of
[12] J.H.I.K.D. Virag, Preliminary analysis of COVID - 19 academic information Human Diseases, pp. 243–248, 2014.
patterns: a call for open science in the times of closed borders, Scientometrics [22] P. Recognition, A.E. Bradley, The use of the Area Under The ROC Curve in the
124 (3) (2020) 2687–2701, https://doi.org/10.1007/s11192-020-03587-2. Evaluation Of machine learning Algorithms, vol. 30, no. 7, pp. 1145–1159,
[13] E. Ogundepo et al., An exploratory assessment of a multidimensional 1997.
healthcare and economic data on COVID-19 in Nigeria, Data Br. 33 (2020), [23] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text classification using machine
https://doi.org/10.1016/j.dib.2020.106424. learning techniques, WSEAS Trans. Comput. 4 (8) (2005) 966–974, https://doi.
[14] M. Saefi et al., Survey data of COVID-19-related knowledge, attitude, and org/10.11499/sicejl1962.38.456.
practices among indonesian undergraduate students, Data Br. 31 (2020), [24] K. Kowsari, K.J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, Text
https://doi.org/10.1016/j.dib.2020.105855. classification algorithms: a survey, Inf. 10 (4) (2019) 1–68, https://doi.org/
[15] P. Le Bras, A. Gharavi, D.A. Robb, A.F. Vidal, S. Padilla, M.J. Chantler, Visualising 10.3390/info10040150.
COVID-19 Research, no. May, pp. 1–11, 2020, [Online]. Available: http://arxiv.
org/abs/2005.06380.