You are on page 1of 10

Int. j. inf. tecnol.

(April 2023) 15(4):2273–2282


https://doi.org/10.1007/s41870-023-01273-z

ORIGINAL RESEARCH

A two‑staged NLP‑based framework for assessing the sentiments


on Indian supreme court judgments
Isha Gupta1 · Indranath Chatterjee2 · Neha Gupta1

Received: 21 December 2022 / Accepted: 11 April 2023 / Published online: 27 April 2023
© The Author(s), under exclusive licence to Bharati Vidyapeeth’s Institute of Computer Applications and Management 2023

Abstract Topic modeling is a powerful technique for sheer volume of unstructured data available on the web can
uncovering hidden patterns in large documents. It can iden- make it challenging to extract meaningful insights from it
tify themes that are highly connected and lead to a certain [2]. Text mining is a powerful tool for converting unstruc-
region while accounting for temporal and spatial complexity. tured text into a structured format to identify significant pat-
In addition, sentiment analysis can determine the sentiments terns and fresh insights [1, 3].
of media articles on various issues. This study proposes a One useful application of text mining is topic modeling,
two-stage natural language processing-based model that which can help identify the general subjects covered by a
utilizes Latent Dirichlet Allocation to identify critical top- large corpus of data [4]. Topic modeling is an unsupervised
ics related to each type of legal case or judgment and the technique that discovers hidden patterns from a large amount
Valence Aware Dictionary Sentiment Reasoner algorithm of data [6, 7]. It involves clustering groups of words that
to assess people’s sentiments on those topics. By applying semantically represent the same topic, offering the benefit
these strategies, this research aims to influence public per- of saving time and effort [5]. While it does not produce a
ception of controversial legal issues. This study is the first summary of the whole document [7], topic modeling out-
of its kind to use topic modeling and sentiment analysis on puts various topics that are dominant in the document. A
Indian legal documents and paves the way for a better under- good topic model produces great results that can be eas-
standing of legal documents. ily inferred by a human. Topic modeling is mainly of three
types [8], with Latent Dirichlet Allocation (LDA) being
Keywords Sentiment analysis · Topic modeling · Legal one of the finest strategies [9]. LDA is a Bayesian form of
documents · Latent Dirichlet allocation · Valence aware Latent Semantic Analysis (LSA), in which the distribution
dictionary sentiment reasoner · Google news feed is sampled over a probability simplex, and the model has a
generative procedure [10].
LSA, on the other hand, uses Term Frequency-Inverse
1 Introduction Term Frequency (TF-IDF) to evaluate textual documents
[11]. Both LSA and LDA are based on the distributional
The Internet is a vast repository of data that spans various hypothesis, with the primary difference being that LDA sup-
fields, including medical, engineering, technology, e-com- poses the allocation of subjects in a manuscript, and the
merce, legal, historical, and geographical [1]. However, the allocation of words in the different subjects is Dirichlet dis-
tribution, while LSA does not undertake any distribution,
resulting in more incomprehensible vector submissions of
* Isha Gupta
ishaanujgupta@gmail.com topics and credentials [12].
1 Probabilistic Latent Semantic Analysis (pLSA) was pro-
Faculty of Computer Applications, Manav Rachna
International Institute of Research and Studies, posed to resolve the depiction test in LSA by substituting
Faridabad 121003, India Singular Value Decomposition (SVD) with a probabilistic
2
Department of Computer Engineering, Tongmyong model [13]. It shows every record in the TF-IDF matrix
University, Busan 48520, South Korea using probability.

13
Vol.:(0123456789)
2274 Int. j. inf. tecnol. (April 2023) 15(4):2273–2282

Sentiment analysis (SA) is another important application in a particular journal was conducted to identify significant
of text mining that focuses on analyzing people’s feelings, topical elements and to examine publication and reference
sentiments, or attitudes toward something [14, 15]. SA can patterns, among other things. Using word cloud analysis and
be applied to topics, events, products, or organizations, and topic modeling, the authors of this study revealed key trends
it works at three levels [17]: document-level, sentence-level, and topics in the data [16, 17, 20].
and aspect level [18]. Sentiment classification mainly works Another study investigated the impact of the ongoing pan-
on three techniques: supervised learning, unsupervised demic on hydro-meteorological disasters, such as floods and
learning, and hybrid technique [19], which is a combination typhoons, in 24 countries [22]. To provide an overview of
of supervised and unsupervised techniques. the concerns in these countries, the researchers employed
Ensembling SA and topic modeling can help deduce a Latent Dirichlet Allocation (LDA), a computational topic
document’s importance, as topic modeling can find abstract modeling technique, to extract key terms and topics from
topics and related words or patterns from the document, and numerous reports and news. This interdisciplinary study
these features can help find the sentiment more efficiently. offers insights that can be beneficial for policymakers and
In this paper, we propose the amalgamation of these two researchers to address the challenges of responding to such
techniques to evaluate the most effective sentiments related disasters during a pandemic.
to the topics found in legal judgments. Legal documents Moreover, a different study introduced various topic mod-
are critical in all sectors, educational or work-related, and eling approaches that can handle the relationship between
are prepared by a legal officer or corporate lawyer with the topics, changes in topics over time, and the ability to deal
provision of entering the trial court of law. Legal documents with short messages such as those encountered in virtual
are often challenging for non-experts to read due to their entertainment or sparse message data. The paper also briefly
use of technical vocabulary, complex syntax and semantics, reviewed the algorithms used to optimize and collect param-
and the use of unusual meanings, doublets, and triplets. Fur- eters in topic modeling, which are crucial for producing
thermore, legal documents tend to be lengthy and technical, meaningful results, regardless of the approach [23].
making it essential to develop an automated tool that can In this study, the authors [24] framed the possibilities of
help us understand them better. Topic modeling techniques underlying topic modeling for hierarchical exploration and
can play a pivotal role in developing this automated tool give a bit-by-bit instructional exercise on the most profi-
by identifying the most relevant topics covered in legal cient method to apply it. The application model expanded
judgments.. on 428,492 surveys of Fortune 500 organizations from the
Section 1 provides an introduction to the concept of internet-based stage Glassdoor, on which representatives
topic modeling, its significance, and its distinctive features can assess associations. The research exhibited how under-
in comparison to summarization techniques. Furthermore, lying topic models permit inductively recognizing themes
this section delves into the relevance of sentiment analy- that make a difference to representatives and measure their
sis (SA) within the context of legal documents. Section 2 relationship with workers’ impression of hierarchical cul-
delves into the prior research conducted on topic modeling ture. The paper examined the benefits and restrictions of
and sentiment analysis, highlighting the pertinent literature topic modeling as an exploration strategy and layout how
on the topic. Section 3 outlines the proposed methodology future examinations can apply the method to concentrate on
and its two-stage framework, providing a comprehensive hierarchical peculiarities.
overview of the approach taken. Section 4 elaborates on the The study [25] fostered the embedded topic model, a gen-
experimental details of the proposed model, including the erative model of records that weds customary topic models
methodology employed in the implementation of the frame- with word embedding. More explicitly, the models have each
work. Section 5 presents the results of the experiment and word with a downright circulation whose regular boundary
provides a detailed analysis of the outcomes. The discus- is the internal item between the word’s implanting and an
sion will address the implications of the findings and outline inserting of its doled-out point. To fit the model, the authors
avenues for future research. created a proficient amortized variation deduction calcula-
tion. The model found interpretable topics even with huge
vocabularies that incorporate interesting words and stop
2 Related work words.
The authors of the study [26] surveyed the examina-
Researchers have extensively studied sentiment analysis tion writing by managing proper pre-handling of the text
(SA) and topic modeling in various domains, and their assortment; satisfactory determination of model boundaries,
importance in different applications has been demonstrated including the number of topics to be produced, assessment
through numerous studies. In recent literature, a bibliomet- of the model’s unwavering quality; and the course of truly
ric analysis of over 3,710 publications from 1971 to 2018 deciphering the subsequent topics. They proposed a system

13
Int. j. inf. tecnol. (April 2023) 15(4):2273–2282 2275

that moves toward these difficulties. The objective was to examination. To generate the various themes of discussion,
make LDA topic modeling more available to correspondence LDA was used for topic modeling, and Valence Aware Dic-
specialists and to guarantee consistency with disciplinary tionary Sentiment Reasoner (VADER) was used for SA to
norms. Thus, the research fostered a short involved client determine the broad attitudes and viewpoints contained in
guide for applying LDA topic modeling. The study showed the dataset. These strategies are utilized to investigate the
the worth of the methodology with exact information from idea of environmental change conversation between vari-
a continuous exploration project. ous nations over the long run. SA showed that the general
Access work has been done in the past on SA. There are conversation is pessimistic, particularly when clients are
many concerns and issues still focused on by many research- responding to political or outrageous climate occasions.
ers. We have tried to review the latest work on SA. One of Topic modeling showed that the various topics of conversa-
the main concerns with SA is the polarization of Twitter sen- tion on environmental change are assorted, yet a few topics
timent. The suggested method [27] entails categorizing the are more common than others.
attitudes using the eight fundamental emotions provided by The authors [33] proposed a cross-planning table meth-
Plutchik’s wheel of emotion, which makes the chores more odology because of the area’s prevalence, evaluations, latent
manageable. Other elements have been applied following the topics, and sentiment. The outcomes show that the consoli-
Rule Based Emotion Classification (RBEM) algorithm to dated elements of LDA, SVM, evaluations, and cross-map-
determine the polarity of messages. The proposed algorithm pings are helpful for improved execution.
has demonstrated reasonable accuracy. The motivation behind the study [34] was utilizing the
The authors [28] assessed the effectiveness of the vari- genuine encounters of different clients who have encoun-
ous episodes of "Mann Ki Baat," a program that the Indian tered aircraft. The information gathered was online audits
Prime Minister launched in 2014. Two steps were taken from 27 carriers, with more than 14,000 surveys. The objec-
to complete this. First, the execution of SA on this radio tive is what sorts of significant words are in the web-based
show’s written episodes. Second, Twitter posts were used audits.
that the general public made about the subjects covered in The review study [35] proposed an ontology and LDA
the various episodes of this show. The outcomes reveal that (OLDA) based topic modeling and word implanting
this show has benefited Indian citizens in a variety of ways. approach for sentiment characterization. The proposed
Additionally, our method validated this outcome with a framework recovers transportation content from interper-
respectable accuracy of 85.4%. sonal organizations, eliminates unessential substances to
This study [29] presents a hybrid SA strategy in which separate significant data, and creates topics and elements
lexicon-based approaches are employed in conjunction with from extricated information utilizing OLDA. AI classifiers
deep learning models to increase sentiment accuracy. Stud- are utilized to assess the proposed word implanting frame-
ies entail examining the influence of TextBlob on model work. The strategy accomplished an accuracy of 93%, which
classification accuracy in comparison to the original anno- showed that the proposed approach is successful for senti-
tations while keeping in mind the possibility of fraudulent ment classification.
annotations. The study [36] introduced a bibliometric survey of SA
The goal of the current study [30] was to evaluate Persian with the premise of an underlying topic modeling strategy
tweets to assess Iranians’ attitudes toward the COVID-19 to get a broad outline of the exploration field. The authors
vaccination and Iranian attitudes about domestic and for- additionally used techniques like relapse investigation, geo-
eign COVID-19 vaccines. They recognized sentiments of graphic perception, informal organization examination,
retrieved tweets using a deep learning SA model based on and the Mann-Kendal pattern test. The discoveries gave an
CNN-LSTM architecture. exhaustive comprehension of the patterns and topics regard-
A lot of people also used a combination of topic mod- ing SA, which could help in effectively observing future
eling and SA for research purposes. In this study [31], topic exploration works and undertakings. This review proposed a
modeling and classification approaches are used to create a structure for directing a complete bibliometric investigation.
hybrid model for extracting customer opinions from tweets To date, little or no work has been done on legal judg-
of Abuja Electricity Distribution Company (AEDC). The ments employing topic modeling. A few related works to
electrical business can use SA to enhance the quality of its this field are explained. The authors [21] investigated the
services. Tweets were utilized to generate dominating top- use of 56 distinct strategies for analyzing text-based similar-
ics using the LDA topic modeling technique. A prediction ity across legal dispute explanations to a dataset of Indian
accuracy of 94.8% was achieved by the proposed model. Supreme Court Cases. Thirty of the 56 diverse tactics are
In a review study [32], a huge dataset of geo-tagged modifications of current procedures, while the remaining 26
tweets containing specific catchphrases connecting with are our suggested ideas. Models such as BERT and Law2Vec
environmental change is dissected utilizing volume are included in the techniques under consideration. It was

13
2276 Int. j. inf. tecnol. (April 2023) 15(4):2273–2282

discovered that more conventional approaches (such as the Table 1 summarized the different work done in the area
TF-IDF and LDA) that rely on a set of terms depiction per- of topic modeling by different authors. To the best of our
form better than more advanced setting mindful tactics (such knowledge, no one has applied topic modeling and SA in the
as BERT and Law2Vec) for determining report level com- field of legal documents, especially for Indian judgments.
parability. Finally, they picked five of our best-performing There are studies or articles available in other domains that
strategies for evaluating resemblance across case reports make use of SA and topic modeling as discussed above.
based on experimental approval. Most of them have exercised Twitter for the application of
The paper [37] utilized LDA topic modeling on a data- topic modeling and SA. This becomes the motivation for
set of 3931 diary articles, and investigated three inquiries: us to work in this field. The pivotal objective of the paper
Which topics inside legitimate examination on AI can be is to employ topic modeling and SA as a coupled model for
recognized? When were these topics tended to? Can com- analyzing the legal documents from Indian court judgments.
parable papers be recognized? The topic modeling brings
about a sum of 32 significant subjects. Also, it was found
that legitimate examination of AI expanded as of 2016, with 3 Proposed methodology
topics turning out to be more granular and different over the
long run. At last, a correlation of the likeness evaluations In this study, we propose a two-staged NLP framework that
created by the calculation and a human master recommends leverages LAD and VADER algorithms to identify top-
that the evaluations frequently match. ics and sentiments expressed by individuals towards those
The study [38] proposed the Supreme Court classifier, a topics. Our methodology consists of two stages: Stage I
framework that applies solutions to the issue of lawful court employs topic modeling to identify topics from large docu-
attitudes report order. The research compares methodolo- ments, while Stage II utilizes VADER to extract sentiments
gies that use traditional AI and NN-based approaches. The related to a specific topic. The flowchart of our proposed
authors also provided a CNN used with pre-prepared word technique is depicted in Fig. 1.
vectors that outperform the best in class when applied to our Stage 1: In Stage 1, we begin by gathering large docu-
dataset. The Washington University School of Law Supreme ments from diverse sources. These sources can be any writ-
Court Database was used by the reviewers to train and ana- ten or printed materials that contain a substantial amount
lyze the framework (SCDB). The greatest framework (word- of data in a single document. To process the data, we first
2vec + CNN) accomplishes 72.4% accuracy when arranging convert it into a text file, which undergoes several pre-pro-
the court choices into 15 expansive SCDB classifications cessing procedures such as stopword elimination, tokeniza-
and 31.9% accuracy while grouping amid 279 better-grained tion (the process of breaking text into tokens), stemming
SCDB classifications. (the process of returning a word to its root level), and lem-
The work [39] portrayed and assessed the utilization of matization (grouping the inflected forms of a word). Once
BERT for topic modeling in authoritative archives. The crea- pre-processing is complete, the documents are ready for
tors have zeroed in on a subset of milestone cases from the topic modeling. We utilize the JAVA-based Mallet tool,
US Case law dataset to assess the effect of topic modeling, LDA, to perform topic modeling. LDA generates a topic-
through area explicit embedding pre-prepared from LEGAL- per-document model and a words-per-topic model by lev-
BERT. The study researched various varieties of producing eraging Dirichlet distributions as the modeling framework.
sentence embedding from the cases. After applying LDA, we generate clusters of features, where

Table 1  Applications of Topic Modeling


References The technique of Dataset Sample size Objective
topic modeling

[32] LDA Tweets 3,90,016 Inference different topics of discussion of global climate change
[40] LDA Twitter NA Identification of noteworthy topics of Twitter messages
[41] LDA Online review data 23,614 Social media mining for product planning
[42] Nonnegative Cases reported by Los 1,027,168 Classification of crime into discrete categories
matrix factori- Angeles police depart-
zation (NMF) ment
[43] Top2Vec News headlines 100,000 Investigation of COVID-19 News
[44] LDA News 8000 Inferencing connections to the sociological view of culture
[45] LDA Blogs 1,300,000 Discussion on change in climate
[46] LDA Tweets 1,09,076 Analyze scholar’s Twitter usage in CS conferences

13
Int. j. inf. tecnol. (April 2023) 15(4):2273–2282 2277

Fig. 1  Flowchart/Framework of the algorithm

each cluster represents a group and is named after an expert news feed, utilizing the topics generated from clusters of
in the field. features obtained in Stage 1 as the search key.
Stage 2: In Stage 2, we focus on SA, which involves stud- The experiment was conducted in two stages. Stage I
ying the sentiments of individuals towards a specific topic. aimed to identify the critical topics of legal documents using
Here, we use the topics generated from Stage 1 as input. We topic modeling techniques, while Stage II aimed to evaluate
extract Google News related to each topic using the Google the sentiments of people based on news articles published
News API, which undergoes pre-processing techniques such via Google News on the topics identified in Stage I.
as data mining. We then apply VADER to obtain sentiment
scores of news related to each topic. Finally, we analyze the 4.1 Stage I
data using visualization techniques.
• Data Collection: The judgments are downloaded from
the Supreme Court of India website.
4 Experimental setup • Data Pre-processing: The documents were converted to
text files. These judgments contain common keywords.
The experimental setup for the proposed methodology is Data preprocessing was performed on these text files to
described as follows. The experiments were conducted on transform the manuscript to lowercase as well as elimi-
a High-Performance Computing facility that was equipped nate the stop words, punctuation, and numbers. After
with an AMD Ryzen Threadripper PRO 3945 WX processor that, text files were tokenized and stemmed.
with 12 cores and 64 GB DDR4 Quad channel RAM, along • Topic Modeling: The model will be trained for topic
with an NVIDIA RTX A5000 graphics card that had 24 GB modeling for a different number of topics. There is a pre-
DDR6 memory. The experiments were implemented using requisite while running the LDA that the number of top-
the Python programming language. ics should be known in advance. We will train the model
For Stage 1 (Topic Modeling), the dataset comprised 700 for a different number of topics. We have empirically
legal judgments, downloaded in Portable Document Format run the dataset on a different number of topics from the
(PDF) files from the Supreme Court of India website, as per [10, 25] with an interval gap of 5. We found that n = 15
the study conducted [47]. For Stage 2 (News from Google performs well on topic modeling. Therefore we have
feed), the Google News API was employed to extract the taken 15 topics in our experiment. Fifteen separate topic

13
2278 Int. j. inf. tecnol. (April 2023) 15(4):2273–2282

clusters were generated, with the top correlated words in 5 Results & discussion
each cluster.
After downloading 700 legal documents from the Supreme
4.2 Stage II: Court of India website in Portable Document Format (PDF)
files, they were converted to text files and pre-processed
• Web Scrapping: Four topics were randomly selected from using stemming and lemmatization techniques. The result-
the resultant topics from Stage I. The topics were named ing vocabulary size was 39,828, and the average number of
for better clarity of the topics. Using the topics conven- words per document remained around 2000.
tion, web scrapping of news related to those topics was To identify critical topics of legal documents, we applied
performed using Google News API. topic modeling to these files, selecting 15 topics empiri-
• Data Preprocessing: The preprocessing of the news arti- cally. Each topic provided a set of features or words, with
cles was essential for the conversion of the sentences into Table 2 indicating these various topics and their associated
lowercase, tokenization, stemming, and removal of stop features. In general terms, we assigned a name to each topic,
words. with some being unnamed. Topic 0 contained generalized
• Sentiment Polarity: The sentiment polarity of each news features, while the remaining topics were named accord-
article was computed using the VADER algorithm. The ing to their focus. These included writ cases, constitutional
Vader score falls between -1 (strongly negative) to + 1 matters, land/property disputes, criminal matters, disaster
(strongly positive). All the articles were divided into management, Indian Penal code matters, Vehicle Act cases,
positive(> 0), negative(< 0), and neutral (= 0) categories. property-related issues, Insolvency act matters, arbitration

Table 2  Labels and top 20 words for 15 topics from the Proposed LDA topic model
Topic No Top Words Label of Topic

0 [‘provisions’, ‘right’, ‘SCC’, ‘public’, ‘authority’, ‘decision’, ‘person’, ‘power’, ‘time’, ‘state’, ‘sub’, ‘provi- General
sion’, ‘government’, ‘judicial’, ‘provided’, ‘manner’, ‘matter’, ‘necessary’, ‘effect’, ‘article’]
1 [‘high’, ‘learned’, ‘judgment’, ‘said’, ‘filed’, ‘passed’, ‘submitted’, ‘counsel’, ‘civil’, ‘writ’, ‘view’, ‘appel- Writ
lants’, ‘present’, ‘proceedings’, ‘matter’, ‘date’, ‘period’, ‘impugned’, ‘orders’, ‘notice’]
2 [‘state’, ‘article’, ‘constitution’, ‘scheduled’, ‘backward’, ‘commission’, ‘list’, ‘reservation’, ‘election’, ‘com- Constitutional Matters
mittee’, ‘constitutional’, ‘amendment’, ‘classes’, ‘government’, ‘union’, ‘parliament’, ‘members’, ‘power’,
‘judgment’, ‘castes’]
3 [‘land’, ‘building’, ‘government’, ‘project’, ‘development’, ‘plan’, ‘area’, ‘public’, ‘state’, ‘construction’, Land/property dispute
‘authority’, ‘central’, ‘notification’, ‘buildings’, ‘plot’, ‘forest’, ‘use’, ‘proposed’, ‘heritage’, ‘compensation’]
4 [‘accused’, ‘police’, ‘evidence’, ‘prosecution’, ‘witness’, ‘mohmed’, ‘confession’, ‘stated’, ‘body’, ‘examina- Criminal
tion’, ‘deposed’, ‘time’, ‘persons’, ‘criminal’, ‘said’, ‘ali’, ‘victim’, ‘recovered’, ‘taken’, ‘kumar’]
5 [‘government’, ‘state’, ‘submitted’, ‘covid’, ‘persons’, ‘disability’, ‘union’, ‘disabilities’, ‘workers’, ‘national’, Disaster management
‘policy’, ‘disaster’, ‘pandemic’, ‘central’, ‘writ’, ‘special’, ‘children’, ‘scheme’, ‘states’, ‘relief’]
6 [‘accused’, ‘offence’, ‘criminal’, ‘bail’, ‘high’, ‘fir’, ‘police’, ‘offences’, ‘complaint’, ‘investigation’, ‘magis- IPC
trate’, ‘trial’, ‘person’, ‘singh’, ‘code’, ‘crpc’, ‘charge’, ‘sections’, ‘ipc’, ‘proceedings’]
7 [‘bank’, ‘company’, ‘compensation’, ‘alcohol’, ‘commission’, ‘vehicle’, ‘complaint’, ‘insurance’, ‘loss’, ‘acci- Vechile-Related
dent’, ‘consumer’, ‘person’, ‘cheque’, ‘locker’, ‘said’, ‘national’, ‘agreement’, ‘forum’, ‘claim’, ‘liability’]
8 [‘suit’, ‘property’, ‘company’, ‘plaintiff’, ‘decree’, ‘defendant’, ‘possession’, ‘sale’, ‘deed’, ‘filed’, ‘family’, Property-related
‘rule’, ‘land’, ‘parties’, ‘trial’, ‘companies’, ‘tata’, ‘judgment’, ‘civil’, ‘held’]
9 [‘resolution’, ‘corporate’, ‘plan’, ‘debtor’, ‘financial’, ‘creditors’, ‘code’, ‘authority’, ‘adjudicating’, ‘debt’, Insolvency Act
‘nclt’, ‘insolvency’, ‘coc’, ‘creditor’, ‘ibc’, ‘process’, ‘approval’, ‘cirp’, ‘committee’, ‘company’]
10 [‘arbitration’, ‘award’, ‘agreement’, ‘parties’, ‘contract’, ‘tribunal’, ‘arbitral’, ‘arbitrator’, ‘scc’, ‘party’, ‘com- Arbritation
mercial’, ‘dispute’, ‘limitation’, ‘proceedings’, ‘period’, ‘held’, ‘disputes’, ‘judgment’, ‘civil’, ‘foreign’]
11 [‘service’, ‘candidates’, ‘appointment’, ‘rules’, ‘post’, ‘selection’, ‘state’, ‘years’, ‘age’, ‘year’, ‘medical’, Service agreement
‘vacancies’, ‘government’, ‘examination’, ‘rule’, ‘officers’, ‘committee’, ‘college’, ‘high’, ‘list’]
12 [‘accused’, ‘evidence’, ‘deceased’, ‘prosecution’, ‘trial’, ‘singh’, ‘high’, ‘ipc’, ‘witnesses’, ‘state’, ‘appellants’, Criminal
‘injuries’, ‘stated’, ‘persons’, ‘judgment’, ‘police’, ‘scc’, ‘incident’, ‘criminal’, ‘learned’]
13 [‘goods’, ‘tax’, ‘power’, ‘refund’, ‘payment’, ‘customs’, ‘rate’, ‘services’, ‘state’, ‘sale’, ‘input’, ‘import’, ‘itc’, Sales Tax
‘high’, ‘income’, ‘credit’, ‘purchase’, ‘paid’, ‘assesse’, ‘supply’]
14 [‘death’, ‘sentence’, ‘state’, ‘imprisonment’, ‘years’, ‘victim’, ‘offence’, ‘life’, ‘criminal’, ‘accused’, ‘ipc’, Capital Punishment
‘circumstances’, ‘conviction’, ‘scc’, ‘committed’, ‘crime’, ‘evidence’, ‘child’, ‘sexual’, ‘punishment’]

13
Int. j. inf. tecnol. (April 2023) 15(4):2273–2282 2279

trials, service agreements, criminal cases, sales tax cases, with document 590 was obtained for topic number 1. Fig-
and capital punishment in India. ure 3 represents the highest association of each topic with
After performing topic modeling on 700 pre-processed the document number. Table 4 shows the same results as
and lemmatized legal documents, correlation tests were Table 3.
conducted to obtain correlation values of each document To extract Google news feed related to four randomly
with 15 topics. The boxplot visualization of the various top- chosen topics resulting from topic modeling, namely land
ics with the documents is shown in Fig. 2, which indicates or property dispute cases, capital punishment in India,
that features are more dispersed in Topic 1 and Topic 12. insolvency and bankruptcy cases, and service-related
Topic 0 is right skewed, Topic 1 is almost showing normal cases or matters, web-scrapping was conducted using
distribution, and Topic 12 is skewed to the right with many Google news API. After extracting the news, VADER was
outliers outside the whiskers of each topic. Table 3 displays applied to obtain sentiment values in terms of polarity
the maximum and minimum probability values of each topic value, where ’1’ shows positive sentiments, ’0’ indicates
with the document. The maximum probability value of 0.965 neutral behavior, and ’-1’ indicates negative sentiments.

Fig. 2  Boxplot visualization of various topics with the documents

Table 3  Documents with the Topic no Max probability value Max probability Min probability value Min probability
highest and lowest probability document No of document No
per topic
0 0.788751323 441 0.00012936 425
1 0.965711919 590 8.21518E-05 586
2 0.740050002 144 1.03875E-06 213
3 0.696587298 80 2.50045E-06 321
4 0.815674705 653 8.81921E-07 65
5 0.826009937 304 1.80191E-06 580
6 0.757856215 223 2.21043E-06 403
7 0.692951989 399 1.38036E-06 403
8 0.770479731 148 1.14694E-06 65
9 0.794709373 117 5.02766E-07 65
10 0.738390115 224 1.96209E-06 580
11 0.761465274 206 1.62472E-06 213
12 0.95425968 481 7.8978E-06 321
13 0.711524975 216 7.24678E-07 65

13
2280 Int. j. inf. tecnol. (April 2023) 15(4):2273–2282

Table 4  Comparison table for the Proposed Approach


Paper Techniques Objective

[48] LDA, SA Stock Market Prediction


[32] LDA, SA Global climate change
[49] Topic Modeling, SA Product opportunities
[34] Topic modeling, SA Airline Reviews
[33] LDA, SA Tourist Spots
[50] Topic Modeling, SA Airport service experience
[51] Topic Modeling, SA Online Education in the
COVID-19 Era
Fig. 3  Displaying Top Title Topic-wise
[52] Topic Modeling, SA Bangladesh Airlines

Fig. 4  Graphs representing SA on different topics. a SA of Google news on Land/Property Dispute Cases. b SA of Google news on Capital Pun-
ishment Cases in India. c SA of Google news on Insolvency & Bankruptcy Cases. d SA of Google news on service-related laws in India

Figure 4a indicates that people or media have almost news related to service-related laws is perceived more
equal positive and negative sentiments. Figure 4b sug- positively.
gests that news related to capital punishment in India is The analysis of the legal judgments using topic mod-
more negatively perceived, while Fig. 4c suggests that eling reveals that there are numerous themes of debate,
news related to insolvency and bankruptcy cases/mat- with some being more prevalent than others. To ascertain
ters is more positively perceived. Figure 4d shows that the effectiveness of our proposed methodology, it is worth
noting that no prior studies have attempted to use topic

13
Int. j. inf. tecnol. (April 2023) 15(4):2273–2282 2281

modeling and sentiment analysis in the context of legal References


documents or judgments. As such, a comparative analysis
with previous works is not feasible. Nonetheless, Table 4 1. Hearst M (2003) What is text mining. SIM UC Berkeley. 5:2234
presents past research that has utilized topic modeling and 2. Kumar A, Dabas V, Hooda P (2020) Text classification algorithms
for mining unstructured data: a SWOT analysis. Int J Inf Technol
sentiment analysis in other domains, demonstrating the 12(4):1159–1169. https://​doi.​org/​10.​1007/​s41870-​017-​0072-1
novelty of our approach in the legal domain. 3. Ding K, Choo WC, Ng KY, Ng SI (2020) Employing structural
So, the proposed approach is not a quantitative model, topic modelling to explore perceived service quality attributes in
but a qualitative model. So this model is unique in itself Airbnb accommodation. Int J Hosp Manag 91:102676. https://d​ oi.​
org/​10.​1016/J.​IJHM.​2020.​102676
as it is a completely using novel dataset. The model saves 4. Khurana D, Koli A, Khatter K, Singh S (2022) Natural language
a lot of time as it reduces the need to read large jargon processing: state of the art, current trends and challenges. Mul-
documents which might be difficult to understand. timed Tools Appl. https://​doi.​org/​10.​1007/​s11042-​022-​13428-4
5. Vayansky I, Kumar SAP (2020) A review of topic modeling
methods. Inf Syst 94:101582. https://​doi.​org/​10.​1016/j.​is.​2020.​
101582
6 Conclusion 6. Koltcov NSISOK (2015) Topic modelling for qualitative stud-
ies. J Inf Sci 26(5):599–613. https://​d oi.​o rg/​1 0.​1 177/​0 1655​
This paper presents a pioneering study that investigates 51515​617393
7. Asmussen CB, Møller C (2019) Smart literature review: a prac-
the application of topic modeling and sentiment analysis tical topic modelling approach to exploratory literature review.
in Indian legal documents. Our proposed methodology J Big Data. https://​doi.​org/​10.​1186/​s40537-​019-​0255-7
effectively extracts topics and identifies related senti- 8. Negara ES, Triadi D, Andryani R (2019) Topic Modelling Twit-
ter Data with Latent Dirichlet Allocation Method. ICECOS Int
ments in lengthy legal documents, with promising results
Conf Electr Eng Comput Sci. https://​doi.​org/​10.​1109/​ICECO​
that have the potential to enhance users’ ability to compre- S47637.​2019.​89845​23
hend legal judgments and identify relevant sentiments in 9. Reisenbichler M, Reutterer T (2019) Topic modeling in mar-
a shorter amount of time. While this area has been largely keting: recent advances and research opportunities. J Bus Econ
89(3):327–356. https://​doi.​org/​10.​1007/​s11573-​018-​0915-7
unexplored by previous studies, our approach provides a
10. Yu H, Yang J (2001) A direct LDA algorithm for high-dimen-
valuable contribution to the field. However, there are still sional data—with application to face recognition. Pattern Rec-
significant challenges that need to be addressed, such as the ognit 34(10):2067–2070
lack of optimized topic models for legal data. Overall, this 11. Iqbal F et al (2019) A Hybrid Framework for Sentiment Analy-
sis Using Genetic Algorithm Based Feature Reduction. IEEE
study represents a significant milestone in the exploration of
Access. 7:14637–14652. https://​d oi.​o rg/​1 0.​1 109/​ACCESS.​
topic modeling and sentiment analysis in Indian legal docu- 2019.​28928​52
ments, providing valuable insights to legal professionals and 12. Landauer TK (2007) LSA as a theory of meaning. In Handbook
researchers. of latent semantic analysis. https://​doi.​org/​10.​4324/​97802​03936​
399
13. Lu Y, Mei Q, Zhai C (2011) Investigating task performance of
Authors contributions All the authors of this manuscript contrib-
probabilistic topic models: an empirical study of PLSA and LDA.
uted to the conceptual framework and design of the study. Material
Inf Retr Boston 14:178–203
preparation, data collection, and analysis were performed by IG and
14. Liu B (2015) Sentiment analysis: Mining opinions, sentiments,
IC. IG wrote the first draft of the manuscript. IC and NG have edited
and emotions. Cambridge University Press, USA
and revised the manuscript. All authors read and approved the final
15. Liu B (2012) Sentiment analysis and opinion mining. Synth Lect
manuscript.
Hum Lang Technol 5(1):1–184. https://​doi.​org/​10.​2200/​S0041​
6ED1V​01Y20​1204H​LT016
16. Farhadloo M, Rolland E (2016) Fundamentals of sentiment analy-
sis and its applications. In Studies in Computat Intell. 639:1–24
Funding The authors declare that no funding has been received to 17. Chen X, Zou D, Xie H (2020) Fifty years of British Journal of
perform this study. Educational Technology: A topic modeling based bibliometric
perspective. Br J Educ Technol 51(3):692–708
Data availability The authors declare that this study is fully repro- 18. Zhang L, Liu B (2014) Aspect and Entity Extraction for Opinion
ducible, and the data used in this research work may be shared with Mining”. In: Chu WW (ed) Data Mining and Knowledge Discov-
readers at their request. ery for Big Data: Methodologies, Challenge and Opportunities.
Springer, Berlin. Heidelberg, Berlin, Heidelberg
Declarations 19. Ghosh S, Hazra A, Raj A (2020) A comparative study of different
classification techniques for sentiment analysis. Int J Synt Emot.
11(49–57):2020. https://​doi.​org/​10.​4018/​IJSE.​20200​101.​oa
Conflict of interest The authors declare no conflict of interest. 20. Wawre SV, Deshmukh SN (2016) Sentiment Classification using
Machine Learning. Techniques 5:2015–2017
Consent For publication The authors give their full consent for the 21. Mandal A, Ghosh K, Ghosh S, Mandal S (2021) Unsupervised
publication of identifiable details, which can include a photograph(s) approaches for measuring textual similarity between legal court
and/or details within the text (“Material”) to be published in this case reports. Artif Intell Law 29(3):417–451. https://​doi.​org/​10.​
esteemed journal in the form of an article. 1007/​s10506-​020-​09280-2

13
2282 Int. j. inf. tecnol. (April 2023) 15(4):2273–2282

22. Malakar K, Lu C (2022) Hydrometeorological disasters during 40. D. A. Ostrowski, “Using latent dirichlet allocation for topic mod-
COVID-19: Insights from topic modeling of global aid reports. elling in twitter,” Proc. 2015 IEEE 9th Int. Conf. Semant. Comput.
Sci Total Environ 838:155977. https://d​ oi.o​ rg/1​ 0.1​ 016/j.s​ citot​ env.​ IEEE ICSC 2015, pp. 493–497, 2015, doi: https://d​ oi.o​ rg/1​ 0.1​ 109/​
2022.​155977 ICOSC.​2015.​70508​58.
23. Vayansky I, Kumar SAP (2020) A review of topic modeling 41. Jeong B, Yoon J, Lee J-M (2019) Social media mining for product
methods”. Inf Syst 94:101582. https://​doi.​org/​10.​1016/j.​is.​2020.​ planning: A product opportunity mining approach based on topic
101582 modeling and sentiment analysis. Int J Inf Manage 48:280–290.
24. Schmiedel T, Müller O, vom Brocke J (2019) Topic modeling as https://​doi.​org/​10.​1016/j.​ijinf​omgt.​2017.​09.​009
a strategy of inquiry in organizational research: a tutorial with an 42. Kuang D, Brantingham PJ, Bertozzi AL (2016) Crime topic mod-
application example on organizational culture. Organ Res Meth- eling. Crime Sci. https://​doi.​org/​10.​1186/​s40163-​017-​0074-0
ods 22(4):941–968. https://​doi.​org/​10.​1177/​10944​28118​773858 43. Ghasiya P, Okamura K (2021) Investigating COVID-19 News
25. Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embed- across Four Nations: A Topic Modeling and Sentiment Analysis
ding spaces. Trans Assoc Comput Linguist 8:439–453. https://d​ oi.​ Approach. IEEE Access 9:36645–36656. https://​doi.​org/​10.​1109/​
org/​10.​1162/​tacl_a_​00325 ACCESS.​2021.​30628​75
26. Maier D et al (2018) Applying LDA topic modeling in communi- 44. DiMaggio P, Nag M, Blei D (2013) Exploiting affinities between
cation research: toward a valid and reliable methodology. Com- topic modeling and the sociological perspective on culture: Appli-
mun Methods Meas 12(2–3):93–118. https://​doi.​org/​10.​1080/​ cation to newspaper coverage of U.S. government arts funding.
19312​458.​2018.​14307​54 Poetics 41(6):570–606. https://​doi.​org/​10.​1016/J.​POETIC.​2013.​
27. Kumar P, Vardhan M (2022) PWEBSA: twitter sentiment analy- 08.​004
sis by combining plutchik wheel of emotion and word embed- 45. Elgesem D, Steskal L, Diakopoulos N (2015) Structure and Con-
ding. Int J Inf Technol 14(1):69–77. https://​doi.​org/​10.​1007/​ tent of the Discourse on Climate Change in the Blogosphere: The
s41870-​021-​00767-y Big Picture. Environ Commun 9(2):169–188. https://​doi.​org/​10.​
28. Garg K (2020) Sentiment analysis of Indian PM’s ‘Mann Ki 1080/​17524​032.​2014.​983536
Baat.’ Int J Inf Technol 12(1):37–48. https://​doi.​org/​10.​1007/​ 46. Parra D, Trattner C, Gómez D, Hurtado M, Wen X, Lin YR (2016)
s41870-​019-​00324-8 Twitter in academic events: A study of temporal usage, communi-
29. Aljedaani W et al (2022) Sentiment analysis on Twitter data inte- cation, sentimental and topical patterns in 16 Computer Science
grating TextBlob and deep learning models: The case of US air- conferences. Comput Commun 73:301–314. https://​doi.​org/​10.​
line industry. Knowled-Based Syst. 255:109780. https://​doi.​org/​ 1016/J.​COMCOM.​2015.​07.​001
10.​1016/j.​knosys.​2022.​109780 47. “No Title.” https://​main.​sci.​gov.​in/​judgm​ents Accessed 04 Oct
30. Bokaee Nezhad Z, Deihimi MA (2022) Twitter sentiment analysis 2022.
from Iran about COVID 19 vaccine. Diabetes Metab Syndr Clin 48. Nguyen TH, Shirai K (2015) Topic modeling based senti-
Res Rev 16:102367. https://​doi.​org/​10.​1016/j.​dsx.​2021.​102367 ment analysis on social media for stock market prediction in
31. Ugochi O, Prasad R, Odu N, Ogidiaka E, Ibrahim BH (2022) Cus- proceedings of the 53rd annual meeting of the association for
tomer opinion mining in electricity distribution company using computational linguistics. Int Joint Conf Natural Lang Process.
twitter topic modeling and logistic regression. Int J Inf Technol 1:354–1364
14(4):2005–2012. https://​doi.​org/​10.​1007/​s41870-​022-​00890-4 49. Jeong B, Yoon J, Lee JM (2019) Social media mining for product
32. Dahal B, Kumar SAP, Li Z (2019) Topic modeling and sentiment planning: A product opportunity mining approach based on topic
analysis of global climate change tweets. Soc Netw Anal Min modeling and sentiment analysis. Int J Inf Manage 48(April):280–
9(1):1–20. https://​doi.​org/​10.​1007/​s13278-​019-​0568-8 290. https://​doi.​org/​10.​1016/j.​ijinf​omgt.​2017.​09.​009
33. Shafqat W, Byun YC (2020) A recommendation mechanism for 50. Kiliç S, Çadirci TO (2022) An evaluation of airport service expe-
under-emphasized tourist spots using topic modeling and senti- rience: An identification of service improvement opportunities
ment analysis. Sustain. https://​doi.​org/​10.​3390/​SU120​10320 based on topic modeling and sentiment analysis Res. Transp Bus
34. Kwon HJ, Ban HJ, Jun JK, Kim HS (2021) Topic modeling and Manag 43:100744. https://​doi.​org/​10.​1016/j.​rtbm.​2021.​100744
sentiment analysis of online review for airlines. Inf 12(2):1–14. 51. Waheeb SA, Khan NA, Shang X (2022) Topic modeling and senti-
https://​doi.​org/​10.​3390/​info1​20200​78 ment analysis of online education in the covid-19 era using social
35. Ali F et al (2019) Transportation sentiment analysis using word networks based datasets. Electronics 11:5. https://d​ oi.o​ rg/1​ 0.3​ 390/​
embedding and ontology-based topic modeling. Knowled-Based elect​ronic​s1105​0715
Syst 174:27–42. https://​doi.​org/​10.​1016/j.​knosys.​2019.​02.​033 52. Hasib KM, Towhid NA, Alam MGR (2021) Topic modeling and
36. Chen X, Xie H (2020) A structural topic modeling-based bib- sentiment analysis using online reviews for bangladesh airlines
liometric study of sentiment analysis literature. Cognit Comput ieee 12th annual information technology. Elect Mob Commun
12(6):1097–1129. https://​doi.​org/​10.​1007/​s12559-​020-​09745-1 Conf (IEMCON) 2021:428–434. https://​doi.​org/​10.​1109/​IEMCO​
37. C. Rosca, B. Covrig, C. Goanta, G. van Dijck, and G. Spanakis, N53756.​2021.​96231​55
2020 Return of the AI: An Analysis of Legal Research on Artifi-
cial Intelligence Using Topic Modeling. In NLLP@ KDD. 3–10. Springer Nature or its licensor (e.g. a society or other partner) holds
38. Undavia S, Meyers A, Ortega JE (2018) “A Comparative Study exclusive rights to this article under a publishing agreement with the
of Classifying Legal Documents with Neural Networks”, in. Fed author(s) or other rightsholder(s); author self-archiving of the accepted
Confer Comp Sci Inform Syst (FedCSIS) 2018:515–522 manuscript version of this article is solely governed by the terms of
39. Silveira R, Fernandes CG, Neto JAM, Furtado V, Pimentel Filho such publishing agreement and applicable law.
JE (2021) Topic Modelling of Legal Documents via LEGAL-
BERT. Proc 1613:73

13

You might also like