Professional Documents
Culture Documents
The Information Management (IM) domain has experienced convergence and integration, with
the separation of roles between data administrators and information managers. This has led to the
convergence of skills and made IM a strategic issue in organizations, improving corporate performance,
encouraging competition, and reducing uncertainty. Strategic information management enhances
differentiation in value chain activities and has broader impacts at organizational, sectoral, and societal
levels.
The exponential growth in information production has increased the importance of information
management (IM), which requires new skills, knowledge, qualifications, and experience for managing
information at four levels:
• Information Retrieval
• Information Systems
• Information Contexts
• Information Environments
IM is defined as the management of processes and systems that create, acquire, organize, store,
distribute, and use information. In the last five decades, IM has covered various themes and topics, but
limited efforts have been made to integrate this fragmented research. Understanding the evolution of
the intellectual structure of IM research published in various journals and conference proceedings in the
last five decades could add significant value to the existing body of knowledge and be interesting for
academics and practitioners in the field. A study covering the evolution of IM over a large time interval
can add significant value to the literature on the information management field.
1. Bibliometric analysis
It is a widely accepted technique for analyzing and summarizing vast and fragmented research.
Originating from the library and information sciences, it has since been applied to various fields
such as social science, international business, public policy, marketing, advertising, psychology,
travel and tourism marketing, computer integrated manufacturing, communications, and
information systems. Citation analysis measures similarity and association among research
papers, contributors, and journals. For a specific research domain, bibliometric analysis can
analyze thematic areas and visualize conceptual subdomains. For a journal, it can understand
thematic evolution, visualize citation patterns, discern progressive themes, and envision future
research avenues. Overall, bibliometric analysis is a reliable and widely accepted method for
analyzing and summarizing research.
2. Topic modeling based on the structural topic models
Topic modeling is a useful technique in natural language processing and text analytics that
extracts underlying topics (latent themes) from text documents. It is an unsupervised machine
learning technique that learns and discovers latent themes and their prevalences across a
collection of documents. Popular techniques include Latent Semantic Analysis (LSA), Probabilistic
Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA). LDA is a Bayesian
probability-based generative probabilistic model, assuming each document is a distribution over
a fixed vocabulary and each latent topic is a distribution over words with an assigned probability.
Structural topic modeling (STM) is a recent and sophisticated probabilistic topic modeling
technique that estimates a topic model through fast variational approximation using an
expectation maximization algorithm. STM can also model the interaction between covariates
associated with metadata.
Step-1. Estimate the topic prevalence parameter (proportion vector) θ d for each word in a
document d using a logistic-normal generalized linear model. θ d is generated for each word
from the vocabulary of size V having the probability of k = 1, …, K different topics.
where, the topic prevalence model’s coefficients are represented by Γ =[γ1 |.. . | γK ] and Σ is a
hyper-parameter modeled as a (K- 1) by (K -1) covariance matrix.
Step-2. Generate the topical content model β, which represents the words as a probabilistic
mixture of each topic (k) using the following equation where m is the baseline word distribution
vector of length V, is the topic (t) specific deviation, is the covariate (c) group deviation
and is the interaction (i) coefficient. This study includes publication year as covariates.
Step-3. For each word in the document, (n ∈{1,. . ., Nd}), the core language model represented
by following unsupervised models can be used to sample a topic from a multinomial distribution
over the topic prevalence parameter and for a given topic, a word is sampled using the
applicable multinomial distribution.
This study uses Logistic-Normal distribution to compute topic prevalence parameters in STM,
which is related to document-level covariates. The text corpus for STM is created from the
article's title, keywords, abstract, and publication year. The text data is preprocessed to remove
stop words, numbers, non-English words, special characters, and punctuations. The text corpus
is then cleaned to remove frequent words related to copyright information and publishers.
Bigram terms are generated from the text corpus and compared with author's specified
keywords for topic modeling. The most frequent bigram terms are concatenated for better
results.
3. Data
The study retrieved bibliographic data from the Scopus database from 1970 to 2019, focusing on
the Information Management (IM) domain. The search query included Business, Management,
and Accounting as the most prominent subject area, covering 20,057 documents. The study used
19,916 research documents after eradicating discrepancies. The R environment was used for
analyses, including bibliometric overview, topic modeling, and results visualization, on a
Windows 10-based computer with 16 GB RAM and 64-bit architecture.
Results
The study discovered 16 topics from 19,916 research documents related to the Information
Management (IM) domain using STM. The semantically descriptive topics were labeled based on highly
probable words. The study used concatenated most frequent bigrams to include meaningful bigrams in
topic modeling. Semantic coherence and exclusivity are crucial constructs measuring the overall quality
of topic models. The average semantic coherence scores for all topics range from 11.33-11.80, while the
range of semantic coherence is -191.29 to -102.45. The study found that top words in two topics do not
cooccur equally within the documents.
The perspective visualization in Fig. 7 shows topical contrast for Topic-1, 12 (International
Accounting and Global Business), 3, 3 (Information, Web, and User), and 8 (Industry and Industrial
Innovation), indicating semantic association between topics.
The research on international accounting and global business focuses on macro-level aspects,
while topics like financial performance and investment focus on micro-level aspects. Topics like
information management, web, user, content, search, and semantics are more focused on, while topics
like industry and industrial innovation are more focused on innovation, industry, and environmental
aspects.
This study used a correlation analysis to quantify the association among extracted topics. The
results showed weak or no correlations, with correlation values less than 0.3, indicating a weak or no
correlation among the extracted topics. A positive correlation indicates many documents contain both
topics equally.
This study analyzed maximum-a-posteriori (MAP) estimates for document topic loadings to
confirm topic quality. A histogram plot showed the expected distribution of topic proportions across
research documents from 1970 to 2019. The statistical mixture hypothesis suggests each document is a
probabilistic mixture of key and non-key topics. The plot shows that each extracted latent topic has little
or no relation with multiple research documents.