Scientific African: Kingstone Nyakurukwa, Yudhvir Seetharam

Scientific African 20 (2023) e01596
Contents lists available at ScienceDirect
Scientific African
journal homepage: www.elsevier.com/locate/sciaf
The evolution of studies on social media sentiment in the

stock market: Insights from bibliometric analysis
Kingstone Nyakurukwa∗, Yudhvir Seetharam
University of the Witwatersrand, School of Economics and Finance, 1 Jan Smuts Avenue, Braamfontein, 2000, Johannesburg, South Africa
a r t i c l e i n f o a b s t r a c t
Article history: Social media sentiment applied in the stock market is extracted from social media plat-
Received 6 June 2022 forms and researchers have grappled with the way it influences different stock market fea-
Revised 23 December 2022
tures like returns, trading volume and volatility. The growth in Twitter, StockTwits, WeChat
Accepted 16 February 2023
and Sina-Weibo social media platforms has provided investors with convenient avenues
for expressing their opinions about the stock market. We seek to examine the evolution
Editor: DR B Gyampoh of textual sentiment in the stock market over the past decade. We used co-citation, bib-
liographic coupling and co-occurrence analysis to provide an overview of the structure of
Keywords:
social media sentiment within the stock market. The findings from the study show that the
Social media sentiment
concept of social media sentiment as applied in the stock market is multidisciplinary. Most
Bibliometric analysis
Textual sentiment of the studies are found in the computer science and mathematical sciences domains with
Stock market a few in the economics and finance domains. More recent studies are centred on ways
and methods of extracting sentiment from social media as seen by the emergence of such
author keywords like “Natural language processing”, “machine learning” and “deep learn-
ing” in the second half of the decade of the sample period used in the study. In summary,
“social media sentiment” in the stock market has many avenues of expansion as seen by
permeating different research domains like physics, mathematical sciences, computer sci-
ence and finance. To the best of our knowledge, this is the first study to examine the
evolution of social media sentiment using bibliometric analysis.
© 2023 The Author(s). Published by Elsevier B.V. on behalf of African Institute of
Mathematical Sciences / Next Einstein Initiative.
This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Introduction
Sentiment analysis is a series of methods and techniques used to detect and extract subjective information such as
opinions from language. Sentiment analysis has been used in different fields to obtain opinion polarity (i.e. whether a person
has a neutral, positive or negative opinion towards something). Most of the proxies for investor sentiment that have been
used in finance are aggregate measures extracted from macroeconomic variables as well as market-wide metrics. However,
the advent of social media has provided another avenue through which the opinions of investors can be extracted and
linked to stock market features at a micro-level. The establishment of Twitter in 2006 and StockTwits in 2008 led to the
proliferation of studies examining how opinions from these platforms can be linked to stock market features.
∗
Corresponding author.
E-mail addresses: knyakurukwa@gmail.com (K. Nyakurukwa), Yudhvir.Seetharam@wits.ac.za (Y. Seetharam).
https://doi.org/10.1016/j.sciaf.2023.e01596
2468-2276/© 2023 The Author(s). Published by Elsevier B.V. on behalf of African Institute of Mathematical Sciences / Next Einstein Initiative. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
K. Nyakurukwa and Y. Seetharam Scientific African 20 (2023) e01596
Online stock forums have emerged as essential investing platforms where multiple users can share their opinions about
financial markets. The role of social media in the financial markets was especially brought to the fore by the events sur-
rounding GameStop Inc., a struggling, Texas-based video game retailer. Wall Street hedge funds placed major bets on the
demise of the GameStop stock because of the effects of the COVID-19 pandemic on the brick-and-mortar company. Some
retail investors who share opinions on Reddit, an online discussion forum, rallied together to buy the GameStop stock. These
retail investors were not buying the stock because of its fundamentals but because of the sentiment attached to the stock.
The GameStop stock soared by more than 1700% as a result, triggering a short squeeze on several hedge funds that had
taken short positions in the stock. The above event and many others in the recent history of financial markets make social
media sentiment in the stock market a concept that warrants intensive research. It is a result of the prominence that social
media sentiment has attained in financial markets that this study seeks to explore the evolvement of social media sentiment
in the stock market in the past decade since the first article was published in the Scopus and Web of Science databases. In
this study, we, therefore, apply co-citation, bibliographic coupling and co-occurrence analysis to provide an overview of the
structure of social media sentiment within the stock market.
This study adds to the growing literature on social media sentiment in the stock market in several ways. Firstly, though
bibliometric analysis has been done on investor sentiment generally (e.g. [24]), this study especially looks at social media
sentiment, an emerging type of investor sentiment that is extracted from social media. To the best of our knowledge, this
is the first bibliometric study to specifically look at the evolvement of social media sentiment within the stock market.
Secondly, we utilise two of the most used databases (Scopus and Web of Science) for our analysis, providing us with a com-
prehensive database of peer-reviewed articles as well as conference proceedings. Most studies utilising bibliometric analysis
use one of the above-mentioned databases but usually not both. Thirdly, our database includes studies encompassing the
period since the first COVID-19 cases were reported in different countries, which is a period that saw many studies being
done on the role of social media in the stock market as many retail investors resorted to online stock platforms like Robin-
hood for investment opinions. Finally, we identify one university in an under-represented region and explore the extent to
which its library provides access to the publications cited. In undertaking this research, we seek to answer the following
research questions:
• What is the current trend of research in the area of social media sentiment in the stock market?
• Which are the leading, influential and impactful sources and contributors to the extant literature?
• Which are the most influential articles in this research domain?
• What are the prominent themes prevailing in this area of research?
• What is the scope for future research?
The findings show that the fields of mathematical sciences, as well as computer sciences, dominate the research space
regarding this concept. Most of the studies done on this concept since the first article appeared in 2011 are concentrated
in China, the United States of America, the United Kingdom and several other European countries. More recent studies are
centred on ways and methods of extracting sentiment from social media as seen by the emergence of such author keywords
like “Natural language processing”, “machine learning” and “deep learning” in the second half of the decade of the sample
period used in the study. In summary, “social media sentiment” in the stock market has many avenues of expansion as seen
by permeating different research domains like physics, mathematical sciences, computer science and finance.
The study proceeds as follows; Section 2 looks at the methodology used in the study, Section 3 presents the results from
the study while Section 4 discusses the results and concludes.
Literature review
Several models have been developed to comprehend how investor sentiment may form and the subsequent impact on
financial markets. The challenge is that investor sentiment is not observable and has therefore to be estimated from proxies.
The closed-end discount first used by Zweig [46] and later expanded by Lee,Shleifer and Thaler [23] is perhaps the oldest
existing measure of investor sentiment. Baker and Wurgler [3] combine the closed-end discount with five other measures
to create an investor sentiment index. This captures sentiment more appropriately than any of the individual components
in explaining the cross-section of returns. This measure, together with some associated variants, has since become one
of the most used proxies for investor sentiment in literature. However, it should be noted that the Baker and Wurgler’s
[3] and other associated metrics simply measure various beliefs without using a specific benchmark model. It therefore
becomes challenging to separate rational and irrational behaviours as the measured beliefs may be in tandem with those
from rational models.
Zhou [45] identifies three broad categories of investor sentiment measures which can be divided into market-based
measures, survey-based as well as text-based. Market-based measures of investor sentiment can include market data such
as stock prices and trading volumes; survey-based measures include data from polls where market participants infer their
opinions while a text-based method is a fairly new phenomenon that uses content extracted from text from news, social
media platforms, internet message boards etc. The first two categories of investor sentiment given above have weaknesses
in that they are quite indirect in their attribution of investor sentiment to a particular asset. Moreso, they are mostly mea-
sured at low frequencies. In a fast-paced contemporary economic and financial environment, these might not be able to
2
explain shocks that take place at very high frequencies. High-frequency traders are also becoming more dominant in fi-
nancial markets, justifying the need to identify proxies of investor sentiment that can be measured at moderately higher
frequencies.
As a result of increased internet penetration coupled with more computing power, several disciplines have used this syn-
ergistic advantage to extract sentiment from text messages, especially from online platforms like social media and internet
news websites. The finance domain has also joined the bandwagon as seen by a shift of investor sentiment literature to
sentiment extracted from textual data. Gan et al. [17] define investor sentiment extracted from textual data as the overall
attitude of investors towards specific security, sector or market. This deviates from Baker and Wurgler [3]’s definition by
giving a more direct measurement of the market sentiment. Kearney and Liu [20] broadly identify two types of sentiment
in finance; investor sentiment (as defined by Wurgler (2006)) and textual sentiment; which they define as the “degree of
positivity or negativity in texts”. According to Kearney and Liu [20], the fundamental difference between investor sentiment
and textual sentiment is that the former captures the subjective judgements of behavioural properties of individuals while
the latter also includes a more objective reflection of the conditions in financial markets.
Several studies have been done on the role of social media in finance, particularly in the stock market. One strand of
literature has investigated social media sentiment contagion within a network of firms. Using a generalised difference-in-
differences approach and 246,515 firm-day observations for 2988 unique firms Gu,Teoh and Wu [18] sought to establish if
GIFs on StockTwits can be a source of investor sentiment contagion amongst the different users of the social media platform.
The empirical results from the study show that investor sentiment directed at a particular company increases when stock
opinion about the company is first debated using GIFs. Also, it is revealed that days on which a higher proportion of stock
opinions are expressed in the form of GIFs are associated with higher investor sentiment. Besides StockTwits, social media
sentiment contagion has also been investigated using other platforms like Sina Weibo [16], Guba Eastmoney [35] and Reddit
[34].
Most of the studies that have examined the role of social media in the stock market have investigated its association
with stock returns. Several proxies for social media sentiment have been used in this regard including sentiment extracted
from Twitter [26,27], Facebook [36], Sharewise [10] and Reddit [1]. The majority of these studies have confirmed that dif-
ferent proxies of social media sentiment scores can predict stock returns. While the common trend in studies linking social
media and stock returns has been within a certain geographical location, some recent studies have attempted to conduct an
external validation of the social media sentiment proxies using dual-listed companies. The rationale is that if social media
sentiment is an accurate proxy for investor sentiment, then it should at least partially explain some of the deviations from
price parity observed with dual-listed companies. An example is Karabulut [19] who uses Facebook’s Gross National Happi-
ness scores to examine if social media can partially explain deviations from the law of one price for dual-listed stocks. To
summarise the literature, social media has become an important aspect of understanding financial markets as the growing
literature shows previously unexplored areas being examined.
Research methodology
This study utilises a bibliometric analysis of peer-reviewed articles and conference proceedings on social media senti-
ment in the stock market. The term bibliometrics was first used by Pritchard [31] where it was defined as the “application
of mathematical and statistical methods to books and other means of communication”. The science of bibliometrics uses
quantitative analysis of published articles to establish patterns within a specific field. Vogel and Güttel [42] state that bib-
liometrics is essential in examining the emerging themes in a particular study area. Bibliometrics is usually complemented
with science mapping techniques to visualise the intellectual structure of a particular field. The tools used in the bibliometric
analysis include citation analysis, co-citation analysis, keyword analysis and co-authorship analysis.
Database
The initial stage of a bibliometric study is the determination of the appropriate database from which the relevant doc-
uments can be retrieved. Most of the bibliometric studies done in the broad area of finance use either Scopus or Web of
Science (WoS) as the preferred databases but usually not both. The full Scopus database dates back to 1966 while the full
WoS database dates back to 1945. According to Pranckutė [30], though these commonly used databases have received ex-
tensive scholarly research on which is better, no conclusion has been reached as both of them are not inclusive. Where one
database lacks, the other database complements. For example, it is known that Scopus indexes a greater number of unique
sources not covered by WoS [30]. Since scholarly evidence shows that the two databases mentioned above can be comple-
mentary, this study adopts both databases for the purposes of metadata analysis. According to Keramatfar and Amirkhani
[21], the process of extracting sentiment from texts is more inclined toward the computer science and mathematical sci-
ences domains, and as a result, the main venue of publications is conference proceedings. In this regard, since Scopus has
more conference proceeding publications, it is used as the primary database with the WoS used as a secondary database
to identify other publications that would not have been captured in the former. Thus, in this study, the final database
used is a merged database of records from Scopus and WoS but the Scopus is used as the primary database as explained
above.
3
Search strategy
A common challenge in bibliometric studies is building a valid search query that will allow the retrieval of several ar-
ticles while at the same time minimising the frequency of irrelevant articles. When it comes to social media sentiment in
the stock market, several keywords have been used by several authors. This study, therefore, reviewed several articles that
have been published as systematic reviews as well as bibliometric studies to devise an appropriate search query for “social
media sentiment in the stock market”. The strategy was to identify the commonly used words for social media sentiment
in literature as well as the sources used in literature to extract sentiment from social media. The commonly used social
media platforms used to construct social media sentiment relevant to the stock market identified from a systematic review
of literature include Twitter, StockTwits, Facebook and Reddit in countries outside China. Because most western social me-
dia platforms are banned in China, the country has its home-grown social media platforms used by researchers to mine
opinions relevant to the stock market. A systematic review of literature in this regard shows that the most used social me-
dia platforms in China are Weibo, a “Twitter-like platform” as well as WeChat. The main words used to represent opinions
generated from social media found in the literature are “social media sentiment”, “opinion mining” and “textual sentiment”.
Instead of “social media” some studies in literature use “microblogs” and “microblogging sites”. Since the study specifically
aims to examine the evolvement of social media sentiment relevant to the stock market, the following words were identi-
fied as the commonly used words to refer to incidences where companies listed on the stock exchange are examined; “stock
market”, “listed companies” and “stock exchange”.
As a result, the final string formed was (Twitter OR StockTwits OR Facebook OR Weibo OR Reddit OR WeChat OR {Social
media sentiment} OR {textual sentiment} OR microblog∗ ) AND ({stock market} OR {listed companies} OR {stock exchange}).
Curly brackets were used to make sure that the words searched appeared as phrases rather than independent words. As-
terisks were included in certain words to include more variations of the words under consideration. The words used in the
search query were searched from the keywords, abstracts as well as titles of the articles that formed part of the sample of
the study. Boolean operators “OR” and “AND” are used to find documents that contain any of the terms as well as docu-
ments that contain all of the terms respectively. The results from the search as outlined above were scrutinised by reading
each abstract to see whether the content of the manuscripts was in line with the objective of the study. This yielded 292
documents from Scopus and 184 documents from WoS. After merging the documents from both databases and removing
duplications, the final database for this study contained 366 documents comprising conference proceedings as well as peer-
reviewed articles. These were the documents that were used for the rest of the analysis and the publications by year are
shown in Fig. 1.
Fig. 1 shows the number of articles published per year using the combined database containing documents from both
Scopus and the WoS. As can be seen in Fig. 1, the first article was published in 2011, which is the Bollen [9] paper titled
“Twitter mood predicts the stock market”. Two other articles on social media sentiment in the stock market were published
in 2011. From 2012, the number of articles published increased continuously until 2020. Since the document search for this
study was executed on 30 March 2022, the 2022 publications in Fig. 1 represent the number of articles published as of 30
March 2022.
Bibliometric indicators
Different bibliometric indicators that have been used in other studies in the broad area of finance are also used in this
study. One of the extensively utilised metrics in bibliometric studies is co-citation analysis [14]. Small [38] defines co-citation
Fig. 1. Distribution of publications by year.

Notes: Though the first article in our database starts from 2011, we used the entire period in the respective databases in our search criteria
4
as two publications that are cited together in a single study. This metric is important as explained by Benckendorff and
Zehrer [6] who argues that when two studies are habitually cited together, chances are that these studies have something
in common. For this study, co-citation is implemented using the RStudio package Bibliometrix. According to the Bibliometrix
package, a co-citation network can be obtained using the general formulation:
C = AT × A
Where A is a bipartite network. The main diagonal of C contains the number of cases in which a reference is cited in a
dataframe. In other words, the diagonal element is the number of local citations of the reference i.
Chang et al. [12] argue that though co-citation analysis provides the basis on which most bibliometric studies are based,
it does not provide a content picture of the research topics inherent in the studies reviewed. To this end, co-word analysis
is often used to circumvent this anomaly. The principle of co-word analysis is based on the examination of the frequency
of co-occurrence of keywords, that is, the number of papers in which two keywords appear together. Co-word analysis is
therefore useful as it can reveal the interactions between keywords through visualisation of the strengths of the interactions.
The use of keywords helps readers in searching for an article and usually shows the core of a research article. As such, co-
word analysis can be used to diagnose the concept network of research topics and can also show emerging trends within
a particular research field. Another bibliometric indicator used in this study is bibliographic collaboration. A scientific col-
laboration network is created where nodes are authors and links are co-authorships. In the Bibliometrix package, an author
collaboration network collaboration can be attained using the following general formulation:
AC = AT × A
Where A is a bipartite network Manucsripts × Authors. The diagonal aci is the number of manuscripts authored or co-
authored by the researcher i.
Results and discussion
Productive authors
According to Bergh et al. [7], the characteristics of an author have a bearing on the impact that an article written by the
author will have. Those authors who publish a lot in a specific field have a significant impact on the themes that future
research follows in that field. It is therefore of paramount importance to examine the most published authors in social
media sentiment in the stock market to understand the past evolution in the field as well as the expected direction of
research in the field. The 366 articles which formed part of the sample used in this study were written by 845 different
authors. amongst these 845 authors, 11 (1.3%) published 5 or more articles, 9(1.1%) published 4 articles, 29 (3.4%) published
3 articles, 85(10.1%) published 2 articles and 711(84.1%) published only 1 article. In line with Cuccurullo et al. [13], the
contributing authors are ranked first using the frequency of articles published as well as the fractionalised frequency to
capture the dynamics in multi-authored articles. Using fictionalised frequency, an article published by two authors leads to
each author receiving half a credit while in the case of three authors, each author receives a third of a credit and so on.
Table 1 provides a list of the top ten most-published authors based on adjusted and total appearances.
The top 2 most prolific authors are Shen. D with 2.4 fractionalised frequency and 9 total appearances as well as Zhang
W. with 2.3 fractionalised frequency and 9 total appearances. The top 5 authors are affiliated with institutions based in
China, showing that studies in China are pushing the research agenda in the area of social media sentiment in the stock
market. The majority of the most prolific authors publish their articles in computer science journals as well as conference
proceedings showing the dominance of articles centred on the computational dynamics of extracting sentiment from social
media rather than explaining the dynamics using mainstream finance theories. This explains why most of the contemporary
research on social media sentiment in the stock market is multidisciplinary but with mathematics and computer science
Table 1
Most published authors.
Authors Frequency F Frequency
Shen D 9 2.42
Zhang W 9 2.33
Li X 7 1.63
Wang D 6 2.00
Zhang Y 6 1.68
Areal N 5 1.67
Chen W 5 1.37
Cortez P 5 1.67
Oliveira N 5 1.67
Wang B 5 1.18
Notes: Frequency shows the number of times an author has appeared in the
database while F Frequency is the fractionalised frequency of each author.
5
Table 2
Most prolific countries.
Country N %
China 47 14.20
USA 23 6.94
UK 13 3.92
India 11 3.33
Spain 11 3.33
Turkey 7 2.11
Brazil 6 1.81
Germany 6 1.81
Hong Kong 6 1.81
Notes: N represents the total number of articles of each country and% represents
the proportion of each country’s publications to the total publications.
Fig. 2. Most productive countries.

Notes: Fig. 2 shows the most productive countries in terms of the number of articles published by researchers affiliated with institutions in each
respective country. SCP shows single-country publications and MCP shows multiple-country publications.
domains dominating. For the top 10 most prolific authors shown in Table 1, the Pearson correlation coefficient between the
total and fractionalised frequency is 0.849. This shows a strong association between total and adjusted author appearances
and implies that the most prolific authors tend to write with the same number of co-authors.
Productive countries
According to the search results, all the articles and conference proceedings papers on social media sentiment in the
stock market used in this study came from 38 countries. The top 8 countries are shown in Table 2 with a total number of
130 papers accounting for 39.40% of the total publications. China has the most publications with 47, accounting for 14.20%
followed by the United States of America with 23 articles accounting for 6.94% of the total publications. Table 2 shows that
using the Scopus and WoS databases, most of the publications come from developed countries like the USA and the United
Kingdom as well as emerging economies like China and India. The studies are concentrated in North America, Europe and
Asia with Africa underrepresented.
Fig. 2 shows that amongst the most productive countries, the United Kingdom has the most collaborations with other
countries, followed by China and the United States of America. The dominance of China is mainly driven by the fact that the
majority of Chinese investors are individual rather than institutional, making studies of online opinions on the stock market
important since these individual investors depend mainly depend on opinions on online stock forums [43].
Most cited articles
Table 3 shows the most cited articles within the period of the analysis, which is between 2011 and 2020. The average
global citations score (GCS) for each article is 20.70 while the standard deviation is 134.67 and the median is 4. The most
cited article is the article by Bollen et al. [9] titled “Twitter mood predicts the stock market” which is reputed to be the
first documented article to extract sentiment from Twitter to predict the stock market. The importance of this paper can
6
Table 3
Most cited articles.
Authors Article title TC TC/Y
1 Bollen [9] Twitter mood predicts the stock market 2410 219.09
2 Yu et al. [44] The impact of social and conventional media on firm equity value: A sentiment 245 27.22
analysis approach
3 Sprenger et al. [39] Tweets and Trades: The Information Content of Stock Microblogs 177 22.13
4 Ruiz et al. [33] Correlating Financial Time Series with Micro-Blogging Activity 171 17.1
5 Smailović et al. [37] Stream-based active learning for sentiment analysis in the financial domain 131 16.37
6 Oliveira et al. [28] The impact of microblogging data for stock market prediction: Using Twitter to 126 25.20
predict returns, volatility, trading volume and survey sentiment indices
7 Pagolu et al. [29] Sentiment analysis of Twitter data for predicting stock market movements 109 21.9
8 Ranco et al. [32] The Effects of Twitter Sentiment on Stock Price Returns 106 15.14
9 Bollen and Mao, [8] Twitter Mood as a Stock Market Predictor 105 10.56
10 Bartov et al. [5] Can Twitter Help Predict Firm-Level Earnings and Stock Returns? 79 5.82
Notes: TC shows the total citations; TC/Y shows the total citations per year.
Fig. 3. Co-citation network.

Notes: The colours show the different clusters of cited references included in the analysis. References included in a cluster are more likely to be cited
with other references from the same cluster. The greater the size of the circle on each reference, the more often the reference is cited.
also be seen in the fact that it led to the creation of the world’s first twitter-based hedge fund as well as the media hype
it received from reputable media outlets like Huffington Post, CNBC and CNN [22]. Another thing to note from the top ten
most cited articles is that almost all of them use investor sentiment extracted from Twitter, again showing the importance
of the Twitter platform in studies that connect textual sentiment to the stock market. Also, it can be seen that the majority
of the most-cited articles examined how social media sentiment could be used to predict the stock market.
Co-citation analysis
As alluded to earlier, when two studies are habitually cited together, chances are that these studies have something in
common. Fig. 3 shows the co-citation network that shows the incidences where two or more authors were cited together in
the same article. Each node denotes an author while the size of the node reflects the number of citations. The greater the
node, the more times the author is cited by all the other authors while the colour of each node and line shows clustering.
7
The connections between the two authors indicate that they appear in one paper at the same time. The thicker the line that
connects two authors, the more often the two authors appear together.
As can be seen, Bollen et al. [9], the most cited article is the article that is co-cited a lot with other articles in the
database. Several patterns can be seen from the co-citation network shown in Fig. 3. Firstly, Bollen et al. [9], an article that
examined how sentiment mined from Twitter can predict the stock market, was mainly premised on the optimal way of
mining opinions from Twitter relevant to the stock market rather than explaining the fundamental theoretical underpinnings
from the broad area of finance. It was published in the Journal of Computational Finance, a computer science journal. From
the network analysis, it can be seen that the article was co-cited with articles that were inclined on testing the predictive
power of sentiment from social media without grounding the findings within the theories of finance. This corroborates
the criticism levelled against Bollen et al. [9] by Lachanski and Pav [22] who argued that the former failed to place their
findings in the wider world of text mining applications for empirical asset pricing. On other hand, it can be seen that
another network cluster of co-citations exists with Antweiler & Frank [2] and Tetlock [41], co-cited with authors like Fama
[15], Brown and Cliff [11], Baker and Wurgler [3] and Baker and Wurgler [4]. This second network cluster tries to place
research findings within the theories of finance as seen by linking the findings to the efficient market hypothesis [15] as
well as Baker and Wurgler [3], who proposed one of the most used investor indices in behavioural finance.
Keyword analysis
Authors of articles provide keywords that give a synopsis of their studies as well as a summary of the content of the
articles. This section presents the results on the most used keywords in the literature. The articles used in this study exist
for a twelve-year period from the first article in 2011 to the last article in 2022. Author keyword analysis is done by dividing
the twelve-year period into two six-year periods (2011–2016 and 2017–2022) and investigating the trends of the keywords.
This is then followed by an analysis of author keyword co-occurrences.
Analysis of author keywords by subperiods

In Table 4, the twelve-year period used in the study is divided into two five-year periods to analyse the trends in the
usage of the author keywords. Table 4 shows the popularity of the top ten most used author keywords for the whole sample
period as well as the two subperiods. For the three sample periods shown in Table 4, “sentiment analysis”, “Twitter” and
“stock market” are the most used author keywords. This shows that the Twitter platform was the most used social medium
used by researchers to extract social media sentiment. Also, the author keyword “stock market” is in line with the study ob-
jective which aimed to filter studies involving social media sentiment within the stocks market only. It can also be seen that
in the 2011–2016 period, “microblogging” and “microblogging data” were amongst the top ten most used author keywords
but drop from the top ten in the 2017–2022 period as authors prefer using “social media” instead. In the 2017–2022 subpe-
riod, three new author keywords appear which do not appear in the 2011–2016 subperiod showing the different direction
that research is taking. These keywords are “machine learning”, “deep learning” and “natural language processing” and show
the preoccupation of researchers with the best ways of extracting sentiment from texts using these three methods. Since
existing research and applications of sentiment analysis have focused primarily on written texts, “Natural language process-
ing” has grown in usage in the second period. “Prediction” and “stock market prediction” exist in all the subperiods showing
that the majority of the studies primarily examine whether social media sentiment can predict stock market features like
returns, return volatility, earnings and trading volume. Behavioural finance was in the top ten used author keywords in the
first period and disappears in the second period confirming the dearth of papers that seek to link sentiment to the theories
of finance.
Table 4
Analysis of author keywords.
2011–2022 2011–2016 2017–2022
Author keyword N % Author keyword N % Author keyword N %
1 Sentiment analysis 95 7.2 Twitter 18 14.6 Sentiment analysis 87 7.8

2 Twitter 77 5.8 Stock market 14 6.9 Twitter 59 5.3
3 Stock market 58 4.4 Sentiment analysis 8 3.9 Stock market 44 3.9
4 Social media 40 3.3 Prediction 6 1.5 Social media 34 3.0
5 Stock market prediction 30 2.3 Social media 6 1.5 Stock market prediction 29 2.6
6 Machine learning 28 2.1 Social networks 6 1.5 Machine learning 28 2.5
7 Deep learning 13 0.98 Behavioural finance 4 1.9 Deep learning 13 1.1
8 Natural language processing 13 0.98 Data mining 4 1.9 Natural language processing 13 1.1
9 Stock market prediction 13 0.98 Microblogging data 4 1.9 Stock prediction 13 1.1
10 Text mining 13 0.98 Microblogging 4 1.9 Big data 12 1.0
Notes: N shows the number of times author keywords appeared within a specified sample period,% is the percentage of the specified keyword to the total
number of keywords within a specified period. The total number of keywords submitted by the authors for the whole sample period is 637 which together
had 1319 appearances; for the 2011–2016 subperiod, 123 author keywords were submitted together totalling 204 appearances; while for the 2017–2022
subperiod, 563 keywords were submitted together totalling 1115 appearances.
8
Fig. 4. Author keyword co-occurrence network.

Notes: Each node represents a keyword, the size of the node indicates the number of occurrences of the keyword, and the link between
nodes represents the relationship between two keywords.
Analysis of author keyword co-occurrences

Author keyword co-occurrence analysis is utilised to establish the strength of links between different keywords in a
collection of documents. The analysis of author keyword co-occurrences is important as it reveals the research frontiers of a
specific academic domain. To avoid cluttering of keywords in the visualisation of author keywords co-occurrence, a filter was
used to only analyse keywords that appeared in the documents more than five times. The results of the author keywords
co-occurrence are visualised in the network diagram in Fig. 4.
Three clusters can be seen in Fig. 4. There is a significant correlation between the keywords in each cluster. Cluster 1
which is shown in red in Fig. 4 includes the following keywords; “Twitter”, “stock market” and “social media” which have
the most appearances. The other keywords in the first cluster include “investor sentiment” “behavioural finance”, “predic-
tion”, “stock returns” and “volatility”. The strongest link between the keywords is between “stock market” and “prediction”
showing that most of the studies in the sample looked at stock market prediction from sentiment extracted from social me-
dia. “Twitter” and “stock market” are the most frequent keywords and again show the prevalence of the Twitter platform for
sentiment analysis. “Investor sentiment” and “behavioural finance” occur together and these words are not normally used in
the mathematical sciences and computer science domains but rather in the finance domain. This reflects that articles iden-
tified within Cluster 1 are likely to be studies that try to explain the role of sentiment in the stock market using established
finance theories. Cluster 2 is shown in blue in Fig. 4 and some of the keywords with the most appearances include “senti-
ment analysis”, “stock market prediction” and “machine learning”. As seen by the size of the lines between keywords, the
strongest link between keywords exists between “sentiment analysis” and “stock markets”. “Sentiment analysis” co-occurs
with such keywords as “machine learning”, “deep learning” and “text mining” which shows that the cluster mainly contains
studies from the computer science domain as the emphasis is on the methods of extracting and classifying sentiment from
texts. The third cluster (in green) has three keywords “tweets”, “stock price” and “data mining” co-occurring again show-
ing that most of the studies use Twitter sentiment specifically to predict stock prices. There are some overlaps amongst
the three clusters, which shows that there are some similarities amongst different research topics. This result confirms that
social media sentiment is a cross-disciplinary research field.
Implications of the results

The findings show that the studies on social media sentiment in the African continent are grossly misrepresented as there
are only three countries featured in the sample; South Africa with three studies as well as Morroco and Tunisia with one
9
article apiece. This dearth in the literature on the African stock exchanges could be a result of several reasons. First, Africa
still lags behind the developed world in terms of usage of social media and therefore might not be generating adequate
social media interactions to warrant meaningful research. The other reason could be attributed to the lack of access to the
most important seminal work that has been done on the subject in university libraries. We explore the latter using one
of the top-ranked universities in Africa, the University of the Witwatersrand, Johannesburg, South Africa. In terms of the
database used for this study, most of the authors who wrote on Africa were affiliated with this university.
To achieve the objective, we first identify the articles which were cited by the most influential journals. The most influ-
ential journals are here defined as the journals that contributed most of the papers to the sample that formed part of this
study. We then checked whether these articles or books are accessible from the library’s website. The university provides
full access to the top 20 influential journals, dominated by computational and mathematical journals. We explore the acces-
sibility of important articles to the researchers affiliated with the University of the Witwatersrand with interests in social
media sentiment. For the four articles published at the University, we extract the references used by the authors, and refer-
ences from open-access journals to see the extent to which researchers have access to subscription-type and pay-wall-type
journal articles. The findings show that from a combined reference list of 145 different sources, 6 were from books and the
remainder were distributed across conference proceedings and peer-reviewed titles. We find that the university provides full
access to all the books in print and digital form as well as 96% of the peer-reviewed articles and conference proceedings.
This compares well with other studies done in developed markets [25]. These findings show that the university library pro-
vides the majority of the most cited and influential journal titles. This could explain why the university has produced the
most articles in the subregion. We, therefore, recommend that African universities put more effort into ensuring access to
the most influential journals by researchers.
From the study, we note that several proxies of social media sentiment are used including sentiment extracted from
Twitter, Facebook and Sina-Weibo. Future studies can conduct studies that directly compare some of these to establish
which can be regarded as more accurate proxies of investor sentiment. More studies could also examine how social me-
dia sentiment plays a role in asset pricing. In a new theoretical framework (Sentiment Efficient Markets Hypothesis), Sun
and Zeng [40] show that social media sentiment provides a powerful instrument to interpret financial facts and anomalies
inconsistent with the traditional EMH. Future empirical studies could therefore test this new theoretical framework.
Conclusion
The study aimed to examine the evolvement of the concept of “social media sentiment” as it is applied in studies that
seek to explore the dynamics of the stock market. The findings show that the fields of mathematical sciences, as well as
computer sciences, dominate the research space regarding this concept. Most of the studies done on this concept since
the first article appeared in 2011 are concentrated in China, the United States of America, the United Kingdom and several
other European countries. Only 6 articles appear in the database of studies done in Africa. This either shows a dearth of
studies examining the role of social media sentiment in the continent of Africa or it reflects that studies done in Africa are
indexed in other databases but not in Scopus and Web of Science. More recent studies are centred on ways and methods
of extracting sentiment from social media as seen by the emergence of such author keywords like “Natural language pro-
cessing”, “machine learning” and “deep learning” in the second half of the decade of the sample period used in the study.
In summary, “social media sentiment” in the stock market has many avenues of expansion as seen by permeating different
research domains like physics, mathematical sciences, computer science and finance. China dominates the studies on social
media sentiment in the stock market. However, social media is heavily censored in China.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
References
[1] AlZaabi, S. (2021). Correlating sentiment in reddit’s wallstreetbets with the stock market using machine learning techniques [Rochester Institute of
Technology]. https://scholarworks.rit.edu/theses/11061
[2] W. Antweiler, M.Z. Frank, Is all that talk just noise? The information content of internet stock message boards, J. Finance 59 (3) (2004) 1259–1294,
doi:10.1111/j.1540-6261.20 04.0 0662.x.
[3] M. Baker, J. Wurgler, Investor sentiment and the cross-section of stock returns, J. Finance 61 (4) (2006) 1645–1680, doi:10.1111/j.1540-6261.2006.
00885.x.
[4] M. Baker, J. Wurgler, Investor sentiment in the stock market, J. Econ. Perspect. 21 (2) (2007) 129–151.
[5] E. Bartov, L. Faurel, P.S. Mohanram, Can twitter help predict firm-level earnings and stock returns? Account. Rev. 93 (3) (2018) 25–57, doi:10.2308/
accr-51865.
[6] P. Benckendorff, A. Zehrer, A network analysis of tourism research, Ann. Tourism Res. 43 (2013) 121–149, doi:10.1016/j.annals.2013.04.005.
[7] D.D. Bergh, J. Perry, R. Hanke, Some predictors of SMJ article impact, Strateg. Manag. J. 27 (1) (2006) 81–100, doi:10.1002/smj.504.
[8] J. Bollen, H. Mao, Twitter mood as a stock market predictor, Computer (Long Beach Calif) 44 (10) (2011) 91–94, doi:10.1109/MC.2011.323.
[9] J. Bollen, H. Mao, X. Zeng, Twitter mood predicts the stock market, J. Comput. Sci. 2 (1) (2011) 1–8, doi:10.1016/j.jocs.2010.12.007.
[10] B. Breitmayer, F. Massari, M. Pelster, Swarm intelligence? Stock opinions of the crowd and stock returns, Int. Rev. Econ. Financ. 64 (C) (2019) 443–464.
[11] G.W. Brown, M.T. Cliff, Investor sentiment and the near-term stock market, J. Empirical Finance 11 (1) (2004) 1–27.
10
[12] Y.-.W. Chang, M.-.H. Huang, C.-.W. Lin, Evolution of research subjects in library and information science based on keyword, bibliographical coupling,
and co-citation analyses, Scientometrics 105 (3) (2015) 2071–2087, doi:10.1007/s11192- 015- 1762- 8.
[13] C. Cuccurullo, M. Aria, F. Sarto, Foundations and trends in performance management. A twenty-five years bibliometric analysis in business and public
administration domains, Scientometrics 2 (108) (2016) 595–611, doi:10.1007/s11192- 016- 1948- 8.
[14] N. Donthu, S. Kumar, D. Mukherjee, N. Pandey, W.M. Lim, How to conduct a bibliometric analysis: an overview and guidelines, J. Bus. Res. 133 (2021)
285–296, doi:10.1016/j.jbusres.2021.04.070.
[15] E.F. Fama, Efficient capital markets: a review of theory and empirical work, J. Finance 25 (2) (1970) 383–417, doi:10.2307/2325486.
[16] R. Fan, J. Zhao, Y. Chen, K. Xu, Anger is more influential than joy: sentiment correlation in Weibo, PLoS One 9 (10) (2014) e110184, doi:10.1371/journal.
pone.0110184.
[17] B. Gan, V. Alexeev, R. Bird, D. Yeung, Sensitivity to sentiment: news vs social media, Int. Rev. Financ. Anal. 67 (2020) 101390, doi:10.1016/j.irfa.2019.
101390.
[18] Gu, M., Teoh, S.H., & Wu, S. (2022). Contagion of investor sentiment in online investment communities: evidence from dynamic visuals on stocktwits
(SSRN Scholarly Paper No. 4110191). 10.2139/ssrn.4110191
[19] Karabulut, Y. (2013). Can facebook predict stock market activity? (SSRN Scholarly Paper No. 2017099). doi:10.2139/ssrn.2017099.
[20] C. Kearney, S. Liu, Textual sentiment in finance: a survey of methods and models, Int. Rev. Financ. Anal. 33 (2014) 171–185, doi:10.1016/j.irfa.2014.02.
006.
[21] A. Keramatfar, H. Amirkhani, Bibliometrics of sentiment analysis literature, J. Inf. Sci. 45 (1) (2019) 3–15, doi:10.1177/0165551518761013.
[22] M. Lachanski, S. Pav, Shy of the character limit: “Twitter mood predicts the stock market” revisited, Econ. J. Watch 14 (3) (2017) 302–345.
[23] C.M.C. Lee, A. Shleifer, R.H. Thaler, Investor sentiment and the closed-end fund puzzle, J. Finance 46 (1) (1991) 75–109, doi:10.1111/j.1540-6261.1991.
tb03746.x.
[24] M.Á. López-Cabarcos, A.M. Pérez-Pico, P. Vázquez-Rodríguez, M.L. López-Pérez, Investor sentiment in the theoretical field of behavioural finance, Econ.c
Research-Ekon. Istraživanja 33 (1) (2020) 2101–2119, doi:10.1080/1331677X.2018.1559748.
[25] O. Nigro, J. Johansson, S. Högvik Hansson, Insight into what they cite: a citation analysis of publications at the School of Business, Economics and Law
at the University of Gothenburg, J. Bus. Financ. Librariansh. (2022) 1–27 Ahead-of-print, doi:10.1080/08963568.2022.2044614.
[26] Nyakurukwa, K., & Seetharam, Y. (2022). Does online investor sentiment explain analyst recommendation changes? Evidence from an emerging market.
Managerial Finance, ahead-of-print(ahead-of-print). doi:10.1108/MF- 05- 2022- 0221.
[27] K. Nyakurukwa, Y. Seetharam, The wisdom of the Twitter crowd in the stock market: evidence from a fragile state, Afric. Rev. Econ. Financ. 14 (1)
(2022) 203–228, doi:10.10520/ejc-aref_v14_n1_a7.
[28] N. Oliveira, P. Cortez, N. Areal, The impact of microblogging data for stock market prediction: using Twitter to predict returns, volatility, trading volume
and survey sentiment indices, Expert Syst. Appl. 73 (2017) 125–144, doi:10.1016/j.eswa.2016.12.036.
[29] Pagolu, S., Reddy, K., Panda, G., & Majhi, B. (2016). Sentiment analysis of Twitter data for predicting stock market movements. 1345–1350.
10.1109/SCOPES.2016.7955659
[30] R. Pranckutė, Web of Science (WoS) and Scopus: the titans of bibliographic information in today’s, Acad. World. Public. 9 (1) (2021) Article 1, doi:10.
3390/publications9010012.
[31] A. Pritchard, Statistical Bibliography or Bibliometrics? J. Document. 25 (4) (1969) 348–349.
[32] G. Ranco, D. Aleksovski, G. Caldarelli, M. Grčar, I. Mozetič, The effects of Twitter sentiment on stock price returns, PLoS One 10 (9) (2015) e0138441,
doi:10.1371/journal.pone.0138441.
[33] E.J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, A. Jaimes, Correlating financial time series with micro-blogging activity, in: Proceedings of the Fifth ACM
International Conference on Web Search and Data Mining, 2012, pp. 513–522, doi:10.1145/2124295.2124358.
[34] Semenova, V., & Winkler, J. (2022). Social contagion and asset prices: reddit’s self-organised bull runs. arXiv. 10.48550/arXiv.2104.01847.
[35] Y. Shi, Y. Tang, W. Long, Sentiment contagion analysis of interacting investors: evidence from China’s stock forum, Phys. A 523 (2019) 246–259,
doi:10.1016/j.physa.2019.02.025.
[36] A. Siganos, E. Vagenas-Nanos, P. Verwijmeren, Facebook’s daily sentiment and international stock markets, J. Econ. Behav. Organ. 107 (2014) 730–743,
doi:10.1016/j.jebo.2014.06.004.
[37] J. Smailović, M. Grčar, N. Lavrač, M. Žnidaršič, Stream-based active learning for sentiment analysis in the financial domain, Inf. Sci. (Ny) 285 (2014)
181–203, doi:10.1016/j.ins.2014.04.034.
[38] H. Small, Co-citation in the scientific literature: a new measure of the relationship between two documents, J. Am. Soc. Inf. Sci. 24 (4) (1973) 265–269,
doi:10.1002/asi.4630240406.
[39] T.O. Sprenger, A. Tumasjan, P.G. Sandner, I.M. Welpe, Tweets and trades: the information content of stock microblogs, Eur. Financ. Manag. 20 (5) (2014)
926–957, doi:10.1111/j.1468-036X.2013.12007.x.
[40] Sun, Y., & Zeng, X. (2022). Efficient markets: information or sentiment? (SSRN Scholarly Paper No. 4293484). doi:10.2139/ssrn.4293484.
[41] P.C. Tetlock, Giving content to investor sentiment: the role of media in the stock market, J. Finance 62 (3) (2007) 1139–1168, doi:10.1111/j.1540-6261.
2007.01232.x.
[42] R. Vogel, W.H. Güttel, The Dynamic Capability view in strategic management: a bibliometric review, Int. J. Manag. Rev. 15 (4) (2013) 426–446, doi:10.
1111/ijmr.120 0 0.
[43] G. Wang, G. Yu, X. Shen, The Effect of online investor sentiment on stock movements: an LSTM approach [Research Article], Complexity (2020)
Hindawi, doi:10.1155/2020/4754025.
[44] Y. Yu, W. Duan, Q. Cao, The impact of social and conventional media on firm equity value: a sentiment analysis approach, Decis. Support Syst. 55 (4)
(2013) 919–926, doi:10.1016/j.dss.2012.12.028.
[45] G. Zhou, Measuring investor sentiment, Annu. Rev. Financ. Econ. 10 (2018) 239–259, doi:10.1146/annurev- financial- 110217- 022725.
[46] M.E. Zweig, An investor expectations stock price predictive model using closed-end fund premiums, J. Finance 28 (1) (1973) 67–78, doi:10.1111/j.
1540-6261.1973.tb01346.x.
11

Scientific African: Kingstone Nyakurukwa, Yudhvir Seetharam

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scientific African: Kingstone Nyakurukwa, Yudhvir Seetharam

Uploaded by

Copyright:

Available Formats

Scientiﬁc African 20 (2023) e01596

Contents lists available at ScienceDirect

The evolution of studies on social media sentiment in the

Fig. 1. Distribution of publications by year.

Results and discussion

Authors Frequency F Frequency

Fig. 2. Most productive countries.

Most cited articles

Authors Article title TC TC/Y

Fig. 3. Co-citation network.

Analysis of author keywords by subperiods

2011–2022 2011–2016 2017–2022

Author keyword N % Author keyword N % Author keyword N %

1 Sentiment analysis 95 7.2 Twitter 18 14.6 Sentiment analysis 87 7.8

Fig. 4. Author keyword co-occurrence network.

Analysis of author keyword co-occurrences

Implications of the results

Declaration of Competing Interest

You might also like