Zoeynull - 1500 - Big Data Analytics - Jisha - Jisha - 1may

1
Sentiment Analysis of Twitter Users on Data

Breaching in Different Organizations
(Student’s anonymous Z code here)

serious ramifications for a company, including a

Abstract—Data breaches are a persistent
threat to businesses of all sizes. Even if the types loss in market value and large penalty fines. Firms
of breaches differ, the implications are the same.
must build a better user privacy programme to
This research examines data breaches that have
been made public and have resulted in the loss of overcome a security breach and limit its negative
a large number of individual records, as well as
implications, both as a responsibility check and as a
significant financial and technical implications.
As a result, data breaches are one of the most preventive approach against similar events.
typical problems for every business that interacts
This paper includes data breaches that have
with the public. Furthermore, because cyber
breaches are the most serious, this research occurred within businesses and the sentiment
concentrate our efforts on revealing the targeted
analysis on the tweets of the individuals after the
firms, such as Marriott, Chegg, T-Mobile, and
Under-Armor, as well as tracking how hackers' various organizations got breached. The major
interests change over time. The goal of this
method used here to discriminate between positive
research is to recognise and analyse emotions
expressed by individuals in tweets in order to and negative attitudes in text is unsupervised. The
make recommendations following a data breach.
key idea of unsupervised learning would be that no
This research developed a dataset comprising
text, user, and many other information by prior assumptions or constraints regarding the
collecting tweets and replies on some selected
output of variables fed into it are provided to the
themes. The dataset was also used to analyze
emotions in tweets and identify whether users' model; instead, researchers just enter the
tweets were positive, negative, or neutral.
preprocessed data and consider the model to train
Keywords— Data Breach, Unsupervised the data by itself.
Sentiment Analysis, Hacking Breaches,
This is very useful in the dataset which has been
Organizations, word2vec.
provided here as it is the unlabeled dataset.
I. INTRODUCTION
Data breaches are defined as illegal or unintentional

disclosure by businesses that result in exposure of
personally identifiable information (PII) of users,
such as social security numbers (SSNs)
II. LITERATURE REVIEW
and credit/debit card data. A data breach can have As technology progresses and information storage

becomes digital, organisations are swiftly changing
2
from a sourcing model to a value-based one, while may be enforced as a consequence of compliance
keeping client wants in mind. Businesses are known issues, customer legal action, increased security
to gather and analyse personally identifiable costs, and a lack of consumer faith [5]. According
information (PII) from customers, like purchasing to a recent poll, the cost of a data breach is
patterns, surfing habits, credit card details, and continuing to rise. When it was discovered that
security numbers (SSNs), and then use this data to sensitive data had been lost, organisations were hit
present customers with personalized promotions [1]. with hefty fines. According to Hovav & D'Arcy et
Consumers' privacy was affected by data breaches, al[6], based on whether business was an e-tailer or
which has a massive effect on their confidentiality not, the businesses responded significantly to
decisions . Consumers who had their personal breach notices. After announcing a data breach,
information stolen expressed their dissatisfaction businesses lose 2.1 percent of their market share
with the compromised firm, citing fears that their [7].In the investigation conducted by Ga Shankar
privacy had been abused [2] According to research, and Mohammed (2020), they discovered two
as a result of the compromised firm's lack of faith, potentially dangerous data breaches at Choice Point
these affected clients switched to other competitive and TJX. They believe that when developing
organisations (Choi et al., 2016). According to the organisational privacy policies, companies should
research Conducted by the Philippine Institute, data think about ethical duties[8].
breaches cost impacted organisations around
III. METHODOLOGY
$3 million and resulted in a 5% drop in stock
For this analysis, the researchers took the twitter
prices , as well as users ending partnerships with the
data of the individuals of the four organizations data
companies involved and shifting to those with
breach: (1) the Chegg data breach, (2) the Marriott
superior security measures. In brief, the exploitation
data breach, (3) the Under Armor data breach, and
of users' PII and the subsequent data breach foretold
(4) the T-Mobile data breach.
major implications for the companies involved [3].
In the dataset, the researchers used Word2vec, tfidf
According to scientific studies, when the data
weighting, and KMeans clustering. Unsupervised
breach was imminent, financial markets responded
sentiment analysis was performed using word
negatively, lowering the valuation of the breached
embeddings learned for the current dataset which
firm and, as a result, the wealth of its shareholders
used gensim's Word2Vec method implementation.
[4]. Businesses face considerable challenges as a
result of data breaches or the prospect of a data
breach. Illegal access to the confidential data or the
A. Chegg Data Breach
unintentional disclosure of confidential information
data could have catastrophic consequences. Fines Chegg informed the SEC of a security issue
3
affecting more than 40 million users. The attacker Unsupervised sentiment analysis was performed
was able to obtain email id, names and password of using word embeddings trained for the provided
the users & shipping addresses. dataset using gensim's Word2Vec framework, and
B. Marriot Data Breach the results are shown below. The main processes
were detecting negative and positive clusters in
Hackers gained access to many of Marriott's hotel
word vector space using sklearn's version of the
reservation systems for four years, disclosing the
KMeans clustering method, this was then used to
travel plans, user names, contact number, email,
convert each statement into a vector of sentiment
passport number, D.O.B, and gender of 500 million
scores for every word or phrase. The second vector
people. Some victims' payment card information
for a given sentence was created by replacing all of
and expiration dates were also stolen.
the terms in the sentence with their associated tfidf-
C.T-Mobile Data Breach
scores. For each sentence, the resulting prediction
According to T-Mobile, more than 2 million was generated as a dot product of these two vectors;
people's information may have been accessed. T- If the vector sum appeared to be positive, the
Mobile notified affected customers via text message overall sentiment was assumed positive; if the
that hackers had acquired to the user name, ZIP vector sum appeared to be negative, the overall
codes, contact numbers, mail id, type of account emotion was considered negative.
and account data. A. Chegg Data.
D.UnderArmour Data Breach Table 1. Confusion Matrix for Chegg Tweets
Under Armour notified customers in 2018 that

MyFitnessPal data breach had occurred, impacting
150 million accounts. Under Armour didn't spare
any time in informing authorities and customers.
Table 2. Model Scores
Cybersecurity experts protected the app and are
always on the lookout for any strange or suspicious
activities. They also demanded that every user
update their password.
B. Marriot Data
IV. RESULTS AND ANALYSIS Table 3. Confusion Matrix for Marriot Tweets
The Word2Vec method, KMeans clustering, and

tfidf weighting were used to analyse the dataset.
4
Table 4. Model Score Table 8. Model Score
Since the categories in the dataset were

considerably uneven, the chosen metrics for
measuring model performance were accuracy,
recall, & F-score; nevertheless, the dataset was very
C.T-Mobile Data
skewed that this study should have generated a
Table 5. Confusion Matrix for T-Mobile Tweets metric that penalised this mismatch even more. The
model, as it ended up, had a precision of 0.99 for
almost all of the businesses, indicating that it was
quite good at discriminating negative emotion data
Table 6. Model Score (the model basically never confused negative
observations with positive observations). One could
make the argument that it must have since it had
few negative observations and they potentially
differs considerably yet most from the other ones ,
and this is slightly correct; even so, if you take into
account that the models also attained nearly
80 percent recall (which implies that majority of 80
percent among all positive findings in the sample
were properly identified as positive) in almost all of
the organisations, it may illustrate that it also
learned a few things, and didn't just divided the
information in half, with negative data.
D.Under Armour Data
Table 7. Confusion Matrix for Under Armour Tweets

5
V. CONCLUSION REFERENCES
The Researchers investigated individual tweets [1] Ayyagari, R., 2012. An exploratory analysis of
following data breaches at many firms and noticed data breaches from 2005-2011: Trends and
the impact of information breaches on the insights. Journal of Information Privacy and
organisation as a whole. Data in people's Twitter Security, 8(2), pp.33-56.
posts was used to detect and analyse their [2] Bansal, G. and Zahedi, F.M., 2015. Trust
sentiment. The tweets and replies on a few key violation and repair: The information privacy
issues were compiled into a dataset that includes perspective. Decision Support Systems, 71, pp.62-
text, user, and sentiment info, among many other 77.
things. The dataset was then used to detect [3] Garrison, C.P. and Ncube, M., 2011. A
sentiment in tweets & replies, as well as to calculate longitudinal analysis of data breaches. Information
model scores based on a variety of user- and tweet- Management & Computer Security.
based criteria. Word2vec, tfidf weighting, and [4] Goode, S., Hoehle, H., Venkatesh, V. and
KMeans clustering was used in the dataset. Word Brown, S.A., 2017. User compensation as a data
embeddings generated for the current dataset were breach recovery action: An investigation of the Sony
utilised to perform unsupervised sentiment analysis PlayStation network breach. MIS Quarterly, 41(3),
using gensim's Word2Vec implementation. The pp.703-727.

[5] Juma'h, A.H. and Alnsour, Y., 2020. The effect
researchers observed that Unsupervised approach
of data breaches on company
produced quite impressive results, as it produced
performance. International Journal of Accounting
fairly great scores, much greater than
& Information Management.
anticipated randomly, without the usage of any pre-
[6] Noor, U., Anwar, Z., Malik, A.W., Khan, S. and
trained algorithms and, in addition, no previous
Saleem, S., 2019. A machine learning framework
negative or positive data was provided in the dataset
for investigating data breaches based on semantic
ACKNOWLEDGMENT analysis of adversary’s attack patterns in threat
This research was partially supported by Mr.s intelligence repositories. Future Generation
Reezu Nandi. I thank my colleagues from Computer Systems, 95, pp.467-487.
PaperPedia who provided his insight in this [7] Sailunaz, K. and Alhajj, R., 2019. Emotion and
research, although he may not agree with all of the sentiment analysis from Twitter text. Journal of
interpretations/conclusions of this paper. Computational Science, 36, p.101003.
[8] Shankar, N. and Mohammed, Z., 2020.
Surviving data breaches: A multiple case study
analysis. Journal of Comparative International
Management, 23(1), pp.35-54.

Zoeynull - 1500 - Big Data Analytics - Jisha - Jisha - 1may

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zoeynull - 1500 - Big Data Analytics - Jisha - Jisha - 1may

Uploaded by

Copyright:

Available Formats

1

Sentiment Analysis of Twitter Users on Data

serious ramifications for a company, including a

Data breaches are defined as illegal or unintentional

D.UnderArmour Data Breach Table 1. Confusion Matrix for Chegg Tweets

Under Armour notified customers in 2018 that

The Word2Vec method, KMeans clustering, and

Table 4. Model Score Table 8. Model Score

Since the categories in the dataset were

Table 7. Confusion Matrix for Under Armour Tweets

using gensim's Word2Vec implementation. The pp.703-727.

You might also like