You are on page 1of 6

MINI STUDY

Discovery of Theme-Topic Labels for Phishing Email Subject Lines via Zero-Shot Learning and Emotional Model

In this experiment, I attempted to infer possible topics and sentiments related to subject lines of phishing emails (Including
ours and common summarized topic sentences [1]). According to [4,5], blacklisted words in email subject lines have been
found the most informative features for phish/benign email classification in terms of information gain. The main target of this
topic modeling has been done with the ZeroShot learning scheme provided by HuggingFace modules. ZeroShot learning NLP
models powered by ROBERTA, BERT models are capable of inferring the relatedness of sentences to given topic classes. These
topic classes are given on the fly so there is no need a customized training. The above-mentioned transformer NLP models
were already trained with a huge number of documents. The relatedness of given classes (i.e. "security", "urgency") to the
given sentences (i.e. the subject line(s)) are computed through the neighborhood of word vectors, long short term relations
and the attention paradigm as well. The challenge of this study is finding out the right discriminative set of classes that best
fit the subjects of emails in which we, as human beings, intuitively could validate. The email subject lines were provided
below. Our campaigns' subject lines were taken from Lucy System.

I first attempted to find a paper in the literature that can shed a light on this problem. Although our need was to find a resource
stating common concepts, themes, or topics which are were used in subject lines of phishing emails. According to my best
knowledge, there exists no satisfactory document which directly concludes this context. The studies [1,4,5,6] have addressed
only various common words or sampled frequent subject lines. Thus, it is arguable to select the most appropriate concepts or
topics in phishing emails. I, therefore, intuitively proposed some set of classes such as
{security,account,meeting,post,work,travel,joy,urgency} or {security, login, meeting, announcement, marketing, work,
vacation, event, health, urgency}. Furthermore, I also employed Plutchik's wheel of emotions [7] as another alternative to
infer the emotion belonging to a subject line. The Plutchik Model of Emotions provides a simple logical way to express various
primary and opposite emotions on a polar coordinate system. Accordingly, it involves 8 basic emotions such as joy, trust, fear,
surprise, sadness, anticipation, anger, and disgust. Besides, it organizes these 8 basic emotions based on the physiological
purpose of each. Moreover, the wheel also includes the combinations of emotions such as love, submission, awe, disapproval,
remorse, contempt, aggressiveness, optimism.

Fig.1 Plutchik’s Wheel of Emotions

In this mini-study, I attempted to obtain higher cohesion between the suggested set of classes and labeling of zero-shot NLP
topic/theme classification. To be more specific, the predictions are computed via the cross-entropy softmax function and each
prediction comes up with a probability score. In order to provide the abovementioned "cohesion", I relied on the prediction
probability score at the inference stage. As is known, the zero-shot learning scheme learns a classifier on one set of labels and
then evaluates on a different set of labels that the classifier has never seen before. For instance, GPT-2 models were directly
used on different downstream tasks such as machine translation without any fine-tuning pre-processing stage. Similarly, the
learned representations in an unsupervised fashion would be used to classify unseen classes based on latent relations among
the words and sequences.

As the dataset, I have collected several subject lines which are reported as frequently found subject lines in [1,6]. In addition,
I have also added lines of our campaigns. As a result, I collected 138 subject lines which are given below. I translated all these
lines to English in the case when they are in german. Next, I created a different 9 sets of classes (5 our proposal while 4 of
them were gathered from Plutchik's wheels of emotions. Here, our objective is to evaluate how the ZSL performs better in
terms higher prediction probability score. During the experiments, I used 3 different assessment schemes each applying
different probability threshold values (i.e 0.5, 0.4 and 0). The higher the probability threshold we apply, the more reliability
we can achieve. My initial results suggest that the set of "ectasy, admiration ,terror, amazement, grief, loathing, rage, vigilance"
achieves the best cohesion in terms of acquired probability scores in all 3 different settings. Note that, these scores currently
do not provide any evidence for any kind of correlation between labels and the difficulty level of campaigns. However, my
next attempt will be exploring the pure relation between the subject line thema/emotion and click rates.

SOME COMMON EMAIL SUBJECT LINES OF 2020 FROM [1]

"Change of Password Required Immediately",


"Microsoft/Office 365: De-activation of Email in Process",
"Password Check Required Immediately",
"HR: Employees Raises",
"Dropbox: Document Shared With You",
"IT: Scheduled Server Maintenance – No Internet Access",
"Office 365: Change Your Password Immediately",
"Airbnb: New device login",
"Slack: Password Reset for Account",
"SharePoint: Approaching SharePoint Site Storage Limit",
"Microsoft: Anderson Hauck has shared a Whiteboard with you",
"Office 365: Medium-severity alert: Unusual volume of file deletion"
"FedEx: Correct address needed for your package delivery on",
"USPS: Your digital receipt is ready",
"Twitter: Your Twitter account has been locked",
"Google: Please Complete the Required Steps",
"Cash App: Your Account Has Been Closed",
"Coinbase: Important Please Resolve Error Now",
"Would you mind taking a look at this invoice?",
"COVID-19 - Now airborne, Increased community transmission",
"Confidential Information on COVID-19",
"Coronavirus Stimulus Checks",
"Branch/Corporate Reopening Schedule",
"Earn money working from home",
"IT: ATTENTION: Security Violation",
"Ring: Karen has shared a Ring Video with you",
"HR: Company Policy Notification: COVID-19 - Test & Trace Guidelines",
"Chase: Stimulus Funds",

FROM [6]

"Job Opportunity",
"Strategy Meeting",
"What is Chen Guangcheng fighting for?",
"FW: for the extension of the measures against North Korea",
"2012 U.S.Army orders for weapons",
"FW: results of homemaking 2007 annual business plan (min quarter 1 included)",
"DSO-DARPA-BAA-11-65",
"Wage Data 2012",
"U.S.Air Force Procurement Plan 2012",
"About seconded expatriate management in overseas offices",
"FW:[CLASSIFIED] 2012 USA Government of the the Health Reform",
"T.T COPY",
"USA to Provide Declassified FISA Documents",
"FY2011-12 Annual Merit Compensation Guidelines for Staff",
"Contact List Update",
"DOD Technical Cooperation Program",
"DoD Protection of Whistleblowing Spies",
"FW:UK Non Paper on arrangements for the Arms Trade Treaty (ATT) Secretariat",
"Mail delivery failed: returning message to sender",
"Delivery Status Notification (Failure)",
"Become A Paid Mystery Shopper Today! Join and Shop For Free!",
"Re:",
"failure notice",
"Delivery Status Notification (Delay)",
"Returned mail: see transcript for details",
"Get a job as Paid Mystery Shopper! Shop for free and get Paid!",
"Application number: AA700003125331",
"Your package is available for pickup",
"Your statement is ready for your review",
"Unpaid invoice 2913.",
"Track your parcel",
"You have received A Hallmark E-Card!",
"Your Account Opening is completed.",
"Delivery failure",
"Undelivered Mail Returned to Sender",
"Laura would like to be your friend on hi5!",
"You have got a new message on Facebook!",

OUR EMAIL SUBJECT LINES

"Kostenloser Overleaf Account",


"You won 5.- in Swisslotto",
"ZHAW Umfragen ausschalten",
"A flower delivery has failed",
"Google someone hacked your account",
"Gratis Mega Upload Subscription",
"Ihre DropBox ist fast voll",
"Rechnungsversand Zara",
"Urgent: Please check your Account information",
"Migration Outlook E-Mail Account",
"Eingehendes Document auf OneDrive",
"Postfach voll - Bitte bereinigen",
"ZHAW Passwort Reset",
"Gratulation zum Gratiswochenende in Davos!",
"Eventoweb Account Sicherheitsmeldung",
"Zufällige Bewerbung",
"Eingehendes Document auf OneDrive",
"MS Teams - XX hat eine Nachricht gesendet",
"Einladung zur Mitarbeiterbesprechung",
"Valentines Day",
"Gratis DropBox Account for ZHAW Students",
"XX has shared a picture with you on Google",
"An incident was opened for your account",
"Please migrate your OLAT account",
"Google Drive ist fast voll. Jetzt bereinigen",
"Umfrage mit Gewinnspiel",
"Outlook Reminder Sitzungseinladung",
"Bitcoin Fraud",
"Passwort Check nötig",
"Check your recent Activty on Google",
"Netflixaccount - Payment was rejected",
"Your password was leaked",
"Someone Sharded a Photo with you on Google",
"Your Github account has been flagged",
"Instagram - Security Alert",
"Verschlüsselte Nachricht erhalten",
"Erstattung ihres Ticket von Ticketcorner.ch",
"Instagram Friend Request",
"Dein Passwort wurde erfolgreich zurückgesetzt",
"Game: Spot the phishing scam!",
"Google Login Alarm",
"Linkedin Friend Request",
"Stephen Hawking visits ZHAW",
"Outlook: Re-Authenticate to get latest E-Mails",
"OneDrive Dateifreigabe",
"Swiss National Football Team visits ZHAW",
"Zwei Faktor Authentisierung wurde deaktiviert",
"Google Document Invation",
"ZHAW Human Resources - Ferienzeitanpassung",
"MS Teams Einladung Klassenteam",
"Grusskarte zum Geburtstag",
"Gratis Netflix Account",
"Abschaltung der Multifaktorauthentisierung",
"Warning! Security Violation detected",
"Bitcoin - Trade with a 500 USD starting balance",
"Test our new virus online scanner now!",
"Eingehendes Document auf OneDrive","Need for a Tutor",
"Ihre Rechnung vom XX.XX.2020",
"Einladung zum Feierabendbier",
"Your membership account has been created",
"Gratis Spotify Abo",
"Newsletter Abstellen",
"Zahlungsbestätigung ZHAW",
"Facebook - Dein Freund hat dich markiert",
"Verifizieren Sie ihre E-Mail Addresse",
"Pressemeldung Data Breach Nachricht",
"A new class was opened for you on Moodle",
"Fake Conversation of other ZHAW Staff",
"MS Teams Einladung zum Team",
"Persönliche Office 365 Lizenz abgelaufen",
"A friend Shared a Meme with you on Pinterest!",
"Willkommensgeschenk neuer Mitarbeiter"

RESULTS

WITHOUT ANY DETECTION THRESHOLD PROBABILITY

security,login,meeting,delivery,health,work,vacation,fun,health,urgency 138 out of 138 , total score: 62.4693


security,login,meeting,announcement,marketing,work,vacation,event,health,urgency 138 out of 138 , total score: 58.5730
security,meeting,announcement,trade,work,vacation,event,urgency 138 out of 138 , total score: 63.0540
security,login,meeting,post,work,travel,fun,urgency 138 out of 138 , total score: 59.7602
security,account,meeting,post,work,travel,joy,urgency 138 out of 138 , total score: 61.89378
warning,login,meeting,post,work,travel,fun,urgency 138 out of 138 , total score: 60.41261
joy,trust,fear,surprise,sadness,disgust,anger,anticipation 138 out of 138 , total score: 56.83123
ectasy,admiration,terror,amazement,grief,loathing,rage,vigilance 138 out of 138 , total score: 66.9755 *
serenity,acceptance,apprehension,distraction,pensiveness,boredom,annoyance,interest 138 out of 138 , total score: 53.7817
love,submission,awe,disapproval,remorse,contempt,aggresiveness,optimism 138 out of 138 , total score: 58.07690

THRESHOLDED DETECTIONS >0.5 (probability)

security,login,meeting,delivery,health,work,vacation,fun,health,urgency 45 out of 138 , total score: 30.88856


security,login,meeting,announcement,marketing,work,vacation,event,health,urgency 36 out of 138 , total score: 23.03357
security,meeting,announcement,trade,work,vacation,event,urgency 46 out of 138 , total score: 29.28598
security,login,meeting,post,work,travel,fun,urgency 34 out of 138 , total score: 22.28335
security,account,meeting,post,work,travel,joy,urgency 46 out of 138 , total score: 29.06959

warning,login,meeting,post,work,travel,fun,urgency 44 out of 138 , total score: 27.97123


joy,trust,fear,surprise,sadness,disgust,anger,anticipation 31 out of 138 , total score: 20.45792
ectasy,admiration,terror,amazement,grief,loathing,rage,vigilance 55 out of 138 , total score: 38.52926 *
serenity,acceptance,apprehension,distraction,pensiveness,boredom,annoyance,interest 28 out of 138 , total score: 16.7191
love,submission,awe,disapproval,remorse,contempt,aggresiveness,optimism 30 out of 138 , total score: 20.61194

THRESHOLDED DETECTIONS >0.4 (probability)

security,login,meeting,delivery,health,work,vacation,fun,health,urgency 71 out of 138 , total score: 42.54702


security,login,meeting,announcement,marketing,work,vacation,event,health,urgency 62 out of 138 , total score: 34.54367
security,meeting,announcement,trade,work,vacation,event,urgency 79 out of 138 , total score: 43.91945
security,login,meeting,post,work,travel,fun,urgency 74 out of 138 , total score: 40.08870
security,account,meeting,post,work,travel,joy,urgency 75 out of 138 , total score: 41.81558
warning,login,meeting,post,work,travel,fun,urgency 70 out of 138 , total score: 39.45468
joy,trust,fear,surprise,sadness,disgust,anger,anticipation 58 out of 138 , total score: 32.54473
ectasy,admiration,terror,amazement,grief,loathing,rage,vigilance 75 out of 138 , total score: 47.62525 *
serenity,acceptance,apprehension,distraction,pensiveness,boredom,annoyance,interest 47 out of 138 , total score: 25.00121
love,submission,awe,disapproval,remorse,contempt,aggresiveness,optimism 57 out of 138 , total score: 32.77099
REFERENCES

[1] https://www.techrepublic.com/article/these-subject-lines-are-the-most-clicked-for-phishing/

[2] @article{wallach2007generating, title={Generating Summary Keywords for Emails Using Topics},author={Wallach,


Hanna M},year={2007}}

[3] @article{bergholz2010new, title={New filtering approaches for phishing email},author={Bergholz, Andr{\'e} and De Beer,
Jan and Glahn, Sebastian and Moens, Marie-Francine and Paa{\ss}, Gerhard and Strobel, Siehyun}, journal={Journal of
computer security}, volume={18}, number={1}, pages={7--35}, year={2010}, publisher={IOS Press}}

[4] @inproceedings{hamid2011hybrid, title={Hybrid feature selection for phishing email detection},author={Hamid, Isredza
Rahmi A and Abawajy, Jemal}, booktitle={International Conference on Algorithms and Architectures for Parallel
Processing}, pages={266--275}, year={2011}, organization={Springer}}

[5] @inproceedings{ma2009detecting, title={Detecting phishing emails using hybrid features}, author={Ma, Liping and
Ofoghi, Bahadorrezda and Watters, Paul and Brown, Simon}, booktitle={2009 Symposia and Workshops on Ubiquitous,
Autonomic and Trusted Computing}, pages={493--497},year={2009},organization={IEEE}}

[6] @inproceedings{dewan2014analyzing,title={Analyzing social and stylometric features to identify spear phishing


emails},author={Dewan, Prateek and Kashyap, Anand and Kumaraguru, Ponnurangam},booktitle={2014 apwg symposium on
electronic crime research (ecrime)},pages={1--13},year={2014},organization={IEEE}}

[7] @article{ekman1992argument, title={An argument for basic emotions},author={Ekman, Paul},journal={Cognition \&


emotion}, volume={6}, number={3-4}, pages={169--200},year={1992}, publisher={Taylor \& Francis}}

You might also like