You are on page 1of 23

Author profiling

Author profiling is the analysis of


a given set of texts in an attempt
to uncover various characteristics
of the author based on stylistic-
and content-based features.
Characteristics analysed
commonly include age and
gender, though more recent
studies have looked at other
characteristics like personality
traits and occupation [1]
PSM V37 D594 Thomas Corwin
Author profiling is one of the three Mendenhall
major fields in Automatic
Authorship Identification (AAI), the other two being
authorship attribution and authorship identification. The
process of AAI emerged at the end of the 19th century.
Thomas Corwin Mendenhall, an American autodidact
physicist and meteorologist, was the first to apply this
process to the works of Francis Bacon, William Shakespeare,
and Christopher Marlowe. From these three historic figures,
Mendenhall sought to uncover their quantitative stylistic
differences by inspecting word lengths. [2]

Although much progress has been made in the 21st century,


the task of author profiling remains an unsolved problem due
to its difficulty.

Contents
1 Techniques
2 Author profiling and the Internet
2.1 Social media
2.1.1 Facebook
2.1.2 Weibo
2.1.3 Chat logs
2.1.4 Blogs
2.2 Email
3 Applications
3.1 Forensic linguistics
3.2 Bot detection
3.3 Marketing
3.4 Literary works
3.5 Library cataloguing
4 In popular culture
5 See also
6 References

Techniques
Through the analysis of texts, various author profiling
techniques can be applied to predict information about the
author. For example, function words, as well as part-of-
speech analysis, can be referenced to determine the
author's gender and truth of a text.[3]

The process of author profiling usually involves the following


steps:[4]

1. Identifying specific features to be extracted from the


text
2. Building an adopted, standard representation (e.g.Bag-
of-Words model) for the target profile
3. Building a classification model using a standard
classifier (e.g. Support Vector Machines) for the target
profile

Machine learning algorithms for author profiling have


become increasingly complex over time. Algorithms used in
author profiling include:

Support Vector Machines [5]


Naive Bayes Classifiers [5]
Deep averaging networks,[6] many layers in a cycle of
machine learning that utilizes the mean of word
embeddings within a text [7]
Long Short-Term Memory [8]

In the past, author profiling was limited to physical


documents, often in the form of books and newspaper
articles. Different combinations of textual attributes
belonging to the authors were identified and analyzed using
author profiling, including lexical and syntactical features.[4]
Pioneering research in author profiling focused mostly on a
single genre until the shift towards author profiling on social
media and the Internet.[9] While attributes, such as content
words and POS tags, are effective in author profile
predictions on physical documents, their effectiveness in
author profile predictions on digital texts is subjective and
dependent on the type of online content being analyzed.[4]

With the advances in technology, author profiling on the


Internet has become increasingly common. Digital texts,
such as social media posts, blog posts and emails, are now
being used.[4] This has sparked greater research efforts
because of the advantages analysing digital texts can bring
to sectors like marketing and business.[8] Author profiling on
digital texts has also enabled predictions of a wider range of
author characteristics such as personality,[8] income and
occupation.[10]

The most effective attributes for author profiling on digital


texts involve a combinations of stylistic and content features.
[4] Author profiling on digital texts focuses on cross-genre

author profiling, whereby one genre is used for training data


and another genre is used for testing data, though both need
to be relatively similar for good results.[9]

Tthere are some problems[4] when performing author


profiling techniques on online texts. These problems include:

Wide variation in lengths of texts used


Class imbalance in data

Author profiling and the Internet


The rise of the internet in the 20th to 21st century catalysed
an increase in author profiling research, since data could be
mined from the web, including social media platforms, emails
and blogs. Content from the web have been analysed in
tasks of author profiling to identify the age, gender,
geographic origins, nationality and psychometric traits of
web users. The information obtained has been used to serve
various applications, including marketing and forensics.

Social media

The increased integration of social media in people’s daily


lives have made them a rich source of textual data for author
profiling. This is mainly because users frequently upload and
share content for various purposes including self-
expression, socialisation, and personal businesses. The
Social bot is also a frequent feature of social media
platforms, especially Twitter, generating content that may be
analysed for author profiling.[11] While different platforms
contain similar data they may also contain different features
depending on the format and structure of the particular
platform.

There are still limitations in using social media as data


sources for author profiling, because data obtained may not
always be reliable or accurate. Users sometimes provide
false information about themselves or withhold information.
[12] As a result, the training of algorithms for author profiling

may be impeded by data that is less accurate. Another


limitation is the irregularity of text in social media. Features
of irregularity include deviation from normal linguistic
standards such as spelling errors, unstandardised
transliteration as with the substitution of letters with
numbers, shorthands, user-created abbreviations for
phrases and et cetera, which may pose a challenge to author
profiling.[13] Researchers have adopted methods to
overcome these limitations in training their algorithms for
author profiling.[13]

Facebook

Facebook is useful for author profiling studies as a social


networking service. This is because of how a social network
may be built, expanded, and used for social action in the site.
[14] In such processes, users share personal content that

may be used for author profiling studies. Textual data is


obtained from Facebook for author profiling from user’s
personal posts such as ‘status updates’.[15] These are
acquired to produce a corpus in the selected language(s) for
author profiling, to create either a bilingual or multilingual
database of content words,[15][16] which may then be used
for author profiling.

In the context of Facebook, author profiling mainly involves


English textual data, but also uses non-english languages
that include: Roman Urdu, Arabic, Brazilian Portuguese,
Spanish.[16][11] While author profiling studies on Facebook
have been predominantly for gender and age-group
identification, there have been attempts to derive attributes
to predict religiosity, the IT background of users, and even
basic emotions (as defined by Paul Ekman) among others.
[15][17]

Weibo

Sina Weibo is one of the few Asian social media platforms


that contain texts in Asian languages to have been analysed
for author profiling. Primary content of focus for author
profiling on Weibo content include classical Chinese
characters, hashtags, emoticons, kaomoji, homogenous
punctuation, Latin sequences (due to the multilingualism of
text) and even poetic formats. Particularly popular Chinese
expressions, POS tags and word types are also tracked for
author profiling.[18]

Author profiling for Weibo content requires algorithms


different from those utilised for other social media platforms,
mainly due to the linguistic differences between Mandarin
Chinese and Western languages. For example, Chinese
emotions involve Chinese characters describing the gesture
or facial expression in brackets, such as: e.g. [哈哈]
‘laughter’, [泪] ‘tears’, [偷笑] ‘giggle’, [爱你] ‘love’, [⼼] ‘heart’.
[18] This differs from the use of punctuation symbols for

emoticons in Western languages, or the common use of the


Unicode emojis in other platforms such as Facebook,
Instagram, et cetera. Further, while there are around 161
western emoticons, there are around 2900 emoticons
regularly used in Mainland China for web content as in
Weibo.[19] In order to tackle these differences, author
profiling algorithms have been trained on Chinese emoticons
and linguistic features. For example, author profiling
algorithms have been designed to detect Chinese stylistic
expressions expressing formality and sentiment, in place of
algorithms detecting English linguistic features such as
capital letters.[19]

As compared to other more popular, globalised platforms,


texts on Weibo are not as commonly used in the task of
author profiling. This is likely due to the centralisation of
Weibo in the Chinese population of Mainland China, limiting
its usage to predominantly China Nationals. Studies done for
this platform have utilised bots, machine learning algorithms
to identify authors’ age and gender. Data is acquired from
Weibo microblog posts of willing participants to be analysed,
and used to train algorithms that build concept-based
profiles of users to a certain accuracy.[18]

Chat logs

Chat logs have been studied for author profiling as they


include much textual discourse, the analysis of which have
contributed to applicational studies including social trends
and forensic science. Sources of data for author profiling
from chat logs include platforms such as Yahoo!, AIM
(software) and WhatsApp.[20] Computational systems have
been devised to produce concept-based profiles listing chat
topics discussed in a single chat room or by independent
users.[21]

Blogs

Author profiling can be used to identify characteristics of


blog writers, such as their age, gender and geographical
location, based on their different writing styles,[22] This is
especially useful when it comes to anonymous blogs. The
choice of content words, style-based features and topic-
based features are analyzed in order to discover
characteristics of the author.[23]

In general, features that are frequently occur in blogs include


a high distribution of verbs per writing and a relatively high
use of pronouns. The frequency of verbs, pronouns and
other word classes are used to profile and classify emotions
in the writings of authors, as well as their gender and age.[24]
Author profiling using classification models that were used
on physical documents in the past, such as Support Vector
Machines, have also been tested on blogs. However, it has
been proven to be unsuitable for the latter due to its low
performance.[22]

The machine learning algorithms that work well for author


profiling on blogs[22] include:

Instance-based learning
Random Decision Forests

Email

Email has been a consistent focus for author profiling due to


rich textual data that can be found in various sections of a
typical emailing platform. These sections include the sent,
inbox, spam, trash, and archived folders.[25] Multilingual
approaches to author profiling for emails have included
English, Spanish, and Arabic emails as data sources, among
others.[25][12] Through author profiling, details of email users
may be identified, such as their age, gender, geographical
origin, level of education, nationality and even psychometrics
traits of personality, which includes neuroticism,
agreeableness, conscientiousness and extraversion and
introversion from the Big Five personality traits.[citation needed]
In author profiling for email, content is processed for
important textual data, while unimportant features such as
metadata and other hyper-text markup language (HTML)
redundancies are excluded. Important parts of the Multi-
purpose Internet Mail Extensions (MIME) that contain
content of the emails are also included in the analysis.
Obtained data is often parsed into various sections of
content, including author text, signature text, advertisement,
quoted text, and reply lines.[25] Further analysis of email
textual content in author profiling tasks involves the
extraction of tone of voice, sentiment, semantics and other
linguistic features to be processed.

Applications
Author profiling has applications in various fields where there
is a need to identify specific characteristics of an author of a
text, with a growing importance in fields like forensics and
marketing.[26] Depending on its application, the task of
author profiling can vary in terms of the characteristics to be
identified, number of authors studied and number of texts
available for analysis.

Although its applications have traditionally been limited to


written texts, such as literary works, this has extended to
online texts with the advancement of the computer and the
Internet.
Forensic linguistics

In the context of forensic linguistics, author profiling is used


to identify characteristics of the author of anonymous,
pseudonymous or forged text, based on the author’s use of
the language. Through linguistic analysis, forensic linguists
seek to identify the suspect’s motivation and ideology, along
with other class features, such as the suspect’s ethnicity or
profession. While this does not always lead to decisive
author identification, such information can help law
enforcement narrow the pool of suspects.[27]

In most cases, author profiling in the context of forensic


linguistics involves a single text problem, in which there is
either no or few comparison texts available and no external
evidence that points to the author.[28] Examples of text
analysed by forensic linguists include blackmailing letters,
confessions, testaments, suicide letters and plagiarised
writing.[29] This has also extended to online texts as well,
such as sexually explicit online chat logs between middle-
aged men and underaged girls,[28] with the increasing
number of cybercrimes committed on the Internet.[30]

One of the earliest and best-known examples of the use of


author profiling is by Roger Shuy, who was asked to examine
a ransom note linked to a notorious kidnapping case in 1979.
Based on his analysis of the kidnapper’s idiolect, Shuy was
able to identify crucial elements of the kidnappers identity
from his misspellings and a dialect item, that is, the
kidnapper was well-educated and from Akron, Ohio.[31] This
eventually led to a successful arrest and confession by the
suspect.

However, there are criticisms that author profiling methods


lack objectivity, since these methods are reliant on a forensic
linguist’s subjective identification of crucial sociolinguistic
markers . These methods, such as those adopted by literary
critic Donald Wayne Foster, are said to be speculative and
based entirely on one’s subjective experience, and therefore
cannot be tested empirically.[32]

Bot detection

Author profiling is adopted in the identification of social bots,


the most common being Twitter bots. Social bots have been
deemed as a threat given their commercial, political and
ideological influence, such as the 2016 United States
Presidential Election, during which they polarised political
conversations, and spread misinformation and unverified
information. In the context of marketing, social bots can
artificially inflate the popularity of a product by posting
positive reviews, and undermine the reputation of
competitive products with unfavourable reviews.[33]
Therefore, bot detection from an author profiling perspective
is a task of high importance.[33][34]
Made to appear as human accounts, bots can mostly be
identified by information on their profiles, like their
username, profile photo and time of posting.[34] However,
the task of identifying bots solely from textual data (i.e.
without meta-data) is significantly more challenging,
requiring author profiling techniques.[34] This usually
involves a classification task based on semantic and
syntactic features.[35][36]

The task of bot and gender profiling was one of four shared
tasks organised by PAN, which organises a series of
scientific events and shared tasks of digital text forensics
and stylometry, in its 2019 edition.[33] Participating teams
had achieved much success, with the best results for bot
detection for English and Spanish tweets at 95.95% and
93.33% respectively.[35]

Marketing

Author profiling is also useful from a marketing viewpoint, as


it allows businesses to identify the demographics of people
that like or dislike their products based on an analysis of
blogs, online product reviews and social media content.[26]
This is important since most individuals post their reviews on
products anonymously. Author profiling techniques are
helpful to business experts in making better informed
strategic decisions based on the demographics of their
target group.[37] In addition, businesses can target their
marketing campaigns at groups of consumers who match
the demographics and profile of current customers.[38]

Literary works

Author profiling techniques are used to study traditional


media and literature to identify the writing style of various
authors as well as their written topics of content. Author
profiling for literature is also been done to deduce the social
networks of authors and their literary influence based on
their bibliographic records of co-authorship.

Some examples of author profiling studies on literature and


traditional media include studies on the following:[39][40]

The Bible
Gospels of the New Testament
Shakespeare’s works [41]
The Federalist Papers in the 1990s and 1960s
Author profiling studies for Lithuanian Literary Texts [40]

Library cataloguing

Another application of author profiling is in devising


strategies for cataloguing library resources based on
standard attributes.[42] In this approach, author profiling
techniques may improve the efficiency of library cataloguing
in which library resources are automatically classified based
on the authors' bibliographic records. This was a significant
issue in the early 21st century when much of library
cataloguing was still done manually.

In using author profiling for library cataloguing, researchers


have utilised machine learning for automatic processes in
the library, such as Support Vector Machine algorithms
(SVMs). With the use of SVMs for author profiling,
bibliographic records of authors within existing databases
may be identified, tracked, and updated to identify an author
based on her topics of literary content and expertise as
indicated in his or her bibliographic records. In this case,
author profiling utilises the social structures of authors that
may be derived from physical copies of published media to
catalogue library resources.[42]

In popular culture
Author profiling has been featured in popular culture. The
2017 Discovery Channel mini-series Manhunt: Unabomber is
a fictionalised account of the FBI investigation surrounding
the Unabomber. It features a criminal profiler who identifies
defining characteristics of the Unabomber’s identity based
on his analysis of the Unabomber’s idiolect in his published
manifesto and letters. The show highlighted the importance
of author profiling in criminal forensics, as it was critical in
the capture of the real Unabomber culprit in 1996.[43]
See also
Related subjects

Computational linguistics
Forensic linguistics
Native-language identification
Social bot
Stylometry

References
1. Wiegmann, M., Stein, B. & Potthast, M. (2019).
"Overview of the Celebrity Profiling Task at PAN 2019."
CLEF.
2. Mikros, G.K., & Perifanos, K. (2013). "Authorship
attribution in Greek tweets using author's multilevel n-
gram profiles." 2013 AAAI Spring Symposium Series.
3. Koppel, M., Argamon, S., & Shimoni, A.R. (2013).
"Automatically categorizing written texts by author
gender." Literary and Linguistic Computing, 17, pg 401–
412.
4. ^ a b c d e f López-Monroy, A. P., Montes-y-Gómez, M.,
Escalante, H. J., Villaseñor-Pineda, L. & Stamatatos, E.
(2015). "Discriminative subprofile-specific
representations for author profiling in social media." In:
Knowledge-Based Systems, 89, 134 - 147.
5. ^ a b Lundeqvist, E. & Svensson, M. (2017). "Author
profiling: A machine learning approach towards
detecting gender, age and native language of users in
social media." In: Department of Information
Technology.
6. Franco-Salvador, M., Plotnikova, N., Pawar, N., &
Benajiba, Y. (2017). "Subword-based deep averaging
networks for author profiling in social media." CLEF.
7. Kurita, K. (2018). "Paper dissected: Deep unordered
composition rivals syntactic methods for text
classification explained." Machine Learning Explained.
8. ^ a b c Bsi, B. & Zrigui, M. (2018). "Deep learning
techniques for author profiling in social media content."
In: 31st IBIMA Conference.
9. ^ a b Bilan, I. & Zhekova, D. (2016). "CAPS: A cross-
genre author profiling system." CLEF.
10. Schler, J., Koppel, M., Argamon, S., & Pennebaker, J.W.
(2005). "Effects of Age and Gender on Blogging." AAAI
Spring Symposium: Computational Approaches to
Analyzing Weblogs.
11. ^ a b Rangel, F., & Russo, P. (2019). "Overview of the 7th
author profiling task at PAN 2019: Bots and gender
profiling in Twitter." CLEF.
12. ^ a b Rosso, P., Rangel, F., Farías, I. H., Cagnina, L.,
Zaghouani, W., & Charfi, A. (2018). "A survey on author
profiling, deception, and irony detection for the Arabic
language." Language and Linguistics Compass, 12(4).
13. ^ a b Gómez-Adorno, H., Markov, I., Sidorov, G.,
Posadas-Durán, J.-P., Sanchez-Perez, M. A., &
Chanona-Hernandez, L. (2016). "Improving Feature
Representation Based on a Neural Network for Author
Profiling in Social Media Texts". In: Computational
Intelligence and Neuroscience, pg 1–13.
14. Dam, J. W. V., & Velden, M. V. D. (2015). "Online
profiling and clustering of Facebook users". In: Decision
Support Systems, 70, 60–72.
15. ^ a b c Hsieh, F.C., Sandroni, R.F., & Paraboni, I. (2018).
"Author Profiling from Facebook Corpora". LREC.
16. ^ a b Fatima, M., Hasan, K., Anwar, S., & Nawab, R. M. A.
(2017). "Multilingual author profiling on Facebook". In:
Information Processing & Management, 53(4), 886–
904.
17. Rangel, F., & Rosso, P. (2013). "Use of Language and
Author Profiling: Identification of Gender and Age."
18. ^ a b c Zhang, W., Caines, A., Alikaniotis, D., & Buttery, P.
(2015). "Predicting author age from Weibo microblog
posts." LREC.
19. ^ a b Chen, L., Qian, T., Wang, F., You, Z., Peng, Q., &
Zhong, M. (2015). "Age Detection for Chinese Users in
Weibo." WAIM 2015, LNCS 9098, 83–95.
20. Lin, J. (2007). "Automatic Author Profiling of Online
Chat Logs"
21. Bengel J., Gauch S., Mittur E., Vijayaraghavan R. (2004)
ChatTrack: "Chat Room Topic Detection Using
Classification." In: Chen H., Moore R., Zeng D.D., Leavitt
J. (eds) Intelligence and Security Informatics. ISI 2004.
Lecture Notes in Computer Science, 3073. Springer,
Berlin, Heidelberg
22. ^ a b c Pham, D.D., Tran, G.B., & Pham, S.B. (2009).
Author Profiling for Vietnamese Blogs. 2009
International Conference on Asian Language
Processing, 190-194.
23. Santosh, K., Bansal, R., Shekhar, M. & Varma, V. (2013).
Author Profiling: Predicting Age and Gender from Blogs
Notebook for PAN at CLEF 2013. CLEF.
24. Rangel, F. & Rosso, P. (2013). Use of Language and
Author Profiling: Identification of Gender and Age.
Natural Language Processing and Cognitive Science
2013.
25. ^ a b c Estival, D., Gaustad, T., Pham, S. B., Radford, W.,
& Hutchinson, B. (2007). Author Profiling for English
Emails.
26. ^ a b Author Profiling 2018. (n.d.).
27. Foster, D. (2000). Author Unknown: On the Trail of
Anonymous. Henry Holt and Company
28. ^ a b Grant, T. D. (2008). "Approaching questions in
forensic authorship analysis." In Gibbons, J. & Turell, M.
T. (Eds.). Dimensions of Forensic Linguistics. John
Benjamins.
29. Kotzé, E. F. (2010). "Author identification from
opposing perspectives in forensic linguistics". South
African Linguistics and Applied Language Studies.
28(2). 185-197
30. Yang, M. & Chow, K. P. (2014) "Authorship Attribution
for Forensic Investigation with Thousands of Authors."
In: Cuppens-Boulahia N., Cuppens F., Jajodia S., Abou
El Kalam A., Sans T. (eds) ICT Systems Security and
Privacy Protection. SEC 2014. IFIP Advances in
Information and Communication Technology, vol 428.
Springer, Berlin, Heidelberg.
31. Leonard, R. A. (2005). "Applying the Scientific
Principles of Language Analysis to Issues of the Law."
International Journal of Humanities. 3. 1-9
32. Chaski, C. E. (2001). "Empirical evaluations of
language-based author identification techniques."
Forensic Linguistics, 8, 1-65.
33. ^ a b c "Bots and Gender Profiling 2019". (n.d.).
34. ^ a b c Goubin, Régis & Lefeuvre, Dorian & Alhamzeh,
Alaa & Mitrović, Jelena & Egyed-Zsigmond, El˝ & Fossi,
Leopold. (2019). "Bots and Gender Profiling using a
Multi-layer Architecture Notebook for PAN at CLEF
2019".
35. ^ a b Daelemans W. et al. (2019) "Overview of PAN
2019: Bots and Gender Profiling, Celebrity Profiling,
Cross-Domain Authorship Attribution and Style Change
Detection." In: Crestani F. et al. (eds) Experimental IR
Meets Multilinguality, Multimodality, and Interaction.
CLEF 2019. Lecture Notes in Computer Science, vol
11696. Springer, Cham.
36. Kovács, G., Balogh, V., Mehta, P., Shridhar, K., Alonso,
P., & Liwicki, M. (2019). "Author Profiling using
Semantic and Syntactic Features: Notebook for PAN at
CLEF 2019."
37. Raghunadha Reddy T., Lakshminarayana M., Vishnu
Vardhan B., Sai Prasad K., Amarnath Reddy E. (2019) "A
New Document Representation Approach for Gender
Prediction Using Author Profiles." In: Bapi R., Rao K.,
Prasad M. (eds) First International Conference on
Artificial Intelligence and Cognitive Computing.
Advances in Intelligent Systems and Computing, vol
815. Springer, Singapore
38. Maharjan, Suraj & Shrestha, Prasha & Solorio, Thamar
& Hasan, Ragib. (2014). "A Straightforward Author
Profiling Approach in MapReduce." LNCS (LNAI).
39. Company, J. S., & Wanner, L. (2017). "On the Relevance
of Syntactic and Discourse Features for Author Profiling
and Identification." Proceedings of the 15th Conference
of the European Chapter of the Association for
Computational Linguistics, 2, 681–687.
40. ^ a b Dzikiene. J. K., Utka, A., & Šarkute, L. (2015).
"Authorship Attribution and Author Profiling of
Lithuanian Literary Texts", 96–105.
41. Ledger, G. (1994). "Shakespeare, Fletcher, and the Two
Noble Kinsmen." Literary and Linguistic Computing,
9(3), 235–247.
42. ^ a b Nomoto, T. (2009). "Classifying library catalogues
by author profiling." In: Proceedings of the 32nd
International ACM SIGIR Conference on Research and
Development in Information Retrieval - SIGIR 09.
43. Davies, D. (2017, August 22). "FBI Profiler Says
Linguistic Work Was Pivotal In Capture Of Unabomber."

You might also like