You are on page 1of 9

Organization vs. Individual: Twitter User classification.

Kheir Eddine Daouadi1 , Rim Zghal Rebai2 and Ikram Amous3


1 MIRACL-FSEGS, Sfax University, 3000 Sfax, Tunisia.
2,3 MIRACL-ISIMS, Sfax University, Tunis Road Km 10, 3021 Sfax, Tunisia.
khairi.informatique@gmail.com
{rim_zghal, ikram_amous}@yahoo.fr

Abstract. This paper present a novel technique for classifying user accounts on
twitter. The main purpose of our classification is to distinguish the patterns of
users from those of individuals and organizations. The ability of distinguishing
between the two account types is needed for developing consumer products opin-
ion mining tools, recommendation engines, information dissemination platforms
etc. However, such a task is non-trivial. Classic and consolidated approaches use
textual features from natural language processing for classification. Nevertheless,
such approaches still have some drawbacks like the computational cost and time
consumption. In this work, we propose a statistical approach based on metadata
of user profile and popularity of posts features so as to recognize the type of users
without using textual content. We performed a set of experiments over a twitter
dataset and learn-based algorithms in classification task. This yields 95.6% f-
measure using random forest algorithm and synthetic minority oversampling
technique.

Keywords: Twitter User Classification, Individual Vs. Organization, Metadata


of User Profile, Popularity of Posts Features.

1 Introduction:

Nowadays, social media attract the attention of several parties in the community, such
as organizations, researchers, governmental institutions, and politicians etc. This gain
of interests is due to the fact of the increasing number of applications targeted by the
aforementioned actors. This growing phenomenon has become an important part of our
everyday life; millions of people exchange a tremendous amount of data on such social
media which allow people to register through the creation of a profile account, sharing
content and following other members. The emergence of social media has motivated a
large number of recent research. Several research axes, such as authors profiling [1,2,3],
user recommendation [4], event detection [5]…etc.
Twitter is one of the leading social networks which allow users to post 280 charac-
ters, known as tweets. Users can generate three main types of tweets. These are mainly
‘tweet’, which means an original posting from a user, ‘re-tweet’, which means a re-
posting of a tweet posted by another user, and ‘reply’, which means a comment for
another tweet. This free micro blogging has been used not only by individuals to create,
2

publish and share content, but also by organizations to spread information and engage
users. We define an individual account as one having a non-organizational nature and
usually managed by an individual whereas the organization account is one managed by
a non-profit organization, an institution, an association, a corporation or any group of
common interests. In [6], authors showed that the presence of organizations makes up
9.4 % of the accounts in twitter.
In this paper, we attempt to distinguish between accounts belonging to individuals
and organizations in twitter social networks. The ability to classify the patterns from
both users’ types is needed for developing of many applications such as products opin-
ion mining tools, recommendation engines, information dissemination platforms etc.
Classic and consolidated approaches of text mining use textual features from natural
language processing (NLP) for classification. Nevertheless, such approaches still have
some drawbacks like the computational cost and time consumption. In this context, the
main contribution of this work is to demonstrate the importance of using parameters
from metadata of user profile and metadata of posts in order to perform the Individual-
vs-Organization (IvO) classification task. We illustrated the benefits from using both
metadata of user profile and metadata of posts in order to enhance the performance the
IvO classification users.
The remainder of this paper is organized as follows. In the Second Section, we will
discuss the related works and justify the choice of the used approach. In Section Three,
we will show the line of our proposed method: we will present the different phases of
our proposed framework. In Section Four, experimental results are displayed. In the
closing, we will summarize the main contributions of our proposed method.

2 Related works

Classifying social networks account types is considered as a type of latent attribute


inferences, in which their main purpose is to infer various properties of online accounts.
Several works has been done on a single and specific perspective such as political ori-
entation [9], fraud detection [7], organization and individual classification [3, 6, 10],
age prediction [8]. Depending on the type of features used in the classification of twitter
account types, we divide the existing literature into three main classes, mainly statisti-
cal-based approaches, content-based approaches, and hybrid approaches.
Statistical-based approaches used statistic features in the classification task such as
the social network structure, post frequency…etc. In [7], authors used post frequency
feature with a supervised learning framework in order to classify users into three main
classes, mainly human, bots, or cyborgs. In [11], authors used only the time distribution
between posts so as to distinguish between three types of users including human, bot
and organization.
Content-based approaches used only on the textual features from (NLP) such as sen-
timents and emotion expressions, n-grams, informal language use, tweet style etc. in
order to classify tweets according to their author type (individual and organization us-
ers). Authors In [10], proposed a rich set of textual features from different linguistic
3

aspects of tweet content using a supervised classifier framework. In [9], authors pro-
posed an approach for political orientation detection. They used textual features in order
to examine users’ political ideology. In [8], authors used content of the tweets with deep
convolution neural network so as to infer the age groups of users from those of teenag-
ers and adults.
Apart from statistical-based and content-based approaches, the third commonly used
type of approaches are the hybrid approaches which aims to combine textual and sta-
tistical features in the classification task. In [2], authors used features from the social
connectivity and the content of the description of the user profile so as to distinguish
between individuals, organizations and celebrities users. In [3], authors used the con-
tent, social, and temporal features so as to classify users as individuals or organizations.
In [6], authors used post content features, stylistic features, structural and behavioral
features in order to separate users from those individuals and organization. In [1], au-
thors classify users from those individuals, vaper enthusiasts, informed agency market-
ers, and spammers. They used a set of both statistical and derived tweeting behavior
features. In [12], authors used both content and network-based features in order to sep-
arate the patterns from those of organization and individual.
Although content-based and hybrid approaches were heavily used in the literature
with respect to their specific aspect, these approaches are usually have a computational
cost and time-consuming. Such limitations stand as an obstacle to obtaining a real-time
response. Moreover, they needed external resources of (NLP). In contrast, statistical
approaches take a shorter time and have a lower computational cost than the other ap-
proaches. In addition, they do not use any external resources. Therefore, we proposed
a statistical-based approach in order to perform an IvO classification of twitter users.
Our work is similar to that of [3,6,10,12], but these previous works used textual content
features in the classification task, while the user of social networks may use a multime-
dia posts. Our proposed approach may offer an important advantage by removing noise
from individual accounts which do not have organization attributes, with an independ-
ent language features, lower computational cost, less time consumption and a higher
accuracy of classification than the previous works in the literature.

3 Proposed method

In this section, we will discuss the progress of our proposed framework. The proposed
framework contains three main phases namely data collection, features extraction and
classification. We will discuss each phase in the following subsections.

3.1 Data collection

In order to create our data sets, first we used Twitter timeline (API), which allows col-
lecting the K most recent posts of such user. Previous studies have shown that 200 posts
are typically sufficient and it is enough for predicting Twitter user characteristics [mc].
Second, we used the datasets published on [6]. Table.1, present an overview about the
used datasets. These datasets contain a user_id with a label as “ind” or “org” (individual
4

or organisation). We used the user_id with twitter timeline API in order to collect the
200 most recent posts of each user. The posts came with a little attribute; they were:
Timestamp, Tweet and User. First, the Timestamp attribute contained the time and date
that the post was submitted by the user in the Twitter site, for example (“31/03/2018
21:30:12”). Second, the Tweet attribute contained post metadata such as the number
of favorite that this post has, the number of retweets that this post has, whether there is
a tweet, a retweet, or a reply…etc. Finally, the User attribute contained the metadata of
the user profile such as the number of followings, the number of followers, the number
of public lists that this user is a member of …etc.
We observed the users’ accounts’ statuses via the statuses/user_Timeline API end-
point. These statuses can take on one of four values:
1. Active account: an account which is available in the site and its posts are public.
2. Removed account: a user whose account is suspended permanently by the site; here
the user cannot petition Twitter to have his account reinstated.
3. Suspend account: a user whose account is suspended temporarily by the site; here
the user can petition Twitter to have his account reinstated.
4. Private account: a user who makes his/her posts in private.

Table 1. An overview about the used datasets.

Total number Number of Removed, sus- Number of


labeled account pended, or private account Active account
Individual 18362 3065 15297
Organization 1911 96 1815

Since, we could not access to the posts of suspend, removed and private accounts, in
this work we used only active user. This way, the next step of the method could follow.

3.2 Features extraction

This phase aims to build out the feature vector space for the active user using the 200
posts crawled for each user. We proposed three types of features. These mainly
metadata of user profile, popularity of posts features, and statistical features about the
posts.

Metadata of user profile features. Different parameters from metadata of the user
profile are exploited. We used the User attribute described in (section 3.1) in order to
get features from metadata of user profile, these including binary and numerical fea-
tures. First, the numerical features including the age of the account, number of follow-
ers, number of followings, the ratio of followers to followings, the number of lists that
this user is a member of, the number of tweets that this user has liked, the number of
all posts per day, and the total number of posts. Second, the binary features include
whether a user has enabled the possibility of geo-tagging their Tweets, whether the user
has declared their location, whether the user has provided a URL in association with
5

their profile, whether the profile has a description and whether the user has a verified
account.

Popularity of posts. In addition to the metadata of user profile features, we proposed


a statistical features from the metadata of posts, these features represent how much us-
ers interact with the posts of such user. 12 popularity of posts features were proposed.
We used both User and Tweet attributes described in (Section 3.1) in order to acquire
the popularity of posts features. We focused to the social media metric in order to cap-
ture the popularity of posts, two metric of social media are used (retweet and favorite
metrics). For each type of posts (i.e tweet, retweet, and reply) we calculated the follow-
ing equation:

𝐴𝑣𝑒𝑟𝑎𝑔𝑒_𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑓𝑎𝑣𝑜𝑟𝑖𝑡𝑒𝑠(𝑃)
Avg_𝑓𝑎𝑣(𝑃) = (1)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔
𝐴𝑣𝑒𝑟𝑎𝑔𝑒_𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑟𝑒𝑡𝑤𝑒𝑒𝑡𝑠(𝑃)
𝐴𝑣𝑔_𝑟𝑒𝑡(𝑃) = (2)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 ∗ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 (𝑃)

𝑎𝑙𝑙_𝑓𝑎𝑣(𝑃) = (∑𝑛𝑘=0 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑖𝑡𝑒𝑠 (𝑃)) (3)


𝑛
𝑎𝑙𝑙_𝑟𝑒𝑡(𝑃) = (∑𝑘=0 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑤𝑒𝑒𝑡𝑠(𝑃)) (4)

Where: P may be (tweets, retweets, and replies), n represent the number of P, number
of retweets is the number of retweets to the posts done by other user, number of favor-
ites is the number of favorites to the posts done by other user, and number of following
is the number of user that follows the user.

Statistical features. Apart from metadata of user profile and post popularity features
we proposed also a statistical features about the posts of user, these features include the
number of posts that contain a mention to another user, the number of quoted tweets
(i.e. retweets with a comment), the number of posts that were posted in association with
a place, the number of posts geo-tagged, and the average interval between posts.

3.3 Classification
The third phase of our proposed framework represent the classification task. We used
the supervised machine-learning approach, a set of classifiers were used. Then, a com-
parison of the result of each classifier was performed so as to make a decision as to
which classifier performs better with our proposed features. After choosing the classi-
fier that give us the best results of accuracy, a final predictive model were constructed
in order to use it in the process of IvO classification task. We will discuss experimental
setting and the results in the following section.
6

4 Experiment and evaluation

To verify the validity of our proposed approach, we performed several experiments.


We used a set of supervised classifiers of machine learning, namely Random Forest
(RF), Simple Logistic (SL), Logit Boost (LB), Bagging (B) and Multilayer Perceptron
(MLP). A 10-fold cross validation was performed in order to test the performance meas-
urement of each classifier, The tests were implemented using the Waikato Environment
for Knowledge Analysis (WEKA), which is the one most commonly used by similar
research. In order to evaluate the performance of our proposed method, we used the
three performance metrics. In our context, average precision is presented by (Eq 5),
average recall is presented by (Eq 6), and average F-measure can be defined as (Eq 7).
1 𝑇𝑃𝑜𝑟𝑔 𝑇𝑃𝑖𝑛𝑑𝑣
𝐴𝑣𝑔_𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ( + ) (5)
2 𝑇𝑃𝑜𝑟𝑔 +𝐹𝑃𝑜𝑟𝑔 𝑇𝑃𝑖𝑛𝑑𝑣 +𝐹𝑃𝑖𝑛𝑑𝑣

1 𝑇𝑃𝑜𝑟𝑔 𝑇𝑃𝑖𝑛𝑑𝑣
𝐴𝑣𝑔_𝑅𝑒𝑐𝑎𝑙𝑙 = ( + ) (6)
2 𝑇𝑃𝑜𝑟𝑔 +𝐹𝑁𝑜𝑟𝑔 𝑇𝑃𝑖𝑛𝑑𝑣 +𝐹𝑁𝑖𝑛𝑑𝑣

2×𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝐴𝑣𝑔𝑅𝑒𝑐𝑎𝑙𝑙
𝐴𝑣𝑔𝐹_𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (7)
𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝐴𝑣𝑔𝑅𝑒𝑐𝑎𝑙𝑙

where: TPindv and TPorg are the total number of users that are correctly classified as
individuals and organizations, respectively. FPindv is the number of organizations that
are incorrectly classified as individuals. FPorg is the number of individuals that are in-
correctly classified as organizations. FNorg and FNindv are the number of users that the
classifier has classified as individuals and organizations respectively.
Table.2 shows the performance measurement of each classifier, the (RF) outper-
formed the others classifiers. This yields a 93.1% f-measure.

Table 2. 10-fold cross validation performance measurement.

Avg_Precision% Avg_Recall% Avg_f-measure%


RF 92,9 93,4 93,1
SL 89,4 90,9 89,3
LB 91,9 92,5 92,1
B 92,6 93,1 92,8
MLP 89,2 90,3 89,6

Table 3. F-measure results for both class and in two condition.

Balanced datasets Full datasets


Ind Org %acc Ind Org %Acc
Majority class 0 100 50 100 0 89,39
Proposed method 88,9 89,3 89,28 96,4 65,9 93,43
7

Given the skewed distribution of the user account types,we compared the perfor-
mance of the proposed approach with the majority-class baseline, which classes all in-
stance in the class that contain the majority samples (in our case individual class). We
also evaluated our proposed approach in two condition, using balanced datasets and
imbalanced datasets. As shown in Table.3, we outperformed the majority-class base-
line in both balanced datasets and imbalanced datasets.
In order to evaluate the importance of features, we also evaluate the f-measure of the
corresponding models:

 (MUP): metadata of user profile features 91,4% .


 (PP): popularity of posts features 90,2%.
 (S): statistical features 85,9%.
 (MUP) + (PP): metadata of user profile and popularity of posts features 92,8%.
 (MUP)+(S): metadata of user profile and statistical features 91,9 %.
 (PP)+(S): popularity of posts features and statistical features 90,8 %.
 (MUP)+(PP)+(S): all features 93,1%.

Figure. Confusion matrix.

We also used Synthetic Minority Oversampling Technique (SMOTE) [13] in order to


balance the datasets. SMOTE algorithm allows to generate a new sample based on fea-
ture vector of the minority class (in our case organization class), and is a powerful
8

method that has seen successfully in many domains [14]. Significant performance gains
were observed on balancing datasets. As shown in Table.4, when we used our proposed
method with SMOTE technique the average f-measure reach 95.6%.

Table 4. Performance measurement using SMOTE technique.

Avg_Precision% Avg_Recall% Avg_f-measure%


RF+SMOTE 95,7 95,6 95,6

5 Conclusion

This paper present an alternative approach for user classification on online social net-
works using a set of rich statistical features, unlike the traditional approaches which
used textual content. This type of approach highlights a new field in the classification
of individual and organization users. Our proposed framework differ from the previous
one in term of the features used in the classification task, previous work used the textual
content of post as features, whereas some drawbacks such as time consumption, com-
putational cost…etc. Our work outperforms the previous one in two dimensions: first,
in previous work, feature extraction needs multilingual and multidialectal resources,
but in our proposed approach feature extraction is performed without any external re-
sources and classifies users without taking into account the user's language. Second,
the user can post textual data, images, or videos, previous work fails with the user who
posts multimedia posts while the proposed method does well with the user who uses a
multimedia content since it uses only both metadata of the user profile and metadata of
posts features. Examining the proposed metadata of user profile and metadata of posts
features was critical in improving the overall performance of IvO classification task.
As future work, we plan to make our system open source and implement a web service
(ex an API) to allow the research community to perform user level organization detec-
tion using it.

References
1. Kim, A., Miano, T., Chew, R., Eggers, M., Nonnemaker, J.: Classification of Twitter Users
Who Tweet About E-Cigarettes. In: JMIR Public Health and Surveillance, vol. 3, no 3, p.
e63, (2017).
2. Nagpal, C., Singhal, K.: Twitter User Classification using Ambient Metadata, arXiv preprint
arXiv:1407.8499, (2014).
3. Oentaryo, R. J., Low, J. W., Lim, E. P.: Chalk and Cheese in Twitter: Discriminating Per-
sonal and Organization Accounts. In: Hanbury A., Kazai G., Rauber A., Fuhr N. (Advances
in Information Retrieval). European Conference on Information Retrieval ECIR 2015, Lec-
ture Notes in Computer Science, vol 9022, pp. 465-476. Springer, Cham (2015).
9

4. Kalaï, A., Wafa, A., Zayani, C. A., Amous, I.: LoTrust: A social Trust Level model based
on time-aware social interactions and interests similarity. In: 14th Annual Conference on
Privacy, Security and Trust (PST), pp. 428-436. IEEE, New Zealand, (2016).
5. Troudi, A., Zayani, C. A., Jamoussi, S., Amor, I. A. B.: A New Mashup Based Method for
Event Detection from Social Media. Information Systems Frontiers, 1-12. (2018)
6. McCorriston, J., Jurgens, D., Ruths, D.: Organizations Are Users Too: Characterizing and
Detecting the Presence of Organizations on Twitter. In: 9th International Conference on Web
and Social Media ICWSM. pp. 650-653, The AAAI Press, UK, (2015).
7. Tavares, G.M., MASTELINI, S.M., BARBON JR, S.: User Classification on Online Social
Networks by Post Frequency. CEP, vol. 86057, pp. 970-977 (2017).
8. Guimaraes, R. G., Rosa, R. L., De Gaetano, D., Rodriguez, D. Z., Bressan, G.: Age Groups
Classification in Social Network Using Deep Learning, IEEE Access, 1-11 (2017).
9. Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., Ungar, L.: Beyond binary labels: political ideology
prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics, pp. 729-740, Canada (2017).
10. De Silva, L., Riloff, E.: User Type Classification of Tweets with Implications for Event
Recognition. ACL ,98-108 (2014).
11. Tavares, G., Faisal, A.: Scaling-laws of human broadcast communication enable distinction
between human, corporate and robot Twitter users. PloS one, vol. 8, no 7, p. e65774, (2013).
12. Kim, S. M., Paris, C., Power, R., & Wan, S.: Distinguishing Individuals from Organisations
on Twitter. In Proceedings of the 26th International Conference on World Wide Web Com-
panion, pp. 805-806 (2017).
13. Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: synthetic minority
over-sampling technique. Journal of artificial intelligence research, 16, 321-357. (2002).
14. He, H., & Garcia, E. A. : Learning from imbalanced data. IEEE Transactions on Knowledge
& Data Engineering, (9), 1263-1284. (2008).

You might also like