You are on page 1of 5

Social Trait Analyzer

Muhammad Farhan Muhammad Ali Khan Muhammad Atif


Department of Computer Science, DSU Department of Computer Science, DSU Department of Computer Science, DSU
DHA Suffa University DHA Suffa University DHA Suffa University
Karachi, Pakistan Karachi, Pakistan Karachi, Pakistan
cs161095@dsu.edu.pk cs161070@dsu.edu.pk cs161066@dsu.edu.pk

Abstract—Twitter is one of the social Network which is used learning was the most suitable field for this project.
bu hundreds of millions of users.People use it to flow important
information, on the other hand it is used by individuals ,most of
them are celebrities , politicians and people with high amount C. Relevant Background
of followers. There exists so many users so it maybe confusing
sometimes to distinguish the traits of the user. To overcome this The classification of users are one of the most usable
particular point , we are building an API which will work real researches and have applied to many other platforms as well
time and would be able to predict and classify any user social as twitter ,many of the work are done on identifying whether a
traits with the help of our trained models.These traits would user is a bot or human ,or classification of user is done but in
include user’s domain/area of interests which are pre-selected
by us.This API would-be able to predict 11 types of different 2 to 3 classes only so this work is proposing 11 classes (sports
domains by providing a set of unto 50 features to various machine , Business , Politics ,Entertainment ,Science and technology
learning models mainly random forest and logistic regression. ,Religion ,General category , Health ,Education , News ,
This API is currently in a development process so we will analyze Wellbeing) so this project is extending the classification work
the possibilities of the work we have described above and will of twitter.
take further steps according to the progress of the work

I. I NTRODUCTION
II. M OTIVATION AND BACKGROUND
A. Description of the Project This project is based upon the classification of users of
As we know Twitter is one of the most commonly used twitter among 11 major classes (sports , Business , Politics
social platform and it has around 600 million users, People ,Entertainment ,Science and technology ,Religion ,General
around the globe having different traits are using the twitter, category , Health ,Education , News , Wellbeing).Since this
so this project is mainly useful in understanding the users. is a classification problem ,the field of machine learning is
Identifying the domains can be used in many forms for most suitable for this project and the models ML provides
example a particular domain of user is usually be followed for classification. There exist similar work on twitter
or following the users with the same domain of interest classification but the classification of 11 major classes wasn’t
and it produces a chain of users of similar interests, by done before. Classification: Grouping any element with
identifying these block of people we can do marketing its particular category is known as classification. Machine
activities or any other task which is needed to perform by Learning: Machine Learning is an entire field which provides
identifying domains. Ignoring the identification of blocks of algorithms and techniques to apply classification or any other
same domain of interest people, this project would also be task. Data Augmentation: Machine learning technique to
usable for identifying a single user for multiple purposes. cover the data imbalancement.

B. Details about the Domain A. Project Objectives


Understanding the user is the main objective of this project Our objective is to develop a service which assesses the
, which will classify into many classes , this classification tweets of users and to apply further methodology on it to
which is based on 11 classes (sports , Business , Politics classify the domain of a user .By using this service ,various
,Entertainment ,Science and technology ,Religion ,General marketing strategies can be built as the users of the product
category , Health ,Education , News , Wellbeing) can be used would be greatly known to companies..
further for many proposes and for this classification we have
used the field of machine learning . Machine learning is the III. L ITERATURE R EVIEW
domain of this project. Machine learning allows learning by The same nature of work has been developed before .A
analyzing datasets consisting of various features and after research based project namely “Understanding Types of Users
that it becomes able to predict the results, its accuracy can on Twitter” was conducted, it was made to identify 6 major
be tuned by its usage, and because of this power machine classes of twitter which was Personal users, Professional users
,Business users ,spam users ,Feed/news and viral/marketing. we are classifying between domain/interests of users. A
They observed 716 different user profiles to predict these theory presented in 2014 was based on 6 classes that is
classes .Random Forest with bagging technique was used for Personal users, Professional users, Business users, spam
classification and their scores were considerable. users, Feed/news and viral/marketing while our work is
In 2012, a project namely “Detecting Automation of Twitter providing 11 classes and they are all different from these 6
Accounts: Are You a Human, Bot, or Cyborg?” was developed classes. Another research was classifying spammers, while
which purpose was to classify the users between Human, Bot this work is classifying interests /domains while another
and cyborg .Malicious bots spread spam or malicious contents research was totally focused on time periodicity for the
while cyborg is either bot-assisted human or human-assisted prediction of only 2 classes while we are analyzing around
bot and so to classify them they collected 500,000 accounts 50 features for the classification of 11 classes.
with the analyzation of features such as tweeting behavior,
tweet content, and account properties.
In 2010, “Detecting Spammers on Twitter” named research V. P ROJECT E XPLANATION
was introduced to distinguish between spammers and non-
spammers. They crawled 54 million user profiles, all their A. InScope
tweets and links of follower and followers after that they We will create a service that would be able to predict
formed the features by manual inspection after the creation of ones domain . This service would be able to predict up
datasets they gave it to SVM model for their required scores. to 11 domains which includes sports , Business , Politics
A research namely “Classifying Twitter User Interests using ,Entertainment ,Science and technology ,Religion ,General
Time Series” is proposed in 2013 which was able to classify category , Health ,Education , News , Well being.
user’s interests by analyzing periodicity stream of tweets. They
applied this technique on only 2 classes’ politics and sports B. Out Scope
and their scores were better as compared to eight competing
classification solutions significantly. We won’t be predicting anything regrading user’s person-
ality or anything regarding bot like behavior or sentiment
A. Related Work analyses.
There exists a lot of similar work for example “Under-
standing Types of Users on Twitter” research was used for C. Web Interface
classifying users in 6 broad classes , “Detecting Automation
of Twitter Accounts: Are You a Human, Bot, or Cyborg? “Was
used to classify between bot, human and cyborg, “Detecting
Spammers on Twitter” was based on spammer’s classification
and “Classifying Twitter User Interests using Time Series” was
the classification of 2 domains with the help of time.

B. Relevant Background & Definition


This project is based upon the classification of users of
twitter among 11 major classes (sports , Business , Politics
,Entertainment ,Science and technology ,Religion ,General
category , Health ,Education , News , Wellbeing).Since this
is a classification problem ,the field of machine learning is
most suitable for this project and the models ML provides for
classification. There exist similar work on twitter classification Fig. 1. GUI.
but the classification of 11 major classes wasn’t done before.
Classification: Grouping any element with its particular
category is known as classification. Machine Learning: D. Software Engineering Methodology
Machine Learning is an entire field which provides algorithms
and techniques to apply classification or any other task. Data We have used Waterfall Methodology to run and test various
Augmentation: Machine learning technique to cover the data machine learning models based on data we have collected from
imbalancement. various sources. We have followed five essentials steps in the
Waterfall Software development life cycle.
• Analysis
IV. G AP A NALYSIS • Design
This work is different from most of the projects ever • Implementation
developed .A research proposed in 2012 was based on • Verification
classification of bot, human and cyborg on the other hand • Maintenance
E. Project Methodology
We have used waterfall methodology in this project and has
done it with agile approach because of the complexity of this
work, a lot of work is divided into sub components and the
implementation of each sub component follows this waterfall
model. At first the sub component was the planning about the
project afterwards on how to fetch users data and for doing
this we have done several meetings with our supervisor on how
to collect it and how to collect it fast after that the formation
of features was another sub tasks which follows through the
identification of useful features that is necessary for our project
,we extracted features what would be helpful followed by data
preprocessing which resulted in the selection of models which Fig. 2. System Function.[a]
we have to implement according to their nature and type and
after doing multiple meetups we decided to use random forest
and logistic regression models because of their considerable
results and after that optimization took most of the time.
F. System Level Architecture
Initially this project will extract the tweets of the user, after
that it will focus on the extraction of the tweets of followers
and following, and after that it will extract like tweets, even-
tually it will collect extra attributes regarding the user. After
gathering all the required data, we will transform the data to
Fig. 3. System Function.[b]
the project needs (filtration of data, mathematical modelling),
this transformed data would be provided to various machine
learning models to predict the domain, spread, influence etc. software will require heavy CPU and GPU to run faster and
G. Design Strategy efficiently
2) Security Requirements: The data we are extracting for
Our design would focus simplicity for the user, the only our project is publicly available for everyone ,so in case of data
task user have to perform is to enter the username of the leakage or any sort of hacking would not breach our security
twitters account, project will start extracting tweets of the constraints.
user ,its following/follower tweets, it’s like tweets and various 3) Reliability Requirements: Several testing techniques will
extra attributes, after that various machine learning algorithms be applied to ensure the quality and reliability of project.
would start executing, the whole process from the extraction of 4) Usability Requirements: We will have a simple web
tweets till the prediction of domain would be carried through interface where the input would be username and the output
python based scripts, eventually user would be provided with would be domain, spread ,influence etc. This interface will be
the domain of the user as well as spread, influence etc. easily usable with single input output with fast results.
H. Design Considerations 5) Support-ability Requirements : This project will all
1) Assumptions and Dependencies: We assume that we operating systems and all browsers, to use this Project internet
will be able to explore/understand the user’s traits, while connection is required
considering various attributes/features of the users. Our main L. User Documentation Requirements
dependency is on the data to be collected and API’s needed
• Internet Browser Version
for extra features.
• Twitter Username
2) Risks and Volatile Areas: This project is highly depen-
• Internet connection
dent on twitter, if the accessibility to twitters data become
unavailable this project wouldn’t be able to extract the data M. Experimental Evaluations & Results
on the other hand versions of browser is also a dependency, 1) Evaluation Test bed: Initially we have generated the
newer the version there may be distortions to our interface. data sets consisting of 56 attributes, we have splitter the
I. System Functions / Functional Requirements attributes mainly from 4 main attributes we had, that was
users own tweets, his followers tweets, his following tweets
J. System Attributes/ Nonfunctional Requirements
and the tweets he liked with some additional features we have
K. Non - Functional Requirements extracted such as average length of tweets, count of follower
1) Performance Requirements: Since we require heavy tweets, count of following tweets, his own tweets and liked
processing power to run various machine learning models ,this tweets.
So we managed to reach the accuracy of 92% with the help
of logistic regression by applying data augmentation.
We further analyzed to fetch second and third domain of a
user so we preprocessed the data frame to fetch second and
third domain and the results are.

Fig. 4. Non Function.

Eventually he had a data set consisting of 56 attributes in


total. This data set was for the extraction of first domain , we
have used an API for the extraction of a singular tweet of user
and according to that this data set was generated
Fig. 8. Result.

After applying data augmentation

Fig. 5. Data set.

We have implemented models namely SVM,Logistic Re-


gression ,Neural network, Gradient boosted ,Random Forest Fig. 9. Result.
for the prediction of user , its accuracy was quiet low at first
but we applied several techniques to tune to our need and
below are the results they gave us. N. Results and Discussion
After applying all the methods and techniques we have
found that logistic regression is providing 92% of accuracy
for first domain, while the extraction of second and third
domain was not that much success.

VI. C ONCLUSION A ND F UTURE W ORK


A. Limitations and Future Work
Fig. 6. Result.
The following limitation are as follows in this project are
as follows:
So it was clearly visual that the results Logistic regression
• To run machine learning models it required powerful
was considerable among other Models. But this result was not
computer and it required at least 12 GB of RAM.
trusted because there was data imbalancement in the data set
• Data changes in time to time, we had to collect the data
so we applied data augmentation to see results and we got
as soon as possible, once the data become unavailable,
these results.
we can’t fetch those data sets.
• The accuracy of models varies according to data sets. Fu-
ture work would be improvement of the accuracy of our
models by applying different techniques timely weather
applying some new inventions in machine learning or
development of some technique by ourselves.
Reasons for Failure – If Any: Since all tweets are coming
from twitters API so if Twitters restricts the access to its API
Fig. 7. Result. then data wouldn’t be fetched and this whole project would
be nonfunctioning.

R EFERENCES
[1] https://www.researchgate.net/publication/261994329 Understanding Types of Users on Twitter
[2] https://ieeexplore.ieee.org/document/6280553
[3] https://www.researchgate.net/publication/228855257 Detecting spammers on Twitter
[4] http://pike.psu.edu/publications/asonam13.pdf
[5] Anaconda Software:-https://www.anaconda.com/what-is-anaconda/
[6] Python download:-https://www.python.org/downloads/release/python-
374/
[7] What is Feature engineering: https://medium.com/mindorks/what-is-
feature-engineering-for-machine-learning-d8ba3158d97a
[8] What is data augmentation: https://bair.berkeley.edu/blog/2019/06/07/data aug/

You might also like