You are on page 1of 14

Tuning machine learning models to

detect bots on Twitter


Stefano M P C Souza,
Department of Electrical Engineering, University of Brasília (UnB), Brasília, Brazil
stefanomozart@ieee.org

Tito B Rezende, José Nascimento, Levy G Chaves, Darlinne H P Soto, Soroor Salavati
Institute of Computing, University of Campinas (Unicamp), Campinas, Brazil
{t025327, j170862, l264958, d264955, s264967}@dac.unicamp.br

https://github.com/stefanomozart/twitter_bot_detection

1
Motivation
● Bots can be used to spread fake news, manipulate public
opinion, fake hashtag trends;

● However, not all bots are malicious.

2
Related work
● Related work can be split in three categories:

○ Account/Profile based;

○ Tweet text based;

○ Topological.

● State-of-the art models are built with feature engineering


techniques specific for a single platform.

3
Pipeline

4
Dataset
Consists of a group of 3 publicly available datasets

Dataset #bots #humans

botwiki-2019 698 0

cresci-rtbust-2019 353 340

cresci-stock-2018 7,102 6,174

Total 8,153 6,514

5
EDA: t-SNE

- There is no clear linear plane that could split the groups;

- Hence, we can expect poor results on classifiers with linear


assumptions such as the logistic regressor.

6
Features
Selected Features: Engineered features:

● Statuses count ● Screen name length


● Followers count ● Screen name number of
● Friends count digits
● Favourites count ● Name length
● Listed count ● Name number of digits
● Default profile ● Description length
● Profile uses
background image
● verified

7
Feature scaling
● No feature scaling
● Standard score:

● Min-max:

8
Parameter Tuning
Search for the best hyper-parameters for each model.

9
Model Selection

● KNN
● Logistic Regression
● SVM
● Decision Tree
● Random Forest
● Bagging: Bootstrap Aggregating Tree Based Estimator
● XGBoost: Gradient Boosting Decision Tree

10
Results
● We present the best tuned model for each
classifier
● XGBoost was the best amongst all classifiers

11
Conclusions
- Ensemble methods gave better results, probably because they
better handle the non-linearities in the data, agreeing with the
findings of EDA made with t-SNE;

- No significant difference between XGboost, RF and Bagging;

- Feature normalization does not have a major effect on the


accuracy of tree based models;

- The proposed pipeline has been proved to be a simple way to


compare many ML models and tuning strategies. Pipeline
steps can be easily replaced or enhanced to evaluate and
compare other techniques.

12
Future work

- Include different datasets;

- Modify the pipeline to include NLP;

- Work with different classes of bots as other


modes of operation, such as real accounts with
few automated posts, are becoming more
popular;

- Include datasets from different platforms.

13
Thank
YOU!!!

Image credits: Unplash and morguefile 14

You might also like