Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report

A MINI PROJECT REPORT
On
MALICIOUS TWITTER BOTS DETECTION USING

MACHINE LEARNING
Submitted in Partial fulfilment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING

Under the Guidance of K.SRINIVAS REDDY
Associate Professor
By
V.YOSHITHA PRIYANKA 19D21A05G6
M.SHIREESHA 19D21A05E6
M.POORVAJA 19D21A05E2
Department of Computer Science and Engineering

SRIDEVI WOMEN’S ENGINEERING COLLEGE
(Approved by AICTE, affiliated to JNTUH, HYD and Accredited by NBA and
NAAC An ISO 9001:2015 Certified Institution)
V.N. PALLY, Gandipet, Hyderabad-75
2022-2023
Malicious Twitter Bots Detection

using Machine Learning
Department of Computer Science and Engineering
SRIDEVI WOMEN’S ENGINEERING COLLEGE

(Approved by AICTE, Affiliated to JNTUH, HYD and Accredited by NBA and
NAAC An ISO9001:2015 Certified Institution)
V.N. PALLY, Gandipet, Hyderabad-75
2022-2023
CERTIFICATE
This is to certify that mini project report entitled “MALICIOUS TWITTER
BOTS DETECTION USING MACHINE LEARNING.” is being submitted by
V.YOSHITHA PRIYANKA (19D21A05G6), M.SHIREESHA (19D21AO5E6),
M.POORVAJA (19D21A05E2) in partial fulfilment for the award of degree of
Bachelor of Technology in Computer Science and Engineering is a record of
bonafide work carried out by them.
Under the guidance of Coordinator Head of the department

K. SRINIVAS REDDY Dr. A. RAVI KUMAR Dr.A.GAUTHAMI LATHA
Associate Professor Professor Professor
EXTERNAL EXAMINAR

Date: 15-12-2022
Certificate of Internship
This is to certify that Ms. V Yoshitha Priyanka, Ms. M Shireesha, Ms. M

Poorvaja bearing roll number 19D21A05G6, 19D21A05E6, 19D21A05E2, students
of B.Tech (Department Of Computer Science and Engineering), Sridevi
Women’s Engineering College, Hyderabad have successfully completed Internship
Program on “Malicious Twitter Bots Detection using Machine Learning ” at our
Organization from 28-08-2022 to 15-12-2022. During this period we found them
sincere, honest, hardworking, dedicated candidates with professional attitude and
very good technical knowledge

Triad Techno Services.
Managing Director
Head Office:204, Ratna Complex, Image Hospital Lane, Beside DHFL, Ameerpet
Hyderabad - 500073, Contact: 7036987111/222/333/444
Branch Office:2nd Floor, Datta Lord House, Behind PVP Cinemas, M.G.Road, Labbipet
Vijayawada - 520002, Contact: 7036987666
DECLARATION
We, hereby declare that Mini project entitled “MALICIOUS TWITTER
BOTS DETECTION USING MACHINE LEARNING” is the work done during
the period of 𝟐𝟗𝐭𝐡August 2022 to 𝟏𝟓𝐭𝐡 December 2022 and is submitted in partial
fulfilment of the requirements for the award of Bachelor of Technology in Computer
Science and Engineering from Jawaharlal Nehru Technological University,
Hyderabad.


ACKNOWLEDGEMENT
There are many people who helped us directly or indirectly to complete our
mini project successfully. We would like to take this opportunity to thank one and
all.
First of all we would like to express our sincere gratitude and indebtedness to
K. SRINIVAS REDDY, Associate Professor, Internal guide and for his valuable
guidance, suggestions, and keen personal interest throughout the course of this
project and for his tireless patience in hearing all our seminar, minutely seeing all the
reports and giving appropriate guidance and suggestions.
It is our privilege and pleasure to express our deep sense of indebtedness to
Dr. A. RAVI KUMAR, Professor, Coordinator for his timely cooperation
and valuable suggestions throughout the project. We are indebted to him for the
support given to us throughout the project work.
We would like to express our gratitude to Dr. A. GAUTAMI LATHA,
Professor and Head of the Department of CSE for her precious suggestions,
motivations and cooperation for the successful completion of the project.
We are also extremely thankful to Dr. B.L. MALLESWARI, Principal for

her precious guidance and valuable suggestions.
Finally, we would like to thank all our faculty, family, and friends for their
help and constructive criticism during our project period. Finally, we are very much
indebted to our parents for their moral support and encouragement to achieve goals.

LIST OF CONTENTS
Page No.
Title Page i
Certificate (College) ii
Certificate (Company) iii
Declaration iv
Acknowledgement v
List of Contents vi
List of Figures viii
List of Tables ix
Abstract x
1 Introduction 1
1.1 Purpose 2
1.2 Scope 2
1.3 Model Diagram 3
2 Literature Survey 4
2.1 Technology Used 7
2.1.1 Python 7
2.1.2 Machine Learning 11
3 System Analysis 13
3.1 Existing System 13
3.1.1 Disadvantages 13
3.2 Problem Statement 13
3.3 Proposed System 13
3.3.1 Advantages 14
4 Software Requirement Specifications 15
4.1 Functional Requirements 15
4.2 Non-Functional Requirements 15
4.3 Hardware Requirements 15
4.4 Software Requirements 15
5 System Design 16
5.1 System Architecture 16
5.2 System components 17
5.3 UML diagrams 18

6 Implementation 23
6.1 Sample code 23
7 System Testing 29
7.1 Introduction to Testing 29
7.2 Testing Strategies 29
7.3 Test Cases 31
7.4 Discussion of Results 33
8 Conclusion And Future Enhancements 41
8.1 Conclusion 41
8.2 Future Enhancements 41
9 References 42

LIST OF FIGURES
Page No.
1.1 Model diagram 3
5.1 System Architecture 16
5.2 Use case diagram for user 18
5.3 Sequence diagram for user 19
5.4 Class diagram for user 20
5.5 Collaboration diagram for user 21
5.6 Activity diagram for user 22
7.1 Home Screen 33
7.2 Uploading dataset screen 34
7.3 Dataset loaded 35
7.4 Displaying Tweets 36
7.5 Possible bot users 37
7.6 ROC graph 38
7.7 Malicious bot users 39
7.8 ROC graph 40

LIST OF TABLES
Page No.
7.1 Upload tweets dataset 29
7.2 Extract tweets 29
7.3 Recognize twitter bots using machine learning 30
7.4 Recognize malicious bot URLs using machine learning 30

ABSTRACT
Twitter is one of the popular social networking sites which allow the users to
express their opinion on various topics like politics, sports, stock market,
entertainment etc. It highly influences people’s perspective. So, it is necessary that
tweets are sent by genuine users and not by twitter bots. Malicious social bots
generate fake news and automate their social relationships either by pretending like a
follower or by creating multiple fake accounts with malicious activities. Malicious
social bots post shortened malicious URLs in the tweet in order to redirect the
requests of online social networking participants to some malicious servers (because
a tweet is restricted up to 280 characters). So, distinguishing malicious bots is one of
the most important tasks in twitter network. There are more people on Twitter who
mask their identities for malicious reasons. Because it poses a risk towards other
users, it is important towards recognize Twitter bots. Therefore, it is crucial that
tweets are posted through real people & not Twitter bots. A twitter bot posts spam-
related topics. Thus, identifying bots aids in identifying spam messages. Twitter
account attributes are used as Features in machine learning algorithms towards
categories users as real or false.

1. INTRODUCTION
Twitter is one of the fastest-growing social media platforms. It enables users
to exchange news, express themselves, and debate current events. Users may follow
individuals who share their interests or have similar viewpoints. Users may send
tweets to their followers right away. Re-tweeting allows the content to reach a wider
audience. During live events such as sports or award ceremonies, the number of
tweets spikes. Smartphones and PCs can both access Twitter. Paid promotions may
result in significant income creation as well as an increase in product sales. Students
may use Twitter to learn more about the subjects that are covered in class. The
message that is shared with followers is referred to as a tweet. The tweet should be
short and to the point, with a maximum of 140 characters. The hashtag (#) is used to
locate and follow a certain subject. When a hashtag gets popular, it is referred to as a
trending topic. Twitter connections are bidirectional, meaning that a person may
have both followers and followers. If you follow someone on Twitter, you will be
able to view all of their tweets if the account is public; but this does not imply that he
or she will be able to see your tweets. If you follow someone back, they will be able
to view your tweets. A Twitter bot is a piece about software that automatically tweets
towards users. Bots are created towards perform tasks like spamming.
1) Twitter bots are designed towards disseminate rumors & incorrect
information.
2) Towards disparage someone's reputation.
3) Credential theft is accomplished via fabricating correspondence.
4) Users are led towards fraudulent websites.
5) Towards alter someone's or a group's perspective, for instance, through
influencing popularity.

1.1 PURPOSE
Undoubtedly, social media, such as Facebook and Twitter, constitute a major
part of our everyday life due to the incredible possibilities they offer to their users.
However, Twitter and generally online social networks (OSNs) are increasingly used
by automated accounts, widely known as bots, due to their immense popularity
across a wide range of user categories. Their main purpose is the dissemination of
fake news, the promotion of specific ideas and products, the manipulation of the
stock market and even the diffusion of sexually explicit material. Therefore, the early
detection of bots in social media is quite essential. In this paper, two methods are
introduced targeting this that are mainly based on Natural Language Processing
(NLP) to distinguish legitimate users from bots. In the first method, a feature
extraction approach is proposed for identifying accounts posting automated
messages. After applying feature selection techniques and dealing with imbalanced
datasets, the subset of features selected is fed in machine learning algorithms. In the
second method, a deep learning architecture is proposed to identify whether tweets
have been posted by real users or generated by bots. To the best of the authors’
knowledge, there is no prior work on using an attention mechanism for identifying
bots. The introduced approaches have been evaluated over a series of experiments
using two large real Twitter datasets and demonstrate valuable advantages over other
existing techniques targeting the identification of malicious users in social media.
1.2 SCOPE
Twitter is one about fastest method about information transfer. It significantly
influences how individuals think. There are more people on Twitter who mask their
identities for malicious reasons. Because it poses a risk towards other users, it is
important towards recognize Twitter bots. Therefore, it is crucial that tweets are
posted through real people & not Twitter bots. A twitter bot posts spam-related
topics. Thus, identifying bots aids in identifying spam messages.

1.3 MODEL DIAGRAM/OVERVIEW
Fig 1.1 Model Diagram

2. LITERATURE SURVEY
“Using machine learning towards detect fake identities: bots vs humans”
A growing number about people use social media platforms (SMPs) towards
maintain accounts while concealing their identities in order towards do them harm.
Unfortunately, relatively little research has been done towards date towards identify
human-created false identities, particularly among regard towards SMPs. On other
hand, there are numerous instances where false accounts made through bots or
computers have been effectively identified using machine learning models. These
machine learning algorithms were reliant on using artificial variables, including
"friend-to-followers ratio," in case about bots. These features were developed using
attributes that are immediately available in account profiles on SMPs, like "friend-
count" & "follower- count.". study covered in this paper attempts towards improve
accurate identification about fake human identities on SMPs through applying same
designed traits towards a set about fake human accounts.
“Real-time detection about content polluters in partially observable Twitter
networks”
A well-known issue for event prediction, election forecasting, &
differentiating true news from fake news in social media data is presence about
content polluters or bots that hijack a discourse for political or commercial goals.
Modern techniques use vast amounts about network data as features for machine
learning models, which makes it extremely difficult towards identify this kind about
bot. In typical applications that stream social media data for real-time event
prediction, such datasets are typically not easily accessible. In this study, we create a
strategy for identifying content trolls in real time streamed social media datasets.
through using our approach towards issue about predicting civil unrest events in
Australia, we identify content violators from specific tweets without obtaining social
network or history information from specific accounts. In our dataset, we find certain
odd traits about these bots, & we suggest metrics for identifying such accounts. We
then ask several research concerns about this type about bot detection, such as how
effective Twitter is at identifying content spammers & how well cutting- edge
techniques fare in our dataset when it comes towards identifying bots.

“Detecting Fake Followers in Twitter: A Machine Learning Approach”
A new spam business has emerged as a result about Twitter's popularity.
This market offers a variety about services, such as sale about fake accounts, affiliate
schemes that help spread Twitter spam, & a group about spammers who carry out
extensive spam operations. Twitter users have also begun towards purchase false
followers for their profiles. In this work, we demonstrate machine learning
techniques that we have created towards identify phoney. Twitter followers. We
manually checked 13000 paid fake followers & 5386 real followers on an account
we set up for study. Then, we determined a number about traits that set bogus
followers apart from real ones. These traits served as attributes for machine learning
algorithms that we utilized towards categorize people as phoney or real. We have
used certain machine learning techniques towards obtain high detection accuracy &
others towards get low accuracy.
“I Spot a Bot: Building a binary classifier towards detect bots on Twitter”
According towards estimates, up towards 50% about Twitter activity is
generated through bots — algorithmically automated accounts intended towards
advertise goods, disseminate spam, or influence public opinion. According towards
studies, up towards 20% about Twitter activity related towards 2016 U.S.
presidential election came from accounts that were suspected towards be bots. There
is also evidence that bots were used towards spread untrue information about French
presidential candidate Emmanuel Macron & towards escalate a recent conflict in
Qatar. Identifying undesirable actors in "Twitterverse" & shielding real users from
false information & malevolent intentions need detection about bots. Although there
has been research in this field for a while, algorithms today still perform worse than
people do . goal about our research was towards create a binary classifier that can
determine whether ascertain Twitter user is a "bot" or a "human" based on their
profile & tweet history. An internet plug-in for browser that can evaluate a specific
account in real- time would be end-user application for a classifier like this one.
Twitter API offers all about raw data needed towards identify a public Twitter
account using our algorithm, & our functional prototype check screenname.py
software leverages API towards quickly classify a given Twitter user handle. In our
perspective, typical Twitter user desperately needs a product like this.

“Bot spammer detection in Twitter using tweet similarity & time interval
entropy”
Due about Twitter's popularity, a lot about spam has been distributed through
spammers. majority about spam messages, according towards preliminary
investigations, are generated automatically through bots. Consequently, detecting bot
spammers can drastically lower volume about spam messages on Twitter.
towards best about our knowledge, however, not many studies have concentrated
on identifying Twitter bot spammers. As a result, this study suggests a novel
method that uses tweet similarity & time interval entropy towards distinguish
between bot spammer & authentic human accounts. Towards determine each user's
time interval entropy, timestamp collections are used. calculation about tweet
similarity will be based on unigram matching. Twitter datasets comprising both
legitimate & spam my accounts are scraped. results about experiment suggested
that legitimate users might behave normally when publishing tweets as spam bots.
Several trustworthy users have also been seen towards post tweets that are identical.
As a result, it is less effective towards identify bot spammers using just one about
those features. However, combining two criteria results in a superior categorization
outcome. proposed method's accuracy, recall, & f-measure were 85,71%, 94,74%,
& 90%, respectively. It performs better than a strategy that solely uses time interval
entropy or tweet similarity & falls short in terms about precision, recall, & f-
measure.
“Tweets as impact indicators: Examining the implications of automated bot
accounts on Twitter. Journal of the Association for Information Science and
Technology”
This brief communication presents preliminary findings on automated Twitter
accounts distributing links to scientific articles deposited on the preprint repository
arXiv. It discusses the implication of the presence of such bots from the perspective
of social media metrics altmetrics, where mentions of scholarly documents on
Twitter have been suggested as a means of measuring impact that is both broader and
timelier than citations. Our results show that automated Twitter accounts create a
considerable number of tweets to scientific articles and that they behave differently
than common social bots, which has critical implications for the use of raw tweet
counts in research evaluation and assessment. We discuss some definitions of
Twitter cyborgs and bots in scholarly communication and propose distinguishing
between different levels of engagement-that is, differentiating between tweeting only

bibliographic information to discussing or commenting on the content of a scientific
work.

2.1 TECHNOLOGIES USED
2.1.1 PYTHON
Python is an interpreted high-level programming language for general-purpose
programming. Created by Guido van Rossum and first released in 1991, Python has a
design philosophy that emphasizes code readability, notably using significant
whitespace. Python features a dynamic type system and automatic memory
management. It supports multiple programming paradigms, including object-
oriented, imperative, functional and procedural, and has a large and comprehensive
standard library.
 Python is Interpreted − Python is processed at runtime by the interpreter. You do
not need to compile your program before executing it. This is similar to PERL
and PHP.
 Python is Interactive − you can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.
 Python is currently the most widely used multi-purpose, high-level programming
language.
 Python allows programming in Object-Oriented and Procedural paradigms.
 The biggest strength of python is huge collection of standard library which can be
used for the following:
 Machine Learning
 GUI Applications (like Kivy, Tkinter, PyQt etc.)
 Test frameworks
 Scientific computing
 Pre-processing and many more.

Features of Python
There are many features in python, some of which are discussed below -
 Easy to Learn and Use:
Python is easy to learn as compared to other programming languages. Its syntax is
straightforward and much the same as the English language. There is no use of the
semicolon or curly-bracket, the indentation defines the code block. It is the
recommended programming language for beginners.
 Expressive Language:
Python can perform complex tasks using a few lines of code. A simple example, the
hello world program you simply type print ("Hello World"). It will take only one line
to execute, while Java or C takes multiple lines.
 Interpreted Language:
Python is an interpreted language; it means the Python program is executed one line
at a time. The advantage of being interpreted language, it makes debugging easy and
portable.
 Cross-platform Language:
Python can run equally on different platforms such as Windows, Linux, UNIX, and
Macintosh, etc. So, we can say that Python is a portable language. It enables
programmers to develop the software for several competing platforms by writing a
program only once.
 Object-Oriented Language:
Python supports object-oriented language and concepts of classes and objects come
into existence. It supports inheritance, polymorphism, and encapsulation, etc. The
object- oriented procedure helps to programmer to write reusable code and develop
applications in less code.
 Large Standard Library:
It provides a vast range of libraries for the various fields such as machine learning,
web developer, and also for the scripting. There are various machine learning
libraries, such as Tensor flow, Pandas, Numpy, Keras, and Pytorch, etc. Django,
flask, pyramids are the popular framework for Python web development.

 Embeddable:
The code of the other programming language can use in the Python source code. We
can use Python source code in another programming language as well. It can embed
other language.
 Integrated:
It can be easily integrated with languages like C, C++, and JAVA, etc.
 Extensible:
It implies that other languages such as C/C++ can be used to compile the code and
thus it can be used further in our Python code. It converts the program into byte
code, and any platform can use that byte code.
Libraries in Python
 TensorFlow
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also
used for machine learning applications such as neural networks. It is used for both
research and production at Google.
 Numpy
Numpy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains
various features including these important ones:
 A powerful N-dimensional array object.
 Sophisticated (broadcasting) functions.
 Tools for integrating C/C++ and Fortran code.
 Useful linear algebra, Fourier transform, and random number capabilities.
Besides its obvious scientific uses, Numpy can also be used as an efficient
multidimensional container of generic data.
 Pandas
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. Python was
majorly used for data munging and preparation. Pandas solved this problem. Using
Pandas, we can accomplish five typical steps in the processing and analysis of data,
regardless of the origin of data load, prepare, manipulate, model, and analyze.

 Matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats and interactive environments across
platforms. Matplotlib can be used in Python scripts, the Python and IPython shells,
the Jupyter Notebook, web application servers, and four graphical user interface
toolkits. Matplotlib tries to make easy things easy and hard things possible. You can
generate plots, histograms, power spectra, bar charts, error charts, scatter plots, etc.,
with just a few lines of code. For examples, see the sample plots and thumbnail
gallery .
 Scikit – learn
Scikit-learn provides a range of supervised and unsupervised learning algorithms via
a consistent interface in Python. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use.
 Tkinter
Tkinter is the standard GUI library for python. Python when combined with Tkinter
provides a fast and easy way to create GUI applications. Tkinter provides a powerful
object-oriented interface to the Tk GUI toolkit.

2.1.2 MACHINE LEARNING
Machine learning is often categorized as a subfield of artificial intelligence,
but I find that categorization can often be misleading at first brush. The study of
machine learning certainly arose from research in this context, but in the data science
application of machine learning methods, it's more helpful to think of machine
learning as a means of building models of data. Fundamentally, machine learning
involves building mathematical models to help understand data. "Learning" enters
the fray when we give these models tunable parameters that can be adapted to
observed data; in this way the program can be considered to be "learning" from the
data. Once these models have been fit to previously seen data, they can be used to
predict and understand aspects of newly observed data. I'll leave to the reader the
more philosophical digression regarding the extent to which this type of
mathematical, model-based "learning" is similar to the "learning" exhibited by the
human brain. Understanding the problem setting in machine learning is essential to
using these tools effectively.
Terminologies of Machine Learning
 Model – A model is a specific representation learned from data by applying
some machine learning algorithm. A model is also called a hypothesis.
 Feature – A feature is an individual measurable property of the data. A set of
numeric features can be conveniently described by a feature vector. Feature
vectors are fed as input to the model. For example, in order to predict a fruit,
there may be features like color, smell, taste, etc.
 Target (Label) – A target variable or label is the value to be predicted by our
model. For the fruit example discussed in the feature section, the label with each
set of input would be the name of the fruit like apple, orange, banana, etc.
 Training – The idea is to give a set of inputs(features) and it’s expected
outputs(labels), so after training, we will have a model (hypothesis) that will then
map new data to one of the categories trained on.
 Prediction – Once our model is ready, it can be fed a set of inputs to which it
will provide a predicted output(label).

Applications of Machines Learning
Machine Learning is the most rapidly growing technology. It is used to solve
many real-world complex problems which cannot be solved with traditional
approach.
 Emotion & Sentiment analysis
 Error detection and prevention
 Weather forecasting and prediction
 Stock market analysis and forecasting
 Speech & Object recognition
 Object recognition
 Fraud detection & prevention
 Recommendation of products to customer in online shopping

3. SYSTEM ANALYSIS
3.1 EXISTING SYSTEM
Due to the vast array of opportunities, they present, social media platforms like
Facebook and Twitter have become deeply embedded in our daily lives. However,
due to the widespread appeal of Twitter and other OSNs, automated accounts
commonly referred to as bots are rapidly making use of them. The train data has
many attributes. The required features are extracted using Spearman correlation
method. Random Forest. Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset.
3.1.1 Disadvantages
 Low security
3.2 PROBLEM STATEMENT

To assist human users in identifying who they are interacting with, our project
focuses on the classification of human and bot accounts on Twitter, as the service
could lose users who are annoyed, concerned or even harmed. By using the
combination of features extracted from user’s account to determine the likelihood of
being a human or bot.
3.3 PROPOSED SYSTEM

There are numerous characteristics to the train data. The necessary
functionality is retrieved using the Spear-man correlation technique. Logistic
Regression algorithm by Machine learning. Logistic regression is used to describe data
and to explain the relationship between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables. Real-time data as displayed
are used as the optimum learning model. Data are preprocessed and zero values are
eliminated using pandas (tool for pre- processing). The dataset is trained and the test
data set is the actual Twitter data. The output is in shape 0 or1. We developed an
algorithm in our research that identifies Twitter bots. indicate Malicious URL and in
above screen we can see URL prediction accuracy as 73%. we got ML accuracy of Logistic
Regression is 74% Thus, word algorithms were used to real-time data and the Twitter
bots have been detected effectively.

3.3.1 Advantages:
 High security
 High accuracy
 High efficiency

4. SYSTEM REQUIREMENT SPECIFICATION
FUNCTIONAL REQUIREMENTS
• Detecting malicious twitter bots.
NON-FUNCTIONAL REQUIREMENTS
• Accuracy (to detect the malicious twitter bots).
• The project should be portable, i.e., can be run on any device that has python
installed on it.
SOFTWARE REQUIREMENTS
• Operating system - Windows 10
• Programming language - PYTHON
HARDWARE REQUIREMENTS
• Processor - Intel core i3 or higher
• Speed - 1.1 GHzs
• Ram - 4 GB or higher
• Hard Disk - 500 GB

5. SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE
Fig 5.1: System architecture

The first step of the proposed system is to extract trending tweets from twitter
API using Kaggle. After we search the URLs in tweets extracted if URL found in
tweet text we will compare the IDs present if matched identified twitter account is a
bot and is posting malicious URLs.

5.2 SYSTEM COMPONENTS
The system has the following components:
1. Upload tweets dataset:
Upload is one of the modules in our project, first we have to upload the twitter
dataset of the project into our system.
2. Extract tweets:
Using this module, we will extract tweets from online or offline and every time
internet will not be available so we are using offline KAGGLE tweets dataset. By
using this module, we will read or extract all tweets from dataset. If we are
downloading tweets online then we need WOEID from twitter.
3. Recognize twitter bots using Machine learning:
In this module we are extracting features from tweets such as Activity, Anonymity
and Amplification. Activity refers to finding tweet frequency and Anonymity
refers to account information and Amplification refers to retweet count. By using
above 3 concepts author is checking whether account is normal or bot.
4. Recognize malicious URLs using Machine Learning:
Using this module, we analyze all tweets and check if tweet contains a greater
number of URLs, then it will consider as malicious URLS using ML.

5.3 UML DIAGRAMS
Fig 5.2: Use case Diagram of user

Fig 5.3: Sequence diagram of user

Fig 5.4: Class diagram of user

Fig 5.5: Collaboration diagram of user

Fig 5.6: Activity Diagram of user

6. IMPLEMENTATION
6.1 SAMPLE CODE
from tkinter import messagebox
from tkinter import *
from tkinter import simpledialog
import tkinter
from tkinter import filedialog
from tkinter.filedialog import askopenfilename
import numpy as np
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from collections import defaultdict
from sklearn import metrics
main = tkinter.Tk()
main.title("Detecting Malicious Twitter Bots Using Machine Learning") #designing
main screen
main.geometry("1300x1200")
global filename
global dataset
words =
['bot','cannabis','tweetme','mishear','followme','updates','every','gorilla','forget']
def getFrequency(bow):
count = 0
for i in range(len(words)):

if words[i] in bow:
count = count + bow.get(words[i])
return count
def uploadDataset():
global filename
text.delete('1.0', END)
filename = filedialog.askopenfilename(initialdir="Dataset")
text.insert(END,filename+" loaded\n\n")
def runModule1():
global dataset
dataset = pd.read_csv(filename)
text.insert(END,str(dataset))
def runModule2():
train = dataset[['screen_name','status','name','followers_count', 'friends_count',
'listedcount', 'favourites_count', 'statuses_count', 'verified']]
details = train.values
text.insert(END,"Possible BOT users\n\n")
users = []
for i in range(len(details)):
screen = details[i,0]
status = details[i,1]
name = details[i,2]
followers = int(details[i,3])
friends = int(details[i,4])
listed = int(details[i,5])
favourite = int(details[i,6])
status_count = int(details[i,7])
verified = details[i,8]
if not verified: #check user not verified
bow = defaultdict(int) #bag of words
data = str(screen)+" "+str(name)+" "+str(status)#checking screen name,
tweets and name
data = data.lower().strip("\n").strip()

data = re.findall(r'\w+', data)
for j in range(len(data)):
bow[data[j]] += 1 #adding each word frequency to bag of words
frequency = getFrequency(bow) #getting frequency of BOTS words
if frequency > 0 and listed < 16000 and followers < 200: #if condition true
then its bots
users.append(screen)
text.insert(END,str(users)+"\n")
train_attr = dataset[
['followers_count', 'friends_count', 'listedcount', 'favourites_count',
'statuses_count', 'verified']]
train_label = dataset[['bot']]
X = train_attr
Y = train_label.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
logreg = LogisticRegression().fit(X_train, y_train)#logistic regression object
actual = y_test
pred = logreg.predict(X_test)
accuracy = accuracy_score(actual, pred) * 100
precision = precision_score(actual, pred) * 100
recall = recall_score(actual, pred) * 100
f1 = f1_score(actual, pred)
auc = roc_auc_score(actual, pred)
text.insert(END,'\nLogistic Regression Accuracy : '+str(accuracy)+"\n")
text.insert(END,'Logistic Regression Precision : '+str(precision)+"\n")
text.insert(END,'Logistic Regression Recall is : '+str(recall)+"\n")
text.insert(END,'Logistic Regression Area Under Curve is : '+str(auc))
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
auc = metrics.auc(fpr, tpr)
plt.title('ROC')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.2f'% auc)
plt.legend(loc='lower right')

plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
def runModule3():
urls = []
details = dataset.values
for i in range(len(details)):#checking URLS in tweets
tweets = details[i,14]
if 'http' in str(tweets):
urls.append(1)
else:
urls.append(0)
train_attr = dataset[
['followers_count', 'friends_count', 'listedcount', 'favourites_count',
'statuses_count', 'verified']]
train_attr["URLS"] = urls #adding URLS to training dataset
text.insert(END,str(train_attr))
train_label = dataset[['bot']]
X = train_attr
Y = np.asarray(train_label)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
logreg = LogisticRegression().fit(X_train, y_train) #logistic regression object
actual = y_test
pred = logreg.predict(X_test)
accuracy = accuracy_score(actual, pred) * 100
precision = precision_score(actual, pred) * 100
recall = recall_score(actual, pred) * 100
f1 = f1_score(actual, pred)
auc = roc_auc_score(actual, pred)
text.insert(END,'\nLogistic Regression Accuracy : '+str(accuracy)+"\n")
text.insert(END,'Logistic Regression Precision : '+str(precision)+"\n")

text.insert(END,'Logistic Regression Recall is : '+str(recall)+"\n")
text.insert(END,'Logistic Regression Area Under Curve is : '+str(auc))
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
auc = metrics.auc(fpr, tpr)
plt.title('ROC')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.2f'% auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
font = ('times', 16, 'bold')
title = Label(main, text='Detecting Malicious Twitter Bots Using Machine Learning')
title.config(bg='goldenrod2', fg='black')
title.config(font=font)
title.config(height=3, width=120)
title.place(x=0,y=5)
font1 = ('times', 12, 'bold')
text=Text(main,height=20,width=150)
scroll=Scrollbar(text)
text.configure(yscrollcommand=scroll.set)
text.place(x=50,y=120)
text.config(font=font1)
font1 = ('times', 13, 'bold')
uploadButton = Button(main, text="Upload Tweets Dataset",
command=uploadDataset, bg='#ffb3fe')
uploadButton.place(x=50,y=550)
uploadButton.config(font=font1)
module1Button = Button(main, text="Run Module 1 (Extract Tweets)",
command=runModule1, bg='#ffb3fe')
module1Button.place(x=450,y=550)
module1Button.config(font=font1)

module2Button = Button(main, text="Run Module 2 (Recognize Twitter Bots using
ML)",
command=runModule2, bg='#ffb3fe')
module3Button = Button(main, text="Run Module 2 (Recognize Malicious URLS
using ML)", command=runModule3, bg='#ffb3fe')
main.config(bg='SpringGreen2')
main.mainloop()

7. SYSTEM TESTING
7.1 INTRODUCTION TO TESTING
Testing is the process where the test data is prepared and is used for testing
the modules individually and later the validation given for the fields. Then the
system testing takes place which makes sure that all components of the system
property function as a unit. The test data should be chosen such that it passed through
all possible condition. Actually, testing is the state of implementation which aimed at
ensuring that the system works accurately and efficiently before the actual operation
commence. The following is the description of the testing strategies. which were
carried out during the testing period.
7.2 TESTING STRATEGIES

Field testing will be performed manually and functional tests will be written
in detail. There various types of testing strategies. Each test type addresses a specific
testing requirement. Few of them are.
Unit Testing
Unit Testing is a software testing technique by means of which individual
units of software i.e., group of computer program modules, usage procedures, and
operating procedures are tested to determine whether they are suitable for use or not.
It is a testing method using which every independent module is tested to determine if
there is an issue by the developer himself. It is correlated with the functional
correctness of the independent modules. Unit Testing is defined as a type of software
testing where individual components of a software are tested. Unit Testing of the
software product is carried out during the development of an application
System Testing
Testing has become an integral part of any system or project especially in the
field of information technology. When the software is developed before it is given to
user to use the software must be tested whether it is solving the purpose for which it
is developed.
This testing involves various types through which one can ensure the
software is reliable. The program was tested logically and pattern of execution of the
program for a set of data are repeated.

Module Testing
To locate errors, each module is tested individually. This enables us to detect
error and correct it without affecting any other modules. Whenever the program is
not satisfying the required function, it must be corrected to get the required result.
Thus, all the modules are individually tested from bottom up starting with the
smallest and lowest modules and proceeding to the next level. Each module in the
system is tested separately. For example, the job classification module is tested
separately. This module is tested with different job and its approximate execution
time and the result of the test is compared with the results that are prepared
manually. The comparison shows that the results proposed system works efficiently
than the existing system. Each module in the system is tested separately. In this
system the resource classification and job scheduling modules are tested separately
and their corresponding results are obtained which reduces the process waiting time.
Integration Testing
After the module testing, the integration testing is applied. When linking the
modules there may be chance for errors to occur, these errors are corrected by using
this testing. In this system all modules are connected and tested. The testing results
are very correct. Thus, the mapping of jobs with resources is done correctly by the
system.
Acceptance Testing
When that user fined no major problems with its accuracy, the system passers
through a final acceptance test. This test confirms that the system needs the original
goals, objectives and requirements established during analysis without actual
execution which elimination wastage of time and money acceptance tests on the
shoulders of users and management, it is finally acceptable and ready for the
operation.

7.3 TEST CASES
Table 7.1. Upload tweets dataset
Test Case#:1 Priority(h, l): high

Test Objective Upload tweets dataset
Test whether the tweets dataset is uploaded
Test Description
or not into the system
Requirements Verified Dataset
Test environment Command Prompt
Actions Expected Actual
Cannot do further No operations
Data uploaded
operations are performed
Can do further Operations are
Data not uploaded
operations performed
Pass: Yes Conditional pass: Fail
Table 7.2 Extract tweets
Test Case#:2 Priority (h, l): high

Test Objective Upload tweets dataset
Test whether the tweets dataset is uploaded
Test Description
or not into the system
Requirements Verified Dataset
Cannot do further No operations are
Data uploaded
operations performed
Operations are
Can do further
performed and
Data not uploaded operations and
tweets are
extract tweets
extracted

Table 7.3 Recognize twitter bots using machine learning.

Test Objective
Run logistic regression algorithm to recognize
Verify the tweets are recognized or not using

Test Description
ml
Requirements Verified Twitter Dataset
Recognition of
Recognition of
bots is done and
Data uploaded twitter bots will be
accuracy is
done
displayed
Recognition of
Twitter bots are
Data not uploaded twitter bots cannot
not recognized
be done
Table 7.4 Recognize malicious bot URLs using machine learning

Test Objective Recognize malicious bot URLs using ml
Verify malicious bot URLs are recognized
Test Description
using ml
Requirements Verified Twitter Dataset
Malicious twitter Malicious
Data uploaded bots detection will twitter bots are
be done detected
Malicious bot
Malicious bots
Data not uploaded URLs cannot be
are not detected
done

7.4 DISCUSSION OF RESULTS
Fig 7.1: Home screen

 Click on ‘Upload Tweets Dataset’ button .
 The tweets dataset will be uploaded.

Fig 7.2 Uploading dataset screen
 The above fig shows Dataset excel sheet.
 In this will be selecting and uploading ‘kaggle_tweets.csv’ file and then click
on ‘Open’ button to load dataset and to get below screen.

Fig 7.3: Dataset loaded
 In this the tweets dataset is uploaded.
 Now click on ‘Run Module 1 (Extract Tweets)’button to read all tweets from
dataset and to get below screen.

Fig 7.4: Displaying tweets
 In the above screen we can read all tweets and displaying few tweets in above
screen.
 Now click on ‘Run Module 2 (Recognize Twitter Bots using ML)’ button to
recognize BOTS user and then apply logistic regression ML to calculate BOT
prediction accuracy.

Fig 7.5: Possible bot users
 In the above screen, we can see SCREEN NAME of all BOTS account.
 Then we got ML accuracy of Logistic Regression is 74% and in below screen

we can see ROC graph.

Fig 7.6: ROC graph
 In the above screen graph x-axis represents False Positive Rate (wrong
prediction) and y-axis represents True Positive Rate (Correct Prediction) and red
line represents False Rate and blue line represents True prediction Rate and from
above graph we can conclude that True rate is higher than false prediction rate.
 Now click on ‘Run Module 3 (Recognize Malicious URLS using ML)’ button to
find malicious URLS and then calculate malicious URL prediction rate.

Fig 7.7 : Malicious bot users
 In last column, we find URL as malicious or non-malicious .
 In above screen in last column 1 indicates Non-Malicious URL and 0 indicate
Malicious URL and in above screen we can see URL prediction accuracy as 73% and
below is URL prediction ROC graph.

Fig 7.8 : ROC graph
 In the above screen it represents malicious URL ROC graph.
 Blue line rate is higher the color false prediction rate.

8. CONCLUSION AND FUTURE ENHANCEMENTS
8.1 CONCLUSION
On Twitter, bots are automated accounts that can do the same things as real
human beings like send out the tweets, follow other users and retweet posting by
others. Spam bots use these abilities to engage in harmful or annoying activity. So,
our priority is to detect spam bots “defeating the spam bots and authenticating all
humans”.
8.2 FUTURE ENHANCEMENTS

The completion of this work gave birth to a number of ideas for further research
on the subject of this thesis. Indicatively listed areas could be:
• Expanding the created dataset with tweets regarding other trending topics of
Twitter. Trending topics are usually topics that bots select to tweet about, in order
to conceal their intentions in the network.
• Integrating the proposed methodology to other social media platforms. It would
be interesting to observe the different features and behaviour of bots in other
platforms and examine the predictions that the developed model would produce.
• Incorporating to the proposed methodology graph-based features. For instance,
introducing features that could be extracted from a graph created locally around
the account in question, regarding their connections or interactions with other
accounts in the network.
• Collecting the input @usernames from the User Interface tool and refitting the
model with a new dataset.
• Conducting surveys about the User Interface, Bot Detective, to get feedback from
its users and improve the user experience accordingly.

9. REFERENCE
1. C. Meda, F. Bisio, P. Gastaldo and R. Z. Diten, "Machine Learning Techniques

applied to Twitter Spammers Detection," in 2014 International Carnahan
Conference on Security Technology (ICCST), 2014.
2. M. T. Ribeiro, S. Singh and C. Guestrin, "“Why Should I Trust You?”:
Explaining the Predictions of Any Classifier," in KDD '16 Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2016.
3. K.-C. Yang, O. Varol, C. A. Davis, E. Ferrara, A. Flammini and F. Menczer,
"Arming the public with artificial intelligence to counter social bots," Human
Behavior and Emerging Technologies, 2019.
4. K.-C. Yang, O. Varol, P.-M. Hui and F. Menczer, "Scalable and Generalizable
Social Bot Detection through Data Selection," AAAI 2020, 2019.
5. F. Benevenuto, G. Magno, T. Rodrigues and V. Almeida, "Detecting Spammers
on Twitter," in CEAS 2010 - Seventh annual Collaboration, Electronic
messaging, Anti- Abuse and Spam Conference, 2010.
6. W. Chen, C. K. Yeo, C. T. Lau and B. S. Lee, "Real-time Twitter Content
Polluter Detection Based on Direct Features," in 2015 2nd International
Conference on Information Science and Security (ICISS), 2015.
7. P.-C. Lin and P.-M. Huang, "A Study of Effective Features for Detecting
Longsurviving Twitter Spam Accounts," in 2013 15th International Conference
on Advanced Communications Technology (ICACT), 2013.
8. M. McCord and M. Chuah, "Spam Detection on Twitter Using Traditional
Classifiers," in ATC'11 Proceedings of the 8th international conference on
Autonomic and trusted computing, 2011.


Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report

Uploaded by

Copyright:

Available Formats

A MINI PROJECT REPORT

MALICIOUS TWITTER BOTS DETECTION USING

COMPUTER SCIENCE AND ENGINEERING

Department of Computer Science and Engineering

Malicious Twitter Bots Detection

SRIDEVI WOMEN’S ENGINEERING COLLEGE

Under the guidance of Coordinator Head of the department

Associate Professor Professor Professor

Malicious Twitter Bots Detection

This is to certify that Ms. V Yoshitha Priyanka, Ms. M Shireesha, Ms. M

Malicious Twitter Bots Detection

V.YOSHITHA PRIYANKA 19D21A05G6

Malicious Twitter Bots Detection

We are also extremely thankful to Dr. B.L. MALLESWARI, Principal for

V.YOSHITHA PRIYANKA 19D21A05G6

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Fig 1.1 Model Diagram

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

3.2 PROBLEM STATEMENT

3.3 PROPOSED SYSTEM

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Fig 5.1: System architecture

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Fig 5.2: Use case Diagram of user

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

7.2 TESTING STRATEGIES

Malicious Twitter Bots Detection

Malicious Twitter Bots Detection

Test Case#:1 Priority(h, l): high

Table 7.2 Extract tweets

Test Case#:2 Priority (h, l): high

Pass: Yes Conditional pass: Fail