You are on page 1of 3

Abstract

Emails are a way of fast communication than traditional means these days. They are
used to exchange information between companies, students and their teachers and
professors, contacting friends and applying for jobs, internships and scholarships too.
Although emails have facilitated people in so many ways but the issue of unwanted and
harmful emails is still under consideration by developers and they are trying their best to
resolve the issue. A very common way through which viruses are spread are by emails.
Some email providers already scan emails for checking if any virus is being sent
through the email on behalf of user but sometimes it gets difficult to scan them. Malware
analysis is being done by email providers but with the advancement of technology,
hacking is also catching air to grow more and more and the techniques are getting more
advance causing the email providers and thus serving as a hindrance in the means of
user information security. Phishing techniques are also making jumps and sending
email to user for creating backdoors by claiming to be legitimate company is also
beoming common. Scammers use these techniques for swindeling the user to trust
them and provide them with his information that needs secrecy like bank account details
and important passwords etc. As there are times when the email contains malware but
there are also times when email is not actually junk or harmful but email providers claim
them to be unwanted, in these cases a user may miss notiifcations and new email
updates leading him to lose some important information. We suggest a smart system
using machine learning by training it on a datasets of the senders who have been in
record of scammers and spammers, phishing sites, malicious softwares and files with
extensions that are harmful and suspicious links.

Introduction and Background

Previous Investigations

Leading famous Internet Service Providers i.e. Google, Outlook and Yahoo filter the
mails for preventing their users from harm of malwares and bots. They do it by passing
the mails through spam filters. Previously created Machine Learning based spam filters
usually use on Naive Bayes ,Random Forest, Support Vector Machine (SVM) and
Decision Trees Machine Learning algorithms.
Proposed Research

We’ll use general filtering techniques and previous research by different experts in
combat with Machine Learning and by training our Machine Learning model on different
datasets available on internet and by obtaining them on our own using data collection.
The basic data collection would be carried out by creating a word dictionary first. Then
for gathering information regarding malicious programs can be done. One proposed
method for data collection using practical dynamic analysis is by making dummy email
addresses by us and then leaving them on malicious websites. After that dynamic
analysis will be done on the data gathered and proposed idea is to use any famous and
effective open source malware analysis software or website like Cuckoo sandbox or
hybridanalysis.com. First of all, mail will be parsed into differents parts and each part
will be made to be analysed separately. Then, analysis and work can be done by
training a Machine Learning model on the basis of the data and information of malicious
softwares, phishing sites and links and harmful extensions as well as spam senders and
scammers who have some previous record. Content will be analysed using python
program and machine learning applied on a dataset to check if any unwanted or
inappropriate word in the mail. Subject of mail will also be analyzed using same
technique as is for content analysis but separately. Then, if there is any file is found in
mail then file extension will be analyzed using python and machine learning model
trained on file extensions dataset. The file will also be checked from inside as spam or
harmful content might be shared inside file. Multiple Machine learning algorithms from
previous research will be tried on each section of mail and th one will more accuracy
and precision will be applied to give efficient results. Data that would need visualisation
will also be visualized in order to create a better understanding of results.
References

You might also like