You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/336252196

MONITORING OF SUSPICIOUS DISCUSSIONS ON ONLINE FORUMS USING


DATA MINING

Research · January 2018


DOI: 10.13140/RG.2.2.11235.91680

CITATIONS READS

0 2,554

3 authors, including:

Shailesh Dudala
University of Chicago
2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Senior Year Research View project

All content following this page was uploaded by Shailesh Dudala on 04 October 2019.

The user has requested enhancement of the downloaded file.


International Journal of Pure and Applied Mathematics
Volume 118 No. 22 2018, 257-262
ISSN: 1314-3395 (on-line version)
url: http://acadpubl.eu/hub
Special Issue
ijpam.eu

MONITORING OF SUSPICIOUS DISCUSSIONS ON ONLINE FORUMS USING DATA MINING

1
Tanya Srivastava, 2R.Mangalagowri, 3Shailesh S.Dudala
2
Assistant Professor, 1,2,3Department of CSE,
SRM University, Chennai, India 603203
1
tanyasriva96@gmail.com, 3shailesh.dudala@hotmail.com

Abstract: With the increasing years, the internet has analyse text sources from social media and classify the
changed the lives of so many people for better or worse. text into different groups. The system distinguishes
As internet technology is progressing, many illegal between legal and illegal data using Stop-Words
activities have also increased exponentially. The Internet Selection, Stemmer algorithm and Levenshtein
is an unacknowledged path for illegal activities such as algorithm. In Stop-word Selection, the commonly used
hacking, trafficking, betting, fraud and scams etc. The words in English language, such as, “we”, “he”, “they”
cyber-crime branches are looking for provisions to detect are removed. More such words can be removed using this
these forums for illegal feedbacks, comments or reviews algorithm.
and download questionable postings as verification for The Porter Stemmer algorithm removes the suffixes from
their investigation. Our proposed system will monitor for English words and transforms into its root word, for
suspicious postings, collect it from few discussion example:
forums, implement techniques of data mining and extract The words “Stemmed”, “Stemmer”, “Stemming” have
meaningful data. In this concern, we focus on Data suffixes “ED”, “ER” and “ING” will be removed during
Mining and Sentimental Analysis to enhance the the information retrieval and will leave one word
techniques and to extract the features of the text to “STEM” as the root word.
represent them. In Levenshtein algorithm, a large set of words are
compared with each other. For example, the Levenshtein
Keywords: Illegal Activities, Discussion forums, distance between "fitten" and "bitting" is 3, since the
Sentimental Analysis. following three edits change one into the other, and there
is no way to do it with fewer than three edits:
1. Introduction
fitten → bitten (substitution of "b" for "f")
Accelerating crimes on digital mediums alert the law bitten → bittin (substitution of "i" for "e")
implementation bodies to continuously monitor online bittin → bitting (insertion of "g" at the end).
activities. To achieve the above we need to build a
system which detects suspicious postings on online Levenshtein distance is a measure of similarity between
forums. A lot of surveys and facts have proved that it is two words.
difficult to manage information which constantly keeps
changing on internet thus data mining is the optimal B. Some Limitations of current systems
choice to analyse and gather data. Using various data
mining techniques, raw data is extracted from a large text Although the existing system works fine, but there is still
corpus and this raw /unstructured data is transformed into scope for improvement. The performances of data
structured data in pre-processing. This paper highlights retrieval analysis real from online forums remains
the datamining techniques and sentimental algorithm debatable due to lack of tools. The system is to
which is prototyped and implemented using python monitor suspicious discussions automatically on an
which is functional in natural language using Natural online forum but the system does not take large amount
Language Toolkit (NLTK) library. of data as input. The system is difficult to Moderate. It
only limits to Spam reviews and feedbacks. Security
A. Existing systems vulnerability is also one of the major disadvantage.

The digital technology has been impacting human


behaviour for a very long time. The existing system

257
International Journal of Pure and Applied Mathematics Special Issue

2. Literature Survey efficiency by using multiple processes, threads and


asynchronous resources.
A framework has been developed by [4], which is used to In another research done on sentimental analysis and
detect emotion on online media. EmoTxt finds the opinion mining, [8] Rudy Prabowo1and Mike Thelwall
emotions and categorize based on the input data provided has presented a method which uses different
in a comma separated value (CSV) file format. The classification methods and hybrid classification on
output is in the form of CSV file. The file contains text id multiple classifiers in their research. They showed that
and predicted label for each input data set. The model hybrid classification on multiple classifiers can not only
classifies the emotions as, joy, sad, and anger etc. improve the performance but can also increase the
According to researchers [4], the model follows a effectiveness. The paper presents results by comparative
tree structured hierarchical classification of emotions, study using various automatic classification methods,
where latter layers provides an understanding of machine learning classification with hybrid classification.
emotions of the previous layers. The model includes six
basic emotions, namely love, happy, anger, sad, fear, and 3. Proposed Method
surprise. The data is tested and trained on gold standard
dataset using linear Support Vector Machine (SVM). Data mining can be used to monitor social media as well
A research paper published [1], suggests various as discussion forums for suspicious feedbacks or
techniques and algorithms which can be employed. The comments. Discussion forums can be used to spread any
paper elaborates about Stop-word Selection, Stemming message to a large population almost instantly. Millions
algorithm, Brute-force algorithm, Learning Based of people share their views and ideas on politics, religion
algorithm and Matching algorithm. Matching algorithms and there are also people who intentionally hurt religious
use two constraints Stemmer Strength and Index or racial sentiments through malicious posts. Hence it
Compression. Using these two constraints, stem words in becomes important to monitor the posts on these forums.
database are compared and their value is calculated. This application collects the postings and comments from
Learning based algorithms include machine learning the discussion sites and analyses those comments using
theories like SVM and conditional random field. data mining techniques and algorithm.
Another paper [2], describes the system will analyse The collected data will be analysed for provoking posts
data from few discussion forums and will classify the using a set of keywords in the algorithm. Further the set
data into different groups i.e. legal and illegal data using of sensitive keywords are divided into 6 categories: -
Levenshtein algorithm. Levenshtein is used to measure hacking, sexuality, religious, piracy, gambling, fraud. If
similarity between two words. the comments in text corpus comes across any of
In a paper [5], research is being carried out using web keywords related to any of the 6 categories, then it is
mining. Using Web mining, the data set is collected by classified into that particular category to which the
crawling large number of web pages. It requires a user keyword belongs. Our goal is to achieve sentiment
interactive query interface intended for predicting crime analysis for data provided from discussion forums for
hotspots from various web pages. The main techniques which we will build Classifiers which consists of
used are classification, sequential pattern mining, different machine learning classifiers.
association analysis, outlier analysis and cluster analysis.
Clustering and classification techniques, identify the
similar items and group them in classes. The association
rule mining and sequential pattern mining techniques are
similar. They both identify frequently occurring sets and
extract a pattern. Using all these techniques in web
mining make it more complex. Along with the
techniques, a conceptual network i.e. a dynamic structure
of nodes connecting in a functional way is required for
better visualisation of criminal networks and to reveal the
vulnerabilities inside the network. The biggest challenge
faced by the researchers was collecting the data from the
web pages which consists of hyperlinks, navigation links,
advertisements, privacy policies etc. Theses noises
should be remove d from the data before processing.
Another challenge was that on web the information is
never constant. The model intends to concentrate on

258
International Journal of Pure and Applied Mathematics Special Issue

Imperial Journal of Interdisiplinary Research (IJIR) Vol-


2, Issue-5, 2016
[2] Harika Upgaganlawar, Nilesh Sambhe.
Surveillance of Suspicious Discussions on Online
Forums Using Text Mining. International Journal of
Advances in Electronics and Computer Science, Volume-
4, Issue-4, April-2017
[3] Suhas Pandhe and Sahil Pawar. Algorithm to
Monitor Suspicious on Social Networking Sites Using
Data Mining Techniques. International Journal of
Computer Applications. Volume 116 – No. 12, April
2015
[4] Javad Hosseinkhani, Mohammad Koochakazei,
Solmaaz Keikhaee and Yahaya Hamedi Amin. Detecting
Suspicion Information on Web Crime Using Crime Data
Mining Techniques. International Journal of Advanced
Computer Science and Information Technology
(IJACSIT) Vol.-3, No. 1, 2014, Page 32-41
[5] G.Vinodhini, R.M Chandrasekran. Sentiment
Analysis and Opinion Mining: A Survey. International
Journal of Advanced Research in Computer Science and
Software Engineering. Volume 2, Issue 6, June-2012
[6] Fabio Calefato, Filippo Lanubile, Nicole Novielli.
EmoTxt: A Toolkit for Emotion Recognition from Text,
University of Bari Aldo Moro
[7] M.F Portar. An Algorithm for Suffix Stripping
Figure 3.1 Architecture Diagram Program. Vol. 14 Issue: 3, pp. 130-137
[8] B. Connor, R. Balasubramanyan, B.R Routledge,
4. Conclusion and N.A Smith. From tweets to polls: Linking text
sentiment to public opinion time series. School of
In this paper, there is a detailed explanation of various Computer Science Carnegie Mellon University
existing types of methods to detect suspicious activities
of users in online forums. Basically, we found that the [9] Rudy Prabowo1and Mike Thel wall. Sentiment
techniques developed in this domain, focuses on data Analysis: A Combined Approach.
mining in order to reduce the suspicious behaviour [10] S.V.Manikanthan and K.Baskaran “Low Cost
possessed by the user on web. Study revealed that we VLSI Design Implementation of Sorting Network for
need to analyse the user’s behaviour based on feedbacks, ACSFD in Wireless Sensor Network”, CiiT International
comments, reviews shared by them. Also, suspicious Journal of Programmable Device Circuits and
behaviours can be categorized under groups such as Systems,Print: ISSN 0974 – 973X & Online: ISSN 0974
terrorist activity, financial laundering, hacking, sexual or – 9624, Issue : November 2011, PDCS112011008.
racial harassment etc. Using this categorization, the
corpus of probable suspicious words can be build which [11] T.Padmapriya, Ms. N. Dhivya,Ms U.
will further assist in developing more refined and reliable Udhayamathi, “Minimizing Communication Cost In
techniques for detecting these activities. Wireless Sensor Networks To Avoid Packet
Retransmission”, International Innovative Research
References Journal of Engineering and Technology, Vol. 2, Special
Issue, pp. 38-42.
[1] Murugesan, M. Sururthi, R. Pavitha Devi, S.
Deepthi, V. Shri Lavanya, and Annie Princy. Automated
Monitoring Suspicious Discussions on Online Formus
Using Data Mining Statistical Corpus Based Approach.

259
International Journal of Pure and Applied Mathematics Special Issue

Table 2.1 Literature Survey

Work Done Algorithms Used Result Obtained Conclusion

The purpose of this This system is to develop


[1] Automated ▪ Stop word selection ▪ system is an
Monitoring of monito automatic system for
Suspicious Discussions ▪ Stemming Algorithm to r suspicious detecting
discussions automatically
Using Data Mining ▪ Brute-Force Algorithms on illegal discussions on the
Suffix-Stripping online forums, through
Statistical Corpus-Based ▪ Algorithms online forum. which
Approach ▪ ▪ To detect suspicious post
Matching Algorithms text we can discover illegal
▪ Learning-based Methods analysis is used. activities of all users
This system also focuses
▪ Emotional Algorithms ▪ to
Keyword Spotting reduce execution time,
▪ Technique easier
▪ Learning-based Methods classification to identify
more relevant discussions.

[2] Surveillance of Text mining is used to


Suspicious ▪ Stop word selection ▪ detect This model is presented to
detect suspicious activity
Forums Using Text ▪ suspicious posts in online on
Mining Stemming algorithm
forums. smaller data sets using text
▪ Levenshtein algorithm Fails to detect the
▪ suspicious mining.
activity in larger data sets.

Using ID3. an accuracy


[3] Algorithm to Monitor Data Analysis Technique ▪ of • ID3 decision tree
is achieved which
▪ 86% is algorithm forms a
Suspicious Activity on Cleaning and Integration: comparativel
Selection and y higher than decision tree which will
Social Networking Sites ▪ transformation classificatio
other n monitor the derogatory
using Data Mining ▪ Data Mining algorithms and neural comments on social
▪ networks. networking sites.
Techniques Pattern Evaluation and 14% error
▪ The is due to• The accuracy is 86%
Presentation
unstructured data which using ID3 algorithm.
consists of grammar
mistakes
• ID3 DECISION TREE and abusive comments.

260
International Journal of Pure and Applied Mathematics Special Issue

Using the traditional data Although, the model uses


[4] Detecting Suspicion  Clustering mining the
techniques on web mining
Information on The Web  Association rule mining can traditional data mining
solve the problem of web techniques, the efficiency
Using Crime Data Mining  Deviation detection crime. can
Techniques  Classification increase by using multiple
 String comparator processes and optimal
utilization of asynchronous
techniques
resources.
 Sequential Pattern Mining

[5] Sentiment Analysis We can use a Sentiment The hybrid classification


and  Rule Based Classification Analysis on
Opinion Mining: A General Inquirer
Survey  Based Tool (SAT), that can apply a multiple classifiers can not
semi-automatic, only improve the
Classifier (GIBC) complementary performance
approach, with in each
 Rule-Based Classifier classifier but can also increase the
(RBC) that contributes to other effectiveness.
 Statistics Based classifiers
In achieving effectiveness.
Classifier (SBC)
 Induction Rule Based
Classifier (IRBC)
 Support Vector Machines
 Hybrid Classification

261
262

View publication stats

You might also like