You are on page 1of 3

International Conference on Innovative Mechanisms for Industry Applications

(ICIMIA 2017)

Plagiarism Detection on Bigdata Using Modified


Map-Reduced Based SCAM Algorithm
Jayshree Dwivedi Prof. Abhigyan Tiwary
Department of Computer Science and Engineering Department of Computer Science and Engineering
SIRTS Group of Institute SIRTS Group of Institute
Bhopal, India Bhopal, India
jayshree1401@gmail.com abhigyantiwary@gmail.com

Abstract— Plagiarism is one of the biggest problems of research findings, hypotheses, theories, hypothetical strategies,
scientific research and engineering. Plagiarism is understood as or analysis, or an enlarged borrowing even with attribution.
presenting, intentionally or otherwise, someone else’s words, Plagiarism is always a violation of someone else’s intellectual
thoughts, analyses, argumentation, pictures, techniques, property rights. Obviously, each discipline advances by
computer programmers etc. Plagiarism has a wider meaning, framework on the knowledge and understanding gained and
paraphrasing someone else’s texts by replacing a few words by
published earlier. There is no grievance at all if you refer to
synonyms or interchanging some sentences in own way is also
plagiarism. Even reproducing in your own words a reasoning or previous work and quote it while mentioning the source. It
analysis made by someone else may constitute plagiarism if you must, however, remain clear where actual knowledge ends and
do not add any content of your own; in so doing, you create the where you start presenting the results of your own views or
opinion that you have invented the argumentation yourself while research. As long as you are not capable of contributing to the
this is not the case. The same still applies if you bring together inculcation by adding something crucial to what others have
bits of work by various authors without mentioning the sources. already found, it is misleading and therefore wrong to pretend
Plagiarism has also increased with the use of internet and large you have reached that level. It is very influential for both the
amount of big data available. Plagiarism detection techniques are teacher and the student to have a correct impact of the
applied by making a distinction between natural and
knowledge, understanding and skills of the latter [1]. Two
programming languages. A similarity score is determined for
each pair of documents which match significantly. We have a kinds of plagiarism are recognized in scientific writing -
SCAM (Standard Copy Analysis Mechanism) plagiarism plagiarism of data and plagiarism of text. The first is where
detection algorithm which calculates relative measure to detect a researcher proceeds the data, tables or figures from a
overlap by making comparison on asset of words that are published paper and uses them, often marginally modified to
common between test document and registered document. Our give some probability, in his or her own paper, pretending
proposed detection process is based on natural language by they are his or her own results. Such cases are clearly theft and
comparing documents. We have implemented Map-Reduce based distortion of data and are regarded as a major rift of research
SCAM algorithm for processing big data using Hadoop and ethics. When such cases are discovered, they carry several
detect plagiarism in big data. Normal Scam algorithm is suitable
amends. It is important to determinate the difference between
for normal data processing not for big data processing.
plagiarism of data and plagiarism of text. When one pretends
Keywords— Plagiarism, Scam, Big data, Hadoop, MapReduce that the data are his or her own and using the data of others in
order to regulate a new analysis, as for example in a
INTRODUCTION systematic inspection providing the review gives due
The word plagiarism derived from Latin roots: plagiarius, an acknowledgement of the sources of data. Plagiarism of text
abductor, and plagiare, to steal. The expropriation of another probably arises more frequently and for a variance of reasons.
There are many conditions in which the words used by one
author's text, and the presentation of it as one's own, constitute
author so clearly express a situation that another author uses
plagiarism and is a serious violation of the ethics of
exactly those words, because he or she cannot assume of a
scholarship [1, 2]. Plagiarism includes more subtle and
perhaps more pernicious abuses than simply expropriating the better way to describe that situation. Using another writer’s
exact wording of another wordsmith without attribution. words is allowed providing that clear credit is given normally
by putting the quotation in inverted commas and giving the
Plagiarism also includes the limited borrowing, without
references so for example… “When you insert your ideas on
attribution, of another person's distinctive and significant
paper, your preceptor want to distinguish between the building

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 608


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA 2017)

block ideas imitated from other person and your own newly irrespective of the differences in document sizes. A formula is
reasoned prospects or conclusions” [8]. Big data is a “data used which return a high value when the content of test
beyond the storage capacity and beyond the processing power document is either a subset or a superset of the registered
is called big data”. Big data word is used for data sets it’s so document.
vast and complex that traditional data, it comprise data sets
with different sizes. Big data size is a steadily moving in a PROPOSED WORK
large scale by mankind ranging from a few dozen terabytes to Algorithm SCAM_HADOOP
many petabytes of data means like social networking sites, the {
amount of data produced by generation is growing rapidly 1. Place data on HDFS which is to be used by
every year. Big data having 3V characteristics i.e. volume plagiarism detection.
(GB, TB, PB, ZB), velocity (speed of change or speed of 2. Input file will be distributed to all the mappers.
generation high ), variety (structured, unstructured, semi 3. Every mapper will match the input file with a block
structured ). Big data is really complicated to our life and its of data having multiple.
emaemate as one of the most important technologies in 4. Every mapper will output the matching percentage
modern world. There are many benefits of big data for with respective documents.
example using the information kept in the social network like 5. Reducer will combine the results of all the mappers.
face book, the marketing agencies are review about the 6. Output will be produced on HDFS.
feedback for their expedition, advancement, and other
advertising mediums. Hadoop is the core platform for }
structuring big data, and solves the problem of formatting it
for subsequent analytics purposes. Hadoop uses a distributed EXPERIMENTAL SETUP AND RESULTS
computing architecture consisting of multiple servers using In this paper Hadoop cluster of 2, 4 and 6 nodes is configured.
commodity hardware making it relatively inexpensive to scale The cluster is having following configuration: Intel i7
and support extremely large data stores. It is based on parallel processor and 8 GB RAM.
computing, to process this data Google launch two things Hdfs
to store huge data set and Mapreduce to process huge data set.
Hadoop comes from elephant name, hadoop is open source
comparison of execution
software (framework) overseen by apache software foundation time for variable amount
for storing processing huge data sets with use of cluster of
commodity hardware. There are various types of tools in
hadoop like Pig, Hives, Sqoop, Hbase, Oazie. Hadoop has two
of data
main components first one is Hdfs and second is Map reduce. 2000
Time (msec)

Hdfs (Hadoop Distributed File System), it is specially


designed file system for storing and processing huge data sets 0 execution time
with the use of cluster of commodity hardware and with fixed 1 5 10 on 2 nodes
streaming access pattern. It is a self – healing, high bandwidth Data in GB
clustereble storage [15].
SCAM (standard copy analysis mechanism) Which is
relative measure to calculate overlap by making comparibility
on a set of words that are common between test document and
registered document. Using SCAM formula, a similarity comparison of execution
measure is calculated between test document and the
documents belonging to the dataset. SCAM formula is relative
time for variable amount
measure to detect overlap, differences in document sizes. In
this paper we are going to propose a new scam algorithm
of data
working on Hadoop framework. This algorithm is Map 1000
Reduce based and capable of handling big data [11].
Time (msec)

SCAM ALGORITHM There are some steps using in scam 0 execution time
algorithm which are used for achieving detection system first 1 5 10 on 4 nodes
step is indexing the data set in this step create an index of the Data in GB
data set documents this index is useful for retrieve useful
information that is going to be used in evaluation step. Second
step is processing the test document in this input data is
processed and this document is splited by using regular
expressions. The main method in this class is extract tokens. It
has two main objectives fill a data structure and fill the table.
The third one step is searching on the index in this matching is
done between the test document and the documents belonging
to the data set. In last step we evaluating similarity with scam
in this step scam formula is used to detect over lap and

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 609


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA 2017)

l)School o/Computer, Harbin Engineering University,Harbin


2)The 70cjh Research Institute, China Shipbuilding Industry Corporation,
comparison of execution WuHan, China fangxL whu@126.com.

time for variable amount 5. Multipattern String Matching On A GPU Xinyan Zha and Sartaj Sahni
Computer and Information Science and Engineering University of Florida
Gainesville, FL 32611
of data Email: {xzha, sahni}@cise.ufl.edu.

1000 6. Highly Compressed Multi-pattern String Matching on the Cell Broadband


Time (msec)

Engine Xinyan Zha Daniele Paolo Scarpazza Sartaj Sahni.


0 execution time
1 5 10 7. Multi-Pattern String Matching with q-Grams LEENA SALMELA, JORMA
on 6 nodes TARHIO and JARI KYTOJOKI Helsinki University of Technology.
Data in GB
8. Variable-Stride Multi-Pattern Matching For Scalable Deep Packet
Inspection Nan Hua College of Computing Georgia Institute of Technology
nanhua@cc.gatech.edu.

Comparison of execution 9. Scalable Multi-Pipeline Architecture for High Performance Multi-Pattern


String Matching Weirong Jiang, Yi-Hua E. Yang and Viktor K. Prasanna
Ming Hsieh Department of Electrical Engineering University of Southern
time for different California Los Angeles, CA 90089, USA Email: {weirongj, yeyang,
prasanna}@usc.edu.
capacity cluster 10. Ultra-High Throughput String Matching for Deep Packet Inspection Alan
Kennedy, Xiaojun Wang, Zhen Liu School of Electronic Engineering Dublin
City University Dublin 9, Ireland.
2000
time on 2 node
Time (msec)

11. Plagiarism Detection Based on SCAM Algorithm Daniele Anzelmi,


0 cluster Domenico Carlone, Fabio Rizzello, Robert Thomsen, D. M. Akbar Hussain
1 5 10 time on 4 node
12. http://wnsql.sourceforge.net/
Data in GB cluster
13. http://wordnet.princeton.edu/wordnet/documentation/

14. http://en.wikipedia.org/wiki/Plagiarism

CONCLUSION 15.https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_
In this work SCAM algorithm is modified for distributed algorithm
computing platform using Hadoop. In this work different
capacity datasets are tested for plagiarism using modified
SCAM on Hadoop. It is found that execution time doesn’t
increase considerable for bigger dataset also and data will be
distributed across the cluster of machines. This technique
takes sometimes for finding results gives output in short time
with speed and accuracy and we are easily process and handle
big data sets. So hadoop is used for performance enhancement.

REFERENCES

1. An Efficient Multi-Patterns Parameterized String Matching Algorithm with


Super Alphabet by Rajesh Prasad Department of Computer Science &
Engineering LDC Institute of Technical Studies Allahabad, India-212502
jesh_ucer@yahoo.com.s

2. Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Anat


Bremler-Barr Computer Science Dept. Interdisciplinary Center, Herzliya,
Israel
Email: bremler@idc.ac.il. Yaron Koral Computer Science Dept.
Interdisciplinary Center, Herzliya, Israel. Email: koral.yaron@idc.ac.il.

3. Dynamic Multiple Pattern Detection Algorithm Chouvalit Khancome


Software System Engineering Laboratory Department of Mathematics and
Computer Science Faculty of Science, King Monkut's Institute of Technology
at Ladkrabang(KMITL).

4. The Research and Improving for Multi-pattern String Matching Algorithm


Fang Xiangyan1)

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 610