You are on page 1of 27

BE Project: Presentation on,

Design & Analysis of Duplication Detection Tool for Articles

By,
Group No: 12

Sanket Patil (A-11)


Harshvardhan Tambe (B-59)
Anuja Bhamare (B-57)
Harshwardhan Pawar (B-73)

Guided By
Prof. Archana Banait
Dept. of Computer Engineering

MET’s Bhujbal Knowledge City, Institute of Engineering


INDEX
 Overview  Flowchart

 Introduction  Mathematical Model

 Problem Statement  ER Diagram

 Objectives  UML Diagrams

 Literature Survey  System Requirements

 Architecture  Conclusion

 Modules  References

 Algorithm
Design & Analysis of Duplication Detection Tool for Articles 2
OVERVIEW
 Large volume public comment campaigns and web portals
that encourage the public to customize articles produce
many duplicate documents, which increases processing and
storage costs, but is rarely a serious problem.
 Filtering near-duplicates out of a collection is thus important
and is particularly challenging in applications that require
them to be filtered out in real-time with high precision.
 Our proposed system, using hashing technique with hash
index and similarity will detect articles duplication.
Design & Analysis of Duplication Detection Tool for Articles 3
INTRODUCTION
 Electronic media has been developing rapidly nowadays,
resulting in many articles produced online, and thus
duplication detection is needed.
 The proposed system will be focusing on articles duplication
detection. In our proposed system, fingerprinting technique
is used with hash index to detect articles duplication.
 It will try to conduct an empirical study and summarize few
of most used plagiarism patterns in plagiarism articles.

Design & Analysis of Duplication Detection Tool for Articles 4


PROBLEM STATEMENT
 To design a tool to process each sentence in the article, which
will be able to detect the plagiarism patterns and find
duplicate articles.

Design & Analysis of Duplication Detection Tool for Articles 5


OBJECTIVES
 To support accurate duplication detection of articles.

 To explore the plagiarism patterns on article.

 To update existing system with new features.

Design & Analysis of Duplication Detection Tool for Articles 6


LITERATURE SURVEY
Sr.
Title Author Findings Limitations
No.
Duplication
Lu Lu,
Detection It is only proposed for
1. Pengcheng NDFinder using Hashing technique
(IEEE Base WeMedia Platforms.
Wang (2019)
Paper)
Computational speed
and memory overhead
Cosine
Salton et al., consumption is high as
2. Similarity Cosine distance between their vectors.
(1975) it calculates cosine
Method
distance for every two
vectors.

This method is costly


The
Broder Set Similarity method for set of as the number of
3. Shingling
(1997-2000) overlapping shingles shingles generated can
Method
be quite large.

Design & Analysis of Duplication Detection Tool for Articles 7


LITERATURE SURVEY

It works on
probability of the hash
Locality-
Charikar values of vectors, if
4. Sensitive Near Duplication
(2002) these values are not
Hashing
equal then it gives
false result.
This method is
Chowdhury Fingerprinting by hashing all the sensitive to very slight
5. I-Match
et al. (2002) significant tokens changes in the
document.

Design & Analysis of Duplication Detection Tool for Articles 8


LITERATURE SURVEY
Improved
Focused on judging
Robustness
the similarity of two
of Signature-
documents without
Based Near-
Kołcz et al. performing the
6. Replica Lexicon Method
(2004) computationally
Detection
expensive bit-wise
via Lexicon
comparison of entire
Randomizati
documents.
on

As it uses Shingling
method and Locality
Near- Sensitive hashing so it
Henzinger Shingling and Locality Sensitive
7. duplicate costly and chances of
(2006) Hashing
Detection false results are
absolute due to
probability.

Design & Analysis of Duplication Detection Tool for Articles 9


ARCHITECTURE OF EXISTING
SYSTEM

Figure 1. [1]

Design & Analysis of Duplication Detection Tool for Articles 10


ARCHITECTURE OF PROPOSED
SYSTEM

Data Pre-
Data Extraction Normalization
processing
Input
Articles

Hashing
algorithm

Duplication Measuring and Jaccard similarity


Detection Reporting Algorithm

Figure 2. Architecture of System

Design & Analysis of Duplication Detection Tool for Articles 11


MODULES
 Data Extraction
 Dataset Preprocessing
 Normalization
 Fingerprinting and Hashing algorithm
 Jaccard similarity Algorithm
 Report Generation
 Duplication Detection

Design & Analysis of Duplication Detection Tool for Articles 12


ALGORITHM FOR TEXT PLAGIARISM

Design & Analysis of Duplication Detection Tool for Articles 13


ALGORITHM FOR IMAGE
PLAGIARISM

Design & Analysis of Duplication Detection Tool for Articles 14


FLOWCHART

Design & Analysis of Duplication Detection Tool for Articles 15


MATHEMATICAL MODEL
 Let the system be described by S,
 S={D, DP, N, PH, JA, MR,DD}
 Where,
S: is a System.
D: Set of Dataset.
DP: Dataset Preprocessing.
N: Normalization.
FH: Fingerprinting and Hashing algorithm
JA: Jaccard similarity Algorithm.
MR: Measuring and Reporting
DD: Duplication Detection
 S= {I,F,O}

Design & Analysis of Duplication Detection Tool for Articles 16


MATHEMATICAL MODEL
 I= Set of Inputs ={I1,I2}
I1= Articles
I2= Trained Dataset
 F is the set of Function ={Fn1,Fn2,Fn3,Fn3,Fn4,Fn5}
Fn1: Normalization
Fn2: Fingerprinting and Hashing algorithm
Fn3: Jaccard similarity Algorithm
Fn4: Measuring and Reporting
Fn5: Duplication Detection
 O= set of output ={O1,O2}
O1= Similarity value
O2= duplicate detection

Design & Analysis of Duplication Detection Tool for Articles 17


MATHEMATICAL MODEL

Figure 3. Mathematical Model

Design & Analysis of Duplication Detection Tool for Articles 18


ER DIAGRAM

Figure 4. ER Diagram

Design & Analysis of Duplication Detection Tool for Articles 19


UML: CLASS DIAGRAM

Figure 5. UML Class Diagram

Design & Analysis of Duplication Detection Tool for Articles 20


UML: SEQUENCE DIAGRAM

Figure 6. UML: Sequence Diagram

Design & Analysis of Duplication Detection Tool for Articles 21


UML: USE CASE DIAGRAM

Figure 7. UML: Use Case Diagram


Design & Analysis of Duplication Detection Tool for Articles 22
MINIMUM SYSTEM REQUIREMENTS
Hardware Requirement:
 Hard Disk up to 1 GB and above
 Ram 4 GB and above
 Processor Core 2 Duo and above

Software Requirement:
 IDE : Visual Studio
 Database: Dataset
 Language: C# .Net
 OS: Windows 7 and above

Design & Analysis of Duplication Detection Tool for Articles 23


CONCLUSION
The Proposed System will be able to detect plagiarism patterns
and find duplication in articles. For the given system report and
review paper are prepared. Text and Image plagiarism
algorithm of proposed system are prepared and will be
implemented.

Design & Analysis of Duplication Detection Tool for Articles 24


REFERENCES
[1] Lu Lu, Pencheng Wang, “Duplication Detection in news
articles based on big data”, 2019 IEEE 4th International
Conference on Cloud Computing and Big Data Analytics.

[2] W. Kienreich, M. Granitzer, V. Sabol, and W. Klieber,


“Plagiarism detection in large sets of press agency news
articles,” 17th IEEE International Workshop on Database and
Expert Systems Applications, 2006, pp. 181–188.

[3] A. S. Bin-Habtoor and M. A. Zaher, “A survey on plagiarism


detection systems,” International Journal of Computer Theory
and Engineering, 2012, 4(2): pp. 185–188.
REFERENCES
[4] S. M. Alzahrani and N. Salim, “Plagiarism detection in
Arabic scripts using fuzzy information retrieval,” In Student
Conf. Res. Develop., Johor Bahru, Malaysia, 2008, pp. 281–285.

[5] C. Liu, C. Chen, J. Han, and P. S. Yu, “GPLAG: detection of


software plagiarism by program dependence graph analysis,”
Proc. SIGKDD, 2006, pp. 872–881.
THANK YOU

You might also like