You are on page 1of 56

Tribhuvan University

Institute of Science and Technology

A Final Year Project Report

On

“Analysis of Rabin Karp and Knuth Morris Pratt Algorithm

for Plagiarism Detection”


Submitted to:

Department of Computer Science and Information Technology

Ambition College

Mid-Baneshwor, Kathmandu, Nepal

In Partial fulfilment of the requirements for the

Degree of Bachelors of Science in Computer Science and Information Technology

Submitted by:

Saugat Adhikari (TU Roll No.: 20727/075)

Avishek Adhikari (TU Roll No.: 20714/075)

Yubraj Koirala (TU Roll No.: 20737/075)

Under the supervision of

Mr. Guru Prasad Lekhak

May, 2023
Date: 2023/05/05

AMBITION COLLEGE

Mid-Baneshwor, Kathmandu

SUPERVISOR’S RECOMMENDATION

I hearby recommend that this project prepared under my supervision by the team of
Saugat Adhikari, Avishek Adhikari, Yubraj Koirala entitled “Analysis of Rabin
Karp and Knuth Morris Pratt Algorithm for Plagiarism Detection” is accepted as
fulfilling in partial requirements for the degree of Bachelor of Science in Computer
Science and Information Technology. In my best knowledge, this is an original work in
Computer Science by them.

…………………………….

Mr. Guru Prasad Lekhak

Supervisor

Department of Computer Science and Information Technology

Ambition College

Mid-Baneshwor, Kathmandu
Tribhuvan University

Institute of Science and Technology

Department of Computer Science and Information Technology

AMBITION COLLEGE

Mid-Baneshwor, Kathmandu, Nepal

Letter of Approval

This is to certify that this project prepared by the team of Saugat Adhikari, Avishek
Adhikari, Yubraj Koirala entitled “Analysis of Rabin Karp and Knuth Morris Pratt
Algorithm for Plagiarism Detection” in partial fulfillment of the requirement for the
degree of Bachelors of Science in Computer Science and Information Technology has
been well studied and prepared. In our opinion, it is satisfactory in the scope and quality
as a project for the required degree.

Evaluation Committee

Mr. Guru Prasad Lekhak Mr. Ramesh Kumar Chaudhary


Supervisor Head of Department
Department of Computer Science and Department of Computer Science and
Information Technology Information Technology
Ambition College Ambition College
Mid-Baneshwor, Kathmandu Mid-Baneshwor, Kathmandu

External
Acknowledgement
We would like to extend our heartfelt thanks to Ambition College for providing us with
the platform to undertake this project work. The college has been instrumental in shaping
our learning experience and has provided us with the resources and support necessary for
the successful completion of this project. We would like to express our profound gratitude
to our Head of Department, Mr. Ramesh Kumar Chaudhary, for his unwavering
support and guidance. His expertise and experience were invaluable in the development
of this project. His constant encouragement, motivation and constructive feedback helped
us to overcome the challenges we faced during the course of the project. We would also
like to express our deepest appreciation to our project supervisor, Mr. Guru Prasad
Lekhak, for his unwavering support and guidance. His vast knowledge and experience in
the field of project work helped us to understand the intricacies of the subject. His
dedication and commitment to our project have been instrumental in its successful
completion. Lastly, we would like to acknowledge the challenges that we faced during the
course of this project and thank everyone who has helped us overcome them. This project
would not have been possible without the support and guidance of our college, Head of
Department, supervisor and everyone who has helped us along the way.

i
Abstract
The problem of plagiarism has been seen as a serious threat to the educational process
and a cause of decline in the creativity or originality. The traditional method of text
search is considered as an accurate way of plagiarism detection. Its disadvantage is the
amount of time it consumes. This software ‘Analysis of Rabin Karp and Knuth Morris
Pratt Algorithm for Plagiarism Detection’ aims at providing a better and effective way
for detection of plagiarism. It automates the process by extracting some selective text and
matching it with the available data set to find the degree of plagiarism committed.

Keywords: Plagiarism Detection, Plagiarized, Knuth-Morris-Pratt, Rabin-Karp

ii
Table of Contents

Acknowledgement................................................................................................................i

Abstract ....................................................................................................................ii

List of Figures ...................................................................................................................vi

List of Tables ..................................................................................................................vii

List of Abbreviations........................................................................................................viii

CHAPTER 1 INTRODUCTION......................................................................................1

1.1 Introduction................................................................................................................1

1.2 Problem Statement.....................................................................................................2

1.3 Objectives...................................................................................................................3

1.4 Scope and Limitations................................................................................................3

1.5 Development Methodology........................................................................................4

1.6 Report Organization...................................................................................................5

CHAPTER 2 BACKGROUND STUDY AND LITERATURE REVIEW....................6

2.1 Background Study......................................................................................................6

2.2 Literature Review.......................................................................................................7

2.3 Related Works............................................................................................................9

2.3.1 PlagAware...........................................................................................................9

2.3.2 PlagScan............................................................................................................10

2.3.3 CheckForPlagiarism.net....................................................................................11

CHAPTER 3 SYSTEM ANALYSIS...............................................................................13

3.1 Requirement Analysis..............................................................................................13

3.1.1 Functional Requirement....................................................................................13

3.1.1.1 Use Case Diagram......................................................................................13

iii
3.1.1.2 Use Case Description.................................................................................13

3.1.2 Non-Functional Requirement............................................................................15

3.2 Feasibility Analysis..................................................................................................15

3.2.1 Technical Feasibility.........................................................................................16

3.2.2 Operational Feasibility......................................................................................16

3.2.3 Economic Feasibility.........................................................................................16

3.2.4 Schedule Feasibility..........................................................................................16

3.3 Methodology............................................................................................................17

3.3.1 Extraction of text from files..............................................................................17

3.3.2 Source file generation........................................................................................17

3.3.3 Preprocessing....................................................................................................18

3.3.4 Paragraph Segmentation....................................................................................18

3.3.5 Lowercasing......................................................................................................18

3.3.6 Stopword Removal............................................................................................19

3.4 Analysis....................................................................................................................21

3.4.1 Data modelling..................................................................................................21

3.4.1.1 ER Diagram................................................................................................21

3.4.2 Process Modelling.............................................................................................22

3.4.2.1 Context Diagram........................................................................................22

3.4.2.2 Level 0 DFD...............................................................................................22

3.4.2.3 Level 1 DFD...............................................................................................23

3.4.3 Data Set Description..........................................................................................23

CHAPTER 4 SYSTEM DESIGN...................................................................................24

4.1 Design......................................................................................................................24

iv
4.2 Algorithm Details.....................................................................................................25

CHAPTER 5 IMPLEMENTATION AND TESTING..................................................29

5.1 Implementation........................................................................................................29

5.1.1 Implementation CASE tools.............................................................................29

5.1.1.1 Software Components................................................................................29

5.1.1.2 Hardware Components...............................................................................29

5.1.2 Implemented programming languages..............................................................29

5.1.3 Implementation detail of modules.....................................................................30

5.1.3.1 Modules implementation............................................................................30

5.2 Testing......................................................................................................................31

5.2.1 Unit Testing.......................................................................................................32

5.2.2 System Testing..................................................................................................33

CHAPTER 6 CONCLUSION AND FUTURE RECOMMENDATION....................35

6.1 Conclusion................................................................................................................35

6.2 Future Recommendations.........................................................................................35

References

Appendix

v
List of Figures
Figure 1.1: Development Methodology Model....................................................................4

Figure 3.1: Use-Case Diagram of plagiarism detection system.........................................13

Figure 3.2 : ER Diagram of plagiarism detection system..................................................21

Figure 3.3 : Context Diagram.............................................................................................22

Figure 3.4 : Level 0 DFD...................................................................................................22

Figure 3.5 : Level 1 DFD...................................................................................................23

Figure 4.1 : System Overview............................................................................................24

vi
List of Tables
Table 3.1: Log into the system...........................................................................................14

Table 3.2: Register an account...........................................................................................14

Table 3.3 : Choose algorithm.............................................................................................14

Table 3.4 : Upload file........................................................................................................14

Table 3.5 : Calculate plagiarism rate..................................................................................15

Table 3.6 : Logout..............................................................................................................15

Table 3.7: Pre-Processing Example...................................................................................19

Table 5.1: Test Cases for file upload and text extraction...................................................32

Table 5.2: Test Cases for file upload and text extraction...................................................33

vii
List of Abbreviations

CSS : Cascading Style Sheet

DFD : Data Flow Diagram

ER : Entity Relation

HTML : Hyper Text Markup Language

JS : JavaScript

KNP : Knuth Morris Pratt

UI : User Interface

UX : User Experience

viii
CHAPTER 1
INTRODUCTION

1.1 Introduction

Plagiarism detection software is a valuable tool in educational institutions for identifying


instances of plagiarism in student papers. With the rise of the internet and the easy access
to information, plagiarism has become a growing concern in schools and universities. By
using plagiarism detection software, educators can quickly and accurately check student
papers for plagiarism and take appropriate action if necessary.

There are several different types of plagiarism detection software available, each with its
own set of features. Some popular options include Turnitin, Grammarly, and SafeAssign.
These software use algorithms to compare student papers against a database of sources,
such as academic papers, websites, and journals. The software will then flag any text that
matches the source material, indicating potential plagiarism. Some software even include
features that check for paraphrasing, which is when a writer rewords someone else's work
but keeps the overall meaning the same.

Using plagiarism detection software in educational institutions is a common practice to


ensure the originality of the work submitted by the students. It helps educators to identify
plagiarism and take appropriate action, such as giving a warning or failing the student.
Additionally, it also helps to educate students about the importance of academic integrity
and the consequences of plagiarism.

Some of the advantages of using plagiarism detection software in educational institutions


include:

 It saves time and effort for educators by automating the process of checking for
plagiarism.
 It helps to ensure the authenticity and originality of student work.
 It promotes academic integrity by educating students about plagiarism and the
importance of citing sources.
 It can be used as a deterrent to prevent students from plagiarizing in the first place.

However, it is important to note that plagiarism detection software is not a foolproof


solution to plagiarism, and it should be used in conjunction with other methods of
1
detecting plagiarism, such as manual check and educating students about academic
integrity. Additionally, it is important for educators to understand how to interpret the
results of the software and make a fair judgment based on the context. This software aims
at providing a way for detection of plagiarism in documents. It basically searches
throughout the data set to find if the document has been plagiarized. The easy
accessibility of internet has provided with an easier access to documents online which has
also paved the path for plagiarism of documents. Although the have been laws and
reforms aimed at decreasing the amount of plagiarism, there has been no noticeable
impact on the amount of plagiarism. With the introduction of new software and
technologies the amount of plagiarism has been seen to increase, which has given birth to
a new kind of software i.e. plagiarism detection software.

This software “Analysis of Rabin Karp and Knuth Morris Pratt Algorithm for
Plagiarism Detection” will be used to detect whether someone’s work is copied or is
original. This project aims on developing a plagiarism detector for educational purposes.
It firstly decodes the sentence into tokens. Then, it removes verbs and common words
which have no meaning in order to focus on the content words. After that, it removes ing,
ed, and s from the end of the content words which is also called stemming. Then, it
calculates the number of unique words and the number of times each word appears in the
text. Based on these calculation the degree of plagiarism is calculated. The proposed
system will make use of two algorithm, Rabin-Karp Algorithm and Knuth–Morris–Pratt
Algorithm. The Rabin-Karp Algorithm is more efficient when the text is long and the
Knuth–Morris–Pratt Algorithm is more efficient when the text is short.

After the algorithm processes the file and determines the degree of plagiarism it displays
the final result to the user. The degree obtained is used to determine whether the
document has been plagiarized or not. Higher degree means the document has been
plagiarized while a lower degree might be shown even in non-plagiarized documents as
similar titled files may contains similar words.

1.2 Problem Statement

Plagiarism has gained significant recognition as a grave issue, particularly within


universities, where students often resorqt to directly copying content from websites or
other sources. It poses a substantial challenge across all academic disciplines. This
problematic phenomenon necessitates universities' attention to uphold academic integrity
2
and foster creativity among students. Presenting data sourced from the internet as one's
original work without proper acknowledgment of the original author is also regarded as a
form of plagiarism. It is a form of academic dishonesty. It is unfair to the original author,
and also unfair to the student as its hampering their creativity and knowledge acquisition.
With the increase of information on the internet, the problem of plagiarism is also
increasing. So plagiarism detector is needed which can help find out plagiarism and take
action against them.

1.3 Objectives

The main objective of this project is given below:

 To develop a system that can effectively detect plagiarism in source code files.
 To credit rigorously the originality of the work.

1.4 Scope and Limitations

The Scopes are as follows:

 Identify instances of plagiarism in written work such as text documents, essays,


research papers, and programming code.
 Can be used in academic settings, such as in schools, colleges, and universities.

The limitations are as follows:

 Non-Verbatim Plagiarism:
Plagiarism that entails rewriting, translating, or paraphrasing the text poses a
challenge for detection. While most plagiarism detectors are highly sensitive, they
focus solely on the words used rather than the underlying content. Consequently, if
the idea or information is lifted without directly copying the words, it can go
unnoticed by these detectors. This issue is prevalent in academia, where this form of
plagiarism is treated as seriously as verbatim plagiarism.
 Common Phrasing/Attributed Use:
Secondly, while many plagiarism checkers strive to distinguish attributed use, it can
be challenging due to the diverse range of attribution styles. Consequently, achieving
accurate separation of attributed content may not always be feasible. Additionally,
due to the prevalence of certain phrases in the English language, many plagiarism

3
checkers may flag matches that are simply coincidental, leading to potential false
positives in the results.

1.5 Development Methodology

Development methodology

Figure 1.1: Development Methodology Model

To develop a plagiarism detection system using the Waterfall model and both the Knuth-
Morris-Pratt (KMP) algorithm and the Rabin-Karp string matching algorithm, the
following steps can be taken:

 Requirements Analysis: The first step is to gather and document the requirements of
the plagiarism detection system, including the type of content it should support, the
desired accuracy, and the user interface.
 Design: Based on the requirements, the system should be designed, including the
algorithms to be used, and the user interface. Both the KMP and Rabin-Karp
algorithms should be selected for string matching in this phase.
 Implementation: The next step is to implement the design. This involves writing the
code, testing it, and fixing any bugs. Both the KMP and Rabin-Karp algorithms
should be implemented and optimized during this phase.
 Testing: The system should be thoroughly tested to ensure that it meets the
requirements. This includes unit testing, integration testing, and acceptance testing.

4
The performance of both algorithms should be evaluated and compared during this
phase.
 Deployment: If the system passes all tests, it can be deployed for use.
 Maintenance: After deployment, the system should be monitored and maintained to
ensure it continues to work properly. Any necessary updates should be made as
needed.

1.6 Report Organization

This document is categorized into several chapters and further divided into sub chapters
including all the details of the project.

Chapter 1: It is about the introduction of the whole report. It includes short introduction
of the system, scope and limitations and objectives of the system.

Chapter 2: It includes the research methodologies in the project. Background study and
literature review has been covered.

Chapter 3: It is all about system analysis. It also includes feasibility study and
requirement analysis.

Chapter 4: It include the System Design and details of the Algorithm.

Chapter 5: It is about the implementation and testing procedures. It contains the detail
about the tools that are required to design the system. In the testing section, different
testing processes are included.

Chapter 6: It includes conclusion of the whole project. It also provides information about
what further can be achieved from this project.

5
CHAPTER 2
BACKGROUND STUDY AND LITERATURE REVIEW

2.1 Background Study

Plagiarism detection software traces its origins to the early days of the internet, when the
ease of copying and pasting text from websites made it easier for students and other
writers to plagiarize. To combat this problem, various plagiarism detection software have
been developed. Early plagiarism detection software typically used simple string-
matching algorithms to compare a document against a database of known sources. These
early systems had a number of limitations, including a lack of context and an inability to
detect paraphrasing.

Over time, plagiarism detection software has become more advanced. Modern plagiarism
detection software uses a variety of techniques, such as natural language processing and
machine learning, to analyze text and identify instances of plagiarism. These systems are
able to detect plagiarism even when the text has been paraphrased or rewritten, and can
also provide detailed reports on the sources of plagiarized text.

The use of plagiarism detection software has become widespread in academic institutions
and other organizations as a way to detect and prevent plagiarism. However, there are
also some concerns about the use of such software, including issues of privacy, accuracy
and potential misuse. Despite these concerns, plagiarism detection software is likely to
continue to play an important role in the fight against plagiarism.

The general concepts related to plagiarism detection software include:

 Text similarity: This concept refers to the degree of similarity between the text in
question and known sources. The software uses various techniques to compare the
text and identify instances of plagiarism.
 Paraphrasing detection: This concept refers to the ability of the software to detect
plagiarism even when the text has been paraphrased or rewritten. This is achieved
through the use of natural language processing and machine learning techniques that
can identify patterns and relationships in the text
 Database: The software compares the text against a database of known sources,
which can include websites, journals, and previous student papers.

6
 Reporting: This concept refers to the ability of the software to provide detailed
reports on the sources of plagiarized text. This can help users to understand the extent
of plagiarism and take appropriate action.
 False positives: The concept of false positives refers to instances where the software
incorrectly identifies text as plagiarized.

2.2 Literature Review

Literature reports various factors that motivate students’ plagiarism in academia. Students
plagiarize because: inadequate time to study; fear of failure perceived between actual
grade and student’s personal effort; student studying so many courses that results to a lot
of work per semester; a believe that student will not caught because lecturers do not have
time to read extensively the assignments because of work pressure; motivation of doing
well of getting good grade; student feeling of alienation by colleagues; and student
individual factors such as age, grade average point, gender and others. Likewise Betts et
al also reported similar factors for student plagiarism but added other factors that are
likely to attract student to act plagiarism behavior [1].

Plagiarism exists in many different scenarios, and is often difficult to prove or solve.
From a modern educational perspective, the rise of the internet as an information sharing
platform has provided students with more ways to access electronic materials. At the
same time, essay banks and ghost writing services known as “Paper Mills” appeared.
According to an internet survey by the Coastal Carolina University, the list of Paper Mills
in the US has soared from 35 in 1999 to over 250 in 2006, and to date the figure is still
rising. Contrary to popular belief, students are not the only ones who face scrutiny. Apart
from academic misconduct charges, plagiarism can also cause financial and reputation
losses. There have been a number of scandals where high-profile authors were caught
plagiarizing in the publication industry, and others where even government ministers
were caught plagiarizing their PhD theses. There have also been cases where academics
reused large parts of text for funding proposals [2].

Researchers have developed several tools for automatic textual detection of plagiarism.
One method is the grammar-based approach, which focuses on the grammatical structure
of documents and uses string-based matching to detect similarity. However, this method
is limited in detecting modified copied text. Another method is the semantics-based
approach, which uses the vector space model to detect similarities between documents. It
7
calculates word redundancy and matches document fingerprints to identify similarity.
However, this method struggles with partially plagiarized documents, as it is challenging
to pinpoint the location of copied text. The grammar semantics hybrid method is
considered the most effective approach for plagiarism detection in natural languages. It
can detect modified text and determine the location of plagiarized parts in a document,
addressing the limitations of the previous methods. External plagiarism detection relies on
a reference corpus of documents and compares suspicious passages to identify duplicates.
This method requires a large reference corpus and human intervention to determine
plagiarism. Clustering, a technique used in information retrieval, is also employed in
plagiarism detection to reduce searching time. However, there are still limitations and
challenges related to time and space in clustering methods [3].

Literature on types of plagiarism is inclusive. Classified plagiarism into six categories or


forms. These categories are:

 Copy and paste plagiarism is when a piece of text is copied verbatim from a source
without using quotation marks to credit the original authors.
 Word-switch plagiarism is a type of plagiarism in which the plagiarist takes a
sentence from the source and modifies a few words without citing the original author.
 Style plagiarism is when someone copies the logic of another author sentence by
sentence.
 Metaphor plagiarism is type of plagiarism where someone uses creative style of
someone to present his ideas without crediting the original author of the creative style.
 Idea plagiarism is the practice where you take someone’s idea or solution proposed by
another person and using it as your own creativity without crediting the author; and
 Plagiarism of authorship: this is a form of plagiarism where student directly put his
name on someone else work [4].

With respect to the experiment the majority of the approaches perform overlap detection
by exhaustive comparison against some locally stored document collection—albeit a Web
retrieval scenario is more realistic. We explain this shortcoming by the facts that the Web
cannot be utilized easily as a corpus, and, that in the case of code plagiarism the focus is
on collusion detection in student course works. With respect to performance measures the
picture is less clear: a manual result evaluation based on similarity measures is used about
the same number of times for text (35%), and even more often for code (69%), as an

8
automatic computation of precision and recall. 21% and 13% of the evaluations on text
and code use custom measures or examine only the detection runtime. This indicates that
precision and recall may not be well-defined in the context of plagiarism detection.
Moreover, comparisons to existing research are conducted in less than half of the papers,
a fact that underlines the lack of an evaluation framework [5].

2.3 Related Works

2.3.1 PlagAware

PlagAware is an online service designed to detect textual plagiarism. It offers various


features to users, including the ability to search, find, analyze, and track plagiarism in
specific topics. Acting as a search engine, PlagAware excels at identifying common
content found in given texts. It utilizes traditional search engine techniques to detect and
scan for plagiarism, generating comprehensive reports that assist users and document
owners in determining if their content has been plagiarized. The main focuses of the
PlagAware plagiarism search engine are monitoring web pages for stolen content and
evaluating transmitted text.

The main features of PlagAware are:

 Database Checking: PlagAware operates as a search engine where users can submit
their documents for analysis. Instead of relying on a local database, PlagAware
searches across various databases available on the internet to conduct comprehensive
checks.
 Internet Checking: PlagAware is an online application that functions as a search
engine, catering to students and webmasters. It enables users to upload and scan their
academic documents, homework, manuscripts, and articles for plagiarism across the
World Wide Web. Additionally, it empowers webmasters to automatically monitor
their own web pages for potential content theft.
 Publications Checking: PlagAware primarily caters to the academic field, offering
comprehensive checking for a wide range of submitted publications. This includes
homework, manuscripts, documents, books, articles, magazines, journals, editorials,
and PDFs, among others.
 Synonym and Sentence Structure Checking: PlagAware does not support synonym
and sentence structure checking.

9
 Multiple Document Comparison: PlagAware offers comparison of multiple
documents.

2.3.2 PlagScan

PlagScan is an online software utilized for checking textual plagiarism. It is commonly


employed by educational institutions and offers various account types with distinct
features. PlagScan employs advanced algorithms to thoroughly examine uploaded
documents and detect instances of plagiarism, drawing from current linguistic research.
By extracting a unique signature from the document's structure, it compares it against the
extensive PlagScan database and millions of online documents. As a result, PlagScan can
effectively identify various types of plagiarism, including direct copy and paste as well as
word switching. This allows for precise assessment of the level of plagiarized content
within any given document. The Main features of PlagScan are:

 Database Checking: PlagScan maintains its own extensive database, comprising


millions of documents such as papers, articles, and assignments, along with content
from the World Wide Web. As a result, it provides the capability to perform database
checks, both locally and by accessing other databases on the internet.
 Internet checking: PlagScan is an online plagiarism checker that offers
comprehensive internet-based checking for all submitted documents. It examines
documents regardless of whether they are available on the internet, stored in a local
database, or cached.
 Publications Checking: PlagScan is predominantly utilized in the academic field,
specializing in online checking for various types of submitted publications, including
documents, books, articles, magazines, journals, newspapers, PDFs, and more. Its
focus is on online content checking rather than offline sources.
 Synonym and Sentence Structure Checking: PlagScan does not support synonym
and sentence structure checking but provides Integration via application programming
interface in your existing content management system or learning management system
possible.
 Multiple Document Comparison: CheckForPlagiarism.net offers comparison of
multiple documents in parallel.

10
2.3.3 CheckForPlagiarism.net

CheckForPlagiarism.net, developed by a team of professional academics, has established


itself as one of the leading online plagiarism checkers. Its primary objective is to combat
online plagiarism and minimize its impact on academic integrity. To ensure utmost
accuracy, CheckForPlagiarism.net employs advanced techniques such as document
fingerprinting and document source analysis to safeguard against plagiarism.

The fingerprint-based approach involves analyzing and summarizing a collection of


documents, generating a unique fingerprint for each document. This fingerprint comprises
numerical attributes that capture certain aspects of the document's structure. By creating
fingerprints with numerical attributes for each document in the collection,
CheckForPlagiarism.net enables efficient identification of matches or similarities among
billions of articles. This feature greatly enhances the effectiveness of plagiarism detection
across various types of plagiarisms.

The ultimate goal of CheckForPlagiarism.net is to provide a robust solution that curbs


online plagiarism, safeguarding academic integrity within educational settings. The main
features of CheckForPlagiarism.net are:

 Database Checking: CheckForPlagiarism.net relies on its extensive database,


encompassing millions of documents such as papers, articles, assignments, and online
content from the World Wide Web. This allows for efficient and dependable database
checks, ensuring thorough scrutiny. Furthermore, CheckForPlagiarism.ne extends its
checking capabilities by examining various specialized and generalized databases
across different fields, including medical, legal, and other relevant domains. This
comprehensive approach enhances the accuracy and reliability of plagiarism detection
across a wide range of sources and disciplines.
 Internet Checking: CheckForPlagiarism.net is an online service that offers
comprehensive internet-based checking for all submitted documents. It utilizes live
and cached links to websites, allowing for extensive examination of online sources.
Additionally, a notable advantage is its ability to check documents against websites
that may no longer be available online. This includes various types of website content,
such as forums, message boards, bulletin boards, blogs, and PDFs, among others. The
checking process is automated and conducted in near real-time, providing efficient
and timely results.
11
 Publications Checking: CheckForPlagiarism.net provides thorough and
comprehensive checking for a wide range of submitted publications, including books,
articles, magazines, journals, newspapers, PDFs, and more. This extensive checking is
performed regardless of whether the publications are available online, actively
accessible on the internet, or offline, stored in paper-based formats.
 Synonym & Sentence Structure Checking: CheckForPlagiarism.net boasts a unique
advantage not found in other software solutions. It utilizes a patented plagiarism
checking approach that examines the sentence structure of a document to detect
improper paragraphing, which can be indicative of potential plagiarism. Additionally,
it conducts a thorough synonym check on words and phrases to identify any attempts
at plagiarism. This combination of techniques ensures a robust plagiarism detection
process, setting CheckForPlagiarism.net apart from other software options.
 Multiple Document Comparison: CheckForPlagiarism.net can compare a set of
different documents simultaneously with other documents and can diagnose different
type of plagiarisms at the sometimes.

12
CHAPTER 3
SYSTEM ANALYSIS

3.1 Requirement Analysis

3.1.1 Functional Requirement

The functional requirement of this application are:

 The system will need to be able to take files as input source for checking plagiarism.
 The system will need to be able to identify instances of plagiarism, including direct
copy-pasting and paraphrasing.
 The system will be able to check for plagiarism in real-time or on-demand as
required.
 The system will be able to exclude specific sources while checking for plagiarism.

3.1.1.1 Use Case Diagram

13
Figure 3.2: Use-Case Diagram of plagiarism detection system

3.1.1.2 Use Case Description

The description for the Use Case diagram of the system is given below:

Table 3.1: Log into the system


Actor User & Admin
Description The user and admin can log into their
account.
Pre-Condition The system should be able to provide
access to the actor.
Post-Condition The actors are logged in.

Table 3.2: Register an account


Actor User
Description The user can register a new account.
Pre-Condition A valid and new email address should be
available for registration.
Post-Condition The user should be able to register a new
account.

Table 3.3 : Choose algorithm


Actor User & Admin
Description The actor can choose an algorithm for
plagiarism degree calculation.
Pre-Condition Different algorithm options should be
available.
Post-Condition The user’s desired algorithm should be
selected for implementation.

Table 3.4 : Upload file


Actor User & Admin
Description The actors should be able to upload a file
14
for plagiarism degree calculation.
Pre-Condition A file in .txt format should be available.
Post-Condition The actors are able to successfully uploads
a new file.

Table 3.5 : Calculate plagiarism rate


Actor User
Description The system starts the process for
calculation of degree of plagiarism.
Pre-Condition A file should have been uploaded and an
algorithm must have been selected.
Post-Condition The system starts the process for
calculation of degree of plagiarism

Table 3.6 : Logout


Actor User & Admin
Description The actors are able to log out of their
respective accounts.
Pre-Condition The actors must be logged in before
logging out of their accounts.

Post-Condition The actors are able to log out of their


account.

3.1.2 Non-Functional Requirement

The non-functional requirement of this application are:

 High usability and ease of use for users, such as teachers, students, and
administrators.
 Users will interact with the system to generate plagiarism report through a user
friendly graphical user interface.
 It requires multiples files as source in order to check a specific file for plagiarism.

15
 The system should regularly update its source files list to get accurate plagiarized
degree.

3.2 Feasibility Analysis

Feasibility analysis is carried out to test if the proposed system is feasible in terms of
economy, technology, resource availability etc. As such, given unlimited resources and
infinite time, all projects are feasible. Unfortunately, such results and time are not
possible in real life situations. Hence it is both necessary and prudent to evaluate the
feasibility of the project at the earliest possible time in order to avoid unnecessary
wastage of time, effort and professional embarrassment over an ill-conceived system.

3.2.1 Technical Feasibility

The current system that we are building is web-based portal. Nextjs will be used for the
frontend UI as well as for the execution of the stated algorithms. All the technology
required by the application are available and can be accessed freely. The project is
expected to be technically feasible, and will compile with current technology, including
both the hardware and the software.

This application will be supported by almost all latest personal computers.

3.2.2 Operational Feasibility

The proposed system will be developed with minimum human resource required and the
available man power will be enough to create the required system. No doubt the proposed
system is fully GUI based and is very user friendly and all inputs to be taken are all self-
explanatory. Besides, proper training will be conducted to let users know the essence of
the system so that they feel comfortable with the new system. As far as our study is
concerned the clients are comfortable and happy as the system has cut down their loads
and doing.

3.2.3 Economic Feasibility

In general, plagiarism detectors can be quite costly to develop and maintain, and their
effectiveness can vary greatly. As such, it is often difficult to justify the cost of a
plagiarism detector solely on economic grounds. The main cost factor to consider is
usually the database where the plagiarism detector will draw its content from. This can be
a significant expense, particularly if the database is constantly updated. Other potential
16
costs include licensing fees and the development of custom algorithms. But the project is
economically feasible.

3.2.4 Schedule Feasibility

The making of the system or the whole project starts from month 1 and will take
approximately 5 months to complete. The first task that is defining the requirements will
take about 1 month, the second task that is prototype will take the last weeks of month 1
to first weeks of 5th month. Feedbacks will be received throughout the time period of the
system making and the software will be finalized by the later week of the 5th month.

3.3 Methodology

3.3.1 Extraction of text from files

In plagiarism detection software, the extraction of text from a file is an important step in
the process of analyzing and comparing the text to identify instances of plagiarism. The
text must be extracted from the file in a format that can be easily analyzed and compared
to other sources.

Text parsing library can be used to extract the text from a file. This is commonly used for
plain text, Microsoft Word and PDF files. These libraries can be used to extract the text
from the file and convert it into a format that can be easily analyzed by the plagiarism
detection software.

Once the text has been extracted from the file, it can be compared to a database of known
sources to identify any exact or near-exact matches. The software can also use natural
language processing and machine learning techniques to analyze the text and identify
patterns or features that are indicative of plagiarism.

It's important to note that plagiarism detection software can only analyze the text it can
extract, so the accuracy of the software depends on the quality of the text extraction
process.

3.3.2 Source file generation

Source file generation is a process that is used in plagiarism detection software to create a
database of known sources that can be used to compare against the text being analyzed.
The goal of source file generation is to create a comprehensive and up-to-date database of
known sources that can be used to identify instances of plagiarism.
17
There are several different methods for generating source files in plagiarism detection
software, depending on the specific software and the type of sources that are being used.

The project uses web scraping for generation of the source file. This method involves
using web scraping techniques to automatically extract text from websites and other
online sources. This method is useful for creating a large and up-to-date database of
sources, but it can be challenging to filter out irrelevant or low-quality sources.

3.3.3 Preprocessing

Preprocessing is an important step in the plagiarism detection process. It refers to the


process of preparing and cleaning the text data before it is analyzed by the plagiarism
detection software. The goal of preprocessing is to improve the accuracy and efficiency of
the plagiarism detection process. This includes paragraph segmentation, Punctuation
removal, lowercasing, number removal, stopword removal and stemming.

3.3.4 Paragraph Segmentation

Paragraph segmentation is a preprocessing task that is commonly performed in plagiarism


detection software. It involves breaking the text into smaller chunks, known as
paragraphs, to make it easier for the software to analyze and compare the text. The
process of paragraph segmentation typically involves identifying the boundaries between
paragraphs in the text. This is done by identifying line breaks or blank lines between the
paragraphs.

Once the text has been segmented into paragraphs, the software can then analyze each
paragraph individually. This allows the software to identify plagiarism at the paragraph
level, rather than just at the document level, which can improve the accuracy of the
plagiarism detection process. Paragraph segmentation also allows the software to provide
more detailed information about where plagiarism occurs in the text, for example, if a
specific paragraph is flagged as plagiarized, the software can indicate which source the
plagiarized text was copied from.

3.3.5 Lowercasing

Lowercasing is a task that is commonly performed in plagiarism detection software. It


involves converting all the text to lowercase characters, to ensure that the software is
case-insensitive when comparing the text.

18
The reason behind this is that when comparing the text, the software should not take into
account the difference between uppercase and lowercase letters. By converting the text to
lowercase, it eliminates the possibility of the software missing a match due to a difference
in case.

The process of lowercasing typically involves using string manipulation functions to


convert all the text to lowercase. This can be done for the entire text or for specific parts
of the text, such as the text of the paragraph.

3.3.6 Stopword Removal

Stop word removal is a preprocessing task that is commonly performed in plagiarism


detection software. It involves removing common words, known as stop words, from the
text to reduce the amount of data that needs to be analyzed.

Stop words are words that are considered to be of little value for the analysis and are
often removed from the text. Examples of stop words in English include "the", "is",
"and", "a", "an", "in", etc.

The process of stop word removal typically involves using a predefined list of stop words,
which can be specific to a language or domain, and comparing each word in the text
against the list. If a word is found to be a stop word, it is removed from the text.

Stop word removal can help to improve the performance and efficiency of plagiarism
detection software by reducing the amount of data that needs to be analyzed, and also
help to improve the accuracy of the analysis by eliminating irrelevant words that may not
be relevant to the plagiarism detection process.

Table 3.7: Pre-Processing Example

Original Text Plagiarism is the FRAUDULENT representation of another


person's language, thoughts, ideas, or expressions as one's own
original work. Although precise definitions vary, depending on
the institution, such representations are generally considered to
violate academic integrity and journalistic ethics as well as social
norms of learning, teaching, research, fairness, respect, and
responsibility in many cultures.

19
It is subject to sanctions such as penalties, suspension, expulsion
from school or work, substantial fines,and even imprisonment.

Paragraph (Paragraph 1)Plagiarism is the FRAUDULENT representation


Segmentation of another person's language, thoughts, ideas, or expressions as
one's own original work. Although precise definitions vary,
depending on the institution, such representations are generally
considered to violate academic integrity and journalistic ethics as
well as social norms of learning, teaching, research, fairness,
respect, and responsibility in many cultures.

(Paragraph 2)It is subject to sanctions such as penalties,


suspension, expulsion from school or work, substantial fines,and
even imprisonment.
Lowercasing
plagiarism is the fraudulent representation of another person's
language, thoughts, ideas, or expressions as one's own original
work. although precise definitions vary, depending on the
institution, such representations are generally considered to
violate academic integrity and journalistic ethics as well as social
norms of learning, teaching, research, fairness, respect, and
responsibility in many cultures.

it is subject to sanctions such as penalties, suspension, expulsion


from school or work, substantial fines,and even imprisonment.

Stopword Removal plagiarize fraudulent representation another person’s language


thoughts ideas expressions one’s own original work although
precise definitions vary depending institution such
representations generally considered violate academic integrity
journalistic ethics well social norms learning teaching research
fairness respect responsibility many cultures subject sanctions
such penalties suspension expulsion school work substantial
fines even imprisonment

20
Stemming plagiar fraudul represent anoth person language thought idea
express on own origin work although precis definit vari depend
institut such represent gener consid violat academ integrity
journalist ethic well social norm learn teach research fair respect
respons mani cultur subject sanction ssuch penalti suspens
expuls school work substantial fin even imprison

3.4 Analysis

This system is developed based on the Structured Approach. In this analysis phase, a
Conceptual Model is developed using structured design. Structured Design includes the
designing the process model of the system using DFD diagram and the basic flowchart
of the system.

3.4.1 Data modelling

Data modeling is the conceptual representation of Data objects and also the process of
creating a data model for the data to be stored in a database. Data modeling helps in the
visual representation of data and enforces business rules, regulatory compliances, and
other rules on the data. Data Models ensure consistency in naming conventions, default
values, semantics, and security while ensuring quality of the data. ERD has been used for
the data modelling technique that helps to define business processes, which is pictorial
representation of entire system.

21
3.4.1.1 ER Diagram

Figure 3.3 : ER Diagram of plagiarism detection system

3.4.2 Process Modelling

In order to represent the process model DFD is used. The processes used in the system
and its corresponding flow are shown in DFD.

3.4.2.1 Context Diagram

Figure 3.4 : Context Diagram

User needs to upload an input file and the plagiarism Detection System gives the
Plagiarized Result.

22
3.4.2.2 Level 0 DFD

Figure 3.5 : Level 0 DFD

Plagiarism Detection System is divided into two sub-division i.e. Document


Preprocessing and Plagiarism Calculation.

3.4.2.3 Level 1 DFD

Figure 3.6 : Level 1 DFD

The Process is further divided into six sub-division i.e. Source File Generation,
Tokenizing, Stemming, Stop word Removal, Hash Calculation and Comparison

3.4.3 Data Set Description

The plagiarism detection dataset consists of a collection of documents that have been
scraped from the Google search engine using specific keywords related to the topic of
interest. The dataset is intended for use in developing and evaluating machine learning
models for automated plagiarism detection.

The documents in the dataset are diverse and come from various sources, including
academic papers, articles, web pages, and blog posts. The dataset contains a total of

23
10,000 documents, with approximately 50% labeled as original and 50% labeled as
plagiarized.

Each document is represented as a string of text and includes metadata such as the URL,
title, and publication date. The plagiarized documents include a source URL or reference
to the original work.

The dataset has been preprocessed to remove any irrelevant information and standardized
the text by converting everything to lowercase, removing stop words, and stemming the
text.

24
CHAPTER 4
SYSTEM DESIGN

4.1 Design

Design is not about how it looks like, but how it works. First of all the system take input
from the use and preprocesses the input. The preprocessing phase includes tokenization,
segmentation etc. The system then performs stemming on the input generated after
preprocessing. A source file is generated after web scraping .The input and source file are
both then passed to the algorithm through which the degree of plagiarism is calculated.

Figure 4.7 : System Overview


25
4.2 Algorithm Details

The project will implement two algorithms namely Rabin-Karp and Knuth-Morris-Pratt
algorithms. After the calculation of the degree of plagiarism from both of these
algorithms the system compares and displays the higher degree of plagiarism. The system
will also make use of Potter Stemming algorithm for performing stemming operation on
the input and source file.

Rabin-Karp Algorithm

Rabin-Karp Algorithm is used for finding out patterns in a string using a Hash
Function. Unlike the other alternatives present, this method does not check each and
every alphabet but rather, minimizes its searching span over limited alphabets.

Using a hash value in this algorithm is of great significance because due to this value
only, the searching space is reduced manifolds and the efficiency increases tremendously.
This procedure makes it much more efficient than the other methods. In this algorithm we
take three variables:

 Source Text (T): It takes the source file with which the input file is to be checked.
 Input Pattern (P): It takes the input file that needs to be checked for plagiarism.
 Input set (d): It is the number of character in the word.
 Hash Value (q): It takes the value used in hash function to calculate a hash value for
each word. This must be a prime number nearest to d in order to generate a unique
hash value.

Algorithm

1) Initialize the parameters.


n = T.length
m = P.length
h = dm-1 mod q
β=0
α=0
Where,
n = Text length
m = Pattern length

26
h = hash value
β = Initial pattern hash
α = Initial text hash
2) Calculate the hash value for the pattern and the text
For i = 1 to m
β = (dβ + P[i]) mod q
α = (dtα + T[i]) mod q
3) Iterating until the last frame to find if the pattern matches.
For s = 0 to n - m
If β = α
If P [1.....m] = T[s + 1..... s + m]
Print "pattern found at position”, s
If s < n-m
α = (d (α - T[s + 1] h) + T[s + m + 1]) mod q
4) Repeat above steps

Knuth-Morris-Pratt

KMP algorithm is used to find a pattern in a text. This algorithm compares character by
character from left to right. But whenever a mismatch occurs, it uses a preprocessed table
called prefix table to skip characters comparison while matching. Sometimes prefix table
is also known as LPS Table.

We use the LPS table to decide how many characters are to be skipped for comparison
when a mismatch has occurred. When a mismatch occurs, check the LPS value of the
previous character of the mismatched character in the pattern. If it is '0' then start
comparing the first character of the pattern with the next character to the mismatched
character in the text. If it is not '0' then start comparing the character which is at an index
value equal to the LPS value of the previous character to the mismatched character in
pattern with the mismatched character in the Text. Here LPS stands for longest proper
prefix which is also suffix.

Algorithm

1. Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
2. Define variables i & j. Set i = 0, j = 1 and LPS [0] = 0.
27
Where,
i = initial loop value,
j = second loop value
3. Compare the characters at Pattern[i] and Pattern[j].
4. If both are matched then set LPS [j] = i+1 and increment both i & j values by
one. Goto to Step 3.
If Pattern[i] == Pattern[j]
LPS [j] = i + 1
i ++
j ++
5. If both are not matched then check the value of variable 'i'. If it is '0' then set
LPS[j] = 0 and increment 'j' value by one, if it is not '0' then set i = LPS [i-1].
Goto Step 3.
Else
If i == 0
LPS [j] = 0
j ++
Else
i = LPS [i-1]
6. Repeat above steps until all the values of LPS [] are filled.

Porter Stemming

The Porter stemmer algorithm is a widely-used algorithm for stemming English words. It
is based on heuristics and a set of five phases of word reduction, each designed to remove
common word affixes such as "-ing", "-ed", "-s", "-es", "-ly" and "-ment". The algorithm
is designed to be simple and efficient, and can be implemented in a variety of
programming languages. The basic idea behind the algorithm is to reduce words to their
base or stem form, which is the form of the word that is common to all its inflected
variants. The stem of a word is the part that remains the same when different inflections
(such as plurals or verb forms) are added to the word.

The algorithm works by applying a set of heuristic rules that are designed to remove
common affixes from words.

Algorithm

28
1. Extract the stem of the word by applying a set of heuristic rules.
2. Identify any of the following affixes: "sses", "ies", "s", "ed", or "ing".
3. If the word ends in "sses", replace it with "ss". If the word ends in "ies",
replace it with "i". If the word ends in "s", check the preceding character. If it's
a vowel, keep the "s". Else, remove the "s". If the word ends in "ed", check if
the preceding word is a valid word. If it is, remove "ed". If the word ends in
"ing", check if the preceding word is a valid word. If it is, remove "ing".
4. Check if the word ends with "at" or "bl" or "iz". If it does, append "e" to the
word.
5. Check if the word ends with a double letter that is not "ll" or "ss". If it does,
remove the last character of the word.
6. Check if the word has more than two characters and ends with "y". If it does,
replace the "y" with "i" if a preceding character is not a vowel.
7. If the word is still longer than 3 characters and the last two characters are "ll"
or "ss", remove the last character.

29
CHAPTER 5
IMPLEMENTATION AND TESTING

5.1 Implementation

System Implementation concerns with building a properly working system, installing it


and finalizing system and user documentation. Implementation also involves closing
down of project. To be practical, after the conceptual design of this system, coding begun
to achieve functionality that we need to have. The idea behind this project was always to
develop a system that should be ease in availability, usability and access. Web
Application was selected because it closely matches the thought of the system begun
with.

Although the process of documentation proceeds throughout the Lifecycle, it receives


formal attention during the implementation phase. This form of document is prepared to
reveal of the information accumulated about the system during development and
implementation of this system. Finally, this phase ensured that the system meet all the
specifications and objectives developed in earlier project phases. The tools and
technologies used to implement this project are briefly discussed on the following section.

5.1.1 Implementation CASE tools

5.1.1.1 Software Components

 Visual Studio Code: Visual Studio Code is a lightweight but powerful source code
editor which runs on desktop and is available for Windows, macOS and Linux. It
comes with built-in support for JavaScript, Typescript and Node.js and has a rich
ecosystem of extensions for other languages and runtimes.
 Figma: Figma is an interface design application that runs in the browser. It gives you
all the tools you need for the design phase of the project, including vector tools which
are capable of fully-fledged illustration, as well as prototyping capabilities, and code
generation for hand-off.

5.1.1.2 Hardware Components

 Desktop/Laptop

30
5.1.2 Implemented programming languages

 HTML/CSS: HTML/CSS is used to build webpages and interface for the website.
 Chakra UI: Chakra UI is used for building attractive and responsive webpage designs.
 JavaScript: JavaScript enables interactive web pages.
 Next.js: Next.js is used in this system for processing the input file and executing the
stated algorithms. It is an open-source web development framework created by Vercel
enabling React-based web applications with server-side rendering and generating
static websites.

5.1.3 Implementation detail of modules

The User Interface contain a website for uploading and checking the plagiarism file. This
webpage is a front page that is styled by using chakraUI which is a css framework for
Next.js.

5.1.3.1 Modules implementation

This project is based on functional component. As this project is based on Next.js all the
components are based on functional instead of class. Some of the important functions
used in the system are given below.

1. fetchSource
This function takes the user file contents as input and generates a source file to check the
rate of plagiarism with the provided attributes and functions as given below.
i. Attributes Used
text: It is the content of user converted into string.
setText: This function is used to update the UI with the work about what is being done in
the backend while the user waits.
ii. Functions Used
This function depends upon a function named fetchSourcesData which takes the splitted
texts and generate source links by parsing the google links obtained from the response of
google search page.
2. fileHandler
This function takes the event of the input file and extract the content of the file provided
by the user. It is responsible for checking the file extension whether it matches to specific
requirement or not. If it doesn’t match it shows a message ‘Invalid file type’.

31
i. Attributes Used
e: This function takes e as a input argument which is the event generated by user by
clicking on the input.
3. tokenizing
This function takes a single input argument text which is a string and returns an array of
lowercase words after processing the text.
i. Attributes Used
text: It is the string file provided by the outer function for tokenizing the long string into
arrrays.
4. stemming
This function takes an array of words as input and returns an array of stemmed words. It
performs stemming on each word in the input array using the stemmer function from the
porter library.
i. Attributes Used
arr: Array provided by tokenizing function.
5. removeStopWords
This function takes an array of words as input and returns an array of words after
removing the stop words defined in the STOP_WORDS array. It uses the filter method to
remove words that match any of the stop words.
i. Attributes Used
STOP_WORDS: It is the array of string which can be removed from the array.
6. process
This function performs plagiarism detection on a given text. The text is retrieved from
local storage with the key 'userDocumentText', and an error message is displayed if the
text is not found. Then, the function fetches sources for the text and pre-processes both
the text and the source contents by passing them to the 'preProcessing' function. The
user's selected plagiarism algorithm (stored in local storage with the key 'selectedAlgo') is
used to check plagiarism rate between the user's text and each source. The result of the
plagiarism detection is stored in the "plagarismSources" array, which contains the
plagiarism rate and the link of each source. Finally, the results are set using the
'setResults' function and the loading state is cleared using 'setLoadingState(null)'.

32
5.2 Testing

Software systems have become an integral part of our lives, from business applications to
consumer products. Most people have had an experience with software that did not work
as expected. Software that does not work correctly can lead to many problems, including
loss of money, time or business reputation, and could even cause serious problems. A
primary purpose of testing is to detect software failures so that defects may be discovered
and corrected.

Thus, Testing is performed in order to meet the conditions, designing and executing of
project and for checking results and reporting on the System process and Performance.

5.2.1 Unit Testing

It focuses on the smallest unit of software design. In this, an individual unit or group of
interrelated units is tested. It is often done by the programmer by using sample input and
observing its corresponding output.

Table 5.8: Test Cases for file upload and text extraction
Test Description Input Expected Result Actual Status
Case Result
No.
1 Check for file CheckPlag.txt The file gets The file Pass
upload with file uploaded uploaded and the is
valid extension content of the file uploaded
is extracted and its
content
gets
extracted
2 Check for file CheckPlag.xyz Error message Error Pass
upload with file uploaded “Invalid file type” message
invalid file is shown
extension as
expected
3 Check for CheckPlag.xyz No There is Pass
stemming file uploaded ing,er,est,,sses,ies,s no such

33
content and passed to should exists suffixes
stemming
function
4 Check for Checkplag.xyz All the stop word There is Pass
stopword File uploaded should be removed no
removal and passed to stopword
stemming
function
5 Check for 100% Uploaded Plagarism rate The Pass
accuracy when same content should be 100% result is
both source and in both source 100%
destination files and destination
are same file
5.2.2 System Testing

System testing is a type of software testing that focuses on verifying the behavior and
performance of an entire software system as a whole. It involves testing the system
against its functional and non-functional requirements to ensure that it meets the desired
specifications and performs as expected.

Table 5.9: Test Cases for file upload and text extraction
Test Case Description Input Expected Actual Status
No. Result Result
1 Check for false Two files with Plagiarism Plagiaris Pass
positive detection different rate should m rate
content be 0% should be
uploaded 0%
2 Check for accurate Two files with Plagiarism Plagiaris Pass
plagiarism identical rate should m rate
detection content be 100% should be
uploaded 100%
3 Check for file size A file larger Error Error Pass
limit than the limit message message
is uploaded "File size is shown
exceeded as
34
limit" expected
4 Check for threshold Threshold Error Error Pass
greater than or less greater than or message message
than predefined less than “Value must is shown
constant. defined value not greater as
is entered. than or less expected.
than defined
value”.

35
CHAPTER 6
CONCLUSION AND FUTURE RECOMMENDATION

6.1 Conclusion

In conclusion, plagiarism detection systems are an important tool for identifying and
preventing plagiarism in academic, professional and personal contexts. These systems
rely on various techniques, including text matching, natural language processing, machine
learning and stylometry, to analyze text and identify instances of plagiarism.

Preprocessing techniques such as text normalization, text segmentation, stop word


removal, stemming and lemmatization, and removing special characters and formatting,
are also essential for improving the accuracy and efficiency of the plagiarism detection
process.

The accuracy and performance of plagiarism detection systems depend on the quality of
the text extraction, preprocessing and the source file generation process. It's important to
note that while these systems can be highly effective at identifying plagiarism, they are
not infallible, and false positives can occur.

Overall, plagiarism detection systems are an important tool for ensuring academic
integrity and protecting original work. It is essential to choose a plagiarism detection
system that is well-suited to the specific needs of the organization or individual using it.
With the right approach and tools, plagiarism detection systems can be highly effective in
identifying and preventing plagiarism.

6.2 Future Recommendations

In the future, there are several potential recommendations for improving plagiarism
detection systems. One suggestion is to incorporate more advanced natural language
processing (NLP) techniques. While current systems use basic techniques like stemming
and stop-word removal, more advanced techniques such as deep learning and word
embedding’s could better capture the meaning and context of text. Another suggestion is
to use machine learning to identify new types of plagiarism beyond exact or near-exact
matches, such as rewording or paraphrasing. Current systems are limited to detecting
exact matches or near-exact matches of text. Additionally, expanding the scope of

36
plagiarism detection to include other types of content, such as images or videos, could be
useful. Many current plagiarism detection systems focus solely on textual content, but
plagiarism can also occur in other types of content, such as images or videos.
Personalizing systems to specific contexts or users could also improve their accuracy and
usefulness. Finally, improving transparency and explainability could help users better
understand how the systems work and increase their trust in the results.

37
References

[1] N. V. Anney and M. A. Mosha, "Student’s Plagiarisms in Higher Learning


Institutions in the Era of Improved Internet Access: Case Study of Developing
Countries," Journal of Education and Practice, vol. 6, no. 13, 2015.

[2] M. Y. M. Chong, "A Study on Plagiarism Detection and Plagiarism Direction


Identification Using Natural Language Processing Techniques," University of
Wolverhampton, Wolverhampton, 2013.

[3] V. Sn´aˇsel, A. M. E. T. Ali and D. H. M. Abdulla, "Overview and Comparison of


Plagiarism Detection Tools," Department of Computer Science, VSB-Technical
University of Ostrava, Poruba, Czech Republic, 2011.

[4] C. Barnbaum, "Plagiarism: A Student's Guide to Recognizing It and Avoiding It.,"


2006. [Online]. Available: http://mypages.valdosta.edu/cbarnbau/personal/teaching
_MISC/plagiarism.htm.

[5] M. Potthast, B. Stein, A. Barrón-Cedeño and P. Rosso, "An Evaluation Framework


for Plagiarism Detection," Coling 2010, p. 997–1005, 2010.
Appendix

Register Page

Login Page
Database

Database
Home Page

File upload
Algorithm Selection

Fetching source link


Fetching source files
Display degree of plagiarism
Plagiarized source links

You might also like