You are on page 1of 4

Semantic Exploration of Textual Analogies for Advanced Plagiarism Detection

Elyah Frisco Andriantsialo Volatiana Marielle Ratianantitra


GLoRe, Madagascar GLoRe, Madagascar
elyahfrisco7@gmail.com volatianamarielle@yahoo.fr

Thomas Mahatody
GLoRe, Madagascar
tsmahatody@gmail.com

Abstract infringement and violation of copyright law


(Vandendorpe, 1992).
This study explores textual analogies in A document is considered plagiarized when it
French within the context of plagiarism is produced by applying a series of
detection, adopting a semantic approach.
transformations to an original document. The
By combining traditional methods with
advanced models such as BERT and GPT,
plagiarized document should retain the same
the paper proposes a hybrid model to function as the original but may take on a different
enhance detection efficiency. Comparative form. There are several types of plagiarism,
evaluation highlights the model's ability to including copy-paste, paraphrasing, the use of
detect subtle similarities and paraphrases. false references, and plagiarism of ideas.
The approach represents a significant (Mostafa, 2016)
advancement in accurate plagiarism
detection by leveraging deep contextual 2.2 Natural Language Processing
understanding and the reformulation
Natural Language Processing (NLP) is a
capabilities of integrated models.
multidisciplinary field involving linguistics,
computer science, and artificial intelligence (AI)
1 Introduction
with the aim of creating NLP tools capable of
Plagiarism detection, constantly evolving, remains automatically processing linguistic data for
a crucial challenge in the field of digital content various applications.
management. The emergence of new copying . Some of the most well-known applications
methods and circumvention of traditional systems include automatic translation, information
necessitates the exploration of more advanced and extraction, text summarization, spell checking,
adaptive solutions. In this perspective, our automatic generation, voice synthesis, speech
research positions itself at the forefront of recognition, and the detection of specific topics
innovation by focusing on Semantic Exploration (sentiment analysis, etc.) (Ratianantitra, 2023).
of Textual Analogies. One outcome of the progress in NLP is GPT
The current landscape, marked by the (Generative Pre-trained), a language model
sophistication of plagiarism practices, underscores employing deep learning to generate text
the urgency of adopting more complex and resembling human speech. In simpler terms, it's a
sophisticated approaches. Our research capitalizes computational system created to produce
on the latest advancements in natural language sequences of words, code, or other data from an
processing (NLP), thus laying the groundwork for input source known as the prompt. GPT finds
a more robust plagiarism detection system. applications in various tasks like machine
translation, where it predicts word sequences
2 Basic Theory statistically. The model is trained on an unlabelled
dataset comprising texts from sources like
2.1 Plagiarism Wikipedia, available mostly in English but also in
Plagiarism is a term with moral and aesthetic other languages. This computational approach
connotations, used in literature to describe the act serves diverse purposes, including summarization,
of incorporating, in an undisclosed and more or translation, grammar correction, question
less faithful manner, textual elements from answering, chatbots, composing emails, and more
another author. This term is not commonly used in (Floridi, Luciano, Massimo Chiriatti, 2020).
legal contexts, where one would rather refer to
3 Literature review Cedeño et al. (2013) highlights the challenges
posed by these practices. Detecting paraphrases
The literature review emphasizes the importance and rephrases requires distinct approaches,
of detecting plagiarism by examining similarities although they are semantically related (Harris,
between documents. Different approaches, can be 1957; Martin, 1976; Duclaye, 2003).
explored. Table 1 summarizes these plagiarism
detection methods, ranging from simple To overcome these challenges, alternative
algorithms to advanced approaches. methods can explored:

Method Description • Stylometric Approaches, this method


employs statistical techniques to analyze
Fingerprinting Represents the document in the
form of fingerprints (n-grams) and
various aspects of writing style, focusing
utilizes algorithms such as "Rabin- on features such as word frequencies,
Karp" for plagiarism detection. sentence lengths, punctuation usage, and
String Matching Compares documents word by syntactic structures. By quantifying these
word using algorithms such as features, the method aims to capture
"Brute Force" unique patterns and characteristics
Bag of Words Utilizes a vector space model with specific to each author's writing style.
vectors representing documents,
calculating cosine similarity to • Neural networks, this method achieves
measure the similarity between high performance in detecting plagiarism,
texts. especially in identifying paraphrases and
Citation Analysis Analyzes citations within texts to subtle similarities within complex texts. It
detect similar patterns in citation utilizes advanced techniques such as deep
sequences, adapted for academic learning models, which have
and scientific texts demonstrated superior capabilities in
Stylometry Utilizes statistical methods to capturing intricate patterns and nuances in
quantify and analyze the writing language.
style of an author based on
features such as word frequencies. • BERT is a pre-trained natural language
Rule-Based Simple to implement and quick, processing (NLP) model developed by
Algorithms but limited to predefined rules and Google. It uses Transformer architecture
less suitable for different and is trained on large unlabeled text
languages and sentence structures. corpora. BERT is designed to understand
Neural networks Achieves high performance on the context of words in a sentence by
complex texts, detects paraphrases looking at both preceding and succeeding
and similarities, but requires a words, allowing it to capture nuances of
high level of implementation meaning and context.
complexity and massive amounts
of data for training. Methods for detecting paraphrastic rephrasing
Bidirectional Utilizes a pre-trained model on a are common, using alignment methods (Callison-
Encoder large corpus of text, provides a Burch et al., 2008) or more advanced techniques
Representations deep understanding of the text but (Shen et al., 2006). The work of Fenoglio et al.
from requires high computational power (2007) emphasizes fundamental elementary
Transformers
and is not designed for text transformations, while Mel’cuk's Sense-Text
generation. theory (1967) is often adopted.
(BERT)
These advancements in NLP and models like
Table 1: Overview of Plagiarism Detection BERT contribute not only to the efficiency of
Methods plagiarism detection but also to a more nuanced
understanding of language use.
When comparing documents to detect
plagiarism, the search for similarities is crucial.
Word-for-word comparison, while effective in
identifying "copy and paste" instances, becomes
insufficient in the face of sophisticated
paraphrasing and rephrasing. The work of Barron-
4 Methodology 5 Evaluation
Our approach is to combine the traditional We assess the effectiveness of our plagiarism
method, which is Direct Textual Comparison, with detection approach by applying various methods.
Natural Language Processing techniques such as The tests encompass diverse datasets containing
BERT (Bidirectional Encoder Representations authentic texts and examples of plagiarism with
from Transformers) and GPT (Generative Pre- varying levels of complexity.
trained Transformer) to provide a more robust and Evaluation Metrics: Precision, recall, and F-
accurate approach. measure are employed for a balanced assessment
We will proceed as follows: of the model.
Test Dataset: Various texts representing
different styles are utilized, including simulated
Step Name Description
cases of plagiarism to test the model's sensitivity.
Phase 1 Text Using text cleaning techniques to Comparison with Other Methods: Our
Preprocessing: eliminate irrelevant elements such as performance is compared to traditional methods
Tokenization whitespaces, punctuation, etc.
and others.
Tokenizing the texts to prepare them
for processing by models.
Phase 2 Direct Using traditional methods such as Let's take a look at two text extracts::
Textual string comparison to detect direct
Comparison copy-pasting. • Text A: " Les avancées technologiques
Phase 3 Converting texts into embeddings ont révolutionné notre quotidien."
Semantic (vector representations) using BERT • Text B: " Les progrès technologiques ont
Analysis with and comparing the embeddings to bouleversé la vie quotidienne."
BERT evaluate semantic similarity between
texts. If the embeddings are very
For testing, we used two sentences in French,
similar, it could indicate paraphrasing
as it is the most widely used language for
or plagiarism
Phase 4 Using GPT to rephrase one of the publishing articles or writing theses in Africa.
Generation texts and comparing the rephrased However, it can also be used with various
and text with the other. If GPT generates languages such as English and Spanish, as BERT
Comparison a text very similar to the other text, it and GPT already support multiple languages.
with GPT could indicate plagiarism.texts
Phase 5 Using stylometric analysis techniques Method Precision Limitation
Stylometric to compare the writing style of both Textual Comparison 0.85 Limited to copy-
Analysis texts. If the styles are very similar, it paste cases, less
could indicate plagiarism, especially effective on longer
if the content is also similar. texts.
Phase 6 Combining results from all methods Semantic Analysis 0.92 High computational
Evaluation to make a final decision on with BERT costs, requires large
and Decision plagiarism. For example, if direct amounts of training
textual comparison, semantic data.
analysis with BERT, and stylometric Generation and 0.88 It can generate text,
analysis all indicate plagiarism, you Comparison with unlike BERT, but
can be reasonably certain that the text GPT it's not primarily
is plagiarized. designed for
plagiarism
detection and
Table 2: Different phases of designing the stages requires finesse in
of the multi-level combination method. hyperparameter
tuning
By combining these models, we can leverage Stylometric 0.80 May be sensitive to
multiple architectures and achieve better results, Analysis intentional stylistic
obtaining superior performance compared to each variations.
individual model. However, this requires careful
planning and implementation. Table 3: Results of the approaches used
To obtain the result, we followed a rigorous 7 References
evaluation procedure using diverse datasets,
including authentic texts and plagiarism examples Vandendorpe, C. (1992). Le plagiat.
of varying levels of complexity. Each method was Hambi El Mostafa, Faozia Benabbou, El Habib Ben
evaluated based on its precision performance, LahMar. (2016, June). Comparaison
taking into account its specific advantages and Des Techniques De Détection Du Plagiat
limitations. Académique.
The textual comparison method was applied to Ratianantitra, V. M. (2023, December). A State of the
authentic texts and copy-paste cases, evaluating art review on Natural Language Processing applied
accuracy and identifying limitations on lengthy to the Malagasy Language. In International
Conference on Artificial Intelligence and its
texts. Semantic analysis with BERT converted
Applications (pp. 1-5).
texts into embeddings, measuring semantic
similarity with paraphrase examples while Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature,
assessing computational costs. scope, limits, and consequences. Minds and
Machines, 30, 681-694.
Generation and comparison with GPT involved
rephrasing a text and adjusting hyperparameters, Barrón-Cedeño, A., Vila, M., Martí, M. A., & Rosso,
evaluating accuracy and detecting creative P. (2013). Plagiarism meets paraphrasing: Insights
similarities. Stylometric analysis assessed writing for the next generation in automatic plagiarism
detection. Computational Linguistics, 39(4), 917-
style with tests sensitive to stylistic variations,
947.
measuring accuracy.
The overall process encompassed a comparison Harris, Z. S. (1957). Co-occurrence and
of the performance of each method, identifying transformation in linguistic structure. Language,
33(3), 283-340.
and analyzing the limitations of each approach for
a comprehensive evaluation. Fenoglio, I., Lebrave, J. L., & Ganascia, J. G. (2007).
Following the evaluation of these results, it was EDITE MEDITE: un logiciel de comparaison de
observed that the plagiarism detection approach versions.
En ligne: http://www. item. ens. fr/index. php.
combined with the natural language processing
methods Bert and GPT reflects effectiveness in Martin, R. (1976). Inférence, antonymie et
several key aspects: improved accuracy, detection paraphrase: éléments pour une théorie sémantique.
of complex plagiarism patterns, scalability, and Duclaye, F. (2003). Apprentissage automatique de
generalization. relations d'équivalence sémantique à partir du Web
(Doctoral dissertation, Télécom ParisTech).
6 Conclusion Callison-Burch, C., Cohn, T., & Lapata, M. (2008,
August). Parametric: An automatic evaluation
Our innovative semantic approach, integrating metric for paraphrasing. In Proceedings of the 22nd
BERT and GPT, has demonstrated increased International Conference on Computational
effectiveness in detecting various forms of Linguistics (Coling 2008) (pp. 97-104).
plagiarism, including subtle paraphrasing. Despite
Shen, S., Radev, D., Patel, A., & Erkan, G. (2006,
challenges related to complexity and July). Adding syntax to dynamic programming for
computational costs, significant benefits, such as aligning comparable texts for the generation of
accurately detecting paraphrased content, make paraphrases. In Proceedings of the COLING/ACL
our model a promising solution for meeting the 2006 Main Conference Poster Sessions (pp. 747-
requirements of plagiarism detection in diverse 754).
digital contexts. Moreover, our approach is
designed to be easily implementable, utilizing
programming languages compatible with
OpenAI's library and BERT. Ongoing research is
necessary to optimize the model and explore
emerging domains, underscoring our commitment
to evolving plagiarism detection tools and
preserving the integrity of digital content.

You might also like