Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/261164999
CITATIONS READS
5 136
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Masoud Asadpour on 24 May 2015.
Abstract—In this paper a plagiarism detection framework is It is interesting to mention that in another study by
proposed based on coding style. Furthermore, the typical style- Ashworth, Bannister and Thorne in 1997, it is discovered that
based approach is improved to better detect plagiarism in the main reason for plagiarism among students is lack of time
programming codes. The plagiarism detection is performed in for doing homework [2]. Furthermore, the plagiarism is mostly
two phases: in the first phase the main features representing a done in homework and exercises due to lack of proper control
coding style are extracted. In the second phase the extracted from the instructors. Therefore identification of copied
features are used in three different modules to detect the assignments and homework is a vital need in educational
plagiarized codes and to determine the giver and takers of the systems. Consequently, it has ignited a set of research on
codes. The extracted features for each code developer are kept in
detecting plagiarism to prevent students from submitting
a history log, i.e. a user profile as his/her style of coding, and
unoriginal assignment and punish the ones who behave
would be used to determine the change in coding style. The user
profile allows the system to detect if a code is truly developed by otherwise.
the claimed developer or it is written by another person, having Obviously with the increase in the size of the assignments
another style. Furthermore, the user profile allows determining and the increase in the number of students, manual assignment
the code giver and code taker when two codes are similar by comparison is not a viable solution. Consequently, there have
comparing the codes’ styles with the style of the programmers. been attempts to develop algorithms and systems that can
Also if a code is copied from the internet or developed by a third check assignments automatically and detect the similar ones. In
party, then the style of who claims the ownership of the code is
the case of coding assignments, the chance of similarity
normally less proficient in coding than the third party and can be
detected. The difference between the style levels is done through
between codes, compared to the natural language texts, is
the style level checker module in the proposed framework. The higher since the syntax is more restricted. Furthermore, it is
proposed framework has been implemented and tested and the fairly easy to copy and change the overall look of a code which
results are compared to Moss which shows comparable makes it more difficult to detect the similarity through manual
performance in detecting plagiarized codes. observation or basic automatic algorithms. MOSS [3], as an
automatic tool to detect similar coding assignments, and JPlag
Keywords-Plagiarism detection; author identification; software [4], as an online essay comparison system, are two systems
forensics; source code developed to detect plagiarism. Although these systems are
fairly successful in detecting similarity between codes
I. INTRODUCTION submitted to them, they have limits in two areas: a) code
With the wide spread use of portable media and the access obfuscation and b) third party development. Code obfuscation
that internet has provided to people, the software community is a tool that was developed to aid in the prevention of software
has faced with the new challenge of copied and plagiarized piracy by applying semantics preserving transformations to
code. For instance, students’ programming assignments are programs [5]. If students use this tool to change the appearance
designed to help the students in their coding skills and to of their code they can defeat these detection systems. The third
determine the level of their proficiency in coding. However, party code development can happen if a piece of code is copied
with the ease of sharing and copying codes, the desired goals from the internet or specifically developed by a third party. In
cannot be achieved with increase in plagiarism. This plagiarism the first case, keeping a huge database of the codes online may
is more noticeable especially in online education systems help to determine the plagiarism. However, the 2nd one, i.e.
which students lack any direct relation with teachers. In other hiring a person to code (paid or unpaid) on behalf of the
words, it is harder to determine whether or not a student has programmer, is undetectable since there is no code to be
achieved certain level of proficiency in coding or not, since compared to the original code.
proctored exams are not used in online education. In this paper, a style-based framework for plagiarism
An early survey on 380 bachelor students by Haines, detection is introduced in which the coding style of each
Diekhoff, Labeff and Clark in 1986 shows that although half of student is detected and used to compare with other students.
the students really cheat but only 1.3% of them were caught The similarity between the styles and the change of the style in
[1]. The recent surveys show that in the past two decades a student over several assignments raise the possibility of
plagiarism among the students has increased due to ease of plagiarism. The proposed approach is capable of detecting the
access and sharing of answers to homework and assignments. person who shares his/her document. Also the proposed
method can distinguish between the style of coding between
To determine the plagiarized codes each code is compared other, there is no way to determine the giver and taker through
to all other codes submitted to the system and also it is their style.
compared to its owner’s style, if one available. In comparison
The style checker module is designed to determine if a code
with others, when similarity rate is more than a given threshold
is taken from the web or it is performed by a third party (paid
the code is marked as potential candidate for plagiarism. On the
or favored). This module is mainly useful when the current
other hand, the main point in self comparison is that every
style of the code does not match any other code submitted for
programmer has a specific style, which normally is followed in
checking. This case happens if a developer has not copied from
all codes developed by the programmer. Therefore, having the
another developer, in the group of codes submitted for
style found from previous codes, authenticity of every new
checking, but copies from the web or he/she asks another
code can be ascertained. That is, changes in the style hints the
person to write the code for him/her. In such a case, usually,
possibility of plagiarism.
the style of the submitted code is more profficient than the style
Fig. 1 shows the general framework for the proposed of the developer who has submitted the code. Consequently, to
plagiarism detection system. The process is based on five check the style level of the code submitted by the developer, it
modules. In feature extraction module, the features are detected is passed through the style level checker which compares it
and normalized to create the feature vector for the documents. with a trained dataset. Also the profile of the user is passed
In similarity clustering module, the feature vectors from two through the style checker to determine his/her level of
different submitted codes are compared and the similar codes proficiency in coding. If the results of the two checks are
are sent to the next module to determine the giver and taker. different, then there would be a possibility of plagiarism.
The feature vector, i.e. the output of feature extraction The detail of each module is discussed in the following.
module, is compared to the profile of the developer of the code.
If it is similar, then it is added to the current profile. In case A. Feature Extraction module
there is no profile and it is the first time a developer submitted In this module, a feature vector is derived for each
a code to the system, it would create a profile for him/her and document under investigation. A derived feature vector is
would use it in the future. The giver/taker module uses the considered as the fingerprint of a document and would be used
output from similarity clustering module and the developer throughout the investigation. In general, a feature vector should
profile module to determine the similarity between the represent the general structure of a document and possible
submitted codes and profiles to determine the giver and taker. overlooked characteristics of a document. For instance, in a
In the case that developer profiles are very similar to each C/C++ program, the general structure consists of features such
as number of loops, number of functions, and number of
(c) (d)
(e) (f)
(g) (h)
(i) (j)
(k) (l)
Figure 2. Sample code that showing effect of proposed features in style, (a) and (b) command placementing, (c) and (d) indention,(e) and (f) spacing at the
end of line, (g) and (h) line comment or block comment, (i) and (j) Placing some character like ‘+’, ’=’, ‘)’, with one space before and after it, or without
space, (k) and (l) placement of ‘{’ and ‘}’.