You are on page 1of 8

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/261164999

A coding style-based plagiarism detection

Conference Paper · November 2012


DOI: 10.1109/IMCL.2012.6396471

CITATIONS READS

5 136

3 authors, including:

Hadi Moradi Masoud Asadpour


University of Tehran University of Tehran
125 PUBLICATIONS 466 CITATIONS 61 PUBLICATIONS 1,029 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Modeling humans in human-computer interaction View project

Search and rescue in large environments View project

All content following this page was uploaded by Masoud Asadpour on 24 May 2015.

The user has requested enhancement of the downloaded file.


A Coding Style-based Plagiarism Detection
S. Arabyarmohamady, H. Moradi, M. Asadpour
Advanced Robotics and Intelligent Systems Laboratory & Control and Intelligent Processing Center
School of Electrical and Computer Engineering
University of Tehran

Abstract—In this paper a plagiarism detection framework is It is interesting to mention that in another study by
proposed based on coding style. Furthermore, the typical style- Ashworth, Bannister and Thorne in 1997, it is discovered that
based approach is improved to better detect plagiarism in the main reason for plagiarism among students is lack of time
programming codes. The plagiarism detection is performed in for doing homework [2]. Furthermore, the plagiarism is mostly
two phases: in the first phase the main features representing a done in homework and exercises due to lack of proper control
coding style are extracted. In the second phase the extracted from the instructors. Therefore identification of copied
features are used in three different modules to detect the assignments and homework is a vital need in educational
plagiarized codes and to determine the giver and takers of the systems. Consequently, it has ignited a set of research on
codes. The extracted features for each code developer are kept in
detecting plagiarism to prevent students from submitting
a history log, i.e. a user profile as his/her style of coding, and
unoriginal assignment and punish the ones who behave
would be used to determine the change in coding style. The user
profile allows the system to detect if a code is truly developed by otherwise.
the claimed developer or it is written by another person, having Obviously with the increase in the size of the assignments
another style. Furthermore, the user profile allows determining and the increase in the number of students, manual assignment
the code giver and code taker when two codes are similar by comparison is not a viable solution. Consequently, there have
comparing the codes’ styles with the style of the programmers. been attempts to develop algorithms and systems that can
Also if a code is copied from the internet or developed by a third check assignments automatically and detect the similar ones. In
party, then the style of who claims the ownership of the code is
the case of coding assignments, the chance of similarity
normally less proficient in coding than the third party and can be
detected. The difference between the style levels is done through
between codes, compared to the natural language texts, is
the style level checker module in the proposed framework. The higher since the syntax is more restricted. Furthermore, it is
proposed framework has been implemented and tested and the fairly easy to copy and change the overall look of a code which
results are compared to Moss which shows comparable makes it more difficult to detect the similarity through manual
performance in detecting plagiarized codes. observation or basic automatic algorithms. MOSS [3], as an
automatic tool to detect similar coding assignments, and JPlag
Keywords-Plagiarism detection; author identification; software [4], as an online essay comparison system, are two systems
forensics; source code developed to detect plagiarism. Although these systems are
fairly successful in detecting similarity between codes
I. INTRODUCTION submitted to them, they have limits in two areas: a) code
With the wide spread use of portable media and the access obfuscation and b) third party development. Code obfuscation
that internet has provided to people, the software community is a tool that was developed to aid in the prevention of software
has faced with the new challenge of copied and plagiarized piracy by applying semantics preserving transformations to
code. For instance, students’ programming assignments are programs [5]. If students use this tool to change the appearance
designed to help the students in their coding skills and to of their code they can defeat these detection systems. The third
determine the level of their proficiency in coding. However, party code development can happen if a piece of code is copied
with the ease of sharing and copying codes, the desired goals from the internet or specifically developed by a third party. In
cannot be achieved with increase in plagiarism. This plagiarism the first case, keeping a huge database of the codes online may
is more noticeable especially in online education systems help to determine the plagiarism. However, the 2nd one, i.e.
which students lack any direct relation with teachers. In other hiring a person to code (paid or unpaid) on behalf of the
words, it is harder to determine whether or not a student has programmer, is undetectable since there is no code to be
achieved certain level of proficiency in coding or not, since compared to the original code.
proctored exams are not used in online education. In this paper, a style-based framework for plagiarism
An early survey on 380 bachelor students by Haines, detection is introduced in which the coding style of each
Diekhoff, Labeff and Clark in 1986 shows that although half of student is detected and used to compare with other students.
the students really cheat but only 1.3% of them were caught The similarity between the styles and the change of the style in
[1]. The recent surveys show that in the past two decades a student over several assignments raise the possibility of
plagiarism among the students has increased due to ease of plagiarism. The proposed approach is capable of detecting the
access and sharing of answers to homework and assignments. person who shares his/her document. Also the proposed
method can distinguish between the style of coding between

978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan


2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL)
Page 180
professional programmers and beginners. This is done by written by two different programmers, the approach may
creating two style classes, i.e. professional class and beginner falsely mark programs as copied.
class, and categorizing a given code into any of these two
In the systems based on the sequence analysis, documents’
classes. If this categorization is different from the history of the
global structures are compared through deriving n_grams of
programmer, the given code would be flagged as possible
words or tokens. N-gram is a stretch of text n words long.
illegal code.
Information in n-grams tells us something about text. This
The person who shares his/her document is called the giver method has lower probability of falsely marking two
from now on and the person who receives/copies the document documents similar since it captures a document owner’s style
is called the taker. Also the term document is used for a source through determining the structure of the document. However,
code throughout this paper. driving and saving n_grams and token stream are time
consuming and costly. Examples of these systems are Moss
The paper is organized as follows. In section 2 the related and Jplag. Moss(Measure Of Software Similarity) uses a string
work in the area of plagiarism detection is reviewed. Then in matching algorithm to divide the source code programs into
section 3 the proposed approach is explained and the results are n_grams and hash them, then select the subset of these hashes
presented in section 4. At the end, in section 5 conclusion and as fingerprints.
ideas for development of this method are presented.
Jplag is a token based system that is freely available on the
II. REALATED WORK internet. It output the similarity scale between each pair of
The research toward detecting plagiarized documents, programs. The major limitation of Jplag is that is requires
which can be code, essay or any other assignment, has been parsing the dataset if a program fail to be parsed it will be
done in two directions: a) detecting the similarity between the omitted from the dataset.
two documents [6], [3] and b) stealthily signing each document Another method of similarity detection through sequence
with a unique signature, i.e. a watermark [7]-[10], that can be analysis is called Static Execution Tree (S.E.T) [11]. S.E.T is a
detected later if the document is used by another person. representation displaying the interconnection between the main
In the later, a hidden watermark is saved in the document program body and all procedures by parsing the source
which includes unique characteristics of the owner of the program. After building a tree, it is compared to other trees
document, such as a student’s ID in case of school assignment, built for other programs and similar trees are detected. The
and characteristics of the document, such as assignment main disadvantage of this approach is its limited use due to its
number and date. Then whenever a person copies a document, usability for program-based documents only.
the hidden watermark would be copied along with it. The Another study concerning the plagiarism problem is the
advantage of such an approach is that it can be used to detect authorship analysis, i.e. determining the owner of the code
the giver and taker of a document. The main issue in such an [12]-[1]. In the author identification approach, a profile is
approach is to design the watermark in such a way that created for each author based on all the programs which he/she
changing the visible areas of the document would not affect the has written. The profile can be created based on the n-grams or
watermark so it can be detected later. Furthermore, the user the attribute counts of the written programs. In the
should not be able to detect the watermark so he/she cannot identification phase, the profile of the given document is
manipulate it to deceive the system and avoid plagiarism compared to the set of available profiles. The author of the
detection. Finally, the original document should be closest profile is selected as the author of the document.
electronically signed so the watermark is placed in the
document before the user starts changing it. In this research, the attribute counting method is improved
by including more attributes to better detect plagiarized
In the first approach, i.e. the similarity detection, two documents. Furthermore, a giver-taker module is proposed to
methods are used to determine similarity. Both methods, i.e. be able to determine the giver and copier of documents.
word/token sequence analysis and attribute counting methods, Finally, a style level checker module is developed to determine
try to determine a fingerprint for each document and match the the style of a code and compare it with a programmer’s style
documents with the same or close fingerprints. In the level to determine the legitimacy of the code.
word/token sequence analysis method the number of certain
attributes is counted. The closer the number of attributes the III. THE PROPOSED FRAMEWORK
higher the chance of being similar and copied. In the attribute
counting method, a structure metric is defined and calculated The proposed plagiarism detection algorithm in this paper
for each document such that, the closer the metrics the higher uses pattern recognition methods based on attribute counting
the chance of having copied documents. for programs. Furthermore, a profile of each developer is built
for future use. The developer profile is used to determine the
In documents, attributes such as the number of lines, consistency of the style of coding of a developer which can be
number of loops, and number of function are used. Since the used to predict the possibility of plagiarism. Moreover, it can
number of these attributes can be close, even if the codes are be used to determine the giver and taker of copied codes by
matching the style of the giver with his/her profile.

978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan


2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL)
Page 181
Figure 1. General framework of the proposed method

To determine the plagiarized codes each code is compared other, there is no way to determine the giver and taker through
to all other codes submitted to the system and also it is their style.
compared to its owner’s style, if one available. In comparison
The style checker module is designed to determine if a code
with others, when similarity rate is more than a given threshold
is taken from the web or it is performed by a third party (paid
the code is marked as potential candidate for plagiarism. On the
or favored). This module is mainly useful when the current
other hand, the main point in self comparison is that every
style of the code does not match any other code submitted for
programmer has a specific style, which normally is followed in
checking. This case happens if a developer has not copied from
all codes developed by the programmer. Therefore, having the
another developer, in the group of codes submitted for
style found from previous codes, authenticity of every new
checking, but copies from the web or he/she asks another
code can be ascertained. That is, changes in the style hints the
person to write the code for him/her. In such a case, usually,
possibility of plagiarism.
the style of the submitted code is more profficient than the style
Fig. 1 shows the general framework for the proposed of the developer who has submitted the code. Consequently, to
plagiarism detection system. The process is based on five check the style level of the code submitted by the developer, it
modules. In feature extraction module, the features are detected is passed through the style level checker which compares it
and normalized to create the feature vector for the documents. with a trained dataset. Also the profile of the user is passed
In similarity clustering module, the feature vectors from two through the style checker to determine his/her level of
different submitted codes are compared and the similar codes proficiency in coding. If the results of the two checks are
are sent to the next module to determine the giver and taker. different, then there would be a possibility of plagiarism.
The feature vector, i.e. the output of feature extraction The detail of each module is discussed in the following.
module, is compared to the profile of the developer of the code.
If it is similar, then it is added to the current profile. In case A. Feature Extraction module
there is no profile and it is the first time a developer submitted In this module, a feature vector is derived for each
a code to the system, it would create a profile for him/her and document under investigation. A derived feature vector is
would use it in the future. The giver/taker module uses the considered as the fingerprint of a document and would be used
output from similarity clustering module and the developer throughout the investigation. In general, a feature vector should
profile module to determine the similarity between the represent the general structure of a document and possible
submitted codes and profiles to determine the giver and taker. overlooked characteristics of a document. For instance, in a
In the case that developer profiles are very similar to each C/C++ program, the general structure consists of features such
as number of loops, number of functions, and number of

978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan


2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL)
Page 182
includes. Furthermore, a set of overlooked features such as the action of plagiary is transforming conditional term and loop
number of spaces after each line, the average length of structures together. Thus the result of the feature extraction
variables, and space between a variable and the assignment module is a feature vector for each program file. Of course this
operator, are among those features that normally stay vector is normalized to make programs having variable
unchanged during plagiarism process. volumes comparable.
The idea behind employing the overlooked features          
initiated based on the typical behavior during copying process.
For example if a student copies a programming assignment, B. Similarity Detection Module
he/she possibly applies the following to prevent plagiarism In this module a hierarchical clustering algorithm is used
detection: for classifying feature vectors that derived from the feature
extraction module. The method is a bottom up or
• Changes variables’ names agglomerative approach: i.e. each feature vector starts in its
• Relocates code blocks own cluster, and then successively pairs of clusters are merged
until all clusters have been merged into a single cluster that
• Transforms all kinds of loop structures together contains all feature vectors and any further merge action is
impossible.
• Changes or deletes comments
At the end the clusters with high similarity rate between
• Replaces conditional and control structures to similar their programs, i.e. having similarity over a given threshold, are
command marked as candidates for plagiarism. Since clustering is
As it can be seen the major changes involve general cases conducted hierarchically, it is possible to detect the cheater
and the copier overlooks the details that can distinguish one groups.
code from a similar one. Consequently, if two programs are The clustering is performed by constructing a dissimilarity
written by two different people, the codes should be different in matrix using Euclidian distance as a distance metric. Then
details. In other words, if a program is an altered version of single-linkage clustering, as linkage criteria for merging feature
another one, the details remain the same or similar. A few of vectors, is used on the dissimilarity matrix. In other words, if a,
these overlooked features are: b are two feature vectors and A, B are two clusters then:
• Writing each command in one line or several
commands in one line, Fig.2 (a) (b).       

• Type of indentation at beginning of lines and inside
the loop, Fig.2 (c) (d).          
• Spacing at the end of lines, Fig.2 (e)(f). The result of this module can be shown as a dendrogram.
• Placing the line comment or block comment, before C. Developer Profile Module
functions or each line, Fig.2 (g), (h). In this module, the feature vector related to the present
• Placing some character like ‘+’, ’=’, ‘)’, with one program is compared with older vectors of the program’s
space before and after it, or without space, Fig.2 (i), developer profile. If similarity between them is high, the vector
(j). is added to the to the developer’s profile. Furthermore, if the
similarity rate between them is lower than a given threshold it
• Placing the marks of Beginning and End of block is possible that the new code is not written by this programmer.
(brackets) in one line alone or in continuation of one Therefore this developer should be the copier of the code. If the
command, Fig.2 (k), (l). similarity rate to previous vectors is high but the result of
similarity detection module show us that this person is in a
• Usage of special marks like ”_” in the name of plagiarism cluster then it may be concluded that this person is a
variables giver.
• Number of empty lines D. Style Level Checking Module
The above cases are normally overlooked by students who As mentioned before, Moss and Jplag compare codes to
plagiarize. Consequently, if most of these cases are identical or each other. Consequently, if a given code is copied from the
similar between two codes the likelihood of plagiarism would internet or provided by a third party, these systems may fail to
be high. That is why the feature vector is a combination of mark it as plagiarized. Although it is possible to keep a large
general and detail cases which relate to writing style in database of the codes available on the web, similar to the way
program. that Turnitin works, however it would require a huge database
A few of the feature vector elements that can show the to hold all these codes and the process of detection would take
above cases, general and detail cases, include: average length a long time.
of lines, average length of variables and numbers of loops. It To resolve this issue, the proposed method has been trained
must be mentioned that in producing these vectors there is no to classify the codes into advanced and beginner codes, i.e. it
differentiation between synonymous keywords because the first has been written by an advance programmer or by a naïve

978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan


2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL)
Page 183
(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)

(k) (l)

Figure 2. Sample code that showing effect of proposed features in style, (a) and (b) command placementing, (c) and (d) indention,(e) and (f) spacing at the
end of line, (g) and (h) line comment or block comment, (i) and (j) Placing some character like ‘+’, ’=’, ‘)’, with one space before and after it, or without
space, (k) and (l) placement of ‘{’ and ‘}’.

978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan


2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL)
Page 184
programmer. The training data has beenn collected from
professional programmers developing large programs, mainly 
from the srouceforge.net since the contributors to
sourceforge.net are mainly professional programmer.
rammer. The naïve 
programmer samples are collected from the first year freshman
students in Electrical and Computer Engineering
ring field. 
IV. RESULTS

The system was tested on a group of 120 freshman
students. The projects for “introduction to programming” 
course were analyzed using Moss system and the proposed
framework. The results showed that for cases es with more than 
30% similarity, the system’s result is comparable
rable to Moss. The
proposed system also found resemblance between etween codes with     
       
less than 30% similarity that normally ignoredred due to possible
random resemblance. These cases can be manually evaluated Figure 3. Moss system result on 8 given programs, N1 to N8 are the
by the instructor or rejected using a given threshold.
eshold. given programs and the vertical axis
is shows the percentage of similarity.

Considering the fast performance of the proposed method


compared to Moss (one minute for every 120 students), the
system can be a reliable replacement forr online and fast
plagiarism detection.
Fig. 3 shows the result of running Moss on 8 source codes
from different programmers. Moss showing ing the similarity
similari
between each pair of programs in the table, that
hat here showed in
the chart. For example N4 is about 60% similar
lar to N8 but N8 is
85% similar to N4. This is because of changing
anging the size of
program.
Fig. 4 shows the result of running the proposed
roposed method on
Figure 4. proposed system result onn same 8 programs as dendrogram
the same dataset. In this case chart is a dendrogram
ndrogram tree that
shows the dependency between program in a hierarchical
clustering format. help with detecting the areas off the code that have been copied.
Finally the developer profile module would be improved to
As it can be seen in both cases two wo cluster
clusters were efficiently merge feature vectors.
ors. It must be mentioned that
distinguished one of which include source codes
odes 1, 2 and 3 and some of developing tools provided automaticalautomatically some
the other one including source codes 4, 5, 6, 7 and 8. In case of formatting rules to codes that decrease the importance of some
using Moss, the user should visually determine
mine these clusters elements of feature vector and should be handled.
while in the proposed framework, the dendrogram
rogram tree shows
the clusters automatically. REFERENCES
RENCES
[1] V.J. Haines, G.M. Diekhoff, E.E.E Labeff and R.E. Clark, “College
V. CONCLUSION AND FEATURE
E WORK cheating: Immaturity, lack off commitment
commitment, and the neutralizing
In this paper a new framework to plagiarismarism detection on attitude,” Research in Higher education
ucation 25(4):342-354,
25(4):342 1986.
coding documents based on attribute counting g is presented. The [2] P. Ashworth, P. Bannister, and d P. Thorne, “Guilty in whose eyes?
University students perceptions of cheating and plagiarism in
advantages of the proposed framework are: a) it is fast and can academic work and assessment ent,” Studies in Higher Education
work on large volume data since it creates a feature vector for 22(2):187-203. 1997.
each document eliminating the need to process the whole [3] S. Schleimer, D. Wilkerson and A. Aiken, “Winnowing: Local
document every time. b) the framework providesovides a method to Algorithms for Document Fingerprinting,
rprinting,” SIGMOD 2003, San Diego,
detect the giver/taker of a code in case of plagiarism,
giarism, and c) the CA, USA, June 9-12, 2003.
proposed framework is capable of detecting ecting plagiarized [4] L. Prechelt, G. Malpohl and M. Philippsen,
P “Finding plagiarisms among
documents in which the code is copied from om the internet or a set of programs with Jplag,” J. Univ.Comput. Sci., 8, 1016–1038,
2002.
provided by a third party.
[5] C. Collberg, G. Myles and M. Stepp, “Cheating Cheating Detectors”.
The future work would focus on assigning g different weights Technical Report TR04-05, 2004. 4.
to the features since one feature could be more
ore important than [6] C. Arwin, and S.M.M. Tahaghoghi, ghoghi, “Plagiarism Detection across
the other feature. Furthermore, the current nt version of the Programming Languages,” 29th 9th Australasian Computer Science
Conference Vol.48, 2006.
proposed feature extraction creates the feature
ture vector for the
whole document rather than part of a document.
ment. Consequently [7] J. Brassil, S. Low, N. Maxemchukmchuk and L. O'Gorman, “Electronic
Marking and Identification Techniques
T to Discourage Document
it may fail to detect plagiarism in case partt of a document is Copying,” 13th Proceedings IEEE E Digital Object Identifier, 1278 - 1287
copied. Consequently, the feature vectors need
ed to be created for vol.3, 1994.
blocks of code rather than the whole code. This would further

978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan


2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL)
Page 185
[8] C. Daly and J. Horgan, “Patterns of plagiarism,” Proceedings of the [15] A. Jadalla and A. Elnagar, “PDE4Java: Plagiarism Detection Engine for
Thirty-Sixth SIGCSE Technical Symposium on Computer Science Java source code: a clustering approach,” IJBIDM 3(2): 121-135, 2008.
Education, pp. 383-387, SIGCSE 2005. [16] S. Mann and Z. Frew, “Similarity and originality in code: plagiarism and
[9] C. Daly and J.M. Horgan, “Automatic Plagiarism Detection,” normal variation in student assignments,” Proceedings of the 8th
Proceedings of the IASTED International Conference Applied Austalian conference on Computing education - Volume 52, 2006.
Informatics, pp.255-259, 2001. [17] L. Moussiades and A. Vakali, “PDetect: A Clustering Approach for
[10] Simon, “Electronic watermarks to help authenticate soft-copy exams,” Detecting Plagiarism in Source Code Datasets,” The Computer Journal
ACE '05: Proceedings of the 7th Australasian conference on Computing Vol. 48 No. 6, 2005.
education - Volume 42, Volume 42, 2005. [18] R.C. Lange and S. Mancoridis, “Using code metric histograms and
[11] H.T. Jankowitz, “Detecting plagiarism in student Pascal programs,” genetic algorithms to perform author identification for software
Computer Journal, Vol.31, No1, pp1-8. 1988. forensics,” Proceedings of the 9th annual conference on Genetic and
[12] G. Frantzeskou, S. MacDonell, E. Stamatatos, and S. Gritzalis, evolutionary computation, July 07-11, 2007.
“Examining the Significance of High-level Programming Features in [19] J. Sheard, A. Carbone and M. Dick, “Determination of Factors which
Source Code Author Classification,” The Journal of Systems and Impact on IT Students’ Propensity to Cheat,” Proc. Fifth Australasian
Software, 81(3), 447-460, Elsevier, 2008. Computing Education Conference, 119-126, ACM Press. 2002.
[13] G. Frantzeskou, E. Stamatatos, S. Gritzalis, C.E. Chaski, and B.S. [20] M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the Use
Howald, “Identifying Authorship by Byte-Level N-Grams: The Source of Discretized Source Code Metrics for Author Identificatio”. In the
Code Author Profile (SCAP),” Method Int. Journal of Digital Evidence, IEEE Proceedings of the 1st International Symposium on Search Based
6(1), 2007. Software Engineering (SBSE'09), Windsor, UK, May, 2009.
[14] J.Hope, “The Authorship of Shakespeare’s Plays,” Cambridge
University Press, Cambridge, 1994.

978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan


2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL)
Page 186

View publication stats

You might also like