Professional Documents
Culture Documents
Edward L. Jones
Department of Computer and Information Sciences
Florida A&M University
Tallahassee, FL 32307
850-412-7362
ejones@cis.famu.edu
ABSTRACT
1 INTRODUCTION
Plagiarism is a common problem in programming courses. Although course
syllabi warn students that plagiarism is punishable, the effort the instructor must invest
to build and pursue a case of plagiarism effectively guarantees prosecution will
seldom occur. It is important that educators address the issue of plagiarism, despite the
additional time and emotional burden required to confront offenders [1]. Educators are
obligated not only to talk about issues of ethics and social responsibility [2], but also
to view ethical conduct with the same weight as technical competence. Certainly,
plagiarism will be a constant temptation for computer science students. What ethical
dilemma besides plagiarism is better suited to satisfy the curriculum mandate to treat
ethical problems? Plagiarism is a concrete ethical problem that will recur throughout
the student's academic career.
This paper illustrates how simple metrics can provide initial levels of screening
for excessive collaboration and plagiarism. The application of metrics is a part of an
overall process of accumulating the evidence necessary to confront violators. Our
approach is extensible, permitting the use of additional measures and documentation.
The system provides a sanitized report of the results from the monitoring, which is
distributed to students to sensitize them to fact that “big brother” is watching.
In section 2, we discuss other approaches to plagiarism prevention, detection, and
confrontation. In section 3, we describe the metrics used by our approach, and the
environment in which evidence is collected on suspicious programs. Section 4
contains sample results from using our approach. In section 5 we illustrate how to
extend this approach by adding other measures. The final section presents conclusions
and suggestions for future work.
6th Annual CCSC Northeastern Conference, Middlebury, Vermont, April 20-21, 2001.
1. Verbatim copying.
2. Changing comments.
3. Changing white space and formatting.
4. Renaming identifiers.
5. Reordering code blocks.
6. Reordering statements within code blocks.
7. Changing the order of operands/operators in
expressions.
8. Changing data types.
9. Adding redundant statements or variables.
10. Replacing control structures with equivalent
structures.
The YAP family of approaches [5,6] also tokenizes the source program, but
retains only the tokens that indicate the program structure -- control structures, blocks,
subprograms, and uses of library functions. A language-specific lexicon defines the
significant language elements, including library functions, to be tokenized. The bodies
of subprograms are expanded once, in the order in which they are called. Each
subsequent call to an expanded subprogram is replaced by a unique token. The
resultant token string is the program profile that is used in comparisons. The closeness
of two programs is computed using a longest common subsequence algorithm applied
to the pair of program profiles. The YAP approach detects most of the plagiarism
transformations listed in Table 1.
6th Annual CCSC Northeastern Conference, Middlebury, Vermont, April 20-21, 2001.
3 OUR APPROACH
Our approach creates program-size independent, numeric program profiles.
Closeness is computed as the normalized Euclidean distance between profiles. We
investigate the three profiles shown in Table 2: one focuses on the physical attributes
of the source file, another on language usage, and the third combines the other two.
These profiles are selected for economy of construction and robustness of detecting
the plagiarism transformations given in Table 1.
3.1 Programming Environment
We use a Unix-based program submission and grading environment. Each
student is provided a private repository for submitting assignments and receiving
feedback from the instructor. The student environment contains a set of Unix
commands for accessing the repository. The student is permitted unlimited
submissions, subject to deadline controls. The environment maintains a time-stamped
log of all submissions. The submission environment facilitates the capture of artifacts
used to document potential violations, and permits the accumulation of a historical
archive of student work that provides insight into historical patterns of student
behavior. This archive also serves as the test bed of students programs used in this
work.
3.2 Profiles
The physical profile P describes a program in terms of its physical attributes:
number of lines(l), words(w), and characters (c). The physical profile is obtained from
the UNIX word count (wc) command. The Halstead profile H consists of Halstead
metrics length (N), vocabulary (n) and volume (V). Profile H characterizes the
program in terms of program token types and frequencies. Comments are discarded.
Operator tokens correspond to source language keywords, operator symbols, and
standard library module names. Operands are programmer-defined words.
A third profile is obtained by combining the physical and Halstead profiles. The
composite profile is the 6-vector (l,w,c,N,n,V). Composition combines the similarity
detection strengths of two or more profiles. Table 3 shows the types of transformations
(from Table 1) against which the physical, Halstead and composite profiles are robust.
For the transformation where a check mark appears, the profile will give the same
value. In other instances, the value may differ only slightly. Note that in the distance
computation, small differences in one component are generally weighted downward
6th Annual CCSC Northeastern Conference, Middlebury, Vermont, April 20-21, 2001.
by the other components. Superficial plagiarism transformations usually have only
small effects on the computed distance
Transformation 1 2 3 4 5 6 7 8 9 10
profile P √
profile H √ √ √ √ √ √ √ √
Composite profile P+ H √ √ √ √ √ √ √ √