Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/282942297
READS
148
3 authors, including:
Ali El-Matarawy
Reem Bahgat
8 PUBLICATIONS 5 CITATIONS
Cairo University
SEE PROFILE
55 PUBLICATIONS 83 CITATIONS
SEE PROFILE
Mohammad El-Ramly
Reem Bahgat
ABSTRACT
This paper presents a new technique for clone detection using
sequential pattern mining titled EgyCD. Over the last decade
many techniques and tools for software clone detection have
been proposed such as textual approaches, lexical approaches,
syntactic approaches, semantic approaches , etc. In this
paper, we explore the potential of data mining techniques in
clone detection. In particular, we developed a clone detection
technique based on sequential pattern mining (SPM). The
source code is treated as a sequence of transactions processed
by the SPM algorithm to find frequent itemsets. We run three
experiments to discover code clones of Type I, Type II and
Type III and for plagiarism detection. We compared the
results with other established code clone detectors. Our
technique discovers all code clones in the source code and
hence it is slower than the compared code clone detectors
since they discover few code clones compared with EgyCD.
Keywords
Sequential Pattern Mining, Clone Detection, Data Mining
1. INTRODUCTION
Over the last decade many techniques and tools for software
2. BASIC DEFINITIONS
Mainly we followed the same basic definitions mentioned in
[1, 3].
Definition 1: Code Fragment. A code fragment is a
continuous part of the source code, may consist of one or
more lines. It can be of any granularity, e.g., function
definition, begin-end block, or sequence of statements.
10
3. RELATED WORK
According to [1] clone detection techniques can be divided
into string-based, token-based, parse tree-based, metricsbased, PDG-based or hybrids-based techniques.
It is developed in C++.
It use
clones
11
where > 1
12
i=2
F = { d = 2 * n; }
T then
add e to CC
stillMore = true
end if
}
prune CC by removing all e
where |e| = i and e f and f
where |f| = i+1
CC
CC
i = i + 1
prune all non used elements in F
}
Figure 1. Pseudo-code of EgyCD Algorithm
Check_Apriori(Si+1)
{
For all e
Si+1
{
X = all elements in e except
first element
if e Si then
prune e from Si
end if
}
// after pruning
}
Iteration2:
{
stillMore = false
Si+1 = { ( c = a + b; , d = 2 *n; ) } x { d = 2 * n; }
Si+1 = { ( c = a + b; , c = a + b; , d = 2 * n) }
Si+1 =
//
After
Check_Apriori(S)
stillMore = false
CC = { ( c = a + b; , d = 2 *n; ) }
i=3
F=
}
No more loops since stillMore = false and
CC = { ( c = a + b; , d = 2 *n; ) }
}
Return Si+1
b.
Figure 2. Pseudo-code of Check_Apriori(Si+1)
c.
d.
7.
IMPLEMENTATION DETAILS
13
int x, y, z ;
cin >> x >> y;
z = x + y;
cout c;
cout z;
3.a Source code before transformation
T X, X,X ;
cin>>X>>X;
X=X+X;
T X, X,X ;
cin>>X>>X;
X=X+X;
coutX;
coutX;
3.b Source code after transformation
Figure 3
TX,X,X ;
cin >> X >> X;
X=X+X;
------------------------Cout X;
1.
2.
3.
4.
5.
TX,X,X ;
cin >> X>> X;
--------X=X+X;
Cout X;
8. CASE STUDIES
We have three cases studies. The first case compares the
number of code clones and its corresponding time among
EgyCD, NICAD [28] and simCAD [29] for Type I. The
second case compares the number of code clones and its
corresponding time among the same three tools for Type II.
The third case study was done on a large Java system to detect
code clones of Type I. No need to do a case study for Type III
14
14
12
Time in Seconds
T. in
Sec.
No. of
Clones
T. in
sec.
No. of
Clones
T. in
sec.
1915
80
3.00
0.08
1.2
4304
231
5.00
10
0.1
1.5
5949
345
7.00
11
0.5
1.8
7424
431
8.00
18
0.6
2.1
8454
486
12.00
20
0.7
2.4
EgyCD
600
500
300
8454
7424
5949
Graph (2) and Table (1) compare the detection time for the
same group of files for each tool. EgyCD is slower in
comparison with NICAD and simCad, and this is because
EgyCD discovers much more code clones than NICAD.
It is also due to the nature of the EgyCD algorithm of building
code clones, especially in getting the second itemset of code
clones since its cardinality is so high and equals to the square
of the first itemset cardinality; the first itemset cardinality
equals to all items (lines) in the source code that appear more
than once in the source code.
simCAD
NICAD
EgyCD
100.0000
80.0000
60.0000
40.0000
20.0000
0.0000
1915
400
8454
simCAD
Size in Lines
7424
NICAD
No. of
Clones
Size in Lines
200
100
8454
7424
5949
4304
0
1915
5949
EgyCD
Size in
Lines
S
e
q
.
10
1915
simCAD
NICAD
EgyCD
The third case study is for a very large scale to show that
EgyCD can detect code clones for large scale systems. We
randomly selected 2151 files from the JDK. Their size is 21.8
MB.
4304
The first and the second case studies are applied for the same
set of files. This is the example set of 25 C language files
bundled with NICADs clone detector. We divided them into
5 groups; the first group contains 5 files and each consequent
group contains the files of the pervious group and has 5
additional files. So, the last group contains 25 files. The total
size of these files is 332 KB and they collectively contain
about 8545 lines of code.
Graph (1) and Table (1) compare the number of code clones
detected by each tool, since EgyCD uses an Apriori-based
algorithm, it comprehensively detects all code clones in the
source code. Hence, EgyCD has a high precision and high
recall; it detects all code clones regardless of whether they are
meaningful or not.
4304
since Type II is a sub-case from Type III and it will give the
whole information we need about the algorithm.
Size in Lines
15
No. of
Clones
T. in
Sec.
No. of
Clones
T.
in
sec
No. of
Clones
T.
in
Sec.
1915
148
29.00
18
0.5
4304
345
50.00
21
0.5
5949
506
60.00
26
0.6
12
7424
636
68.00
38
0.6
16
8454
706
73.00
80
0.6
19
80.00
60.00
40.00
20.00
0.00
Siz in Lines
600
EgyCD
500
8454
7424
5949
4304
1915
Size in Lines
No. of
Code
Clones
Time in
Hours
629
59665
10502
3105
0.07
868
102777
17895
5118
0.37
1315
154267
27392
7565
0.63
1782
230485
41328
10772
4.55
2151
310861
53420
14189
7.54
70
60
50
40
30
20
10
8454
Size in Lines
7424
5949
4304
1915
EgyCD
80
Clone Size
(Lines)
16000
14000
12000
10000
8000
6000
4000
2000
0
310861
100
No. Of
Lines
230485
200
No. of
Files
154267
300
S
e
q
.
102777
400
Time in Seconds
100.00
59665
800
120.00
7424
Size in
Lines
simCAD
140.00
1915
S
e
q
.
NICAD
EgyCD
simCAD
NICAD
EgyCD
8454
5949
4304
Size in Lines
16
Detecticting Time
EgyCD
8.00
Time in Hours
7.00
6.00
5.00
4.00
3.00
2.00
1.00
310861
230485
154267
102777
59665
0.00
Size in Lines
11. ACKNOWLEDGMENTS
Thanks to. Chanchal K. Roy, for his support and technical
comments as well as his encouragement for this work, also
thanks for Auni Ku and Ira for their support.
12. REFERENCES
[1] C. K. Roy, J. R. Cordy, R. Koschke, Comparison and
Evaluation of Code Clone Detection Techniques and
Tools: A Qualitative Approach. Comparison and
Evaluation of Code Clone Detection Techniques,
Science of Computer Programming, 74, 470-495, (2009).
[2] B. Baker, On Finding Duplication and Near-Duplication
in Large Software Systems, in: Proceedings of the 2nd
Working Conference on Reverse Engineering, WCRE
1995, pp. 86-95 (1995).
[3] C. K. Roy and J. R. Cordy, An Empirical Study of
Function Clones in Open Source Software Systems, in:
Proceedings of the 15th Working Conference on Reverse
Engineering, WCRE 2008, pp. 81-90 (2008).
[4] E. Juergens, F. Deissenboeck, B. Hummel and S.
Wagner. Do Code Clones Matter? In Proceedings of the
31st International Conference on Software Engineering
(ICSE09), pp. 485495, Vancouver, Canada, May 2009.
[5] J. H. Johnson. Identifying Redundancy in Source Code
Using Fingerprints. In Proceeding of the 1993
Conference of the Centre for Advanced Studies
Conference (CASCON 93), pp. 171183, Toronto,
Canada, October 1993.
[6] B. Baker. On Finding Duplication and Near-Duplication
in Large Software Systems. In Proceedings of the
Second
Working
Conference
on
Reverse
Engineering(WCRE95), pp. 8695, Toronto, Ontario,
Canada, July 1995.
[7] A. Chou, J. Yang, B. Chelf, S. Hallem and D. R. Engler.
An Empirical Study of Operating System Errors. In
Proceedings of the 18th ACM symposium on Operating
systems principles (SOSP01), pp. 7388, Banff,
Alberta, Canada, October 2001.
[8] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner:
Finding Copy-Paste and Related Bugs in Large-Scale
Software Code. IEEE Transactions on Software
Engineering, 32(3):176192, 2006.
[9] M. Fowler. Refactoring: Improving the Design of
Existing Code. Addison-Wesley, 2000.
[10] S. Bellon, R. Koschke, G. Antoniol, J. Krinke and E.
Merlo, Comparison and Evaluation of Clone Detection
Tools, Transactions on Software Engineering, 33(9):577591 (2007).
17
IJCATM : www.ijcaonline.org
18