ability information from public CVE advisories to vulner-able packages and vulnerable source files. We then dis-cover all clones of these packages in a Linux distribution.Finally, we check the manually tracked vulnerable pack-ages that Debian Linux maintain for each CVE and reportif any of our discovered clones are not identified as beingvulnerable.
Clonewise is based on machine learning. We employ sta-tistical classification to learn and then classify two pack-ages as sharing or not sharing code.The classification problem is to have a known set ofclasses and then given an object, assign a class to it. Bina-ry classification is when there are two classes. For exam-ple, spam and not spam in spam classification. Anotherexample is malicious and non malicious in malware clas-sification.In general, classification can also be divided into su-pervised and unsupervised learning. In supervised learn-ing, the classifier is trained using labeled data. The la-beled data has objects with their classes. After training,objects with unknown classes can be classified. In unsu-pervised learning, there is no labelled data. The typicalapproach is then to use clustering to identify classes.Clustering groups similar objects into the same cluster.Many classification algorithms work on feature vec-tors. A feature vector contains a fixed number of elementsor dimensions. Each dimension represents a feature. Thevalue of element might be the number of times that fea-ture occurs, or it may a numeric value representing thefeature directly. Features do not always need to be nu-meric. However, for the classification algorithms we usein Clonewise, numeric features are used.Classification is a well studied problem in machinelearning and software is available to make analyses easy.Weka is a popular data mining toolkit using machinelearning that Clonewise uses to perform machine learn-ing.
2.1 Feature Extraction
Feature extraction is necessary to perform classification.We need to select features that reflect if two packagesshare or do not share code. The list of features we use isshown in Fig. 1 and is discussed in the following subsec-tions.
Number of Filenames
Our first set of features is simply the number of filenamesin the source trees of the two packages being compared.
Source Filenames and Data Filenames
In Clonewise, we distinguish between two types of file-name features. Filenames that represent program sourcecode and programs that represent non program sourcecode. We distinguish these two types of filenames bytheir file extension. The list of extensions used to identifysource code is shown in Fig. 1. Almost all of the featuresin Clonewise are applied for both source and data file-names.
Number of Common Filenames
To identify a relationship exists between two packagessuch that they share common code we use common file-names in their source packages as a feature. Filenamestends to remain somewhat constant between minor ver-sion revisions, and many filenames remain present evenfrom the initial release of that software. For our purposeswe can ignore directory structure and consider the pack-age as a set of files, or we can include directory structureand consider the package as a tree of files. We noted sev-eral things while experimenting with this feature:1.
Many files in a package do not contribute to theactual program code.2.
C code is sometimes repackaged as C++ codewhen cloned. For example, lib3ds.c might becomelib3ds.cxx.3.
The filenames of small libraries can often be re-ferred to as libfoo.xx or foo.xx in cloned form.4.
Some files that are cloned may include the versionnumber. For example, libfoo.c might becomelibfoo43.c.We therefore employ a normalization process on thefilenames to make this feature counting the number ofsimilar filenames more effective.Normalization works by changing the case of eachfilename to be all lower case. If the filename is prefixedwith lib, it is removed from the filename. The file exten-
Figure 3. The assignment problem.
C cpp cxx cc phpinc java py rb jspl pm m mli lua
Figure 2. Source code filename extensions.