You are on page 1of 4

Detecting and Classifying Malware Executables Through Data Mining

Elizabeth K. Maloney
Student, Computer Science Department Columbus State University Columbus, Georgia
Abstract. Computer systems interfaced with the Internet are being bombarded with malware. Governments, rogue corporations, organized crime and individuals have been identified as malware developers and although each has a different purpose they communicate and provide support to each other in continuous development of unique approaches to create malware [3]. More automated tools are necessary to fight this increasing burden on both users and anti malware software companies. Learning to Detect and Classify Malicious Executables in the Wild describes an experimental automated data mining solution using hex dumps that is proposed for both malware identification and analysis on the fly [1]. This paper describes an alternative solution that takes Kolter & Maloofs experiment one step closer to becoming a viable tool for automatically detecting and removing malware. It proposes combining strengths of Kolter & Maloofs proposal along with several changes to input data and testing setup. The alternative solution recommends retaining the hex dump technique and statistical analysis while changing the number and type of executables used for training and testing, classifying benign executables, and adding simulated networks to apply the new solution. Depending on the outcome of this alternative solution, the results will either provide further support for or discount the Kolter & Maloof solution for detecting malware.

in the Wild, their concerns about malware collaborated with the 2008 AV-Test Lab.orgs investigations that there are increasing complexities in malware that make it more difficult in keeping malware away from legitimate computer systems [1]. Because of this problem, Kolter & Maloof recommend that antivirus software developers add new tools and methodologies to anti-virus solutions that address these increasing difficulties [1]. Their publication discusses an experimental solution using data mining to analyze four byte hexadecimal codes derived from binary executables. This paper proposes an alternative methodology using the strengths of the Kolter & Maloofs experiment along with improvements that, depending on actual results, will result in increased or weaken support for using this process for malware detection. The following sections provide a description of Kolter & Maloofs experiment along with appropriate recommendations for changes and weaknesses identified in their proposal. Section 2 describes related work, Section 3 provides an overview of how the hexadecimal n-grams are created, Sections 3 through 6 explain the analytical approaches and results for each experimental setup, Section 7 describes proposed improvements and Section 8 contains a conclusion. II. RELATED WORK



As recently as February 2008, AV-Test reported the Internet is experiencing a glut of malware, causing Antivirus companies to analyze 2000-3000 new viruses per hour [3]. There are three primary reasons for this abundance, malware developers have more easy to use development tools; malware development is becoming more profitable; and the ability for malware to produce variants of itself thereby avoiding detection and increasing the number infections. Along with this increase, the complexity and damages inflicted by new malware is becoming more burdensome [3]. Infected computers contain spyware, backdoors, damaged data and lost executables. Malware developers are adding more malevolent functionality within these products instead of being just a virus, worm or trojan, new individual malware products have all three features and can package malware in ever changing obfuscated files. This rapid increase in malware numbers and complexity is forcing anti-virus companies to research and implement new ways to identify, classify and delete malware. Unbeknownst to Kolter & Maloof, authors of the 2006 paper Learning to Detect and Classify Malicious Executables

The authors discuss several related papers, only one proposed a solution slightly similar to theirs. Schultz et al. (2001) created binary profiles of programs by creating binary attributes from two byte hexadecimal numbers from the executables [5]. Using data mining techniques, Schultz et al. trained their system by using 6 files each containing 1/6 of the lines from the hexadecimal datasets. The first file contained lines 1, 6, 12, the second file contained lines 2, 7, 13,.. until they had created 6 learning files. For this analysis the Voting Nave Bayes of the hex dump produced the best result with a . 95 true-positive rate and a .06 false-positive rate with 96.88% accuracy. III. TECHNICAL DESCRIPTION

To begin their investigation in 2003, Kolter & Maloof created a collection of 3622 executables made up of 1971 benign and 1651 malicious executables in the windows PE format [1]. Sources for the benign executables included folders running windows 2000 & XP, SourceForge and [6] [7]. There is no mention of the method used for selecting these benign executables. A better methodology for this

experiment would be to strategically choose benign executables that are similar to ones used by most computer users. By using a selection method for the benign executables, the experiment would be able to more clearly identify the depth of potential false positives in real world situations. VX Heavens and MITR Corporation provided malicious virus loaders, worms and trojan horses, some were not obfuscated, others with compression, encryption or both [8] [9]. The obfuscated status of these malware samples remained unknown to researchers. Using this set of malware, commercial virus programs were used to determine malware categories. One product was unable to identify 18 of the 114 malicious executables and another product was unable to recognize 20 executables as malware even though they were in the public domain. An additional 291 malware samples were collected after the initial 2003 set which were used for a real world online evaluation. Since most of the executables were collected in or before 2003, this paper proposes performing the experiment again using the current samples along with more executables gathered through 2008. All executables were converted to hexadecimal codes in ascii format. Starting with the first byte, four sequential bytes were used to create 1 n-gram with a rolling mechanism so that one n-gram was created with every byte in the first position concatenated to the following sequential three bytes (the authors didnt say what they did with the last three bytes). The results of this processing created 255,904,403 distinct n-grams. For each executable the authors used the combined n-gram dataset and assigned a Boolean of 0 (False) or 1 (True) to the ngram depending whether or not the executable contained that ngram. This data categorized in n-grams provided input for the three types of analysis: 1. Detecting Malicious Executables, 2. Classifying Executables, and 3. Evaluating Real-world, Online Performance. This unique methodology of creating n-grams is sound and for future experiments should be applied consistently. It provides a mechanism for looking at executables without decrypting or reverse engineering which would introduce errors subsequent analysis. The other benefit of this process is that every byte is positioned in the first byte location allowing the system to look at various combinations of codes. IV. DETECTING MALICIOUS EXECUTABLES

the larger set [1]. Others such as IBk and boosted SVMs performed well with a .96/.98 and .97/.99 respectively.
Method Boosted J48 Boosted SVM IBk, k=5 SVM Boosted Nave Bayes J48 Nave Bayes AUC 0.98360.0095 0.97440.0118 0.96950.0129 0.96710.0133 0.94610.0170 0.92350.0204 0.88500.0247

TABLE 1: RESULTS FOR DETECTING MALICIOUS EXECUTABLES IN THE SMALL COLLECTION. MEASURES AREA UNDER THE ROC CURVE (AUC) WITH A 95% CONFIDENCE INTERVAL [1]. Method Boosted J48 Boosted SVM IBk, k=5 SVM Boosted Nave Bayes J48 Nave Bayes AUC 0.99580.0024 0.99250.0033 0.99030.0038 0.98990.0038 0.98870.0042 0.97120.0067 0.93660.0099




Kolter & Maloof ran their experiment on both a small and large collection of malware [1]. For each type, ten data sets with equal numbers of randomly selected executables called a partition were created. One data set to be used for testing and the other nine for training. Another 9 partitions were created using the same methodology as the first. Within each partition, for all the executables combined they calculated the information gain or average mutual information calculation which was used to select the top 500 n-grams for analysis. Once the partitions were created, analysis were done using the following methods; Boosted J48; Boosted SVM; IBK, k=5; SVM; Boosted Nave Bayes; J48; Nave Bayes. Analysis of the larger data set provided a slightly better outcome than the smaller set. ROC curves (AUC) were used to determine the best analytical method on this data a perfect score occurs when the area under a curve is equal to 1. As you can see on the reproduced tables 1 and 2, the Boosted J48 performed the best for both data sets sizes .98 on the smaller set and .99 on

To classify executables, Kolter & Maloof investigated the possibility of classifying malware based on the type of malicious functionality the malware performed on a system. Unfortunately there wasnt enough data to support stratifying data into many functions so the authors came up with comparing functionality as a one-versus-all classification where analysis is done on one function such as backdoor versus all the other types combined [1]. In order to classify executables, the authors needed a mechanism for training the system using existing information. The authors attempted to classify collected malware by using several popular malware detecting software packages. Although their process seemed to be sound, they came up with three weak classification schemes settling on three partitioned sets of the following 1. malicious executable viruses versus all other malicious executables, 2. malicious executables with a backdoor versus all other malicious executables and 3. malicious executables with a mass-mailer versus all other malicious executables. Each partition was analyzed thru the same 8 statistical methods as the detecting malicious executables group resulting lower overall ROC curves. For Mass Mailer the highest ROC was thru SVM @ .89, Backdoor was Boosted J48 @ .87 and Virus was Boosted J48 @ .91.
Method Boosted J48 SVM IBk, k=5 Boosted SVM Boosted Nave Bayes J48 Nave Bayes Mass Mailer 0.88880.0152 0.89860.0145 0.88290.0155 0.87580.0160 0.87730.0159 0.83150.0184 0.78200.0205 Payload Backdoor 0.87040.0161 0.85080.0171 0.84340.0174 0.86250.0165 0.83130.0180 0.76120.0205 0.81900.0185 Virus 0.91140.0166 0.89990.0175 0.89750.0177 0.87750.0192 0.83700.0216 0.82950.0220 0.75740.0250




Using the 291 malicious executables collected after the 2003 initial data was collected Kolter & Maloof designed a real world online performance of the detectors [1]. For this evaluation, the authors did not develop partitions of the original; instead they used the entire 2003 dataset for classifiers and selected three false-positive rates of .01, .05 and .1 for decision thresholds. As expected, Boosted J48 provided the best results compared to the other 7 methods.
Method Boosted J48 SVM Boosted SVM IBk, k=5 Boosted Nave Bayes J48 Nave Bayes Desired False-positive Rate 0.01 0.05 0.1 A P A P A 0.86 0.99 0.98 1.00 1.00 0.41 0.98 0.90 0.99 0.93 0.56 0.98 0.89 0.99 0.92 0.67 0.99 0.81 1.00 0.99 0.55 0.94 0.93 0.98 0.98 0.34 0.97 0.94 0.98 0.95 0.28 0.57 0.72 0.81 0.83

Assuming that most real world systems may need or want to introduce benign executables, another flaw in the Kolter & Maloof study is that they did not look at tests of new benign executables. The new proposed solution would answer the following questions. How does a system get legitimate updates or additional known software added to a system? What is the affect of these two activities on the data mining solution, i.e. would unknown benign executables be identified as malware? This new solution would take Kolter & Maloofs experiment a step further by setting up three real networks all implemented during the same time frame; one would have traditional malware detectors implemented to be used for determining baseline comparison metrics. One would be targeted for detecting and classifying malware on the fly (this model could include a morphed version of the data mining with the traditional detector). The third network would have a process for updating the trained data mining model on a planned schedule and updating and adding benign executables. These experimental systems could capture data about the number of times malware hit the system as well as other data such as date/time and type of malware. Perhaps capture the malware executable itself for future study. VIII. CONCLUSION Kolter & Maloof propose a new potentially automated methodology for detecting, categorizing and deploying real world solutions to prevent malware from infecting computer systems. Their experimental results are positive and are a good first step at developing a new useful tool for automatically identifying malware. This paper proposes a new solution that marries strengths from their solution with several enhancements. The results from adding these features will fortify the possibility that using data mining techniques can be added to the anti malware tool box. The new solution uses Kolter & Maloofs process of reducing an executables into n-grams, then dividing the training data into 10 partitions that included training and test data, and finally testing the various data mining statistical algorithms. Unfortunately this may be a weakness in this proposal since adding this type of tool may encourage malware developers to use more code that looks similar to benign executables when broken into n-grams. The newer process should re train the data mining process by utilizing newer and different benign and malicious executables to further validate positive results. This proposal would increase the number and quality of benign executables to ensure that the system is truly identifying malicious code. Retraining the model could be a risk as the results may come out negatively. To ensure this model doesnt overwhelm a system with false positives, benign executables need to be identified and classified. If the model presents a large number of false positives then these classifications will help to determine if the model should be discarded or modified to exclude certain types of benign executables. Also, using classification will help validate that the benign executable set contains good representation of common and rare applications.

P 0.94 0.82 0.86 0.90 0.79 0.20 0.48


Although the results of the real-world experiments seem impressive, comparing these results to current traditional malware detection methods would substantiate a more robust model. One possible technique would be to set up two identical networks with one having the traditional malware detection software applied and the other have a data mining solution along with traditional malware detectors implemented. Data from these two systems could be captured and compared to prove that the data mining solution adds value in detecting malware. VII. PROPOSED IMPROVEMENTS Kolter & Maloofs method of reducing an executable into a set of rolling hex bytes and then using a smaller select number based on the most information gain to identify malware is one of the major strengths of their experiment. Additionally, the authors paid attention to detail by dividing the training data into 10 partitions that included training and test data, and then testing the various data mining statistical algorithms. They did a good job of running the processes and evaluating results using various statistical models. Since these principles are sound and results positive, this part of their proposal will be incorporated into the new proposed solution. The new solution would be an improvement by adding more benign and malicious executables to the datasets, ensuring that more current ones are included. For benign executables, the new solution would attempt to include those from popular as well as rare software applications. Also, the benign executables will be classified into types to help in identifying whether or not certain types of benign executables can trick the model and become identified as false positives. For categorizing malware, an industry set of standards should be developed that indicates more closely the types of malware along with the methodology for classifying each type. Kolter & Maloof mention that there is missing data and conflicting data regarding malware classification, this could be reduced by having a single point clearinghouse for labeling malware.

Setting up simulated networks will be expensive, but well worth the cost especially if the results are as positive as Kolter & Maloofs. Unfortunately, the limitations of these networks will be time constraints and the inability of simulating this solution on a network being used by random people in a real working situation. One network would simulate a traditional malware protected system; this would be a control system to capture baseline metrics. The second would contain a trained data mining system along with the traditional malware protection methods, metrics would be captured to prove (or disprove) that this data mining system will enhance protection. A third network would be setup to simulate the second network with a proposed mechanism for automatically updating the malware protection system as well as updating legitimate software. Although Kolter & Maloof tried to use a data mining approach for classifying malware, realistically, the ability to classify malware is beyond the scope of this proposal. Its a real problem not only for this solution but others as well. Until a professional working group or organization is committed to this endeavor, the ability to classify malware will be too cumbersome. This proposed solution of combining several changes to the strongest features of the Kolter & Maloof proposal should provide researchers with more robust data that can be used to determine if this type of data mining is a useful tool in the war against malware. If results are positive, the anti malware industry would receive a invaluable tool for combating destructive intrusions from malware. IX.






[6] [7] [8] [9]

Zico Kolter and Marcus A. Maloof. Learning to Detect and Classify Malicious Executables in the Wild. Journal of Machine Learning Research 7 (2006) 2721-2744. J. Zico Kolder and Marcus A. Maloof. Learning to Detect Malicious Executables in the Wild, in proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2004. . Andreas Marx, Malware vs. Anti-Malware: (How) Can we still Survive? Virus Bullentin,, February, 2008 B. Krebs. Anti-Virus Firms Scrambling to Keep Up (Sophistication of Viruses and Other Threats Poses Big Challenges for Companies, Customers). , Wednesday, March 19, 2008, 11:12am M. G Schultz, E. Eskin, E. Zadok, and S. j. Stolfo. Data mining methods for detection of new malicious executables. Proceedings of the IEEE Symposium on Security and Privacy, pages 38-49, Los Alamitos, CA, 2001. IEEE press. URL SourceForge, URL, URL VX Heavens, URL MITRE Corporation, URL