You are on page 1of 18
In this paper, we summarize our extensive experience using machine learning to build advanced protection for our customers. Machine Learning Methods for Malware Detection kaspersky 28% eas Contents Basic Approaches to Malware Detection Machine Learning: Concepts and Definitions Unsupervised learning Supervised learning Deep learning. Machine Learning Application Specifics in Cybersecurity Large representative datasets are required The trained model has to be interpretable False positive rates must be extremely low Algorithms must allow us to quickly adept them tomalware writers’ counteractione Kaspersky Machine Learning Application Detecting new malware in pre-execution with similarity hashing Two-stage pre-execution detection on users’ computers with similarity hash mapping combined with decision trees ensemble Deep learning against rare attacks Deep learning in post-execution behavior detection Applications in the Infrastructure Clustering the incoming stream of objects Distillation: packing the updates Summary 6 Basic Approaches to Malware Detection An efficient, robust and scalable malware recognition module is the key component of every cybersecurity product. Malware recognition modules decide if an object is a threat, based on the data they have collected onit. This data may be collected at sifferent phases: + Pre-execution ph! tell about a file without executing it Thie may include executable fle format descriptions, code descriptions, binary data statistics, toxt strings and information extracted via code emulation and other similar de * Post-execution phase data conveys information about behavior or events caused by process setivity ma system Inthe early part of the eyber era, the number of malware threats was relatively low, and simple manualy crested pre-execution rules were often enough to detect threats ‘The rapid rise ofthe Internet and the ensuing growthin malware meant that manually created detection rules were no longer practical and new, advanced protection technologies were needed, Anti-malware companies turned to machine learning, an area of computer science that had been used successfull in image recognition, searching and decisionmaking ‘to augment their malware detection and classtheation, Today, machine learning boosts malware detection using various kinds of data on host, network and eloud-based anti-malware components, ‘According to the classic definition given by Al of methods that gives computers “the ability to learn without bi Unsupervised | ning Supervised learning Machine Learning Concepts and Definitions jioneer Arthur Samuel, machine learning is a set 1g explicitly programmed”. Inother words. machine learning algorithm discovers and formalizes the priniples that Underle the data t sees. Wh ths knowledge, the slgerthm can reason’ the properties (f previously unseon samples. Inmmalware detection. a previously unseen sample could be anew fie. Its hidden property could be malware or benign. A mathematically ‘ormalized set of prineples underlying data properties ealed the model Machine learning has 8 broad variety of approaches that it takes to a solution rather ‘han a single method. These approaches nave ciferent capacities and diferent tasks that they sut be: ‘One machine learning approach is unsupervised lesrning. In this setting. we are given ‘only adata set without the right answars forthe task. The goals to discover the structure of the data or the [aw of data generation, ‘One important example is clustering. Clustering isa task that includes spiting ‘data set into groups of similar objects, Another tacks representation learning — this includes building an informative feature set for ebjects based on thei lovlevel description (for example, an autoencoder model), Large unlabeled datasets are avalible to cybersecurity vendors and the cost of their ‘anual labeling by experts is high ~ this mates unsupervised earning valuable for threat detection. Clustering canhelp te optimze efforts for the manuallabeling ‘of new samples. With informative embedding. we can decrease the numiber of labeled ‘objects needed for the next machine learning appreach in our pipeline supervisea learning. ‘Supervised learning isa setting that is used when both the data and the right answers for each object are avaliable. The goals to fit the model that willproduce the night answers for new object ‘Supervised learning consists of two stages: “Training 2 model and ficting 2 model to avalab Applying the trained model ining date new samples and obtainng predictions, The task: "We are given eset of objects, Each object is represented with feature set X. Each abject is mapped to the right answer or labeled as, This taining information is utiized during the training phase, when we search forthe bbest model that will produce the correct label Y for previously unseen abject given the feature set x In the case of malware detection. X could be some features of ile content ‘or behavior for instance, fle statsties anda ist of used API functions, Labels Y ‘Could be malware or benign, or even a more precise classification, such ae a virus Trojan-Downloader or adware In the training phase, we need to select family of models, for example, neural networks oF decision trees. Usualy each model in a family's determined by ts parameters. Training means that we search for the model from the selected family lwith a particular set of parameters that gives the most accurate answers for the. ‘rained modelo: ‘of reference objects according 70 «particular metric. In ‘other words, we learn the optimalparamaters that define valid mapping from X oY. rained a modeland verified its quaity. we are ready for the next phase ~ applying the model to now abjects. inthis phase, the type of the model and Its parameters do not change. The model only produces predictions Inthe case of malware detection. thisis the protection phase. Vendors often deliver a trained modelto users where the product makes decisions besed on model predictions autonomously. Mistakes can cause devastating consequences for a user for example. removing an OS driver Itis erucial forthe vendor to selec’ model family properly. The vendor must use an efficient traning procedure to find ‘the model with a high detection rate and alow false postive rate Tabng tae ‘executables: P IG Mali © weet. - Present Protection phase vices, —» FQ 5 nscs/onrn bya preditive model Deep learning is a special machine learning approach that facilitates the extraction of features of shighlevelf abstraction from low-level data, Deep learning hae proven successfulin computer sion. speach recognition, natura language processing and other tasks. It works best when you want the machine to infer high- Fevel meaning from low-level data For image recognition chal enges.ke ImageNet, deep learning-based approaches already surpass humans leis natural that eyborsecurty vendors tried to apply deep learning for recognizing malware from low-leveldats. A deep learning model can learn comelex Feature hierarchies and incorporate diverse steps of malware detection pipeline into one sold model that canbe trained end-tovend, so that all of the components of model arelearned simukaneously. User products that plement mac! Machine Learning Application Specifics in Cybersecurity elearning make decisions autonomously. The quality of the machine learning model impacts the user system performance and its state. Because of this, machine learning-based malware detection has specifics. sentative required Lar datasets ined model has to be interpretable False po: es must be extremely low Ieis important to emphasize the data-driven nature ofthis approach. A created model ‘depends heavily on the data hae seen during the ta ‘0 determine which features aro statistical relevant for predicting the co Let's look st why making a representative data set i 20 important Imagine we collect a training set and we overlook the fact that ovcasionaly al files larger than 10 MB {are all malware and not benign {whichis certainly nat true for real world les), Winle {taining the modelwil exploit ths property af the dataset, and ill earn that any fle larger than 10 MB is malware it wiluse this property for detection. When this model ' applied to real world data, wil produce many falee positives, To prevent the ‘uteome, we needed to add benign files with larger sizes to the eaining set, Then, the model wiinat rely on an erroneous data set property. Goneraliaing this. we must train ur models on a data set that correctly represents the ‘conditions where the made! wile working nthe real word. This makes the tack of ‘colecting srepresentative datacet crucial for machine learning to.be euccessfu Most of the model families used currently like deep neural networks are called black box models, Slack box models are given the input X, and they ill produce ¥ trough 8 complex sequence of operations that can hardly be interpreted by ahuman, Thi ‘could pose a problem inreal-ife applications For example, when a fale alarm occurs, and we want fo understand why itappened, we ask whether it was a problem with a training set or the model tselt The nterpretabllty of amadeldevermines how easy willbe forus to manage it, assess ts qualty and correct its operation False positives happen when an algorithm mistakes a malicious label fr a benign fl. (ur am to make the false positive rate as low as possible, of zero, This Ie nek typi foramachine learning application. Ths simportant, Because even ane false postive. inamilion benign fils can create serious consequences for users. This's complicated by the fact that there are lots of clean les n the world and they keep appearing, To address this problom.it is important te impose high requirements for both machine learning models and metrics that willbe optimized during traning, with the clear focus ‘on low false positive rate (FPR) models Thisis stilnat enough, because new benign fies that go unseen earor may ‘occasionally be falsely detected, We take thisinto account andimplement a flexible ‘design of a madelthat slows us to fx false-positves on the fy, without completely retraring the model Examales of th’ ae implemented in our preand post-execution models, which are deseribedin the following sections Algorithms must allow us te quickly adapt them to malware writers’ counteractions Outside the malware detection domain, machine learning algorithms regularly work Under the assumption of xed data aistrbution, which means that t doesn't chenge wthime, When we have arainng set that elarge enough, we can train the model so hat it wil effectively reason ary new sample it a test set. As time goes on. the model vallcontinue working 2s expected. face the fact that Active adversaries (malware writers} constantly work on avoiding de releasing new versions of malvarefles bbeen seen during the vaining phase, tions and at difer significantly from those that have ‘Thousands of software companies produce now types of benign executables that are sign ieantyaifferent from previously known types. The data on these types. ‘was lacking nthe training set but the model, nevertheless, needs to recognize them ae benign This causes serious changes in data distribution and raises the problem of detection rate degradation over timein ary machine learring implementation. Cybersecurity vendors that implement machine learning in ther antimalware solutions face this problem and need to overcame, The architecture needs tobe flexible and has to ‘Slow model updates on the fy between retraining. Vendors must alzo have effective processes for collecting and labeling new samples, enviching traning datasets and Fegularlyretraning models. 100% 98% 90% 35% 80% Degradation ofa simple test model 2 FPRHO4 FREI" 78% 70% Detection rate (% of malware detected) 65% 60% 55% 50% 3 4 6 ‘ ? a 9 0 0 How long ago the model has been trained (months) Kaspersky Machine Learning Application The aforementioned properties of real world malware detection make straightforward application of machine learning techniques a challenging task. Kaspersky has almost a decado’s worth of experience when it comes to utilizing machine learning methods in information security applications. Detecting new malware A the dawn of the anivius industry. malar detection on computers was based on in pre-execution with uri fates that erste partelarmetar by code ragment: hashes of cade fragments or the whole fe: fe properties ardcombin ne of thete features. The main goal was to create 2 reliable Angerprint — a combination of Features — of amalcious fle that could be checked quicky- Earle, this workflow required the manual creation of detection rules, via the careful selection of a representative Sequence of bytes of other features indicating malware, During the detection, {an antiviral engine in aproduct checked the presence of the malware fingerprint ina fle against known malware fingerprints storedin the antivirus database However, malware writers invented techniques like server-side polymorphism, This resulted ina flow of hundreds of thousands of malicious samples being discovered ‘every day. At the same time. the fingerprints used were sensitive to smal changes infles. Minor changes in existing malware took f ofthe radar. The previous approach quickly became ineffective because: Creating detection rules manually couldn’ keep up with the emerging flow of malware ‘Checking exch file's fingerprint against alibrary of known malware meant that you ‘couldnt detect new malware untilanalyats manually create a detection rule, We wore interested in features that wore robust against small changesin fie. Those features Would detect new mod festians of malware, bur would not requite more resources for calculation Performance and sealabilty are the key priorities of the fist ‘tages of anti-malware engine processing Toaaddress this, we focused on extracting features that could be caleulated quickly lke statistics derived from file byte content or code disascenbly, directly retveved from the structure af the executable, Ikea fle format doseription Using this data. we calculated a specie type of hash functions called localtysensitve hashes (SH) Regular cryptographic hashes of two almost identical files der as much as hashes of to very diferent fies, Theres ne connection between the similarity of fles and heir hashes, However, LSHs of almost identical les map to the same binary bucket ~ their LSHs are very similar ~ with high probablity. SHs of two diferent fies differ substentialy Cryptographic hash (rash values) oo simior fos) Non-infrflos Wo i (0 vory intr les I Localty sensitive hash (rash values) But we went further The LSH calculation was unsupervised It adr't take into account ‘our additional knowledge of each cample being malware or benign. Having a dataset of snilar and non-sinilar objects, we enhanced this approach by introducing 3 traning phase, We mplemented a similarity haching approach ie silar TOLSH, but ts supervised and eapable of ullzng information about pais of sila and nenesimilr objects inthe case, ur training data X wouldbe pats of file feature representations [XL X2] Y¥ would be the label that would tel us whether the objects were actualy semanticaly similar or net. During training. the algorithm fi cers of hash mapping h(X) to maximize ‘he number of pairs from the taining sot for which A(X) ana X2) are identical for simlar objects and diferent otherwise This algorthm thatis being applied to executable fle features provides specific sila hash mapping with usefuldetection eapabiltesn fact we train several versions ofthis mapping that afer in their sensitivity to local variations of efferent sets of features. For example, one version of simlarty hash mapping could be more focused on capturng the executable fl structure, while paying less attention tothe actual content. Another could bbemore focused on capturing the ASCi-strings of the fe. This captures the idea that liferent subsets of features could be mare orless slseriminative to lfferent kinds of malware fies. For one of ther, file content statsties ould reveal the presence of an unknouin malicious packer For the others, the mast important piece of information regarding potential Behavior concentratedin stings representing used OS API created fle names, accessed URLS or other feature subsets. For more precise detection in products, the resuts of a similarity hashing algorithm are ‘combined with other machine learning-based detection methods ‘Two-stage pre-execution To analyze files during the pre-execution stage. our products combine a similarity hashing detection on users’ approach with other tained algorithms na two-stage scheme, To train computers with similarity large collection of Fes hash mapping combined with deci s ensemble Hard region decision ees ensemble Simple region silty hash Feature x Schematic representation of segmentation of the abject space crested with silaty hash mapoing For simaety the lustitonhas ory ‘ste donors Anindex of exch celleorrespords tate particular smarty hes mapping vale, rach color the gr tustrates Toran of sbjects withthe sare value of snare hash mapping slo known se shark Cuckot Get colors malcious (28) and bengr/urknoun eer) ‘ine options are asisble add the nashot arogonto themslaare databace smpleregons) ere athe frst part of the sworstage tector combined ath erepian-apecie lester (nesegionsh ‘The two-stage analysis design addresses the problem of reducing computationalload on user system and preventing falee posttves Some file features important for detection require larger computational resources fortheir calculation. Those features are caled heavy. To avoid their calculation for all scanned fils, we introduced aprelimnary tage calcd apre-detect. A pro-detect ‘occurs when a fle is analyzed with ‘ightwoight features andis extracted without ‘ubstantalload on tne system In many cases. apre-detect provides us with enough information to know'f ales benign and ends the fle scan. Sometimes it even detects a fleas malware ifthe frst stage was not sufficient.the fle goes tothe second stage of analyss, when heauy features are extracted for precise detection. In our products, the two-stage analysis works in the following way. Inthe predetact stage, learned similarity hash mapping is caleuated for tholightwlght features of the scanned fe. Then, i's eneckedo see there are any other les withthe same hash mapping. 2nd ‘whether they are malvare or benign. A group of files wth a smilarhash mapping value caled ahash bucket. Depending onthe hash bucket that the seanned fl falls into. the felloning outcomes may occur Ina simple region case the fe als into bucket that contains only one kind of object malware or benign I ale Falsinto-2 pure malware bucket’ we detect i as malware If iefall to © pure benign bucket we don't scanit any deeper Inboth eaves, we donot ‘extract any new heavy features In ahard region the hash bucket contains both malware and benign fas. ttis the only ‘case when the system may extract heavy features from the scanned fie for precise detection. For each hard region there is a separate region specifi classifier tained Currently we use amodifcation of adecision‘ree of femure. based simlarty hashing, depending on what is more on hard tegion. In realty there are some hard regions that are not suitable fr further analysis by ‘this two-stage technology, because they contain too many popular benign files, Processing them with thie method yelds a hign ak of fale postives and performance degradation. For such cases, we do net train a region specific classifier and do net sean Fes in thisregion through this mode! For correct analysis in aregion lke ths wo use other detection technelogies. — : ae =| _|EL eg File ‘Medel L “ ‘or benign a2 al2 (tceone] Implomentation of apre-detect stage drastically reduces the amount of files that are heaviy-scanned during the second step. Ths process improves the performance because the lockup by smarty hash mapping te pre-devect phase s completed quiedy Curtworstage design also reduces the risk af false positives: Inthe fist (pre-detect) stage, we do not enable detection wth region specific classifiers inregions with ahigh risk of false positves. Because of this, the distribution of objects passed to the second stage's biased towards the malware” class, This reduces the falee postive rate, too, Inthe second stage. classifers in each hard region are trained on malware from only tone bucket ~but on al clean objects avalable nal the buckets of the training et. ‘This makes a regional classifier detect the malware of particular hard region bucket more precisely ‘also prevents any unexpected false positives, when the model worksin praduets with real-world dats Incoxpretabilty of the wo-stage model comes from the fact that eachhashina databaveis ascocisted with some subset of malware samples intraining. The whole ‘model could be adapted to a now daly malware stroam via adding detections. including hash mappings and reo ensemble models for apreviously unobservedregion. Ths lts us revoke and retrain region specific classifiers without signticantly degrading she detection rae of the whole product. Without ths. we wouldneed to retrain the wale model on all fof the malware that we know with every change we would want to make. ware detection is suitable forthe specif of machine @ introduction. 10 Deep learning against rare attacks ‘Typically. machine learning faces tasks when malicious and benign samples are ‘umerously represented in the training set, But same attacks are so.are that we have only ene example of malware for training. This s typical for high-profile targeted attacks In this case, we use a very specific deep learning-basea model architecture, We call this approach exemplar network (ExNet). ‘Malware x ‘Malware ¥ Compact sbact Copossttaten = Malnoez Pe-exemplr ‘chassis The des hereis that we train the model to build compact representations of input f classifiers ~ these are algorithms that detect particular types of malware, De learning alous us to combine these multiple steps (object feature extraction, compact feature representation andlocal, or petexemplar. model ereation) nto one neural network pipeline hat date the discriminative features for various types of malware ‘This medel can efficiently generalze knowledge about single malware samples and large collection of clean samples. Then, t ean detect new modifications of corresponding malware " Deep learning in post- ‘The approaches deseribed earlier were consideredin the framework of static analysis ‘execution behavior en an object deseription's extracted and analyzedboefore the objects execution in detection ‘therealuser environment Static analysis atthe pre-execution stage has a number of significant advantages, ‘The main sdvantage's that itis safe for the user. An abject can be detected beforeit {tarts to act on areal user's machine. Butt faces iesues with advanced encryption ‘obfuscation techniques and the use of 2 wide variety of high-level sript langusges teontaners, and fileless attack scenarios. These are situations when post-execution behavior detection comes into play We also use deep learning methods to address the task of behavior detection Inthe post-execution stage, we are working with behevier logs provided by the threat Behavior engine. The behavior logis the sequence of system events occurring during the process execution, together with corresponding arguments. norder to detect maleious setiviy in observed log deta. our madeleempresses the obtained sequence of events to.a set of binary vectors. It then trains a deep neuralnetwork ‘0 distinguish loan and malicious logs. Log compression Loe Bipartite graph Behavior pattern Medi ncepeden) Cresta csr Os Crete doer 0755) Meaiyeietsecier eyelet Siac > ES — nfl Ga Mey cntant) . Mediyiedceier) Ee i DalsFler scene) ve DetreFiar ner wg ls =! Cog cst enanteing raseae fogean gE, . at war ERS, mer (2) FEE ord tet) € (tes al © stn tl © Butea) ——— Alog's compressing stage includes several steps. 1. The logis transformed into a bipartite behavior graph. This graph contains two types of vertices ovents and arguments, Edges are drawn between each event and argument. which oceur together inthe sameline inthe log. Such a graph Fepresentation is much more compact than he inival aw data, e stays robust against any permutations of Ines caused by tracing diferent runs ofthe sare multprocessing program, or behavior obfuscation oy the analyzed process, yatcaly extract specific subgraphs. or behavior patterns, from this graph, Each pattern contains a subset of events and adjacent arguments related to aspeciie actiity of the process, such 9s network communieations, file system exploration, maaifcation of the system register ete. 2. After that, we 3, We compress each “behavior pattern’ to a sparse binary vector, Each component of tis vectors responsible for the inclusion of a speci overs or argument’ token (elated to wob- fle- and other types of actiity}in the tomate, 4. The trained deop neural network transforms sparse binary vectors af behavior terns into compact repr ns called pattern embedaings. The fare combined into a single vector, o log embedding, by taking the elome 5. Finaly, based on the log embedding. the network predicts thelog’s suspiciousness. ‘The main feature of the used neural networks that allthe weights are positive and alithe activation functions are monotonic. These properties provide us with many important advantages: (Our models suspicion score output anly grows with time while processing new lines {rom the log. Ae 2 result. malware cannot evade detection by performing additional rolse or 2 clear activity mn parallel with te main payload. Since the model’ output ie ztable in time, we are probably protected ‘rom eventual false alarms caused by the predietin's fluctuation inthe male of seanning of 3 clean log ‘Working with log samples ina monotonic space allows us to automatically select events that eause the detection and manage fale alarms more eanvenientl, An approach lke this enables us to traina deep learning model capable of operating ‘with high-levelinterpretable behavior concepts, This approach is safely applied tothe whole diversity of user environments andincorporates fase alarm fixing c2pabilt ints architecture, logether. a ofthat gives us a powerful mean for the behavioral detection of the most complicated modern threats 8 Applications in the Infrastructure large-scale “lab From efficiently processing incoming streams of malware in Kaspersky to maintai detection algorithms, machine learning plays an equally important role in building a proper infrastructure. Clustering the incoming With hundreds of thousands of samples coming nto Kaspersky every day along wth the stream of objects high cost of monuslannataton of rew types af sarees reducing the amount of data ‘het analysts need tolook at becomes acrucial task inbor of separate object groups ‘would be automaticaly processed based on the presence of an aeady annotated object reel camer oe . . 5 oe Ro Incoingstonm _Chsterng vnen ot tn aod en Cacstadajett cue a taney Allrecently received incoming fles are analyzed by our in-lab malware detection techniques including pre- and post-executlon. We arm to label 2s many objects 25 possible. but some objects are stil unclassified. We want to label them. For this. all objects including the labeled ones. are processed by multple feature extractors, Then. they ae passed together through several clustering algorthms (e.g. K-means and ddbsear} depenaing onthe fle type. Ths produces groups of similar objects A this point we face four different types ofresuking clusters with unknown fies: 4, clusters that contain malware and unknown files 2, clusters th mnand.unknown files ‘clean and unknown files ‘only contain unknown files For objectsin the clusters of types 3, we use aditional machine learning algorithms tke tele propagation to verfy the simlarty of unknown samples to classiiea ones. In some esesthisis effective eveninthe clusters of type 5. This alow us to autometicaly abel tnkrown files leaving only the clusters of type 4. and partial of type 3 for humans. This resulsin a drastic reduction ofthe human annotations needed ona dally basis, “

You might also like