Malware is the biggest to threat to our data. To be able to protect against the immense amount of new malware samples and their damages, automatic malware classification methods are needed. The existing systems for dynamic malware analysis are mostly provided as online services, and cannot be locally deployed, which is inconvenient for the users. To address this issue, the proposed approach is to be developed as an open source framework, which supports local deployment. An experiment is carried out with real time malware samples based on supervised learning approach for extracting the API calls. By the analysis of API calls extracted a binary classification is performed using Artificial Neural Network.
Keywords: Malware - API call Sequence - Cuckoo - open source-
Technological growth has changed the way of communication. The usage of electronic devices and technology is tremendously increasing and organizations are becoming more dependent on their information system and the public is increasingly concerned with the security of their personal data.
Malware is malicious software which is designed with the intent of disturbing the normal operations of a computer and to gain unauthorized access to a computer system or resource . The types of malware include Virus, Worn, Trojan horse, Keyloggers, and Spyware etc. Every day hundreds of new malware is being released into the market. To protect the systems from malware, automatic malware classification methods are needed. In order to understand the function of malware, malware analysis is to be done. Traditional method for malware classification is the use of signatures, which uniquely identify a malware. Although signature based classification works efficiently the major drawback with this approach is that the need for frequent updates of the signature database and the failure to identify malware whose signature is not present in the database, hence to overcome this drawback many research is taking place in malware detection based on non-signature methods.
The rest of the paper is organized as follows; Section 2 describes Existing work on malware detection. Section 3 describes the motivation for the proposed approach. Section 4 briefs about the architecture of the proposed system. Section 5 provides a detailed analysis of the Experimental results. Section 7 states about the future work.
This section explains the various research works that had been carried out for performing malware detection.
Nature of Analysis:
Previous works have analyzed malware samples by 3 broad approaches, Static,,, dynamic-, hybrid,.
Most commonly used method is static analysis, which involves performing a detailed analysis of the code. In  various techniques for performing static analysis are file fingerprinting, disassembly, packer detection etc were explained. Various tools like IDA, Ollydb can be used for performing static analysis. The advantage of static analysis is the entire execution path can be analyzed and it is safer since the malware is not allowed to execute. The major drawback of static analysis is that it is not able to detect malware that uses anti-reverse engineering techniques like obfuscation . In , Zahra et al used IDA static analysis tool to extract API calls and employed Euclidean distance and multilayer perceptron with different learning algorithms to classify the malware variants into the correct families. Christodorescu et al  used dependency graph between system calls to specify the malware behavior.
To address the drawbacks of static analysis, dynamic malware analysis was proposed. In dynamic analysis the malware is made to run within a controlled environment and its behavior is analyzed . The major advantage of performing dynamic analysis is that it is in-sensitive to techniques like obfuscation or runtime packing. The limitation of dynamic analysis is that they fail to observe all capabilities of a malware . Tools used for performing Dynamic Analysis are Cuckoo SandBox,, CWSandbox, Anubis, BitBlaze. Behavioral features of malware are analyzed by using API calls in . In  API calls are extracted and NMF algorithm is used to cluster the malware samples into several different clusters.  proposes an open source software for performing automated malware analysis, which is locally deployable.
Hybrid Analysis is the combination of static and dynamic analysis.  used a hybrid approach, in which the frequencies of opcodes are used in static analysis and the behavior of malware binaries is analyzed for performing dynamic analysis.
Machine Learning for Malware Detection:
Two major techniques used for malware detection are Supervised Learning  - , and Unsupervised Learning,.
Supervised Learning algorithm uses known dataset to make predictions. In supervised learning include input data and their corresponding output responses. Classification comes under Supervised Learning. Classification is a process of identifying to which class a data belongs to. It is a supervised learning technique, whereby the model learns a function to map the input variable with that of the output variables. Ahmed et al  used a combination of spatial and temporal features to perform classification of malware. In  machine learning techniques like Random Forest were used to perform malware classification.
Artificial Neural Networks can be used to perform Supervised Learning. Artificial Neural Networks can be viewed as parallel and distributed processing systems which consists of a huge number of simple and highly inter- connected processors or processing units. They are inspired by the biological neural networks, in which neurons communicate with one another by passing signals.
In , vector creation is done by calculating the average amount of API calls for each family. This vector is then fed to Euclidean Distance Calculation to classify the malware into their respective families.  also used ANN with 10 different algorithms and compared their performance measure.  has performed a comparative study of BP in ANN, J48, K-nearest classifier, NaÃ¯ve Bayes.
In Unsupervised Learning, inferences are made from dataset consisting only of the input values i.e. unsupervised learning does not have labeled responses. In clustering we partition the data into similar groups based on some similarities among the data. There are two types of clustering namely hard clustering and soft clustering. Hard clustering is the most common type in which one sample belongs to only one cluster, whereas with soft clustering a sample can belong to more than one cluster. ,  proposed a framework for performing clustering of malware samples. Akinori et al used Non-negative Matrix Factorization method for grouping malwares belonging to a particular family into one cluster . Cuckoo Sandbox, BBIS, and Malheur techniques were integrated in  to develop an open source framework for malware analysis that uses n-gram technique for clustering of malware.
The existing solutions provided for detection of malware were available online and local deployment was not possible. The file size to be uploaded online for detecting malware is limited. Behavioral analysis has high false positive rates. None of the existing works provide a solution for addressing these problems. Here, this work proposes a framework towards building an open source malware detection which is locally deployable.
...(download the rest of the essay above)