Abstract
In this paper, we develop four malware detection methods using Hamming distance to find similarity between samples which are first nearest neighbors (FNN), all nearest neighbors (ANN), weighted all nearest neighbors (WANN), and k-medoid based nearest neighbors (KMNN). In our proposed methods, we can trigger the alarm if we detect an Android app is malicious. Hence, our solutions help us to avoid the spread of detected malware on a broader scale. We provide a detailed description of the proposed detection methods and related algorithms. We include an extensive analysis to assess the suitability of our proposed similarity-based detection methods. In this way, we perform our experiments on three datasets, including benign and malware Android apps like Drebin, Contagio, and Genome. Thus, to corroborate the actual effectiveness of our classifier, we carry out performance comparisons with some state-of-the-art classification and malware detection algorithms, namely Mixed and Separated solutions, the program dissimilarity measure based on entropy (PDME) and the FalDroid algorithms. We test our experiments in a different type of features: API, intent, and permission features on these three datasets. The results confirm that accuracy rates of proposed algorithms are more than 90% and in some cases (i.e., considering API features) are more than 99%, and are comparable with existing state-of-the-art solutions.
Original language | English |
---|---|
Pages (from-to) | 230-247 |
Number of pages | 18 |
Journal | Future Generation Computer Systems |
Volume | 105 |
Early online date | 9 Dec 2019 |
DOIs | |
Publication status | Published - 1 Apr 2020 |
Keywords
- Android
- Clustering
- Hamming distance
- K-nearest neighbor (KNN)
- Malware detection
- Static analysis
Access to Document
Fingerprint
Dive into the research topics of 'Similarity-based Android malware detection using Hamming distance of static binary features'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver
}
In: Future Generation Computer Systems, Vol. 105, 01.04.2020, p. 230-247.
Research output: Contribution to journal › Article › peer-review
TY - JOUR
T1 - Similarity-based Android malware detection using Hamming distance of static binary features
AU - Taheri, Rahim
AU - Ghahramani, Meysam
AU - Javidan, Reza
AU - Shojafar, Mohammad
AU - Pooranian, Zahra
AU - Conti, Mauro
N1 - Funding Information: Mohammad Shojafar is supported by Marie Curie Global Fellowship, UK ( MSCA-IF-GF ) funded by European Commission agreement grant number MSCA-IF-GF-839255 and Mauro Conti is supported by a Marie Curie Fellowship funded by the European Commission (agreement PCIG11-GA-2012-321980 ). Appendix Funding Information: In this section, we aim to give a simple example how our proposed similarity-based algorithms adopt to detect Android malware in a binary dataset. Definition 1 Suppose X is the sample that we want to predict its label. As an example, vector X can be defined as following: X = 0 0 0 1 0 0 0 1 0 0 Also, this vector can be written as follows. The numbers of this vector are the sample locations that have a value of 1. So, we have: X = 4 8 Definition 2 Suppose the training set namely S is used as follows. Given the fact that this matrix is sparse, it is possible to write the matrix only by storing the features of the value of 1. Hence, we have: Because, in all proposed algorithms, the distance of sample X is used from all samples in training datasets. In Table 8 , we show the distance between each sample of the training set with the sample X , which is computed by the Hamming distance. Considering the presented definitions, in the following we examine our methods for the defined samples. Applying FNN Algorithm: In Table 8 , the first nearest sample to X , which is selected by the FNN algorithm, is S 2 . Since the label of sample S 2 is 0 , the value of 0 is assigned to the sample X . Applying ANN Algorithm: Focusing on ANN algorithm, we select all similar samples. In this example, S 2 , S 4 , S 7 , and S 8 have been selected according to Table 8 (i.e., see lower values; we select four vectors with value 2). By voting between labels of these samples, the value of 1 is assigned to the sample X . Applying WANN Algorithm: Focusing on WANN algorithm, we first count the number of features in the training samples to find the vector w (see Table 9 which includes the weight of each feature). Now, we compute the weight of each sample. The weight of each sample is the total weight of the features of that sample, which is 1. Given that the weight of sample X is equal to 6 and as we can see from Table 10 , samples S 2 , S 3 , S 5 , and S 7 are similar to X and by voting between them the label of sample X is will be 1 . Applying KMNN Algorithm: Focusing on KMNN method, we first select the same sample X as the ANN method and then select S 2 , S 4 , S 7 and S 8 samples. Now, we create two clusters by placing similar samples in the same cluster. The similarity measure will be the distance between samples in each cluster. For this purpose, we determine the matrix of the intervals between these samples namely I as follows: Each entry of a matrix I represents the distance between the two samples, which is obtained by comparing peer to peer elements of corresponding vectors. Focusing on matrix I , the distance between the S 2 and S 8 samples is the smallest distance, so we can place them in a cluster. Similarly, the samples of S 4 and S 7 are near each other and we can place them in another cluster. Now, in each cluster, we select one of the samples which has a minimum distance from other samples as a cluster head (CH). In this example, since we have only two samples per cluster, we can consider each cluster sample as a CH. Hence, we define S 2 as the CH in the first cluster and S 4 as the CH in the second cluster. Then, we compute the total distance (i.e., d ) of all the samples from two CHs as (See Table 11 ). In the last step, we should leave a k percentage of the most distant samples and vote among the other samples. In the proposed method, we consider k = 10 , but for more clarity in these examples, we define k = 25 , and we do not consider just the last sample. After that, we vote among the rest of the samples. As a result, the result of the voting obtains the value of 1 for the label of the sample X . Rahim Taheri received his B.Sc. degree of Computer engineering from Bahonar Technical College of Shiraz and M.Sc. degree of computer networks at the Shiraz University of Technology in 2007 and 2015, respectively. Now he is a Ph.D. candidate on Computer Networks at the Shiraz University of Technology. In February 2018, he joined to SPRITZ Security & Privacy Research Group at the University of Padua as a visiting Ph.D. student. His main research interests include machine learning, data mining, network securities and heuristic algorithms. His main research interests are in an adversarial machine and deep learning as a new trend in computer security. Meysam Ghahramani graduated from B.Sc. degree in mathematics and its applications, in 2014. He won the first rank at the ACM programming competitions of the university in 2013. He was admitted to the postgraduate in the field of cryptography. In 2016, he graduated with the first rank and received the award of a distinguished university student. Mr. Ghahramani is currently a Ph.D. student in the Department of Computer Engineering and Information Technology at the Shiraz University of Technology. His primary fields of interest are Post-Quantum Cryptography, Cryptographic Protocol Analysis, Applied Mathematics, and Information Security. Reza Javidan received M.Sc. Degree in Computer Engineering (Machine Intelligence and Robotics) from Shiraz University in 1996. He received a Ph.D. degree in Computer Engineering (Artificial Intelligence) from Shiraz University in 2007. Dr. Javidan has many publications in international conferences and journals regarding Image Processing, Underwater Wireless Sensor Networks (UWSNs) and Software Defined Networks (SDNs). His major fields of interest are Network security, Underwater Wireless Sensor Networks (UWSNs), Software Defined Networks (SDNs), Internet of Things, artificial intelligence, image processing, and SONAR systems. Dr. Javidan is an associate professor in the Department of Computer Engineering and Information Technology at the Shiraz University of Technology. Mohammad Shojafar is a Marie Curie Fellow, Intel Innovator, and Senior Researcher in SPRITZ Security and Privacy Research group at the University of Padua, Italy in since January 2018. He was CNIT Senior Researcher at the University of Rome Tor Vergata contributed on European H2020 “SUPERFLUIDITY” project. Mohammad is principle investigator on PRISENODE project, a 275,000 euro Horizon 2020 Marie Curie project in the areas of network security and Fog computing and resource scheduling collaborating between the University of Padua and University of Melbourne. He also was a principal investigator on an Italian SDN security and privacy (60,000 euro) supported by the University of Padua in 2018. He also was contributed to some Italian projects in telecommunications like GAUChO — A Green Adaptive Fog Computing and Networking Architecture (400,000 euro), S2C: Secure, Software-defined Cloud (30,000 euro), and SAMMClouds-Secure and Adaptive Management of Multi-Clouds (30,000 euro) collaborating among Italian universities. He received the Ph.D. degree from Sapienza University of Rome, Italy, in 2016 with an “Excellent” degree. His main research interest is in the area of Network and network security and privacy. In this area, he published more than 100+ papers in top-most international peer-reviewed journals and conference, e.g., IEEE TCC, IEEE TNSM, IEEE TGCN, and IEEE ICC/GLOBECOM (h-index=26, 2.5k+ citations). He is an Associate Editor in IEEE Transactions on Consumer Electronics, IET Communication, Cluster Computing, and Ad Hoc & Sensor Wireless Networks Journals. He is a Senior Member of the IEEE. For additional information: http://mshojafar.com . Zahra Pooranian is currently a Postdoc in the SPRITZ Security and Privacy Research group at the University of Padua, Italy, since April 2017. She received her Ph.D. degree in Computer Science Sapienza University of Rome, Italy, in February 2017. She is a (co)author of several peer-reviewed publications (h-index=16, citations=700+) in well-known conferences and journals. She is an Editor of KSSI transaction on internet and information systems and Future Internet. Her current research focuses on Machine Learning, Smart Grid, and Cloud/Fog Computing. She was a programmer in several companies in Iran from 2009-2014, respectively. She is a member of IEEE. For additional information: https://www.math.unipd.it/ zahra/ . Mauro Conti received his M.Sc. and his Ph.D. in Computer Science from Sapienza University of Rome, Italy, in 2005 and 2009. He has been Visiting Researcher at GMU (2008, 2016), UCLA (2010), UCI (2012, 2013, 2014), TU Darmstadt (2013), UF (2015), and FIU (2015, 2016). In 2015 he became Associate Professor, and Full Professor in 2018. He has been awarded with a Marie Curie Fellowship (2012) by the European Commission, and with a Fellowship by the German DAAD (2013). His main research interest is in the area of security and privacy. In this area, he published more than 300 papers in topmost international peer-reviewed journals and conference. He is Associate Editor for several journals, including IEEE Communications Surveys & Tutorials, IEEE Transactions on Network and Service Management, and IEEE Transactions on Information Forensics and Security. He is Senior Member of the IEEE. For additional information: http://www.math.unipd.it/ conti/ . Funding Information: We conduct our experiments on three datasets which are explained below: • Drebin dataset: The Drebin dataset is a Android example collection that we can apply directly. The Drebin dataset includes 118,505 applications/samples from various Android sources [26] . • Genome dataset: The genome project is supported by the National Science Foundation (NSF) of the United States. From August 2010 to October 2011, the authors collected about 1,200 samples of Android malware from different categories as a genome dataset [45] . • Contagio dataset: it consists of 11,960 mobile malware samples and 16,800 benign samples [46] . 5.1.2 Funding Information: Mohammad Shojafar is supported by Marie Curie Global Fellowship, UK (MSCA-IF-GF) funded by European Commission agreement grant number MSCA-IF-GF-839255 and Mauro Conti is supported by a Marie Curie Fellowship funded by the European Commission (agreement PCIG11-GA-2012-321980). Publisher Copyright: © 2019 Elsevier B.V.
PY - 2020/4/1
Y1 - 2020/4/1
N2 - In this paper, we develop four malware detection methods using Hamming distance to find similarity between samples which are first nearest neighbors (FNN), all nearest neighbors (ANN), weighted all nearest neighbors (WANN), and k-medoid based nearest neighbors (KMNN). In our proposed methods, we can trigger the alarm if we detect an Android app is malicious. Hence, our solutions help us to avoid the spread of detected malware on a broader scale. We provide a detailed description of the proposed detection methods and related algorithms. We include an extensive analysis to assess the suitability of our proposed similarity-based detection methods. In this way, we perform our experiments on three datasets, including benign and malware Android apps like Drebin, Contagio, and Genome. Thus, to corroborate the actual effectiveness of our classifier, we carry out performance comparisons with some state-of-the-art classification and malware detection algorithms, namely Mixed and Separated solutions, the program dissimilarity measure based on entropy (PDME) and the FalDroid algorithms. We test our experiments in a different type of features: API, intent, and permission features on these three datasets. The results confirm that accuracy rates of proposed algorithms are more than 90% and in some cases (i.e., considering API features) are more than 99%, and are comparable with existing state-of-the-art solutions.
AB - In this paper, we develop four malware detection methods using Hamming distance to find similarity between samples which are first nearest neighbors (FNN), all nearest neighbors (ANN), weighted all nearest neighbors (WANN), and k-medoid based nearest neighbors (KMNN). In our proposed methods, we can trigger the alarm if we detect an Android app is malicious. Hence, our solutions help us to avoid the spread of detected malware on a broader scale. We provide a detailed description of the proposed detection methods and related algorithms. We include an extensive analysis to assess the suitability of our proposed similarity-based detection methods. In this way, we perform our experiments on three datasets, including benign and malware Android apps like Drebin, Contagio, and Genome. Thus, to corroborate the actual effectiveness of our classifier, we carry out performance comparisons with some state-of-the-art classification and malware detection algorithms, namely Mixed and Separated solutions, the program dissimilarity measure based on entropy (PDME) and the FalDroid algorithms. We test our experiments in a different type of features: API, intent, and permission features on these three datasets. The results confirm that accuracy rates of proposed algorithms are more than 90% and in some cases (i.e., considering API features) are more than 99%, and are comparable with existing state-of-the-art solutions.
KW - Android
KW - Clustering
KW - Hamming distance
KW - K-nearest neighbor (KNN)
KW - Malware detection
KW - Static analysis
UR - http://www.scopus.com/inward/record.url?scp=85075977961&partnerID=8YFLogxK
U2 - 10.1016/j.future.2019.11.034
DO - 10.1016/j.future.2019.11.034
M3 - Article
AN - SCOPUS:85075977961
SN - 0167-739X
VL - 105
SP - 230
EP - 247
JO - Future Generation Computer Systems
JF - Future Generation Computer Systems
ER -