Home // International Journal On Advances in Security, volume 16, numbers 3 and 4, 2023 // View article
Malware Self-Supervised Graph Contrastive Learning with Data Augmentation
Authors:
Yun Gao
Hirokazu Hasegawa
Yukiko Yamaguchi
Hajime Shimada
Keywords: malware classification; graph contrastive learning; data augmentation; self-supervised learning.
Abstract:
Traditional malware detection methods struggle to quickly and effectively keep up with the massive amount of newly created malware. Based on the features of samples, machine learning is a promising method for the detection and classification of large-scale, newly created malware. The current research trend uses machine-learning technologies to rapidly and accurately learn newly created malware. In this paper, we propose a malware classification framework based on Graph Contrastive Learning (GraphCL) with data augmentation. We first extract the Control-Flow Graph (CFG) from portable executable (PE) files and simultaneously generate node feature vectors from the disassembly code of each basic block through MiniLM, a large-scale pre-trained language model. Then four different data augmentation methods are used to expand the graph data, and the final graph representation is generated by the GraphCL model. These representations can be directly applied to downstream tasks. For our classification task, we use C-Support Vector Classification (SVC) as a classification model. To evaluate our approach, we made a CFG-based malware classification dataset from the PE files of the BODMAS Malware Dataset, which we call the Malware Geometric Multi-Class Dataset (MGD-MULTI), and collected the results. In addition, we added a new public malware graph dataset called MALNET-TINY. We also employed Local Degree Profile (LDP) to generate node representations for the graph data. The evaluation results on two malware graph dataset show that our proposal has great potential. According to our experimental evaluation, the self-supervised learning method can also achieve superior results, even surpassing the supervised learning method.
Pages: 116 to 125
Copyright: Copyright (c) to authors, 2023. Used with permission.
Publication date: December 30, 2023
Published in: journal
ISSN: 1942-2636