Unsupervised Graph Contrastive Learning with Data Augmentation for Malware Classification

Gao, Yun; Hasegawa, Hirokazu; Yamaguchi, Yukiko; Shimada, Hajime

Home // SECURWARE 2022, The Sixteenth International Conference on Emerging Security Information, Systems and Technologies // View article

Unsupervised Graph Contrastive Learning with Data Augmentation for Malware Classification

Authors:
Yun Gao
Hirokazu Hasegawa
Yukiko Yamaguchi
Hajime Shimada

Keywords: malware classification; graph contrastive learning; data augmentation; unsupervised learning

Abstract:
Traditional malware detection methods struggle to quickly and effectively keep up with the massive amount of newly created malware. Based on the features of samples, machine learning is a promising method for the detection and classification of large-scale, newly created malware. The current research trend uses machine-learning technologies to rapidly and accurately learn newly created malware. In this paper, we propose a malware classification framework based on Graph Contrastive Learning (GraphCL) with data augmentation. We first extract the Control-Flow Graph (CFG) from portable executable (PE) files and simultaneously generate node feature vectors from the disassembly code of each basic block through MiniLM, a large-scale pre-trained language model. Then four different data augmentation methods are used to expand the graph data, and the final graph representation is generated by the GraphCL model. These representations can be directly applied to downstream tasks. For our classification task, we use C-Support Vector Classification (SVC) as a classification model. To evaluate our approach, we made a CFG-based malware classification dataset from the PE files of the BODMAS Malware Dataset, which we call the Malware Geometric Multi-Class Dataset (MGD-MULTI), and collected the results. The evaluation results show that our proposal achieved Micro-F1 scores of 0.9975 and Macro-F1 scores of 0.9976. According to our experimental evaluation, the unsupervised learning approach outperformed the supervised learning approach in Graph Neural Networks based on malware classification.

Pages: 41 to 47

Copyright: Copyright (c) IARIA, 2022

Publication date: October 16, 2022

Published in: conference

ISSN: 2162-2116

ISBN: 978-1-68558-007-0

Location: Lisbon, Portugal

Dates: from October 16, 2022 to October 20, 2022