Home // CYBER 2022, The Seventh International Conference on Cyber-Technologies and Cyber-Systems // View article


How Good is Openly Available Code Snippets Containing Software Vulnerabilities to Train Machine Learning Algorithms?

Authors:
Kaan Oguzhan
Tiago Espinha Gasiba
Akram Louati

Keywords: machine learning; deep learning; industry; software; vulnerabilities

Abstract:
Machine learning has been gaining more and more attention over the last years. One of the recent areas where machine learning has been applied is secure software development to identify software vulnerabilities. The algorithms depend on the amount and quality of data used for training. Although many studies are emerging on machine learning algorithms, one must enquire about the data used to train these algorithms. This paper addresses this question by investigating and analyzing freely available vulnerable code snippets. We investigate their quantity and quality in terms of the existing categorization of security vulnerabilities used in industrial environments. Furthermore, we investigate these aspects in dependency on several different programming languages. In addition, we provide the database containing the collected vulnerable code snippets for further research. Our results show that, while a large number of training data is available for some programming languages, this is not the case for every language. Our results can be used by researchers and industry practitioners working on machine learning and applying these algorithms to improve software security.

Pages: 25 to 33

Copyright: Copyright (c) IARIA, 2022

Publication date: November 13, 2022

Published in: conference

ISSN: 2519-8599

ISBN: 978-1-61208-996-6

Location: Valencia, Spain

Dates: from November 13, 2022 to November 17, 2022