Home // ADVCOMP 2021, The Fifteenth International Conference on Advanced Engineering Computing and Applications in Sciences // View article
AMPRO-HPCC: A Machine-Learning Tool for Predicting Resources on Slurm HPC Clusters
Authors:
Mohammed Tanash
Daniel Andresen
William Hsu
Keywords: HPC; Scheduling; Supervised Machine Learning; Slurm; Performance.
Abstract:
Determining resource allocations (memory and time) for submitted jobs in High Performance Computing (HPC) systems is a challenging process even for computer scientists. HPC users are highly encouraged to overestimate resource allocation for their submitted jobs, so their jobs will not be killed due to insufficient resources. Overestimating resource allocations occurs because of the wide variety of HPC applications and environment configuration options, and the lack of knowledge of the complex structure of HPC systems. This causes a waste of HPC resources, a decreased utilization of HPC systems, and increased waiting and turnaround time for submitted jobs. In this paper, we introduce our first ever implemented fully-offline, fully-automated, stand-alone, and open-source Machine Learning (ML) tool to help users predict memory and time requirements for their submitted jobs on the cluster. Our tool involves implementing six ML discriminative models from the scikit-learn and Microsoft LightGBM applied on the historical data (sacct data) from Simple Linux Utility for Resource Management (Slurm). We have tested our tool using historical data (saact data) using HPC resources of Kansas State University (Beocat), which covers the years from January 2019 - March 2021, and contains around 17.6 million jobs. Our results show that our tool achieves high predictive accuracy R2 (0.72 using LightGBM for predicting the memory and 0.74 using Random Forest for predicting the time), helps dramatically reduce computational average waiting-time and turnaround time for the submitted jobs, and increases utilization of the HPC resources. Hence, our tool decreases the power consumption of the HPC resources.
Pages: 20 to 27
Copyright: Copyright (c) IARIA, 2021
Publication date: October 3, 2021
Published in: conference
ISSN: 2308-4499
ISBN: 978-1-61208-887-7
Location: Barcelona, Spain
Dates: from October 3, 2021 to October 7, 2021