Home // ADVCOMP 2024, The Eighteenth International Conference on Advanced Engineering Computing and Applications in Sciences // View article
Comparing Fault-tolerance in Kubernetes and Slurm in HPC Infrastructure
Authors:
Mirac Aydin
Michael Bidollahkhani
Julian Kunkel
Keywords: fault-tolerance; Kubernetes; Slurm; High-Performance Computing; HPC; resilience; comparative analysis; literature review; case studies.
Abstract:
In this paper, we explore the fault-tolerance mechanisms in Kubernetes and Slurm within High-Performance Computing (HPC) infrastructures. As computational workloads and data requirements continue to expand, ensuring reliable and resilient HPC systems becomes increasingly critical. Our study examines the strategies employed by Kubernetes and Slurm to handle failures, maintain system stability, and provide continuous service. We present a comparative analysis, highlighting the strengths and limitations of each system in various failure scenarios. We review and synthesize findings from existing literature and case studies to infer the effectiveness of these fault-tolerance mechanisms. Through this comprehensive review, we provide insights into the current state of fault-tolerance in Kubernetes and Slurm and propose recommendations for enhancing resilience in HPC environments.
Pages: 40 to 48
Copyright: Copyright (c) IARIA, 2024
Publication date: September 29, 2024
Published in: conference
ISSN: 2308-4499
ISBN: 978-1-68558-184-8
Location: Venice, Italy
Dates: from September 29, 2024 to October 3, 2024