Comparing Fault-tolerance in Kubernetes and Slurm in HPC Infrastructure

Aydin, Mirac; Bidollahkhani, Michael; Kunkel, Julian

Home // ADVCOMP 2024, The Eighteenth International Conference on Advanced Engineering Computing and Applications in Sciences // View article

Comparing Fault-tolerance in Kubernetes and Slurm in HPC Infrastructure

Authors:
Mirac Aydin
Michael Bidollahkhani
Julian Kunkel

Keywords: fault-tolerance; Kubernetes; Slurm; High-Performance Computing; HPC; resilience; comparative analysis; literature review; case studies.

Abstract:
In this paper, we explore the fault-tolerance mechanisms in Kubernetes and Slurm within High-Performance Computing (HPC) infrastructures. As computational workloads and data requirements continue to expand, ensuring reliable and resilient HPC systems becomes increasingly critical. Our study examines the strategies employed by Kubernetes and Slurm to handle failures, maintain system stability, and provide continuous service. We present a comparative analysis, highlighting the strengths and limitations of each system in various failure scenarios. We review and synthesize findings from existing literature and case studies to infer the effectiveness of these fault-tolerance mechanisms. Through this comprehensive review, we provide insights into the current state of fault-tolerance in Kubernetes and Slurm and propose recommendations for enhancing resilience in HPC environments.

Pages: 40 to 48

Copyright: Copyright (c) IARIA, 2024

Publication date: September 29, 2024

Published in: conference

ISSN: 2308-4499

ISBN: 978-1-68558-184-8

Location: Venice, Italy

Dates: from September 29, 2024 to October 3, 2024