Home // International Journal On Advances in Systems and Measurements, volume 9, numbers 1 and 2, 2016 // View article
Data Persistency for Fault-Tolerance Using MPI Semantics
Authors:
Jose Gracia
Nico Struckmann
Julian Rilli
Rainer Keller
Keywords: Message Passing Interface; MPI; fault-tolerance; application-level checkpointing; data persistency
Abstract:
As the size and complexity of high-performance computing hardware, as well as applications increase, the likelihood of a hardware failure during the execution time of large distributed applications is no longer negligible. On the other hand, frequent checkpointing of full application state or even full compute node memory is prohibitively expensive. Thus, application-level checkpointing of only indispensable data and application state is the only viable option to increase an application's resiliency against faults. Existing application-level checkpointing approaches, however, require the user to learn new programming interfaces, etc. In this paper we present an approach to persist data and application state, as for instance messages transferred between compute nodes, which is seamlessly integrated into Message Passing Interface, i.e., the de-facto standard for distributed parallel computing in high-performance computing. The basic idea consists in allowing the user to mark a given communicator as having special, i.e., persistent, meaning. All communication through this persistent communicator is stored transparently by the system and available for application restart even after a failure. The concept is demonstrated by prototypical implementation of the proposed interface.
Pages: 58 to 65
Copyright: Copyright (c) to authors, 2016. Used with permission.
Publication date: June 30, 2016
Published in: journal
ISSN: 1942-261x