Home // INFOCOMP 2015, The Fifth International Conference on Advanced Communications and Computation // View article
Towards Data Persistency for Fault-tolerance Using MPI Semantics
Authors:
Jose Gracia
Muhammad Wahaj Sethi
Nico Struckmann
Rainer Keller
Keywords: Message Passing Interface; MPI, fault-tolerance; application-level checkpointing; data persistency
Abstract:
As the size and complexity of high-performance computing hardware, as well as applications increase, the likelihood of a hardware failure during the execution time of large distributed applications is no longer negligible. On the other hand, frequent checkpointing of full application state or even full compute node memory is prohibitively expensive. Thus, application-level checkpointing of only indispensable data and application state is the only viable option to increase an application's resiliency against faults. Existing application-level checkpointing approaches, however, require the user to learn new programming interfaces, etc. In this paper we present an approach to persist data and application state, as for instance messages transfered between compute nodes, which is seamlessly integrated into Message Passing Interface, i.e., the de-facto standard for distributed parallel computing in high-performance computing. The basic idea consists in allowing the user to mark a given communicator as having special, i.e., persistent, meaning. All communication through this persistent communicator is stored transparently by the system and available for application restart even after a failure.
Pages: 26 to 29
Copyright: Copyright (c) IARIA, 2015
Publication date: June 21, 2015
Published in: conference
ISSN: 2308-3484
ISBN: 978-1-61208-416-9
Location: Brussels, Belgium
Dates: from June 21, 2015 to June 26, 2015