Home // INFOCOMP 2015, The Fifth International Conference on Advanced Communications and Computation // View article


Towards Data Persistency for Fault-tolerance Using MPI Semantics

Authors:
Jose Gracia
Muhammad Wahaj Sethi
Nico Struckmann
Rainer Keller

Keywords: Message Passing Interface; MPI, fault-tolerance; application-level checkpointing; data persistency

Abstract:
As the size and complexity of high-performance computing hardware, as well as applications increase, the likelihood of a hardware failure during the execution time of large distributed applications is no longer negligible. On the other hand, frequent checkpointing of full application state or even full compute node memory is prohibitively expensive. Thus, application-level checkpointing of only indispensable data and application state is the only viable option to increase an application's resiliency against faults. Existing application-level checkpointing approaches, however, require the user to learn new programming interfaces, etc. In this paper we present an approach to persist data and application state, as for instance messages transfered between compute nodes, which is seamlessly integrated into Message Passing Interface, i.e., the de-facto standard for distributed parallel computing in high-performance computing. The basic idea consists in allowing the user to mark a given communicator as having special, i.e., persistent, meaning. All communication through this persistent communicator is stored transparently by the system and available for application restart even after a failure.

Pages: 26 to 29

Copyright: Copyright (c) IARIA, 2015

Publication date: June 21, 2015

Published in: conference

ISSN: 2308-3484

ISBN: 978-1-61208-416-9

Location: Brussels, Belgium

Dates: from June 21, 2015 to June 26, 2015