Home // PESARO 2013, The Third International Conference on Performance, Safety and Robustness in Complex Systems and Applications // View article
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications
Authors:
Md Mohsin Ali
Peter Strazdins
Keywords: fault tolerance; MPI; FT-MPI; process failure;
Abstract:
In order to make the High Performance Computing (HPC) applications fault-tolerant, many application developers are investigating Algorithm-Based Fault Tolerance (ABFT) techniques to improve the efficiency of these applications recovery beyond what existing checkpoint/restart techniques alone can provide. Unfortunately, the standard library Message Passing Interface (MPI) used for implementing this type of application do not have standardized fault tolerance semantics. This paper presents how the fault tolerance semantics of Fault-Tolerant MPI (FT-MPI) can be used as a part of ABFT to design and implement a fault-tolerant algorithm applicable for time-evolving applications which could survive process failures. The model of the presented technique is a master-worker scheme which can tolerate the failures of all worker processes. As an example of time-evolving application, we consider the upwind scheme of one dimensional advection equation solution. We focus on communication-level issues, data prevention techniques, as well as time-evolving control issues. This paper also highlights a common set of issues including failure detection, failed process recovery, duplicate message handling, etc. This contribution will help application developers to resolve different issues of design and implementation of fault-tolerant algorithms for more complex time-evolving applications.
Pages: 40 to 47
Copyright: Copyright (c) IARIA, 2013
Publication date: April 21, 2013
Published in: conference
ISSN: 2308-3700
ISBN: 978-1-61208-268-4
Location: Venice, Italy
Dates: from April 21, 2013 to April 26, 2013