Home // PESARO 2013, The Third International Conference on Performance, Safety and Robustness in Complex Systems and Applications // View article


Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications

Authors:
Md Mohsin Ali
Peter Strazdins

Keywords: fault tolerance; MPI; FT-MPI; process failure;

Abstract:
In order to make the High Performance Computing (HPC) applications fault-tolerant, many application developers are investigating Algorithm-Based Fault Tolerance (ABFT) techniques to improve the efficiency of these applications recovery beyond what existing checkpoint/restart techniques alone can provide. Unfortunately, the standard library Message Passing Interface (MPI) used for implementing this type of application do not have standardized fault tolerance semantics. This paper presents how the fault tolerance semantics of Fault-Tolerant MPI (FT-MPI) can be used as a part of ABFT to design and implement a fault-tolerant algorithm applicable for time-evolving applications which could survive process failures. The model of the presented technique is a master-worker scheme which can tolerate the failures of all worker processes. As an example of time-evolving application, we consider the upwind scheme of one dimensional advection equation solution. We focus on communication-level issues, data prevention techniques, as well as time-evolving control issues. This paper also highlights a common set of issues including failure detection, failed process recovery, duplicate message handling, etc. This contribution will help application developers to resolve different issues of design and implementation of fault-tolerant algorithms for more complex time-evolving applications.

Pages: 40 to 47

Copyright: Copyright (c) IARIA, 2013

Publication date: April 21, 2013

Published in: conference

ISSN: 2308-3700

ISBN: 978-1-61208-268-4

Location: Venice, Italy

Dates: from April 21, 2013 to April 26, 2013