Home // INFOCOMP 2013, The Third International Conference on Advanced Communications and Computation // View article


FETOL: A Divide-and-Conquer Based Approach for Resilient HPC Applications

Authors:
Wassim Abu Abed
Kostyantyn Kucher
Manfred Krafczyk
Markus Wittmann
Thomas Zeiser
Gerhard Wellein

Keywords: hpc; resilience; fault-tolerance; divide-and-conquer

Abstract:
The inevitable increase of the frequency of hard and soft faults in current and future high performance computing systems motivates the need of integrated approaches to improve the resilience of such systems. In this paper, a framework for a fault tolerant environment termed FETOL implementing an approach to achieve a coordinated resilience solution is presented. FETOL is based on a software solution exploiting a Divide-and- Conquer strategy that will offer comprehensive methods on the middleware and application level to deal with various failure scenarios.

Pages: 7 to 12

Copyright: Copyright (c) IARIA, 2013

Publication date: November 17, 2013

Published in: conference

ISSN: 2308-3484

ISBN: 978-1-61208-310-0

Location: Lisbon, Portugal

Dates: from November 17, 2013 to November 21, 2013