Home // INFOCOMP 2013, The Third International Conference on Advanced Communications and Computation // View article
FETOL: A Divide-and-Conquer Based Approach for Resilient HPC Applications
Authors:
Wassim Abu Abed
Kostyantyn Kucher
Manfred Krafczyk
Markus Wittmann
Thomas Zeiser
Gerhard Wellein
Keywords: hpc; resilience; fault-tolerance; divide-and-conquer
Abstract:
The inevitable increase of the frequency of hard and soft faults in current and future high performance computing systems motivates the need of integrated approaches to improve the resilience of such systems. In this paper, a framework for a fault tolerant environment termed FETOL implementing an approach to achieve a coordinated resilience solution is presented. FETOL is based on a software solution exploiting a Divide-and- Conquer strategy that will offer comprehensive methods on the middleware and application level to deal with various failure scenarios.
Pages: 7 to 12
Copyright: Copyright (c) IARIA, 2013
Publication date: November 17, 2013
Published in: conference
ISSN: 2308-3484
ISBN: 978-1-61208-310-0
Location: Lisbon, Portugal
Dates: from November 17, 2013 to November 21, 2013