Home // ICN 2020, The Nineteenth International Conference on Networks // View article


Selective Process Replication for Fault Tolerance in Large-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution

Authors:
Longhao Li
Taieb Znati
Rami Melhem

Keywords: fault-tolerance; selective replication; cloud computing; heterogeneous environment.

Abstract:
Future systems are scaling to a large number of cores. Consequently, their propensity to failure increases dramatically, making it more challenging to achieve forward progress for compute-intensive applications on a large number of cores. Pure process replication is a widely accepted technique to tolerate fail-stop errors. At extreme-scale, however, it is inadequate to achieve fault tolerance efficiently due to doubled or even tripled computational resources usage. In this paper, we propose a selective process replication model that only assigns replicas to failure-prone processes. It assumes cores fail independently, but non-identically. The simulation results show that, on average, selective replication reduces more than 35 percent of energy consumption and more than 25 percent of the time to completion comparing to full replication with 1 million cores, where 20 percent of them are failure-prone.

Pages: 55 to 60

Copyright: Copyright (c) IARIA, 2020

Publication date: February 23, 2020

Published in: conference

ISSN: 2308-4413

ISBN: 978-1-61208-770-2

Location: Lisbon, Portugal

Dates: from February 23, 2020 to February 27, 2020