Home // ICN 2020, The Nineteenth International Conference on Networks // View article
Authors:
Longhao Li
Taieb Znati
Rami Melhem
Keywords: fault-tolerance; selective replication; cloud computing; heterogeneous environment.
Abstract:
Future systems are scaling to a large number of cores. Consequently, their propensity to failure increases dramatically, making it more challenging to achieve forward progress for compute-intensive applications on a large number of cores. Pure process replication is a widely accepted technique to tolerate fail-stop errors. At extreme-scale, however, it is inadequate to achieve fault tolerance efficiently due to doubled or even tripled computational resources usage. In this paper, we propose a selective process replication model that only assigns replicas to failure-prone processes. It assumes cores fail independently, but non-identically. The simulation results show that, on average, selective replication reduces more than 35 percent of energy consumption and more than 25 percent of the time to completion comparing to full replication with 1 million cores, where 20 percent of them are failure-prone.
Pages: 55 to 60
Copyright: Copyright (c) IARIA, 2020
Publication date: February 23, 2020
Published in: conference
ISSN: 2308-4413
ISBN: 978-1-61208-770-2
Location: Lisbon, Portugal
Dates: from February 23, 2020 to February 27, 2020