TY - GEN
T1 - Simulating crash failures with many faulty processors
AU - Bazzi, Rida
AU - Neiger, Gil
N1 - Funding Information: Distributed computer systems give algorithm designers the ability to write fault-tolerant applications in which correctly functioning processors can complete a computation despite the failure of others. It has been well-established that the complexity of writing such applications depends upon the type of faulty behavior that processors may exhibit. While simple stopping failures are relatively * Partial support for this work was provided by the National Science Foundation under grants CCR-8909663 and CCR-9106627. ** This author was supported in part by a scholarship from the Harirl Foundation.
PY - 1992
Y1 - 1992
N2 - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.
AB - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.
UR - http://www.scopus.com/inward/record.url?scp=0342365620&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0342365620&partnerID=8YFLogxK
U2 - 10.1007/3-540-56188-9_12
DO - 10.1007/3-540-56188-9_12
M3 - Conference contribution
SN - 9783540561880
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 166
EP - 184
BT - Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings
A2 - Segall, Adrian
A2 - Zaks, Shmuel
PB - Springer Verlag
T2 - 6th International Workshop on Distributed Algorithms, WDAG 1992
Y2 - 2 November 1992 through 4 November 1992
ER -