Skip to main content

Vajra: Benchmarking Survivability in Distributed Systems

Researcher: Priya Narasimhan


Vajra: Benchmarking Survivability in Distributed Systems

It is important for distributed mission-critical applications to be able to quantify their survivability, and also for application developers to be able to compare different fault-tolerance approaches with each other, in the interests of selecting the best one. Unfortunately, while reliability modeling tools exist today, there are no comprehensive run-time approaches that ensure coverage over a wide span of faults, and moreoever, do so in distributed settings. Most run-time fault-injection techniques assume failures are independent (i.e., faults happen in isolation, without any correlation across the system) and benign (i.e., there is no malicious adversary orchestrating the failures). Neither of these assumptions is realistic for applications that must complete their missions despite arbitrary failures.

The Vajra survivability benchmark project aims to perform the run-time fault-injection of various kinds of failures, including crash, communication, malicious and timing failures. Apart from its comprehensive approach, the Vajra tools allow the injection of distributed failures, e.g., timing fault on one component coupled with a message loss on another. The utility of such a benchmark is that it will allow us to compare the relative dependability of different distributed systems that claim to be survivable. Even for a single distributed application, Vajra allows us to quantify the impact of failures, including complex ones that distributed systems are vulnerable to. Additionally, Vajra is transparent and target-agnostic -- by using an interception approach, it simply attaches itself non-intrusively, at run-time, to an existing distributed application and allows for a variety of failures to be injected into the application in any order. This implies that it can be used to measure and analyze the dependability of any distributed system without requiring any changes of the system.

Results from Prior CyLab Support

Our past year’s accomplishments (seed grant on “Resource-Aware Configurable Survivability”, 2004-2005) include the development of a prototype for resource-aware adaptive fault-tolerance for distributed middleware applications. We also implemented a decentralized resource monitoring infrastructure into the prototype of our system. Our research has helped us to understand and characterize the trade-offs between resources, performance and fault-tolerance under a variety of reconfigurations. Specifically, changing the replication style of a running distributed application (changing from active to passive replication on the fly) allowed us to understand fundamental mechanisms such as quiescence in distributed systems.

Our last year’s CyLab funding has led to publications in international conferences, along with a beta-version software prototype of the system that we released in open-source form with documentation to industrial partners, including Lockheed Martin and Raytheon. We intend to exploit the system that we have built over the last year as one of the first targets that Vajra will evaluate. In addition, I have personally presented the results of this research to some of CyLab’s industrial affiliates through invited seminars at Lockheed Martin, Siemens Research, Bosch Research, HP Laboratories and IBM Research. The funding on this project has also resulted in presentations at various professional forums:

  • Book chapter to appear in the book “Architecting Dependable Systems, vol. 3”, 2005
  •  Journal paper to appear in the journal “Concurrency and Computation: Practice and Experience”, 2005
  • Conference Paper, Hawaii International Conference on System Sciences, 2005
  • Student posters and presentations, Cylab Industrial Affiliates Meeting 2004
  • Presentations, Software Engineering Research Seminar at Carnegie Mellon University

Proposed Research

Current survivable systems rely on strong theoretical properties (such as the Byzantine-fault tolerance guarantees) to guarantee survivability. Unfortunately, it is not common to measure the efficiency or the effectiveness of these properties in implementations of survivable systems. Instead of focusing on the survivability benefit of a system or technique, evaluations of such systems generally focus on the performance overhead of the mechanisms in the fault-free case: a metric that, in itself, is not a good evaluator of survivability. This dearth of metrics makes the objective comparison of the survivability of different implementations of systems---even those that employ similar algorithms---nearly impossible. To solve this problem, we propose the development of metrics to characterize and evaluate survivability.

We intend to employ these metrics to evaluate survivable systems, including the dependable systems that we ourselves have built. There are two important categories of operation for any survivable system: (i) the fault-free case; and (ii) the faulty case, under which the system's resistance---though not necessarily its survivability---has been overcome, i.e., a fault, either latent or active, now exists in the system. For the purpose of evaluation, it is useful to categorize the faulty case further into (a) proactive and (b) reactive, based on the survivability strategies employed by the system. It is our intention to evaluate a number of different systems to evaluate the survivability, performance and resource usage under the fault-free, reactive-faulty and proactive-faulty cases. The intention here is for us to understand better the precise implications of the different survivability approaches of the three different systems, to derive insights into the specific fault-tolerance mechanisms and to determine the effectiveness of these systems under a variety of faults.

Clearly, in order to evaluate these systems objectively, we need a transparent way of injecting both malicious and accidental faults into these systems. The transparency is essential because (i) we need to use the same approach to evaluating the different systems as if they were black-boxes, and (ii) the evaluation technique should not itself perturb the survivability/performance of the system. We have previously developed interception approaches for inserting various functionalities transparently into distributed applications; we intend to exploit this technique to develop a survivability evaluator through the injection of various kinds of faults, including crash faults, message losses, resource-exhaustion, value faults, masquerades, mutant messages, corrupted messages, etc. The idea is to “attach” the fault-injection interceptor transparently to each of these target systems, at runtime, to inject these faults (without necessarily concerning ourselves with the precise architecture or implementation details, although these would be useful in the post-mortem of the results of our evaluation) to “measure” each system’s resilience, speed-of-reaction and resource usage, in the face of these faults.

Apart from the goal of transparency, another goal is objectivity and target-independence. The idea is to have the fault injection process, and therefore the results of the fault-injection, be independent of the specific target system under test. This is the only way that we will be able to compare two different systems against each other. Secondly, this means that we will not place any specific emphasis on the meaning of messages in the two different systems for benchmarking two different systems. We could easily do this for putting a number on the survivability of any one specific system (e.g., we could easily induce corruption of only the membership messages in the Byzantine Fault Tolerance (BFT) system, for example), but the results would lack meaning if we targeted BFT’s membership messages and Immune’s (for example) membership messages, and then sought a comparison on their survivability when their approaches might be dissimilar (of course, it might provide us some insight into which membership approach might be more resilient in its implementation). Thus, in the first version of the Vajra framework, we are aiming to be target-agnostic, and are injecting the same set of failures at the same interception points across the target systems being compared. At this initial stage of research, we have categorized a variety of benign and malicious faults that we intend to inject, and we have also identified interception points in the Linux operating system where these faults might be injected. Over the next year, with this seed-grant funding, we expect to complete the injection of faults into BFT in order to enable us to report the dependability of BFT, purely from a run-time standpoint. We hope to learn from the BFT evaluation process to harden the Vajra tools and to make them more comprehensive. Furthermore, we intend to also evaluate BFT against a counterpart system, such as Immune. The results from this should be rather interesting – for one, apart from providing an objective way to compare BFT to Immune in terms of their implementation (independent of the underlying algorithms), we might be able to report on how accurately each system follows it fault model (i.e., the faults that it claims to handle). The Vajra project is not intended to be forgiving towards our own systems – in fact, we intend to use it to evaluate the strengths and weaknesses of any dependable system that we build so that we quantify our own survivability.

The ultimate aim for our work with Vajra framework is to have classified the survivability of a number of different systems and to be able to share with the community the relative resilience of these various systems and their underlying approaches.

A secondary aim is to go after the complex, distributed failures that no fault-injection tool today addresses, e.g., what happens when a timing failure occurs at the client, and a communication failure at the server, what happens when a failure propagates unchecked through the system. These are truly challenging, but very real, failure scenarios – if we had the runtime ability to inject these failures, we could build complex distributed systems out of COTS parts, secure in the knowledge that we would have a way to evaluate the dependability of even such systems with complex failure modes. Given our (speaking collectively as the dependable systems community) inability to “put a number on” the survivability of the systems that we build every day, the Vajra tool will be invaluable to our research group, to other research groups.