Skip to main content

Development & Evaluation of Resource-Aware Configurable Survivability

Researcher: Priya Narasimhan

Abstract

Development and Evaluation of Resource-aware Configurable Survivability

As middleware (or distributed object) technologies such as Jini, EJB, CORBA, and DCOM have become more important for designing large software systems, it is has become increasingly important to develop mechanisms, infrastructural support, and algorithms to provide intrusion detection and intrusion tolerance capabilities in middleware systems. The Starfish project [2] builds on my previous experience with the Immune system [7] and the Eternal system [6]. The Starfish system provides survivability to CORBA applications transparently, enabling them to continue to operate despite intrusions or accidents that damage the underlying distributed system, or faults that occur within the system. Starfish system aims to go beyond the typical reactive survivable strategies (i.e., waiting for a fault/attack to occur before reacting to tolerate/isolate it) by providing for a proactive approach, to detect an impending fault/attack, and to take effective action, before the fault/attack actually occurs.

Proposed Research

Having demonstrated the effectiveness of proactive fault-tolerance for isolated, accidental faults, our intention is to study malicious faults and their propagation in highly connected, distributed systems, to determine whether/where/how a proactive fault-tolerance approach might aid in faster recovery, less disruption of service and higher availability of such systems. The ability of a system to respond quickly, and in a resource-aware manner, to unanticipated attacks, and to learn from them, is invaluable in our highly connected world, where new attacks are the norm, rather than the exception. In our second year of research, we intend to focus on the resource-aware aspects of configuring and measuring survivability, with the intention of quantifying the ability of distributed systems to detect, react to, and sustain a variety of faults. There are, thus, two distinct (but very related) thrusts in our second year of developing the Starfish system.

Resource-aware survivability configuration advisor

Configuring survivability in a distributed system is not necessarily straightforward. In fact, in today’s systems, this involves considerable guesswork, and insight, even on the part of experienced system developers/administrators. For large-scale systems, it is not always possible to guess at the right configuration for survivability, e.g., which nodes should be used for protecting critical resources, which nodes can afford to run compute-intensive protocols, which nodes can afford to use replication, how the application’s components can be distributed across the various nodes in the system to obtain the expected behavior. These decisions also need be made under various kinds of scenarios: the fault-free or the normal case, under a real attack, under an injected attack, under a training sequence, etc. In the context of Starfish, the key research aspects of this strategy will include:

  • Developing mechanisms for profiling (with minimal performance overhead) the resource usage of the distributed system under the various scenarios.
  • Utilizing the resource usage data, the survivability requirements, the behavior of various nodes (e.g., which nodes have been known to be more faulty than others in the past), to derive the configuration for distributing the application and the Starfish infrastructural components onto the system.
  • Recognizing that system and resource conditions might change at run-time, e.g., processors might crash fatally and never recover, the configuration advisor has to work actively with the run-time resource monitors to adjust the configuration dynamically, to keep up with the changing conditions.

Additional issues that we will need to address include the stability (or eventual termination) of this distributed adaptive tuning algorithm, the kind of feedback control that we employ to allow the system to reconfigure itself and settle down rapidly, even under attack.

Objective evaluation of survivability and resource usage of distributed system

Current survivable systems rely on strong theoretical properties (such as the Byzantine-fault tolerance guarantees) to guarantee survivability. Unfortunately, it is not common to measure the efficiency or the effectiveness of these properties in implementations of survivable systems.

Instead of focusing on the survivability benefit of a system or technique, evaluations of such systems generally focus on the performance overhead of the mechanisms in the fault-free case: a metric that, in itself, is not a good evaluator of survivability. This dearth of metrics makes the objective comparison of the survivability of different implementations of systems---even those that employ similar algorithms---nearly impossible. To solve this problem, we propose the development of metrics [5] to characterize and evaluate survivability. We intend to employ these metrics to evaluate survivable systems, including Starfish itself.

There are two important categories of operation for any survivable system: (i) the fault-free case; and (ii) the faulty case, under which the system's resistance---though not necessarily its survivability---has been overcome, i.e., a fault, either latent or active, now exists in the system. For the purpose of evaluation, it is useful to categorize the faulty case further into (a) proactive and (b) reactive, based on the survivability strategies employed by the system. It is our intention to evaluate a number of different systems, including BFT [1], Fleet [3], Immune [7] and Starfish [2] to evaluate the survivability, performance and resource usage under the fault-free, reactive-faulty and proactive-faulty cases. The intention here is for us to understand better the precise implications of the different survivability approaches of the three different systems, to derive insights into the specific fault-tolerance mechanisms and to determine the effectiveness of these systems under a variety of faults.

Clearly, in order to evaluate these systems objectively, we need a transparent way of injecting both malicious and accidental faults into these systems. The transparency is essential because (i) we need to use the same approach to evaluating the different systems as if they were black-boxes, and (ii) the evaluation technique should not itself perturb the survivability/performance of the system. We have previously developed interception approaches [6] for inserting various functionality transparently into distributed applications; we intend to exploit this technique to develop a survivability evaluator through the injection of various kinds of faults, including crash faults, message losses, resource-exhaustion, value faults, masquerades, mutant messages, corrupted messages, etc. The idea is to “attach” the fault-injection interceptor transparently to each of these target systems, at runtime, to inject these faults (without necessarily concerning ourselves with the precise architecture or implementation details, although these would be useful in the post-mortem of the results of our evaluation) to “measure” each system’s resilience, speed-of-reaction and resource usage, in the face of these faults.