Skip to main content

Engineering Self-Healing Systems to Support Trustworthy Computing

Researcher: David Garlan


Engineering Self-healing Systems to Support Trustworthy Computing

Today’s software engineering practices are largely predicated on the notion that with sufficient effort one can design systems to eliminate all critical flaws. Hence most techniques for software development of trustworthy systems have focused on design-time techniques: specification, verification, modeling and analysis, protocol design, etc. This approach works well for systems that function in a known environment, that interact with other systems over which we have considerable control, and that can be taken off-line to correct problems

However, increasingly systems must function in environments that are highly unpredictable, if not outright hostile. They must interact with other components of dubious pedigree. They must function in a world where resources are not limitless, and where cost may be a major concern in achieving trustworthy behavior. And they must continue to run without interruption.

For such systems it becomes essential that systems become more responsible for their own behavior, adapting as appropriate at run time to maintain adequate levels of service. These systems must be able to detect when problems arise and fix them automatically or semi-automatically.

Regrettably, software engineering has little to say about principled ways to build such systems. Today most self-adaptation is performed at a low level and embedded in the code of applications (e.g., timeouts and exception handling). While such techniques can deal with localized problems, they are usually incapable of detecting system-wide problems, or gradual system degradation.

Over the past few years my research group has been exploring a new technique, termed architecture-based adaptation. The key idea is to engineer a software system in the fashion of a closed loop control system. As illustrated in below, systems are monitored at run time (1). Low-level events are abstracted via a set of components (2) that interpret those events in terms of actions on a higher-level architectural model (3). An architecture represents a system formally in terms of its main run time components, connections and their envelope of acceptable behavior. These models permit one to detect when problems arise through an architecture analyzer. Constraint violations trigger repair actions (5), which are reflected down into the system (6).

Our research to date has focused on applying the techniques of architecture-based adaptation to the problem of performance enhancement. For example, we have shown that one can effectively monitor and adapt distributed web-based client-server applications to produce run-time performance improvement comparable to the best one can do with existing embedded techniques.

We plan to extend this approach to the domain of security. Specifically we will adapt existing mechanisms for system monitoring, system modeling, and system repair in the area of security to support robust run-time self-adaptation via runtime architectural models. The goal will be to show that we can add effective security-enhancing self-adaptation to complex legacy systems with minimal engineering effort. The use of our current technology and research base will be essential to achieve this. But there are also a number of critical research questions that must be addressed in order for this approach to be successful.

Plan of research

While we are confident that we can apply our techniques to the security domain, the degree of success will depend on our ability to solve a number of problems in four key areas:

(1)    Detection: ways to exploit existing monitoring technology to detect security problems. What kinds of dynamic security violations can be detected at run time through system observation? What kinds of system monitoring capabilities are needed to detect them? How can one abstract from large amounts of monitored data to provide architectural views of system behavior related to security concerns?

(2)    Modeling: ways to use architectural representations at run time for monitoring, problem resolution and repair. What kinds of architectural models can be used to represent security-oriented architectures? How can one characterize situations of concern through architectural properties and constraints?

(3)    Resolution:ways to adapt known security-analysis techniques to pinpoint likely sources of a problem. What kind of runtime information is needed to isolate a security problem? How does one use knowledge of the system’s environment to help with the process?

(4)    Repair:ways to adapt security-enhancing techniques to runtime adaptation. What kinds of security-enhancing mechanisms and approaches are amenable to self-adaptation? How can one engineer systems so that they can be adapted at runtime to accommodate security enhancing capabilities. How can one combine multiple models to pick security enhancements that maximize utility (including cost) for the task at hand?