Skip to main content

Architectural Support for Scalable Program Checkpointing

Researchers: Babak Falsafi

Abstract

Architectural Support for Scalable Program Checkpointing and Recovery

Technological and design projections, on the one hand, and the emerging demand for system reliability, security, and debuggability, on the other hand, indicate the need for efficient periodic checkpointing and recovery of program execution in future computer systems. Hardware performance variability due to variations in semiconductor fabrication, transient (soft) errors induced by cosmic radiation, and hardware wear-out are all citical reliability concerns for future computer systems as projected by ITRS and industrial/academic experts. Similarly, detecting and facilitating the recovery from security breaches and faulty software are becoming ever more critical as hardware systems grow in complexity resulting in a commensurate increase in complexity and vulnerability in application and system software. Today’s computer system hardware provides little or no support for generalized fault detection and recovery in program execution.

We propose to design and evaluate Transactional Computer Systems (TCS), where program execution proceeds in the form of transactions—i.e., atomic code sequences whose effects are reversible—that are supported as first-class execution objects in the hardware. TCS allows for continuous checkpointing of machine state and subsequent recovery upon encountering faults. While conventional fault-tolerant systems have relied on software to implement checkpointing and recovery, the prohibitive overhead in checkpointing machine state in software has limited its applicability to scenarios where checkpointing is infrequent enough (e.g., hundreds of millions of instructions) to amortize its overhead over the interval between checkpoints.

In contrast, TCS systems allow for minimal-overhead checkpointing of machine state for arbitrarily small transaction sizes, and fast recovery and/or detection of errors in software or hardware. TCS facilitates forward and reverse replay of program execution to allow for flexible inspection and detection of software and hardware vulnerabilities. Moreover, with TCS unsafe program execution can proceed encapsulated in a transaction while various safety properties are monitored and checked in parallel to eliminate the verification overhead. Transactions can be committed upon verification or rolled back when suspicious activity or faults are detected.

Methods:

We will design and evaluate a spectrum of alternatives for TCS hardware mechanisms using SimFlex, our fast and accurate full-system simulation infrastructure developed here at Carnegie Mellon (http://www.ece.cmu.edu/~simflex). We have installed and tuned transaction processing workloads on IBM DB2 and Oracle and use standard desktop/engineering benchmarks (SPEC) and other server workloads such as Apache and Zeus in our evaluation.