Deterministic Replay Using Global Clock
ACM Transactions on Architecture and Code Optimization，2013，10（1）：1 | 2013年04月01日 | doi.org/10.1145/2445572.2445573
Debugging parallel programs is a well-known difficult problem. A promising method to facilitate debugging parallel programs is using hardware support to achieve deterministic replay on a Chip Multi-Processor (CMP). As a Design-For-Debug (DFD) feature, a practical hardware-assisted deterministic replay scheme should have low design and verification costs, as well as a small log size. To achieve these goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information infused by the global clock. By the recorded pending period information, about 99% execution orders are inferrable, implying that LReplay only needs to record directly the residual 1% noninferrable execution orders in production run. The 1% noninferrable orders can be addressed by a simple yet cost-effective direction prediction technique, which further reduces the log size of LReplay. Benefiting from the preceding innovations, the overall log size of LReplay over SPLASH-2 benchmarks is about 0.17B/K-Inst (byte per k-instruction) for the sequential consistency, and 0.57B/K-Inst for the Godson-3 consistency. Such log sizes are smaller in an order of magnitude than previous deterministic replay schemes incurring no performance loss. Furthermore, LReplay only consumes about 0.5% area of the Godson-3 CMP, since it requires only trivial modifications to existing components of Godson-3. The features of LReplay demonstrate the potential of integrating hardware support for deterministic replay into future industrial processors.