Aggressive technology scaling trends are expected to make the hardware of high-performance computing (HPC) systems more susceptible to transient faults. Transient faults are commonly caused by electrical noise or high energy particles from cosmic radiation, which in turn causes random bit flips. These faults can have serval outcome; they may be masked by program logic without affecting the program output; cause a program to crash, e.g., a transient fault corrupts a pointer variable; or lead to silent data corruption (SDC), where the transient fault introduces an error into the application that is not detected. The HPC community is concerned with the impact of silent data corruption on computation results, on which critical decisions may rely. Due to the insidious nature of SDC, understanding a fault-corrupted program's behavior is crucial for developing and evaluating software resiliency techniques to ensure correct output results from HPC applications. Classical fault-injection studies give an overall statistical resiliency profile for an application. However, summarizing critical information and understanding the behavior of the fault-injected program is difficult. In this work, we introduce SpotSDC, a visualization system for analyzing a program's characteristics of resilience to SDC. SpotSDC provides an overview of the effect of SDC on an application by indicating the impact of a transient fault over the program's different regions and different time steps during program execution. It highlights the regions of code that are most susceptible to SDC and that will have a high impact on the program's output. SpotSDC also enables users to visualize the propagation of error through a program execution. The results from SpotSDC allow developers to design selective algorithmic error detection and mitigation techniques to protect their applications from transient faults.
Posted by: Steve Petruzza