Determinism and Reproducibility in Large-Scale HPC Systems


Wei-Fan Chiang, Ganesh Gopalakrishnan, Zvonimir Rakamaric, Dong H. Ahn, Gregory L. Lee. 4th Workshop on Determinism and Correctness in Parallel Programming (WoDet 2013), Houston, TX, USA.
Abstract: The ability to reproduce simulation results (external determinism) goes a long way towards enhancing the trustworthiness of high performance computing simulations. The ability to replay schedules (internal determinism) greatly facilitates reproducing bugs, and helps reduce wasted programmer productivity. In this paper, we consider these issues in the context of software libraries and APIs used in today’s mainstream high performance computing (HPC) systems as well as expected to be used in upcoming high-end systems. After cataloging the main sources of external and internal nondeterminism, we summarize two thrusts in our current research: (1) mechanisms to control internal nondeterminism by active schedule control, and (2) techniques that may help assess the extent of result nondeterminism in floating point calculations.


