Overcoming Extreme-Scale Reproducibility Challenges Through a Unified, Targeted, and Multilevel Toolset

Dong H. Ahn, Gregory L. Lee, Ganesh Gopalakrishnan, Zvonimir Rakamaric, Martin Schulz, Ignacio Laguna. 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering (SE-HPCCSE 2013), Denver, CO, USA.
[pdf] [bib]

Abstract: Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is crucial for computational science and engineering applications. However, non-determinism in concurrency scheduling often hampers achieving this ability on high performance computing (HPC) systems. To aid in managing the adverse effects of non-determinism, prior work has provided techniques to achieve bit-precise reproducibility, but most of them focus only on small-scale parallelism. While scalable techniques recently emerged, they are disparate and target special purposes, e.g., single-schedule domains. On current systems with O(10^6) compute cores and future ones with O(10^9), any technique that does not embrace a unified, targeted, and multilevel approach will fall short of providing reproducibility. In this paper, we argue for a common toolset that embodies this approach, where programmers select and compose complementary tools and can effectively, yet scalably, analyze, control, and eliminate sources of non-determinism at scale. This allows users to gain reproducibility only to the levels demanded by specific code development needs. We present our research agenda and ongoing work toward this goal.

Bibtex:

@inproceedings{sehpccse2013-algrsl,
  author = {Dong H. Ahn and Gregory L. Lee and Ganesh Gopalakrishnan
    and Zvonimir Rakamari\'c and Martin Schulz and Ignacio Laguna},
  title = {Overcoming Extreme-Scale Reproducibility Challenges
    Through a Unified, Targeted, and Multilevel Toolset},
  booktitle = {Proceedings of the 1st International Workshop
    on Software Engineering for High Performance Computing in
    Computational Science and Engineering (SE-HPCCSE)},
  publisher = {ACM},
  year = {2013},
  pages = {41--44},
}