Systematic Debugging Methods for Large Scale HPC Computational Frameworks

CiSE 2014 screenshot

Abstract

Parallel computational frameworks for high performance computing (HPC) are central to the advancement of simulation based studies in science and engineering. Unfortunately, finding and fixing bugs in these frameworks can be extremely time consuming. Left unchecked, these bugs can drastically diminish the amount of new science that can be performed. This paper presents our systematic study of the Uintah computational framework, and our approaches to debug it more incisively. Our key insight is to leverage the modular structure of Uintah which lends itself to systematic debugging. In particular, we have developed a new approach based on Coalesced Stack Trace Graphs (CSTGs) that summarize the system behavior in terms of key control flows manifested through function invocation chains. We illustrate several scenarios how CSTGs could help efficiently localize bugs, and present a case study of how we found and fixed a real Uintah bug using CSTGs.

Citation

Alan Humphrey, Qingyu Meng, Martin Berzins, Diego Caminha B. de Oliveira, Zvonimir Rakamaric, Ganesh Gopalakrishnan
Systematic Debugging Methods for Large Scale HPC Computational Frameworks
Computing in Science and Engineering (CiSE), 16(3): 48--56, doi:10.1109/MCSE.2014.11, 2014.

BibTeX

@article{2014_cise_hmbcrg,
  title = {Systematic Debugging Methods for Large Scale HPC Computational Frameworks},
  author = {Alan Humphrey and Qingyu Meng and Martin Berzins and Diego Caminha B. de Oliveira and Zvonimir Rakamaric and Ganesh Gopalakrishnan},
  journal = {Computing in Science and Engineering (CiSE)},
  volume = {16},
  publisher = {IEEE},
  pages = {48--56},
  doi = {10.1109/MCSE.2014.11},
  number = {3},
  month = {may},
  year = {2014}
}

Acknowledgements

The authors wish to thank the referees for their insightful comments. This work was supported by the National Science Foundation under grants OCI-0721659, the NSF OCI PetaApps program, through award OCI 0905068 and DOE NETL for funding under NET DE-EE0004449. This project used the University of Delaware’s Chimera computer which was funded by the U.S. National Science Foundation Award CNS-0958512.