Systematic Debugging Methods for Large Scale HPC Computational Frameworks

Alan Humphrey, Qingyu Meng, Martin Berzins, Diego Caminha B. de Oliveira, Zvonimir Rakamaric, Ganesh Gopalakrishnan. Computing in Science and Engineering (CiSE), 2014.
[pdf] [bib]

Abstract: Parallel computational frameworks for high performance computing (HPC) are central to the advancement of simulation based studies in science and engineering. Unfortunately, finding and fixing bugs in these frameworks can be extremely time consuming. Left unchecked, these bugs can drastically diminish the amount of new science that can be performed. This paper presents our systematic study of the Uintah computational framework, and our approaches to debug it more incisively. Our key insight is to leverage the modular structure of Uintah which lends itself to systematic debugging. In particular, we have developed a new approach based on Coalesced Stack Trace Graphs (CSTGs) that summarize the system behavior in terms of key control flows manifested through function invocation chains. We illustrate several scenarios how CSTGs could help efficiently localize bugs, and present a case study of how we found and fixed a real Uintah bug using CSTGs.


  author = {Alan Humphrey and Qingyu Meng and Martin Berzins and
    Diego Caminha B. de Oliveira and Zvonimir Rakamari\'c and Ganesh Gopalakrishnan},
  title = {Systematic Debugging Methods for Large Scale {HPC} Computational Frameworks},
  journal = {Computing in Science and Engineering (CiSE)},
  volume = {16},
  number = {3},
  year = {2014},
  month = {May},
  pages = {48--56},
  publisher = {IEEE}