Systematic Debugging of Concurrent Systems Using Coalesced Stack Trace Graphs

Diego Caminha B. de Oliveira, Zvonimir Rakamaric, Ganesh Gopalakrishnan, Alan Humphrey, Qingyu Meng, Martin Berzins. 27th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2014), Hillsboro, OR, USA.
[pdf] [bib]

Abstract: A central need during software development of large-scale parallel systems is tools that help to quickly identify the root causes of bugs. Given the massive scale of these systems, tools that highlight changes — say introduced across software versions or their operating conditions (e.g., inputs, schedules) — can prove to be highly effective in practice. Conventional debuggers, while good at presenting details at the problem-site (e.g., crash), often omit contextual information to identify the root causes of the bug. We present a new approach to collect and coalesce stack traces, leading to an efficient summary display of salient system control flow differences in a graphical form called Coalesced Stack Trace Graphs (CSTG). CSTGs have helped us debug situations within a computational framework called Uintah that has been deployed at very large scale. In this paper, we detail CSTGs through case studies in the context of Uintah where unexpected behaviors caused by different versions of software or occurring across different time-steps of a system (e.g., due to non-determinism) are debugged. We show that CSTG also gives conventional debuggers a far more productive and guided role to play.

Bibtex:

@inproceedings{lcpc2014-orghmb,
  author = {Diego Caminha B. de Oliveira and Zvonimir Rakamari\'c
    and Ganesh Gopalakrishnan and Alan Humphrey and Qingyu Meng
    and Martin Berzins},
  title = {Systematic Debugging of Concurrent Systems Using Coalesced
    Stack Trace Graphs},
  booktitle = {Proceedings of the 27th International Workshop on Languages
    and Compilers for Parallel Computing (LCPC)},
  series = {Lecture Notes in Computer Science},
  volume = {8967},
  publisher = {Springer},
  editor = {James Brodman and Peng Tu},
  year = {2014},
  pages = {317--331},
}