Abstract
Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation.
We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study.
Citation
Tyler Sorensen,
Alastair F. Donaldson,
Mark Batty,
Ganesh Gopalakrishnan,
Zvonimir Rakamaric
Portable Inter-workgroup Barrier Synchronisation for GPUs
Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 39--58, doi:10.1145/3022671.2984032, 2016.
BibTeX
@inproceedings{2016_oopsla_sdbgr, title = {Portable Inter-workgroup Barrier Synchronisation for GPUs}, author = {Tyler Sorensen and Alastair F. Donaldson and Mark Batty and Ganesh Gopalakrishnan and Zvonimir Rakamaric}, booktitle = {Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA)}, publisher = {ACM}, pages = {39--58}, doi = {10.1145/3022671.2984032}, year = {2016} }
Acknowledgements
We are grateful to: Brad Beckman, Marco Cornero, Hugues Evrard, Hedley Francis, Thibaut Lutz, Marc Orr, Sven Van Haastregt, and John Wickerson for feedback and insightful discussions around this work, and the OOPSLA reviewers (paper and artifact) for their thorough evaluations and feedback which greatly improved this paper. This work was supported in part by an equipment grant from GCHQ, a gift from Intel Corporation, an EPSRC Impact Acceleration Award, the Royal Academy of Engineering, the Lloyds Register Foundation, NSF CCF 1346756 and ACI 1535032.