HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING
Abstract
The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.
References
- Proceedings of the IEEE 96(5), 879 (2008), DOI: 10.1109/JPROC.2008.917757. Crossref, ISI, Google Scholar
Victor W. Lee , Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, Proceedings of the 37th annual international symposium on Computer architecture pp. 451–460. Google ScholarRichard Vuduc , On the limits of GPU acceleration, Proceedings of the 2nd USENIX conference on Hot topics in parallelism (HotPar'10) pp. 13–13. Google Scholar-
Erich Gamma , Design Patterns: Elements of Reusable Object-Oriented Software ( Addisson-Wesley , Reading, MA, USA , 1995 ) . Google Scholar Nadathur Satish , Mark Harris and Michael Garland , Designing efficient sorting algorithms for manycore GPUs, IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing pp. 1–10. Google Scholar- GPGPU.org. [Online] , http://gpgpu.org/developer/cudpp . Google Scholar
- Nadathur Satish et al., "Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort,", 2010, pp. 351--362 . Google Scholar
Jatin Chhugani , Efficient implementation of sorting on multi-core SIMD CPU architecture, Proc. VLDB Endow. (2008) pp. 1313–1324. Google Scholar- Nadathur Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures," techreport 2010 . Google Scholar
-
Donald Knuth , The Art of Computer Programming ,Sorting and Searching III ( Addison-Wesley , Reading, MA, USA , 1973 ) . Google Scholar -
Thomas H. Cormen , Introduction to Algorithms , 2nd edn. ( McGraw-Hill , 2001 ) . Google Scholar Guojing Cong , George Almasi and Vijay Saraswat , Fast PGAS Implementation of Distributed Graph Algorithms, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10) pp. 1–11. Google Scholar-
Kun Zhou , Data-Parallel Octrees for Surface Reconstruction , EEE Transactions on Visualization & Computer Graphics ( 2010 ) . Google Scholar Kun Zhou , Real-time KD-tree construction on graphics hardware, SIGGRAPH Asia '08: ACM SIGGRAPH Asia 2008 papers pp. 1–11. Google ScholarJ. Pantaleoni and D. Luebke , HLBVH: hierarchical LBVH construction for real-time ray tracing of dynamic geometry, Proceedings of the Conference on High Performance Graphics (HPG '10) pp. 87–95. Google ScholarErik Sintorn and Ulf Assarsson , Real-time approximate sorting for self shadowing and transparency in hair rendering, I3D '08: Proceedings of the 2008 symposium on Interactive 3D graphics and games pp. 157–162. Google ScholarKun Zhou , RenderAnts: interactive Reyes rendering on GPUs, SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papers pp. 1–11. Google ScholarBernhard Kainz , Ray casting of multiple volumetric datasets with polyhedral boundaries on manycore GPUs, SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papers pp. 1–9. Google ScholarPeter Kipfer , Mark Segal and Rüdiger Westermann , UberFlow: a GPU-based particle engine, HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware pp. 115–122. Google ScholarJonathan M. Cohen , Sarah Tariq and Simon Green , Interactive fluid-particle simulation using translating Eulerian grids, I3D '10: Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games pp. 15–22. Google Scholar-
Charles Loop and Kirill Garanzha , Fast Ray Sorting and Breadth-First Packet Traversal for GPU Ray Tracing , Eurographics . Google Scholar - Ignacio Castaño. (2007, February) High Quality DXT Compression Using CUDA. [Online] , http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/dxtc/doc/cuda_dxtc.pdf . Google Scholar
Dan A. Alcantara , Real-time parallel hashing on the GPU, SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papers pp. 1–9. Google ScholarBingsheng He , Relational joins on graphics processors, SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data pp. 511–524. Google ScholarNaga Govindaraju , GPUTeraSort: high performance graphics co-processor sorting for large database management, SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data pp. 325–336. Google ScholarNaga Govindaraju , Nikunj Raghuvanshi and Dinesh Manocha , Fast and approximate stream mining of quantiles and frequencies using graphics processors, SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data pp. 611–622. Google ScholarJeremy Shopf , March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU, SIGGRAPH '08: ACM SIGGRAPH 2008 classes pp. 52–101. Google ScholarA. Hartstein , Cache miss behavior: is it √2?, CF '06: Proceedings of the 3rd conference on Computing frontiers pp. 313–320. Google Scholar- IEEE Trans. Comput. 31(3), 260 (1982). ISI, Google Scholar
Siddhartha Chatterjee , Guy Blelloch and Marco Zagha , Scan primitives for vector computers, Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing pp. 666–675. Google ScholarAndrea C. Dusseau , Fast Parallel Sorting Under LogP: Experience with the CM-5, IEEE Trans. Parallel Distrib. Syst.7 (1996) pp. 791–805. Google ScholarMarco Zagha and Guy Blelloch , Radix sort for vector multiprocessors, Supercomputing '91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing pp. 712–721. Google ScholarShubhabrata Sengupta , Scan Primitives for GPU Computing, GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware pp. 97–106. Google ScholarBingsheng He , Efficient gather and scatter operations on graphics processors, SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing pp. 1–12. Google Scholar- Shubhabrata Sengupta, Mark Harris, and Michael Garland, "Efficient Parallel Scan Algorithms for GPUs," NVIDIA, Technical Report NVR-2008-003, 2008 . Google Scholar
Yuri Dotsenko , Fast scan algorithms on graphics processors, ICS '08: Proceedings of the 22nd annual international conference on Supercomputing pp. 205–213. Google Scholar- Duane Merrill and Andrew Grimshaw, "Parallel Scan for Stream Architectures," University of Virginia, Department of Computer Science, Charlottesville, VA, USA, Technical Report CS2009-14, 2009 . Google Scholar
Guy E. Blelloch , Siddhartha Chatterjee and Marco Zagha , Solving Linear Recurrences with Loop Raking, Proceedings of the 6th International Parallel Processing Symposium pp. 416–424. Google Scholar- Mark Harris. (2007) Optimizing parallel reduction in CUDA. [Online] , http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf . Google Scholar
- IEEE Trans. Comput. 22, 786 (1973). ISI, Google Scholar
- Duane Merrill and Andrew Grimshaw, "Revisiting Sorting for GPGPU Stream Architectures," University of Virginia, Charlottesville, VA, Technical Report CS2010-03, 2010 . Google Scholar
K. Thearling and S. Smith , An improved supercomputer sorting benchmark, Proceedings of the 1992 ACM/IEEE conference on Supercomputing (SC '92) pp. 14–19. Google Scholar- J. Exp. Algorithmics 14, 1.4 (2009), DOI: 10.1145/1498698.1564500. Google Scholar
- Nikolaj Leischner, Vitaly Osipov, and Peter Sanders, GPU sample sort, 2009 . Google Scholar
- Frank Dehne and Hamidreza Zaboli, Deterministic Sample Sort For GPUs, 2010 . Google Scholar
- Guy Blelloch, "Prefix Sums and Their Applications," School of Computer Science, Carnegie Mellon University, Technical Report CMU-CS-90-190, 1990 . Google Scholar
- Thrust. [Online] , http://code.google.com/p/thrust/ . Google Scholar


