World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING

    The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.

    References

    • J. D. Owenset al., Proceedings of the IEEE 96(5), 879 (2008), DOI: 10.1109/JPROC.2008.917757. Crossref, ISIGoogle Scholar
    • Victor W. Leeet al., Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, Proceedings of the 37th annual international symposium on Computer architecture pp. 451–460. Google Scholar
    • Richard Vuducet al., On the limits of GPU acceleration, Proceedings of the 2nd USENIX conference on Hot topics in parallelism (HotPar'10) pp. 13–13. Google Scholar
    • Erich   Gamma et al. , Design Patterns: Elements of Reusable Object-Oriented Software ( Addisson-Wesley , Reading, MA, USA , 1995 ) . Google Scholar
    • Nadathur Satish, Mark Harris and Michael Garland, Designing efficient sorting algorithms for manycore GPUs, IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing pp. 1–10. Google Scholar
    • GPGPU.org. [Online] , http://gpgpu.org/developer/cudpp . Google Scholar
    • Nadathur Satish et al., "Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort,", 2010, pp. 351--362 . Google Scholar
    • Jatin Chhuganiet al., Efficient implementation of sorting on multi-core SIMD CPU architecture, Proc. VLDB Endow. (2008) pp. 1313–1324. Google Scholar
    • Nadathur Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures," techreport 2010 . Google Scholar
    • Donald   Knuth , The Art of Computer Programming , Sorting and Searching   III ( Addison-Wesley , Reading, MA, USA , 1973 ) . Google Scholar
    • Thomas H.   Cormen et al. , Introduction to Algorithms , 2nd edn. ( McGraw-Hill , 2001 ) . Google Scholar
    • Guojing Cong, George Almasi and Vijay Saraswat, Fast PGAS Implementation of Distributed Graph Algorithms, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10) pp. 1–11. Google Scholar
    • Kun   Zhou et al. , Data-Parallel Octrees for Surface Reconstruction , EEE Transactions on Visualization & Computer Graphics ( 2010 ) . Google Scholar
    • Kun Zhouet al., Real-time KD-tree construction on graphics hardware, SIGGRAPH Asia '08: ACM SIGGRAPH Asia 2008 papers pp. 1–11. Google Scholar
    • J. Pantaleoni and D. Luebke, HLBVH: hierarchical LBVH construction for real-time ray tracing of dynamic geometry, Proceedings of the Conference on High Performance Graphics (HPG '10) pp. 87–95. Google Scholar
    • Erik Sintorn and Ulf Assarsson, Real-time approximate sorting for self shadowing and transparency in hair rendering, I3D '08: Proceedings of the 2008 symposium on Interactive 3D graphics and games pp. 157–162. Google Scholar
    • Kun Zhouet al., RenderAnts: interactive Reyes rendering on GPUs, SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papers pp. 1–11. Google Scholar
    • Bernhard Kainzet al., Ray casting of multiple volumetric datasets with polyhedral boundaries on manycore GPUs, SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papers pp. 1–9. Google Scholar
    • Peter Kipfer, Mark Segal and Rüdiger Westermann, UberFlow: a GPU-based particle engine, HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware pp. 115–122. Google Scholar
    • Jonathan M. Cohen, Sarah Tariq and Simon Green, Interactive fluid-particle simulation using translating Eulerian grids, I3D '10: Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games pp. 15–22. Google Scholar
    • Charles   Loop and Kirill   Garanzha , Fast Ray Sorting and Breadth-First Packet Traversal for GPU Ray Tracing , Eurographics . Google Scholar
    • Ignacio Castaño. (2007, February) High Quality DXT Compression Using CUDA. [Online] , http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/dxtc/doc/cuda_dxtc.pdf . Google Scholar
    • Dan A. Alcantaraet al., Real-time parallel hashing on the GPU, SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papers pp. 1–9. Google Scholar
    • Bingsheng Heet al., Relational joins on graphics processors, SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data pp. 511–524. Google Scholar
    • Naga Govindarajuet al., GPUTeraSort: high performance graphics co-processor sorting for large database management, SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data pp. 325–336. Google Scholar
    • Naga Govindaraju, Nikunj Raghuvanshi and Dinesh Manocha, Fast and approximate stream mining of quantiles and frequencies using graphics processors, SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data pp. 611–622. Google Scholar
    • Jeremy Shopfet al., March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU, SIGGRAPH '08: ACM SIGGRAPH 2008 classes pp. 52–101. Google Scholar
    • A. Hartsteinet al., Cache miss behavior: is it √2?, CF '06: Proceedings of the 3rd conference on Computing frontiers pp. 313–320. Google Scholar
    • R. P. Brent and H. T. Kung, IEEE Trans. Comput. 31(3), 260 (1982). ISIGoogle Scholar
    • Siddhartha Chatterjee, Guy Blelloch and Marco Zagha, Scan primitives for vector computers, Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing pp. 666–675. Google Scholar
    • Andrea C. Dusseauet al., Fast Parallel Sorting Under LogP: Experience with the CM-5, IEEE Trans. Parallel Distrib. Syst.7 (1996) pp. 791–805. Google Scholar
    • Marco Zagha and Guy Blelloch, Radix sort for vector multiprocessors, Supercomputing '91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing pp. 712–721. Google Scholar
    • Shubhabrata Senguptaet al., Scan Primitives for GPU Computing, GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware pp. 97–106. Google Scholar
    • Bingsheng Heet al., Efficient gather and scatter operations on graphics processors, SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing pp. 1–12. Google Scholar
    • Shubhabrata Sengupta, Mark Harris, and Michael Garland, "Efficient Parallel Scan Algorithms for GPUs," NVIDIA, Technical Report NVR-2008-003, 2008 . Google Scholar
    • Yuri Dotsenkoet al., Fast scan algorithms on graphics processors, ICS '08: Proceedings of the 22nd annual international conference on Supercomputing pp. 205–213. Google Scholar
    • Duane Merrill and Andrew Grimshaw, "Parallel Scan for Stream Architectures," University of Virginia, Department of Computer Science, Charlottesville, VA, USA, Technical Report CS2009-14, 2009 . Google Scholar
    • Guy E. Blelloch, Siddhartha Chatterjee and Marco Zagha, Solving Linear Recurrences with Loop Raking, Proceedings of the 6th International Parallel Processing Symposium pp. 416–424. Google Scholar
    • Mark Harris. (2007) Optimizing parallel reduction in CUDA. [Online] , http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf . Google Scholar
    • Peter M. Kogge and Harold S. Stone, IEEE Trans. Comput. 22, 786 (1973). ISIGoogle Scholar
    • Duane Merrill and Andrew Grimshaw, "Revisiting Sorting for GPGPU Stream Architectures," University of Virginia, Charlottesville, VA, Technical Report CS2010-03, 2010 . Google Scholar
    • K. Thearling and S. Smith, An improved supercomputer sorting benchmark, Proceedings of the 1992 ACM/IEEE conference on Supercomputing (SC '92) pp. 14–19. Google Scholar
    • Daniel Cederman and Philippas Tsigas, J. Exp. Algorithmics 14, 1.4 (2009), DOI: 10.1145/1498698.1564500. Google Scholar
    • Nikolaj Leischner, Vitaly Osipov, and Peter Sanders, GPU sample sort, 2009 . Google Scholar
    • Frank Dehne and Hamidreza Zaboli, Deterministic Sample Sort For GPUs, 2010 . Google Scholar
    • Guy Blelloch, "Prefix Sums and Their Applications," School of Computer Science, Carnegie Mellon University, Technical Report CMU-CS-90-190, 1990 . Google Scholar
    • Thrust. [Online] , http://code.google.com/p/thrust/ . Google Scholar