HYBRID-PARALLEL SPARSE MATRIX-VECTOR MULTIPLICATION WITH EXPLICIT COMMUNICATION OVERLAP ON CURRENT MULTICORE-BASED SYSTEMS
Abstract
We evaluate optimized parallel sparse matrix-vector operations for several representative application areas on widespread multicore-based cluster configurations. First the single-socket baseline performance is analyzed and modeled with respect to basic architectural properties of standard multicore chips. Beyond the single node, the performance of parallel sparse matrix-vector operations is often limited by communication overhead. Starting from the observation that nonblocking MPI is not able to hide communication cost using standard MPI implementations, we demonstrate that explicit overlap of communication and computation can be achieved by using a dedicated communication thread, which may run on a virtual core. Moreover we identify performance benefits of hybrid MPI/OpenMP programming due to improved load balancing even without explicit communication overlap. We compare performance results for pure MPI, the widely used "vector-like" hybrid programming strategies, and explicit overlap on a modern multicore-based cluster and a Cray XE6 system.
References
- J. Supercomputing 50(1), 36 (2008). Crossref, ISI, Google Scholar
- Parallel Computing 35, 178 (2009). Crossref, ISI, Google Scholar
- N. Bell and M. Garland: Implementing sparse matrix-vector multiplication on throughput-oriented processors. Proceedings of SC09 . Google Scholar
- Parallel Computing 36(4), 181 (2010). Crossref, ISI, Google Scholar
- Parallel Computing 27(1), 883 (2001). Crossref, ISI, Google Scholar
- International Journal of High Performance Computing Applications 17, 49 (2003). Crossref, ISI, Google Scholar
-
G. Wellein , Fast sparse matrix-vector multiplication for TFlop/s computers , Proceedings of VECPAR2002 ,LNCS 2565 ( Springer , Berlin , 2003 ) . Google Scholar -
R. Barrett , Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods ( SIAM , 1993 ) . Google Scholar -
G. Hager and G. Wellein , Introduction to High Performance Computing for Scientists and Engineers ( CRC Press , 2010 ) . Crossref, Google Scholar - Rev. Mod. Phys. 78, 275 (2006). Crossref, ISI, Google Scholar
A. Weiße and H. Fehske , Chebyshev expansion techniques,Lecture Notes in Physics 739 (Springer, Berlin Heidelberg, 2008) pp. 545–577. Crossref, Google Scholar- Phys. Rev. B 69, 165115 (2004). Crossref, ISI, Google Scholar
E. Cuthill and J. McKee , Reducing the bandwidth of sparse symmetric matrices, Proceedings of 24th national conference (ACM '69) (ACM, New York, NY, USA) pp. 157–172. Google Scholar- , An Introduction to Algebraic Multigrid , eds.
U. Trottenberg ( Academic Press , 2000 ) . Google Scholar - http://www.scai.fraunhofer.de/en/business-research-areas/numerical-software/products/samg.html . Google Scholar
- A. Basermann et al.: HICFD - Highly Efficient Implementation of CFD Codes for HPC Many-Core Architectures. In: Proceedings of CiHPC, Springer 2011 [in print] . Google Scholar
- A. Buluc, S. W. Williams, L. Oliker, and J. Demmel: Reduced-Bandwidth Multithreaded Algorithms for Sparse-Matrix Vector Multiplication. Proc. IPDPS 2011 (to appear) , http://gauss.cs.ucsb.edu/~aydin/ipdps2011.pdf . Google Scholar
- http://code.google.com/p/likwid . Google Scholar
G. Schubert , G. Hager and H. Fehske , Performance limitations for sparse matrix-vector multiplications on current multicore environments, High Performance Computing in Science and Engineering (Springer, Berlin Heidelberg, 2010) pp. 13–26. Google Scholar- R. Rabenseifner, G. Hager, and G. Jost: Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes. In: Proceedings of PDP 2009 . Google Scholar


