ASYMPTOTIC PEAK UTILISATION IN HETEROGENEOUS PARALLEL CPU/GPU PIPELINES: A DECENTRALISED QUEUE MONITORING STRATEGY
Abstract
Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing on graphics processing units (GPGPU). The characteristics of a Graphics Processing Unit (GPU)—including significant memory transfer latency and complex performance characteristics—demand new approaches to ensuring that all available computational resources are efficiently utilised. This paper considers the simple case of a divisible workload based on widely-used numerical linear algebra routines and the challenges that prevent efficient use of all resources available to a naive SPMD application using the GPU as an accelerator. We suggest a possible queue monitoring strategy that facilitates resource usage with a view to balancing the CPU/GPU utilisation for applications that fit the pipeline parallel architectural pattern on heterogeneous multicore/multi-node CPU and GPU systems. We propose a stochastic allocation technique that may serve as a foundation for heuristic approaches to balancing CPU/GPU workloads.
This work has been partially funded by the European Commission FP7 through the project ParaPhrase: Parallel Patterns for Adaptive Heterogeneous Multicore Systems, under contract no.: 288570 (10/2011-9/2014).
References
- Concurrency and Computation: Practice and Experience 17(9), 1173 (2005). Crossref, ISI, Google Scholar
- Communications of the ACM 52(10), 56 (2009). Crossref, ISI, Google Scholar
A. Benoit , Workload balancing and throughput optimization for heterogeneous systems subject to failures, Euro-Par 20116852,LNCS (Springer, Bordeaux, 2011) pp. 242–254. Google Scholar- ACM Transactions on Graphics 22(3), 917 (2003). Crossref, ISI, Google Scholar
- Computer Physics Communications 183(3), 449 (2012). Crossref, ISI, Google Scholar
M. Daga , A. M. Aji and W.-c. Feng , On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing, SAAHPC'11 (IEEE, Knoxville, 2011) pp. 141–149. Google Scholar-
J. J. Dongarra , Numerical linear algebra for high-performance computers , 2nd edn. ( SIAM , 1998 ) . Crossref, Google Scholar - Concurrency and Computation: Practice and Experience 24(1), 3 (2012). Crossref, ISI, Google Scholar
- Parallel Processing Letters 18(4), 511 (2008). Link, Google Scholar
M. Garba , H. González-Vélez and D. Roach , Parallel computational modelling of inelastic neutron scattering in multi-node and multi-core architectures, IEEE HPCC-10 (IEEE, Melbourne, 2010) pp. 509–514. Google Scholar- M. Garba, H. González-Vélez, and D. Roach. GPU Acceleration for Hermitian Eigen-systems. Transactions on Computational Collective Intelligence, 2012. In Press . Google Scholar
- Concurrency and Computation: Practice and Experience 24(1), 84 (2012). Crossref, ISI, Google Scholar
- Concurrency and Computation: Practice and Experience 22(15), 2073 (2010). ISI, Google Scholar
- Software–Practice and Experience 40(12), 1135 (2010). Crossref, ISI, Google Scholar
- Computer 40(2), 96 (2007). Crossref, ISI, Google Scholar
-
T. Mattson , B. Sanders and B. Massingill , Patterns for parallel programming ( Addison-Wesley Professional , 2004 ) . Google Scholar -
B. T. Smith , Matrix Eigensystem Routines - EISPACK Guide 6 ,LNCS ( Springer-Verlag , 1976 ) . Google Scholar J. Stuart , M. Cox and J. Owens , GPU-to-CPU callbacks, Euro-Par 20106586,LNCS (Springer-Verlag, Ischia, 2011) pp. 365–372. Google ScholarS. Tomov , Dense linear algebra solvers for multicore with GPU accelerators, IPDPS 2010 (IEEE, Atlanta, 2010) pp. 1–8. Google Scholar


