LEVERAGING SHARED CACHES FOR PARALLEL TEMPORAL BLOCKING OF STENCIL CODES ON MULTICORE PROCESSORS AND CLUSTERS
Abstract
Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. Benchmark results are presented for three current x86-based microprocessors, showing clearly that our optimization works best on designs with high-speed shared caches and low memory bandwidth per core. We furthermore demonstrate that simple bandwidth-based performance models are inaccurate for this kind of algorithm and employ a more elaborate, synthetic modeling procedure. Finally we show that temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment, albeit with limited benefit at strong scaling.
References
-
B. Bergen , F. Hülsemann and U. Rüde , Is 1.7×1010 Unknowns the Largest Finite Element System that Can Be Solved Today? , Proceedings of the ACM/IEEE SC 2005 Conference (Supercomputing Conference '05 . Google Scholar -
K. Datta , Stencil Computation Optimization and Auto-tuning on State-of-the-art Multicore Architectures , Proc. SC2008 . Google Scholar - SIAM Review 51(1), 129 (2009), DOI: 10.1137/070693199. Crossref, ISI, Google Scholar
-
G. Wellein , Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization , Proc. COMPSAC 2009 . Google Scholar -
J. Treibig and G. Hager , Introducing a Performance Model for Bandwidth Limited Loop Kernels , Workshop on Memory Issues on Multi- and Manycore Platforms, Proc. PPAM 2009 , http://arxiv.org/abs/0905.0792 . Google Scholar -
D. Wonnacott , Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations , Proc. IPDPS 2000 . Google Scholar -
G. Jin , J. Mellor-Crummey and R. Fowler , Increasing temporal locality with skewing and recursive blocking , Proc. SC2001 . Google Scholar - M. Kowarschik: Data Locality Optimizations for Iterative Numerical Algorithms and Cellular Automata on Hierarchical Memory Architectures. Dissertation, SCS Publishing House, Erlangen (2004) , http://www10.informatik.uni-erlangen.de/Publications/Dissertations/Diss_Kowarschik_2004.pdf . Google Scholar
-
M. Frigo and V. Strumpen , Cache oblivious stencil computations , Proc. ICS 2005 . Google Scholar -
S. Kamil , Implicit and explicit optimizations for stencil computations , Proc. MSPC 2006 . Google Scholar - Progress in CFD 8(1–4), 179 (2008). Google Scholar
- J. Treibig, G. Hager, G. Wellein: Multi-core architectures: Complexities of performance prediction and the impact of cache topology. Under review , http://arxiv.org/abs/0910.4865 . Google Scholar
-
C. Ding and Y. He , A ghost cell expansion method for reducing communications in solving PDE problems , Proc. SC2001 . Google Scholar - M. Wittmann: Potentials of temporal blocking for stencil-based computations on multi-core systems. Master's Thesis, University of Applied Sciences Nuremberg, March 2009 , http://www.hpc.rrze.uni-erlangen.de/Projekte/stencil.shtml . Google Scholar


