AN MPI PERFORMANCE MONITORING INTERFACE FOR CELL BASED COMPUTE NODES
Abstract
In this paper, we present a methodology for profiling parallel applications executing on the family of architectures commonly referred as the "Cell" processor. Specifically, we examine Cell-centric MPI programs on hybrid clusters containing multiple Opteron and IBM PowerXCell 8i processors per node such as those used in the petascale Roadrunner system. We analyze the performance of our approach on a PlayStation3 console based on Cell Broadband Engine—the CBE—as well as an IBM BladeCenter QS22 based on PowerXCell 8i. Our implementation incurs less than 0.5% overhead and 0.3 µs per profiler call for a typical molecular dynamics code on the Cell BE while efficiently utilizing the limited local store of the Cell's SPE cores. Our worst-case overhead analysis on the PowerXCell 8i costs 3.2 µs per profiler call while using only two 5 KiB buffers. We demonstrate the use of our profiler on a cluster of hybrid nodes running a suite of scientific applications. Our analyses of inter-SPE communication (across the entire cluster) and function call patterns provide valuable information that can be used to optimize application performance.
References
-
S. Pakin , Receiver-initiated Message Passing over RDMA Networks , Proceedings of the 22nd IEEEInternational Parallel and Distributed Processing Symposium (IPDPS 2008) ( 2008 ) . Google Scholar - M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J.Dongarra, MPI: The Complete Reference, volume 1, The MPI Core. The MIT Press, Cambridge, Massachusetts, 2nd edition, September 1998 . Google Scholar
- IBM, Software Development Kit for Multicore Acceleration Version 3.0: Programmer's Guide . Google Scholar
H. Brunst and W. E. Nagel , Scalable Performance Analysis of Parallel Systems: Concepts and Experiences, Proceedings of the Parallel Computing Conference (ParCo 2003) (2003) pp. 737–744. Google ScholarD. Hackenberg , H. Brunst and W. E. Nagel , Event Tracing and Visualization for Cell Broadband Engine Systems, Proceedings of 14th International Euro-Par Conference (Euro-Par 2008) (Las Palmas de Gran Canaria, Spain, 2008) pp. 172–181. Google Scholar- IBM Journal of Research and Development 51(5), 559 (2007). Crossref, ISI, Google Scholar
- Scientific Programming 17, 1 (2009). ISI, Google Scholar
-
K. J. Barker , Entering thePetaflop Era: The Architecture and Performance and Roadrunner , Proceedings of IEEE/ACM SC08 ( 2008 ) . Google Scholar - IEEE Micro 26(2), 10 (2006), DOI: 10.1109/MM.2006.41. Crossref, ISI, Google Scholar
- International Journal of Computer Science (2008). Google Scholar
- A. Buttari, J. Dongarra, and J. Kurzak, Limitations of the PlayStation 3 for high performance cluster computing, University of Tennessee Computer Science, Tech. rep. 2007 . Google Scholar
S. Mintchev and V. Getov , PMPI: High-Level Message Passing in Fortran77 and C, Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking (HPCN 1997)1225,Lecture Notes in Computer Science (Springer, 1997) pp. 603–614. Google Scholar-
J. C. Sancho and D. J. Kerbyson , Analysis of Double Buffering on two Different Multicore Architectures: Quad-core Opteron and the Cell-BE , Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008) ( 2008 ) . Google Scholar - International Journal of High Performance Computing Applications 22(1), 113 (2008), DOI: 10.1177/1094342007085015. Crossref, ISI, Google Scholar
D. J. Kerbyson and K. J. Barker , Automatic Identification of Application Communication Patterns via Template, 18th International Conference on Parallel and Distributed Computing Systems (ISCA PDCS 2005) (2005) pp. 114–121. Google Scholar


