World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

MPI ON MILLIONS OF CORES

    Petascale parallel computers with more than a million processing cores are expected to be available in a couple of years. Although MPI is the dominant programming interface today for large-scale systems that at the highest end already have close to 300,000 processors, a challenging question to both researchers and users is whether MPI will scale to processor and core counts in the millions. In this paper, we examine the issue of scalability of MPI to very large systems. We first examine the MPI specification itself and discuss areas with scalability concerns and how they can be overcome. We then investigate issues that an MPI implementation must address in order to be scalable. To illustrate the issues, we ran a number of simple experiments to measure MPI memory consumption at scale up to 131,072 processes, or 80%, of the IBM Blue Gene/P system at Argonne National Laboratory. Based on the results, we identified nonscalable aspects of the MPI implementation and found ways to tune it to reduce its memory footprint. We also briefly discuss issues in application scalability to large process counts and features of MPI that enable the use of other techniques to alleviate scalability limitations in applications.

    This paper is an extended version of a paper titled "MPI on a Million Processors" that was presented at the 16th European PVM/MPI User's Group Meeting 2009 and printed in volume 5759 of Lecture Notes in Computer Science, Springer-Verlag, 2009.

    References

    • 20 petaflop Sequoia supercomputer. (July 2010) , http://www-304.ibm.com/jct03004c/press/us/en/pressrelease/26599.wss . Google Scholar
    • P. Balajiet al., Computer Science – Research and Development 24(1–2), 11 (2009), DOI: 10.1007/s00450-009-0095-3. Crossref, ISIGoogle Scholar
    • G.   Bosilca et al. , MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes , Proc. of SC 2002 ( IEEE , 2002 ) . Google Scholar
    • G.   Bosilca et al. , Dodging the cost of unavoidable memory copies in message logging protocols , Recent Advances in Message Passing Interface. 17th European MPI Users' Group Meeting , Lecture Notes in Computer Science ( Springer-Verlag , 2010 ) . Google Scholar
    • R. Butler and E. Lusk. ADLB library. (July 2010) , http://www.cs.mtsu.edu/~rbutler/adlb/ . Google Scholar
    • M. Chaarawi and E. Gabriel, Evaluating sparse data storage techniques for MPI groups and communicators, Computational Science. 8th International Conference (ICCS)5101, Lecture Notes in Computer Science (Springer-Verlag, 2008) pp. 297–306. Google Scholar
    • E. Chanet al., Concurrency and Computation: Practice and Experience 19(13), 1749 (2007), DOI: 10.1002/cpe.1206. Crossref, ISIGoogle Scholar
    • B.   Chapman , G.   Jost and R.   van der Pas , Using OpenMP: Portable Shared Memory Parallel Programming ( The MIT Press , 2007 ) . Google Scholar
    • R. Cypher and S. Konstantinidou, SIAM Journal on Computing 25(5), 1082 (1996), DOI: 10.1137/S0097539794262161. Crossref, ISIGoogle Scholar
    • G. E. Fagg, A. Bukovsky and J. J. Dongarra, Parallel Computing 27(11), 1479 (2001), DOI: 10.1016/S0167-8191(01)00100-4. Crossref, ISIGoogle Scholar
    • G. E. Fagg and J. J. Dongarra, FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 7th European PVM/MPI Users' Group Meeting1908, Lecture Notes in Computer Science (Springer-Verlag, 2000) pp. 346–353. Google Scholar
    • G. E.   Fagg et al. , Extending the MPI specification for process fault tolerance on high performance computing systems , Proc. of the International Supercomputer Conference (ICS) 2004 ( Primeur , 2004 ) . Google Scholar
    • M. Farreraset al., Scaling MPI to short-memory MPPs such as BG/L, ACM International Conference on Supercomputing (ICS) (2006) pp. 209–217. Google Scholar
    • W. Gropp and E. Lusk, Goals guiding design: PVM and MPI, Proc. of the IEEE Int'l Conference on Cluster Computing (Cluster 2002) (2002) pp. 257–265. Google Scholar
    • W.   Gropp , E.   Lusk and A.   Skjellum , Using MPI: Portable Parallel Programming with the Message-Passing Interface , 2nd edn. ( MIT Press , Cambridge, MA , 1999 ) . Google Scholar
    • W. D. Gropp and E. Lusk, Int'l. Journal of High Performance Computer Applications 18(3), 363 (2004), DOI: 10.1177/1094342004046045. Crossref, ISIGoogle Scholar
    • T.   Hoefler , A.   Lumsdaine and W.   Rehm , Implementation and performance analysis of non-blocking collective operations for MPI , Proc. of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07 ( IEEE Computer Society/ACM , 2007 ) . Google Scholar
    • T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, and J. L. Träff. The scalable process topology interface of MPI 2.2. Concurrency and Computation: Practice and Experience, 2010. To appear . Google Scholar
    • T.   Hoefler and J. L.   Träff , Sparse collective operations for MPI , Proc. of 14th Int'l Workshop on High-level Parallel Programming Models and Supportive Environments at IPDPS . Google Scholar
    • Intel Trace Analyzer and Collector. (July 2010) , http://www.intel.com/cd/software/products/asmo-na/eng/306321.htm . Google Scholar
    • H. Jitsumoto, T. Endo, and S. Matsuoka. ABARIS: An adaptable fault detection/recovery component framework for MPIs. In Proc. of 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS '07) in conjunction with IPDPS 2007, 2007 . Google Scholar
    • H.   Kamal , S. M.   Mirtaheri and A.   Wagner , Scalability of communicators and groups in MPI , ACM International Symposium on High Performance Distributed Computing (HPDC) . Google Scholar
    • KOJAK. http://www.fz-juelich.de/jsc/kojak/and http://icl.cs.utk.edu/kojak/ (July 2010) . Google Scholar
    • S. Kumaret al., Architecture of the Component Collective Messaging Interface, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 15th European PVM/MPI Users' Group Meeting5205, Lecture Notes in Computer Science (Springer-Verlag, 2008) pp. 23–32. Google Scholar
    • MPI Forum. MPI: A Message-Passing Interface Standard. Version 2.2, September 4th 2009. available at: (July 2010) , http://www.mpi-forum.org . Google Scholar
    • MPI Forum. MPI: A Message-Passing Interface Standard – Working-Draft for Non-blocking Collective Operations, April 2009 . Google Scholar
    • MPI Forum Fault Tolerance Working Group (Lead: Rich Graham). (July 2010) , http://meetings.mpi-forum.org/mpi3.0_ft.php . Google Scholar
    • MPI Forum Hybrid Programming Working Group (Lead: Pavan Balaji). (July 2010) , http://meetings.mpi-forum.org/mpi3.0_hybrid.php . Google Scholar
    • MPI Forum Tools Working Group (Lead: Martin Schulz). (July 2010) , https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/MPI3Tools . Google Scholar
    • mpiP. (July 2010) , http://mpip.sourceforge.net/ . Google Scholar
    • Paradyn. (July 2010) , http://www.paradyn.org . Google Scholar
    • Paraver. (July 2010) , http://www.cepba.upc.edu/paraver/ . Google Scholar
    • P. Patarasuk and X. Yuan, Efficient MPI_Bcast across different process arrival patterns, 22nd International Parallel and Distributed Processing Symposium (IPDPS) (2008) p. 32. Google Scholar
    • F. Petrini, D. J. Kerbyson and S. Pakin, The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q, Proc. of the ACM/IEEE Conf. on High Performance Networking and Computing (IEEE/ACM, 2003) p. 55. Google Scholar
    • PETSc library. (July 2010) , http://www.mcs.anl.gov/petsc . Google Scholar
    • S. C. Pieper and R. B. Wiringa, Annu. Rev. Nucl. Part. Sci. 51, 53 (2001), DOI: 10.1146/annurev.nucl.51.101701.132506. Crossref, ISIGoogle Scholar
    • R. Rabenseifner, G. Hager and G. Jost, Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes, Proc. of 17th Euromicro Int'l Conference on Parallel, Distributed, and Network-Based Processing (PDP 2009) (2009) pp. 427–236. Google Scholar
    • R. Rabenseifner and J. L. Träff, More efficient reduction algorithms for message-passing parallel systems, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 11th European PVM/MPI Users' Group Meeting3241, Lecture Notes in Computer Science (Springer-Verlag, 2004) pp. 36–46. Google Scholar
    • A.   Rane and D.   Stanzione , Experiences in tuning performance of hybrid MPI/OpenMP applications on quad-core systems , Proc. of 10th LCI Int'l Conference on High-Performance Clustered Computing . Google Scholar
    • R. Ross, N. Miller and W. Gropp, Implementing fast and reusable datatype processing, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 10th European PVM/MPI Users' Group Meeting2840, Lecture Notes in Computer Science (Springer-Verlag, 2003) pp. 404–413. Google Scholar
    • TAU - Tuning and Analysis Utilities. (July 2010) , http://www.cs.uoregon.edu/research/tau . Google Scholar
    • R. Thakur, R. Rabenseifner and W. Gropp, Int'l. Journal of High-Performance Computing Applications 19(1), 49 (2005), DOI: 10.1177/1094342005051521. Crossref, ISIGoogle Scholar
    • TotalView Parallel Debugger. (July 2010) , http://www.totalviewtech.com . Google Scholar
    • J. L. Träff, SMP-aware message passing programming, Proc. of 8th Int'l Workshop on High-level Parallel Programming Models and Supportive Environments at IPDPS (2003) pp. 56–65. Google Scholar
    • J. L. Träff, A simple work-optimal broadcast algorithm for message-passing parallel systems, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 11th European PVM/MPI Users' Group Meeting3241, Lecture Notes in Computer Science (Springer-Verlag, 2004) pp. 173–180. Google Scholar
    • J. L. Träff, Compact and efficient implementation of the MPI group operations, Recent Advances in Message Passing Interface. 17th European MPI Users' Group Meeting6305, Lecture Notes in Computer Science (Springer-Verlag, 2010) pp. 170–178. Google Scholar
    • J. L. Träffet al., Flattening on the fly: Efficient handling of MPI derived datatypes, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 6th European PVM/MPI Users' Group Meeting1697, Lecture Notes in Computer Science (Springer-Verlag, 1999) pp. 109–116. Google Scholar
    • Vampir - Performance Optimization. (July 2010) , http://www.vampir.eu . Google Scholar