Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling
Abstract
Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficient runtime systems, synchronization and communication reduction, and analytical performance modeling. This paper presents a piece of tiled communication-avoiding QR factorization software that is able to scale efficiently for matrices with general dimensions. We design a tiled communication-avoiding QR factorization algorithm and implement it with a fully distributed dynamic scheduling runtime system to minimize both synchronization and communication. The whole class of communication-avoiding QR factorization algorithms uses an important parameter of D (i.e., the number of domains), whose best solution is still unknown so far and requires manual tuning and empirical searching to find it. To that end, we introduce a simplified analytical performance model to determine an optimal number of domains D. The experimental results show that our new parallel implementation is faster than a state-of-the-art multicore-based numerical library by up to 30%, and faster than ScaLAPACK by up to 30 times with thousands of CPU cores. Furthermore, using the new analytical model to predict an optimal number of domains is as competitive as exhaustive searching, and exhibits an average performance difference of 1%.
References
- 1. , Generalized QR factorization and its applications, Linear Algebra and its Applications 162 :243–271, 1992. Crossref, ISI, Google Scholar
- 2. , Numerical Methods for Least Squares Problems, SIAM, 1996. Crossref, Google Scholar
- 3. , Applied Numerical Linear Algebra, SIAM, 1997. Crossref, Google Scholar
- 4. , LAPACK Users’ Guide, SIAM, 1992. Google Scholar
- 5. , ScaLAPACK Users’ Guide, SIAM, volume 4, 1997. Crossref, Google Scholar
- 6. ,
ScaLAPACK: A portable linear algebra library for distributed memory computers: Design issues and performance , in Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science, Springer, 1996, pages 95–106. Crossref, Google Scholar - 7. J. W. Demmel, L. Grigori, M. F. Hoemmen and J. Langou, Communication-optimal parallel and sequential QR and LU factorizations. LAPACK Working Note 204, UTK, August 2008. Google Scholar
- 8. , Reconstructing householder vectors from tall-skinny QR, Journal of Parallel and Distributed Computing 85 :3–31, 2015. IPDPS 2014 Selected Papers on Numerical and Combinatorial Algorithms. Crossref, ISI, Google Scholar
- 9. , A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method, in 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS ),
May 2011 , pages 966–977. Google Scholar - 10. , Scalable tile communication-avoiding QR factorization on multicore cluster systems, in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10 ),
2010 , pages 1–11. Google Scholar - 11. , Row elimination for solving sparse linear systems and least squares problems, Numer. Anal., Lect Notes Math, 506 :122–133, 1976. Crossref, Google Scholar
- 12. , Distributed orthogonal factorization: Givens and Householder algorithms, SIAM Journal on Scientific and Statistical Computing, 10 :1113–1134, 1989. Crossref, ISI, Google Scholar
- 13. , Minimizing communication in sparse matrix solvers, in Conference on High Performance Computing Networking, Storage and Analysis (SC’09 ), page 36. ACM, 2009. Google Scholar
- 14. , LU factorization with panel rank revealing pivoting and its communication avoiding version, SIAM Journal on Matrix Analysis and Applications 34(3) :1401–1429, 2013. Crossref, ISI, Google Scholar
- 15. aRSEC. http://icl.utk.edu/parsec. Google Scholar
- 16. , Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems, in Proceedings of the 2009 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’09), pages 1–11. IEEE, 2009. Google Scholar
- 17. , Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems, in Proceedings of the International Conference on Supercomputing, ICS’12, pages 1–11. ACM, 2012. Google Scholar
- 18. , A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems, Concurrency and Computation : Practice and Experience 27(14) :3702–3723, 2015. Crossref, ISI, Google Scholar
- 19. , Task-based conjugate gradient: from multi-GPU towards heterogeneous architectures, in European Conference on Parallel Processing, Springer, 2016, pages 69–82. Google Scholar
- 20. , Iteration-fusing conjugate gradient, in Proceedings of the International Conference on Supercomputing, page 21. ACM, 2017. Google Scholar
- 21. BigRedII. https://kb.iu.edu/d/bcqt. Google Scholar
- 22. Comet. https://portal.xsede.org/sdsc-comet. Google Scholar


