World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling

    Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficient runtime systems, synchronization and communication reduction, and analytical performance modeling. This paper presents a piece of tiled communication-avoiding QR factorization software that is able to scale efficiently for matrices with general dimensions. We design a tiled communication-avoiding QR factorization algorithm and implement it with a fully distributed dynamic scheduling runtime system to minimize both synchronization and communication. The whole class of communication-avoiding QR factorization algorithms uses an important parameter of D (i.e., the number of domains), whose best solution is still unknown so far and requires manual tuning and empirical searching to find it. To that end, we introduce a simplified analytical performance model to determine an optimal number of domains D. The experimental results show that our new parallel implementation is faster than a state-of-the-art multicore-based numerical library by up to 30%, and faster than ScaLAPACK by up to 30 times with thousands of CPU cores. Furthermore, using the new analytical model to predict an optimal number of domains is as competitive as exhaustive searching, and exhibits an average performance difference of 1%.

    References

    • 1. E. Anderson, Z. Bai and J. Dongarra, Generalized QR factorization and its applications, Linear Algebra and its Applications 162 :243–271, 1992. Crossref, ISIGoogle Scholar
    • 2. A. Björck, Numerical Methods for Least Squares Problems, SIAM, 1996. CrossrefGoogle Scholar
    • 3. J. W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997. CrossrefGoogle Scholar
    • 4. E. Anderson, Z. Bai, C. Bischof, L. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen, LAPACK Users’ Guide, SIAM, 1992. Google Scholar
    • 5. L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet et al., ScaLAPACK Users’ Guide, SIAM, volume 4, 1997. CrossrefGoogle Scholar
    • 6. J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker and R. C. Whaley, ScaLAPACK: A portable linear algebra library for distributed memory computers: Design issues and performance, in Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science, Springer, 1996, pages 95–106. CrossrefGoogle Scholar
    • 7. J. W. Demmel, L. Grigori, M. F. Hoemmen and J. Langou, Communication-optimal parallel and sequential QR and LU factorizations. LAPACK Working Note 204, UTK, August 2008. Google Scholar
    • 8. G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, N. Knight and H. D. Nguyen, Reconstructing householder vectors from tall-skinny QR, Journal of Parallel and Distributed Computing 85 :3–31, 2015. IPDPS 2014 Selected Papers on Numerical and Combinatorial Algorithms. Crossref, ISIGoogle Scholar
    • 9. M. Hoemmen, A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method, in 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS ), May 2011, pages 966–977. Google Scholar
    • 10. F. Song, H. Ltaief, B. Hadri and J. Dongarra, Scalable tile communication-avoiding QR factorization on multicore cluster systems, in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10 ), 2010, pages 1–11. Google Scholar
    • 11. W. M. Gentleman, Row elimination for solving sparse linear systems and least squares problems, Numer. Anal., Lect Notes Math, 506 :122–133, 1976. CrossrefGoogle Scholar
    • 12. A. Pothen and P. Raghavan, Distributed orthogonal factorization: Givens and Householder algorithms, SIAM Journal on Scientific and Statistical Computing, 10 :1113–1134, 1989. Crossref, ISIGoogle Scholar
    • 13. M. Mohiyuddin, M. Hoemmen, J. Demmel and K. Yelick, Minimizing communication in sparse matrix solvers, in Conference on High Performance Computing Networking, Storage and Analysis (SC’09 ), page 36. ACM, 2009. Google Scholar
    • 14. A. Khabou, J. W. Demmel, L. Grigori and M. Gu, LU factorization with panel rank revealing pivoting and its communication avoiding version, SIAM Journal on Matrix Analysis and Applications 34(3) :1401–1429, 2013. Crossref, ISIGoogle Scholar
    • 15. aRSEC. http://icl.utk.edu/parsec. Google Scholar
    • 16. F. Song, A. YarKhan and J. Dongarra, Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems, in Proceedings of the 2009 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’09), pages 1–11. IEEE, 2009. Google Scholar
    • 17. F. Song, S. Tomov and J. Dongarra, Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems, in Proceedings of the International Conference on Supercomputing, ICS’12, pages 1–11. ACM, 2012. Google Scholar
    • 18. F. Song and J. Dongarra, A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems, Concurrency and Computation : Practice and Experience 27(14) :3702–3723, 2015. Crossref, ISIGoogle Scholar
    • 19. E. Agullo, L. Giraud, A. Guermouche, S. Nakov and J. Roman, Task-based conjugate gradient: from multi-GPU towards heterogeneous architectures, in European Conference on Parallel Processing, Springer, 2016, pages 69–82. Google Scholar
    • 20. S. Zhuang and M. Casas, Iteration-fusing conjugate gradient, in Proceedings of the International Conference on Supercomputing, page 21. ACM, 2017. Google Scholar
    • 21. BigRedII. https://kb.iu.edu/d/bcqt. Google Scholar
    • 22. Comet. https://portal.xsede.org/sdsc-comet. Google Scholar