Exploiting ILP, TLP, and DLP to Improve Multi-Core Performance of One-Sided Jacobi SVD
Abstract
This paper shows how the performance of singular value decomposition (SVD) is enhanced through the exploitation of ILP, TLP, and DLP on Intel multi-core processors using superscalar execution, multi-threading computation, and streaming SIMD extensions, respectively. To facilitate the exploitation of TLP on multiple execution cores, the well-known cyclic one-sided Jacobi algorithm is restructured to work in parallel. On two dual-core Intel Xeon processors with hyper-threading technology running at 3.0 GHz, our results show that the multi-threaded implementation of one-sided Jacobi SVD gives about four times faster than the single-threaded superscalar implementation. Furthermore, the multi-threaded SIMD implementation speeds up the execution of single-threaded one-sided Jacobi by a factor of 10, which is close to the ideal speedup. On a reasonable large matrix size fitted in the L2 cache, our results show a performance of 11 GFLOPS (double-precision) is achieved on the target system through the exploitation of ILP, TLP, and DLP as well as memory hierarchy.
References
J. Hennessy and D. Patterson , Computer Architecture: A Quantitative Approach, 4th edn. (Morgan Kaufmann, San Francisco, CA, 2007). Google Scholar-
J. Mike , Superscalar Microprocessor Design ,Prentice Hall Series in Innovative Technology ( 1991 ) . Google Scholar - ACM SIGARCH Computer Architecture News 17, 272 (1989), DOI: 10.1145/68182.68207. Crossref, Google Scholar
S. Palacharla , N. Jouppi and J. Smith , Complexity-Effective Superscalar Processors, proc. 24th Annual International Symposium on Computer Architecture (1997) pp. 206–218. Google ScholarZ. Youssfi and M. Shanblatt , A New Technique to Exploit. Instruction-Level Parallelism for Reducing Microprocessor Power Consumption, proc. 6th International Conference on Electro/Information Technology (2006) pp. 119–124. Google Scholar- Intel Technology Journal 11, 217 (2007), DOI: 10.1535/itj.1103.05. ISI, Google Scholar
- Intel 64 and IA-32 Architectures Optimization Reference Manual, 2006, www.intel.com/products/processor/manuals/index.htm . Google Scholar
- J. Held, J. Bautista, and S. Koehl, From a Few Cores to Many: A Tera-scale Computing Research Overview, White Paper Research at Intel, http://download.intel.com/research/platform/terascale/terascale.overview paper.pdf . Google Scholar
- IEEE Transactions on Automatic Control 25, (1980). Google Scholar
-
G. Golub and C. Van Loan , Matrix Computations , 2nd edn. ( John Hopkins University Press , Baltimore and London , 1993 ) . Google Scholar -
E. Anderson , LAPACK User's Guide ( SIAM , Philadelphia , 1992 ) . Crossref, Google Scholar - http://www.intel.com/software/products/inkl/does/WebHelp/mkl.htm . Google Scholar
- SIAM Journal on Scientific and Statistical Computing 10, 3 (1989). ISI, Google Scholar
- ACM Transactions on Mathematical Software 6, 524 (1980), DOI: 10.1145/355921.355925. Crossref, ISI, Google Scholar
- SIAM Journal on Matrix Analysis and Applications 13, 1204 (1992), DOI: 10.1137/0613074. Crossref, ISI, Google Scholar
- SIAM Journal on Scientific and Statistical Computing 6, 69 (1985), DOI: 10.1137/0906007. Crossref, ISI, Google Scholar
- SIAM Journal on Scientific and Statistical Computing 7, 441 (1986), DOI: 10.1137/0907029. Crossref, ISI, Google Scholar
- Journal of Parallel and Distributed Computing 42, 1 (1997), DOI: 10.1006/jpdc.1997.1304. Crossref, ISI, Google Scholar
- Parallel Algorithms Application 13, 265 (1999). Crossref, Google Scholar
- Parallel Algorithms Application 14, 37 (1999). Crossref, Google Scholar
- Parallel Computing 28, 243 (2002), DOI: 10.1016/S0167-8191(01)00138-7. Crossref, ISI, Google Scholar
- Journal of Parallel and Distributed Computing 63, 638 (2003), DOI: 10.1016/S0743-7315(03)00007-8. Crossref, ISI, Google Scholar
- Parallel Algorithms and Applications 18, 49 (2003). Crossref, ISI, Google Scholar
- Computing 75, 27 (2005), DOI: 10.1007/s00607-004-0113-z. Crossref, ISI, Google Scholar
- V. Strumpen, H. Hoffmann, and A. Agarwal, A Stream Algorithm for the SVD, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, MIT-LCS-TM-641, 2003 . Google Scholar
- Journal of Parallel and Distributed Computing 68, 769 (2008), DOI: 10.1016/j.jpdc.2007.12.003. Crossref, ISI, Google Scholar
M. Soliman , S. Rajasekaran and R. Ammar , A Block JRS Algorithm for Highly Parallel Computation of SVDs, proc. High Performance Computing Conference HPCC 2007 (2007) pp. 346–357. Google Scholar- Journal of the Society for Industrial and Applied Mathematics 6, 51 (1958), DOI: 10.1137/0106005. Crossref, Google Scholar
-
M. Jahre and L. Natvig , Performance Effects of a Cache Miss Handling Architecture in a Multi-core Processor , proc. Norwegian Informatics Conference NIK 2007 ( 2007 ) . Google Scholar - Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, 1 (2006), www.intel.com/products/processor/manuals/index.htm . Google Scholar
- World Scientific Journals: Parallel Processing Letters 19, 159 (2009), DOI: 10.1142/S0129626409000134. Link, ISI, Google Scholar
-
J. Dongarra , The Sourcebook of Parallel Computing ( Morgan Kaufmann , 2002 ) . Google Scholar -
E. Athanasaki , Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT architectures , proc. 10th Pan-Hellenic Conference in Informatics ( 2005 ) . Google Scholar -
R. Gerber , The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms , 2nd edn. ( Intel PRESS , 2006 ) . Google Scholar -
A. Binstock and R. Gerber , Programming with Hyper-Threading Technology: How to Write Multithreaded Software For Intel IA-32 Processors ( Intel PRESS , 2003 ) . Google Scholar -
S. Akhter and J. Roberts , Multi-Core Programming: Increasing Performance through Software Multithreading ( Intel PRESS , 2006 ) . Google Scholar - Electronics 38, 114 (1965). Google Scholar


