World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Exploiting ILP, TLP, and DLP to Improve Multi-Core Performance of One-Sided Jacobi SVD

    This paper shows how the performance of singular value decomposition (SVD) is enhanced through the exploitation of ILP, TLP, and DLP on Intel multi-core processors using superscalar execution, multi-threading computation, and streaming SIMD extensions, respectively. To facilitate the exploitation of TLP on multiple execution cores, the well-known cyclic one-sided Jacobi algorithm is restructured to work in parallel. On two dual-core Intel Xeon processors with hyper-threading technology running at 3.0 GHz, our results show that the multi-threaded implementation of one-sided Jacobi SVD gives about four times faster than the single-threaded superscalar implementation. Furthermore, the multi-threaded SIMD implementation speeds up the execution of single-threaded one-sided Jacobi by a factor of 10, which is close to the ideal speedup. On a reasonable large matrix size fitted in the L2 cache, our results show a performance of 11 GFLOPS (double-precision) is achieved on the target system through the exploitation of ILP, TLP, and DLP as well as memory hierarchy.

    References

    • J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th edn. (Morgan Kaufmann, San Francisco, CA, 2007). Google Scholar
    • J.   Mike , Superscalar Microprocessor Design , Prentice Hall Series in Innovative Technology ( 1991 ) . Google Scholar
    • Norman P. Jouppi and David W. Wall, ACM SIGARCH Computer Architecture News 17, 272 (1989), DOI: 10.1145/68182.68207. CrossrefGoogle Scholar
    • S. Palacharla, N. Jouppi and J. Smith, Complexity-Effective Superscalar Processors, proc. 24th Annual International Symposium on Computer Architecture (1997) pp. 206–218. Google Scholar
    • Z. Youssfi and M. Shanblatt, A New Technique to Exploit. Instruction-Level Parallelism for Reducing Microprocessor Power Consumption, proc. 6th International Conference on Electro/Information Technology (2006) pp. 119–124. Google Scholar
    • S. Kumar, C. Hughes and A. Nguyen, Intel Technology Journal 11, 217 (2007), DOI: 10.1535/itj.1103.05. ISIGoogle Scholar
    • Intel 64 and IA-32 Architectures Optimization Reference Manual, 2006, www.intel.com/products/processor/manuals/index.htm . Google Scholar
    • J. Held, J. Bautista, and S. Koehl, From a Few Cores to Many: A Tera-scale Computing Research Overview, White Paper Research at Intel, http://download.intel.com/research/platform/terascale/terascale.overview paper.pdf . Google Scholar
    • V. Klema and A. Laub, IEEE Transactions on Automatic Control 25, (1980). Google Scholar
    • G.   Golub and C.   Van Loan , Matrix Computations , 2nd edn. ( John Hopkins University Press , Baltimore and London , 1993 ) . Google Scholar
    • E.   Anderson et al. , LAPACK User's Guide ( SIAM , Philadelphia , 1992 ) . CrossrefGoogle Scholar
    • http://www.intel.com/software/products/inkl/does/WebHelp/mkl.htm . Google Scholar
    • R. Schreibar and C. Van Loan, SIAM Journal on Scientific and Statistical Computing 10, 3 (1989). ISIGoogle Scholar
    • F. Luk, ACM Transactions on Mathematical Software 6, 524 (1980), DOI: 10.1145/355921.355925. Crossref, ISIGoogle Scholar
    • J. Demmel and K. Veselic, SIAM Journal on Matrix Analysis and Applications 13, 1204 (1992), DOI: 10.1137/0613074. Crossref, ISIGoogle Scholar
    • R. Brent and F. Luk, SIAM Journal on Scientific and Statistical Computing 6, 69 (1985), DOI: 10.1137/0906007. Crossref, ISIGoogle Scholar
    • R. Schreiber, SIAM Journal on Scientific and Statistical Computing 7, 441 (1986), DOI: 10.1137/0907029. Crossref, ISIGoogle Scholar
    • B. Zhou and R. Brent, Journal of Parallel and Distributed Computing 42, 1 (1997), DOI: 10.1006/jpdc.1997.1304. Crossref, ISIGoogle Scholar
    • M. Becka and M. Vajtersic, Parallel Algorithms Application 13, 265 (1999). CrossrefGoogle Scholar
    • M. Becka and M. Vajtersic, Parallel Algorithms Application 14, 37 (1999). CrossrefGoogle Scholar
    • M. Becka, G. Oksa and M. Vajtersic, Parallel Computing 28, 243 (2002), DOI: 10.1016/S0167-8191(01)00138-7. Crossref, ISIGoogle Scholar
    • B. Zhou and R. Brent, Journal of Parallel and Distributed Computing 63, 638 (2003), DOI: 10.1016/S0743-7315(03)00007-8. Crossref, ISIGoogle Scholar
    • G. Oksa and M. Vajtersic, Parallel Algorithms and Applications 18, 49 (2003). Crossref, ISIGoogle Scholar
    • V. Hari, Computing 75, 27 (2005), DOI: 10.1007/s00607-004-0113-z. Crossref, ISIGoogle Scholar
    • V. Strumpen, H. Hoffmann, and A. Agarwal, A Stream Algorithm for the SVD, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, MIT-LCS-TM-641, 2003 . Google Scholar
    • S. Rajasekaran and M. Song, Journal of Parallel and Distributed Computing 68, 769 (2008), DOI: 10.1016/j.jpdc.2007.12.003. Crossref, ISIGoogle Scholar
    • M. Soliman, S. Rajasekaran and R. Ammar, A Block JRS Algorithm for Highly Parallel Computation of SVDs, proc. High Performance Computing Conference HPCC 2007 (2007) pp. 346–357. Google Scholar
    • M. Hestenes, Journal of the Society for Industrial and Applied Mathematics 6, 51 (1958), DOI: 10.1137/0106005. CrossrefGoogle Scholar
    • M.   Jahre and L.   Natvig , Performance Effects of a Cache Miss Handling Architecture in a Multi-core Processor , proc. Norwegian Informatics Conference NIK 2007 ( 2007 ) . Google Scholar
    • Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, 1 (2006), www.intel.com/products/processor/manuals/index.htm . Google Scholar
    • M. Soliman, World Scientific Journals: Parallel Processing Letters 19, 159 (2009), DOI: 10.1142/S0129626409000134. Link, ISIGoogle Scholar
    • J.   Dongarra et al. , The Sourcebook of Parallel Computing ( Morgan Kaufmann , 2002 ) . Google Scholar
    • E.   Athanasaki et al. , Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT architectures , proc. 10th Pan-Hellenic Conference in Informatics ( 2005 ) . Google Scholar
    • R.   Gerber et al. , The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms , 2nd edn. ( Intel PRESS , 2006 ) . Google Scholar
    • A.   Binstock and R.   Gerber , Programming with Hyper-Threading Technology: How to Write Multithreaded Software For Intel IA-32 Processors ( Intel PRESS , 2003 ) . Google Scholar
    • S.   Akhter and J.   Roberts , Multi-Core Programming: Increasing Performance through Software Multithreading ( Intel PRESS , 2006 ) . Google Scholar
    • G. Moore, Electronics 38, 114 (1965). Google Scholar