World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms

    Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with Hyper-Threading technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and Level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores is less than 2 and 1 GFLOPS on Level-2 and Level-1 BLAS, respectively, even though eight threads are executed in parallel on eight logical processors.

    References

    • G. Moore, Electronics 38, 114 (1965). Google Scholar
    • S. Palacharla, N. Jouppi and J. Smith, Complexity-Effective Superscalar Processors, Proc. 24th Annual International Symposium on Computer Architecture pp. 206–218. Google Scholar
    • M. Hrishikeshet al., The Optimal Useful Logic Depth per Pipeline Stage is 6-8 FO4, Proc. 29th Annual International Symposium on Computer Architecture (2002) pp. 14–24. Google Scholar
    • S. Kumar, C. Hughes and A. Nguyen, Intel Technology Journal 11, 217 (2007), DOI: 10.1535/itj.1103.05. ISIGoogle Scholar
    • C. Lawsonet al., ACM Transactions on Mathematical Software 5, 308 (1979), DOI: 10.1145/355841.355847. CrossrefGoogle Scholar
    • J. Dongarraet al., ACM Transactions on Mathematical Software 14, 1 (1988), DOI: 10.1145/42288.42291. Crossref, ISIGoogle Scholar
    • J. Dongarraet al., ACM Transactions on Mathematical Software 16, 1 (1990), DOI: 10.1145/77626.79170. Crossref, ISIGoogle Scholar
    • Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, 1 (2006), www.intel.com/products/processor/manuals/index.htm . Google Scholar
    • Intel 64 and IA-32 Architectures Optimization Reference Manual, (2006), www.intel.com/products/processor/manuals/index.htm . Google Scholar
    • S.   Akhter and J.   Roberts , Multi-Core Programmin Increasing Performance through Software Multithreading ( Intel PRESS , 2006 ) . Google Scholar
    • S. Eggerset al., IEEE Micro 17, 12 (1997), DOI: 10.1109/40.621209. Crossref, ISIGoogle Scholar
    • T. Martinez and S. Parikh, Understanding Dual Processors, Hyper-Threading Technology, and Multi-Core Systems, Intel Optimizing Center, February 2005, http://www.devx.com/Intel/Article/27399 . Google Scholar
    • A.   Binstock and R.   Gerber , Programming with Hyper-Threading Technology: How to Write Multithreaded Software For Intel IA-32 Processors ( Intel PRESS , 2003 ) . Google Scholar
    • R.   Gerber et al. , The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms , 2nd edn. ( Intel PRESS , 2006 ) . Google Scholar
    • J.   Dongarra et al. , The Sourcebook of Parallel Computing ( Morgan Kaufmann , 2002 ) . Google Scholar
    • J.   Dongarra et al. , Solving Linear Systems on Vector and Shared Memory Computers ( Society for Industrial and Applied Mathematics , 1991 ) . Google Scholar
    • G. Amdahl, Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities, Proc. AFIPS 1967 Spring Joint Computer Conference30 (AFIPS Press, 1967) pp. 83–485. Google Scholar
    • http://msdn.microsoft.com/en-us/library . Google Scholar
    • G.   Golub and C.   Van Loan , Matrix Computations , 2nd edn. ( John Hopkins University Press , Baltimore and London , 1993 ) . Google Scholar