Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms
Abstract
Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with Hyper-Threading technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and Level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores is less than 2 and 1 GFLOPS on Level-2 and Level-1 BLAS, respectively, even though eight threads are executed in parallel on eight logical processors.
References
- Electronics 38, 114 (1965). Google Scholar
S. Palacharla , N. Jouppi and J. Smith , Complexity-Effective Superscalar Processors, Proc. 24th Annual International Symposium on Computer Architecture pp. 206–218. Google ScholarM. Hrishikesh , The Optimal Useful Logic Depth per Pipeline Stage is 6-8 FO4, Proc. 29th Annual International Symposium on Computer Architecture (2002) pp. 14–24. Google Scholar- Intel Technology Journal 11, 217 (2007), DOI: 10.1535/itj.1103.05. ISI, Google Scholar
- ACM Transactions on Mathematical Software 5, 308 (1979), DOI: 10.1145/355841.355847. Crossref, Google Scholar
- ACM Transactions on Mathematical Software 14, 1 (1988), DOI: 10.1145/42288.42291. Crossref, ISI, Google Scholar
- ACM Transactions on Mathematical Software 16, 1 (1990), DOI: 10.1145/77626.79170. Crossref, ISI, Google Scholar
- Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, 1 (2006), www.intel.com/products/processor/manuals/index.htm . Google Scholar
- Intel 64 and IA-32 Architectures Optimization Reference Manual, (2006), www.intel.com/products/processor/manuals/index.htm . Google Scholar
-
S. Akhter and J. Roberts , Multi-Core Programmin Increasing Performance through Software Multithreading ( Intel PRESS , 2006 ) . Google Scholar - IEEE Micro 17, 12 (1997), DOI: 10.1109/40.621209. Crossref, ISI, Google Scholar
- T. Martinez and S. Parikh, Understanding Dual Processors, Hyper-Threading Technology, and Multi-Core Systems, Intel Optimizing Center, February 2005, http://www.devx.com/Intel/Article/27399 . Google Scholar
-
A. Binstock and R. Gerber , Programming with Hyper-Threading Technology: How to Write Multithreaded Software For Intel IA-32 Processors ( Intel PRESS , 2003 ) . Google Scholar -
R. Gerber , The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms , 2nd edn. ( Intel PRESS , 2006 ) . Google Scholar -
J. Dongarra , The Sourcebook of Parallel Computing ( Morgan Kaufmann , 2002 ) . Google Scholar -
J. Dongarra , Solving Linear Systems on Vector and Shared Memory Computers ( Society for Industrial and Applied Mathematics , 1991 ) . Google Scholar G. Amdahl , Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities, Proc. AFIPS 1967 Spring Joint Computer Conference30 (AFIPS Press, 1967) pp. 83–485. Google Scholar- http://msdn.microsoft.com/en-us/library . Google Scholar
-
G. Golub and C. Van Loan , Matrix Computations , 2nd edn. ( John Hopkins University Press , Baltimore and London , 1993 ) . Google Scholar


