World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

    Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

    References

    • J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th edn. (Morgan Kaufmann, San Francisco, CA, 2007). Google Scholar
    • M. Soliman, Mat-Core: A Matrix Core Extension for General Purpose Processors, Proc. The 2007 International Conference on Computer Engineering and Systems (ICCES'07) pp. 304–310. Google Scholar
    • J. Smith and G. Sohi, Proceedings of the IEEE 83, 1609 (1995), DOI: 10.1109/5.476078. Crossref, ISIGoogle Scholar
    • J. Fisher, VLIW Architectures and the ELI-512, Proc. 10th International Symposium on Computer Architecture pp. 140–150. Google Scholar
    • W.   Schonauer , Scientific Computing on Vector Computers ( North-Holland , Amsterdam , 1987 ) . Google Scholar
    • C. Lee, Code Optimizers and Register Organizations for Vector Architectures, Ph.D. Thesis, University of California at Berkeley, 1992 . Google Scholar
    • R. Espasa, Advanced Vector Architectures, Ph.D. Thesis, Universitat Politecnica de Catalunya, 1997 . Google Scholar
    • C. Kozyrakis, Scalable Vector Media-processors for Embedded Systems, Ph.D. Thesis, University of California at Berkeley, 2002 . Google Scholar
    • R. Krashinsky, Vector-Thread Architecture And Implementation, Ph.D. Thesis, Massachusetts Institute Of Technology, 2007 . Google Scholar
    • J. Gebis, Low-complexity Vector Microprocessor Extensions, Ph.D. thesis, University of California at Berkeley, 2008 . Google Scholar
    • R. Espasa, M. Valero and J. Smith, Vector Architectures: Past, Present and Future, Proc. 2th International Conference on Supercomputing pp. 425–432. Google Scholar
    • J.   Smith , The Best Way to Achieve Vector-Like Performance? , Proc. 21st International Symposium on Computer Architecture . Google Scholar
    • V. Fanet al., Intel. Technology Journal 12, 69 (2008), DOI: 10.1535/itj.1201.07. Crossref, ISIGoogle Scholar
    • Y. Pattet al., IEEE Computer 30, 51 (1997), DOI: 10.1535/itj.1201.07. CrossrefGoogle Scholar
    • D. Burger and J. Goodman, IEEE Computer 37, 22 (2004). CrossrefGoogle Scholar
    • L. Hammondet al., IEEE MICRO 20, 71 (2000), DOI: 10.1109/40.848474. Crossref, ISIGoogle Scholar
    • J. Smith and S. Vajapeyam, IEEE Computer 30, 68 (1997), DOI: 10.1109/40.848474. CrossrefGoogle Scholar
    • M. Lipasti and J. Shen, IEEE Computer 30, 59 (1997). CrossrefGoogle Scholar
    • C. Kozyrakiset al., IEEE Computer 30, 75 (1997). CrossrefGoogle Scholar
    • D.   Burger et al. , TRIPS Processor Reference Manual ( The University of Texas at Austin , 2005 ) . Google Scholar
    • J. Smith, ACM Transactions on Computer Systems 2, 289 (1984), DOI: 10.1145/357401.357403. Crossref, ISIGoogle Scholar
    • R. Espasa and M. Valero, Decoupled Vector Architecture, Proc. 2nd International Symposium on High-Performance Computer Architecture pp. 281–290. Google Scholar
    • W. Roet al., The Journal of Supercomputing 38, 237 (2006), DOI: 10.1007/s11227-006-8321-2. Crossref, ISIGoogle Scholar
    • T.   Grotker et al. , System Design with SystemC ( Kluwer Academic Publishers , Norwell, MA,USA , 2002 ) . Google Scholar
    • J.   Bhaske , A SystemC Primer ( Star Galaxy Publishing , 1058 Treeline Drive, Allentown, PA 18103,USA , 2002 ) . Google Scholar
    • D.   Black and J.   Donovan , SystemC: From The Ground Up ( Kluwer Academic Publishers , Norwell, MA, USA , 2004 ) . CrossrefGoogle Scholar
    • Open SystemC Initiative, The SystemC Library, 2009 , www.systemc.org . Google Scholar
    • J. Thornton, Parallel Operation in the Control Data 6600, Proc. 26th AFIPS Conference2 (1964) pp. 33–40. Google Scholar
    • J.   Thornton , Design of a Computer: The Control Data 6600 ( Scott Foresman , Glenview, Ill , 1970 ) . Google Scholar
    • M. Weiss, Strip Mining on SIMD Architectures, Proc. 5th International Conference on Supercomputing pp. 234–243. Google Scholar
    • D. Bacon, S. Graham and O. Sharp, ACM Computing Surveys 26, 345 (1994), DOI: 10.1145/197405.197406. Crossref, ISIGoogle Scholar
    • D. DeVries, A Vectorizing SUIF Compiler: Implementation and Performance, Master Thesis, University of Toronto, 1997 . Google Scholar
    • G.   Golub and C.   Van Loan , Matrix Computations , 3rd edn. ( The Johns Hopkins University Press , Baltimore and London , 1996 ) . Google Scholar
    • Intel 64 and IA-32 Architectures Software Developer's Manual, 2009 , www.intel.com/products/processor/manuals/index.htm . Google Scholar