World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

CacheFlow: Cache Optimizations for Data Driven Multithreading

    Data-Driven Multithreading is a non-blocking multithreading model of execution that provides effective latency tolerance by allowing the computation processor do useful work, while a long latency event is in progress. With the Data-Driven Multithreading model, a thread is scheduled for execution only if all of its inputs have been produced and placed in the processor's local memory. Data-driven sequencing leads to irregular memory access patterns that could affect negatively cache performance. Nevertheless, it enables the implementation of short-term optimal cache management policies. This paper presents the implementation of CacheFlow, an optimized cache management policy which eliminates the side effects due to the loss of locality caused by the data-driven sequencing, and reduces further cache misses. CacheFlow employs thread-based prefetching to preload data blocks of threads deemed executable. Simulation results, for nine scientific applications, on a 32-node Data-Driven Multithreaded machine show an average speedup improvement from 19.8 to 22.6. Two techniques to further improve the performance of CacheFlow, conflict avoidance and thread reordering, are proposed and tested. Simulation experiments have shown a speedup improvement of 24% and 32%, respectively. The average speedup for all applications on a 32-node machine with both optimizations is 26.1.

    References

    • P.   Evripidou and J.-L.   Gaudiot , A Decoupled Graph/Computation Data-Driven Architecture with Variable Resolution Actors , Proceedings of the 1990 International Conference on Parallel Processing . Google Scholar
    • P. Evripidou and C. Kyriacou, Journal of Universal Computer Science (JUCS) 6(10), (2000). Google Scholar
    • C.   Kyriacou and P.   Evripidou , Advances in Informatics , LNCS   2563 ( Springer-Verlang , 2002 ) . Google Scholar
    • S.   Cameron Woo and et al.   , The SPLASH-2 Programs: Characterization and Methodological Considerations , Proc. of the 22nd Annual International Symposium on Computer Architecture . Google Scholar
    • M.   Devarokonda and A.   Mukherjee , Issues in implementation of cache-affinity scheduling , Proc. of Winter USENIX Conference . Google Scholar
    • J. Salehiet al., IEEE/ACM Transactions on Networking 4, (1996). Google Scholar
    • D. K.   Poulsen and P. C.   Yew , Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , Proc. of the 1994 International Conference on Parallel Processing . Google Scholar
    • P. Trancoso and J. Torrellas, ICPP 3, 79 (1996). ISIGoogle Scholar
    • B. Batson and T. N. Vijaykumar, Reactive-associative caches, Proc. of the International Symposium on Parallel Architectures and Compiler Techniques (PACT) pp. 49–60. Google Scholar
    • T.   Sherwood , S.   Sair and B.   Calten , Predictor-directed stream buffers , 33rd International Symposium on Microarchitecture . Google Scholar
    • J. Collins et al. , Pointer cache assisted prefetching , Proc. of the 35th Annual International Symposium on Microarchitecture (MICRO-35) . Google Scholar
    • T. C.   Mowry , M. S.   Lam and A.   Gupta , Design and evaluation of a compiler algorithm for prefetching , Proc. of the 5th Inter. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V) . Google Scholar
    • C. K.   Luk and T. C.   Mowry , Compiler based prefetching for recursive data structures , Proc. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) . Google Scholar
    • J. Collins et al. , Dynamic Speculative Precomputation , Proc. of the 34th Annual International Symposium on Microarchitecture (MICRO-34) . Google Scholar
    • A.   Roth and G.   Sohi , Speculative data-driven multithreading , Proc. of the 7th Inter. Symposium on High-Performance Computer Architecture . Google Scholar
    • Y.   Solihin , J.   Lee and J.   Torrellas , Using User-Level Memory Thread for Correlation Prefetching , Proc. of the 29th Annual International Symposium on Computer Architecture (ISCA) . Google Scholar
    • S. Eggerset al., IEEE Micro 12 (1997). Google Scholar
    • C. K.   Luk , Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , Proc. of the 28th Annual International Symposium on Computer Architecture (ISCA) . Google Scholar
    • L. Cezeet al., IEEE Computer Architecture Letters (CAL)  (2004). Google Scholar
    • N.   Tuck and D.   Tullsen , Multithreaded Value Prediction , Proc. of the 11th International Symposium on High Performance Computer Architecture . Google Scholar
    • Intel  (2003). Google Scholar