World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.
Special Issue on High-Level Programming for Heterogeneous and Hierarchical Parallel Systems; Guest-editors: Gaétan Hains and Frédéric Gava (LACL, Université Paris-Est, France), and Kevin Hammond (University of St. Andrews, UK)No Access

AUTO-TUNING PARALLEL SKELETONS

    Parallel skeletons are a structured parallel programming abstraction that provide programmers with a predefined set of algorithmic templates that can be combined, nested and parameterized with sequential code to produce complex programs. The implementation of these skeletons is currently a manual process, requiring human expertise to choose suitable implementation parameters that provide good performance. This paper presents an empirical exploration of the optimization space of the FastFlow parallel skeleton framework. We performed this using a Monte Carlo search of a random subset of the space, for a representative set of platforms and programs. The results show that the space is program and platform dependent, non-linear, and that automatic search achieves a significant average speedup in program execution time of 1.6× over a human expert. An exploratory data analysis of the results shows a linear dependence between two of the parameters, and that another two parameters have little effect on performance. These properties are then used to reduce the size of the space by a factor of 6, reducing the cost of the search. This provides a starting point for automatically optimizing parallel skeleton programs without the need for human expertise, and with a large improvement in execution time compared to that achievable using human expert tuning.

    References

    • K. Asanovicet al., CACM 52, 56 (2009). Crossref, ISIGoogle Scholar
    • M.   Cole , Algorithmic Skeletons: Structured Management of Parallel Computation ( MIT Press , 1991 ) . Google Scholar
    • M. Cole, Parallel Computing 30, 389 (2004). Crossref, ISIGoogle Scholar
    • H. González-Vélez and M. Leyton, Software: Practice and Experience 40, 1135 (2010). Crossref, ISIGoogle Scholar
    • M. Aldinucci and M. Danelutto, Stream parallel skeleton optimization, Proc. Parallel and Distributed Computing and Systems (1999) pp. 955–962. Google Scholar
    • D. Caromel and M. Leyton, Fine tuning algorithmic skeletons, EuroPar (2007) pp. 72–81. Google Scholar
    • M. Aldinucci, M. Torquati and M. Meneghin, FastFlow: Efficient parallel streaming applications on multi-core, Tech. Rep. TR-09-12, Dipartimento di Informatica, Università di Pisa (2009) . Google Scholar
    • M. Aldinucci and M. Torquati, FastFlow website (2011) , http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace:about . Google Scholar
    • M.   Aldinucci et al. , Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed , Proc. Int. Conf. Parallel Computing ( 2009 ) . Google Scholar
    • M. Aldinucci, M. Meneghin and M. Torquati, Efficient smith-waterman on multi-core with FastFlow, Proc. Euromicro Conf. on Parallel, Distributed and Network-based Processing (2010) pp. 195–199. Google Scholar
    • M. Aldinucci, S. Ruggieri and M. Torquati, Porting decision tree algorithms to multicore using FastFlow, Proc. Euro. Conf. on Machine Learning and Knowledge Discovery in Databases: Part I (2010) pp. 7–23. Google Scholar
    • M. Torquati, Single-producer/single-consumer queue on shared cache multi-core systems, Tech. Rep. TR-10-20, Dipartimento di Informatica, Università di Pisa (2010) . Google Scholar
    • A. Georges, D. Buytaert and L. Eeckhout, Statistically rigorous java performance evaluation, OOPSLA (2007) pp. 57–76. Google Scholar
    • V.   Barnett and T.   Lewis , Outliers in Statistical Data ( Wiley-Blackwell , 1994 ) . Google Scholar
    • H.   Abdi , The Bonferroni and Sidák corrections for multiple comparisons ( Sage , 2007 ) . Google Scholar
    • C.   Bishop , Pattern Recognition and Machine Learning ( Springer , 2006 ) . Google Scholar
    • Z. Wang and M. O'Boyle, Partitioning streaming parallelism for multi-cores: A machine learning based approach, in PACT (2010), 307–318 . Google Scholar
    • M. Christen, O. Schenk and H. Burkhart, Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures, PDPS (2011) pp. 676–687. Google Scholar
    • U. Dastgeer, J. Enmyren and C. W. Kessler, Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-gpu systems, Proc. Int. Works. Multicore software engineering (2011) pp. 25–32. Google Scholar
    • G. Contreras and M. Martonosi, Characterising and improving the performance of intel threading building blocks, IISWC (2008) pp. 57–66. Google Scholar
    • Z. Wang and M. O'Boyle, Mapping parallelism to multi-cores: a machine learning based approach, PPoPP (2009) pp. 75–84. Google Scholar