World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS

    An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also develop an analytical model of the performance for fault tolerance based on periodic checkpointing and compare this approach to both failure avoidance techniques. We find that this comparison is sensitive to the nature of the stochastic distribution of the time between failures, and that failure avoidance is likely inferior to fault tolerance in the long term. Regardless, our result show that each approach is likely to achieve poor utilization for large-scale platforms (e.g., 220 nodes) unless the mean time between failures is large. We show how bounding parallel job size improves utilization, but conclude that achieving good utilization in future large-scale platforms will require a combination of techniques.

    References

    • Jack Dongarraet al., Int. J. High Perform. Comput. Appl. 23(4), 309 (2009), DOI: 10.1177/1094342009347714. Crossref, ISIGoogle Scholar
    • Vivek Sarkar and others. Exascale software study: Software challenges in extreme scale systems, 2009. White paper. See , http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf . Google Scholar
    • Top500 Supercomputer Sites , http://www.top500.org . Google Scholar
    • Franck Cappelloet al., Int. Journal of High Performance Computing Applications 23(4), 374 (2009), DOI: 10.1177/1094342009347767. Crossref, ISIGoogle Scholar
    • B. Schroeder and G. A. Gibson, Journal of Physics: Conference Series 78(012022), (2007), DOI: 10.1088/1742-6596/78/1/012022. Google Scholar
    • Nick Kolettis and N. Dudley Fulton, Software rejuvenation: Analysis, module and applications, FTCS '95: Proc. 25th Int. Symposium on Fault-Tolerant Computing (IEEE Computer Society, Washington, DC, USA, 1995) p. 381. Google Scholar
    • V. Castelliet al., IBM J. Res. Dev. 45(2), 311 (2001), DOI: 10.1147/rd.452.0311. Crossref, ISIGoogle Scholar
    • Prashasta   Gujrati et al. , A meta-learning failure predictor for blue gene/l systems , Int. Conf. Parallel Processing (ICPP) ( IEEE Computer Society , 2007 ) . Google Scholar
    • Chao Wanget al., Proactive process-level live migration in hpc environments, SC '08: Proceedings 2008 ACM/IEEE Conf. Supercomputing (IEEE Press, 2008) pp. 1–12. Google Scholar
    • Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kalé, Proactive fault tolerance in mpi applications via task migration, HiPC 2006, the IEEE High performance Computing Conference (IEEE Computer Society Press, 2006) pp. 485–496. Google Scholar
    • A. Oliner and J. Stearley, What supercomputers say: A study of five system logs, DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (IEEE Computer Society, Washington, DC, USA, 2007) pp. 575–584. Google Scholar
    • T. Heath, R. P. Martin and T. D. Nguyen, SIGMETRICS Perform. Eval. Rev. 30(1), 217 (2002), DOI: 10.1145/511399.511362. Crossref, ISIGoogle Scholar
    • B. Schroeder and G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proc. of the International Conference on Dependable Systems and Networks (2006) pp. 249–258. Google Scholar
    • U. Lublin and D. Feitelson, Journal of Parallel and Distributed Computing 63(11), 1105 (2003), DOI: 10.1016/S0743-7315(03)00108-4. Crossref, ISIGoogle Scholar
    • M. J. Koopet al., Performance analysis and evaluation of pcie 2.0 and quad-data rate infiniband, HOTI '08: Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects (IEEE Computer Society, Washington, DC, USA, 2008) pp. 85–92. Google Scholar
    • L. Wanget al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, Proc. Int. Conf. on Dependable Systems and Networks (2005) pp. 812–821. Google Scholar
    • Joshua Wingstrom. Overcoming The Difficulties Created By The Volatile Nature Of Desktop Grids Through Understanding, Prediction And Redundancy. PhD thesis, University of Hawai'i at Manoa, 2009 . Google Scholar
    • John W. Young, Communications of the ACM 17(9), 530 (1974), DOI: 10.1145/361147.361115. Crossref, ISIGoogle Scholar
    • J. T. Daly, Future Generation Computer Systems 22(3), 303 (2004). Crossref, ISIGoogle Scholar
    • A. Frommer and D. B. Szyld, J. Comput. Appl. Math. 123(1-2), 201 (2000), DOI: 10.1016/S0377-0427(00)00409-X. Crossref, ISIGoogle Scholar