Elastic Parallel Systems for High Performance Cloud Computing: State-of-the-Art and Future Directions
Abstract
With on-demand access to compute resources, pay-per-use, and elasticity, the cloud evolved into an attractive execution environment for High Performance Computing (HPC). Whereas elasticity, which is often referred to as the most beneficial cloud-specific property, has been heavily used in the context of interactive (multi-tier) applications, elasticity-related research in the HPC domain is still in its infancy. Existing parallel computing theory as well as traditional metrics to analytically evaluate parallel systems do not comprehensively consider elasticity, i.e., the ability to control the number of processing units at runtime. To address these issues, we introduce a conceptual framework to understand elasticity in the context of parallel systems, define the term elastic parallel system, and discuss novel metrics for both elasticity control at runtime as well as the ex-post performance evaluation of elastic parallel systems. Based on the conceptual framework, we provide an in-depth analysis of existing research in the field to describe the state-of-the-art and compile our findings into a research agenda for future research on elastic parallel systems.
References
- 1. , An analysis of public clouds elasticity in the execution of scientific applications: A survey, Journal of Grid Computing 14 (2016) 193–216. Crossref, ISI, Google Scholar
- 2. , HPC cloud for scientific and business applications: Taxonomy, vision, and research challenges, ACM Computing Surveys (CSUR) 51 (2018) 8:1–8:29. Crossref, ISI, Google Scholar
- 3. , TASKWORK: A cloud-aware runtime system for elastic task-parallel HPC applications, in Proceedings of the 9th International Conference on Cloud Computing and Services Science (SciTePress, 2019), 198–209. Crossref, Google Scholar
- 4. , Designing locality and NUMA aware MPI runtime for nested virtualization based HPC cloud with SR-IOV enabled InfiniBand, in Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’17) (ACM, New York, NY, USA, 2017), pp. 187–200. Crossref, Google Scholar
- 5. , A comparative review of high-performance computing major cloud service providers, in 2018 9th International Conference on Information and Communication Systems (ICICS) (
April 2018 ), pp. 181–186. Google Scholar - 6. , Next generation cloud computing: New trends and research directions, Future Generation Computer Systems 79 (2018) 849–861. Crossref, ISI, Google Scholar
- 7. , High-performance cloud computing: A view of scientific applications, in 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN) (IEEE, 2009), pp. 4–16. Crossref, Google Scholar
- 8. , A survey on cloud computing elasticity, in 2012 IEEE Fifth International Conference on Utility and Cloud Computing
November 2012 , pp. 263–270. Google Scholar - 9. , Elasticity in cloud computing: State of the art and research challenges, IEEE Transactions on Services Computing 11 (2018) 430–447. Crossref, ISI, Google Scholar
- 10. , Designing self-tuning split-map-merge applications for high cost-efficiency in the cloud, IEEE Transactions on Cloud Computing 5 (2017) 303–316. Crossref, ISI, Google Scholar
- 11. , Autoelastic: Automatic resource elasticity for high performance applications in the cloud, IEEE Transactions on Cloud Computing 4 (2016) 6–19. Crossref, ISI, Google Scholar
- 12. , Cost-efficient parallel processing of irregularly structured problems in cloud computing environments, Cluster Computing (December 2018). Crossref, ISI, Google Scholar
- 13. , Cloud Computing Patterns: Fundamentals to Design, Build, and Manage Cloud Applications (Springer Publishing Company, Incorporated, 2014). Crossref, Google Scholar
- 14. , Elasticity in cloud computing: What it is, and what it is not, in Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13) (USENIX, San Jose, CA, 2013), pp. 23–27. Google Scholar
- 15. , MELA: Monitoring and analyzing elasticity of cloud services, in 2013 IEEE 5th International Conference on Cloud Computing Technology and Science (
2013 ), pp. 80–87. Google Scholar - 16. , HPC-Aware VM placement in infrastructure clouds, in 2013 IEEE International Conference on Cloud Engineering (IC2E) (
March 2013 ), pp. 11–20. Google Scholar - 17. , Cloud computing in e-Science: Research challenges and opportunities, The Journal of Supercomputing 70 (2014) 408–464. Crossref, ISI, Google Scholar
- 18. , Evaluation of HPC applications on cloud, in 2011 Sixth Open Cirrus Summit (
October 2011 ), pp. 22–26. Google Scholar - 19. , The who, what, why, and how of high performance computing in the cloud, in IEEE 5th International Conference on Cloud Computing Technology and Science (
December 2013 ), pp. 306–314. Google Scholar - 20. , High performance cloud computing, Future Generation Computer Systems 29(6) (2013) 1408–1416. Crossref, ISI, Google Scholar
- 21. , Characterizing application sensitivity to OS interference using kernel-level noise injection, in 2008 SC – International Conference for High Performance Computing, Networking, Storage and Analysis (November 2008), pp. 1–12. Google Scholar
- 22. , Evaluating and improving the performance and scheduling of HPC applications in cloud, IEEE Transactions on Cloud Computing 4 (2016) 307–321. Crossref, ISI, Google Scholar
- 23. , Introduction to Parallel Computing, 2nd edn. (Pearson Education, 2003). Google Scholar
- 24. , Fundamentals of Parallel Processing (Prentice Hall Professional Technical Reference, 2002). Google Scholar
- 25. , Parallel and Distributed Computation: Numerical Methods (Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989). Google Scholar
- 26. , Speedup versus efficiency in parallel systems, IEEE Transactions on Computers 38 (1989) 408–423. Crossref, ISI, Google Scholar
- 27. , Joint-analysis of performance and energy consumption when enabling cloud elasticity for synchronous HPC applications, Concurrency and Computation: Practice & Experience 28 (2016) 1548–1571. Crossref, ISI, Google Scholar
- 28. D. Chandler et al., Report on cloud computing to the OSG steering committee, Tech. Rep., SPEC OSG Cloud Computing Working Group (2012). Google Scholar
- 29. , Lessons from applying the systematic literature review process within the software engineering domain, Journal of Systems and Software 80(4) (2007) 571–583. Crossref, ISI, Google Scholar
- 30. , Analyzing the past to prepare for the future: Writing a literature review, MIS Quarterly 26(2) (2002) xiii–xxiii. ISI, Google Scholar
- 31. , Using MPI-2: Advanced Features of the Message Passing Interface (MIT Press, 1999). Google Scholar
- 32. , Towards cloud-based asynchronous elasticity for iterative HPC applications, Journal of Physics: Conference Series 649 (2015) 012006. Crossref, Google Scholar
- 33. , Towards combining reactive and proactive cloud elasticity on running HPC applications, in Proceedings of the 3rd International Conference on Internet of Things, Big Data and Security: IoTBDS (SciTePress, 2018), pp. 261–268. Crossref, Google Scholar
- 34. , A lightweight plug-and-play elasticity service for self-organizing resource provisioning on parallel applications, Future Generation Computer Systems 78 (2018) 176–190. Crossref, ISI, Google Scholar
- 35. V. Shankar et al., Numpywren: Serverless linear algebra, Tech. Rep. UCB/EECS-2018-137, EECS Department, University of California, Berkeley (October 2018). Google Scholar
- 36. , A framework for elastic execution of existing MPI programs, in 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (
May 2011 ), pp. 940–947. Google Scholar - 37. ,
Regular versus irregular problems and algorithms , in Parallel Algorithms for Irregularly Structured Problems, eds. A. FerreiraJ. Rolim (Springer Berlin Heidelberg, Berlin, Heidelberg, 1995), pp. 1–25. Crossref, Google Scholar - 38. , Solving irregularly structured problems based on distributed object model, Parallel Computing 29 (2003) 1539–1562. Crossref, ISI, Google Scholar
- 39. , Reengineering for parallelism: An entry point into PLPP for legacy applications, Concurrency and Computation: Practice and Experience 19(4) (2007) 503–529. Crossref, ISI, Google Scholar
- 40. , A design pattern language for engineering (parallel) software: Merging the PLPP and OPL projects, in Proceedings of the 2010 Workshop on Parallel Programming Patterns (ACM, 2010). Crossref, Google Scholar
- 41. , Cloud paradigms and practices for computational and data-enabled science and engineering, Computing in Science Engineering 15 (2013) 10–18. Crossref, ISI, Google Scholar
- 42. , Migrating parallel applications to the cloud: Assessing cloud readiness based on parallel design decisions, SICS Software-Intensive Cyber-Physical Systems 34 (2019) 73–84. Crossref, Google Scholar
- 43. , Improving HPC application performance in cloud through dynamic load balancing, in 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (
May 2013 ), p. 402–409. Google Scholar - 44. M. Kuperberg et al., Defining and quantifying elasticity of resources in cloud computing and scalable platforms, Tech. Rep. 16, Karlsruher Institut für Technologie (KIT) (2011). Google Scholar
- 45. , BUNGEE: An elasticity benchmark for self-adaptive IaaS cloud environments, in 2015 IEEE/ACM 10th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (
May 2015 ), pp. 46–56. Google Scholar - 46. , Chameleon: A hybrid, proactive auto-scaling mechanism on a level-playing field, IEEE Transactions on Parallel and Distributed Systems 30 (2019) 800–813. Crossref, ISI, Google Scholar


