World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Chapter 8: Using Software Aging Monitoring and Rejuvenation for the Assessment of High-Availability Systems by:1 (Source: Crossref)

    Background: In this chapter, we present the application of software aging monitoring and software rejuvenation for the assessment of high-availability systems. In high-availability systems, the metric of interest is the transient performability during system recovery, also referred to as “survivability”. A survivability assessment requires the definition of the failure model. In addition, extensive testing using loads and configurations that are able to model the conditions customers encounter in production is required.

    Aim: We describe the application of an agile devops methodology leveraging a failure model that incorporates aging-related failures. This agile devops methodology has been developed to integrate the failure reporting from production (i.e., ticket history), the Markov chain design, the performance test case design, the performance test case execution, the performance check and the decision to release the software.

    Applicability domain: The domain of applicability of this study is mission-critical systems that employ high-availability strategies, such as software component hosting supporting open-source development, media streaming hardware and software supporting high-volume media processing, and online banking. Continuous integration, testing and operations is a key part of building software in the new devops paradigm.

    Method: Our method involves the following steps, which embrace development and operations. Each step is based on its predecessor output: 1) an analysis of ticket history generated by operations; 2) a Markov chain design derived from ticket history; 3) a performance test case design based on Markov chain analysis; 4) a performance test case execution for each software version; 5) a Markov chain parameterization based on test case results; 6) an evaluation of performance metrics of interest using the parameterized Markov chain; 7) performance checks and 8) a new software release delivered to operations. The development team receives feedback on performance issues and bugs, and provides new releases for performance checking and testing.

    Results: We present extensive measurements from a large industrial system, as well as a list of test cases identified from the high-availability Markov chain. These test cases were executed using the industrial system under study in this research, and the obtained test results are presented. These results were used in the Markov model parameterization.

    Lessons learned: Several high-availability strategies, such as automated load balancing and software rejuvenation, require that failures be detected in an efficient manner. We have found that high-availability strategies implemented for fast failure recovery also need to focus on implementing strategies for fast failure detection, as in our experiments failure recovery rates were dominated by the failure detection time. We believe that this finding can be used to help high-availability system architects select which software aging and rejuvenation features to add to their high-availability designs.