World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

METADATA RANKING AND PRUNING FOR FAILURE DETECTION IN GRIDS

    The objective of Grid computing is to make processing power as accessible and easy to use as electricity and water. The last decade has seen an unprecedented growth in Grid infrastructures which nowadays enables large-scale deployment of applications in the scientific computation domain. One of the main challenges in realizing the full potential of Grids is making these systems dependable.

    In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been ranked, these can be eliminated from the job scheduling resource pool yielding in that way a more predictable, dependable and adaptive infrastructure. We also present the tools we developed towards evaluating the FailRank framework. In particular, we present the FailBase Repository which is a 38GB corpus of state information that characterizes the EGEE Grid for one month in 2007. Such a corpus paves the way for the community to systematically uncover new, previously unknown patterns and rules between the multitudes of parameters that can contribute to failures in a Grid environment. Additionally, we present an experimental evaluation study of the FailRank system over 30 days which shows that our framework identifies failures in 93% of the cases and can achieve this by only fetching 65% of the available information sources. We believe that our work constitutes another important step towards realizing adaptive Grid computing systems.

    This work is supported in part by the European Union under projects CoreGRID (# IST-2002-004265) and EGEE (#IST-2003-508833). A Preliminary version of this paper has appeared in [32] and [33]. The second author was supported by a CoreGRID REP Fellowship during 2008.

    References

    • Berndt D., Clifford J., "Using Dynamic Time Warping to Find Patterns in Time Series", In KDD 1994 . Google Scholar
    • Bruno N., Gravano L. and Marian A., "Evaluating Top-K Queries Over Web Accessible Databases", In ICDE 2002 . Google Scholar
    • G.   Das , D.   Gunopulos and H.   Mannila , Finding Similar Time Series , PKDD ( 1997 ) . Google Scholar
    • G.   Chun et al. , Benchmark probes for grid assessment , IEEE IPDPS ( 2004 ) . Google Scholar
    • "CIC", http://cic.gridops.org/ . Google Scholar
    • G.   Da Costa , S.   Orlando and M. D.   Dikaiakos , Nine months in the life of EGEE: a look from the South , IEEE MASCOTS ( 2007 ) . Google Scholar
    • C.   Dumitrescu et al. , DiPerF: An automated Distributed PERformance testing Framework , IEEE/ACM Grid ( 2004 ) . Google Scholar
    • "EGEE", http://www.eu-cgee.org/ . Google Scholar
    • "Global Grid User Support (GGUS) ticketing", https://gus.fzk.de/pages/homc.php . Google Scholar
    • "GridlCE", http://grid.infn.it/gridice/ . Google Scholar
    • J.   Han and M.   Kamber , "Data Mining: Concepts and Techniques", 2E ( Elsevier , 2006 ) . Google Scholar
    • R.   Fagin , A.   Lotem and M.   Naor , Optimal Aggregation Algorithms For Middleware , PODS ( 2001 ) . Google Scholar
    • I.   Foster and C   Kesselman , The Grid: Blueprint for a New Computing Infrastructure ( Elsevier , 2004 ) . Google Scholar
    • I. Foster, C. Kesselman and S. Tuecke, Intl. J. Supercomputer Applications 15(3), 200 (2001), DOI: 10.1177/109434200101500302. CrossrefGoogle Scholar
    • I.   Foster , Globus Toolkit Version 4: Software for Service-Oriented Systems , ICNP'05 . Google Scholar
    • Glite middleware http://glite.org/ . Google Scholar
    • Grid Statistics (GStat) http://goc.grid.sinica.edu.tw/gstat/ . Google Scholar
    • F. P.   Junqueira and K.   Marzullo , The virtue of dependent failures in multi-site systems , HotDep ( 2005 ) . Google Scholar
    • E. Kiciman and A. Fox, IEEE Transactions on Neural Networks  (2004). Google Scholar
    • E.   Kiciman and L.   Subramanian , Root Cause Localization in Large Scale Systems , HotDep ( 2005 ) . Google Scholar
    • S.   Krishnamurthy , W. H.   Sanders and M.   Cukier , A Dynamic Replica Selection Algorithm for Tolerating Timing Faults , DSN ( 2001 ) . Google Scholar
    • M. E.   Locasto , S.   Sidiroglou and A. D.   Keromytis , Application Communities: Using Monoculture for Dependability , HotDep ( 2005 ) . Google Scholar
    • "OSG", http://www.opensciencegrid.org . Google Scholar
    • R. Raman, M. Livny and M. H. Solomon, Cluster Computing 2, 129 (1999), DOI: 10.1023/A:1019022624119. CrossrefGoogle Scholar
    • "TeraGrid", http://www.teragrid.org/ . Google Scholar
    • G. Tsouloupas and M. D. Dikaiakos, Journal of Parallel and Distributed Computing 67, 1029 (2007), DOI: 10.1016/j.jpdc.2007.04.009. Crossref, ISIGoogle Scholar
    • K.   Neokleous et al. , Failure Management in Grids: The Case of the EGEE Infrastructure , Parallel Processing Letters ( 2007 ) . Google Scholar
    • G.   Tsouloupas and M. D.   Dikaiakos , Grid Resource Ranking using Low-level Performance Measurements , Euro-Par ( 2007 ) . Google Scholar
    • M.   Vlachos et al. , Indexing multidimensional time-series with support for multiple distance measures , KDD ( 2003 ) . Google Scholar
    • "WISDOM", http://wisdom.eu-egee.fr/ . Google Scholar
    • "Service Availability Monitoring (SAM)", http://goe.grid.sinica.edu.tw/gocwiki/SAM . Google Scholar
    • D. Zeinalipour-Yaztiet al., FailRank: Towards a Unified Grid Failure Monitoring and Ranking System, CoreGRID Workshop on Grid Programming Models and P2P Systems Architecture (Goregrid 2007 Workshop) (2007) pp. 12–13. Google Scholar
    • D.   Zeinalipour-Yazti et al. , Identifying Failures in Grids through Monitoring and Ranking , The 7th IEEE International Symposium on Network Computing and Applications (IEEE NCA '08) ( 2008 ) . Google Scholar