METADATA RANKING AND PRUNING FOR FAILURE DETECTION IN GRIDS
Abstract
The objective of Grid computing is to make processing power as accessible and easy to use as electricity and water. The last decade has seen an unprecedented growth in Grid infrastructures which nowadays enables large-scale deployment of applications in the scientific computation domain. One of the main challenges in realizing the full potential of Grids is making these systems dependable.
In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been ranked, these can be eliminated from the job scheduling resource pool yielding in that way a more predictable, dependable and adaptive infrastructure. We also present the tools we developed towards evaluating the FailRank framework. In particular, we present the FailBase Repository which is a 38GB corpus of state information that characterizes the EGEE Grid for one month in 2007. Such a corpus paves the way for the community to systematically uncover new, previously unknown patterns and rules between the multitudes of parameters that can contribute to failures in a Grid environment. Additionally, we present an experimental evaluation study of the FailRank system over 30 days which shows that our framework identifies failures in 93% of the cases and can achieve this by only fetching 65% of the available information sources. We believe that our work constitutes another important step towards realizing adaptive Grid computing systems.
This work is supported in part by the European Union under projects CoreGRID (# IST-2002-004265) and EGEE (#IST-2003-508833). A Preliminary version of this paper has appeared in [32] and [33]. The second author was supported by a CoreGRID REP Fellowship during 2008.
References
- Berndt D., Clifford J., "Using Dynamic Time Warping to Find Patterns in Time Series", In KDD 1994 . Google Scholar
- Bruno N., Gravano L. and Marian A., "Evaluating Top-K Queries Over Web Accessible Databases", In ICDE 2002 . Google Scholar
-
G. Das , D. Gunopulos and H. Mannila , Finding Similar Time Series , PKDD ( 1997 ) . Google Scholar -
G. Chun , Benchmark probes for grid assessment , IEEE IPDPS ( 2004 ) . Google Scholar - "CIC", http://cic.gridops.org/ . Google Scholar
-
G. Da Costa , S. Orlando and M. D. Dikaiakos , Nine months in the life of EGEE: a look from the South , IEEE MASCOTS ( 2007 ) . Google Scholar -
C. Dumitrescu , DiPerF: An automated Distributed PERformance testing Framework , IEEE/ACM Grid ( 2004 ) . Google Scholar - "EGEE", http://www.eu-cgee.org/ . Google Scholar
- "Global Grid User Support (GGUS) ticketing", https://gus.fzk.de/pages/homc.php . Google Scholar
- "GridlCE", http://grid.infn.it/gridice/ . Google Scholar
-
J. Han and M. Kamber , "Data Mining: Concepts and Techniques", 2E ( Elsevier , 2006 ) . Google Scholar -
R. Fagin , A. Lotem and M. Naor , Optimal Aggregation Algorithms For Middleware , PODS ( 2001 ) . Google Scholar -
I. Foster and C Kesselman , The Grid: Blueprint for a New Computing Infrastructure ( Elsevier , 2004 ) . Google Scholar - Intl. J. Supercomputer Applications 15(3), 200 (2001), DOI: 10.1177/109434200101500302. Crossref, Google Scholar
-
I. Foster , Globus Toolkit Version 4: Software for Service-Oriented Systems , ICNP'05 . Google Scholar - Glite middleware http://glite.org/ . Google Scholar
- Grid Statistics (GStat) http://goc.grid.sinica.edu.tw/gstat/ . Google Scholar
-
F. P. Junqueira and K. Marzullo , The virtue of dependent failures in multi-site systems , HotDep ( 2005 ) . Google Scholar - IEEE Transactions on Neural Networks (2004). Google Scholar
-
E. Kiciman and L. Subramanian , Root Cause Localization in Large Scale Systems , HotDep ( 2005 ) . Google Scholar -
S. Krishnamurthy , W. H. Sanders and M. Cukier , A Dynamic Replica Selection Algorithm for Tolerating Timing Faults , DSN ( 2001 ) . Google Scholar -
M. E. Locasto , S. Sidiroglou and A. D. Keromytis , Application Communities: Using Monoculture for Dependability , HotDep ( 2005 ) . Google Scholar - "OSG", http://www.opensciencegrid.org . Google Scholar
- Cluster Computing 2, 129 (1999), DOI: 10.1023/A:1019022624119. Crossref, Google Scholar
- "TeraGrid", http://www.teragrid.org/ . Google Scholar
- Journal of Parallel and Distributed Computing 67, 1029 (2007), DOI: 10.1016/j.jpdc.2007.04.009. Crossref, ISI, Google Scholar
-
K. Neokleous , Failure Management in Grids: The Case of the EGEE Infrastructure , Parallel Processing Letters ( 2007 ) . Google Scholar -
G. Tsouloupas and M. D. Dikaiakos , Grid Resource Ranking using Low-level Performance Measurements , Euro-Par ( 2007 ) . Google Scholar -
M. Vlachos , Indexing multidimensional time-series with support for multiple distance measures , KDD ( 2003 ) . Google Scholar - "WISDOM", http://wisdom.eu-egee.fr/ . Google Scholar
- "Service Availability Monitoring (SAM)", http://goe.grid.sinica.edu.tw/gocwiki/SAM . Google Scholar
D. Zeinalipour-Yazti , FailRank: Towards a Unified Grid Failure Monitoring and Ranking System, CoreGRID Workshop on Grid Programming Models and P2P Systems Architecture (Goregrid 2007 Workshop) (2007) pp. 12–13. Google Scholar-
D. Zeinalipour-Yazti , Identifying Failures in Grids through Monitoring and Ranking , The 7th IEEE International Symposium on Network Computing and Applications (IEEE NCA '08) ( 2008 ) . Google Scholar


