FAILURE MANAGEMENT IN GRIDS: THE CASE OF THE EGEE INFRASTRUCTURE
Abstract
The emergence of Grid infrastructures like EGEE has enabled the deployment of large-scale computational experiments that address challenging scientific problems in various fields. However, to realize their full potential, Grid infrastructures need to achieve a higher degree of dependability, i.e., they need to improve the ratio of Grid-job requests that complete successfully in the presence of Grid-component failures. To achieve this, however, we need to determine, analyze and classify the causes of job failures on Grids. In this paper we study the reasons behind Grid job failures in the context of EGEE, the largest Grid infrastructure currently in operation. We present points of failure in a Grid that affect the execution of jobs, and describe error types and contributing factors. We discuss various information sources that provide users and administrators with indications about failures, and assess their usefulness based on error information accuracy and completeness. We describe two real-life case studies, describing failures that occurred on a production site of EGEE and the troubleshooting process for each case. Finally, we propose the architecture for a system that could provide failure management support to administrators and end-users of large-scale Grid infrastructures like EGEE.
References
- Enabling Grids for E-SciencE project, http://www.eu-egee.orgy . Google Scholar
- gLite Middleware, http://glite.web.cern.ch/glite/ (accessed June 2006) . Google Scholar
- Grid Statistics (GStat) description. http://goc.grid.sinica.edu.tw/gstat/filter-help.html (accessed June 2006) . Google Scholar
- GridlCE: a distributed monitoring tool for Grid systems, http://grid.infn.it/gridice/ (accessed June 2007) . Google Scholar
- LCG Middleware. http://lcg.web.cern.ch/LCG/activities/middleware.html (accessed June 2006) . Google Scholar
- Lightweight Directory Access Protocol, open source implementation, website. http://www.openldap.org (accessed June 2006) . Google Scholar
- Maui Administrator's Guide. http://www.clusterresources.com/products/maui/docs/mauiadmin.pdf (accessed May 2006) . Google Scholar
- MPI: A Message-Passing Interface Standard, http://www.mpi-forum.org/docs/mpi-11.ps (accessed June 2006) . Google Scholar
- Site Functional Tests for EGEE sites. https://lcg-sft.cern.ch/sft/lastreport.cgi (accessed June 2006) . Google Scholar
- SmokePing network latency measurement tool, http://oss.oetiker.ch/smokeping/ (accessed June 2006) . Google Scholar
- The Large Hadron Collider beauty experiment, homepage. http://lhcb.web.cern.ch/lhcb/ (accessed June 2006) . Google Scholar
- The WISDOM (Wide In Silico Docking On Malaria) Data Challenge, general statistics. http://wisdom.eu-egee.fr/malaria/grid_stat.php?menu_grid=general (accessed June 2006) . Google Scholar
- Torque Administrator's Manual, http://www.clusterresources.com/torquedocs21/ (accessed May 2006) . Google Scholar
- WISDOM: Initiative for grid-enabled drug discovery against neglected and emergent diseases, http://wisdom.eu-egee.fr/ (last accessed June 2006) . Google Scholar
- Internet X.509 Public Key Infrastructure - Certificate and Certificate Revocation List (CRL) Profile, http://www.ietf.org/rfc/rfc3280.txt (accessed March 2006), 2002 . Google Scholar
- Job Description Language: Attributes Specification. http://edms.cern.ch/document/590869/, May 2006 . Google Scholar
- Aaron Brown. Coping with human error in IT systems. ACM Queue magazine, http://www.acmqueue.com, November 2004 . Google Scholar
- Stephen Burke, Simone Campana, Antonio Delgado Peris, Flavia Donno, Patricia Mendez Lorenzo, Roberto Santinelli, and Andrea Sciaba. gLite 3.0 User Guide. https://edms.cern.ch/document/722398/, May 2006. Document Status: PRIVATE . Google Scholar
-
G. DaCosta , M. D. Dikaiakos and S. Orlando , Nine months in the life of EGEE: a look from the South , Proceedings of 15th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2007) ( 2007 ) . Google Scholar - International J. Supercomputer Applications 15(3), 200 (2001). Crossref, Google Scholar
- Sophie Lemaitre, Jeff Templon, Steve Traylen, Markus Schulz, and Davide Salomoni. Maui Cookbook. http://grid-deployment.web.cern.ch/grid-deployment/documentation/Maui-Cookbook.pdf (accessed May 2006) . Google Scholar
- F. Pacini. gLite Workload Management System service. https://edms.cern.ch/document/572489/, May 2006 . Google Scholar
-
D. Thain and M. Livny , Grid 2: Blueprint for a New Computing Infrastructure , 2nd edn. ( Elsevier, Morgan Kaufmann , 2004 ) . Google Scholar -
M. Xu , Grid 2: Blueprint for a New Computing Infrastructure , 2nd edn. ( Elsevier, Morgan Kaufmann , 2004 ) . Google Scholar


