World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

FAILURE MANAGEMENT IN GRIDS: THE CASE OF THE EGEE INFRASTRUCTURE

    The emergence of Grid infrastructures like EGEE has enabled the deployment of large-scale computational experiments that address challenging scientific problems in various fields. However, to realize their full potential, Grid infrastructures need to achieve a higher degree of dependability, i.e., they need to improve the ratio of Grid-job requests that complete successfully in the presence of Grid-component failures. To achieve this, however, we need to determine, analyze and classify the causes of job failures on Grids. In this paper we study the reasons behind Grid job failures in the context of EGEE, the largest Grid infrastructure currently in operation. We present points of failure in a Grid that affect the execution of jobs, and describe error types and contributing factors. We discuss various information sources that provide users and administrators with indications about failures, and assess their usefulness based on error information accuracy and completeness. We describe two real-life case studies, describing failures that occurred on a production site of EGEE and the troubleshooting process for each case. Finally, we propose the architecture for a system that could provide failure management support to administrators and end-users of large-scale Grid infrastructures like EGEE.

    References

    • Enabling Grids for E-SciencE project, http://www.eu-egee.orgy . Google Scholar
    • gLite Middleware, http://glite.web.cern.ch/glite/ (accessed June 2006) . Google Scholar
    • Grid Statistics (GStat) description. http://goc.grid.sinica.edu.tw/gstat/filter-help.html (accessed June 2006) . Google Scholar
    • GridlCE: a distributed monitoring tool for Grid systems, http://grid.infn.it/gridice/ (accessed June 2007) . Google Scholar
    • LCG Middleware. http://lcg.web.cern.ch/LCG/activities/middleware.html (accessed June 2006) . Google Scholar
    • Lightweight Directory Access Protocol, open source implementation, website. http://www.openldap.org (accessed June 2006) . Google Scholar
    • Maui Administrator's Guide. http://www.clusterresources.com/products/maui/docs/mauiadmin.pdf (accessed May 2006) . Google Scholar
    • MPI: A Message-Passing Interface Standard, http://www.mpi-forum.org/docs/mpi-11.ps (accessed June 2006) . Google Scholar
    • Site Functional Tests for EGEE sites. https://lcg-sft.cern.ch/sft/lastreport.cgi (accessed June 2006) . Google Scholar
    • SmokePing network latency measurement tool, http://oss.oetiker.ch/smokeping/ (accessed June 2006) . Google Scholar
    • The Large Hadron Collider beauty experiment, homepage. http://lhcb.web.cern.ch/lhcb/ (accessed June 2006) . Google Scholar
    • The WISDOM (Wide In Silico Docking On Malaria) Data Challenge, general statistics. http://wisdom.eu-egee.fr/malaria/grid_stat.php?menu_grid=general (accessed June 2006) . Google Scholar
    • Torque Administrator's Manual, http://www.clusterresources.com/torquedocs21/ (accessed May 2006) . Google Scholar
    • WISDOM: Initiative for grid-enabled drug discovery against neglected and emergent diseases, http://wisdom.eu-egee.fr/ (last accessed June 2006) . Google Scholar
    • Internet X.509 Public Key Infrastructure - Certificate and Certificate Revocation List (CRL) Profile, http://www.ietf.org/rfc/rfc3280.txt (accessed March 2006), 2002 . Google Scholar
    • Job Description Language: Attributes Specification. http://edms.cern.ch/document/590869/, May 2006 . Google Scholar
    • Aaron Brown. Coping with human error in IT systems. ACM Queue magazine, http://www.acmqueue.com, November 2004 . Google Scholar
    • Stephen Burke, Simone Campana, Antonio Delgado Peris, Flavia Donno, Patricia Mendez Lorenzo, Roberto Santinelli, and Andrea Sciaba. gLite 3.0 User Guide. https://edms.cern.ch/document/722398/, May 2006. Document Status: PRIVATE . Google Scholar
    • G.   DaCosta , M. D.   Dikaiakos and S.   Orlando , Nine months in the life of EGEE: a look from the South , Proceedings of 15th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2007) ( 2007 ) . Google Scholar
    • I. Foster, C. Kesselman and S. Tuecke, International J. Supercomputer Applications 15(3), 200 (2001). CrossrefGoogle Scholar
    • Sophie Lemaitre, Jeff Templon, Steve Traylen, Markus Schulz, and Davide Salomoni. Maui Cookbook. http://grid-deployment.web.cern.ch/grid-deployment/documentation/Maui-Cookbook.pdf (accessed May 2006) . Google Scholar
    • F. Pacini. gLite Workload Management System service. https://edms.cern.ch/document/572489/, May 2006 . Google Scholar
    • D.   Thain and M.   Livny , Grid 2: Blueprint for a New Computing Infrastructure , 2nd edn. ( Elsevier, Morgan Kaufmann , 2004 ) . Google Scholar
    • M.   Xu et al. , Grid 2: Blueprint for a New Computing Infrastructure , 2nd edn. ( Elsevier, Morgan Kaufmann , 2004 ) . Google Scholar