World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Using Model-based Clustering to Improve Predictions for Queueing Delay on Parallel Machines

    Most space-sharing parallel computers presently operated by production high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources may choose among different queues (charging different amounts) potentially on a number of machines to which they have access. In such a situation, the amount of time a user's job will wait in any one batch queue can be a significant portion of the overall time from job submission to job completion. It thus becomes desirable to provide a prediction for the amount of time a given job can expect to wait in the queue. Further, it is natural to expect that attributes of an incoming job, specifically the number of processors requested and the amount of time requested, might impact that job's wait time.

    In this work, we explore the possibility of generating accurate predictions by automatically grouping jobs having similar attributes using model-based clustering. Moreover, we implement this clustering technique for a time series of jobs so that predictions of future wait times can be generated in real time.

    Using trace-based simulation on data from 7 machines over a 9-year period from across the country, comprising over one million job records, we show that clustering either by requested time, requested number of processors, or the product of the two generally produces more accurate predictions than earlier, more naive, approaches and that automatic clustering outperforms administrator-determined clustering.

    References

    • IBM LoadLeveler User's Guide. Technical report, International Business Machines Corporation, 1993 . Google Scholar
    • G.   Box , G.   Jenkins and G.   Reinsel , Time Series Analysis, Forecasting, and Control , 3rd edn. ( Prentice Hall , 1994 ) . Google Scholar
    • J.   Brevik , D.   Nurmi and R.   Wolski , Predicting bounds on queuing delay for batch-scheduled parallel machines , Proceedings of ACM Symposium on the Principles and Practice of Parallel Programming (PPoPP) 2006 ( 2006 ) . Google Scholar
    • Su-Hui   Chiang and Mary K.   Vernon , Dynamic vs. Static Quantum-based Processor Allocation ( Springer-Verlag , 1996 ) . CrossrefGoogle Scholar
    • Scott Clearwater and Stephen Kleban. Heavy-tailed distributions in supercomputer jobs. Technical Report SAND2002-2378C, Sandia National Labs, 2002 . Google Scholar
    • Allen   Downey , Predicting queue times on space-sharing parallel computers , Proceedings of the 11th International Parallel Processing Symposium ( 1997 ) . Google Scholar
    • Allen   Downey , Using queue time predictions for processor allocation , Proceedings of the 3rd Workshop on Job Scheduling Strategies for Parallel Processing ( 1997 ) . Google Scholar
    • The Dror Feitelson's Parallel Workload Page. http://www.cs.huji.ac.il/labs/parallel/workload . Google Scholar
    • Dror G.   Feitelson and B.   Nitzberg , Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860 ( Springer-Verlag , 1996 ) . Google Scholar
    • Dror G.   Feitelson and Larry   Rudolph , Parallel Job Scheduling: Issues and Approaches ( Springer-Verlag , 1995 ) . CrossrefGoogle Scholar
    • Dror G.   Feitelson and Larry   Rudolph , Towards Convergence in Job Schedulers for Parallel Supercomputers ( Springer-Verlag , 1996 ) . CrossrefGoogle Scholar
    • Eitan   Frachtenberg et al. , Parallel Job Scheduling Under Dynamic Workloads ( Springer-Verlag , 2003 ) . CrossrefGoogle Scholar
    • C. W. P.   Granger and P.   Newbold , Forecasting Economic Time Series ( Academic Press , 1986 ) . Google Scholar
    • Gridengine home page - http://gridengine.sunsource.net/ . Google Scholar
    • Mor   Harchol-Balter , The effect of heavy-tailed job size distributions on computer system design , Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics ( 1999 ) . Google Scholar
    • Anil K.   Jain and Richard C.   Dubes , Algorithms for clustering data ( Prentice-Hall, Inc. , Upper Saddle River, NJ, USA , 1988 ) . Google Scholar
    • Dave   Lifka , The ANL/IBM SP scheduling system   949 ( Springer-Verlag , 1995 ) . CrossrefGoogle Scholar
    • David Lifka, Mark Henderson, and Karen Rayl. Users guide to the argonne SP scheduling system. Technical Report TM-201, Argonne National Laboratory, Mathematics and Computer Science Division, May 1995 . Google Scholar
    • J. MacQueen. Some methods for classification and analysis of multivariate observations. pages 281–297, 1967 . Google Scholar
    • Maui scheduler home page - http://www.clusterresources.com/products/maui/ . Google Scholar
    • Cray NQE User's Guide - http://docs.cray.com/books/2148_3.3/html-2148_3.3 . Google Scholar
    • Pbspro home page - http://www.altair.com/software/pbspro.htm . Google Scholar
    • Christian Posse, Journal of Computational and Graphical Statistics 10(3), 464 (2001), DOI: 10.1198/106186001317115072. Crossref, ISIGoogle Scholar
    • G. Schwartz, Ann. of Statistics 461 (1979). Google Scholar
    • Warren Smith, Valerie E. Taylor and Ian T. Foster, Using run-time predictions to estimate queue wait times and improve scheduler performance, IPPS/SPDP '99/JSSPP '99: Proceedings of the Job Scheduling Strategies for Parallel Processing (Springer-Verlag, 1999) pp. 202–219. Google Scholar
    • R. Wolski, Cluster Computing 1, 119 (1998), DOI: 10.1023/A:1019025230054. CrossrefGoogle Scholar
    • R. Wolski, ACM SIGMETRICS Performance Evaluation Review 30(4), 41 (2003), DOI: 10.1145/773056.773064. CrossrefGoogle Scholar
    • R.   Wolski et al. , Handbook of Innovative Computing: Integrating Classical Models with Emerging Technologies , ed. A.   Zomaya ( Springer-Verlag , 2005 ) . Google Scholar
    • Shi Zhong Zhong, Journal of machine learning research 4, 1001 (2003), DOI: 10.1162/jmlr.2003.4.6.1001. ISIGoogle Scholar