Using Model-based Clustering to Improve Predictions for Queueing Delay on Parallel Machines
Abstract
Most space-sharing parallel computers presently operated by production high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources may choose among different queues (charging different amounts) potentially on a number of machines to which they have access. In such a situation, the amount of time a user's job will wait in any one batch queue can be a significant portion of the overall time from job submission to job completion. It thus becomes desirable to provide a prediction for the amount of time a given job can expect to wait in the queue. Further, it is natural to expect that attributes of an incoming job, specifically the number of processors requested and the amount of time requested, might impact that job's wait time.
In this work, we explore the possibility of generating accurate predictions by automatically grouping jobs having similar attributes using model-based clustering. Moreover, we implement this clustering technique for a time series of jobs so that predictions of future wait times can be generated in real time.
Using trace-based simulation on data from 7 machines over a 9-year period from across the country, comprising over one million job records, we show that clustering either by requested time, requested number of processors, or the product of the two generally produces more accurate predictions than earlier, more naive, approaches and that automatic clustering outperforms administrator-determined clustering.
References
- IBM LoadLeveler User's Guide. Technical report, International Business Machines Corporation, 1993 . Google Scholar
-
G. Box , G. Jenkins and G. Reinsel , Time Series Analysis, Forecasting, and Control , 3rd edn. ( Prentice Hall , 1994 ) . Google Scholar -
J. Brevik , D. Nurmi and R. Wolski , Predicting bounds on queuing delay for batch-scheduled parallel machines , Proceedings of ACM Symposium on the Principles and Practice of Parallel Programming (PPoPP) 2006 ( 2006 ) . Google Scholar -
Su-Hui Chiang and Mary K. Vernon , Dynamic vs. Static Quantum-based Processor Allocation ( Springer-Verlag , 1996 ) . Crossref, Google Scholar - Scott Clearwater and Stephen Kleban. Heavy-tailed distributions in supercomputer jobs. Technical Report SAND2002-2378C, Sandia National Labs, 2002 . Google Scholar
-
Allen Downey , Predicting queue times on space-sharing parallel computers , Proceedings of the 11th International Parallel Processing Symposium ( 1997 ) . Google Scholar -
Allen Downey , Using queue time predictions for processor allocation , Proceedings of the 3rd Workshop on Job Scheduling Strategies for Parallel Processing ( 1997 ) . Google Scholar - The Dror Feitelson's Parallel Workload Page. http://www.cs.huji.ac.il/labs/parallel/workload . Google Scholar
-
Dror G. Feitelson and B. Nitzberg , Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860 ( Springer-Verlag , 1996 ) . Google Scholar -
Dror G. Feitelson and Larry Rudolph , Parallel Job Scheduling: Issues and Approaches ( Springer-Verlag , 1995 ) . Crossref, Google Scholar -
Dror G. Feitelson and Larry Rudolph , Towards Convergence in Job Schedulers for Parallel Supercomputers ( Springer-Verlag , 1996 ) . Crossref, Google Scholar -
Eitan Frachtenberg , Parallel Job Scheduling Under Dynamic Workloads ( Springer-Verlag , 2003 ) . Crossref, Google Scholar -
C. W. P. Granger and P. Newbold , Forecasting Economic Time Series ( Academic Press , 1986 ) . Google Scholar - Gridengine home page - http://gridengine.sunsource.net/ . Google Scholar
-
Mor Harchol-Balter , The effect of heavy-tailed job size distributions on computer system design , Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics ( 1999 ) . Google Scholar -
Anil K. Jain and Richard C. Dubes , Algorithms for clustering data ( Prentice-Hall, Inc. , Upper Saddle River, NJ, USA , 1988 ) . Google Scholar -
Dave Lifka , The ANL/IBM SP scheduling system 949 ( Springer-Verlag , 1995 ) . Crossref, Google Scholar - David Lifka, Mark Henderson, and Karen Rayl. Users guide to the argonne SP scheduling system. Technical Report TM-201, Argonne National Laboratory, Mathematics and Computer Science Division, May 1995 . Google Scholar
- J. MacQueen. Some methods for classification and analysis of multivariate observations. pages 281–297, 1967 . Google Scholar
- Maui scheduler home page - http://www.clusterresources.com/products/maui/ . Google Scholar
- Cray NQE User's Guide - http://docs.cray.com/books/2148_3.3/html-2148_3.3 . Google Scholar
- Pbspro home page - http://www.altair.com/software/pbspro.htm . Google Scholar
- Journal of Computational and Graphical Statistics 10(3), 464 (2001), DOI: 10.1198/106186001317115072. Crossref, ISI, Google Scholar
- Ann. of Statistics 461 (1979). Google Scholar
Warren Smith , Valerie E. Taylor and Ian T. Foster , Using run-time predictions to estimate queue wait times and improve scheduler performance, IPPS/SPDP '99/JSSPP '99: Proceedings of the Job Scheduling Strategies for Parallel Processing (Springer-Verlag, 1999) pp. 202–219. Google Scholar- Cluster Computing 1, 119 (1998), DOI: 10.1023/A:1019025230054. Crossref, Google Scholar
- ACM SIGMETRICS Performance Evaluation Review 30(4), 41 (2003), DOI: 10.1145/773056.773064. Crossref, Google Scholar
- , Handbook of Innovative Computing: Integrating Classical Models with Emerging Technologies , ed.
A. Zomaya ( Springer-Verlag , 2005 ) . Google Scholar - Journal of machine learning research 4, 1001 (2003), DOI: 10.1162/jmlr.2003.4.6.1001. ISI, Google Scholar


