DETERMINISTIC INITIALIZATION OF THE K-MEANS ALGORITHM USING HIERARCHICAL CLUSTERING
Abstract
K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. Many of these methods, however, have superlinear complexity in the number of data points, making them impractical for large data sets. On the other hand, linear methods are often random and/or order-sensitive, which renders their results unrepeatable. Recently, Su and Dy proposed two highly successful hierarchical initialization methods named Var-Part and PCA-Part that are not only linear, but also deterministic (nonrandom) and order-invariant. In this paper, we propose a discriminant analysis based approach that addresses a common deficiency of these two methods. Experiments on a large and diverse collection of data sets from the UCI machine learning repository demonstrate that Var-Part and PCA-Part are highly competitive with one of the best random initialization methods to date, i.e. k-means++, and that the proposed approach significantly improves the performance of both hierarchical methods.
References
- ACM Comput. Surveys 31(3), 264 (1999). Crossref, Web of Science, Google Scholar
- Pattern Recog. Lett. 31(8), 651 (2010). Crossref, Web of Science, Google Scholar
-
L. Kaufman and P. Rousseeuw , Finding Groups in Data: An Introduction to Cluster Analysis ( Wiley-Interscience , 1990 ) . Crossref, Google Scholar - Machine Learn. 75(2), 245 (2009). Crossref, Web of Science, Google Scholar
- Theor. Comput. Sci. 442 , 13 ( 2012 ) . Crossref, Web of Science, Google Scholar
- Pattern Recog. 36(12), 2955 (2003). Crossref, Web of Science, Google Scholar
- IEEE Trans. Inf. Theor. 28(2), 129 (1982). Crossref, Web of Science, Google Scholar
P. S. Bradley and U. Fayyad , Refining initial points for k-means clustering, Proc. 15th Int. Conf. Machine Learning (1998) pp. 91–99. Google ScholarM. Dash , H. Liu and X. Xu , '1+1>2': Merging distance and density based clustering, Proc. 7th Int. Conf. Database Systems for Advanced Applications (2001) pp. 32–39. Google Scholar- IEEE Trans. Pattern Anal. Machine Intell. 33(3), 568 (2011). Crossref, Web of Science, Google Scholar
- IEEE Trans. Pattern Anal. Machine Intell. 24(7), 881 (2002). Crossref, Web of Science, Google Scholar
G. Hamerly , Making k-means even faster, Proc. 2010 SIAM Int. Conf. Data Mining (2010) pp. 130–140. Google Scholar- IEEE Trans. Very Large Scale Integration (VLSI) Syst. 18(6), 957 (2010). Crossref, Web of Science, Google Scholar
- IEEE Trans. Knowledge and Data Eng. 16(8), 909 (2004). Crossref, Web of Science, Google Scholar
- IEEE Trans. Pattern Anal. Machine Intell. 6(1), 81 (1984). Crossref, Web of Science, Google Scholar
- Advances in Neural Information Processing Systems 7, eds.
G. Tesauro , D. S. Touretzky and T. K. Leen (MIT Press, 1995) pp. 585–592. Google Scholar , - Stat. Anal. Data Mining 3(4), 209 (2010). Crossref, Google Scholar
- IEEE Trans. Neural Netw. 7(1), 16 (1996). Crossref, Web of Science, Google Scholar
- IEEE Trans. Syst. Man Cybern. B 33(6), 983 (2003). Crossref, Web of Science, Google Scholar
- Expert Syst. Appl. 40(1), 200 (2013). Crossref, Web of Science, Google Scholar
- Image Vis. Comput. 29(4), 260 (2011). Crossref, Web of Science, Google Scholar
- Pattern Recogn. Lett. 20(10), 1027 (1999). Crossref, Web of Science, Google Scholar
J. He , Initialization of cluster refinement algorithms: A review and comparative study, Proc. 2004 IEEE Int. Joint Conf. Neural Networks (2004) pp. 297–302. Google Scholar- Computer J. 10(3), 271 (1967). Crossref, Web of Science, Google Scholar
- M. M. Astrahan, Speech analysis by clustering, or the hyperphoneme method, Technical Report AIM-124, Stanford University, 1970 . Google Scholar
- J. R. Stat. Soc. C 28(1), 100 (1979). Crossref, Google Scholar
- Pattern Recog. 36(2), 451 (2003). Crossref, Web of Science, Google Scholar
M. Al-Daoud , A new algorithm for cluster initialization, Proc. 2nd World Enformatika Conf. (2005) pp. 74–76. Google Scholar- Pattern Recog. Lett. 28(8), 965 (2007). Crossref, Web of Science, Google Scholar
- Pattern Recog. Lett. 30(11), 994 (2009). Crossref, Web of Science, Google Scholar
- Comput. Math. Appl. 58(3), 474 (2009). Crossref, Web of Science, Google Scholar
P. Kang and S. Cho , K-means clustering seeds initialization based on centrality, sparsity, and isotropy, Proc. 10th Int. Conf. Intelligent Data Engineering and Automated Learning (2009) pp. 109–117. Google Scholar- Biometrics 21 , 768 ( 1965 ) . Web of Science, Google Scholar
- Australian J. Botany 14(1), 127 (1966). Crossref, Web of Science, Google Scholar
J. MacQueen , Some methods for classification and analysis of multivariate observations, Proc. 5th Berkeley Symp. Mathematical Statistics and Probability (1967) pp. 281–297. Google Scholar- Behav. Sci. 12(2), 153 (1967). Crossref, Google Scholar
-
J. T. Tou and R. C. Gonzales , Pattern Recognition Principles ( Addison-Wesley , 1974 ) . Google Scholar - Eur. J. Oper. Res. 1(1), 23 (1977). Crossref, Google Scholar
D. Arthur and S. Vassilvitskii , k-means++: The advantages of careful seeding, Proc. 18th Annual ACM-SIAM Symp. Discrete Algorithms (2007) pp. 1027–1035. Google Scholar- Intell. Data Anal. 11(4), 319 (2007). Crossref, Web of Science, Google Scholar
-
M. R. Anderberg , Cluster Analysis for Applications ( Academic Press , 1973 ) . Google Scholar -
M. J. Norušis , IBM SPSS Statistics 19 Statistical Procedures Companion ( Addison Wesley , 2011 ) . Google Scholar - Theor. Comput. Sci. 38(3), 293 (1985). Crossref, Web of Science, Google Scholar
- Psychometrika 1(1), 27 (1936). Crossref, Google Scholar
- Inf. Sci. 2(3), 319 (1970). Crossref, Web of Science, Google Scholar
- J. Electron. Imaging 13(1), 146 (2004). Crossref, Web of Science, Google Scholar
- IEEE Trans. Syst. Man Cybern. 9(1), 62 (1979). Crossref, Web of Science, Google Scholar
- Comput. Vis. Graphics Image Process. 41(2), 233 (1988). Crossref, Web of Science, Google Scholar
- Comput. Vis. Graphics Image Process. 52(2), 171 (1990). Crossref, Google Scholar
- IEEE Trans. Pattern Anal. Machine Intell. 17(3), 312 (1995). Crossref, Web of Science, Google Scholar
- Pattern Recog. Lett. 26(10), 1423 (2005). Crossref, Web of Science, Google Scholar
- M. E. Celebi, Q. Wen, S. Hwang, H. Iyatomi and G. Schaefer, Lesion border detection in dermoscopy images using ensembles of thresholding methods, to appear in Skin Res. Technol. (2013) . Google Scholar
- A. Frank and A. Asuncion, UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml, 2012 . Google Scholar
- J. Emerging Technol. Web Intell. 4(1), 51 (2012). Google Scholar
- J. Class. 5(2), 181 (1988). Crossref, Web of Science, Google Scholar
- ACM Trans. Model. Comput. Simul. 8(1), 3 (1998). Crossref, Google Scholar