World Scientific
  • Search
  •   
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

The Exploitation of Distance Distributions for Clustering

    https://doi.org/10.1142/S1469026821500164Cited by:12 (Source: Crossref)

    Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.

    References

    • 1. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining (American Association for Artificial Intelligence Press, Menlo Park, CA, 1996). Google Scholar
    • 2. P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining (Pearson, London, UK, 2006). Google Scholar
    • 3. R. Yang, Y. Jiang, S. Mathews, E. A. Housworth, M. W. Hahn and P. Radivojac, A new class of metrics for learning on real-valued and structured data, Data Min. Knowl. Discov. 33 (2019) 995–1016. CrossrefGoogle Scholar
    • 4. A. Brazma and J. Vilo, Gene expression data analysis, FEBS Lett. 480 (2000) 17–24. CrossrefGoogle Scholar
    • 5. A. Hampapur and R. M. Bolle, Comparison of distance measures for video copy detection, in 2001 IEEE Int. Conf. Multimedia and Expo (IEEE, New York, NY, 2001), pp. 1–188. CrossrefGoogle Scholar
    • 6. M. Li, X. Chen, X. Li, B. Ma and P. Vitányi, The similarity metric, Trans. Inf. Theory 50 (2004) 3250–3264. CrossrefGoogle Scholar
    • 7. S. H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, Int. J. Math. Models Methods Appl. Sci. 1 (2007) 300–307. Google Scholar
    • 8. L. Yujian and L. Bo, A normalized levenshtein distance metric, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 1091–1095. CrossrefGoogle Scholar
    • 9. R. Kumar and S. Vassilvitskii, Generalized distances between rankings, in Proc. 19th Int. Conf. World Wide Web (ACM, New York, NY, 2010), pp. 571–580. CrossrefGoogle Scholar
    • 10. M. Cao, H. Zhang, J. Park, N. M. Daniels, M. E. Crovella, L. J. Cowen and B. Hescott, Going the distance for protein function prediction: A new distance metric for protein interaction networks, PLOS ONE 8 (2013) e76339. CrossrefGoogle Scholar
    • 11. R. Gentleman, B. Ding, S. Dudoit and J. Ibrahim, Distance measures in DNA microarray data analysis, in Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit (Eds.), (Springer, New York, 2005), pp. 189–208, https://doi.org/10.1007/0-387-29362-0_12. CrossrefGoogle Scholar
    • 12. I. Priness, O. Maimon and I. Ben-Gal, Evaluation of gene-expression clustering via mutual information distance measure, BMC Bioinform. 8 (2007) 111. CrossrefGoogle Scholar
    • 13. T. N. Phyu, Survey of classification techniques in data mining, in Proc. Int. MultiConference Engineers and Computer Scientists (IMECS, Hong Kong, 2009), pp. 18–20. Google Scholar
    • 14. R. M. Cormack, A review of classification, J. R. Stat. Soc. (Gen.) 134 (1971) 321–367. CrossrefGoogle Scholar
    • 15. G. M. Mimmack, S. J. Mason and J. S. Galpin, Choice of distance matrices in cluster analysis: Defining regions, J. Clim. 14 (2001) 2790–2797. CrossrefGoogle Scholar
    • 16. F. Mörchen, Time Series Knowledge Mining. Dissertation (Citeseer/Görich & Weiershäuser, Marburg, Germany, 2006). Google Scholar
    • 17. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data (Prentice Hall College Div, Englewood Cliffs, NJ, 1988). Google Scholar
    • 18. T. M. Mitchell, Machine Learning (McGraw-Hill Education, India, 1997). Google Scholar
    • 19. C. C. Aggarwal, A. Hinneburg and D. A. Keim, in Int. Conf. Database Theory, eds. J. Van den Bussche and V. Vianu (Springer, Berlin, Heidelberg, 2001), pp. 420–434. Google Scholar
    • 20. E. P. Xing, M. I. Jordan, S. J. Russell and A. Y. Ng, Distance metric learning with application to clustering with side-information, in Proc. 15th Int. Conf. Neural Neural Information Processing Systems (NIPS’02) (MIT Press, Cambridge, MA, 2003), pp. 521–528. Google Scholar
    • 21. G. Glazko, A. Gordon and A. Mushegian, The choice of optimal distance measure in genome-wide datasets, Bioinformatics 21 (2005) iii3–iii11. CrossrefGoogle Scholar
    • 22. R. Shahid, S. Bertazzon, M. L. Knudtson and W. A. Ghali, Comparison of distance measures in spatial analytical modeling for health service planning, BMC Health Serv. Res. 9 (2009) 200. CrossrefGoogle Scholar
    • 23. T. Bozkaya and M. Ozsoyoglu, Distance-based indexing for high-dimensional metric spaces, in ACM SIGMOD Rec. (ACM, New York, NY, 1997), pp. 357–368. CrossrefGoogle Scholar
    • 24. K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, When is “nearest neighbor” meaningful? in Int. Conf. Database Theory (Springer, Berlin, 1999), pp. 217–235. CrossrefGoogle Scholar
    • 25. F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor and Y. Moreau, Adaptive quality-based clustering of gene expression profiles, Bioinformatics 18 (2002) 735–746. CrossrefGoogle Scholar
    • 26. M. C. Thrun, Cluster analysis of per capita gross domestic products, Entrep. Bus. Econ. Rev. 7 (2019) 217–231. Google Scholar
    • 27. M. Basseville, Distance measures for signal processing and pattern recognition, Sig. Process. 18 (1989) 349–369. CrossrefGoogle Scholar
    • 28. J. J. Sooful and E. C. Botha, Comparison of acoustic distance measures for automatic cross-language phoneme mapping, in Proc. Seventh International Conference on Spoken Language Processing (ICSLP) (Denver, Colorado, USA, 16-20th September, 2002), pp. 521–524. Google Scholar
    • 29. D. G. Gavin, W. W. Oswald, E. R. Wahl and J. W. Williams, A statistical approach to evaluating distance metrics and analog assignments for pollen records, Quat. Res. 60 (2003) 356–367. CrossrefGoogle Scholar
    • 30. A. Schenker, M. Last, H. Bunke and A. Kandel, in Int. Workshop on Graph-Based Representations in Pattern Recognition, eds. E. Hancock and M. Vento (Springer, Berlin, Heidelberg, 2003), pp. 202–213. CrossrefGoogle Scholar
    • 31. H. Finch, Comparison of distance measures in cluster analysis with dichotomous data, J. Data Sci. 3 (2005) 85–100. CrossrefGoogle Scholar
    • 32. S. D. Bharkad and M. Kokare, Performance evaluation of distance metrics: application to fingerprint recognition, Int. J. Pattern Recognit. Artif. Intell. 25 (2011) 777–806. LinkGoogle Scholar
    • 33. X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann and E. Keogh, Experimental comparison of representation methods and distance measures for time series data, Data Min. Knowl. Discov. 26 (2013) 275–309. CrossrefGoogle Scholar
    • 34. R. E. Bonner, On some clustering technique, IBM J. Res. Dev. 8 (1964) 22–32. CrossrefGoogle Scholar
    • 35. C. Hennig, M. Meila, F. Murtagh and R. Rocci, Handbook of Cluster Analysis (Chapman & Hall/CRC Press, New York, NY, 2015). CrossrefGoogle Scholar
    • 36. M. Verleysen, D. Francois, G. Simon and V. Wertz, in Artificial Neural Nets Problem Solving Methods (Springer, 2003), pp. 105–112. CrossrefGoogle Scholar
    • 37. M. C. Thrun and A. Ultsch, Clustering benchmark datasets exploiting the fundamental clustering problems, Data Br. 30 (2020) 105501. CrossrefGoogle Scholar
    • 38. M. C. Thrun, T. Gehlert and A. Ultsch, Analyzing the fine structure of distributions, PLOS ONE 15 (2020) e0238835. CrossrefGoogle Scholar
    • 39. C. Bouveyron, B. Hammer and T. Villmann, Recent developments in clustering algorithms, in ESANN 2012 Proc. EuropeanSymp. Artificial Neural Networks, Computational Intelligence and Machine Learning (Citeseer, Bruges, Belgium, 2012), pp. 447–458. Google Scholar
    • 40. R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (John Wiley & Sons, Ney York, NY, 2001). Google Scholar
    • 41. J. Handl, J. Knowles and D. B. Kell, Computational cluster validation in post-genomic data analysis, Bioinformatics 21 (2005) 3201–3212. CrossrefGoogle Scholar
    • 42. O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. PéRez and I. Perona, An extensive comparative study of cluster validity indices, Pattern Recognit. 46 (2013) 243–256. CrossrefGoogle Scholar
    • 43. J. Venna, J. Peltonen, K. Nybo, H. Aidos and S. Kaski, Information retrieval perspective to nonlinear dimensionality reduction for data visualization, J. Mach. Learn. Res. 11 (2010) 451–490. Google Scholar
    • 44. A. Ultsch and M. C. Thrun. Credible visualizations for planar projections, in 12th Int. Workshop on Self-Organizing Maps and Learning Vector Quantization, Clustering and Data Visualization (WSOM), M. Cottrell (Ed.) (IEEE, Nany, France, 2017), pp. 1–5. CrossrefGoogle Scholar
    • 45. M. C. Thrun, Projection Based Clustering through Self-Organization and Swarm Intelligence (Springer, Heidelberg, 2018). CrossrefGoogle Scholar
    • 46. A. Neumaier, Combinatorial Configurations in Terms of Distances, Department of Mathematics Memorandum 81-09 (Eindhoven University, Eindhoven, 1981). Google Scholar
    • 47. J. C. Gower and P. Legendre, Metric and Euclidean properties of dissimilarity coefficients, J. Classif. 3 (1986) 5–48. CrossrefGoogle Scholar
    • 48. M. M. Deza and E. Deza, in Encyclopedia of Distances (Springer, 2009), pp. 1–583. CrossrefGoogle Scholar
    • 49. H. H. Bock, in Studia Mathematica, eds. K. P. Grotemeyer, D. Morgenstern and H. Tietz (Vandenhoeck & Ruprecht, Göttingen, Germany, 1974), pp. 1–480. Google Scholar
    • 50. A. Ultsch, in Innovations in Classification, Data Science, and Information Systems, eds. D. Baier and K. D. Werrnecke (Springer, Berlin, Germany, 2005), pp. 91–100. CrossrefGoogle Scholar
    • 51. A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B (Methodol.) 39 (1977) 1–22. Google Scholar
    • 52. J. A. Hartigan and P. M. Hartigan, The dip test of unimodality, Ann. Stat. 13 (1985) 70–84. CrossrefGoogle Scholar
    • 53. A. Ultsch, M. C. Thrun, O. Hansen-Goos and J. Lötsch, Identification of molecular fingerprints in human heat pain thresholds by use of an interactive mixture model R toolbox (AdaptGauss), Int. J. Mol. Sci. 16 (2015) 25897–25911. CrossrefGoogle Scholar
    • 54. R. E. Bellman, Adaptive Control Processes: A Guided Tour (Princeton University Press, 1961). CrossrefGoogle Scholar
    • 55. A. Eckert, ParallelDist: Parallel Distance Matrix Computation Using Multiple Threads (Version 0.2.4), CRAN (2018) https://CRAN.R-project.org/package=parallelDist. Google Scholar
    • 56. J. H. Ward, Jr., Hierarchical grouping to optimize an objective function, J. Am. Stat. Assoc. 58 (1963) 236–244. CrossrefGoogle Scholar
    • 57. J. R. Michael, The stabilized probability plot, Biometrika 70 (1983) 11–17. CrossrefGoogle Scholar
    • 58. M. C. Thrun, Knowledge discovery in quarterly financial data of stocks based on the prime standard using a hybrid of a swarm with SOM, in Eur. Symp. Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Vol. 27, pp. 397–402, Ciaco, 978-287-587-065-0, Bruges, Belgium, 2019. Google Scholar
    • 59. M. C. Thrun and A. Ultsch, Swarm intelligence for self-organized clustering, Artif. Intell. 290 (2021) 103237. CrossrefGoogle Scholar
    • 60. M. C. Thrun and A. Ultsch, Uncovering high-dimensional structures of projections from dimensionality reduction methods, MethodsX 7 (2020) 101093. CrossrefGoogle Scholar
    • 61. M. C. Thrun and Q. Stier, Fundamental clustering algorithms suite, SoftwareX 13 (2021) 100642. CrossrefGoogle Scholar
    • 62. P. Franck, E. Cameron, G. Good, J. Y. Rasplus and B. Oldroyd, Nest architecture and genetic differentiation in a species complex of Australian stingless bees, Mol. Ecol. 13 (2004) 2317–2331. CrossrefGoogle Scholar
    • 63. C. Hennig, in Data Analysis, Machine Learning and Knowledge Discovery (Springer, 2014), pp. 41–49. CrossrefGoogle Scholar
    • 64. B. Hausdorf and C. Hennig, Species delimitation using dominant and codominant multilocus markers, Syst. Biol. 59 (2010) 491–503. CrossrefGoogle Scholar
    • 65. F. Ball and A. Geyer-Schulz, Invariant graph partition comparison measures, Symmetry 10 (2018) 1–27. CrossrefGoogle Scholar
    • 66. C. Weihs and G. Szepannek, Distances in classification, in P., P. (Ed.), ed., Advances in data mining applications and theoretical aspects, Proc. IEEE International Conference on Data Mining (ICDM), Vol. 5633, Springer Berlin, Heidelberg, pp. 1–12, Miami, Florida, USA, 6–9th December, 2009. Google Scholar
    • 67. I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, … A. Zimek, On using class-labels in evaluation of clusterings, Proc. MultiClust: 1st International Workshop on Discovering, Summarizing and using Multiple Clusterings held in Conjunction with 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), pp. 1–9, Washington, DC, 25-28th July, 2010. Google Scholar
    Remember to check out the Most Cited Articles!

    Check out these titles in artificial intelligence!