Abstract
Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.
References
- 1. , Advances in Knowledge Discovery and Data Mining (American Association for Artificial Intelligence Press, Menlo Park, CA, 1996). Google Scholar
- 2. , Introduction to Data Mining (Pearson, London, UK, 2006). Google Scholar
- 3. , A new class of metrics for learning on real-valued and structured data, Data Min. Knowl. Discov. 33 (2019) 995–1016. Crossref, Google Scholar
- 4. , Gene expression data analysis, FEBS Lett. 480 (2000) 17–24. Crossref, Google Scholar
- 5. , Comparison of distance measures for video copy detection, in 2001 IEEE Int. Conf. Multimedia and Expo (IEEE, New York, NY, 2001), pp. 1–188. Crossref, Google Scholar
- 6. , The similarity metric, Trans. Inf. Theory 50 (2004) 3250–3264. Crossref, Google Scholar
- 7. , Comprehensive survey on distance/similarity measures between probability density functions, Int. J. Math. Models Methods Appl. Sci. 1 (2007) 300–307. Google Scholar
- 8. , A normalized levenshtein distance metric, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 1091–1095. Crossref, Google Scholar
- 9. , Generalized distances between rankings, in Proc. 19th Int. Conf. World Wide Web (ACM, New York, NY, 2010), pp. 571–580. Crossref, Google Scholar
- 10. , Going the distance for protein function prediction: A new distance metric for protein interaction networks, PLOS ONE 8 (2013) e76339. Crossref, Google Scholar
- 11. ,
Distance measures in DNA microarray data analysis , in Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit (Eds.), (Springer, New York, 2005), pp. 189–208, https://doi.org/10.1007/0-387-29362-0_12. Crossref, Google Scholar - 12. , Evaluation of gene-expression clustering via mutual information distance measure, BMC Bioinform. 8 (2007) 111. Crossref, Google Scholar
- 13. , Survey of classification techniques in data mining, in Proc. Int. MultiConference Engineers and Computer Scientists (IMECS, Hong Kong, 2009), pp. 18–20. Google Scholar
- 14. , A review of classification, J. R. Stat. Soc. (Gen.) 134 (1971) 321–367. Crossref, Google Scholar
- 15. , Choice of distance matrices in cluster analysis: Defining regions, J. Clim. 14 (2001) 2790–2797. Crossref, Google Scholar
- 16. , Time Series Knowledge Mining. Dissertation (Citeseer/Görich & Weiershäuser, Marburg, Germany, 2006). Google Scholar
- 17. , Algorithms for Clustering Data (Prentice Hall College Div, Englewood Cliffs, NJ, 1988). Google Scholar
- 18. , Machine Learning (McGraw-Hill Education, India, 1997). Google Scholar
- 19. , in Int. Conf. Database Theory, eds. J. Van den Bussche and V. Vianu (Springer, Berlin, Heidelberg, 2001), pp. 420–434. Google Scholar
- 20. , Distance metric learning with application to clustering with side-information, in Proc. 15th Int. Conf. Neural Neural Information Processing Systems (NIPS’02) (MIT Press, Cambridge, MA, 2003), pp. 521–528. Google Scholar
- 21. , The choice of optimal distance measure in genome-wide datasets, Bioinformatics 21 (2005) iii3–iii11. Crossref, Google Scholar
- 22. , Comparison of distance measures in spatial analytical modeling for health service planning, BMC Health Serv. Res. 9 (2009) 200. Crossref, Google Scholar
- 23. , Distance-based indexing for high-dimensional metric spaces, in ACM SIGMOD Rec. (ACM, New York, NY, 1997), pp. 357–368. Crossref, Google Scholar
- 24. , When is “nearest neighbor” meaningful? in Int. Conf. Database Theory (Springer, Berlin, 1999), pp. 217–235. Crossref, Google Scholar
- 25. , Adaptive quality-based clustering of gene expression profiles, Bioinformatics 18 (2002) 735–746. Crossref, Google Scholar
- 26. , Cluster analysis of per capita gross domestic products, Entrep. Bus. Econ. Rev. 7 (2019) 217–231. Google Scholar
- 27. , Distance measures for signal processing and pattern recognition, Sig. Process. 18 (1989) 349–369. Crossref, Google Scholar
- 28. , Comparison of acoustic distance measures for automatic cross-language phoneme mapping, in Proc. Seventh International Conference on Spoken Language Processing (ICSLP) (Denver, Colorado, USA,
16-20th September, 2002 ), pp. 521–524. Google Scholar - 29. , A statistical approach to evaluating distance metrics and analog assignments for pollen records, Quat. Res. 60 (2003) 356–367. Crossref, Google Scholar
- 30. , in Int. Workshop on Graph-Based Representations in Pattern Recognition, eds. E. Hancock and M. Vento (Springer, Berlin, Heidelberg, 2003), pp. 202–213. Crossref, Google Scholar
- 31. , Comparison of distance measures in cluster analysis with dichotomous data, J. Data Sci. 3 (2005) 85–100. Crossref, Google Scholar
- 32. , Performance evaluation of distance metrics: application to fingerprint recognition, Int. J. Pattern Recognit. Artif. Intell. 25 (2011) 777–806. Link, Google Scholar
- 33. , Experimental comparison of representation methods and distance measures for time series data, Data Min. Knowl. Discov. 26 (2013) 275–309. Crossref, Google Scholar
- 34. , On some clustering technique, IBM J. Res. Dev. 8 (1964) 22–32. Crossref, Google Scholar
- 35. , Handbook of Cluster Analysis (Chapman & Hall/CRC Press, New York, NY, 2015). Crossref, Google Scholar
- 36. , in Artificial Neural Nets Problem Solving Methods (Springer, 2003), pp. 105–112. Crossref, Google Scholar
- 37. , Clustering benchmark datasets exploiting the fundamental clustering problems, Data Br. 30 (2020) 105501. Crossref, Google Scholar
- 38. , Analyzing the fine structure of distributions, PLOS ONE 15 (2020) e0238835. Crossref, Google Scholar
- 39. , Recent developments in clustering algorithms, in ESANN 2012 Proc. EuropeanSymp. Artificial Neural Networks, Computational Intelligence and Machine Learning (Citeseer, Bruges, Belgium, 2012), pp. 447–458. Google Scholar
- 40. , Pattern Classification (John Wiley & Sons, Ney York, NY, 2001). Google Scholar
- 41. , Computational cluster validation in post-genomic data analysis, Bioinformatics 21 (2005) 3201–3212. Crossref, Google Scholar
- 42. , An extensive comparative study of cluster validity indices, Pattern Recognit. 46 (2013) 243–256. Crossref, Google Scholar
- 43. , Information retrieval perspective to nonlinear dimensionality reduction for data visualization, J. Mach. Learn. Res. 11 (2010) 451–490. Google Scholar
- 44. . Credible visualizations for planar projections, in 12th Int. Workshop on Self-Organizing Maps and Learning Vector Quantization, Clustering and Data Visualization (WSOM), M. Cottrell (Ed.) (IEEE, Nany, France, 2017), pp. 1–5. Crossref, Google Scholar
- 45. , Projection Based Clustering through Self-Organization and Swarm Intelligence (Springer, Heidelberg, 2018). Crossref, Google Scholar
- 46. , Combinatorial Configurations in Terms of Distances, Department of Mathematics Memorandum 81-09 (Eindhoven University, Eindhoven, 1981). Google Scholar
- 47. , Metric and Euclidean properties of dissimilarity coefficients, J. Classif. 3 (1986) 5–48. Crossref, Google Scholar
- 48. , in Encyclopedia of Distances (Springer, 2009), pp. 1–583. Crossref, Google Scholar
- 49. , in Studia Mathematica, eds. K. P. Grotemeyer, D. Morgenstern and H. Tietz (Vandenhoeck & Ruprecht, Göttingen, Germany, 1974), pp. 1–480. Google Scholar
- 50. , in Innovations in Classification, Data Science, and Information Systems, eds. D. Baier and K. D. Werrnecke (Springer, Berlin, Germany, 2005), pp. 91–100. Crossref, Google Scholar
- 51. , Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B (Methodol.) 39 (1977) 1–22. Google Scholar
- 52. , The dip test of unimodality, Ann. Stat. 13 (1985) 70–84. Crossref, Google Scholar
- 53. , Identification of molecular fingerprints in human heat pain thresholds by use of an interactive mixture model R toolbox (AdaptGauss), Int. J. Mol. Sci. 16 (2015) 25897–25911. Crossref, Google Scholar
- 54. , Adaptive Control Processes: A Guided Tour (Princeton University Press, 1961). Crossref, Google Scholar
- 55. , ParallelDist: Parallel Distance Matrix Computation Using Multiple Threads (Version 0.2.4), CRAN (2018) https://CRAN.R-project.org/package=parallelDist. Google Scholar
- 56. , Hierarchical grouping to optimize an objective function, J. Am. Stat. Assoc. 58 (1963) 236–244. Crossref, Google Scholar
- 57. , The stabilized probability plot, Biometrika 70 (1983) 11–17. Crossref, Google Scholar
- 58. , Knowledge discovery in quarterly financial data of stocks based on the prime standard using a hybrid of a swarm with SOM, in Eur. Symp. Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Vol. 27, pp. 397–402, Ciaco, 978-287-587-065-0,
Bruges, Belgium , 2019. Google Scholar - 59. , Swarm intelligence for self-organized clustering, Artif. Intell. 290 (2021) 103237. Crossref, Google Scholar
- 60. , Uncovering high-dimensional structures of projections from dimensionality reduction methods, MethodsX 7 (2020) 101093. Crossref, Google Scholar
- 61. , Fundamental clustering algorithms suite, SoftwareX 13 (2021) 100642. Crossref, Google Scholar
- 62. , Nest architecture and genetic differentiation in a species complex of Australian stingless bees, Mol. Ecol. 13 (2004) 2317–2331. Crossref, Google Scholar
- 63. , in Data Analysis, Machine Learning and Knowledge Discovery (Springer, 2014), pp. 41–49. Crossref, Google Scholar
- 64. , Species delimitation using dominant and codominant multilocus markers, Syst. Biol. 59 (2010) 491–503. Crossref, Google Scholar
- 65. , Invariant graph partition comparison measures, Symmetry 10 (2018) 1–27. Crossref, Google Scholar
- 66. C. Weihs and G. Szepannek, Distances in classification, in P., P. (Ed.), ed., Advances in data mining applications and theoretical aspects, Proc. IEEE International Conference on Data Mining (ICDM), Vol. 5633, Springer Berlin, Heidelberg, pp. 1–12, Miami, Florida, USA, 6–9th December, 2009. Google Scholar
- 67. I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, … A. Zimek, On using class-labels in evaluation of clusterings, Proc. MultiClust: 1st International Workshop on Discovering, Summarizing and using Multiple Clusterings held in Conjunction with 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), pp. 1–9, Washington, DC, 25-28th July, 2010. Google Scholar
Remember to check out the Most Cited Articles! |
---|
Check out these titles in artificial intelligence! |