A model-based clustering algorithm with covariates adjustment and its application to lung cancer stratification
Abstract
Usually, the clustering process is the first step in several data analyses. Clustering allows identify patterns we did not note before and helps raise new hypotheses. However, one challenge when analyzing empirical data is the presence of covariates, which may mask the obtained clustering structure. For example, suppose we are interested in clustering a set of individuals into controls and cancer patients. A clustering algorithm could group subjects into young and elderly in this case. It may happen because the age at diagnosis is associated with cancer. Thus, we developed CEM-Co, a model-based clustering algorithm that removes/minimizes undesirable covariates’ effects during the clustering process. We applied CEM-Co on a gene expression dataset composed of 129 stage I non-small cell lung cancer patients. As a result, we identified a subgroup with a poorer prognosis, while standard clustering algorithms failed.
References
- 1. , Some methods for classification and analysis of multivariate observations, Proc Fifth Berkeley Symp Mathematical Statistics and Probability, University of California Press, pp. 281–297, 1967. Google Scholar
- 2. , Clustering by Means of Medoids, North-Holland, 1987. Google Scholar
- 3. , Hierarchical grouping to optimize an objective function, J Am Stat Assoc 58(301) :236–244, 1963. Crossref, Google Scholar
- 4. , Finite Mixture Models, John Wiley & Sons, 2004. Crossref, Google Scholar
- 5. , A density-based algorithm for discovering clusters in large spatial databases with noise, Proc Second Int Conf Knowledge Discovery and Data Mining, AAAI Press, pp. 226–231, 1996. Google Scholar
- 6. , Mean shift, mode seeking, and clustering, IEEE Trans Pattern Anal Mach Intell 17(8) :790–799, 1995. Crossref, Google Scholar
- 7. , FCM: The fuzzy -means clustering algorithm, Comput Geosci 10(2–3) :191–203, 1984. Crossref, Google Scholar
- 8. , A tutorial on spectral clustering, Stat Comput 17(4) :395–416, 2007. Crossref, Google Scholar
- 9. , Latent variable regression for multiple discrete outcomes, J Am Stat Assoc 92(440) :1375–1386, 1997. Crossref, Google Scholar
- 10. , Concomitant-variable latent-class models, J Am Stat Assoc 83(401) :173–178, 1988. Crossref, Google Scholar
- 11. , Auxiliary variables in mixture modeling: Three-step approaches using Mplus, Struct Equ Model Multidiscip J 21(3) :329–341, 2014. Crossref, Google Scholar
- 12. , Evaluation of analysis approaches for latent class analysis with auxiliary linear growth model, Front Psychol 9 :130, 2018. Crossref, Medline, Google Scholar
- 13. , A latent transition mixture model using the three-step specification, Struct Equ Model Multidiscip J 21(3) :439–454, 2014. Crossref, Google Scholar
- 14. , Latent class modeling with covariates: Two improved three-step approaches, Polit Anal 18(4) :450–469, 2010. Crossref, Google Scholar
- 15. ,
Mixture model clustering with covariates using adjusted three-step approaches , in Algorithms from and for Nature and Life, Springer, pp. 87–94, 2013. Crossref, Google Scholar - 16. , Estimating latent structure models with categorical variables: One-step versus three-step estimators, Polit Anal 12(1) :3–27, 2004. Crossref, Google Scholar
- 17. , Latent Gold 4.0 Userś Guide, 2005. Google Scholar
- 18. , A classification EM algorithm for clustering and two stochastic versions, Comput Stat Data Anal 14(3) :315–332, 1992. Crossref, Google Scholar
- 19. , Gaussian parsimonious clustering models, Pattern Recognit 28(5) :781–793, 1995. Crossref, Google Scholar
- 20. , Cholesky factorization of matrices in parallel and ranking of graphs, Int Conf Parallel Processing and Applied Mathematics, Springer, pp. 985–992, 2003. Google Scholar
- 21. , The large-sample distribution of the likelihood ratio for testing composite hypotheses, Ann Math Stat 9(1) :60–62, 1938. Crossref, Google Scholar
- 22. , Introduction to Mathematical Statistics, Pearson Education, 2005. Google Scholar
- 23. , A Bayesian information criterion for singular models, J R Stat Soc 79(2) :323–380, 2017. Crossref, Google Scholar
- 24. , On calculating with B-splines, J Approx Theory 6(1) :50–62, 1972. Crossref, Google Scholar
- 25. , Analyzing real-time PCR data by the comparative CTmethod, Nat Protoc 3(6) :1101–1108, 2008. Crossref, Medline, Google Scholar
- 26. , -means clustering via principal component analysis, Proc Twenty-First Int Conf Machine Learning, Association for Computing Machinery, pp. 29, 2004. Crossref, Google Scholar
- 27. , Fuzzy -means algorithms for very large data, IEEE Trans Fuzzy Syst 20(6) :1130–1146, 2012. Crossref, Google Scholar
- 28. , Prognostic and predictive gene signature for adjuvant chemotherapy in resected non-small-cell lung cancer, J Clin Oncol 28(29) :4417–4424, 2010. Crossref, Medline, Google Scholar
- 29. , UMAP: Uniform manifold approximation and projection, J Open Source Softw 3(29) :861, 2018, https://doi.org/10.21105/joss.00861. Crossref, Google Scholar
- 30. ,
Density-based clustering based on hierarchical density estimates , in Pei J, Tseng VS, Cao L, Motoda H, Xu G (eds.), Advances in Knowledge Discovery and Data Mining, Springer, Berlin, pp. 160–172, 2013. Crossref, Google Scholar - 31. , The EM Algorithm and Extensions, John Wiley & Sons, 2007. Google Scholar
- 32. , On the convergence properties of the EM algorithm, Ann Stat 11(1) :95–103, 1983. Crossref, Google Scholar