DEVELOPING OPTIMAL PREDICTION MODELS FOR CANCER CLASSIFICATION USING GENE EXPRESSION DATA
Abstract
Microarrays can provide genome-wide expression patterns for various cancers, especially for tumor sub-types that may exhibit substantially different patient prognosis. Using such gene expression data, several approaches have been proposed to classify tumor sub-types accurately. These classification methods are not robust, and often dependent on a particular training sample for modelling, which raises issues in utilizing these methods to administer proper treatment for a future patient. We propose to construct an optimal, robust prediction model for classifying cancer sub-types using gene expression data. Our model is constructed in a step-wise fashion implementing cross-validated quadratic discriminant analysis. At each step, all identified models are validated by an independent sample of patients to develop a robust model for future data. We apply the proposed methods to two microarray data sets of cancer: the acute leukemia data by Golub et al.3 and the colon cancer data by Alon et al.12 We have found that the dimensionality of our optimal prediction models is relatively small for these cases and that our prediction models with one or two gene factors outperforms or has competing performance, especially for independent samples, to other methods based on 50 or more predictive gene factors. The methodology is implemented and developed by the procedures in R and Splus. The source code can be obtained at .
References
- Nat. Genet. 24(3), 208 (2000). Crossref, Medline, Google Scholar
- D. K. Slonim, P. Tamayo, J. P. Mesirov, T. R. Golub and E. S. Lander, "Class prediction and discovery using gene expression data," In Proc. 4th Ann. Int. Conf. Comput. Mol. Biol. (RECOMB), Tokyo, Japan, 2000 . Google Scholar
- Science 286, 531 (1999). Crossref, Medline, Google Scholar
- Bioinformatics 16(10), 906 (2000). Crossref, Medline, Google Scholar
- PNAS 98(20), 11462 (2001). Crossref, Medline, Google Scholar
- Bioinformatics 18(1), 39 (2002). Crossref, Medline, Google Scholar
- Bioinformatics 17(12), 1131 (2001). Crossref, Medline, Google Scholar
- PNAS 99(10), 6567 (2002). Crossref, Medline, Google Scholar
F. E. Harrell , Regression Modeling Strategies (Springer-Verlag, New York, 2001) pp. 60–64. Crossref, Google Scholar- PNAS 98(1), 31 (2001). Crossref, Medline, Google Scholar
R. A. Johnson and D. W. Wichern , Applied Multivariate Statistical Analysis (Prentice Hall, Upper Saddle River, New Jersey, 1998) pp. 630–675. Google Scholar- PNAS 96(12), 6745 (1999). Crossref, Medline, Google Scholar
- J. Am. Stat. Assoc. 97, 77 (2002). Crossref, Google Scholar
- A. Smith, J. Satagopan, M. Gonen and C. Begg, "Exploring class prediction for leukemia gene expression data," CAMDA 2000:December 18th (2000) . Google Scholar
- J. Biol. Chem. 266, 5847 (1991). Medline, Google Scholar
- Prostate 29(6), 381 (1996). Crossref, Medline, Google Scholar
- J. Am. Stat. Assoc. 92, 179 (1997). Crossref, Google Scholar
- J. Roy. Stat. Soc. series A 150, 1 (1987). Crossref, Google Scholar
- LabMedica International 19(2), 8 (2002). Google Scholar


