DATA MINING TOOLS FOR BIOLOGICAL SEQUENCES
Abstract
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.
References
- Intelligent Systems for Molecular Biology 6, 2 (1998). Medline, Google Scholar
- Proteins 12, 382 (1992), DOI: 10.1002/prot.340120410. Crossref, Medline, Google Scholar
- Bioinformatics 18, 198 (2002), DOI: 10.1093/bioinformatics/18.1.198. Crossref, Medline, Google Scholar
- Informatica 26, 389 (2002). Google Scholar
- Theoretical and Computational Methods in Genome Research 15 (1997). Google Scholar
- J. Comput. Biol. 1, 311 (1994), DOI: 10.1089/cmb.1994.1.311. Crossref, Medline, Google Scholar
-
P. Baldi and S. Brunak , Bioinformatics: The Machine Learning Approach ( MIT Press , Cambridge, MA , 1999 ) . Google Scholar - Bioinformatics 17, 509 (2001), DOI: 10.1093/bioinformatics/17.6.509. Crossref, Medline, Google Scholar
- Nucleic Acids Res. 27, 260 (1999), DOI: 10.1093/nar/27.1.260. Crossref, Medline, Google Scholar
- Nat. Genet. 4, 332 (1993), DOI: 10.1038/ng0893-332. Crossref, Medline, Google Scholar
- Science 265, 1993 (1994), DOI: 10.1126/science.8091218. Crossref, Medline, Google Scholar
- Computers and Chemistry 17(2), 123 (1993), DOI: 10.1016/0097-8485(93)85004-V. Crossref, Google Scholar
- Nucleic Acids Res. 23, 3554 (1995), DOI: 10.1093/nar/23.17.3554. Crossref, Medline, Google Scholar
-
L. Breiman , Classification and Regression Trees ( Wadsworth and Brooks , 1984 ) . Google Scholar -
R. J. Brooker , Genetics: Analysis and Principles ( Addison-Wesley , Reading, MA , 1999 ) . Google Scholar - Proc. Natl. Acad. Sci. USA 97(1), 262 (2000), DOI: 10.1073/pnas.97.1.262. Crossref, Medline, Google Scholar
- Chem. Res. Toxicol. 13, 436 (2000), DOI: 10.1021/tx9900627. Crossref, Medline, Google Scholar
- Data Min. Knowl. Disc. 2, 121 (1998), DOI: 10.1023/A:1009715923555. Crossref, Google Scholar
-
M. Caria , Measurement Analysis: An Introduction to the Statistical Analysis of Laboratory Data in Physics, Chemistry, and the Life Sciences ( Imperial College Press , London , 2000 ) . Link, Google Scholar -
Y. Chauvin and D. Rumelhart , Backpropagation: Theory, Architectures, and Applications ( Lawrence Erlbaum , Hillsdale, NJ , 1995 ) . Google Scholar - Curr. Opin. Struc. Biol. 7, 394 (1997), DOI: 10.1016/S0959-440X(97)80057-7. Crossref, Medline, Google Scholar
- Studies Health Technology and Informatics 68, 436 (1999). Medline, Google Scholar
G. Dong and J. Li , Efficient mining of emerging patterns: Discovering trends and differences, Proc. 5th ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining (1999) pp. 15–18. Google Scholar-
R. Duda and P. Hart , Pattern Classification and Scene Analysis ( Wiley , New York , 1973 ) . Google Scholar - J. Am. Stat. Assoc. 97, 77 (2002), DOI: 10.1198/016214502753479248. Crossref, Google Scholar
-
R. Durbin (eds.) , Biological Sequence Analysis: Probabilistics Models of Proteins and Nucleic Acids ( Cambridge University Press , 1998 ) . Crossref, Google Scholar - Curr. Opin. Struc. Biol. 6, 361 (1996), DOI: 10.1016/S0959-440X(96)80056-X. Crossref, Medline, Google Scholar
- Protein Sci. 8, 978 (1999), DOI: 10.1110/ps.8.5.978. Crossref, Medline, Google Scholar
U. Fayyad and K. Irani , Multi-interval discretization of continuous-valued attributes for classification learning, Proc. 13th Int. Joint Conf. on Artificial Intelligence (1993) pp. 1022–1029. Google Scholar- Nucleic Acids Res. 20, 6441 (1992), DOI: 10.1093/nar/20.24.6441. Crossref, Medline, Google Scholar
- J. Mol. Biol. 267, 446 (1997), DOI: 10.1006/jmbi.1996.0874. Crossref, Medline, Google Scholar
T.-T. Friess , N. Cristianini and C. Campbell , The kernel adatron algorithm: A fast and simple learning procedure for support vector machines, Proc. 15th Int. Conf. on Machine Learning (1998) pp. 188–196. Google Scholar- Bioinformatics 16(10), 906 (2000), DOI: 10.1093/bioinformatics/16.10.906. Crossref, Medline, Google Scholar
- J. Mol. Biol. 196, 261 (1987). Crossref, Medline, Google Scholar
J. Gehrke , BOAT — optimistic decision tree construction, Proc. ACM-SIGMOD Int. Conf. on Management of Data (1999) pp. 169–180. Google Scholar- Genet. Epidemiol. 19(2), 97 (2000). Crossref, Medline, Google Scholar
- Science 286, 531 (1999), DOI: 10.1126/science.286.5439.531. Crossref, Medline, Google Scholar
- Ann. Rev. Cells Dev. Biol. 14, 399 (1998), DOI: 10.1146/annurev.cellbio.14.1.399. Crossref, Medline, Google Scholar
- M. A. Hall, Correlation-based feature selection machine learning. PhD thesis, Department of Computer Science, University of Waikato, New Zealand, 1998 . Google Scholar
-
J. Han and M. Kamber , Data Mining: Concepts and Techniques ( Morgan Kaufmann , San Francisco, CA , 2000 ) . Google Scholar -
T. Hastie , R. Tibshirani and J. Friedman , The Elements of Statistical Learning: Data Mining, Inference, and Prediction ( Springer-Verlag , Berlin , 2001 ) . Crossref, Google Scholar - Bioinformatics 18, 343 (2002), DOI: 10.1093/bioinformatics/18.2.343. Crossref, Medline, Google Scholar
D. Heckerman , Advances in Knowledge Discovery and Data Mining (MIT Press, Cambridge, MA, 1996) pp. 273–305. Google Scholar- Acta Crystallogr. D Biol. Crystallogr. 56, 817 (2000), DOI: 10.1107/S0907444900004261. Crossref, Medline, Google Scholar
- Nat. Biotechnol. 16, 966 (1998), DOI: 10.1038/nbt1098-966. Crossref, Medline, Google Scholar
- Clin. Pharmacol. Ther. 67, 621 (2000), DOI: 10.1067/mcp.2000.106827. Crossref, Medline, Google Scholar
- J. Comput. Biol. 7, 95 (2000), DOI: 10.1089/10665270050081405. Crossref, Medline, Google Scholar
-
F. V. Jensen , An Introduction to Bayesian Networks ( Springer-Verlag , New York , 1996 ) . Google Scholar - G. H. John, Enhancements to the Data Mining Process. PhD thesis, Stanford University, 1997 . Google Scholar
- Plant Mol. Biol. 35, 993 (1997), DOI: 10.1023/A:1005816823636. Crossref, Medline, Google Scholar
- Applied Statistics 29, 119 (1980). Crossref, Google Scholar
- Computational Methods in Molecular Biology 335 (1998), DOI: 10.1016/S0167-7306(08)60472-X. Google Scholar
- The J. Cell Biol. 115, 887 (1991), DOI: 10.1083/jcb.115.4.887. Crossref, Medline, Google Scholar
- Gene 234, 187 (1999), DOI: 10.1016/S0378-1119(99)00210-3. Crossref, Medline, Google Scholar
- Nucleic Acids Res. 15, 8125 (1987), DOI: 10.1093/nar/15.20.8125. Crossref, Medline, Google Scholar
- Computational Methods in Molecular Biology 45 (1998), DOI: 10.1016/S0167-7306(08)60461-5. Google Scholar
- J. Mol. Biol. 235, 1501 (1994), DOI: 10.1006/jmbi.1994.1104. Crossref, Medline, Google Scholar
- Artificial Intelligence in Medicine 8, 431 (1996), DOI: 10.1016/S0933-3657(96)00351-X. Crossref, Medline, Google Scholar
P. Langley , W. Iba and K. Thompson , An analysis of Bayesian classifier, Proc. 10th Natl. Conf. on Artificial Intelligence (1992) pp. 223–228. Google Scholar- Comput. Stat. Data An. 19, 191 (1995), DOI: 10.1016/0167-9473(93)E0056-A. Crossref, Google Scholar
- Bioinformatics 19, 71 (2003), DOI: 10.1093/bioinformatics/19.1.71. Crossref, Medline, Google Scholar
J. Li , H. Liu and L. Wong , A comparative study on feature selection and classification methods using a large set of gene expression profiles, Proc. 13th Int. Conf. on Genome Informatics (Tokyo, Japan, 2002) pp. 51–60. Google ScholarJ. Li , S.-K. Ng and L. Wong , Bioinformatics adventures in database research, Proc. 9th Int. Conf. on Database Theory,LNCS 2572 (2003) pp. 31–46. Google ScholarJ. Li and L. Wong , Geography of differences between two classes of data, Proc. 6th European Conf. on Principles of Data Mining and Knowledge Discovery (2002) pp. 325–337. Google ScholarH. Liu and R. Setiono , Chi2: Feature selection and discretization of numeric attributes, Proc. IEEE 7th Int. Conf. on Tools with Artificial Intelligence (1995) pp. 338–391. Google Scholar- Statistica Sinica 7, 815 (1997). Google Scholar
- J. Am. Stat. Assoc. 83, 715 (1988), DOI: 10.2307/2289295. Crossref, Google Scholar
- Am. J. Surg. Pathol. 19(4), 371 (1995), DOI: 10.1097/00000478-199504000-00001. Crossref, Medline, Google Scholar
- Cancer 88(7), 1599 (2000). Crossref, Medline, Google Scholar
- Nucleic Acids Res. 25, 955 (1997), DOI: 10.1093/nar/25.5.955. Crossref, Medline, Google Scholar
- Science 283, 1168 (1999), DOI: 10.1126/science.283.5405.1168. Crossref, Medline, Google Scholar
- Oper. Res. 43(4), 570 (1995), DOI: 10.1287/opre.43.4.570. Crossref, Google Scholar
- Genome Res. 10(2), 204 (2000), DOI: 10.1101/gr.10.2.204. Crossref, Medline, Google Scholar
M. Mehta , R. Agrawal and J. Rissanen , SLIQ: A fast scalable classifier for data mining, Proc. Int. Conf. on Extending Database Technology (Avignon, France, 1996) pp. 18–32. Google Scholar-
T. M. Mitchell , Machine Learning ( McGraw-Hill , New York , 1997 ) . Google Scholar - Protein Sci. 3, 3 (1994). Medline, Google Scholar
- Intelligent Systems for Molecular Biology 4, 182 (1996). Medline, Google Scholar
- Intelligent Systems for Molecular Biology 5, 226 (1997). Medline, Google Scholar
- , Advances in Kernel Methods — Support Vector Learning , eds.
B. Scholkopf , C. Burges and A. Smola ( MIT Press , Cambridge, MA , 1998 ) . Google Scholar - J. Mol. Biol. 202, 865 (1988), DOI: 10.1016/0022-2836(88)90564-5. Crossref, Medline, Google Scholar
- Mach. Learn. 1, 81 (1986). Google Scholar
-
J. R. Quinlan , C4.5: Program for Machine Learning ( Morgan Kaufmann , San Francisco, CA , 1993 ) . Google Scholar - Resuscitation 42(3), 187 (1999), DOI: 10.1016/S0300-9572(99)00089-1. Crossref, Medline, Google Scholar
R. Rastogi and K. Shim , Public: A decision tree classifier that integrates building and pruning, Proc. 24th Int. Conf. on Very Large Data Bases (1998) pp. 404–415. Google Scholar- Neural Network 13, 17 (2000), DOI: 10.1016/S0893-6080(99)00097-0. Crossref, Medline, Google Scholar
- J. Comput. Biol. 3, 163 (1996), DOI: 10.1089/cmb.1996.3.163. Crossref, Medline, Google Scholar
- Bioinformatics 17, 890 (2001), DOI: 10.1093/bioinformatics/17.10.890. Crossref, Medline, Google Scholar
- Proteins 19, 55 (1994), DOI: 10.1002/prot.340190108. Crossref, Medline, Google Scholar
- Nature 323, 533 (1986), DOI: 10.1038/323533a0. Crossref, Google Scholar
S. Russell , Local learning in probabilistic networks with hidden variables, Proc. 14th Joint Int. Conf. on Artificial Intelligence2 (1995) pp. 1146–1152. Google Scholar- Nucleic Acids Res. 26, 544 (1998), DOI: 10.1093/nar/26.2.544. Crossref, Medline, Google Scholar
-
B. Scholkopf and A. J. Smola , Learning with Kernels ( MIT Press , Cambridge, MA , 2002 ) . Google Scholar - AJR Am. J. Roentgenol. 175(2), 399 (2000). Crossref, Medline, Google Scholar
- J. Investig. Med. 43(5), 468 (1995). Medline, Google Scholar
J. Shafer , R. Agrawal and M. Mehta , SPRINT: A scalable parallel classifier for data mining, Proc. 22nd Int. Conf. on Very Large Data Bases (Bombay, India, 1996) pp. 544–555. Google Scholar- J. Mol. Biol. 248, 1 (1995), DOI: 10.1006/jmbi.1995.0198. Crossref, Medline, Google Scholar
- Artificial Intelligence and Molecular Biology 121 (1993). Google Scholar
- Protein Structural Biology in Biomedical Research 447 (1997). Google Scholar
- J. Neurosurg. 82, 764 (1995). Crossref, Medline, Google Scholar
- Eur. J. Vasc. Endovasc. Surgery 19, 184 (2000), DOI: 10.1053/ejvs.1999.0974. Crossref, Medline, Google Scholar
- Proc. Natl. Acad. Sci. USA 88, 11261 (1991), DOI: 10.1073/pnas.88.24.11261. Crossref, Medline, Google Scholar
P. E. Utgoff , An incremental ID3, Proc. 5th Int. Conf. on Machine Learning (1988) pp. 107–120. Google Scholar-
V. N. Vapnik , The Nature of Statistical Learning Theory ( Springer-Verlag , Berlin , 1995 ) . Crossref, Google Scholar - Diagn. Cytopathol. 23, 171 (2000). Crossref, Medline, Google Scholar
W. J. Wilbur , Analysis of biomedical text for biochemical names: A comparison of three methods, Proc. AMIA Symposium'99 (1999) pp. 176–180. Google ScholarT. Yada , Extraction of hidden Markov model representations of signal patterns in DNA sequences, Proc. Pacific Symposium on Biocomputing (1996) pp. 686–696. Google Scholar- Cancer Cell 1, 133 (2002), DOI: 10.1016/S1535-6108(02)00032-6. Crossref, Medline, Google Scholar
F. Zeng , R. Yap and L. Wong , Using feature generation and feature selection for accurate prediction of translation initiation sites, Proc. 13th Int. Conf. on Genome Informatics (2002) pp. 192–200. Google Scholar- Genome Res. 8, 319 (1998). Crossref, Medline, Google Scholar
A. Zien , Engineering support vector machine kernels that recognize translation initiation sites, Proc. German Conf. on Bioinformatics (1999) pp. 37–43. Google Scholar- Bioinformatics 16, 799 (2000), DOI: 10.1093/bioinformatics/16.9.799. Crossref, Medline, Google Scholar


