STEM KERNELS FOR RNA SEQUENCE ANALYSES
Abstract
Several computational methods based on stochastic context-free grammars have been developed for modeling and analyzing functional RNA sequences. These grammatical methods have succeeded in modeling typical secondary structures of RNA, and are used for structural alignment of RNA sequences. However, such stochastic models cannot sufficiently discriminate member sequences of an RNA family from nonmembers and hence detect noncoding RNA regions from genome sequences. A novel kernel function, stem kernel, for the discrimination and detection of functional RNA sequences using support vector machines (SVMs) is proposed. The stem kernel is a natural extension of the string kernel, specifically the all-subsequences kernel, and is tailored to measure the similarity of two RNA sequences from the viewpoint of secondary structures. The stem kernel examines all possible common base pairs and stem structures of arbitrary lengths, including pseudoknots between two RNA sequences, and calculates the inner product of common stem structure counts. An efficient algorithm is developed to calculate the stem kernels based on dynamic programming. The stem kernels are then applied to discriminate members of an RNA family from nonmembers using SVMs. The study indicates that the discrimination ability of the stem kernel is strong compared with conventional methods. Furthermore, the potential application of the stem kernel is demonstrated by the detection of remotely homologous RNA families in terms of secondary structures. This is because the string kernel is proven to work for the remote homology detection of protein sequences. These experimental results have convinced us to apply the stem kernel in order to find novel RNA families from genome sequences.
References
- Nat. Rev. Genet. 2, 919 (2001), DOI: 10.1038/35103511. Crossref, Medline, Google Scholar
-
R. Durbin , Biological Sequence Analysis ( Cambridge University Press , Cambridge, UK , 1998 ) . Crossref, Google Scholar - Bioinformatics 21, 2611 (2005), DOI: 10.1093/bioinformatics/bti385. Crossref, Medline, Google Scholar
- Nucleic Acids Res. 22, 5112 (1994), DOI: 10.1093/nar/22.23.5112. Crossref, Medline, Google Scholar
- Bioinformatics 19, i232 (2003), DOI: 10.1093/bioinformatics/btg1032. Crossref, Medline, Google Scholar
- Bioinformatics 21, ii237 (2005), DOI: 10.1093/bioinformatics/bti1139. Crossref, Medline, Google Scholar
- Nucleic Acids Res. 22, 2079 (1994), DOI: 10.1093/nar/22.11.2079. Crossref, Medline, Google Scholar
- BMC Bioinformatics 3, 18 (2002), DOI: 10.1186/1471-2105-3-18. Crossref, Medline, Google Scholar
- Curr. Bioinform. 1, 115 (2006), DOI: 10.2174/157489306777011996. Crossref, Google Scholar
V. Bafna , S. Muthukrishnan and R. Ravi , Computing similarity between RNA strings, Proc. 6th Symp. Combinatorial Pattern Matching (1995) pp. 1–16. Google Scholar- J. Comput. Biol. 9, 371 (2002), DOI: 10.1089/10665270252935511. Crossref, Medline, Google Scholar
- J. Mach. Learn. Res. 2, 419 (2002), DOI: 10.1162/153244302760200687. Google Scholar
-
J. Shawe-Taylor and N. Cristianini , Kernel Methods for Pattern Analysis ( Cambridge University Press , Cambridge, UK , 2004 ) . Crossref, Google Scholar - Haussler D., Convolution kernels on discrete structures, Technical Report UCSC-CRL-99-10, Computer Science Department, University of California, Santa Cruz, 1999 . Google Scholar
- Biopolymers 29, 1105 (1990), DOI: 10.1002/bip.360290621. Crossref, Medline, Google Scholar
- Nucleic Acids Res. 9, 133 (1981), DOI: 10.1093/nar/9.1.133. Crossref, Medline, Google Scholar
- Nucleic Acids Res. 31, 439 (2003), http://www.sanger.ac.uk/Software/Rfam/ DOI: 10.1093/nar/gkg006. Crossref, Medline, Google Scholar
- Virology 321, 36 (2004), DOI: 10.1016/j.virol.2003.10.023. Crossref, Medline, Google Scholar
- Nucleic Acids Res. 22, 4673 (1994), DOI: 10.1093/nar/22.22.4673. Crossref, Medline, Google Scholar
- J. Mol. Biol. 319, 1059 (2002), DOI: 10.1016/S0022-2836(02)00308-X. Crossref, Medline, Google Scholar
- J. Comput. Biol. 13, 283 (2006), DOI: 10.1089/cmb.2006.13.283. Crossref, Medline, Google Scholar
- Bioinform. 22, 1723 (2006), DOI: 10.1093/bioinformatics/btl177. Crossref, Medline, Google Scholar
- Genome Inform. 13, 112 (2002). Medline, Google Scholar
- Sato K., Asai K., Sakakibara Y., Profile–profile stem kernels for structural RNA analysis, Poster session at ISMB/ECCB 2007, R07, Vienna, 2007 . Google Scholar