Predicting Model and Algorithm in RNA Folding Structure Including Pseudoknots

The prediction of RNA structure with pseudoknots is a nondeterministic polynomial-time hard (NP-hard) problem; according to minimum free energy models and computational methods, we investigate the RNA-pseudoknotted structure. Our paper presents an e±cient algorithm for predicting RNA structure with pseudoknots, and the algorithm takes O(n3) time and O(n2) space, the experimental tests in Rfam10.1 and PseudoBase indicate that the algorithm is more e®ective and precise. The predicting accuracy, the time complexity and space complexity outperform existing algorithms, such as Maximum Weight Matching (MWM) algorithm, PKNOTS algorithm and Inner Limiting Layer (ILM) algorithm, and the algorithm can predict arbitrary pseudoknots. And there exists a 1þ " (" > 0) polynomial time approximation scheme in searching maximum number of stackings, and we give the proof of the approximation scheme in RNA-pseudoknotted structure. We have improved several types of pseudoknots considered in RNA folding structure, and analyze their possible transitions between types of pseudoknots.


Introduction
RNA is an important biomacromolecule which performs a wide range of functions in biological systems.RNA is a key component in several vital molecular biological processes.RNAs are three-dimensional molecules.The major driving force of RNA molecule is the set of base pairs of A-U, C-G match, and G-U mismatch; RNA can fold into a three-dimensional structure by forming base pairs; a pseudoknot is two overlapping base pairs; pseudoknots are known to exist in some RNAs 23 ; some RNA helices contain overlapping base pairs and more pseudoknots are used to refer pairs of substructures.RNA tertiary structure is a more stable structure, and RNA secondary structure prediction is the ¯rst step to predict RNA tertiary structure in RNA sequence.
Cross of base pairs form pseudoknots, and cross of stems can form pseudoknotted structure.Now it is di±cult to compute large RNA molecules including pseudoknots for existing polynomial time-predicting algorithms.Finding the optimal RNA structures based on combination with stems has become the new method to predict RNA pseudoknotted structures.It is a nondeterministic polynomial-time hard (NP-hard) problem to ¯nd an optimal RNA secondary structure if we allow any set of base pairs, and some RNA structures are legal when base pairs obey the minimum separation requirement.For predicting secondary structures with pseudoknots, Nussinov has studied the case where the energy function is minimized when the number of base pairs is maximized, and has obtained an O(n 3 ) time algorithm for predicting RNA secondary structures, 22 but Nussinov algorithm cannot predict pseudoknotted structures.Algebraic dynamic programming algorithm for ¯nding RNA-pseudoknotted structure with simple planar pseudoknots was proposed by Jens and Robert, the algorithm takes O(n 4 ) time and O(n 2 ) space. 12E±cient algorithm for ¯nding optimal folding of an RNA structure has been ¯rstly known by Michael Zuker 35 ; Pknots algorithm for RNA-pseudoknotted structure of predigesting model based on minimum free energy (MFE) has been presented by Rivas and Eddy, 28 in which time complexity and space complexity are O(n 6 ) and O(n 4 ), respectively.The problem for predicting RNA secondary structure including pseudoknots is NP-complete, 21 and maximizing the number of stacking pairs allowing pseudoknots in a planar secondary structure makes it NP-hard, 10 so naturally people seek for approximation algorithms in the past.In mimic RNA structure, pseudoknots apparently exist. 15A heuristic algorithm including pseudoknots for ¯nding RNA-pseudoknotted structures has been presented by Ren. 27Several publications show that extending the RNA structures including arbitrary pseudoknots indicates the problem of ¯nding the optimum structure is NPhard. 1 People can ¯nd the more stable structure with arbitrary pseudoknots if RNA secondary structure is modeled by maximum weighted matching. 30The problem of the time and space complexities of predicting algorithms in the sparse case, sparserelated techniques have also been applied to RNA folding. 34,5,6e analyze the RNA secondary structure.The contribution of this paper is to present an e±cient algorithm for predicting RNA-pseudoknotted structure, where the time complexity of the algorithm is O(n 3 ) and the space complexity is O(n 2 ), and we implement the algorithm in VCþþ to complete the computation, \the experimental test in PseudoBase and Rfam10.1 of RNA database show that the algorithm is more e®ective and exact than other algorithms", and the algorithm can predict arbitrary pseudoknots.Furthermore, we have proposed and proved that 1 þ " (" > 0) polynomial time approximation scheme (PTAS) exists in searching maximum number of stackings.At last, we investigate the complexity of maximum number for stacking pairs of RNA structures and present a 2-approximation algorithm, analyzing the approximation ratio of the approximation algorithm.This paper also presents several types of pseudoknots considered in RNA folding structures, and analyzes their possible transitions between types of pseudoknots.

Predicting Model of RNA Structure
We discuss the model of RNA structure prediction with simple pseudoknots based on MFE through limiting pseudoknotted type.

Preliminary
(1) RNA secondary structure S: Let S be a set of base pairs such as s i :s j , base s i or s j 2 fA; C; G; Ug, 1 i n.
(2) Pseudoknot: if s i :s j and s i 0 :s j 0 2 S; i < i 0 < j < j 0 , or i 0 < i < j 0 < j, then the RNA base sequence s i . . .s i 0 . . .s j s j 0 composes a pseudoknot.
(7) Nested structure: if s i :s j and s i 0 :s j 0 2 S, i < j < i 0 < j 0 , or i 0 < j 0 < i < j, then the RNA base sequence s i ; . . .; s i 0 ; . . .; s j s j 0 composes nested structure.
In the formula, let V ði; jÞ be the minimum energy for subsequence s i;j and (s i ; s j ) form a base pair, let W ði; jÞ be the minimum energy for subsequence s i;j , the computation of V ði; jÞ and W ði; jÞ can be represented as Figs. 1 and 2. E 1 represents the energy of base pair, and E 2 represents the energy of stack.EF represents the minimum energy of sequence F , LF represents the length of sequence F ; W M represents the energy value of the multi-loop, and the computation of W ði; jÞ and W M can be represented as Figs. 1 and 2. P represents the value of every base pair of the multiloop, M represents the weight value of multi-loop, U represents the value of the length of base pairs, G w represents the value of the pseudoknots.Based on principle of MFE, we use the thermodynamic parameters of M. Andronescu, A. Condon and Mathews, 2,3 the algorithm can identify the pseudoknotted type of the query RNA structure, and output the RNA-pseudoknotted structure.

Predicting Algorithm of RNA Folding Structure
Based on principle of MFE, we design a polynomial time algorithm PreAlgorithm for predicting the RNA folding structure including pseudoknots.Stem is continuous stacking pairs in Fig. 3.

De¯nition
where E is the total energy of an RNA-pseudoknotted structure, S b is the total number of bases for RNA-pseudoknotted sequences, P b is the total number of RNA pseudoknots, N b is the number of the bases which do not match with others, and m is the weight of S b ; n is the weight of P b ; l is the weight of N b ; u is the weight of coaxial stack, v is the weight of dangle bases.cf.Fig. 4.

PTAS of RNA Folded Structure
RNA structure is used to any set of base pairs for an RNA molecule.cf.Fig. 3. Finding the optimal RNA folded structure based on combination with stems and loops has become new method to predict RNA-pseudoknotted structure, there are substructures in RNA secondary structure, for example, hairpin loop, internal loop, multibranched loop, bulge, stem and so on.In fact, RNAs are three-dimensional molecules.The set of base pairs in the three-dimensional structure of an RNA molecule is denoted by the secondary structure.
We design a PTAS according to the characteristic of stems in the RNA secondary structure, and analyze the optimized structure of stems.The paper divides the stem into several subsegments with the length t, and then searches the optimal structure formed by subsegments with the length less than t as the approximation structure of given sequence.

Algorithm 1. PreAlgorithm (S, E)
is the energy value of RNA-nested structure, and E 2 is the energy value of RNA-pseudoknotted folding structure.Fill entry of A[i, j] to save energy value of E and stems.2: Define S 1 to save the k-stems which are found in RNA structures, define V 1 to save the energy value in m-stems structure, and define P 1 to save the amount of RNA pseudoknots of E 2 .3: For (k = 1; k ≤ 42; k + +) Search the prefix of i-stem s i .
For (l = 1; l ≤ 42; l + +), search the suffix of j-stem s i .Seek k-stem s k which is relative maximum of RNA molecule according to the MFE.4: Mark the i-stem s i which is relative maximum in RNA structure S, S ← S − {s l }. 6: Search the stem s j which is hypo-maximum in RNA structure S except s i which is relative maximum.7:

Terminology
Let RNA sequence S ¼ s 1 s 2 ; . . .; s i ; . . .; s n , s i 2 fA; U; G; Cg, Given a stem I in S, we suppose the RNA optimum structure OPTðIÞ ¼ fx 1 ; x 2 ; . . .; x m g, x i is stem, and x i 2 I; 1 i m. cf.Fig. 5.
De¯nition 1.Let LS[i; j] be the length of stem S[i; j] closed by base pairs (i; j) and ðk; lÞ 2 S; i k < l j, then LS½i; De¯nition 2. Let NPS(S) be the number of stacking pairs in stem S½i; j, then NPSðSÞ ¼ LS À 1.
Lemma 1.Let S½i; j be a stem of length LS, divide the stem S½i; j into t segment: s 1 ; s 2 ; . . .; s t , let L si be the length of stem the number of stacking pairs after divding the stem S½i; j is NPSðSÞ ¼ Proof.Let (i; j) be external base pair in stem S½i; j, without loss of generality, let (k; l) be internal base pair in stem S½i; j, according to the De¯nitions 1 and 2, then

The analysis of PTAS
Given each y q 2 OPTðIÞ, let L q be the length of the stem y q ; 1 L q n=2, so the maximum number of stacking pairs for the stem y q is: L q À 1.
Let MOP(y q ) be the maximum number of stacking pairs for the stem x q , then MOPðy q Þ ¼ L q À 1.If L q > lðl > 2Þ, we divide the stem y q with the shorter stem of length l into dL q =le segment.
We discuss with two cases as follows: (a) If ½L q =l 6 ¼ L q =l, so ½L q =l ¼ intðL q =lÞ þ 1, The reduced number of stacking pairs for primary stem x q is intðL q =lÞ þ 1 For any stem x q 2 OPMðIÞ, the number of stacking pairs is more than L q À 1 À ðL q =l þ 1Þ.
Suppose the approximation scheme is AS(y q ) So ASðy q Þ > L q À L q =l: Then LOPðy q Þ=PT ðy q Þ < ðL q À 1Þ=ðL q À L q =lÞ < L q =ðL q À L q =lÞ ¼ 1=ð1 À 1=lÞ That is, MOPðy q Þ=ASðy q Þ < 1 þ 1=ðl À 1Þ So we get the results according to the ¯rst case: (b) If ½L q =l ¼ L q =l, the reduced number of stacking pairs for stem y q is L q =l The number of stacking pairs for the case is Suppose the approximation scheme is ASðy q Þ So; We compute the approximation ratio as follows: We get the results according to the second case: According to (a): According to (b): According to two cases above, we have So we can get the scheme to exist as 1 þ "ð" > 0Þ PTAS in searching maximum stacking pairs in RNA structure prediction.

Types of Pseudoknots Considered in RNA Folding Structure
If a; b are two local minima, then there exists a zigzag path connecting a and b. 14 Reidys provides gfold software that can implement Boltzmann sampling. 24,25We can draw the RNA secondary structure with VARNA, 8 and the RNA topological structures can be computed. 25,7We generalize the RNA-pseudoknotted framework based on BHG and study the di®erences in predicted folding behavior. 17or a given RNA sequence S, its energy landscape L is connected.For any base pairs given from secondary structures S 1 and S 2 ; S 1 2 S; S 2 2 S, there exists a path between S 1 and S 2 , for any two local minimum m 1 ; m 2 , then there exists a zigzag path connecting m 1 and m 3 .We can de¯ne the path as follows: path a direct saddle separating the nearest valley points that the path P passed before and after v there is a minimal shelf L, we declare the path P is a zigzag path.P can be called BHG, then BHG is connected.
We can generalize the RNA structures with pseudoknots using the BHG and the sampling strategy for local minima.We can create a set for implementing the gradient walk of the class of pseudoknots, it comprises ¯ve types of pseudoknots as follows: Type S, Type H, Type K, Type L and Type M. cf.Fig. 6.Type S refers to structures without pseudoknots Removing base pairs is relatively simple since they will never result in an invalid structure, the general case involving ¯ve types of pseudoknots is rather involved, even with the restriction to structures, with at most one pseudoknot.See Table 1.
Adding base pairs is relatively simple since they will never result in an invalid structure, the general case involving ¯ve types of pseudoknots is rather involved, even with the restriction to structures, with at most one pseudoknot.See Table 2.The paper presents an example named PKB92 of tobacco mild green mosaic virus, we investigate 27 bases with pseudoknots named PK1.Its structure can be correctly predicted with an energy of À4.3 kcal/mol by gfold.cf.Fig. 7.
Table 1.Possible transitions between types of pseudoknots upon removing a single base pair.
Possible transitions between types of pseudoknots upon adding a single base pair.The next pseudoknot-free MFE secondary is with an energy of 3.9 kcal/mol.cf.Fig. 8.
It is di±cult to determine which base pairs can be added without changing the class of the RNA structure and to compute the changing result in energy without re-evaluating the structure.We restrict the subset of structures with H-type pseudoknots in restricted class.cf.Fig. 9.According to the principle of the BHG and MFE, the paper provides a pathsearching algorithm to connect the graph LM.We investigate the low-energy part of the BHG for PKB92 sequence, the PKB92 is more likely to fold the most stable secondary structure, and refold to form the pseudoknots.We label the LM by \Lx" and label the saddles by \Sx".The paper labels the edges by their energy barriers with [kcal/mol].Optimal folding pathway of PKB92 sequence based on MFE, and the suboptimal pathways form the stem of pseudoknots which can be drawn in the BHG.Lx, Sx, Saddle points and local minima of the structures are as follows.cf.Fig. 10.

Experimental Comparison
The paper selected many sequences of Rfam10.1 and PseudoBase for experiment, and computes the energy value of stems including pseudoknots, according to the energy parameters. 31,11,29The polynomial time algorithm can compute RNA-nested structure and pseudoknotted structure in RNA sequences.We randomly picked the RNA subsequence in the Rfam10.1 and PseudoBase to compute the experiment. 19any experiments on RNA-pseudoknotted structures indicated that the algorithm has above 83% predicting accuracy averagely.Our algorithm can predict more than 3300 bases of RNA sequences.The predicting accuracy of the algorithm descends with the increasing of the RNA base sequences.We have designed e®ective ways to improve the prediction accuracy for long sequences.
RNA sequence was generated with bases A,C,G,U, four experiments in family of PseudoBase can be computed in less than 45 seconds with quad-core CPU and 32 G memory.We have performed experiments, and experiments show that accuracy of these experiments is valuable, the predicting accuracy outperforms existing algorithms, such as MWM algorithm, PKNOTS algorithm, ILM algorithm, 35,28,21,9 etc. Evolutionary algorithm also provide a kind of important method in the RNA structure prediction. 33The structural alignment of RNA proves to be a useful computational technique for identifying noncoding RNA (ncRNA). 32,20The e±ciency of our algorithm is faster than the other related algorithms in the RNA secondary structure and target structure (see Tables 3 and 4). 33,32,20

Conclusion and Future Work
In this paper, we have presented an e±cient algorithm for predicting RNA structure with pseudoknots, and the algorithm takes O(n 3 ) time and O(n 2 ) space; the predicting accuracy, the time complexity and space complexity outperform existing algorithms, such as MWM algorithm, Pknots algorithm and ILM algorithm, and the algorithm can predict arbitrary pseudoknots.An 1 þ " (" > 0) PTAS in searching maximum number of stackings has been presented, and we give the proof of the approximation scheme in RNA-pseudoknotted structure.We have improved several types of pseudoknots considered in RNA folding structure, and analyzed their possible transitions between types of pseudoknots.
It is a good computational method for characterizing the RNA folding structure using BHG. 16The RNA adopts an unexpected tandem three-way junction structure, and unspliced dimeric genomes are selected by the RNA conformer that directs packaging. 13ne of the strategies for feature selection that is often applied by brain-computer interface researchers is based on genetic algorithms. 26The fuzzy-rough feature selection and vaguely quanti¯ed rough set feature selection are coupled with CLONALG and AIRS for improved detection and computational e±ciencies. 18Interactive segmentation of images has become an integral part of image-processing applications.Several graph-based segmentation techniques have been developed, which depend upon global minimization of the energy cost function.An adequate scheme of interactive segmentation still needs a skilled initialization of regions with user-de¯ned seed pixels distributed over the entire image. 4Their algorithm idea is important for our paper in RNA structure prediction. 26,18,4

Fig. 1 .
Fig. 1.The model of V and W .

Fig. 2 .
Fig. 2. The model of V ; W and W M .

Fig. 4 .
Fig. 4. The representation (a) and formulas (b) of coaxial stack and dangle bases.

Table 3 .
Comparison of preAlgorithm algorithm and pknots algorithm.