A Java High Performance Tool For Topological Data Analysis
Figure 1: Graph over all possible base-pairings (left), weighted graph (middle) and clique weight rank persistent homology filtration (right).. The weights from w1 to wi refer to the weights of base-pairings while the weights form wi+1 to wn refer to the weights of backbone interactions.
Understanding secondary structure is important for inferring structure-function relationship of RNA molecules. Several methods have been forwarded for predicting RNA secondary structures; however, improving the robustness of the prediction algorithms remains an ongoing challenge in computational biology. In this ongoing work, topological data analysis, persistent homology, is introduced to probe the partition function landscape and filter out persistent features (loops), from which the lowest minimum free energy structure and its suboptimal cousins can be derived. RNA folding partition function landscape contains information, i. e., base-pair binding probability, about the entire ensemble of secondary structures; hence the right topologization over the partition function landscape is performed to derive persistent structural elements associated with an RNA sequence. To do so, we use clique weight rank persistent homology filtrations (CWRPH), jHoles.
The CWRPH constructs a filtered simplicial complex over a weighted graph. First, a graph is constructed from an RNA sequence by taking into account all possible Watson-Crick (A-U, G-C) and wobble (G-U) base pairings, see Figure-1 (left). Then, partition function algorithm from RNAstructure is used to compute the base-pairing probability between each base pair . The resulted probability values are used as weights of the base pairing interactions. Besides, the weights for backbone interactions are considered as one fixed value, see Figure-1 (middle). In this way, the weighted graph is constructed and is taken as the scaffold of the topological space from which a simplicial complex representation is derived. On the weighted graph, jHoles filtration computes Betti numbers, such as Betti0 (connected components), Betti1 (1-dimensional hole) Betti2 (2-dimensional hole), etc. At each jHoles filtration step, the family of simplices associated with the current filter value are introduced into the topological space, the Betti numbers are computed and the generators of the the persistent holes are displayed, as shown in Figure-1 (right).
Persistent hole generators of Betti1 are used to identify loops associated with the five classes of RNA structural elements, namely hairpin, helix, bulge, internal-loop and multi-branched-loops; for instance, on Figure-1-right the yellow and green colored generators represent hairpin and helix, respectively. However, since jHoles filtration depends on the weight of each interactions and highly probable base pairs are most likely to be in the known structure in a database of diverse RNA sequences , ignoring the low probable base-pairs yield even better result.
Once we generate the persistent loops, we can use graph transformation for capturing the possible folding transformations of the specified RNA sequences. This is possible by defining a graph-grammar in which the biologically accepted persistent loops are the basic primitives. Each rewriting rule of a graph-grammar defines a partial correspondence between the elements of its left-hand side, a pattern graph, and of its right-hand side, a rewrite graph. In the process of constructing a derivation from a graph grammar, the main challenge is how to find the occurrence of a pattern graph. In this study, our pattern graph search space is depend on the number of loops generated by jHoles, i.e., a fixed search space. Hence, as far as the biological criterion (in a single RNA secondary structure each nucleotide can form a base pair by interacting with at most one other nucleotide) are fulfilled, persistent loops constructed from high probable base-pairs can be used as a rewrite graph. The graph language generated by the graph-grammar is the set of all RNA secondary structures that are the derivations of the grammar.
In RNA folding problem the ultimate goal is to predict the optimal RNA secondary, a secondary structure with minimum free energy, of the given RNA primary sequence. As a result, the free energy of each structure, in the graph language, is the sum of the energies of its constituent persistent loops; thus, a computation technique will be used to compute the free energy of a secondary structure from its loop energies.