Abstract (eng)
Formal languages and grammars are a classical topic in computer science and a powerful tool to deal with complexity in terms of algorithmic design. In this thesis, four scientific works are presented that aim to solve problems in computational biology. These problems, from the area of RNA bioinformatics, are prediction of RNA secondary structure and the search for homologous sequences of known non-coding (nc-) RNA families. Also, an efficient embedding of these grammars in a functional programming language is presented.
First, two algorithms on structural non-coding RNA families are presented. A structural ncRNA family is an alignment of related RNA sequences together with their consensus structure. From it a stochastic model can be calculated which, in turn, can be used to search for further related sequences on a genome-wide scale. A number of different possibilities exist to produce a stochastic model from a structural alignment. In Chapter 5 the ramifications of different encodings are discussed.
The algorithm in Chapter 6 provides a solution to another problem on ncRNA families. In order to facilitate quality control on ncRNA family libraries, a method is provided to determine whether an RNA family is sufficiently well separated from all other families.
The algorithm on extended RNA secondary structures presented in Chapter 7 extends the nearest-neighbor secondary structure model toward better reflection of the knowledge gained from RNA tertiary structure. In particular, base pairing beyond the six canonical Watson-Crick pairs is taken into account. Some regions of the RNA with important biological roles contain almost exclusively non-canonical base pairs which can now be predicted in contrast to previous approaches which would model such regions as essentially unstructured.
Finally, in Chapter 8, a domain-specific language embedded in the functional programming language Haskell is presented. This embedding allows for simplified algorithmic development on a high level. In particular, this embedded language makes it possible to write and extend the previous algorithms easily, while providing performance close to that of the C programming language.