Grammatical approaches to problems in RNA bioinformatics

Christian Höner zu Siederdissen

doi:10.25365/thesis.28648

You are here:

University of Vienna
PHAIDRA
Detail o:1301508

Title (eng)

Grammatical approaches to problems in RNA bioinformatics

Parallel title (deu)

Grammatikalische Ansätze für Probleme in der RNA Bioinformatik

Author

Christian Höner zu Siederdissen

Advisor

Ivo Hofacker

Assessor

Rolf Backofen

Thomas Rattei

Abstract (deu)

Formale Sprachen und Grammatiken sind ein klassisches Thema in der Informatik und maechtiges Werkzeug um die Komplexitaet algorithmischen Designs zu beherrschen. In dieser Arbeit werden vier wissenschaftliche Arbeiten zur Loesung von Problemen in der computergestuetzten Biologie praesentiert. Diese Probleme, ausgewaehlt aus dem Bereich der RNA-Bioinformatik, sind die Vorhersage von RNA Sekundaerstrukturen und die Suche nach homologen Sequenzen bekannter nichtkodierender (nc-) RNA Familien. Ausserdem wird eine effiziente Einbettung dieser Grammatiken in eine funktionale Programmiersprache vorgestellt. Zuerst werden zwei Algorithmen zu strukturellen RNA Familien praesentiert. Eine strukturelle ncRNA Familie is ein Alignment von verwandten RNA Sequenzen zusammen mit ihrer gemeinsamen Struktur. Sie kann in ein stochastisches Modell verwandelt werden. Solch ein Modell ermoeglicht das Auffinden von weiteren verwandten Sequenzen auf genomweiter Ebene. Verschiedene Moeglichkeiten existieren um solche Modelle zu erzeugen. In Kapitel 5 werden die Konsequenzen verschiedener Kodierungen diskutiert. Der Algorithmus in Kapitel 6 bietet eine Loesung fuer ein anderes Problem mit ncRNA Familien. Um die Qualitaet der vorhandenen Familien sicherzustellen wird eine Methode bereitgestellt, die es erlaubt festzustellen ob eine Familie genuegend stark von allen anderen Familien differenziert. Der Algorithmus zu erweiterten RNA Sekundaerstrukturen in Kapitel 7 erweitert das nearest-neighbor Sekundaerstrukturmodell zu einem das das bekannte Wissen zu RNA Strukturen besser reflektiert. Insbesondere sind Basenpaarungen ueber das kanonische Watson-Crick Paarungsmodell hinaus moeglich. Solch ein erweitertes Modell dient der besseren Vorhersage von Basenpaarungen in Regionen welche als wichtig in der biologischen Rolle vieler RNAs erkannt wurden. Zuletzt wird eine domaenenspezische Sprache, eingebettet in die funktionale Programmiersprache Haskell, in Kapitel 8 besprochen. Eine solche Einbettung ermoeglicht ein einfacheres Entwickeln in einer Hochsprache. Insbesondere ist es moeglich die vorherigen Algorithmen einfach zu formulieren und zu erweitern ohne auf C-nahe Geschwindigkeit verzichten zu muessen.

Abstract (eng)

Formal languages and grammars are a classical topic in computer science and a powerful tool to deal with complexity in terms of algorithmic design. In this thesis, four scientific works are presented that aim to solve problems in computational biology. These problems, from the area of RNA bioinformatics, are prediction of RNA secondary structure and the search for homologous sequences of known non-coding (nc-) RNA families. Also, an efficient embedding of these grammars in a functional programming language is presented. First, two algorithms on structural non-coding RNA families are presented. A structural ncRNA family is an alignment of related RNA sequences together with their consensus structure. From it a stochastic model can be calculated which, in turn, can be used to search for further related sequences on a genome-wide scale. A number of different possibilities exist to produce a stochastic model from a structural alignment. In Chapter 5 the ramifications of different encodings are discussed. The algorithm in Chapter 6 provides a solution to another problem on ncRNA families. In order to facilitate quality control on ncRNA family libraries, a method is provided to determine whether an RNA family is sufficiently well separated from all other families. The algorithm on extended RNA secondary structures presented in Chapter 7 extends the nearest-neighbor secondary structure model toward better reflection of the knowledge gained from RNA tertiary structure. In particular, base pairing beyond the six canonical Watson-Crick pairs is taken into account. Some regions of the RNA with important biological roles contain almost exclusively non-canonical base pairs which can now be predicted in contrast to previous approaches which would model such regions as essentially unstructured. Finally, in Chapter 8, a domain-specific language embedded in the functional programming language Haskell is presented. This embedding allows for simplified algorithmic development on a high level. In particular, this embedded language makes it possible to write and extend the previous algorithms easily, while providing performance close to that of the C programming language.

Keywords (eng)

RNA secondary structureRNA familiescovariance modelsemantic ambiguitdiscriminatory powercontext-free grammardomain-specific languagefunctional languageHaskell

Keywords (deu)

RNA SekundaerstrukturRNA FamilienCovarianzmodellSemantische MehrdeutigkeitTrennschaerfekontext-freie Grammatikdomaenenspezifische Sprachefunktionale SpracheHaskell

Subject (deu)

Molekularbiologie

Subject (deu)

Angewandte Informatik: Sonstiges

Type (deu)

Dissertation

Persistent identifier

https://phaidra.univie.ac.at/o:1301508

DOI

10.25365/thesis.28648

URN

urn:nbn:at:at-ubw:1-29596.38582.317055-2

URI

https://utheses.univie.ac.at/detail/25577

Extent (deu)

142 S. : graph. Darst.

Number of pages

144

Association (deu)

Fakultät für Lebenswissenschaften

Title (eng)

Grammatical approaches to problems in RNA bioinformatics

Parallel title (deu)

Grammatikalische Ansätze für Probleme in der RNA Bioinformatik

Author

Christian Höner zu Siederdissen

Abstract (deu)

Formale Sprachen und Grammatiken sind ein klassisches Thema in der Informatik und maechtiges Werkzeug um die Komplexitaet algorithmischen Designs zu beherrschen. In dieser Arbeit werden vier wissenschaftliche Arbeiten zur Loesung von Problemen in der computergestuetzten Biologie praesentiert. Diese Probleme, ausgewaehlt aus dem Bereich der RNA-Bioinformatik, sind die Vorhersage von RNA Sekundaerstrukturen und die Suche nach homologen Sequenzen bekannter nichtkodierender (nc-) RNA Familien. Ausserdem wird eine effiziente Einbettung dieser Grammatiken in eine funktionale Programmiersprache vorgestellt. Zuerst werden zwei Algorithmen zu strukturellen RNA Familien praesentiert. Eine strukturelle ncRNA Familie is ein Alignment von verwandten RNA Sequenzen zusammen mit ihrer gemeinsamen Struktur. Sie kann in ein stochastisches Modell verwandelt werden. Solch ein Modell ermoeglicht das Auffinden von weiteren verwandten Sequenzen auf genomweiter Ebene. Verschiedene Moeglichkeiten existieren um solche Modelle zu erzeugen. In Kapitel 5 werden die Konsequenzen verschiedener Kodierungen diskutiert. Der Algorithmus in Kapitel 6 bietet eine Loesung fuer ein anderes Problem mit ncRNA Familien. Um die Qualitaet der vorhandenen Familien sicherzustellen wird eine Methode bereitgestellt, die es erlaubt festzustellen ob eine Familie genuegend stark von allen anderen Familien differenziert. Der Algorithmus zu erweiterten RNA Sekundaerstrukturen in Kapitel 7 erweitert das nearest-neighbor Sekundaerstrukturmodell zu einem das das bekannte Wissen zu RNA Strukturen besser reflektiert. Insbesondere sind Basenpaarungen ueber das kanonische Watson-Crick Paarungsmodell hinaus moeglich. Solch ein erweitertes Modell dient der besseren Vorhersage von Basenpaarungen in Regionen welche als wichtig in der biologischen Rolle vieler RNAs erkannt wurden. Zuletzt wird eine domaenenspezische Sprache, eingebettet in die funktionale Programmiersprache Haskell, in Kapitel 8 besprochen. Eine solche Einbettung ermoeglicht ein einfacheres Entwickeln in einer Hochsprache. Insbesondere ist es moeglich die vorherigen Algorithmen einfach zu formulieren und zu erweitern ohne auf C-nahe Geschwindigkeit verzichten zu muessen.

Abstract (eng)

Formal languages and grammars are a classical topic in computer science and a powerful tool to deal with complexity in terms of algorithmic design. In this thesis, four scientific works are presented that aim to solve problems in computational biology. These problems, from the area of RNA bioinformatics, are prediction of RNA secondary structure and the search for homologous sequences of known non-coding (nc-) RNA families. Also, an efficient embedding of these grammars in a functional programming language is presented. First, two algorithms on structural non-coding RNA families are presented. A structural ncRNA family is an alignment of related RNA sequences together with their consensus structure. From it a stochastic model can be calculated which, in turn, can be used to search for further related sequences on a genome-wide scale. A number of different possibilities exist to produce a stochastic model from a structural alignment. In Chapter 5 the ramifications of different encodings are discussed. The algorithm in Chapter 6 provides a solution to another problem on ncRNA families. In order to facilitate quality control on ncRNA family libraries, a method is provided to determine whether an RNA family is sufficiently well separated from all other families. The algorithm on extended RNA secondary structures presented in Chapter 7 extends the nearest-neighbor secondary structure model toward better reflection of the knowledge gained from RNA tertiary structure. In particular, base pairing beyond the six canonical Watson-Crick pairs is taken into account. Some regions of the RNA with important biological roles contain almost exclusively non-canonical base pairs which can now be predicted in contrast to previous approaches which would model such regions as essentially unstructured. Finally, in Chapter 8, a domain-specific language embedded in the functional programming language Haskell is presented. This embedding allows for simplified algorithmic development on a high level. In particular, this embedded language makes it possible to write and extend the previous algorithms easily, while providing performance close to that of the C programming language.

Keywords (eng)

RNA secondary structureRNA familiescovariance modelsemantic ambiguitdiscriminatory powercontext-free grammardomain-specific languagefunctional languageHaskell

Keywords (deu)

RNA SekundaerstrukturRNA FamilienCovarianzmodellSemantische MehrdeutigkeitTrennschaerfekontext-freie Grammatikdomaenenspezifische Sprachefunktionale SpracheHaskell

Subject (deu)

Molekularbiologie

Subject (deu)

Angewandte Informatik: Sonstiges

Type (deu)

Dissertation

Persistent identifier

https://phaidra.univie.ac.at/o:1301509

Number of pages

144

Association (deu)

Fakultät für Lebenswissenschaften

License

Download

Citable links

Persistent identifier
https://phaidra.univie.ac.at/o:1301508
Handle
https://hdl.handle.net/11353/10.1301508
DOI
https://doi.org/10.25365/thesis.28648
URN
https://nbn-resolving.org/nbn:at:at-ubw:1-29596.38582.317055-2
Other links

URI
https://utheses.univie.ac.at/detail/25577
Managed by

u:theses
Details

Uploader

Universitätsbibliothek Wien / u:theses

Object type

Container

Created

29.10.2021 10:26:20
Usage statistics

29
Metadata

JSON-LD
Export formats

Dublin Core

DataCite

LOM

EDM

OpenAIRE