Abstract (eng)
Epigenetics, investigating the biological information of genomes not only encoded in the DNA sequence, has become a hot topic boosted by rapid development of high-throughput technologies. In the light of that, bioinformatics plays an important role in analyzing the massive datasets to further examine the data and to formulate biological hypotheses.
DNA methylation is one important epigenetic mark in developmental and disease bi- ology. One widely-used technique to profile genome-wide DNA methylation is based on bisulfite conversion of unmethylated cytosines (C) to thymines (T), followed by deep sequencing technology, called BS-Seq data. The C-T conversion raises a number of challenges in mapping the bisulfite-converted short reads to the reference genome. Besides, the current technology cannot consider the heterogeneity of DNA methylation from mixtures of cells. This affects the accuracy of estimating the DNA methylation patterns in the genome. Hence, new bioinformatics methods are required to estimate the cell-type specific DNA methylation.
Integrating multiple datasets of profiling epigenetic/chromatin marks for many different samples, conditions and organisms is also an underdeveloped field in bioinformatics, given the rapid growth of biological data. It is essential for further studies to find epigenomic patterns like a chromatin-based epigenetic code. However, comparative bioinformatics procedure is difficult because of different distributions or different scales of the marks.
In this thesis, I have developed bioinformatics tools and applied them to the model organism, Arabidopsis thaliana. First, I have implemented a new and sensitive analysis tool for analyzing BS-Seq data based on Smith-Waterman local alignment mapping. Second, I have developed an efficient algorithm to deal with heterogeneity in DNA methylation data derived from BS-Seq. Finally, I have suggested a method to integrate epigenomic signals from multiple genome-wide profiling data for further data mining purpose, e.g. epigenetic signature discovery.