Ab initio gene finding or prediction is still an interesting problem. Current gene finder work well but especially predictions of eukaryotic genes have much space for improvement. Based on Anders Krogh's HMMGene our gene finder is used to test ideas to improve the predictions. One idea is to incorporate external data like gene expression to aid the signal and sequence composition driven prediction. The gene finder is implemented in Python using the python bindings of our own free (LGPL) HMM library General Hidden Markov Model library (GHMM).
Finding the genetic causes of complex diseases such as Autism and ADHD is complicated by ambiguities and subjectivities in the diagnostic process and the simultaneous involvement of multiple genes and environmental factors. We investigate the application of mixture model based clustering on fused geno- and phenotype data. This joint analysis might yield further insight into the complex interactions between geno- and phenotypes which underlie a specific disease pattern.
Detecting Chromosomal aberrations from ArrayCGH and gene expression ArrayCGH experimental data Chromosomal aberrations such as deletions or duplications of chromosomal regions are a crucial contributing factor to cancer. The aberrations can be detected by observing the relative hybridization intensities of healthy vs. diseased patients for BAC-clones covering complete genomes. A Hidden Markov Model with a inhomogeneous Markov Chain allows to reflect dependencies between overlapping clones.
The modeling and analysis of sequence motives is one central task in the elucidation of biological processes such as gene regulation. The choice of model class is crucial to obtain a representation of the motive suitable for the biological application. For instance previous studies showed that for transcription factors which bind to divergent binding sites, mixtures of multiple PWMs increase performance. However, estimating a conventional mixture distribution for each position will in many cases cause overfitting. We avoid this problem by employing a context-specific independence (CSI) framework. In CSI mixtures model complexity is automatically adapted to match the variability found in a given data set.
The delineation of protein complexes from protein-protein interaction data is not as trivial as it may seem. We developed a simple probabilistic framework to cluster purifications while preserving the partial order relation among purifications. With a simple graph-based approach motivated by the asymmetric relationship between purifications, we can visualize overlapping components of protein complexes as supported by the experiment.
Genomic tiling arrays are universal arrays in the sense that they cover complete genomes or chromosomes uniformly, in contrast to most other types of DNA microarrays for which specific sites of interest such as genes or splice sites are defined a priori. We define the problem of choosing optimal oligonucleotide probes from large candidate sets and provide efficient, linear-time in most instances, algorithms for solving it.
DNA-Microarrays, well known for measuring gene expression levels, can be used for detecting presence or absence of biological targets (viruses of bacteria) from hybridization patterns of oligonucleotide probes and genomic DNA of agents. Due to sequence similarity of possible targets the use of non-unique oligonucleotides becomes necessary. With use of statistical group testing and phylogenetic information about targets, even the detection of novel targets becomes viable.
Detecting whether two proteins are homologs is one of the fundamental problems in bioinformatics. Classically, their sequence similarity is measured with a sequence alignment score and a decision about homology is made using score statistics. How well one can solve this classification problem is strongly influenced by the assumptions necessary for the statistics to hold. We use an approach based on Support Vector Machines to address this problem.
In-Situ Hybridization experiments elucidate the spatial distribution of expressed mRNA in organisms. In particular for Drosophila large amounts of data for several developmental stages are available, complementing the DNA-microarray gene expression experiments. We have developed a image processing pipeline and a framework for joint analysis, which allows to detect co-located co-expressed genes from fused data sets.
Whether to cluster at all, which clustering method to use and how many clusters to choose are pressing questions in bioinformatics. Mostly, decisions are made by users of clustering software based on experience guided by benchmarking or indicators for reliability of solutions or model-fit. However, as clustering algorithms always produce solutions, often inappropriate methods or parameters are used and invalid results produced. Meta-learning refers to the application of machine learning techniques in choosing methods and guiding in setting parameters. We intend to build a computational framework to perform cluster validation and apply meta-learning to the problem of analyzing gene expression time-courses. More information at the Project Page. Joint work funded funded by CAPES (Brazil) and DAAD (Germany) under the program Probral.
We work in collaboration with Ralf Spörle from the Department of Developmental Genetics, Christian Hege, head of the Visualization Department at the Konrad-Zuse Zentrum (ZIB) and Bernd Fischer, Professor at the University of Lübeck, on the construction of an atlas of gene expression patterns in embryonal mice. The central piece is the construction of a non-linear registration, that maps numerous in-situ tomograms onto an annotated standard model. This mapping yields then an automatical anatomical annotation of high-resolution 3D spatial expression patterns as well as the fusion of all patterns into one standard model. The mapped expression patterns can then be viewed and analyzed together within the standard model. Analysis of the data involves statistical group testing for functional territories.
The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the hematopoietic system. Gene expression data of cells of various distinguishable developmental stages fosters the elucidation of the underlying molecular processes, which change gradually over time and lock cells in certain lineages. We developed a statistical framework for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes and their similarities and differences.
The molecular processes of life are dynamic over time. Microarray experiments measuring the expression levels of a multitude of genes over time are one way of gaining insight into the dynamic processes. As a first analysis groups of similar expression patterns are routinely identified. We have developed an approach which allows to use prior knowledge, is flexible and very robust to noise. The method is implemented in the software GQL which allows control of the analysis process by use of graphical user interfaces. Currently, we are extending our framework to allow integration of further data related to transcription or protein interactions. Furthermore, we are also investigating methodologies for validating clustering of genes with functional annotation.
Detecting proteins which share a common ancestor is an important step in understanding protein structure and function. Multi-domain proteins normally cause problems due to spurious similarities they induce; with a simple graph-based approach based on the concept of asymmetric similarity we were able to clearly outperform PSI-Blast.