The modeling and analysis of sequence motives is one central task in the elucidation of biological processes such as gene regulation. The choice of model class is crucial to obtain a representation of the motive suitable for the biological application. For instance previous studies showed that for transcription factors which bind to divergent binding sites, mixtures of multiple PWMs increase performance. However, estimating a conventional mixture distribution for each position will in many cases cause overfitting. We avoid this problem by employing a context-specific independence (CSI) framework. In CSI mixtures model complexity is automatically adapted to match the variability found in a given data set.
Another application of the CSI mixture framework is clustering of protein families for simultaneous inference of subgroups and prediction of specificity determining residues based on multiple sequence alignments of protein families. A Dirichlet mixture prior based on nine basic chemical properties of the standard amino acids is used to regularize the structure learning for protein domain data. Evaluation of the method on several well studied families revealed a good clustering performance and ample biological support for the predicted positions.
Georgi, Benjamin and Schultz, Jörg and Schliep, Alexander. Partially-supervised protein subclass discovery with simultaneous annotation of functional residues (2009) [details]
Georgi, B. and Schliep, A.. Partially-supervised context-specific independence mixture modeling (2007) [details]
Georgi, Benjamin and Schultz, Jörg and Schliep, Alexander. Context-Specific Independence Mixture Modelling for Protein Families (2007) [details]
Georgi, B. and Spence, M. A. and Flodman, P. and Schliep, A.. Mixture model based group inference in fused genotype and phenotype data (2007) [details]
Georgi, Benjamin and Schliep, Alexander. Context-specific independence mixture modeling for positional weight matrices (2006) [details]