Home // International Journal On Advances in Life Sciences, volume 1, number 1, 2009 // View article


A Stratified Beta-Gaussian Finite Mixture Model for Clustering Genes With Multiple Data Sources

Authors:
Xiaofeng Dai
Harri Lähdesmäki
Olli Yli-Harja

Keywords: stratified finite mixture model; gene clustering; multiple data fusion; prior

Abstract:
This paper presents a stratified mixture model based clustering framework, sBGMM. It is an extension of one of our previously developed models, BGMM (beta-Gaussian mixture model), which can not only cluster genes based on beta and Gaussian distributed data but also convert information from a third data source to the priors based on which genes are prepartitioned into several groups. By assigning genes in the same pre-group the same prior probabilities of belonging to a certain cluster, sBGMM transfers information from a third data source into the results and allows a high level of flexibility in the choice of the third data source. Different from data sources that are modeled as the component of the joint model, information used for prior construction can be from any sources and of any level of sparsity. Besides the extremely flexible choice of prior, sBGMM can also be extended to other parametric distributed data, which adds even more flexibility to this model-based clustering framework. We developed an expectation maximization algorithm for jointly estimating the parameters of sBGMM, and propose to tackle model selection problem by approximation based model selection criteria, where four well-known penalized methods, Akaike information criterion, a modified Akaike information criterion, the Bayesian information criterion, and the integrated classification likelihood-Bayesian information criterion, are tested and compared. Both simulation and real case study indicate that information from different data sources can reinforce each other and utilizing information from one data source to stratify the model can improve the clustering accuracy especially when the noise is comparatively high in both beta and Gaussian distributed data. Applications with full set of real mouse gene expression data (modeled as Gaussian distribution) and protein- DNA binding probabilities (modeled as beta distribution) not only yield more biologically reasonable results compared to its nonstratified version, but also discovered the relationship between two set of genes and eight TFs, which are all likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades.

Pages: 14 to 25

Copyright: Copyright (c) to authors, 2009. Used with permission.

Publication date: June 7, 2009

Published in: journal

ISSN: 1942-2660