On the Maximum Likelihood Estimation and Goodness-of-fit of the Dirichlet Distribution in Microbiome Compositional Analysis
Student: Sucharitha Dodamgodage
Advisor: Dr. Sumona Mondal
Co-advisor: Dr. Shantanu Sur
PhD Committee members: Dr. Kathleen Kavanagh, Dr. James Greene, Dr. Mohammed Meysami, Dr. Nabendu Pal
Wednesday, May 21st, 2025, 9:00 am, Snell 212, Zoom link
Abstract
The importance of microbiomes in sustaining a stable and healthy ecosystem is gaining increasing attention across various fields, including human health, agriculture, and environmental studies. Microbiome count data, commonly obtained through various sequencing techniques, retains only the information of the proportion between taxa and therefore, is compositional in nature. Models based on the Dirichlet distribution could be an ideal choice for such data and would not require further data transformation to conduct analysis. However, such models are not commonly used in microbiome analysis due to two major challenges, which are the central focus of this proposal: (1) Limited information is available in the literature regarding techniques for robust estimation of Dirichlet model parameters. (2) There are no goodness-of-fit (GOF) tests available for the Dirichlet distribution that could reliably evaluate if a Dirichlet model is a good fit for observed microbial relative abundance data. Maximum likelihood estimation (MLE) is widely regarded as the most effective method for estimating Dirichlet parameters with the assumption of their assured existence and uniqueness. However, without an analytical proof of these properties, interpretations of the MLE’s bias and mean squared error remain uncertain and can be potentially misleading. We address this issue by analytically demonstrating the existence and uniqueness of the MLEs for parameters of the general Dirichlet distribution and the MLE for the symmetric Dirichlet distribution’s single scalar parameter.
Furthermore, we aim to conduct a detailed GOF analysis for the Dirichlet distribution, an area with very limited prior research. GOF testing for the Dirichlet distribution is challenging due to the high dimensionality of microbiome data and the complex nature of its cumulative distribution function. The few GOF test methods explored in this direction often have high computational demand and exhibit poor performance. We propose the adoption of a Chi-square GOF test and explore its behavior when applied to continuous Dirichlet-distributed data, with particular focus on how binning strategies influence test performance for different parameter values. Our next objective is to further improve the performance of the Chi-square GOF test for the Dirichlet distribution. This would include implementing a parametric bootstrap procedure to enhance test power against various alternatives and extending the test’s applicability to higher-dimensional settings.