Collections > Electronic Theses and Dissertations > Bayesian Viral Substitution Analysis and Covariance Estimation via Generalized Fiducial Inference
pdf

With the advances in biology and computing technologies, there have been increasing amount of big bio data awaiting to be analyzed. Aiming to develop statistical tools for omics data, we focus on the problem of viral sequencing data modeling as well a fundamental statistics question with applications in both biology and many other fields. This dissertation is comprised of three major parts. Motivated by a multi-time sampled, case-control influenza viral population study, in the first part we model the sequencing data of a viral population under a Bayesian Dirichlet mixture distribution. We have developed an efficient clustering scheme that enables us to distinguish treatment causal changes from variation within viral populations. As a proof of concept, we applied our method to a well-studied HIV dataset, and successfully identified known drug resistant regions and additional potential sites. For the influenza data, our algorithm revealed two genome sites with strong evidence of treatment effect. The second part of the thesis concerns the covariance matrix estimation in a high-dimensional multivariate linear models and sparse covariate settings using fiducial inference. The sparsity imposed on the covariate matrix allows to estimate relationships between a list of gene expressions and several metabolic levels under a high dimension low sample size setting. Aiming to quantify the uncertainty of the estimators without having to choose a prior, we have developed a fiducial approach to the estimation of covariance matrix. Built upon the Fiducial Berstein-von Mises Theorem, we show that the fiducial distribution of the covariance matrix is consistent under our framework. Furthermore, we propose an adaptive efficient reversible jump Markov chain Monte Carlo algorithm for sampling from the fiducial distribution, which enables us to define a meaningful confidence region for the covariance matrix. In the last part of the thesis, we examine the stochastic models for capturing the evolutionary processes of gene expression levels. Generalizing a microarray Brownian motion (BM) model, we have developed a BM model for high-throughput sequencing data that takes sampling variance into account. To allow conservation in the evolution process, we also investigate Ornstein-Uhlenbeck (OU) models. Applying to a multiple-tissue mammalian dataset, we showed that the OU model is more appropriate for the top 10 highly expressed genes in the dataset, and we performed hypothesis testing for significant changes in gene expression levels along specific lineages.