Statistical Essays Motivated by Genome-Wide Association Study Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 19, 2019
  • Wang, Ling
    • Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research
  • Genome-wide association studies (GWAS) have been gaining popularity in recent years, and have generated a lot of interests in statistics. In this dissertation, motivated by GWAS, we develop statistical methods to identify significant Single-Nucleotide Polymorphisms (SNPs) that are associated with certain phenotype traits of interest. Usually in GWAS, the number of SNPs are much larger than the number of individuals. Hence identifying significant SNPs and estimating their effects is a high-dimensional selection and estimation problem, or sometimes referred to as the large p and small n (p>>n) paradigm. In this talk, we propose three approaches to estimate the proportion of SNPs that are significantly associated with the trait of interest in GWAS, as well as the distribution of their effects. The first one extends the earlier work that models the SNP effects as random effects in a linear mixed model. We instead assume a mixture prior on the random effects, which consists of a pointmass at zero, for those non-significant SNPs, plus a normal component for those significant SNPs. We develop a fast Markov Chain Monte Carlo (MCMC) algorithm to estimate the model parameters. The proposed algorithm reduces the computation time significantly by calculating the posterior conditional on a set of latent variables, that index whether the SNPs are associated with the trait of interest or not. We further relax the prior distribution to a mixture point mass plus a non-parametric distribution. Two types of sieve estimators are proposed based on a least squares (LS) method for probability distributions under the framework of measurement error models. The estimators are obtained by minimizing the distance between the empirical distribution/characteristic functions and the model distribution/characteristic functions, respectively. In the last part, we propose an estimator for the normal mean problem that can adapt to the sparsity of the mean signals as well as incorporate correlation among the signals. The proposed estimator effectively decomposes the arbitrary covariance matrix of the observed signals into two parts: principal factors that derive the strong dependence and weakly dependent error terms. By taking out the largest common factors, the correlation among the signals are significantly weakened. An automatic nonparametric empirical Bayesian method is then used to estimate the sparsity and identify the nonzero means.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Ji, Chuanshu
  • Carlstein, Edward
  • Guo, Guang
  • Shen, Haipeng
  • Smith, Richard L.
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2015
Place of publication
  • Chapel Hill, NC
  • There are no restrictions to this item.

This work has no parents.