Collections > Electronic Theses and Dissertations > STATISTICAL ANALYSES OF HIGH THROUGHPUT GENETICS AND GENOMICS DATA
pdf

Mixed effects models are commonly used for modeling the dependence structure between twin pairs in twin studies. However, mixed effects models are extremely computationally intensive for eQTL (expression quantitative trait loci) analysis. To overcome the computational challenge, twin pairs can be randomly split into two independent groups on which multiple linear regression analysis can be performed. In my first topic, a computationally efficient score statistic is proposed to combine non-independent analysis results from the two groups. Genome-wide association studies (GWAS) aim to identify genetic variants associated with complex traits. The standard first pass GWAS analysis where SNPs are tested one at a time may fail to detect associations due to, for example, multiple causal SNPs. Alternatively, regional SNP-set analyses have been established to test the association between a set of SNPs and a phenotype through a mixed effects model where testing the association is equivalent to testing whether one or more of the variance components are equal to 0. However, the null distribution of the likelihood ratio test (LRT) does not follow the conventional 50:50 mixture chi-square distribution in this setting. My second topic investigates the spectral representation of LRT, based on which an empirical resampling procedure is proposed to approximate the null distribution of LRT. When both GWAS and gene expression data are available on the same set of samples, it is natural to add gene expression as a covariate into the SNP-set analysis to jointly model the SNP and transcript association with the trait. One biologically interesting question is whether the complex phenotype is associated with the gene expression conditional on the SNP effects. My last research topic jointly models the association between the gene expression and SNP-set with the trait. Unlike traditional mixed effects models, our model allows the gene expression to be dependent on the random SNP effects since the independent assumption is likely to be violated when the gene expression is also associated with the SNP set. With relaxed independence assumption, we can make valid statistical inference and parameter estimation.