Variable Selection, Sparse Meta-Analysis and Genetic Risk Prediction for Genome-Wide Association Studies

He, Qianchuan

Download PDF

Request Version for Screen Reader

Last Modified

March 22, 2019

Creator

He, Qianchuan
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics

Abstract

Genome-wide association studies (GWAS) usually involve more than half a million single nucleotide polymorphisms (SNPs). The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Recently developed variable selection methods allow the joint analysis for GWAS data, but they tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs). Genetic risk prediction becomes highly challenging when the number of causal variants is large and many of the effects are weak. Existing methods mostly rely on marginal regression estimates, and their prediction power is quite limited. In meta-analysis, the involvement of multiple studies adds one more layer of complexity to variable selection. While existing variable selection methods can be potentially applied to meta-analysis, they require direct access to raw data, which are often difficult to be obtained. In the first part of this dissertation, we introduce GWASelect, a statistically powerful and computationally efficient variable selection method for analyzing GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false-positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. In the second part, we propose a new approach, Sparse Meta-Analysis (SMA), which performs variable selection for meta-analysis based solely on summary statistics and allows the effect sizes of each covariate to vary among studies. We show that the SMA enjoys the oracle property if the estimated covariance matrix of the parameter estimators from each study is available. We also consider the situations in which the summary statistics include only the variances or no variance/covariance information at all. Simulation studies and real data analysis demonstrate that the proposed methods perform well. Since summary statistics are far more accessible than raw data, our methods have broader applications in high-dimensional meta-analysis than existing ones. In the third part, we investigate the issue of genetic risk prediction when the number of true causal SNPs is large and many of the effect sizes are small. We show that the estimators obtained from marginal logistic regression can be severely biased and that using these estimators for prediction can lead to highly inaccurate results. To construct a joint-effects model, we propose a new method based on the smoothly clipped absolute deviation-supporting vector machine (SCAD-SVM). We conduct a series of simulation studies to show that our method outperforms the methods based on marginal estimators. We further assess the performance of our method by applying it to real GWAS studies.

Date of publication