Principal component analysis in high dimensional data: application for genomewide association studies Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 21, 2019
  • Lee, Seunggeun
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
  • In genomewide association studies (GWAS), population stratification (PS) is a major confounding factor which causes spurious associations by inflating test statistics. PS refers to differences in allele frequencies by disease status due to systematic differences in ancestry, rather than causal association of genes with disease. PCA is commonly used to infer population structure by computing PC scores, which are subsequently used for control of population stratification. Even though PCA is now widely used for PS adjustment, there are still challenges for PCA based effective PS control. One common feature of the genomic data is the strong local correlation among adjacent loci/markers caused by linkage disequilibrium (LD). It is known that this local correlation can have a negative effect on estimated PC scores and produce spurious PCs which do not truly reflect underlying population structure. To address this problem, we have employed a shrinkage PCA approach where coefficients are used to down-weight the contribution of highly correlated SNPs in PCA. Another challenge in PC analysis is choosing which PCs to include as covariates to adjust population stratification. While searching for a reasonable measure for PC selection, we have found the precise relationship between genotype principal components and inflation of association test statistics. Based on this fact, We propose a new approach, called EigenCorr, which selects principal components based on both their eigenvalues and their correlation with the (disease) phenotype. Our approach tends to select fewer principal components for stratification control than does testing of eigenvalues alone, providing substantial computational savings and improvements in power. Under many circumstances, it is of interest to predict PC scores. Although PC score prediction is commonly used in practice, characteristics of the predicted PC scores have not been systematically studied. Under high dimensional settings we have found that the naive predicted PC scores are systematically biased toward 0, and this phenomenon is largely due to the inconsistency of the sample eigenvalues and eigenvectors. We have extended existing convergence results of sample eigenvalues and eigenvectors and derived asymptotic shrinkage factors. Based on these asymptotic results, we propose the bias-adjusted PC score prediction.
Date of publication
Resource type
Rights statement
  • In Copyright
  • "... in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Biostatistics."
  • Zou, Fei
Degree granting institution
  • University of North Carolina at Chapel Hill
Place of publication
  • Chapel Hill, NC
  • Open access

This work has no parents.