The statistical analysis of genetic sequencing and rare variant association studies

Urrutia, Eugene

Download PDF

Request Version for Screen Reader

Last Modified

March 22, 2019

Creator

Urrutia, Eugene
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics

Abstract

Understanding the role of genetic variability in complex traits is a central goal of modern human genetics research. So far, genome wide association tests have not been able to discover SNPs that explain a large proportion of the heritability of disease. It is hoped that with the advent of accessible DNA sequencing data, investigators can uncover more of the so-called missing heritability. The added information contained in sequencing data includes rare variants, that is, minor alleles whose population frequency is low. We examine several existing region based rare variant association tests including burden based tests and similarity based tests and show that each is most powerful under a certain set of conditions which is unknown to the investigator. While some have proposed tests that combine the features of several existing tests, none as yet has provided a test to combine the features of all existing tests. Here, we propose one such test under the framework of the SKAT test, and show that it is nearly as powerful as the most appropriately chosen test under a range of scenarios. Existing methods do not allow for missing values in the covariates. Standard use of complete case analysis may yield misleading results, including false positives and biased parameter estimates. To address this problem, we extend an existing maximum likelihood strategy for accommodating partially missing covariates to the SKAT framework for rare variant association testing. This results in a test with high power to identify genetic regions associated with quantitative traits while still providing unbiased estimation and correct control of type I error when covariates are missing at random. Since the framework is generic, we also consider the application of this approach to epigenetic data. A wide range of variable selection approaches can be applied to isolate individual rare variants within a region, yet there has been little evaluation of these approaches. We examine key methods for prioritizing individual variants and examine how these procedures perform with respect to false positives and power via application to simulated data and real data.

Date of publication