Statistical analysis of haplotypes, untyped SNPs, and CNVs in genome-wide association studies Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 21, 2019
  • Hu, Yijuan
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
  • Missing data arise in genetic association studies when one is interested in assessing the effects of haplotypes, untyped single nucleotide polymorphisms (SNPs) or copy number variants (CNVs). Haplotypes are combinations of nucleotides at multiple loci along individual homologous chromosomes, and the use of haplotypes tends to yield more efficient analysis of disease association than SNPs. Untyped SNPs are SNPs that are not on the genotyping chips used in the study (i.e., missing on all study subjects), and the analysis of untyped SNPs can facilitate localization of disease-causing variants and permit meta-analysis of association studies with different genotyping platforms. A CNV refers to the duplication or deletion of a segment of DNA sequence compared to a reference genome assembly, and can play a causal role in genetic diseases. In the first part of the proposal, we provide a general likelihood-based framework for making inference on the effects of haplotypes or untyped SNPs and their interactions with environmental variables. Unlike most of the existing methods, we allow genetic and environmental variables to be correlated. We show that the maximum likelihood estimators are consistent, asymptotically normal, and asymptotically efficient and we develop EM algorithms to implement the corresponding inference procedures. We conduct extensive simulation studies and apply the methods to a genome-wide association study (GWAS) of lung cancer. In the second part, we focus on comparing two approaches in the analysis of untyped SNPs. The maximum likelihood approach integrates prediction of untyped genotypes and estimation of association parameters into a single framework and yields consistent and efficient estimators of genetic effects and gene-environment interactions with proper variance estimators. The imputation approach is a two-stage strategy which first imputes the untyped genotypes by either the most likely genotypes or the expected genotype counts and then uses the imputed values in downstream association analysis. We conduct extensive simulation studies to compare the bias, type I error, power, and confidence interval coverage between the two methods under various situations. In addition, we provide an illustration with genome-wide data from the Wellcome Trust Case-Control Consortium (WTCCC). In the third part, we present a general framework for the integrated analysis of CNVs and SNPs in association studies, including the analysis of total copy number as a special case. We use allele-specific copy numbers (ASCNs) to describe both the copy number and allelic variations of a locus. %The joint effects of CNVs and SNPs on the disease are formulated in terms of allele-specific copy numbers (ASCNs). Our approach combines the ASCN calling and association analysis into a single step while allowing for differential errors. We construct likelihood functions that properly account for the case-control sampling and measurement errors. We establish the asymptotic properties of the maximum likelihood estimators and develop EM algorithms to implement the proposed inference procedures. The advantages of the proposed methods over the existing ones are demonstrated through realistic simulation studies and an application to a GWAS of schizophrenia.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Lin, Danyu
Degree granting institution
  • University of North Carolina at Chapel Hill
Place of publication
  • Chapel Hill, NC
  • Open access

This work has no parents.