Designs and Analysis of Two-Phase Studies, with Applications to Genetic Association Studies Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 20, 2019
  • Tao, Ran
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
  • The two-phase design is a cost-effective sampling strategy when investigators are interested in evaluating the effects of covariates on an outcome but certain covariates are too expensive to be measured on all study subjects. Under such a design, the outcome of interest and the covariates that are inexpensive to measure are observed for all subjects during the first phase, and the first-phase information is used to select subjects for measurements of "expensive covariates" during the second phase. This design greatly reduces the cost associated with the collection of expensive covariate data and thus has been widely used in large epidemiological studies. In two-phase studies, if the second-phase selection depends on multiple outcomes, then one should consider all of them simultaneously in a multivariate regression model in order to obtain valid inference. We develop an efficient likelihood-based approach to making inference under multivariate outcome-dependent sampling. We implement a computationally efficient expectation-maximization algorithm and establish the theoretical properties of the resulting maximum likelihood estimators. We demonstrate the superiority of the proposed methods over standard linear regression through extensive simulation studies. We provide applications to two large-scale sequencing studies. In two-phase studies, the "inexpensive covariates" can be used to improve the design efficiency of second-phase sampling and control for confounding. However, accommodating continuous inexpensive covariates that are correlated with expensive covariates is very challenging because the likelihood function involves the conditional density functions of expensive covariates given continuous inexpensive covariates. We develop a semiparametric approach to regression analysis by approximating the conditional density functions with B-spline sieves. We establish the theoretical properties of the resulting estimators. We demonstrate the superiority of the proposed methods over existing ones through extensive simulation studies. We provide applications to a large-scale whole-exome sequencing study. Previous research on two-phase studies has largely focused on the inference procedures rather than the design aspects of two-phase studies. An important topic of investigation is the optimal study design when the primary interest is to estimate the regression coefficients of the expensive covariates. We derive optimal two-phase designs, which can be substantially more efficient than the current designs.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Li, Yun
  • Lin, Danyu
  • Zeng, Donglin
  • North, Kari
  • Li, Quefeng
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2016

This work has no parents.