Statistical Learning for Biomedical Data under Various Forms of Heterogeneity

Chen, Guanhua

Download PDF

Request Version for Screen Reader

Last Modified

March 19, 2019

Creator

Chen, Guanhua
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics

Abstract

In modern biomedical research, an emerging challenge is data heterogeneity. Ignoring such heterogeneity can lead to poor modeling results. In cancer research, clustering methods are applied to find subgroups of homogeneous individuals based on genetic profiles together with heuristic clinical analysis. A notable drawback of existing clustering methods is that they ignore the possibility that the variance of gene expression profile measurements can be heterogeneous across subgroups, leading to inaccurate subgroup prediction. In Chapter 2, we present a statistical approach that can capture both mean and variance structure in gene expression data. We demonstrate the strength of our method in both synthetic data and two cancer data sets. For a binary classification problem, there can be potential subclasses within the two classes of interest. These subclasses are latent and usually heterogeneous. We propose the Composite Large Margin Classifier (CLM) to address the issue of classification with latent subclasses in Chapter 3. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear kernel classifier without overfitting the training data while at the same time maintaining the interpretability of traditional linear classifiers. There is a growing recognition of the importance of considering individual level heterogeneity when searching for optimal treatment doses. Such optimal individualized treatment rules (ITRs) for dosing should maximize the expected clinical benefit. In Chapter 4, we consider a randomized trial design where the candidate dose levels are continuous. To find the optimal ITR under such a design, we propose an outcome weighted learning method which directly maximizes the expected beneficial clinical outcome. This method converts the individualized dose selection problem into a nonstandard weighted regression problem. A difference of convex functions (DC) algorithm is adopted to efficiently solve the associated non-convex optimization problem. The consistency and convergence rates for the estimated ITR are derived and small-sample performance is evaluated via simulation studies. We illustrate the method using data from a clinical trial for Warfarin dosing.

Date of publication