Machine Learning Techniques for Heterogeneous Data Sets Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 21, 2019
  • Chen, Jingxiang
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
  • Over the past few decades, machine learning tools are under rapid development in various application fields to support statistical decision making. In this dissertation, we aim at investigating new supervised machine learning techniques which can contribute to analysis of complex datasets. First, we discuss a new learning method under Reproducing Kernel Hilbert Spaces (RKHS) to achieve variable selection and data extraction simultaneously. In particular, we propose a unified RKHS learning method, namely, DOuble Sparsity Kernel (DOSK) learning, to overcome this challenge. We prove that under certain conditions, our new method can asymptotically achieve variable selection consistency. Numerical study results demonstrate that DOSK is highly competitive among existing approaches for RKHS learning. Second, we study on how machine learning can be applied to heterogeneous data analysis by detecting an optimal individual treatment rule for the ordinal treatment case. One of the primary goals in precision medicine is to obtain an optimal individual treatment rule (ITR). Recently, outcome weighted learning (OWL) has been proposed to estimate such an optimal ITR in a binary treatment setting by maximizing the expected clinical outcome. However, for the ordinal treatment settings such as dose level finding, it is unclear how to use OWL. We propose a new technique for estimating ITR with ordinal treatments. Simulated examples and an application to a type-2 diabetes study demonstrate the highly competitive performance of the proposed method. Third, we also focus on analyzing the heterogeneous data but in a different point of view. In particular, we develop a new exploratory machine learning tool to identify the heterogeneous subpopulations without much prior knowledge. To achieve this goal, we formulate a regression problem with subject specific regression coefficients and use adaptive fusion to cluster the coefficients into subpopulations. This method has two main advantages. First, it relies on little prior knowledge on the underlying subpopulation structure. Second, it makes use of the outcome-predictor relationship and hence can have competitive estimation and prediction accuracy. To estimate the parameters, we design a highly efficient accelerated proximal gradient algorithm. Numerical studies show that the proposed method has competitive estimation and prediction accuracy.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Laber, Eric
  • Kosorok, Michael
  • Cole, Stephen
  • Zeng, Donglin
  • Liu, Yufeng
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill
Graduation year
  • 2017

This work has no parents.