Machine Learning Techniques for Heterogeneous Data Sets

Chen, Jingxiang

Download PDF

Request Version for Screen Reader

Last Modified

March 21, 2019

Creator

Chen, Jingxiang
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics

Abstract

Over the past few decades, machine learning tools are under rapid development in various application fields to support statistical decision making. In this dissertation, we aim at investigating new supervised machine learning techniques which can contribute to analysis of complex datasets. First, we discuss a new learning method under Reproducing Kernel Hilbert Spaces (RKHS) to achieve variable selection and data extraction simultaneously. In particular, we propose a unified RKHS learning method, namely, DOuble Sparsity Kernel (DOSK) learning, to overcome this challenge. We prove that under certain conditions, our new method can asymptotically achieve variable selection consistency. Numerical study results demonstrate that DOSK is highly competitive among existing approaches for RKHS learning. Second, we study on how machine learning can be applied to heterogeneous data analysis by detecting an optimal individual treatment rule for the ordinal treatment case. One of the primary goals in precision medicine is to obtain an optimal individual treatment rule (ITR). Recently, outcome weighted learning (OWL) has been proposed to estimate such an optimal ITR in a binary treatment setting by maximizing the expected clinical outcome. However, for the ordinal treatment settings such as dose level finding, it is unclear how to use OWL. We propose a new technique for estimating ITR with ordinal treatments. Simulated examples and an application to a type-2 diabetes study demonstrate the highly competitive performance of the proposed method. Third, we also focus on analyzing the heterogeneous data but in a different point of view. In particular, we develop a new exploratory machine learning tool to identify the heterogeneous subpopulations without much prior knowledge. To achieve this goal, we formulate a regression problem with subject specific regression coefficients and use adaptive fusion to cluster the coefficients into subpopulations. This method has two main advantages. First, it relies on little prior knowledge on the underlying subpopulation structure. Second, it makes use of the outcome-predictor relationship and hence can have competitive estimation and prediction accuracy. To estimate the parameters, we design a highly efficient accelerated proximal gradient algorithm. Numerical studies show that the proposed method has competitive estimation and prediction accuracy.

Date of publication

May 2017

Keyword

DOI

https://doi.org/10.17615/h5ab-eg95

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Zeng, Donglin
Kosorok, Michael
Liu, Yufeng
Laber, Eric
Cole, Stephen

Degree

Doctor of Philosophy

Degree granting institution

University of North Carolina at Chapel Hill

Graduation year

2017

Language

English

Date uploaded

August 15, 2017

Relations

Parents:

This work has no parents.

In Collection:

UNC-Chapel Hill Artificial Intelligence Resources

Items

Title	Date Uploaded	Visibility	Actions
Chen_unc_0153D_17154.pdf	August 15, 2017	Public	Download
PREMIS_Events_Metadata_0_70d9a95c-9c7b-4bfc-97ef-b31aef91ece8.txt	2019-04-10	Public	Download
original_metadata_file_70d9a95c-9c7b-4bfc-97ef-b31aef91ece8.xml	2019-04-10	Public	Download

Machine Learning Techniques for Heterogeneous Data Sets

Downloadable Content

Relations

Items