Statistical Learning with Missing Data

Stewart, Thomas

Download PDF

Request Version for Screen Reader

Last Modified

March 19, 2019

Creator

Stewart, Thomas
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics

Abstract

Statistical learning is a popular family of data analysis methods which has been successfully employed in biomedical research, the social sciences, public safety applications, and most data dependent areas of research. A major goal of statistical learning methods is to construct rules which predict an outcome y from a set of predictors x, for example, predicting treatment response from a set of pre-treatment biomarkers. Accurate prediction rules of treatment response can guide health care providers to select the best treatment options. The support vector machine (SVM) is a statistical learning method profitably employed in a number of research areas such as biomedical computer vision tasks, drug design, and genetics. Because SVMs admit nonlinear prediction rules, it is a natural choice for analyzing data with potentially complex relationships. One drawback to SVMs is the limited means of handling missing data in the training set, yet missing data is ubiquitous in studies of health-related outcomes. In this research, we review the literature on missing data, and we summarize those scenarios when missing data may bias statistical analysis. We also provide an overview of supervised classification methods, especially those methods which accommodate missing data. We pay special attention to SVMs as this family of methods is the focus of our proposed contributions to this body of work. We propose three methods involving SVMs and missing data. The first paper proposes an EM-based solution for constructing SVMs when the training set includes observations with missing covariates. We present the method for continuous covariates but the method is applicable to discrete covariates as well. The second paper proposes weighting methods inspired by weighted estimating equations, also for the purpose of constructing SVMs when the training set includes observations with missing covariates. The third paper considers scenarios in which class labels are missing or are partially observed, an area of study commonly called semi-supervised learning. We propose an EM-type solution for the semi-supervised learning scenario, and we apply the method to both two-class and multi-class SVMs. In each paper, the proposed methods will be demonstrated in the context of a large multi-center observational study of Hepatitis C patients.

Date of publication

August 2015

Keyword

Subject

Statistics

DOI

https://doi.org/10.17615/c1c8-6z96

Identifier

Stewart_unc_0153D_15589.pdf

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Herring, Amy
Liu, Yufeng
Zeng, Donglin
Wu, Michael
Hayes, Neil

Degree

Doctor of Philosophy

Degree granting institution

University of North Carolina at Chapel Hill Graduate School

Graduation year

2015

Language

English

Publisher

University of North Carolina at Chapel Hill Graduate School

Place of publication

Chapel Hill, NC

Access right

There are no restrictions to this item.

Date uploaded

August 25, 2015

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	Stewart_unc_0153D_15589.pdf	2019-04-10	Public	Download

Statistical Learning with Missing Data

Downloadable Content

Relations

Items