Statistical Learning with Missing DataPublic Deposited
Add to collection
You do not have access to any existing collections. You may create a new collection.
Downloadable ContentDownload PDF
MLAStewart, Thomas. Statistical Learning with Missing Data. Chapel Hill, NC: University of North Carolina at Chapel Hill Graduate School, 2015. https://doi.org/10.17615/c1c8-6z96
APAStewart, T. (2015). Statistical Learning with Missing Data. Chapel Hill, NC: University of North Carolina at Chapel Hill Graduate School. https://doi.org/10.17615/c1c8-6z96
ChicagoStewart, Thomas. 2015. Statistical Learning with Missing Data. Chapel Hill, NC: University of North Carolina at Chapel Hill Graduate School. https://doi.org/10.17615/c1c8-6z96
- Last Modified
- March 19, 2019
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics
- Statistical learning is a popular family of data analysis methods which has been successfully employed in biomedical research, the social sciences, public safety applications, and most data dependent areas of research. A major goal of statistical learning methods is to construct rules which predict an outcome y from a set of predictors x, for example, predicting treatment response from a set of pre-treatment biomarkers. Accurate prediction rules of treatment response can guide health care providers to select the best treatment options. The support vector machine (SVM) is a statistical learning method profitably employed in a number of research areas such as biomedical computer vision tasks, drug design, and genetics. Because SVMs admit nonlinear prediction rules, it is a natural choice for analyzing data with potentially complex relationships. One drawback to SVMs is the limited means of handling missing data in the training set, yet missing data is ubiquitous in studies of health-related outcomes. In this research, we review the literature on missing data, and we summarize those scenarios when missing data may bias statistical analysis. We also provide an overview of supervised classification methods, especially those methods which accommodate missing data. We pay special attention to SVMs as this family of methods is the focus of our proposed contributions to this body of work. We propose three methods involving SVMs and missing data. The first paper proposes an EM-based solution for constructing SVMs when the training set includes observations with missing covariates. We present the method for continuous covariates but the method is applicable to discrete covariates as well. The second paper proposes weighting methods inspired by weighted estimating equations, also for the purpose of constructing SVMs when the training set includes observations with missing covariates. The third paper considers scenarios in which class labels are missing or are partially observed, an area of study commonly called semi-supervised learning. We propose an EM-type solution for the semi-supervised learning scenario, and we apply the method to both two-class and multi-class SVMs. In each paper, the proposed methods will be demonstrated in the context of a large multi-center observational study of Hepatitis C patients.
- Date of publication
- August 2015
- Resource type
- Rights statement
- In Copyright
- Herring, Amy
- Liu, Yufeng
- Zeng, Donglin
- Wu, Michael
- Hayes, Neil
- Doctor of Philosophy
- Degree granting institution
- University of North Carolina at Chapel Hill Graduate School
- Graduation year
- Place of publication
- Chapel Hill, NC
- There are no restrictions to this item.
This work has no parents.