NON-PARAMETRIC MACHINE LEARNING METHODS FOR CLUSTERING AND VARIABLE SELECTION Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 19, 2019
Creator
  • Liu, Qian
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
Abstract
  • Non-parametric machine learning methods have been popular and widely used in many scientific research areas, especially when dealing with high-dimension low sample size (HDLSS) data. In particular, clustering and biclustering approaches can serve as exploratory analysis tools to uncover informative data structures, and random forest models have their advantage in coping with complex variable interactions. In many situations it is desirable to identify clusters that differ with respect to only a subset of features. Such clusters may represent homogeneous subgroups of patients with a disease. In this dissertation, we first propose a general framework for biclustering based on the sparse clustering method. Specifically, we develop an algorithm for identifying features that belong to biclusters. This framework can be used to identify biclusters that differ with respect to the means of the features, the variances of the features, or more general differences. We apply these methods to several simulated and real-world data sets, and the results of our methods compare favourably with previous published methods, with respect to both predictive accuracy and computing time. As a follow up to the biclustering study, we further look into the sparse clustering algorithm, and point out a few limitations of their proposed method for tuning parameter selection. We propose an alternative approach to select the tuning parameter, and to better identify features with positive weights. We compare our algorithm with the existing sparse clustering method on both simulated and real world data sets, and the results suggest that our method out-performs the existing method, especially in presence of weak clustering signal. For the last project, we consider random forest variable importance (VIMP) scores. We propose an alternative algorithm to calculate the conditional VIMP scores. We test our proposed algorithm on both simulated and real-world data sets, and the results suggested that our conditional VIMP scores could better reveal the association between predictor variables and the modelling outcome, despite the correlation among predictor variables.
Date of publication
Keyword
Subject
Identifier
Resource type
Rights statement
  • In Copyright
Advisor
  • Zeng, Donglin
  • Kosorok, Michael
  • Nobel, Andrew
  • Slade, Gary
  • Bair, Eric
Degree
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2014
Language
Publisher
Place of publication
  • Chapel Hill, NC
Access
  • This item is restricted from public view for 2 years after publication.
Parents:

This work has no parents.

Items