NON-PARAMETRIC MACHINE LEARNING METHODS FOR CLUSTERING AND VARIABLE SELECTION
Public DepositedAdd to collection
You do not have access to any existing collections. You may create a new collection.
Downloadable Content
Download PDFCitation
MLA
Liu, Qian. Non-parametric Machine Learning Methods For Clustering And Variable Selection. Chapel Hill, NC: University of North Carolina at Chapel Hill Graduate School, 2014. https://doi.org/10.17615/swym-w286APA
Liu, Q. (2014). NON-PARAMETRIC MACHINE LEARNING METHODS FOR CLUSTERING AND VARIABLE SELECTION. Chapel Hill, NC: University of North Carolina at Chapel Hill Graduate School. https://doi.org/10.17615/swym-w286Chicago
Liu, Qian. 2014. Non-Parametric Machine Learning Methods For Clustering And Variable Selection. Chapel Hill, NC: University of North Carolina at Chapel Hill Graduate School. https://doi.org/10.17615/swym-w286- Last Modified
- March 19, 2019
- Creator
-
Liu, Qian
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics
- Abstract
- Non-parametric machine learning methods have been popular and widely used in many scientific research areas, especially when dealing with high-dimension low sample size (HDLSS) data. In particular, clustering and biclustering approaches can serve as exploratory analysis tools to uncover informative data structures, and random forest models have their advantage in coping with complex variable interactions. In many situations it is desirable to identify clusters that differ with respect to only a subset of features. Such clusters may represent homogeneous subgroups of patients with a disease. In this dissertation, we first propose a general framework for biclustering based on the sparse clustering method. Specifically, we develop an algorithm for identifying features that belong to biclusters. This framework can be used to identify biclusters that differ with respect to the means of the features, the variances of the features, or more general differences. We apply these methods to several simulated and real-world data sets, and the results of our methods compare favourably with previous published methods, with respect to both predictive accuracy and computing time. As a follow up to the biclustering study, we further look into the sparse clustering algorithm, and point out a few limitations of their proposed method for tuning parameter selection. We propose an alternative approach to select the tuning parameter, and to better identify features with positive weights. We compare our algorithm with the existing sparse clustering method on both simulated and real world data sets, and the results suggest that our method out-performs the existing method, especially in presence of weak clustering signal. For the last project, we consider random forest variable importance (VIMP) scores. We propose an alternative algorithm to calculate the conditional VIMP scores. We test our proposed algorithm on both simulated and real-world data sets, and the results suggested that our conditional VIMP scores could better reveal the association between predictor variables and the modelling outcome, despite the correlation among predictor variables.
- Date of publication
- December 2014
- Keyword
- Subject
- DOI
- Identifier
- Resource type
- Rights statement
- In Copyright
- Advisor
- Kosorok, Michael
- Slade, Gary
- Zeng, Donglin
- Bair, Eric
- Nobel, Andrew
- Degree
- Doctor of Philosophy
- Degree granting institution
- University of North Carolina at Chapel Hill Graduate School
- Graduation year
- 2014
- Language
- Publisher
- Place of publication
- Chapel Hill, NC
- Access right
- This item is restricted from public view for 2 years after publication.
- Date uploaded
- April 22, 2015
Relations
- Parents:
This work has no parents.
Items
Thumbnail | Title | Date Uploaded | Visibility | Actions |
---|---|---|---|---|
Liu_unc_0153D_14880.pdf | 2019-04-12 | Public | Download |