Nonparametric methods for machine learning and association testing Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 20, 2019
  • Helgeson, Erika
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
  • As data collection becomes easier, non-parametric machine learning methods are increasing in popularity due to their ability to quickly discover informative data structures useful for prediction. Unsupervised clustering methods can be especially valuable for identifying subgroups in high dimensional gene expression data. Another important goal is prediction of disease or symptom outcomes from a given genotype. One way to achieve this goal is to first identify genetic factors associated with the outcome of interest. This may be especially challenging if the association is investigated in a sample with selection stratified with respect to a third variable, as occurs when studying secondary phenotypes in case-control studies. In Chapter 2 we develop a non-parametric, cluster significance testing algorithm. This algorithm compares the strength of identified clusters to the strength of spurious clusters produced from unimodal reference data. The method utilizes dimension reduction and sparse covariance estimation, making it is especially relevant for high dimensional data sets. We also extend the method to estimate the number of clusters present. The method is applied to several simulated and real-world data sets. We find it has comparable accuracy to existing methods and, in addition, can be used in a wider array of settings. We next develop a permutation-based sparse biclustering algorithm built upon the method of Witten and Tibshirani (2010) which iteratively employs a cluster significance testing step. Biclustering identifies a submatrix such that the pattern of the features for the observations within the submatrix are different than the pattern outside of the submatrix. We present simulation and real data results with comparison to existing methods illustrating the accuracy of the proposed method in assigning observations to clusters and identifying distinguishing features. In the last chapter we develop a permutation-based method for assessing the association between genetic factors and secondary phenotypes within a case control study. Conventional inverse-probability-of-sampling-weighted (IPW) regression (Monsees, Tamimi, and Kraft, 2009 and Richardson et al., 2007) may produce invalid estimates of association strength in situations where most of the variation in the secondary phenotype is found in the cases. Simulation and real data results indicate the proposed method has better type-I error rates and comparable power to the conventional IPW method and can be used to identify novel SNPs associated with clinical orofacial pain.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Bair, Eric
  • Slade, Gary
  • Marron, James Stephen
  • Kosorok, Michael
  • Liu, Yufeng
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2017

This work has no parents.