Collections > Electronic Theses and Dissertations > New Statistical Learning Approaches with Applications to RNA-seq Data

This dissertation examines statistical learning problems in both the supervised and unsupervised settings. The dissertation is composed of three major parts. In the first two, we address the important question of significance of clustering, and in the third, we describe a novel framework for unifying hard and soft classification through a spectrum of binary learning problems. In the unsupervised task of clustering, determining whether the identified clusters represent important underlying structure, or are artifacts of natural sampling variation, has been a critical and challenging question. In this dissertation, we introduce two new methods for addressing this question using statistical significance. In the first part of the dissertation, we describe SigFuge, an approach for identifying genomic loci exhibiting differential transcription patterns across many RNA-seq samples. In the second part of this dissertation, we describe statistical Significance of Hierarchical Clustering (SHC), a Monte Carlo based approach for testing significance in hierarchical clustering, and demonstrate the power of the method to identify significant clustering using two cancer gene expression datasets. Both methods were implemented and made available as open source packages in R. In the final part of this dissertation, we propose a spectrum of supervised learning problems which spans the hard and soft classification tasks based on fitting multiple decision rules to a dataset. By doing so, we reveal a novel collection of binary supervised learning problems. We study the problems using the framework of large-margin classification and a class of piecewise linear surrogate losses, for which we derive statistical properties. We evaluate our approach using simulations and a magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.