Methods of Association Mining by Variable-to-Set Affinity Testing Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 20, 2019
  • Bodwin, Kelly
    • Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research
  • Statistical data mining refers to methods for identifying and validating interesting patterns from an overabundance of data. Data mining tasks in which the objective involves pairwise relationships between variables are known as association mining. In general, features sought by association mining methods are sets of variables, often small subsets of a larger collection, that are more associated internally than externally. Methods vary in both the measure of association that is studied and the algorithm by which associated sets are identified. This dissertation discusses provide a generalized framework for association mining called Variable-to-Set Affinity Testing (VSAT). Unlike conventional techniques for clustering or community detection, which usually maximize a score from a dissimilarity or adjacency matrix, the VSAT approach is an adaptive procedure grounded in statistical hypothesis testing principles. The framework is adaptable to a broad class of measurements for variable relationships, and is equipped with theoretical guarantees of error control. This dissertation also presents in detail two new association mining methods built in the VSAT framework. The first, Differential Correlation Mining (DCM), identifies variable sets that have higher average pairwise correlation in one sample condition than in another. Such artifacts are of scientific interest in many fields, including statistical genetics and neuroscience. Differential Correlation Mining is applied to high-dimensional data sets in these two fields. The second method, Coherent Set Mining (CSM), is a novel approach to association mining in binary data. Dichotomous observations are assumed to derive from a latent variable of interest via thresholding. The Coherent Set Mining method identifies variable sets that are strongly associated in the latent measure, despite distortions in the association structure of the observed data due to the thresholding process. Coherent Set Mining is applied to problems in text mining, statistical genetics, and product recommendation.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Nobel, Andrew
  • Marron, James Stephen
  • Bhamidi, Shankar
  • Zhang, Kai
  • Xia, Yin
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2017

This work has no parents.