Robust Clustering Methods with Subpopulation-specific Deviations Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 19, 2019
  • Stephenson, Briana
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
  • Large populations have been found to be composed of a unique set of subpopulations. Each subpopulation tends to exhibit similar behaviors and respond to various outcomes different from other subpopulations. Mixture models have become a great utility in modeling these differences, by attributing each subpopulation its own unique distribution. The underlying subpopulation inferred from the mixture model is referred to as a latent class or cluster. Traditional clustering methods operate under the assumption that two subjects allocated to the same cluster will respond identically to all measured variables. Yet, aberrations between some individual measured variables can yield valuable information. Furthermore, these models often realize an increasing number of clusters that expand with the dimensionality of sample size and number of variables. This may lead to a loss in interpretability, due to a large number of clusters and an oversensitivity to minor deviations that exist among groups. First, we develop a parsimonious clustering method to address these complexities. Motivated from a local partition process framework, we propose a new method known as Robust Profile Clustering (RPC) that allows subjects to aggregate at two levels: (1) globally, where subjects are allotted to overall population-level clusters and (2) locally, where individual measured items can deviate from their global indicators via a Beta-Bernoulli process to adapt for differences across groups of individuals. Second, we build upon this to create a predictive clustering model that links the clustering model generated from the RPC with a response probit model via a supervised RPC joint model. Here, subjects more likely to exhibit the outcome of interest can cluster in accordance with her global and local profiles. Finally, we discuss the impact and practicality of these methods, as well as other recent machine learning techniques in nutritional epidemiology to improve dietary pattern analysis in large heterogeneous populations, such as the United States, while adjusting for potential state-level differences. Using data obtained from the 1997-2009 National Birth Defects Prevention Study, we focus our application towards maternal diet and its association to oral cleft birth defects.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Dunson, David B.
  • Olshan, Andrew
  • Sotres-Alvarez, Daniela
  • Zhou, Haibo
  • Herring, Amy
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2017

This work has no parents.