Study on correlations in high dimensional data Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 21, 2019
Creator
  • Gong, Siliang
    • Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research
Abstract
  • With the prevalence of high dimensional data, variable selection is crucial in many real applications. Although various methods have been investigated in the past decades, challenges still remain when tens of thousands of predictor variables are available for modeling. One difficulty arises from the spurious correlation, referring to the phenomenon that the sample correlation between two variables can be large when the dimension is relatively high even if they are independent. While many classical variable selection methods choose a variable based upon its marginal correlation with the response, the existence of spurious correlation may result in a high false discovery rate. On the other hand, when important variables are highly correlated, it is desirable to include all of them into the model. However, there is no such guarantee in many existing methods. Another challenge is in most variable selection approaches one needs to implement model selection to control the model complexity. While cross-validation is commonly used, it is computationally expensive and lacks statistical interpretation. In this proposal, we introduce some novel variable selection approaches to address the challenges mentioned above. Our proposed methods are based upon the investigations on the limiting distribution of the spurious correlation. For the first project, we study the maximal absolute sample partial correlation between the covariates and the response, and introduce a testing-based variable selection procedure. In the second project, we take advantage of the asymptotic results of the maximal absolute sample correlation among covariates and incorporate them into a penalized variable selection approach. The third project considers applications of the asymptotic results in multiple-response regression. Numerical studies demonstrate the effectiveness of our proposed methods.
Date of publication
Keyword
DOI
Resource type
Rights statement
  • In Copyright
Advisor
  • Marron, Steve
  • Bhamidi, Shankar
  • Zeng, Donglin
  • Zhang, Kai
  • Liu, Yufeng
Degree
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2018
Language
Parents:

This work has no parents.

Items