High dimension, low sample size data analysis Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 20, 2019
  • Ahn, Jeongyoun
    • Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research
  • This dissertation consists of three research topics regarding High Dimension, Low Sample Size (HDLSS) data analysis. The first topic is a study of the sample covariance matrix of a data set with extremely large dimensionality, but with relatively small sample size. Especially the asymptotic behavior of eigenvalues and eigenvectors of the sample covariance matrix is the focus of our study. Assuming that the true population covariance matrix of the data is not too far from identity matrix (i.e., spherical in the Gaussian case), we show that the sample eigenvalues and eigenvectors tend to behave as if the true structure of the data is indeed from identity covariance. Based on this, the asymptotic geometric representation of HDLSS data is extended to a wide range of underlying distributions. The representation essentially states that data vectors form a regular simplex in the data space with the number of vertices equal to the sample size. The second part of the dissertation studies a discriminant direction vector, which is only interesting in HDLSS settings. This direction is characterized by the property that it projects all the data vectors, which are generated from two classes, to two distinct values, one for each class. It will be seen that this Maximal Data Piling (MDP) direction lies within the hyperplane generated by all the data vectors, while it is orthogonal to the hyperplanes generated by each class. It has the largest distance between piling sites among all the possible piling direction vectors and also maximizes the amount of piling. The formula of MDP is equivalent to the Fisher's linear discrimination when the dimension is less than the sample size. As a classification method, MDP is heuristically desirable when the data are well approximated by the HDLSS geometric representation. The third topic relates to kernel methods in statistical learning, especially the kernel based classification problem. Taking the case of the Gaussian kernel function for the support vector machines and mean difference methods, we propose a novel approach to select the bandwidth parameter in kernel functions. The derivation is based on the fact that the bandwidth parameter in a kernel function determines the geometry of the high dimensional kernel embedded feature space. Compared with cross-validation and other tuning criteria from the literature, our approach is demonstrated to be robust to the sampling variation, while maintaining comparable classification power and low computing cost, in real and simulated data examples.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Marron, James Stephen
  • Open access

This work has no parents.