High dimension, low sample size data analysis

Ahn, Jeongyoun

Download PDF

Request Version for Screen Reader

Last Modified

March 20, 2019

Creator

Ahn, Jeongyoun
- Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research

Abstract

This dissertation consists of three research topics regarding High Dimension, Low Sample Size (HDLSS) data analysis. The first topic is a study of the sample covariance matrix of a data set with extremely large dimensionality, but with relatively small sample size. Especially the asymptotic behavior of eigenvalues and eigenvectors of the sample covariance matrix is the focus of our study. Assuming that the true population covariance matrix of the data is not too far from identity matrix (i.e., spherical in the Gaussian case), we show that the sample eigenvalues and eigenvectors tend to behave as if the true structure of the data is indeed from identity covariance. Based on this, the asymptotic geometric representation of HDLSS data is extended to a wide range of underlying distributions. The representation essentially states that data vectors form a regular simplex in the data space with the number of vertices equal to the sample size. The second part of the dissertation studies a discriminant direction vector, which is only interesting in HDLSS settings. This direction is characterized by the property that it projects all the data vectors, which are generated from two classes, to two distinct values, one for each class. It will be seen that this Maximal Data Piling (MDP) direction lies within the hyperplane generated by all the data vectors, while it is orthogonal to the hyperplanes generated by each class. It has the largest distance between piling sites among all the possible piling direction vectors and also maximizes the amount of piling. The formula of MDP is equivalent to the Fisher's linear discrimination when the dimension is less than the sample size. As a classification method, MDP is heuristically desirable when the data are well approximated by the HDLSS geometric representation. The third topic relates to kernel methods in statistical learning, especially the kernel based classification problem. Taking the case of the Gaussian kernel function for the support vector machines and mean difference methods, we propose a novel approach to select the bandwidth parameter in kernel functions. The derivation is based on the fact that the bandwidth parameter in a kernel function determines the geometry of the high dimensional kernel embedded feature space. Compared with cross-validation and other tuning criteria from the literature, our approach is demonstrated to be robust to the sampling variation, while maintaining comparable classification power and low computing cost, in real and simulated data examples.

Date of publication

August 2006

DOI

https://doi.org/10.17615/ybwn-6s61

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Marron, James Stephen

Language

English

Access right

Open access

Date uploaded

October 19, 2010

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	High dimension, low sample size data analysis	2019-04-11	Public	Download

High dimension, low sample size data analysis

Downloadable Content

Relations

Items