Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research
Research in a number of fields requires the analysis of complex datasets. Principal Components Analysis (PCA) is a popular exploratory method. However it is driven entirely by variation in the dataset without using any predefined class label information. Linear classifiers make up a family of popular discrimination methods. However, these will face the data piling issue often when the dimension of the dataset gets higher. In this dissertation, we first study the geometric representation of an interesting dataset with strongly auto-regressive errors under the High Dimensional Low Sample Size (HDLSS) setting and understand why the Maximal Data Piling (MDP), proposed by Ahn et al. (2007), is the best in terms of classification compared with several other commonly used linear discrimination methods. Then we introduce the Class-Sensitive Principal Components Analysis (CSPCA), which is a compromise of PCA and MDP, that seeks new direction vectors for better Class-Sensitive visualization. Specifically, this method will be applied to the Thyroid Cancer dataset (see Agrawal et al. (2014)). Additionally, we investigate the asymptotic behavior of the sample and population MDP normal vector and Class-Sensitive Principal Component directions under the HDLSS setting. Moreover, the Multi-class version of CSPCA (MCSP) will be introduced as the last part of this dissertation.