Collections > Electronic Theses and Dissertations > Principal component analyses for tree structured objects

This study is in the relatively new statistical area of Object Oriented Data Analysis, which considers general data objects (3D images, movies, etc) as the atoms of interest. The focus is on populations of tree-structured objects. Due to the highly non-Euclidean properties of the binary tree space, replacing classical analysis ideas with their counterparts in this new environment is a challenging task. Ideas analogous to Principal Component Analysis (PCA) for trees have been previously developed based on tree-lines. In this work, numerically fast (linear time) algorithms are developed for PCA based tree-lines which enable the first large scale data analysis of trees. Our analysis of tree-line PCA has lead to the invention of improved Principal Component Analyses, based on the new concepts of k-tree-lines and tree-curves. The tree-line analysis results give promising results. However, many tree-lines are required to explain most of the variation in the data. The idea of tree-curves directly targets the drawback of tree-lines. However, no polynomial-time optimal algorithm to find the optimal tree-curves exists. The heuristics developed give results that explain more variation than was observed previously. The k-tree-line study is proposed as a bridge between tree-line and tree-curve ideas. Polynomial time algorithms are sought for this group of problems. These three different proposed PCA methods are used to conduct a study to compare the three existing data sets and measure the age effect on each subpopulation within the sets. The advantages and shortcomings of each method with respect to each other are also discussed in the context of the data analysis. The motivating data set of this study is a collection of the brain vessel structures of 105 subjects. Due to the inaccuracies in scanning and tracking of these vessels, this data set is known to include a high amount of noise. A detailed visualization method is proposed in this work to spot the instances that require manual cleaning or need to be excluded.