Principal Component Analysis in Phylogenetic Tree Space
Public DepositedAdd to collection
You do not have access to any existing collections. You may create a new collection.
Downloadable Content
Download PDFCitation
MLA
Zhai, Haojin. Principal Component Analysis In Phylogenetic Tree Space. 2016. https://doi.org/10.17615/tx71-2y54APA
Zhai, H. (2016). Principal Component Analysis in Phylogenetic Tree Space. https://doi.org/10.17615/tx71-2y54Chicago
Zhai, Haojin. 2016. Principal Component Analysis In Phylogenetic Tree Space. https://doi.org/10.17615/tx71-2y54- Last Modified
- March 20, 2019
- Creator
-
Zhai, Haojin
- Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research
- Abstract
- Complex data objects arise in many fields of modern science including drug discovery, psychology, dynamics of gene expression and anatomy. Object oriented data analysis describes the statistical analysis of a population of complex data objects. The specific case of tree-structured data objects is a large end promising research area with many interesting questions and challenging problems. This dissertation focuses on principal component analysis in the tree space introduced by Billera, Holmes, and Vogtmann. Principal component analysis has been a widely used method in aiding visualization and reducing dimensions, and it is natural to extend this type of analysis into tree space. In this dissertation, we will discuss three interesting approaches to this extension. The first approach is multidimensional scaling, which focuses on better visualization of data in tree space, in particular, the out-of-sample embedding problem which inserts additional points into previously constructed multidimensional scaling configurations. It is shown that a better visualization can be achieved by choosing a higher dimensional embedding space and displaying only the first two dimensions. The other two approaches rely on our novel definitions of tree space line, and it is proven that there are only two types of such lines. The second approach is sample-limited geodesic which is an analog of the first type of line. This idea defines the first principal component for a set of trees by maximizing the data projection variance over geodesic segments connecting pairs of trees. Our study shows that the sample-limited geodesic is not an effective principal component object in terms of capturing data variation, due to the intrinsic geometry of the data used in this dissertation, and it is not natural to be generalized into higher-order principal component objects. The third approach is based on the principal ray set, which is a representative of the second type of line. We develop some heuristic searching algorithms for first order principal ray sets and higher order principal axis sets, which are special cases of principal ray sets. Principal ray sets are better summaries for less variable data, but gain very limited information for data with larger spread.
- Date of publication
- May 2016
- Keyword
- DOI
- Resource type
- Rights statement
- In Copyright
- Advisor
- Marron, James Stephen
- Provan, John
- Lu, Shu
- Pataki, Gabor
- Miller, Ezra
- Degree
- Doctor of Philosophy
- Degree granting institution
- University of North Carolina at Chapel Hill Graduate School
- Graduation year
- 2016
- Language
Relations
- Parents:
This work has no parents.
Items
Thumbnail | Title | Date Uploaded | Visibility | Actions |
---|---|---|---|---|
Zhai_unc_0153D_15893.pdf | 2019-04-12 | Public | Download |