Principal Component Analysis in Phylogenetic Tree Space

Zhai, Haojin

Download PDF

Request Version for Screen Reader

Last Modified

March 20, 2019

Creator

Zhai, Haojin
- Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research

Abstract

Complex data objects arise in many fields of modern science including drug discovery, psychology, dynamics of gene expression and anatomy. Object oriented data analysis describes the statistical analysis of a population of complex data objects. The specific case of tree-structured data objects is a large end promising research area with many interesting questions and challenging problems. This dissertation focuses on principal component analysis in the tree space introduced by Billera, Holmes, and Vogtmann. Principal component analysis has been a widely used method in aiding visualization and reducing dimensions, and it is natural to extend this type of analysis into tree space. In this dissertation, we will discuss three interesting approaches to this extension. The first approach is multidimensional scaling, which focuses on better visualization of data in tree space, in particular, the out-of-sample embedding problem which inserts additional points into previously constructed multidimensional scaling configurations. It is shown that a better visualization can be achieved by choosing a higher dimensional embedding space and displaying only the first two dimensions. The other two approaches rely on our novel definitions of tree space line, and it is proven that there are only two types of such lines. The second approach is sample-limited geodesic which is an analog of the first type of line. This idea defines the first principal component for a set of trees by maximizing the data projection variance over geodesic segments connecting pairs of trees. Our study shows that the sample-limited geodesic is not an effective principal component object in terms of capturing data variation, due to the intrinsic geometry of the data used in this dissertation, and it is not natural to be generalized into higher-order principal component objects. The third approach is based on the principal ray set, which is a representative of the second type of line. We develop some heuristic searching algorithms for first order principal ray sets and higher order principal axis sets, which are special cases of principal ray sets. Principal ray sets are better summaries for less variable data, but gain very limited information for data with larger spread.

Date of publication

May 2016

Keyword

DOI

https://doi.org/10.17615/tx71-2y54

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Marron, James Stephen
Provan, John
Lu, Shu
Pataki, Gabor
Miller, Ezra

Degree

Doctor of Philosophy

Degree granting institution

University of North Carolina at Chapel Hill Graduate School

Graduation year

2016

Language

English

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	Zhai_unc_0153D_15893.pdf	2019-04-12	Public	Download

Principal Component Analysis in Phylogenetic Tree Space

Downloadable Content

Relations

Items