Efficient algorithms in analyzing genomic data

Pan, Feng

Download PDF

Request Version for Screen Reader

Last Modified

March 20, 2019

Creator

Pan, Feng
- Affiliation: College of Arts and Sciences, Department of Computer Science

Abstract

With the development of high-throughput and low-cost genotyping technologies, immense data can be cheaply and efficiently produced for various genetic studies. A typical dataset may contain hundreds of samples with millions of genotypes/haplotypes. In order to prevent data analysis from becoming a bottleneck, there is an evident need for fast and efficient analysis methods. My thesis focuses on two interesting and important genetic analyzing problems. Genome-wide Association mapping. The goal of genome wide association mapping is to identify genes or narrow regions in the genome which have significant statistical correlations to the given phenotypes. The discovery of these genes offers the potential for increased understanding of biological processes affecting phenotypes such as body weight and blood pressure. Sample selection for maximal Genetic Diversity. Given a large set of samples, it is usually more efficient to first conduct experiments on a small subset. Then the following question arises: What subset to use? There are many experimental scenarios where the ultimate objective is to maintain, or at least maximize, the genetic diversity within relatively small breeding populations. In my thesis, I developed the following efficient and effective algorithms to address these problems. Phylogeny-based Genom-wide association mapping: TreeQA: The algorithm uses local perfect phylogeny tree in genome wide analysis for genotype/phenotype association mapping. Samples are partitioned according to the sub-trees they belong to. The association between a tree and the phenotype is measured by some statistic tests. TreeQA+: TreeQA+ inherits all the advantages of TreeQA. Moreover, it improves TreeQA by incorporating sample correlations into the association study. Sample selection for maximal genetic diversity: Sample Selection in biallelic SNP Data: Samples are selected based on their genetic diversity among a set of SNPs. Given a set of samples, the algorithms search for the minimum subset that retains all diversity (or a high percentage of diversity). Representative Sample Selection in Non-Biallelic Data: For more general data (non-biallelic), information-theoretic measurements such as entropy and mutual information are used to measure the diversity of a sample subset. Samples are selected to maximize the original information retained.

Date of publication

August 2009

DOI

https://doi.org/10.17615/k7y0-1349

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Wang, Wei

Language

English

Access right

Open access

Date uploaded

October 11, 2010

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	Efficient algorithms in analyzing genomic data	2019-04-10	Public	Download

Efficient algorithms in analyzing genomic data

Downloadable Content

Relations

Items