Detection of Low Rank Signals in Noise and Fast Correlation Mining with Applications to Large Biological Data

Shabalin, Andrey A.

Download PDF

Request Version for Screen Reader

Last Modified

March 20, 2019

Creator

Shabalin, Andrey A.
- Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research

Abstract

Ongoing technological advances in high-throughput measurement have given biomedical researchers access to a wealth of genomic information. The increasing size and dimensionality of the resulting data sets requires new modes of analysis. In this thesis we propose, analyze and validate several new methods for the analysis of biomedical data. We seek methods that are at once biologically relevant, computationally efficient, and statistically sound. The thesis is composed of two parts. The first concerns the problem of reconstructing a low-rank signal matrix observed in the presence of noise. In Chapter 1 we consider the general reconstruction problem, with no restrictions on the low-rank signal. We establish a connection with the singular value decomposition. This connection and recent results in random matrix theory are used to develop a new denoising scheme that outperforms existing methods on a wide range of simulated matrices. Chapter 2 is devoted to a data mining tool that searches for low-rank signals equal to a sum of raised submatrices. The method, called LAS, searches for large average submatrices, also called biclusters, using an iterative search procedure that seeks to maximize a statistically motivated score function.We perform extensive validation of LAS and other biclustering methods on real datasets and assess the biological relevance of their findings The second part of the thesis considers the joint analysis of two biological datasets. In Chapter 3 we address the problem of finding associations between single nucleotide polymorphisms (SNPs) and genes expression. The huge number of possible associations requires careful attention to issues of computational efficiency and multiple comparisons. We propose a new method, called FastMap, that exploits the discreteness of SNPs, and uses a permutation approach to account for multiple comparisons. In Chapter 4 we describe a method for combining gene expression data produced from different measurement platforms. The method, called XPN, estimates and removes the systematic differences between datasets by fitting a simple block-linear model to the available data. The method is validated on real gene expression data. The methods described in Chapters 2-4 have been implemented and are publicly available online.

Date of publication

August 2010

DOI

https://doi.org/10.17615/bvsk-z309

Resource type

Dissertation

Rights statement

In Copyright

Note

"... in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Statistics and Operations Research (Statistics)."

Advisor

Nobel, Andrew

Language

English

Publisher

University of North Carolina at Chapel Hill

Place of publication

Chapel Hill, NC

Access right

Open access

Date uploaded

March 18, 2013

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
		2019-04-10	Public	Download

Detection of Low Rank Signals in Noise and Fast Correlation Mining with Applications to Large Biological Data

Downloadable Content

Relations

Items