Collections > Electronic Theses and Dissertations > Detection of Low Rank Signals in Noise and Fast Correlation Mining with Applications to Large Biological Data
pdf

Ongoing technological advances in high-throughput measurement have given biomedical researchers access to a wealth of genomic information. The increasing size and dimensionality of the resulting data sets requires new modes of analysis. In this thesis we propose, analyze and validate several new methods for the analysis of biomedical data. We seek methods that are at once biologically relevant, computationally efficient, and statistically sound. The thesis is composed of two parts. The first concerns the problem of reconstructing a low-rank signal matrix observed in the presence of noise. In Chapter 1 we consider the general reconstruction problem, with no restrictions on the low-rank signal. We establish a connection with the singular value decomposition. This connection and recent results in random matrix theory are used to develop a new denoising scheme that outperforms existing methods on a wide range of simulated matrices. Chapter 2 is devoted to a data mining tool that searches for low-rank signals equal to a sum of raised submatrices. The method, called LAS, searches for large average submatrices, also called biclusters, using an iterative search procedure that seeks to maximize a statistically motivated score function.We perform extensive validation of LAS and other biclustering methods on real datasets and assess the biological relevance of their findings The second part of the thesis considers the joint analysis of two biological datasets. In Chapter 3 we address the problem of finding associations between single nucleotide polymorphisms (SNPs) and genes expression. The huge number of possible associations requires careful attention to issues of computational efficiency and multiple comparisons. We propose a new method, called FastMap, that exploits the discreteness of SNPs, and uses a permutation approach to account for multiple comparisons. In Chapter 4 we describe a method for combining gene expression data produced from different measurement platforms. The method, called XPN, estimates and removes the systematic differences between datasets by fitting a simple block-linear model to the available data. The method is validated on real gene expression data. The methods described in Chapters 2-4 have been implemented and are publicly available online.