Mining emerging massive scientific sequence data using block-wise decomposition methods Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 21, 2019
  • Zhang, Qi
    • Affiliation: College of Arts and Sciences, Department of Computer Science
  • I present efficient data mining algorithms for knowledge discovery on two types of emerging large-scale sequence-based scientific datasets: 1) static sequence data generated from SNP diversity arrays for genomic studies, and 2) dynamic sequence data collected in streaming and sensor network systems for environmental studies. The massive, noisy nature of the SNP arrays and the distributive, online nature of sensor network data pose challenging issues for knowledge discovery such as scalability, robustness, and efficiency. Despite the different characteristics of the SNP arrays and streaming sensor data, when viewed as sequences of ordered observations, both can be efficiently mined using algorithms based on block-wise decomposition methods. I present models and mining algorithms for inferring the genetic variation structure in genome-wide Single-Nucleotide Polymorphism (SNP) arrays. Genome-wide SNP arrays provide a comprehensive view of genome variation and serve as powerful resources for genetic and biomedical studies. Understanding the patterns of genetic variation in a population of individuals plays an important role in solving many genetics problems such as genealogy reconstruction and gene association studies. In this thesis, I propose data mining models and algorithms to efficiently infer genetic variation structure from the massive SNP panels of recombinant sequences resulting from meiotic recombination. I introduced the Minimum Segmentation Problem (MSP) to infer the segmentation structure of a single recombinant strain, as well as the Minimum Mosaic Problem (MMP) to infer the mosaic structure on a panel of recombinant strains. Both MSP and MMP estimate the ancestral polymorphism patterns exhibited in recombinant strains which provides important inputs for the subsequent association analysis. Efficient dynamic programming and graph algorithms based on block-wise decomposition are proposed which can solve MSP and MMP on genome-wide large-scale panels. I present efficient algorithms for mining massive streaming and sensor network data for observational sciences such as ecology and environmental studies. I proposed efficient algorithms using block-wise synopsis construction to capture the data distribution online for the dynamic sequence data collected in the sensor network and streaming systems including clustering analysis and order-statistics computation, which is critical for real-time monitoring, anomaly detection, and other domain specific analysis.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Wang, Wei
Degree granting institution
  • University of North Carolina at Chapel Hill
  • Open access

This work has no parents.