USING THE MULTI-STRING BURROW-WHEELER TRANSFORM FOR HIGH-THROUGHPUT SEQUENCE ANALYSIS Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 20, 2019
Creator
  • Holt, James
    • Affiliation: College of Arts and Sciences, Department of Computer Science
Abstract
  • The throughput of sequencing technologies has created a bottleneck where raw sequence files are stored in an un-indexed format on disk. Alignment to a reference genome is the most common pre-processing method for indexing this data, but alignment requires a priori knowledge of a reference sequence, and often loses a significant amount of sequencing data due to biases. Sequencing data can instead be stored in a lossless, compressed, indexed format using the multi-string Burrows Wheeler Transform (BWT). This dissertation introduces three algorithms that enable faster construction of the BWT for sequencing datasets. The first two algorithms are a merge algorithm for merging two or more BWTs into a single BWT and a merge-based divide-and-conquer algorithm that will construct a BWT from any sequencing dataset. The third algorithm is an induced sorting algorithm that constructs the BWT from any string collection and is well-suited for building BWTs of long-read sequencing datasets. These algorithms are evaluated based on their efficiency and utility in constructing BWTs of different types of sequencing data. This dissertation also introduces two applications of the BWT: long-read error correction and a set of biologically motivated sequence search tools. The long-read error correction is evaluated based on accuracy and efficiency of the correction. Our analyses show that the BWT of almost all sequencing datasets can now be efficiently constructed. Once constructed, we show that the BWT offers significant utility in performing fast searches as well as fast and accurate long read corrections. Additionally, we highlight several use cases of the BWT-based web tools in answering biologically mo- tivated problems.
Date of publication
Keyword
DOI
Resource type
Rights statement
  • In Copyright
Advisor
  • Li, Yun
  • Prins, Jan
  • McMillan, Leonard
  • Jojic, Vladimir
  • Pardo-Manuel Pardo-Pardo-Manuel de Villena, Fernando
Degree
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2016
Language
Parents:

This work has no parents.

Items