Collections > Electronic Theses and Dissertations > Correcting Reference Bias in High-throughput Sequencing Analysis

Mapping reads to a reference sequence is a common step when analyzing high throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending the genetic distances of the target sequences from the reference. To avoid this bias researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings, and the selection of which variants to include to remove biases. To address these issues, I proposed novel and generic pipelines that integrate the genomic variations from known or suspected founders into reference sequences and then perform read alignment. Experiments show that my pipelines can align more reads with much lower reference bias than the traditional pipeline where reads are mapped against the standard reference sequence. They can be applied to a wide range of organisms, including inbreds, F1s, and outbreds, and various high throughput sequencing approaches, such as RNAseq, DNAseq, ChiPseq, etc.