A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

Wang, WeiBo; Sun, Wei; Wang, Wei; Szatkiewicz, Jin

Download PDF

Request Version for Screen Reader

Creator

Wang, WeiBo
- Affiliation: College of Arts and Sciences, Department of Computer Science
Sun, Wei
- Other Affiliation: Biostatistics Program, Fred Hutchinson Cancer Research Center
Wang, Wei
- Other Affiliation: Department of Computer Science, University of California, Los Angeles
Szatkiewicz, Jin
- Affiliation: School of Medicine, Department of Genetics

Abstract

Background The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. Results We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection. Conclusions Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.

Date of publication

March 1, 2018

DOI

https://doi.org/10.17615/nc6s-fb47

Identifier

https://doi.org/10.1186/s12859-018-2077-6

Resource type

Article

Rights statement

In Copyright

Rights holder

The Author(s)

Journal title

BMC Bioinformatics

Journal volume

19

Journal issue

1

Page start

74

Language

English

Bibliographic citation

BMC Bioinformatics. 2018 Mar 01;19(1):74

Publisher

BioMed Central

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	12859_2018_article_2077.pdf	2019-05-06	Public	Download
	12859_2018_2077_moesm1_esm.pdf	2019-05-06	Public	Download

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

Downloadable Content

Relations

Items