The Effect of Data Curation on the Accuracy of Quantitative Structure-Activity Relationship Models

Fant, Andrew

Download PDF

Request Version for Screen Reader

Last Modified

March 19, 2019

Creator

Fant, Andrew
- Affiliation: Eshelman School of Pharmacy, Division of Chemical Biology and Medicinal Chemistry

Abstract

In the 33 years since the first public release of GenBank, and the 15 years since the publication of the first pilot assembly of the human genome, drug discovery has been awash in a tsunami of data. But it has only been within the past decade that medicinal chemists and chemical biologists have had access to the same sorts of large-scale, public-access databases as bioinformaticians and molecular biologists have had for so long. The release of this data has sparked a renewed interest in computational methods for rational drug design, but questions have arisen recently about the accuracy and quality of this data. The same question has arisen in other scientific disciplines, but it has a particular urgency to practitioners of Quantitative Structure-Activity Relationship (QSAR) modeling. By its nature QSAR modeling depends on both activity data and chemical structures. While activities are usually expressed as numerical scalar values, a form ubiquitous throughout the sciences, chemical structures (especially that must be interpretable as such by computer software) are stored in a variety of specialized formats which are much less common and mostly ignored outside of cheminformatics and related fields. While previous research has determined that a 5% error rate in data being used for modeling can cause a QSAR model to be non-predictive and useless for its intended purpose, and workflows have been proposed which reduce the effect of inconsistent chemical structure representations on model accuracy, a fundamental question remains: “how accurate are the structure and activity data freely available to researchers?” To this end, we have undertaken two surveys of data quality, one focusing on chemical structure information in Internet resources and a second examining the uncertainty associated with compounds reported in the medicinal chemistry literature as abstracted in ChEMBL. The results of these studies have informed the creation of an improved workflow for the curation of structure-activity data which is intended to identify problematic data points in raw data extracted from databases so that an expert human curator can examine the underlying literature and resolve discrepancies between reported values. This workflow was in turn applied to the creation of two QSAR models that were used to implement a virtual screen seeking molecules capable of binding to both the serotonergic reuptake transporter and the alpha2a adrenergic receptor. While no suitable compounds were identified in the initial screening process, regions of chemical space that may yield truly novel alpha 2a receptor ligands have been identified. These regions can be targeted in future efforts. Basing data curation workflows on manual processes by human curators is not particularly viable, as humans have a tendency to introduce errors by inattention even as they identify and repair other problems. Computers cannot effectively curate data either. While they are highly accurate when programmed properly, they lack human creativity and insight that would allow them to determine which data points represent truly inaccurate information. In order to effectively curate data, humans and computers must both be incorporated into a workflow that harnesses their strengths and limits their liabilities.

Date of publication

December 2015

Keyword

Subject

DOI

https://doi.org/10.17615/5m3a-qf34

Identifier

Fant_unc_0153D_15809.pdf

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Singleton, Scott
Elston, Timothy
Tropsha, Alexander
Rusyn, Ivan
Lee, Andrew

Degree

Doctor of Philosophy

Degree granting institution

University of North Carolina at Chapel Hill Graduate School

Graduation year

2015

Language

English

Publisher

University of North Carolina at Chapel Hill Graduate School

Place of publication

Chapel Hill, NC

Access right

There are no restrictions to this item.

Date uploaded

January 21, 2016

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	Fant_unc_0153D_15809.pdf	2019-04-15	Public	Download

The Effect of Data Curation on the Accuracy of Quantitative Structure-Activity Relationship Models

Downloadable Content

Relations

Items