Collections > Electronic Theses and Dissertations > The Effect of Data Curation on the Accuracy of Quantitative Structure-Activity Relationship Models

In the 33 years since the first public release of GenBank, and the 15 years since the publication of the first pilot assembly of the human genome, drug discovery has been awash in a tsunami of data. But it has only been within the past decade that medicinal chemists and chemical biologists have had access to the same sorts of large-scale, public-access databases as bioinformaticians and molecular biologists have had for so long. The release of this data has sparked a renewed interest in computational methods for rational drug design, but questions have arisen recently about the accuracy and quality of this data. The same question has arisen in other scientific disciplines, but it has a particular urgency to practitioners of Quantitative Structure-Activity Relationship (QSAR) modeling. By its nature QSAR modeling depends on both activity data and chemical structures. While activities are usually expressed as numerical scalar values, a form ubiquitous throughout the sciences, chemical structures (especially that must be interpretable as such by computer software) are stored in a variety of specialized formats which are much less common and mostly ignored outside of cheminformatics and related fields. While previous research has determined that a 5% error rate in data being used for modeling can cause a QSAR model to be non-predictive and useless for its intended purpose, and workflows have been proposed which reduce the effect of inconsistent chemical structure representations on model accuracy, a fundamental question remains: “how accurate are the structure and activity data freely available to researchers?” To this end, we have undertaken two surveys of data quality, one focusing on chemical structure information in Internet resources and a second examining the uncertainty associated with compounds reported in the medicinal chemistry literature as abstracted in ChEMBL. The results of these studies have informed the creation of an improved workflow for the curation of structure-activity data which is intended to identify problematic data points in raw data extracted from databases so that an expert human curator can examine the underlying literature and resolve discrepancies between reported values. This workflow was in turn applied to the creation of two QSAR models that were used to implement a virtual screen seeking molecules capable of binding to both the serotonergic reuptake transporter and the alpha2a adrenergic receptor. While no suitable compounds were identified in the initial screening process, regions of chemical space that may yield truly novel alpha 2a receptor ligands have been identified. These regions can be targeted in future efforts. Basing data curation workflows on manual processes by human curators is not particularly viable, as humans have a tendency to introduce errors by inattention even as they identify and repair other problems. Computers cannot effectively curate data either. While they are highly accurate when programmed properly, they lack human creativity and insight that would allow them to determine which data points represent truly inaccurate information. In order to effectively curate data, humans and computers must both be incorporated into a workflow that harnesses their strengths and limits their liabilities.