A Features Analysis Tool For Assessing And Improving Computational Models In Structural Biology

O'Meara, Matthew James

Download PDF

Request Version for Screen Reader

Last Modified

March 20, 2019

Creator

O'Meara, Matthew James
- Affiliation: College of Arts and Sciences, Department of Computer Science

Abstract

The protein-folding problem is to predict, from a protein's amino acid sequence, its folded 3D conformation. State of the art computational models are complex collaboratively maintained prediction software. Like other complex software, they become brittle without support for testing and refactoring. Features analysis, a language of `scientific unit testing', is the visual and quantitative comparison of distributions of features (local geometric measures) sampled from ensembles of native and predicted conformations. To support features analysis I develop a features analysis tool--a modular database framework for extracting and managing sampled feature instance and an exploratory data analysis framework for rapidly comparing feature distributions. In supporting features analysis, the tool supports the creation, tuning, and assessment of computational models, improving protein prediction and design. I demonstrate the features analysis tool through 6 case studies with the Rosetta molecular modeling suite. The first three demonstrate the tool usage mechanics through constructing and checking models. The first evaluates bond angle restraint models when used with the Backrub local sampling heuristic. The second identifies and resolves energy function derivative discontinuities that frustrate gradient-based minimization. The third constructs a model for disulfide bonds. The second three demonstrate using the tool to evaluate and improve how models represent molecular structure. I focus on modeling H-bonds because of their geometric specificity and environmental dependence lead to complex feature distributions. The fourth case study develops a novel functional form for Sp2 acceptor H-bonds. The fifth fits parameters for a refined H-bond model. The sixth combines the refined model with an electrostatics model and harmonizes them with the rest of the energy function. Next, to facilitate assessing model improvements, I develop recovery tests that measure predictive accuracy by asking models to recover native conformations that have been partially randomized. Finally, to demonstrate that the features analysis and recovery test tools support improving protein prediction and design, I evaluated the refined H-bond model and electrostatics model with additional corrections from the Rosetta community. Based on positive results, I recommend a new standard energy function, which has been accepted by the Rosetta community as the largest systematic improvement in nearly a decade.

Date of publication