Extracting information from deep learning models for computational biology Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 20, 2019
Creator
  • Gao, Tianxiang
    • Affiliation: College of Arts and Sciences, Department of Computer Science
Abstract
  • The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods.
Date of publication
Keyword
DOI
Resource type
Rights statement
  • In Copyright
Advisor
  • Dangl, Jeffery L.
  • Bansal, Mohit
  • Niethammer, Marc
  • Jojic, Vladimir
  • McMillan, Leonard
Degree
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2017
Language
Parents:

This work has no parents.

Items