ingest cdrApp 2018-03-15T16:02:11.114Z d591f2cd-3da7-4b31-9dd8-ee27dcb6a3ee modifyDatastreamByValue RELS-EXT fedoraAdmin 2018-03-15T16:03:00.313Z Setting exclusive relation addDatastream MD_TECHNICAL fedoraAdmin 2018-03-15T16:03:11.502Z Adding technical metadata derived by FITS addDatastream MD_FULL_TEXT fedoraAdmin 2018-03-15T16:03:35.159Z Adding full text metadata extracted by Apache Tika modifyDatastreamByValue RELS-EXT fedoraAdmin 2018-03-15T16:03:56.981Z Setting exclusive relation modifyDatastreamByValue MD_DESCRIPTIVE cdrApp 2018-05-17T18:49:59.964Z modifyDatastreamByValue MD_DESCRIPTIVE cdrApp 2018-07-11T05:40:01.677Z modifyDatastreamByValue MD_DESCRIPTIVE cdrApp 2018-07-18T01:54:14.303Z modifyDatastreamByValue MD_DESCRIPTIVE cdrApp 2018-08-16T15:05:41.943Z modifyDatastreamByValue MD_DESCRIPTIVE cdrApp 2018-09-27T01:39:10.052Z modifyDatastreamByValue MD_DESCRIPTIVE cdrApp 2018-10-12T02:07:40.477Z modifyDatastreamByValue MD_DESCRIPTIVE cdrApp 2019-03-20T20:25:49.178Z Tianxiang Gao Author Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. Winter 2017 2017 Computer science deep learning, hypotheses generation, informative representation, interpretable learning, machine learning, model complexity eng Doctor of Philosophy Dissertation University of North Carolina at Chapel Hill Graduate School Degree granting institution Computer Science Vladimir Jojic Thesis advisor Jeff Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text Tianxiang Gao Creator Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. 2017-12 2017 Computer science deep learning, hypotheses generation, informative representation, interpretable learning, machine learning, model complexity eng Doctor of Philosophy Dissertation University of North Carolina at Chapel Hill Graduate School Degree granting institution Computer Science Vladimir Jojic Thesis advisor Jeff Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text Tianxiang Gao Creator Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. 2017-12 2017 Computer science deep learning, hypotheses generation, informative representation, interpretable learning, machine learning, model complexity eng Doctor of Philosophy Dissertation University of North Carolina at Chapel Hill Graduate School Degree granting institution Computer Science Vladimir Jojic Thesis advisor Jeff Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text Tianxiang Gao Creator Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. 2017-12 2017 Computer science deep learning, hypotheses generation, informative representation, interpretable learning, machine learning, model complexity eng Doctor of Philosophy Dissertation University of North Carolina at Chapel Hill Graduate School Degree granting institution Computer Science Vladimir Jojic Thesis advisor Jeff Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text Tianxiang Gao Creator Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. 2017-12 2017 Computer science deep learning, hypotheses generation, informative representation, interpretable learning, machine learning, model complexity eng Doctor of Philosophy Dissertation Computer Science Vladimir Jojic Thesis advisor Jeffery L. Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text University of North Carolina at Chapel Hill Degree granting institution Tianxiang Gao Creator Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. 2017-12 2017 Computer science deep learning; hypotheses generation; informative representation; interpretable learning; machine learning; model complexity eng Doctor of Philosophy Dissertation Computer Science Vladimir Jojic Thesis advisor Jeffery L. Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text University of North Carolina at Chapel Hill Degree granting institution Tianxiang Gao Creator Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. 2017-12 2017 Computer science deep learning, hypotheses generation, informative representation, interpretable learning, machine learning, model complexity eng Doctor of Philosophy Dissertation University of North Carolina at Chapel Hill Graduate School Degree granting institution Computer Science Vladimir Jojic Thesis advisor Jeffery L. Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text Tianxiang Gao Creator Department of Computer Science College of Arts and Sciences Extracting information from deep learning models for computational biology The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive. This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability. The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation. Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods. 2017-12 2017 Computer science deep learning; hypotheses generation; informative representation; interpretable learning; machine learning; model complexity eng Doctor of Philosophy Dissertation University of North Carolina at Chapel Hill Graduate School Degree granting institution Vladimir Jojic Thesis advisor Jeffery L. Dangl Thesis advisor Leonard McMillan Thesis advisor Marc Niethammer Thesis advisor Mohit Bansal Thesis advisor text Gao_unc_0153D_17407.pdf uuid:800e9020-fb59-4040-a6e2-2a1b7be5e3ef 2019-12-31T00:00:00 2017-12-11T03:52:20Z proquest application/pdf 2264167