Novel Cheminformatics Methods for Modeling Biomolecular Data in High Dimension Low Sample Size (HDLSS) Chemistry Space Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 22, 2019
  • Wu, Tong-Yng
    • Affiliation: School of Medicine, UNC/NCSU Joint Department of Biomedical Engineering
  • The increasing availability of biological and chemical data has led to a critical need for cheminformatics and bioinformatics tools to analyze the data. However, not all datasets contain sufficient information for significant analysis. One problem is High Dimension Low Sample Size (HDLSS), where the number of structural characteristics, e.g., molecular descriptors, that can be calculated from a single compound (high dimensionality) far exceeds the number of compounds (low sample size). A major challenge associated with modeling HDLSS data is overfitting, and specialized tools are required to overcome the statistical difficulties inherent to HDLSS. We improved the Distance Weighted Discrimination (DWD) statistical learning method through a new variable selection technique for estimating the intrinsic dimension of a dataset, i.e., the minimum number of descriptors to classify data. Compared to SVM and DWD without variable selection, DWD with variable selection achieved higher prediction accuracy on several benchmarking datasets and allowed the identification of key molecular features that contribute to investigated biological properties, e.g., inhibition of AmpC β-lactamase and binding affinity for the various serotonin receptors. For analyzing and modeling stereochemistry-dependent datasets, we developed chiral atom-pair descriptors (3D chiral atom-pair), which were calculated from three-dimensional molecular structures. QSAR models built with these descriptors, versus either 3D non-chiral atom-pair or 2D Dragon descriptors, more accurately predicted antimalarial activity and binding affinities of small molecules toward various receptors. Our method not only led to the identification of a subset of chiral atoms that are expected to affect the selected biological property, e.g., antimalarial activity, but also enabled the development of models that would not be possible otherwise. To aid automatic protein function annotation, especially in the case of functional homologs, we developed new protein descriptors based solely on protein's structure. Our method showed sensitivity comparable to that of ScanPROSITE. When predicted annotations from both ScanPROSITE and our method were combined into a consensus model, we observed a significant gain in prediction reliability and the successful functional annotation of proteins with low sequence similarity.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Lalush, David Scott
  • Doctor of Philosophy
Graduation year
  • 2012

This work has no parents.