New statistical learning methods for chemical toxicity data analysis Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 22, 2019
  • Kang, Chae Ryon
    • Affiliation: Gillings School of Global Public Health, Department of Biostatistics
  • In the first part of the dissertation, we introduce the change-line classification and regression method to study latent subgroups. The proposed method finds a line which optimally divides a feature space into two heterogeneous subgroups, each of which yields a response having a different probability distribution or having a different regression model. The procedure is useful for classifying biochemicals on the basis of toxicity, where the feature space consists of chemical descriptors and the response is toxicity activity. In this setting, the goal is to identify subgroups of chemicals with different toxicity profiles. The split-line algorithm is utilized to reduce computational complexity. A two step estimation procedure, using either least squares or maximum likelihood for implementation, is described. Two sets of simulation studies and a data analysis applying our method to rat acute toxicity data are presented to demonstrate utility of the proposed method. Second, the asymptotic properties in the change-line regression model are studied, including consistency and the rates of convergence of M-estimators in the change-line regression model through empirical process techniques. We proved that the estimators of the regression parameters achieve a square-root n-consistency while the estimators of the change-line parameters achieve n-consistency. Last, we introduce the Interactive Decision Committee method for classification when high-dimensional feature variables are grouped into feature categories. The proposed method uses the interactive relationships among feature categories to build base classifiers which are combined using decision committees. The proposed procedure is useful for classifying biochemicals on the basis of toxicity activity, where the feature space consists of chemical descriptors belonging to at least one feature category, and the responses are binary indicators of toxicity activity. The support vector machine, the random forests, and the tree-based AdaBoost algorithms are utilized as classifier inducers. To combine base classifiers, the voting method with forward selection given the number of base classifiers by 5-fold CV and a stacked generalization with two different learning algorithms were utilized. We applied the proposed method to two chemical toxicity data sets. For these data sets, the proposed method improved the classification performance with respect to the average prediction accuracy compared to a single classifier.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Kosorok, Michael
  • Doctor of Philosophy
Graduation year
  • 2011

This work has no parents.