A Comparison of Machine Learning Algorithms for Chemical Toxicity Classification Using a Simulated Multi-Scale Data Model

Judson, Richard; Elloumi, Fathi; Setzer, R Woodrow; Li, Zhen; Shah, Imran

Download PDF

Request Version for Screen Reader

Creator

Judson, Richard
- Other Affiliation: National Center for Computational Toxicology
Elloumi, Fathi
- Other Affiliation: National Center for Computational Toxicology
Setzer, R Woodrow
- Other Affiliation: National Center for Computational Toxicology
Li, Zhen
- Affiliation: Gillings School of Global Public Health, Department of Biostatistics
Shah, Imran
- Other Affiliation: National Center for Computational Toxicology

Abstract

Abstract: Background: Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in highdimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemicaltoxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods. Results: The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA. Conclusion: We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.

Date of publication

May 19, 2008

DOI

https://doi.org/10.17615/ve9n-qx35

Identifier

18489778
https://doi.org/10.1186/1471-2105-9-241

Resource type

Article

Rights statement

In Copyright

Rights holder

Richard Judson et al.; licensee BioMed Central Ltd.

License

http://creativecommons.org/licenses/by/2.0

Journal title

BMC Bioinformatics

Journal volume

9

Journal issue

1

Page start

241

Language

English

Is the article or chapter peer-reviewed?

Yes

ISSN

1471-2105

Bibliographic citation

BMC Bioinformatics. 2008 May 19;9(1):241

Publisher

BioMed Central Ltd

Access right

Open Access

Date uploaded

August 24, 2012

Relations

Parents:

This work has no parents.

In Collection:

UNC-Chapel Hill Artificial Intelligence Resources

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	1471-2105-9-241.pdf	August 24, 2012	Public	Download
	1471-2105-9-241.xml	August 24, 2012	Public	Download

A Comparison of Machine Learning Algorithms for Chemical Toxicity Classification Using a Simulated Multi-Scale Data Model

Downloadable Content

Relations

Items