Classifier Design to Improve Pattern Classification and Knowledge Discovery for Imbalanced Datasets

Wang, Kun

Download PDF

Request Version for Screen Reader

Last Modified

March 19, 2019

Creator

Wang, Kun
- Affiliation: Eshelman School of Pharmacy, Division of Chemical Biology and Medicinal Chemistry

Abstract

Imbalanced dataset mining is a nontrivial issue. It has extensive applications in a variety of fields, such as scientific research, medical diagnosis, business, multiple industries, etc. Standard machine learning algorithms fail to produce satisfactory classifiers: they tend to over-fit the larger class but ignore the smaller class. Numerous algorithms have been developed to handle class imbalance, and limited progress has been achieved in improving prediction accuracy for smaller class. However, real world datasets may have hidden detrimental characteristics other than class imbalance. Those characteristics usually are dataset specific, and can fail otherwise robust algorithms for other imbalanced datasets. Mining such datasets can only be improved by algorithms tailored to domain characteristics (Weiss, 2004); therefore, it is important and necessary to do exploratory data analysis before classifier design. On the other hand, unmet needs in knowledge discovery, such as lead optimization during drug discovery, demand novel algorithms. In this study, we have developed a framework for imbalanced dataset mining tailored to data characteristics and adapted to knowledge discovery in chemical datasets. First, we explored the dataset and visualized domain characteristics, and then we designed different classifiers accordingly: for class imbalance, active learning (AL), cost sensitive learning (CSL) and re-sampling methods were designed; for class overlap, Class Boundary Cleaning (CBC) and Class Boundary Mining (CBM) were developed. CBM was also designed for lead optimization: ideally it would detect fine structural differences between different classes of compounds; and these differences could be options for lead optimization. Methods developed were applied to two datasets, hERG and CPDB. The results from imbalanced hERG liability dataset showed that CBC, CBM and AL were effective in correcting class imbalance/overlap and improving the classifier's performance. Highly predictive models were built; discriminating patterns were discovered; and lead optimization options were proposed. The methodology developed and knowledge discovered will benefit drug discovery, improve hazard test prioritization, risk assessment, and governmental regulatory work on human health and the environmental protection.

Date of publication

May 2010

Keyword

DOI

https://doi.org/10.17615/g3gq-9f55

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Marron, James Stephen
Tropsha, Alexander
Zheng, Weifan
Roth, Bryan
Golbraikh, Alexander

Degree

Doctor of Philosophy

Degree granting institution

University of North Carolina at Chapel Hill Graduate School

Graduation year

2010

Language

English

Relations

Parents:

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	Wang_unc_0153D_10841.pdf	2019-04-10	Public	Download

Classifier Design to Improve Pattern Classification and Knowledge Discovery for Imbalanced Datasets

Downloadable Content

Relations

Items