Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data

An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to...

Full description

Bibliographic Details
Published in:Mathematics and Statistics
Main Author: Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B.
Format: Article
Language:English
Published: Horizon Research Publishing 2021
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85108372262&doi=10.13189%2fms.2021.090320&partnerID=40&md5=4a04a9857a176aa9065fc1d547481b85
Description
Summary:An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to find a classifier that can deal with imbalanced data for any given classification task. The aim of this research is to find the best method among AdaBoost, XGBoost, and Logistic Regression to deal with imbalanced simulated datasets and real datasets. The performances of these three methods in both simulated and real imbalanced datasets are compared using five performance measures, namely sensitivity, specificity, precision, F1-score, and g-mean. The results of the simulated datasets show that logistic regression performs better than AdaBoost and XGBoost in highly imbalanced datasets, whereas in the real imbalanced datasets, AdaBoost and logistic regression demonstrated similarly good performance. All methods seem to perform well in datasets that are not severely imbalanced. Compared to AdaBoost and XGBoost, logistic regression is found to predict better for datasets with severe imbalanced ratios. However, all three methods perform poorly for data with a 5% minority, with a sample size of n = 100. In this study, it is found that different methods perform the best for data with different minority percentages. © 2021 by authors, all rights reserved.
ISSN:23322071
DOI:10.13189/ms.2021.090320