Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data

An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to...

Full description

Bibliographic Details
Published in:Mathematics and Statistics
Main Author: Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B.
Format: Article
Language:English
Published: Horizon Research Publishing 2021
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85108372262&doi=10.13189%2fms.2021.090320&partnerID=40&md5=4a04a9857a176aa9065fc1d547481b85
id 2-s2.0-85108372262
spelling 2-s2.0-85108372262
Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B.
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
2021
Mathematics and Statistics
9
3
10.13189/ms.2021.090320
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85108372262&doi=10.13189%2fms.2021.090320&partnerID=40&md5=4a04a9857a176aa9065fc1d547481b85
An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to find a classifier that can deal with imbalanced data for any given classification task. The aim of this research is to find the best method among AdaBoost, XGBoost, and Logistic Regression to deal with imbalanced simulated datasets and real datasets. The performances of these three methods in both simulated and real imbalanced datasets are compared using five performance measures, namely sensitivity, specificity, precision, F1-score, and g-mean. The results of the simulated datasets show that logistic regression performs better than AdaBoost and XGBoost in highly imbalanced datasets, whereas in the real imbalanced datasets, AdaBoost and logistic regression demonstrated similarly good performance. All methods seem to perform well in datasets that are not severely imbalanced. Compared to AdaBoost and XGBoost, logistic regression is found to predict better for datasets with severe imbalanced ratios. However, all three methods perform poorly for data with a 5% minority, with a sample size of n = 100. In this study, it is found that different methods perform the best for data with different minority percentages. © 2021 by authors, all rights reserved.
Horizon Research Publishing
23322071
English
Article
All Open Access; Gold Open Access
author Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B.
spellingShingle Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B.
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
author_facet Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B.
author_sort Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B.
title Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
title_short Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
title_full Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
title_fullStr Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
title_full_unstemmed Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
title_sort Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
publishDate 2021
container_title Mathematics and Statistics
container_volume 9
container_issue 3
doi_str_mv 10.13189/ms.2021.090320
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85108372262&doi=10.13189%2fms.2021.090320&partnerID=40&md5=4a04a9857a176aa9065fc1d547481b85
description An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to find a classifier that can deal with imbalanced data for any given classification task. The aim of this research is to find the best method among AdaBoost, XGBoost, and Logistic Regression to deal with imbalanced simulated datasets and real datasets. The performances of these three methods in both simulated and real imbalanced datasets are compared using five performance measures, namely sensitivity, specificity, precision, F1-score, and g-mean. The results of the simulated datasets show that logistic regression performs better than AdaBoost and XGBoost in highly imbalanced datasets, whereas in the real imbalanced datasets, AdaBoost and logistic regression demonstrated similarly good performance. All methods seem to perform well in datasets that are not severely imbalanced. Compared to AdaBoost and XGBoost, logistic regression is found to predict better for datasets with severe imbalanced ratios. However, all three methods perform poorly for data with a 5% minority, with a sample size of n = 100. In this study, it is found that different methods perform the best for data with different minority percentages. © 2021 by authors, all rights reserved.
publisher Horizon Research Publishing
issn 23322071
language English
format Article
accesstype All Open Access; Gold Open Access
record_format scopus
collection Scopus
_version_ 1818940561146511360