Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data
An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to...
Published in: | Mathematics and Statistics |
---|---|
Main Author: | |
Format: | Article |
Language: | English |
Published: |
Horizon Research Publishing
2021
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85108372262&doi=10.13189%2fms.2021.090320&partnerID=40&md5=4a04a9857a176aa9065fc1d547481b85 |
id |
2-s2.0-85108372262 |
---|---|
spelling |
2-s2.0-85108372262 Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B. Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data 2021 Mathematics and Statistics 9 3 10.13189/ms.2021.090320 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85108372262&doi=10.13189%2fms.2021.090320&partnerID=40&md5=4a04a9857a176aa9065fc1d547481b85 An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to find a classifier that can deal with imbalanced data for any given classification task. The aim of this research is to find the best method among AdaBoost, XGBoost, and Logistic Regression to deal with imbalanced simulated datasets and real datasets. The performances of these three methods in both simulated and real imbalanced datasets are compared using five performance measures, namely sensitivity, specificity, precision, F1-score, and g-mean. The results of the simulated datasets show that logistic regression performs better than AdaBoost and XGBoost in highly imbalanced datasets, whereas in the real imbalanced datasets, AdaBoost and logistic regression demonstrated similarly good performance. All methods seem to perform well in datasets that are not severely imbalanced. Compared to AdaBoost and XGBoost, logistic regression is found to predict better for datasets with severe imbalanced ratios. However, all three methods perform poorly for data with a 5% minority, with a sample size of n = 100. In this study, it is found that different methods perform the best for data with different minority percentages. © 2021 by authors, all rights reserved. Horizon Research Publishing 23322071 English Article All Open Access; Gold Open Access |
author |
Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B. |
spellingShingle |
Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B. Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data |
author_facet |
Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B. |
author_sort |
Shahri N.H.N.B.M.; Lai S.B.S.; Mohamad M.B.; Rahman H.A.B.A.; Rambli A.B. |
title |
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data |
title_short |
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data |
title_full |
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data |
title_fullStr |
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data |
title_full_unstemmed |
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data |
title_sort |
Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data |
publishDate |
2021 |
container_title |
Mathematics and Statistics |
container_volume |
9 |
container_issue |
3 |
doi_str_mv |
10.13189/ms.2021.090320 |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85108372262&doi=10.13189%2fms.2021.090320&partnerID=40&md5=4a04a9857a176aa9065fc1d547481b85 |
description |
An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to find a classifier that can deal with imbalanced data for any given classification task. The aim of this research is to find the best method among AdaBoost, XGBoost, and Logistic Regression to deal with imbalanced simulated datasets and real datasets. The performances of these three methods in both simulated and real imbalanced datasets are compared using five performance measures, namely sensitivity, specificity, precision, F1-score, and g-mean. The results of the simulated datasets show that logistic regression performs better than AdaBoost and XGBoost in highly imbalanced datasets, whereas in the real imbalanced datasets, AdaBoost and logistic regression demonstrated similarly good performance. All methods seem to perform well in datasets that are not severely imbalanced. Compared to AdaBoost and XGBoost, logistic regression is found to predict better for datasets with severe imbalanced ratios. However, all three methods perform poorly for data with a 5% minority, with a sample size of n = 100. In this study, it is found that different methods perform the best for data with different minority percentages. © 2021 by authors, all rights reserved. |
publisher |
Horizon Research Publishing |
issn |
23322071 |
language |
English |
format |
Article |
accesstype |
All Open Access; Gold Open Access |
record_format |
scopus |
collection |
Scopus |
_version_ |
1818940561146511360 |