Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data

Training an imbalanced dataset can cause classifiers to overfit the majority class and increase the possibility of information loss for the minority class. Moreover, accuracy may not give a clear picture of the classifier’s performance. This paper utilized decision tree (DT), support vector machine...

Full description

Bibliographic Details
Published in:	Indonesian Journal of Electrical Engineering and Computer Science
Main Author:	Malek N.H.A.; Yaacob W.F.W.; Wah Y.B.; Md Nasir S.A.; Shaadan N.; Indratno S.W.
Format:	Article
Language:	English
Published:	Institute of Advanced Engineering and Science 2023
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85142097924&doi=10.11591%2fijeecs.v29.i1.pp598-608&partnerID=40&md5=61804174165e22ee4efed7401972f189

id	2-s2.0-85142097924
spelling	2-s2.0-85142097924 Malek N.H.A.; Yaacob W.F.W.; Wah Y.B.; Md Nasir S.A.; Shaadan N.; Indratno S.W. Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data 2023 Indonesian Journal of Electrical Engineering and Computer Science 29 1 10.11591/ijeecs.v29.i1.pp598-608 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85142097924&doi=10.11591%2fijeecs.v29.i1.pp598-608&partnerID=40&md5=61804174165e22ee4efed7401972f189 Training an imbalanced dataset can cause classifiers to overfit the majority class and increase the possibility of information loss for the minority class. Moreover, accuracy may not give a clear picture of the classifier’s performance. This paper utilized decision tree (DT), support vector machine (SVM), artificial neural networks (ANN), K-nearest neighbors (KNN) and Naïve Bayes (NB) besides ensemble models like random forest (RF) and gradient boosting (GB), which use bagging and boosting methods, three sampling approaches and seven performance metrics to investigate the effect of class imbalance on water quality data. Based on the results, the best model was gradient boosting without resampling for almost all metrics except balanced accuracy, sensitivity and area under the curve (AUC), followed by random forest model without resampling in term of specificity, precision and AUC. However, in term of balanced accuracy and sensitivity, the highest performance was achieved by random forest with a random under-sampling dataset. Focusing on each performance metric separately, the results showed that for specificity and precision, it is better not to preprocess all the ensemble classifiers. Nevertheless, the results for balanced accuracy and sensitivity showed improvement for both ensemble classifiers when using all the resampled dataset. © 2023 Institute of Advanced Engineering and Science. All rights reserved. Institute of Advanced Engineering and Science 25024752 English Article All Open Access; Gold Open Access
author	Malek N.H.A.; Yaacob W.F.W.; Wah Y.B.; Md Nasir S.A.; Shaadan N.; Indratno S.W.
spellingShingle	Malek N.H.A.; Yaacob W.F.W.; Wah Y.B.; Md Nasir S.A.; Shaadan N.; Indratno S.W. Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data
author_facet	Malek N.H.A.; Yaacob W.F.W.; Wah Y.B.; Md Nasir S.A.; Shaadan N.; Indratno S.W.
author_sort	Malek N.H.A.; Yaacob W.F.W.; Wah Y.B.; Md Nasir S.A.; Shaadan N.; Indratno S.W.
title	Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data
title_short	Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data
title_full	Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data
title_fullStr	Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data
title_full_unstemmed	Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data
title_sort	Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data
publishDate	2023
container_title	Indonesian Journal of Electrical Engineering and Computer Science
container_volume	29
container_issue	1
doi_str_mv	10.11591/ijeecs.v29.i1.pp598-608
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85142097924&doi=10.11591%2fijeecs.v29.i1.pp598-608&partnerID=40&md5=61804174165e22ee4efed7401972f189
description	Training an imbalanced dataset can cause classifiers to overfit the majority class and increase the possibility of information loss for the minority class. Moreover, accuracy may not give a clear picture of the classifier’s performance. This paper utilized decision tree (DT), support vector machine (SVM), artificial neural networks (ANN), K-nearest neighbors (KNN) and Naïve Bayes (NB) besides ensemble models like random forest (RF) and gradient boosting (GB), which use bagging and boosting methods, three sampling approaches and seven performance metrics to investigate the effect of class imbalance on water quality data. Based on the results, the best model was gradient boosting without resampling for almost all metrics except balanced accuracy, sensitivity and area under the curve (AUC), followed by random forest model without resampling in term of specificity, precision and AUC. However, in term of balanced accuracy and sensitivity, the highest performance was achieved by random forest with a random under-sampling dataset. Focusing on each performance metric separately, the results showed that for specificity and precision, it is better not to preprocess all the ensemble classifiers. Nevertheless, the results for balanced accuracy and sensitivity showed improvement for both ensemble classifiers when using all the resampled dataset. © 2023 Institute of Advanced Engineering and Science. All rights reserved.
publisher	Institute of Advanced Engineering and Science
issn	25024752
language	English
format	Article
accesstype	All Open Access; Gold Open Access
record_format	scopus
collection	Scopus
_version_	1809678022807126016

Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data

Similar Items