Oversampling Methods for Handling Imbalance Data in Binary Classification

Data preparation occupies the majority of data science, about 60–80%. The process of data preparation can produce an accurate output of information to be used in decision making. That is why, in the context of data science, it is so critical. However, in reality, data does not always come in a prede...

Full description

Bibliographic Details
Published in:	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Main Author:	Riston T.; Suherman S.N.; Yonnatan Y.; Indrayatna F.; Pravitasari A.A.; Sari E.N.; Herawan T.
Format:	Conference paper
Language:	English
Published:	Springer Science and Business Media Deutschland GmbH 2023
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85168762818&doi=10.1007%2f978-3-031-37108-0_1&partnerID=40&md5=da50f78b048c1911c8044b45cdb4bf93

id	2-s2.0-85168762818
spelling	2-s2.0-85168762818 Riston T.; Suherman S.N.; Yonnatan Y.; Indrayatna F.; Pravitasari A.A.; Sari E.N.; Herawan T. Oversampling Methods for Handling Imbalance Data in Binary Classification 2023 Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 14105 LNCS 10.1007/978-3-031-37108-0_1 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85168762818&doi=10.1007%2f978-3-031-37108-0_1&partnerID=40&md5=da50f78b048c1911c8044b45cdb4bf93 Data preparation occupies the majority of data science, about 60–80%. The process of data preparation can produce an accurate output of information to be used in decision making. That is why, in the context of data science, it is so critical. However, in reality, data does not always come in a predefined distribution with parameters, and it can even arrive with an imbalance. Imbalanced data generates a lot of problems, especially in classification. This study employs several oversampling methods in machine learning, i.e., Random Oversampling (ROS), Adaptive Synthetic Sampling (ADASYN), Synthetic Minority Over-sampling Technique (SMOTE), and Borderline-SMOTE (B-SMOTE), to handle imbalanced data in binary classification with Naïve Bayes and Support Vector Machine (SVM). The five methods will be run in the same experimental design and discussed in search of the best and most accurate model for the datasets. The evaluation was assessed based on the confusion matrices with precision, recall, and F1-score calculated for comparison. The AUC and ROC curve is also provided to evaluate the performance of each method via figures. The proposed work reveals that SVM with B-SMOTE has better classification performance, especially in datasets with high similarity characteristics between the minority and majority classes. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG. Springer Science and Business Media Deutschland GmbH 3029743 English Conference paper
author	Riston T.; Suherman S.N.; Yonnatan Y.; Indrayatna F.; Pravitasari A.A.; Sari E.N.; Herawan T.
spellingShingle	Riston T.; Suherman S.N.; Yonnatan Y.; Indrayatna F.; Pravitasari A.A.; Sari E.N.; Herawan T. Oversampling Methods for Handling Imbalance Data in Binary Classification
author_facet	Riston T.; Suherman S.N.; Yonnatan Y.; Indrayatna F.; Pravitasari A.A.; Sari E.N.; Herawan T.
author_sort	Riston T.; Suherman S.N.; Yonnatan Y.; Indrayatna F.; Pravitasari A.A.; Sari E.N.; Herawan T.
title	Oversampling Methods for Handling Imbalance Data in Binary Classification
title_short	Oversampling Methods for Handling Imbalance Data in Binary Classification
title_full	Oversampling Methods for Handling Imbalance Data in Binary Classification
title_fullStr	Oversampling Methods for Handling Imbalance Data in Binary Classification
title_full_unstemmed	Oversampling Methods for Handling Imbalance Data in Binary Classification
title_sort	Oversampling Methods for Handling Imbalance Data in Binary Classification
publishDate	2023
container_title	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
container_volume	14105 LNCS
container_issue
doi_str_mv	10.1007/978-3-031-37108-0_1
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85168762818&doi=10.1007%2f978-3-031-37108-0_1&partnerID=40&md5=da50f78b048c1911c8044b45cdb4bf93
description	Data preparation occupies the majority of data science, about 60–80%. The process of data preparation can produce an accurate output of information to be used in decision making. That is why, in the context of data science, it is so critical. However, in reality, data does not always come in a predefined distribution with parameters, and it can even arrive with an imbalance. Imbalanced data generates a lot of problems, especially in classification. This study employs several oversampling methods in machine learning, i.e., Random Oversampling (ROS), Adaptive Synthetic Sampling (ADASYN), Synthetic Minority Over-sampling Technique (SMOTE), and Borderline-SMOTE (B-SMOTE), to handle imbalanced data in binary classification with Naïve Bayes and Support Vector Machine (SVM). The five methods will be run in the same experimental design and discussed in search of the best and most accurate model for the datasets. The evaluation was assessed based on the confusion matrices with precision, recall, and F1-score calculated for comparison. The AUC and ROC curve is also provided to evaluate the performance of each method via figures. The proposed work reveals that SVM with B-SMOTE has better classification performance, especially in datasets with high similarity characteristics between the minority and majority classes. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
publisher	Springer Science and Business Media Deutschland GmbH
issn	3029743
language	English
format	Conference paper
accesstype
record_format	scopus
collection	Scopus
_version_	1809677590188785664

Oversampling Methods for Handling Imbalance Data in Binary Classification

Similar Items