Reduced Noise SMOTE in Machine Learning Model: Application in Water Quality Classification with Imbalanced Datasets

Achieving accurate classification in imbalanced datasets, especially for environmental data such as water quality assessment, is a major challenge for machine learning classifiers. This study introduces the Reduced Noise-Synthetic Minority Oversampling Technique (RN-SMOTE) to address the problems of...

Full description

Bibliographic Details
Published in:2024 5th International Conference on Artificial Intelligence and Data Sciences, AiDAS 2024 - Proceedings
Main Author: Nasaruddin N.; Masseran N.; Idris W.M.R.; Ul-Saufie A.Z.
Format: Conference paper
Language:English
Published: Institute of Electrical and Electronics Engineers Inc. 2024
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85209693380&doi=10.1109%2fAiDAS63860.2024.10730391&partnerID=40&md5=2676def56948cc6cbe10c9ca564deb1c
Description
Summary:Achieving accurate classification in imbalanced datasets, especially for environmental data such as water quality assessment, is a major challenge for machine learning classifiers. This study introduces the Reduced Noise-Synthetic Minority Oversampling Technique (RN-SMOTE) to address the problems of underrepresentation of minority classes and noise in imbalanced multiclass datasets for Water Quality Classification (WQC). Current state-of-the-art techniques, such as the standard Synthetic Minority Oversampling Technique (SMOTE) and its variants, improve class balance but often need to adequately address noise in synthetic samples. Our research extends RN-SMOTE, previously applied to binary data, to multiclass scenarios. RN-SMOTE improves classification performance by oversampling the minority class and eliminating noisy synthetic instances. We evaluate the effectiveness of RN-SMOTE using three classifiers: Decision Tree (DT), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The experimental results reveal that RN-SMOTE significantly improves classification accuracy and sensitivity. For instance, the RF classifier with RN-SMOTE achieved an accuracy of 71.17% and a sensitivity of 75.24%, 69.23% and 72.14% for the clean, slightly polluted and polluted classes, respectively, outperforming the original dataset and traditional SMOTE techniques. However, RN-SMOTE did not outperform the traditional SMOTE for DT and the XGBoost model. Applying RN-SMOTE to multiclass water quality data extends its utility and advances unbalanced classification in environmental science. © 2024 IEEE.
ISSN:
DOI:10.1109/AiDAS63860.2024.10730391