Summary: | Achieving accurate classification in imbalanced datasets, especially for environmental data such as water quality assessment, is a major challenge for machine learning classifiers. This study introduces the Reduced Noise-Synthetic Minority Oversampling Technique (RN-SMOTE) to address the problems of underrepresentation of minority classes and noise in imbalanced multiclass datasets for Water Quality Classification (WQC). Current state-of-the-art techniques, such as the standard Synthetic Minority Oversampling Technique (SMOTE) and its variants, improve class balance but often need to adequately address noise in synthetic samples. Our research extends RN-SMOTE, previously applied to binary data, to multiclass scenarios. RN-SMOTE improves classification performance by oversampling the minority class and eliminating noisy synthetic instances. We evaluate the effectiveness of RN-SMOTE using three classifiers: Decision Tree (DT), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The experimental results reveal that RN-SMOTE significantly improves classification accuracy and sensitivity. For instance, the RF classifier with RN-SMOTE achieved an accuracy of 71.17% and a sensitivity of 75.24%, 69.23% and 72.14% for the clean, slightly polluted and polluted classes, respectively, outperforming the original dataset and traditional SMOTE techniques. However, RN-SMOTE did not outperform the traditional SMOTE for DT and the XGBoost model. Applying RN-SMOTE to multiclass water quality data extends its utility and advances unbalanced classification in environmental science. © 2024 IEEE.
|