Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language...
Published in: | Communications in Computer and Information Science |
---|---|
Main Author: | |
Format: | Conference paper |
Language: | English |
Published: |
Springer Verlag
2015
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-84946092788&doi=10.1007%2f978-981-287-936-3_6&partnerID=40&md5=c6e2d7a002f123ebbbc0d3de35a84e64 |
id |
2-s2.0-84946092788 |
---|---|
spelling |
2-s2.0-84946092788 Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A. Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset 2015 Communications in Computer and Information Science 545 10.1007/978-981-287-936-3_6 https://www.scopus.com/inward/record.uri?eid=2-s2.0-84946092788&doi=10.1007%2f978-981-287-936-3_6&partnerID=40&md5=c6e2d7a002f123ebbbc0d3de35a84e64 Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language is one of the many tools available for data mining. This paper compares some classification algorithms in R for an imbalanced medical data set. The classifiers ADABOOST, KNN, SVM-RBF and logistic regression were applied to the original, random oversampling and undersampling data sets. Results show that ADABOOST, KNN and SVM-RBF exhibits over-fitting when applied to the original dataset. No over-fitting occurs for the random oversampling dataset where by SVM-RBF has the highest accuracy (Training: 91.5%, Testing: 90.6%), sensitivity (Training:91.0%, Testing: 91.0%), specificity (Training: 92.0%,Testing: 90.2%) and precision (Training:91.9%, Testing 90.5%) for training and testing data set. For random undersampling, no over-fitting occurs only for ADABOOST and logistic regression. Logistic regression is the most stable classifier exhibiting consistent training an testing results. © Springer Science+Business Media Singapore 2015. Springer Verlag 18650929 English Conference paper |
author |
Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A. |
spellingShingle |
Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A. Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset |
author_facet |
Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A. |
author_sort |
Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A. |
title |
Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset |
title_short |
Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset |
title_full |
Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset |
title_fullStr |
Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset |
title_full_unstemmed |
Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset |
title_sort |
Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset |
publishDate |
2015 |
container_title |
Communications in Computer and Information Science |
container_volume |
545 |
container_issue |
|
doi_str_mv |
10.1007/978-981-287-936-3_6 |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-84946092788&doi=10.1007%2f978-981-287-936-3_6&partnerID=40&md5=c6e2d7a002f123ebbbc0d3de35a84e64 |
description |
Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language is one of the many tools available for data mining. This paper compares some classification algorithms in R for an imbalanced medical data set. The classifiers ADABOOST, KNN, SVM-RBF and logistic regression were applied to the original, random oversampling and undersampling data sets. Results show that ADABOOST, KNN and SVM-RBF exhibits over-fitting when applied to the original dataset. No over-fitting occurs for the random oversampling dataset where by SVM-RBF has the highest accuracy (Training: 91.5%, Testing: 90.6%), sensitivity (Training:91.0%, Testing: 91.0%), specificity (Training: 92.0%,Testing: 90.2%) and precision (Training:91.9%, Testing 90.5%) for training and testing data set. For random undersampling, no over-fitting occurs only for ADABOOST and logistic regression. Logistic regression is the most stable classifier exhibiting consistent training an testing results. © Springer Science+Business Media Singapore 2015. |
publisher |
Springer Verlag |
issn |
18650929 |
language |
English |
format |
Conference paper |
accesstype |
|
record_format |
scopus |
collection |
Scopus |
_version_ |
1809677610112778240 |