Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset

Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language...

Full description

Bibliographic Details
Published in:	Communications in Computer and Information Science
Main Author:	Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A.
Format:	Conference paper
Language:	English
Published:	Springer Verlag 2015
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84946092788&doi=10.1007%2f978-981-287-936-3_6&partnerID=40&md5=c6e2d7a002f123ebbbc0d3de35a84e64

id	2-s2.0-84946092788
spelling	2-s2.0-84946092788 Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A. Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset 2015 Communications in Computer and Information Science 545 10.1007/978-981-287-936-3_6 https://www.scopus.com/inward/record.uri?eid=2-s2.0-84946092788&doi=10.1007%2f978-981-287-936-3_6&partnerID=40&md5=c6e2d7a002f123ebbbc0d3de35a84e64 Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language is one of the many tools available for data mining. This paper compares some classification algorithms in R for an imbalanced medical data set. The classifiers ADABOOST, KNN, SVM-RBF and logistic regression were applied to the original, random oversampling and undersampling data sets. Results show that ADABOOST, KNN and SVM-RBF exhibits over-fitting when applied to the original dataset. No over-fitting occurs for the random oversampling dataset where by SVM-RBF has the highest accuracy (Training: 91.5%, Testing: 90.6%), sensitivity (Training:91.0%, Testing: 91.0%), specificity (Training: 92.0%,Testing: 90.2%) and precision (Training:91.9%, Testing 90.5%) for training and testing data set. For random undersampling, no over-fitting occurs only for ADABOOST and logistic regression. Logistic regression is the most stable classifier exhibiting consistent training an testing results. © Springer Science+Business Media Singapore 2015. Springer Verlag 18650929 English Conference paper
author	Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A.
spellingShingle	Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A. Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
author_facet	Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A.
author_sort	Rahman H.A.A.; Wah Y.B.; He H.; Bulgiba A.
title	Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
title_short	Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
title_full	Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
title_fullStr	Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
title_full_unstemmed	Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
title_sort	Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset
publishDate	2015
container_title	Communications in Computer and Information Science
container_volume	545
container_issue
doi_str_mv	10.1007/978-981-287-936-3_6
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84946092788&doi=10.1007%2f978-981-287-936-3_6&partnerID=40&md5=c6e2d7a002f123ebbbc0d3de35a84e64
description	Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language is one of the many tools available for data mining. This paper compares some classification algorithms in R for an imbalanced medical data set. The classifiers ADABOOST, KNN, SVM-RBF and logistic regression were applied to the original, random oversampling and undersampling data sets. Results show that ADABOOST, KNN and SVM-RBF exhibits over-fitting when applied to the original dataset. No over-fitting occurs for the random oversampling dataset where by SVM-RBF has the highest accuracy (Training: 91.5%, Testing: 90.6%), sensitivity (Training:91.0%, Testing: 91.0%), specificity (Training: 92.0%,Testing: 90.2%) and precision (Training:91.9%, Testing 90.5%) for training and testing data set. For random undersampling, no over-fitting occurs only for ADABOOST and logistic regression. Logistic regression is the most stable classifier exhibiting consistent training an testing results. © Springer Science+Business Media Singapore 2015.
publisher	Springer Verlag
issn	18650929
language	English
format	Conference paper
accesstype
record_format	scopus
collection	Scopus
_version_	1809677610112778240

Comparisons of ADABOOST, KNN, SVM and logistic regression in classification of imbalanced dataset

Similar Items