Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning

Effective software defect prediction is a crucial aspect of software quality assurance, enabling the identification of defective modules before the testing phase. This study aims to propose a comprehensive five-stage framework for software defect prediction, addressing the current challenges in the...

Full description

Bibliographic Details
Published in:PeerJ Computer Science
Main Author: Ali M.; Mazhar T.; Al-Rasheed A.; Shahzad T.; Ghadi Y.Y.; Khan M.A.
Format: Article
Language:English
Published: PeerJ Inc. 2024
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85190276232&doi=10.7717%2fpeerj-cs.1860&partnerID=40&md5=2d75fed0160d49f179045f7ff813eafc
id 2-s2.0-85190276232
spelling 2-s2.0-85190276232
Ali M.; Mazhar T.; Al-Rasheed A.; Shahzad T.; Ghadi Y.Y.; Khan M.A.
Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
2024
PeerJ Computer Science
10

10.7717/peerj-cs.1860
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85190276232&doi=10.7717%2fpeerj-cs.1860&partnerID=40&md5=2d75fed0160d49f179045f7ff813eafc
Effective software defect prediction is a crucial aspect of software quality assurance, enabling the identification of defective modules before the testing phase. This study aims to propose a comprehensive five-stage framework for software defect prediction, addressing the current challenges in the field. The first stage involves selecting a cleaned version of NASA’s defect datasets, including CM1, JM1, MC2, MW1, PC1, PC3, and PC4, ensuring the data’s integrity. In the second stage, a feature selection technique based on the genetic algorithm is applied to identify the optimal subset of features. In the third stage, three heterogeneous binary classifiers, namely random forest, support vector machine, and naïve Bayes, are implemented as base classifiers. Through iterative tuning, the classifiers are optimized to achieve the highest level of accuracy individually. In the fourth stage, an ensemble machine-learning technique known as voting is applied as a master classifier, leveraging the collective decision-making power of the base classifiers. The final stage evaluates the performance of the proposed framework using five widely recognized performance evaluation measures: precision, recall, accuracy, F-measure, and area under the curve. Experimental results demonstrate that the proposed framework outperforms state-of-the-art ensemble and base classifiers employed in software defect prediction and achieves a maximum accuracy of 95.1%, showing its effectiveness in accurately identifying software defects. The framework also evaluates its efficiency by calculating execution times. Notably, it exhibits enhanced efficiency, significantly reducing the execution times during the training and testing phases by an average of 51.52% and 52.31%, respectively. This reduction contributes to a more computationally economical solution for accurate software defect prediction. © 2024 Ali et al.
PeerJ Inc.
23765992
English
Article
All Open Access
author Ali M.; Mazhar T.; Al-Rasheed A.; Shahzad T.; Ghadi Y.Y.; Khan M.A.
spellingShingle Ali M.; Mazhar T.; Al-Rasheed A.; Shahzad T.; Ghadi Y.Y.; Khan M.A.
Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
author_facet Ali M.; Mazhar T.; Al-Rasheed A.; Shahzad T.; Ghadi Y.Y.; Khan M.A.
author_sort Ali M.; Mazhar T.; Al-Rasheed A.; Shahzad T.; Ghadi Y.Y.; Khan M.A.
title Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
title_short Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
title_full Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
title_fullStr Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
title_full_unstemmed Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
title_sort Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning
publishDate 2024
container_title PeerJ Computer Science
container_volume 10
container_issue
doi_str_mv 10.7717/peerj-cs.1860
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85190276232&doi=10.7717%2fpeerj-cs.1860&partnerID=40&md5=2d75fed0160d49f179045f7ff813eafc
description Effective software defect prediction is a crucial aspect of software quality assurance, enabling the identification of defective modules before the testing phase. This study aims to propose a comprehensive five-stage framework for software defect prediction, addressing the current challenges in the field. The first stage involves selecting a cleaned version of NASA’s defect datasets, including CM1, JM1, MC2, MW1, PC1, PC3, and PC4, ensuring the data’s integrity. In the second stage, a feature selection technique based on the genetic algorithm is applied to identify the optimal subset of features. In the third stage, three heterogeneous binary classifiers, namely random forest, support vector machine, and naïve Bayes, are implemented as base classifiers. Through iterative tuning, the classifiers are optimized to achieve the highest level of accuracy individually. In the fourth stage, an ensemble machine-learning technique known as voting is applied as a master classifier, leveraging the collective decision-making power of the base classifiers. The final stage evaluates the performance of the proposed framework using five widely recognized performance evaluation measures: precision, recall, accuracy, F-measure, and area under the curve. Experimental results demonstrate that the proposed framework outperforms state-of-the-art ensemble and base classifiers employed in software defect prediction and achieves a maximum accuracy of 95.1%, showing its effectiveness in accurately identifying software defects. The framework also evaluates its efficiency by calculating execution times. Notably, it exhibits enhanced efficiency, significantly reducing the execution times during the training and testing phases by an average of 51.52% and 52.31%, respectively. This reduction contributes to a more computationally economical solution for accurate software defect prediction. © 2024 Ali et al.
publisher PeerJ Inc.
issn 23765992
language English
format Article
accesstype All Open Access
record_format scopus
collection Scopus
_version_ 1818940557090619392