Cloud failure prediction based on traditional machine learning and deep learning

Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service providers, in addition to the loss of productivity suffered by industrial users. Fault tolerance management is the key approach to address this issue, and failure prediction is one of the techniques to...

Full description

Bibliographic Details
Published in:	Journal of Cloud Computing
Main Author:	Tengku Asmawi T.N.; Ismail A.; Shen J.
Format:	Article
Language:	English
Published:	Springer Science and Business Media Deutschland GmbH 2022
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85138294341&doi=10.1186%2fs13677-022-00327-0&partnerID=40&md5=8b03d97afd11a7c8d13e5a904dee326b

id	2-s2.0-85138294341
spelling	2-s2.0-85138294341 Tengku Asmawi T.N.; Ismail A.; Shen J. Cloud failure prediction based on traditional machine learning and deep learning 2022 Journal of Cloud Computing 11 1 10.1186/s13677-022-00327-0 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85138294341&doi=10.1186%2fs13677-022-00327-0&partnerID=40&md5=8b03d97afd11a7c8d13e5a904dee326b Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service providers, in addition to the loss of productivity suffered by industrial users. Fault tolerance management is the key approach to address this issue, and failure prediction is one of the techniques to prevent the occurrence of a failure. One of the main challenges in performing failure prediction is to produce a highly accurate predictive model. Although some work on failure prediction models has been proposed, there is still a lack of a comprehensive evaluation of models based on different types of machine learning algorithms. Therefore, in this paper, we propose a comprehensive comparison and model evaluation for predictive models for job and task failure. These models are built and trained using five traditional machine learning algorithms and three variants of deep learning algorithms. We use a benchmark dataset, called Google Cloud Traces, for training and testing the models. We evaluated the performance of models using multiple metrics and determined their important features, as well as measured their scalability. Our analysis resulted in the following findings. Firstly, in the case of job failure prediction, we found that Extreme Gradient Boosting produces the best model where the disk space request and CPU request are the most important features that influence the prediction. Second, for task failure prediction, we found that Decision Tree and Random Forest produce the best models where the priority of the task is the most important feature for both models. Our scalability analysis has determined that the Logistic Regression model is the most scalable as compared to others. © 2022, The Author(s). Springer Science and Business Media Deutschland GmbH 2192113X English Article All Open Access; Gold Open Access
author	Tengku Asmawi T.N.; Ismail A.; Shen J.
spellingShingle	Tengku Asmawi T.N.; Ismail A.; Shen J. Cloud failure prediction based on traditional machine learning and deep learning
author_facet	Tengku Asmawi T.N.; Ismail A.; Shen J.
author_sort	Tengku Asmawi T.N.; Ismail A.; Shen J.
title	Cloud failure prediction based on traditional machine learning and deep learning
title_short	Cloud failure prediction based on traditional machine learning and deep learning
title_full	Cloud failure prediction based on traditional machine learning and deep learning
title_fullStr	Cloud failure prediction based on traditional machine learning and deep learning
title_full_unstemmed	Cloud failure prediction based on traditional machine learning and deep learning
title_sort	Cloud failure prediction based on traditional machine learning and deep learning
publishDate	2022
container_title	Journal of Cloud Computing
container_volume	11
container_issue	1
doi_str_mv	10.1186/s13677-022-00327-0
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85138294341&doi=10.1186%2fs13677-022-00327-0&partnerID=40&md5=8b03d97afd11a7c8d13e5a904dee326b
description	Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service providers, in addition to the loss of productivity suffered by industrial users. Fault tolerance management is the key approach to address this issue, and failure prediction is one of the techniques to prevent the occurrence of a failure. One of the main challenges in performing failure prediction is to produce a highly accurate predictive model. Although some work on failure prediction models has been proposed, there is still a lack of a comprehensive evaluation of models based on different types of machine learning algorithms. Therefore, in this paper, we propose a comprehensive comparison and model evaluation for predictive models for job and task failure. These models are built and trained using five traditional machine learning algorithms and three variants of deep learning algorithms. We use a benchmark dataset, called Google Cloud Traces, for training and testing the models. We evaluated the performance of models using multiple metrics and determined their important features, as well as measured their scalability. Our analysis resulted in the following findings. Firstly, in the case of job failure prediction, we found that Extreme Gradient Boosting produces the best model where the disk space request and CPU request are the most important features that influence the prediction. Second, for task failure prediction, we found that Decision Tree and Random Forest produce the best models where the priority of the task is the most important feature for both models. Our scalability analysis has determined that the Logistic Regression model is the most scalable as compared to others. © 2022, The Author(s).
publisher	Springer Science and Business Media Deutschland GmbH
issn	2192113X
language	English
format	Article
accesstype	All Open Access; Gold Open Access
record_format	scopus
collection	Scopus
_version_	1809678023361822720

Cloud failure prediction based on traditional machine learning and deep learning

Similar Items