A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study

In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likeli...

Full description

Bibliographic Details
Published in:	AIP Conference Proceedings
Main Author:	Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
Format:	Conference paper
Language:	English
Published:	American Institute of Physics Inc. 2017
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9

id	2-s2.0-85028346649
spelling	2-s2.0-85028346649 Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study 2017 AIP Conference Proceedings 1870 10.1063/1.4995930 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9 In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness. © 2017 Author(s). American Institute of Physics Inc. 0094243X English Conference paper
author	Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
spellingShingle	Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
author_facet	Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
author_sort	Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
title	A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_short	A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_full	A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_fullStr	A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_full_unstemmed	A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_sort	A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
publishDate	2017
container_title	AIP Conference Proceedings
container_volume	1870
container_issue
doi_str_mv	10.1063/1.4995930
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9
description	In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness. © 2017 Author(s).
publisher	American Institute of Physics Inc.
issn	0094243X
language	English
format	Conference paper
accesstype
record_format	scopus
collection	Scopus
_version_	1809677908755611648

A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study

Similar Items