A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study

In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likeli...

Full description

Bibliographic Details
Published in:AIP Conference Proceedings
Main Author: Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
Format: Conference paper
Language:English
Published: American Institute of Physics Inc. 2017
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9
id 2-s2.0-85028346649
spelling 2-s2.0-85028346649
Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
2017
AIP Conference Proceedings
1870

10.1063/1.4995930
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9
In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness. © 2017 Author(s).
American Institute of Physics Inc.
0094243X
English
Conference paper

author Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
spellingShingle Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
author_facet Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
author_sort Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
title A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_short A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_full A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_fullStr A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_full_unstemmed A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
title_sort A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
publishDate 2017
container_title AIP Conference Proceedings
container_volume 1870
container_issue
doi_str_mv 10.1063/1.4995930
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9
description In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness. © 2017 Author(s).
publisher American Institute of Physics Inc.
issn 0094243X
language English
format Conference paper
accesstype
record_format scopus
collection Scopus
_version_ 1809677908755611648