A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study
In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likeli...
Published in: | AIP Conference Proceedings |
---|---|
Main Author: | |
Format: | Conference paper |
Language: | English |
Published: |
American Institute of Physics Inc.
2017
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9 |
id |
2-s2.0-85028346649 |
---|---|
spelling |
2-s2.0-85028346649 Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study 2017 AIP Conference Proceedings 1870 10.1063/1.4995930 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9 In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness. © 2017 Author(s). American Institute of Physics Inc. 0094243X English Conference paper |
author |
Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N. |
spellingShingle |
Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study |
author_facet |
Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N. |
author_sort |
Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N. |
title |
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study |
title_short |
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study |
title_full |
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study |
title_fullStr |
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study |
title_full_unstemmed |
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study |
title_sort |
A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study |
publishDate |
2017 |
container_title |
AIP Conference Proceedings |
container_volume |
1870 |
container_issue |
|
doi_str_mv |
10.1063/1.4995930 |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9 |
description |
In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness. © 2017 Author(s). |
publisher |
American Institute of Physics Inc. |
issn |
0094243X |
language |
English |
format |
Conference paper |
accesstype |
|
record_format |
scopus |
collection |
Scopus |
_version_ |
1809677908755611648 |