A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study

In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likeli...

Full description

Bibliographic Details
Published in:AIP Conference Proceedings
Main Author: Hasan H.; Ahmad S.; Osman B.M.; Sapri S.; Othman N.
Format: Conference paper
Language:English
Published: American Institute of Physics Inc. 2017
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028346649&doi=10.1063%2f1.4995930&partnerID=40&md5=92f9ba68656bd545d34b8ab0716797e9
Description
Summary:In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness. © 2017 Author(s).
ISSN:0094243X
DOI:10.1063/1.4995930