Estimation of missing values in air pollution dataset by using various imputation methods

The aim of this study is to determine the best imputation method to fill in the various gaps of missing values in air pollution dataset. Ten imputation methods such as Series Mean, Linear Interpolation, Mean Nearest Neighbour, Expectation Maximization, Markov Chain Monte Carlo, 12-hours Moving Avera...

Full description

Bibliographic Details
Published in:International Journal of Conservation Science
Main Author: Sukatis F.F.; Noor N.M.; Zakaria N.A.; Ul-Saufie A.Z.; Suwardi A.
Format: Article
Language:English
Published: Alexandru Ioan Cuza University of Iasi 2019
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85087395821&partnerID=40&md5=a921f877d0da2d9d3c0a31ace6b34297
id 2-s2.0-85087395821
spelling 2-s2.0-85087395821
Sukatis F.F.; Noor N.M.; Zakaria N.A.; Ul-Saufie A.Z.; Suwardi A.
Estimation of missing values in air pollution dataset by using various imputation methods
2019
International Journal of Conservation Science
10
4

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85087395821&partnerID=40&md5=a921f877d0da2d9d3c0a31ace6b34297
The aim of this study is to determine the best imputation method to fill in the various gaps of missing values in air pollution dataset. Ten imputation methods such as Series Mean, Linear Interpolation, Mean Nearest Neighbour, Expectation Maximization, Markov Chain Monte Carlo, 12-hours Moving Average, 24-hours Moving Average, and Exponential Smoothing (α = 0.2, 0.5, and 0.8) were applied to fill in the missing values. Annual hourly monitoring data for ambient temperature, wind speed, humidity, SO2, NO2, O3, CO, and PM10 from Petaling Jaya and Shah Alam were used from 2012 to 2016. These datasets were simulated into three types of missing data patterns that vary in length gaps of missing patterns, i.e. simple, medium and complex patterns. Each patterns was simulated into two percentages of missing, i.e. 10% and 20%. The performance of these imputation methods was evaluated using four performance indicator: mean absolute error, root mean squared error, prediction accuracy, and index of agreement. Overall, the Expectation Maximization method was selected as the best method of imputation to fill in the simple, medium and complex patterns of simulated missing data, while the Series Mean method was shown as the worst method of imputation. © 2020 Universitatea "Alexandru Ioan Cuza" din Iasi.
Alexandru Ioan Cuza University of Iasi
2067533X
English
Article

author Sukatis F.F.; Noor N.M.; Zakaria N.A.; Ul-Saufie A.Z.; Suwardi A.
spellingShingle Sukatis F.F.; Noor N.M.; Zakaria N.A.; Ul-Saufie A.Z.; Suwardi A.
Estimation of missing values in air pollution dataset by using various imputation methods
author_facet Sukatis F.F.; Noor N.M.; Zakaria N.A.; Ul-Saufie A.Z.; Suwardi A.
author_sort Sukatis F.F.; Noor N.M.; Zakaria N.A.; Ul-Saufie A.Z.; Suwardi A.
title Estimation of missing values in air pollution dataset by using various imputation methods
title_short Estimation of missing values in air pollution dataset by using various imputation methods
title_full Estimation of missing values in air pollution dataset by using various imputation methods
title_fullStr Estimation of missing values in air pollution dataset by using various imputation methods
title_full_unstemmed Estimation of missing values in air pollution dataset by using various imputation methods
title_sort Estimation of missing values in air pollution dataset by using various imputation methods
publishDate 2019
container_title International Journal of Conservation Science
container_volume 10
container_issue 4
doi_str_mv
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85087395821&partnerID=40&md5=a921f877d0da2d9d3c0a31ace6b34297
description The aim of this study is to determine the best imputation method to fill in the various gaps of missing values in air pollution dataset. Ten imputation methods such as Series Mean, Linear Interpolation, Mean Nearest Neighbour, Expectation Maximization, Markov Chain Monte Carlo, 12-hours Moving Average, 24-hours Moving Average, and Exponential Smoothing (α = 0.2, 0.5, and 0.8) were applied to fill in the missing values. Annual hourly monitoring data for ambient temperature, wind speed, humidity, SO2, NO2, O3, CO, and PM10 from Petaling Jaya and Shah Alam were used from 2012 to 2016. These datasets were simulated into three types of missing data patterns that vary in length gaps of missing patterns, i.e. simple, medium and complex patterns. Each patterns was simulated into two percentages of missing, i.e. 10% and 20%. The performance of these imputation methods was evaluated using four performance indicator: mean absolute error, root mean squared error, prediction accuracy, and index of agreement. Overall, the Expectation Maximization method was selected as the best method of imputation to fill in the simple, medium and complex patterns of simulated missing data, while the Series Mean method was shown as the worst method of imputation. © 2020 Universitatea "Alexandru Ioan Cuza" din Iasi.
publisher Alexandru Ioan Cuza University of Iasi
issn 2067533X
language English
format Article
accesstype
record_format scopus
collection Scopus
_version_ 1809677784384012288