Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data

The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, th...

Full description

Bibliographic Details
Published in:AIP Conference Proceedings
Main Author: Kamarulzalis A.H.; Shaadan N.; Deni S.M.
Format: Conference paper
Language:English
Published: American Institute of Physics 2024
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3
id 2-s2.0-85203193589
spelling 2-s2.0-85203193589
Kamarulzalis A.H.; Shaadan N.; Deni S.M.
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
2024
AIP Conference Proceedings
3123
1
10.1063/5.0224189
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3
The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, the K-means algorithm is believed to be sensitive to outliers. This study is concerned with the K-means algorithm's capability for clustering functional data or curves especially when data consists of outliers. The performance is compared with K-median, and Kmeans.fd when used to cluster PM10 daily functional data (curves). The methodology involved first building the curves data set using the b-spline basis expansion technique. Then, three groups of data sets were created for comparison analysis: the full set with contaminated data, the normal group, and the sparse (outliers) group. The RobMah technique was used to identify the outliers (sparse) group. The rand index (RI) is used to evaluate the clustering performance. Using a pre-determined K = 2 cluster size obtained from the NbClust, the results have shown that K-means works best for the full data set with contaminated data (i.e., moderate compact curves). Meanwhile, K-median works best for the outliers' (sparse) group (i.e., low compact curves), and Kmeans.fd works best for normal curves (i.e., high compact curves). The results indicate that the method's performance depends on the degree of curve compactness which is also described by the degree of outlying-ness in the data set. The study also comes to the conclusion that the optimum K = 2 provided by NbClust is insufficient and that a better algorithm for evaluating K cluster size is needed when applying K-means to contaminated functional data. This is because the cluster size K = 2 ignores other hidden significant clusters. With K = 2, the algorithms can only distinguish between the group of normal curves and the outliers with a high magnitude, not a low magnitude. This means that when the data set contains outliers, the hybridized idea of functional outlier trimming, filtering, buffering, or sample-reducing techniques is recommended to improve the K-means algorithm under consideration for clustering functional data. © 2024 Author(s).
American Institute of Physics
0094243X
English
Conference paper

author Kamarulzalis A.H.; Shaadan N.; Deni S.M.
spellingShingle Kamarulzalis A.H.; Shaadan N.; Deni S.M.
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
author_facet Kamarulzalis A.H.; Shaadan N.; Deni S.M.
author_sort Kamarulzalis A.H.; Shaadan N.; Deni S.M.
title Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_short Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_full Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_fullStr Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_full_unstemmed Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_sort Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
publishDate 2024
container_title AIP Conference Proceedings
container_volume 3123
container_issue 1
doi_str_mv 10.1063/5.0224189
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3
description The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, the K-means algorithm is believed to be sensitive to outliers. This study is concerned with the K-means algorithm's capability for clustering functional data or curves especially when data consists of outliers. The performance is compared with K-median, and Kmeans.fd when used to cluster PM10 daily functional data (curves). The methodology involved first building the curves data set using the b-spline basis expansion technique. Then, three groups of data sets were created for comparison analysis: the full set with contaminated data, the normal group, and the sparse (outliers) group. The RobMah technique was used to identify the outliers (sparse) group. The rand index (RI) is used to evaluate the clustering performance. Using a pre-determined K = 2 cluster size obtained from the NbClust, the results have shown that K-means works best for the full data set with contaminated data (i.e., moderate compact curves). Meanwhile, K-median works best for the outliers' (sparse) group (i.e., low compact curves), and Kmeans.fd works best for normal curves (i.e., high compact curves). The results indicate that the method's performance depends on the degree of curve compactness which is also described by the degree of outlying-ness in the data set. The study also comes to the conclusion that the optimum K = 2 provided by NbClust is insufficient and that a better algorithm for evaluating K cluster size is needed when applying K-means to contaminated functional data. This is because the cluster size K = 2 ignores other hidden significant clusters. With K = 2, the algorithms can only distinguish between the group of normal curves and the outliers with a high magnitude, not a low magnitude. This means that when the data set contains outliers, the hybridized idea of functional outlier trimming, filtering, buffering, or sample-reducing techniques is recommended to improve the K-means algorithm under consideration for clustering functional data. © 2024 Author(s).
publisher American Institute of Physics
issn 0094243X
language English
format Conference paper
accesstype
record_format scopus
collection Scopus
_version_ 1812871793994629120