Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data

The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, th...

Full description

Bibliographic Details
Published in:	AIP Conference Proceedings
Main Author:	Kamarulzalis A.H.; Shaadan N.; Deni S.M.
Format:	Conference paper
Language:	English
Published:	American Institute of Physics 2024
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3

id	2-s2.0-85203193589
spelling	2-s2.0-85203193589 Kamarulzalis A.H.; Shaadan N.; Deni S.M. Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data 2024 AIP Conference Proceedings 3123 1 10.1063/5.0224189 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3 The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, the K-means algorithm is believed to be sensitive to outliers. This study is concerned with the K-means algorithm's capability for clustering functional data or curves especially when data consists of outliers. The performance is compared with K-median, and Kmeans.fd when used to cluster PM10 daily functional data (curves). The methodology involved first building the curves data set using the b-spline basis expansion technique. Then, three groups of data sets were created for comparison analysis: the full set with contaminated data, the normal group, and the sparse (outliers) group. The RobMah technique was used to identify the outliers (sparse) group. The rand index (RI) is used to evaluate the clustering performance. Using a pre-determined K = 2 cluster size obtained from the NbClust, the results have shown that K-means works best for the full data set with contaminated data (i.e., moderate compact curves). Meanwhile, K-median works best for the outliers' (sparse) group (i.e., low compact curves), and Kmeans.fd works best for normal curves (i.e., high compact curves). The results indicate that the method's performance depends on the degree of curve compactness which is also described by the degree of outlying-ness in the data set. The study also comes to the conclusion that the optimum K = 2 provided by NbClust is insufficient and that a better algorithm for evaluating K cluster size is needed when applying K-means to contaminated functional data. This is because the cluster size K = 2 ignores other hidden significant clusters. With K = 2, the algorithms can only distinguish between the group of normal curves and the outliers with a high magnitude, not a low magnitude. This means that when the data set contains outliers, the hybridized idea of functional outlier trimming, filtering, buffering, or sample-reducing techniques is recommended to improve the K-means algorithm under consideration for clustering functional data. © 2024 Author(s). American Institute of Physics 0094243X English Conference paper
author	Kamarulzalis A.H.; Shaadan N.; Deni S.M.
spellingShingle	Kamarulzalis A.H.; Shaadan N.; Deni S.M. Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
author_facet	Kamarulzalis A.H.; Shaadan N.; Deni S.M.
author_sort	Kamarulzalis A.H.; Shaadan N.; Deni S.M.
title	Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_short	Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_full	Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_fullStr	Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_full_unstemmed	Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
title_sort	Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
publishDate	2024
container_title	AIP Conference Proceedings
container_volume	3123
container_issue	1
doi_str_mv	10.1063/5.0224189
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3
description	The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, the K-means algorithm is believed to be sensitive to outliers. This study is concerned with the K-means algorithm's capability for clustering functional data or curves especially when data consists of outliers. The performance is compared with K-median, and Kmeans.fd when used to cluster PM10 daily functional data (curves). The methodology involved first building the curves data set using the b-spline basis expansion technique. Then, three groups of data sets were created for comparison analysis: the full set with contaminated data, the normal group, and the sparse (outliers) group. The RobMah technique was used to identify the outliers (sparse) group. The rand index (RI) is used to evaluate the clustering performance. Using a pre-determined K = 2 cluster size obtained from the NbClust, the results have shown that K-means works best for the full data set with contaminated data (i.e., moderate compact curves). Meanwhile, K-median works best for the outliers' (sparse) group (i.e., low compact curves), and Kmeans.fd works best for normal curves (i.e., high compact curves). The results indicate that the method's performance depends on the degree of curve compactness which is also described by the degree of outlying-ness in the data set. The study also comes to the conclusion that the optimum K = 2 provided by NbClust is insufficient and that a better algorithm for evaluating K cluster size is needed when applying K-means to contaminated functional data. This is because the cluster size K = 2 ignores other hidden significant clusters. With K = 2, the algorithms can only distinguish between the group of normal curves and the outliers with a high magnitude, not a low magnitude. This means that when the data set contains outliers, the hybridized idea of functional outlier trimming, filtering, buffering, or sample-reducing techniques is recommended to improve the K-means algorithm under consideration for clustering functional data. © 2024 Author(s).
publisher	American Institute of Physics
issn	0094243X
language	English
format	Conference paper
accesstype
record_format	scopus
collection	Scopus
_version_	1812871793994629120

Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data

Similar Items