Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data

The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, th...

Full description

Bibliographic Details
Published in:AIP Conference Proceedings
Main Author: Kamarulzalis A.H.; Shaadan N.; Deni S.M.
Format: Conference paper
Language:English
Published: American Institute of Physics 2024
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3
Description
Summary:The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, the K-means algorithm is believed to be sensitive to outliers. This study is concerned with the K-means algorithm's capability for clustering functional data or curves especially when data consists of outliers. The performance is compared with K-median, and Kmeans.fd when used to cluster PM10 daily functional data (curves). The methodology involved first building the curves data set using the b-spline basis expansion technique. Then, three groups of data sets were created for comparison analysis: the full set with contaminated data, the normal group, and the sparse (outliers) group. The RobMah technique was used to identify the outliers (sparse) group. The rand index (RI) is used to evaluate the clustering performance. Using a pre-determined K = 2 cluster size obtained from the NbClust, the results have shown that K-means works best for the full data set with contaminated data (i.e., moderate compact curves). Meanwhile, K-median works best for the outliers' (sparse) group (i.e., low compact curves), and Kmeans.fd works best for normal curves (i.e., high compact curves). The results indicate that the method's performance depends on the degree of curve compactness which is also described by the degree of outlying-ness in the data set. The study also comes to the conclusion that the optimum K = 2 provided by NbClust is insufficient and that a better algorithm for evaluating K cluster size is needed when applying K-means to contaminated functional data. This is because the cluster size K = 2 ignores other hidden significant clusters. With K = 2, the algorithms can only distinguish between the group of normal curves and the outliers with a high magnitude, not a low magnitude. This means that when the data set contains outliers, the hybridized idea of functional outlier trimming, filtering, buffering, or sample-reducing techniques is recommended to improve the K-means algorithm under consideration for clustering functional data. © 2024 Author(s).
ISSN:0094243X
DOI:10.1063/5.0224189