Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data
The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, th...
Published in: | AIP Conference Proceedings |
---|---|
Main Author: | |
Format: | Conference paper |
Language: | English |
Published: |
American Institute of Physics
2024
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3 |
id |
2-s2.0-85203193589 |
---|---|
spelling |
2-s2.0-85203193589 Kamarulzalis A.H.; Shaadan N.; Deni S.M. Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data 2024 AIP Conference Proceedings 3123 1 10.1063/5.0224189 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3 The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, the K-means algorithm is believed to be sensitive to outliers. This study is concerned with the K-means algorithm's capability for clustering functional data or curves especially when data consists of outliers. The performance is compared with K-median, and Kmeans.fd when used to cluster PM10 daily functional data (curves). The methodology involved first building the curves data set using the b-spline basis expansion technique. Then, three groups of data sets were created for comparison analysis: the full set with contaminated data, the normal group, and the sparse (outliers) group. The RobMah technique was used to identify the outliers (sparse) group. The rand index (RI) is used to evaluate the clustering performance. Using a pre-determined K = 2 cluster size obtained from the NbClust, the results have shown that K-means works best for the full data set with contaminated data (i.e., moderate compact curves). Meanwhile, K-median works best for the outliers' (sparse) group (i.e., low compact curves), and Kmeans.fd works best for normal curves (i.e., high compact curves). The results indicate that the method's performance depends on the degree of curve compactness which is also described by the degree of outlying-ness in the data set. The study also comes to the conclusion that the optimum K = 2 provided by NbClust is insufficient and that a better algorithm for evaluating K cluster size is needed when applying K-means to contaminated functional data. This is because the cluster size K = 2 ignores other hidden significant clusters. With K = 2, the algorithms can only distinguish between the group of normal curves and the outliers with a high magnitude, not a low magnitude. This means that when the data set contains outliers, the hybridized idea of functional outlier trimming, filtering, buffering, or sample-reducing techniques is recommended to improve the K-means algorithm under consideration for clustering functional data. © 2024 Author(s). American Institute of Physics 0094243X English Conference paper |
author |
Kamarulzalis A.H.; Shaadan N.; Deni S.M. |
spellingShingle |
Kamarulzalis A.H.; Shaadan N.; Deni S.M. Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data |
author_facet |
Kamarulzalis A.H.; Shaadan N.; Deni S.M. |
author_sort |
Kamarulzalis A.H.; Shaadan N.; Deni S.M. |
title |
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data |
title_short |
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data |
title_full |
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data |
title_fullStr |
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data |
title_full_unstemmed |
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data |
title_sort |
Exploratory analysis on the performance of K-means, Kmeans.fd and K-median in clustering contaminated PM10 functional data |
publishDate |
2024 |
container_title |
AIP Conference Proceedings |
container_volume |
3123 |
container_issue |
1 |
doi_str_mv |
10.1063/5.0224189 |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85203193589&doi=10.1063%2f5.0224189&partnerID=40&md5=04ff800bb2b432b15da598e560f841d3 |
description |
The ultimate goal of clustering analysis is to group observations into clusters where there is maximum similarity within a cluster and dissimilarity between clusters. Due to its effectiveness and simplicity, the K-mean has been one of the most used methods for grouping multivariate data. However, the K-means algorithm is believed to be sensitive to outliers. This study is concerned with the K-means algorithm's capability for clustering functional data or curves especially when data consists of outliers. The performance is compared with K-median, and Kmeans.fd when used to cluster PM10 daily functional data (curves). The methodology involved first building the curves data set using the b-spline basis expansion technique. Then, three groups of data sets were created for comparison analysis: the full set with contaminated data, the normal group, and the sparse (outliers) group. The RobMah technique was used to identify the outliers (sparse) group. The rand index (RI) is used to evaluate the clustering performance. Using a pre-determined K = 2 cluster size obtained from the NbClust, the results have shown that K-means works best for the full data set with contaminated data (i.e., moderate compact curves). Meanwhile, K-median works best for the outliers' (sparse) group (i.e., low compact curves), and Kmeans.fd works best for normal curves (i.e., high compact curves). The results indicate that the method's performance depends on the degree of curve compactness which is also described by the degree of outlying-ness in the data set. The study also comes to the conclusion that the optimum K = 2 provided by NbClust is insufficient and that a better algorithm for evaluating K cluster size is needed when applying K-means to contaminated functional data. This is because the cluster size K = 2 ignores other hidden significant clusters. With K = 2, the algorithms can only distinguish between the group of normal curves and the outliers with a high magnitude, not a low magnitude. This means that when the data set contains outliers, the hybridized idea of functional outlier trimming, filtering, buffering, or sample-reducing techniques is recommended to improve the K-means algorithm under consideration for clustering functional data. © 2024 Author(s). |
publisher |
American Institute of Physics |
issn |
0094243X |
language |
English |
format |
Conference paper |
accesstype |
|
record_format |
scopus |
collection |
Scopus |
_version_ |
1812871793994629120 |