TY - GEN
T1 - Efficient hierarchical clustering algorithms using partially overlapping partitions
AU - Dash, Manoranjan
AU - Liu, Huan
N1 - Publisher Copyright: © Springer-Verlag Berlin Heidelberg 2001.
PY - 2001
Y1 - 2001
N2 - Clustering is an important data exploration task. A promi- nent clustering algorithm is agglomerative hierarchical clustering. Roughly, in each iteration, it merges the closest pair of clusters. It was first proposed way back in 1951, and since then there have been numer- ous modifications. Some of its good features are: a natural, simple, and non-parametric grouping of similar objects which is capable of finding clusters of different shape such as spherical and arbitrary. But large CPU time and high memory requirement limit its use for large data. In this paper we show that geometric metric (centroid, median, and minimum variance) algorithms obey a 90-10 relationship where roughly the first 90iterations are spent on merging clusters with distance less than 10the maximum merging distance. This characteristic is exploited by partially overlapping partitioning. It is shown with experiments and analyses that different types of existing algorithms benefit excellently by drastically reducing CPU time and memory. Other contributions of this paper in- clude comparison study of multi-dimensional vis-a-vis single-dimensional partitioning, and analytical and experimental discussions on setting of parameters such as number of partitions and dimensions for partitioning.
AB - Clustering is an important data exploration task. A promi- nent clustering algorithm is agglomerative hierarchical clustering. Roughly, in each iteration, it merges the closest pair of clusters. It was first proposed way back in 1951, and since then there have been numer- ous modifications. Some of its good features are: a natural, simple, and non-parametric grouping of similar objects which is capable of finding clusters of different shape such as spherical and arbitrary. But large CPU time and high memory requirement limit its use for large data. In this paper we show that geometric metric (centroid, median, and minimum variance) algorithms obey a 90-10 relationship where roughly the first 90iterations are spent on merging clusters with distance less than 10the maximum merging distance. This characteristic is exploited by partially overlapping partitioning. It is shown with experiments and analyses that different types of existing algorithms benefit excellently by drastically reducing CPU time and memory. Other contributions of this paper in- clude comparison study of multi-dimensional vis-a-vis single-dimensional partitioning, and analytical and experimental discussions on setting of parameters such as number of partitions and dimensions for partitioning.
UR - http://www.scopus.com/inward/record.url?scp=84942935904&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84942935904&partnerID=8YFLogxK
U2 - 10.1007/3-540-45357-1_52
DO - 10.1007/3-540-45357-1_52
M3 - Conference contribution
SN - 3540419101
SN - 9783540419105
T3 - Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
SP - 495
EP - 506
BT - Advances in Knowledge Discovery and Data Mining - 5th Pacific-Asia Conference, PAKDD 2001, Proceedings
A2 - Cheung, David
A2 - Williams, Graham J.
A2 - Li, Qing
PB - Springer Verlag
T2 - 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2001
Y2 - 16 April 2001 through 18 April 2001
ER -