TY - GEN
T1 - Feature selection for clustering - A filter solution
AU - Dash, Manoranjan
AU - Choi, Kiseok
AU - Scheuermann, Peter
AU - Liu, Huan
N1 - Copyright: Copyright 2010 Elsevier B.V., All rights reserved.
PY - 2002
Y1 - 2002
N2 - Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.
AB - Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.
UR - http://www.scopus.com/inward/record.url?scp=78149289039&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78149289039&partnerID=8YFLogxK
M3 - Conference contribution
SN - 0769517544
SN - 9780769517544
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 115
EP - 122
BT - Proceedings - 2002 IEEE International Conference on Data Mining, ICDM 2002
T2 - 2nd IEEE International Conference on Data Mining, ICDM '02
Y2 - 9 December 2002 through 12 December 2002
ER -