TY - GEN
T1 - Feature subset selection bias for classification learning
AU - Singhi, Surendra K.
AU - Liu, Huan
PY - 2006
Y1 - 2006
N2 - Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in socalled feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.
AB - Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in socalled feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.
UR - http://www.scopus.com/inward/record.url?scp=34250694929&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34250694929&partnerID=8YFLogxK
U2 - 10.1145/1143844.1143951
DO - 10.1145/1143844.1143951
M3 - Conference contribution
SN - 1595933832
SN - 9781595933836
T3 - ACM International Conference Proceeding Series
SP - 849
EP - 856
BT - ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006
T2 - 23rd International Conference on Machine Learning, ICML 2006
Y2 - 25 June 2006 through 29 June 2006
ER -