TY - GEN
T1 - Bias analysis in text classification for highly skewed data
AU - Tang, Lei
AU - Liu, Huan
PY - 2005
Y1 - 2005
N2 - Feature selection is often applied to high-dimensional data as a preprocessing step in text classification. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naïve Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.
AB - Feature selection is often applied to high-dimensional data as a preprocessing step in text classification. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naïve Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.
UR - http://www.scopus.com/inward/record.url?scp=34548548958&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34548548958&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2005.34
DO - 10.1109/ICDM.2005.34
M3 - Conference contribution
SN - 0769522785
SN - 9780769522784
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 781
EP - 784
BT - Proceedings - Fifth IEEE International Conference on Data Mining, ICDM 2005
T2 - 5th IEEE International Conference on Data Mining, ICDM 2005
Y2 - 27 November 2005 through 30 November 2005
ER -