TY - GEN
T1 - Modelling classification performance for large data sets
T2 - 2nd International Conference on Web-Age Information Management, WAIM 2001
AU - Gu, Baohua
AU - Hu, Feifang
AU - Liu, Huan
N1 - Publisher Copyright: © Springer-Verlag Berlin Heidelberg 2001.
PY - 2001
Y1 - 2001
N2 - For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms¾C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a-b*x-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.
AB - For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms¾C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a-b*x-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.
UR - http://www.scopus.com/inward/record.url?scp=84974711038&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84974711038&partnerID=8YFLogxK
U2 - 10.1007/3-540-47714-4_29
DO - 10.1007/3-540-47714-4_29
M3 - Conference contribution
SN - 9783540477143
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 317
EP - 328
BT - Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings
A2 - Wang, X. Sean
A2 - Yu, Ge
A2 - Lu, Hongjun
PB - Springer Verlag
Y2 - 9 July 2001 through 11 July 2001
ER -