TY - GEN
T1 - Query selection techniques for efficient crawling of structured Web sources
AU - Wu, Ping
AU - Wen, Ji Rong
AU - Liu, Huan
AU - Wei-Ying, Ma
PY - 2006
Y1 - 2006
N2 - The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows "relational" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this direction
AB - The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows "relational" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this direction
UR - http://www.scopus.com/inward/record.url?scp=33749617417&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33749617417&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2006.124
DO - 10.1109/ICDE.2006.124
M3 - Conference contribution
SN - 0769525709
SN - 9780769525709
T3 - Proceedings - International Conference on Data Engineering
SP - 48
EP - 57
BT - Proceedings of the 22nd International Conference on Data Engineering, ICDE '06
PB - IEEE Computer Society
T2 - 22nd International Conference on Data Engineering, ICDE '06
Y2 - 3 April 2006 through 7 April 2006
ER -