Home // ICIW 2013, The Eighth International Conference on Internet and Web Applications and Services // View article


Synergic Data Extraction and Crawling for Large Web Sites

Authors:
Celine Badr
Paolo Merialdo
Valter Crescenzi

Keywords: data extraction; crawler; web wrapper; sampling

Abstract:
Data collected from data-intensive web sites is widely used today in various applications and online services. We present a new methodology for a synergic specification of crawling and wrapping tasks on large data-intensive web sites, allowing the execution of wrappers while the crawler is collecting pages at the different levels of the derived web site structure. It is supported by a working system devoted to non-expert users, built over a semi-automatic inference engine. By tracking and learning from the browsing activity of the non-expert user, the system derives a model that describes the topological structures of the site's navigational paths as well as the inner structures of the HTML pages. This model allows the system to generate and execute crawling and wrapping definitions in an interleaved process. To collect a representative sample set that feeds the inference engine, we propose in this context a solution to an often neglected problem, called the Sampling Problem. An extensive experimental evaluation shows that our system and the underlying methodology can successfully operate on most of the structured sites available on the Web.

Pages: 200 to 205

Copyright: Copyright (c) IARIA, 2013

Publication date: June 23, 2013

Published in: conference

ISSN: 2308-3972

ISBN: 978-1-61208-280-6

Location: Rome, Italy

Dates: from June 23, 2013 to June 28, 2013