Home // ICIW 2013, The Eighth International Conference on Internet and Web Applications and Services // View article
Synergic Data Extraction and Crawling for Large Web Sites
Authors:
Celine Badr
Paolo Merialdo
Valter Crescenzi
Keywords: data extraction; crawler; web wrapper; sampling
Abstract:
Data collected from data-intensive web sites is widely used today in various applications and online services. We present a new methodology for a synergic specification of crawling and wrapping tasks on large data-intensive web sites, allowing the execution of wrappers while the crawler is collecting pages at the different levels of the derived web site structure. It is supported by a working system devoted to non-expert users, built over a semi-automatic inference engine. By tracking and learning from the browsing activity of the non-expert user, the system derives a model that describes the topological structures of the site's navigational paths as well as the inner structures of the HTML pages. This model allows the system to generate and execute crawling and wrapping definitions in an interleaved process. To collect a representative sample set that feeds the inference engine, we propose in this context a solution to an often neglected problem, called the Sampling Problem. An extensive experimental evaluation shows that our system and the underlying methodology can successfully operate on most of the structured sites available on the Web.
Pages: 200 to 205
Copyright: Copyright (c) IARIA, 2013
Publication date: June 23, 2013
Published in: conference
ISSN: 2308-3972
ISBN: 978-1-61208-280-6
Location: Rome, Italy
Dates: from June 23, 2013 to June 28, 2013