Redundancy-Driven Vertical Domain Explorer

Badr, Celine

Home // ICIW 2014, The Ninth International Conference on Internet and Web Applications and Services // View article

Redundancy-Driven Vertical Domain Explorer

Authors:
Celine Badr

Keywords: entity discovery; vertical domain; search; keywords.

Abstract:
Entities, generally, represent real-world concepts, such as a person (writer, singer, etc.), a product (book, camera, etc.), a business, etc. In large data-intensive websites, sections related to an entity in a given vertical domain consist of a thousands of data-rich pages, each displaying attribute values for one instance of the given entity. Ideally, to build a rich repository of entity instances that serves the unlimited search needs of Web users, data aggregators aim to collect all the possible instances available for that given entity and apply data extraction for its attributes. A manual approach would be costly in time and effort. In this work, we propose a system that automatically discovers new large websites publishing pages about a conceptual entity, by exploiting the large amount of overlap on the Web among sources in the same vertical domain. Starting with information from one training site, specific queries are generated and results returned by search engines are analyzed and filtered. The sources retained from these search results undergo then a semantic, syntactic, and structural evaluation to detect data-intensive pages for the domain entity. Semi-structured attributes location is also identified on the discovered entity pages. Our approach can thus be exploited by vertical search engines in pre-processing to enhance web page crawling, as well as in data extraction.

Pages: 60 to 65

Copyright: Copyright (c) IARIA, 2014

Publication date: July 20, 2014

Published in: conference

ISSN: 2308-3972

ISBN: 978-1-61208-361-2

Location: Paris, France

Dates: from July 20, 2014 to July 24, 2014