Home // ICIW 2016, The Eleventh International Conference on Internet and Web Applications and Services // View article
A Filtered-Page Ranking: An Approach for Previously Filtered HTML Documents Ranking
Authors:
Jose Costa
Carina Dorneles
Keywords: Web content automatic extraction; Irrelevant content removal
Abstract:
This paper describes a ranking approach applied over previously filtered documents, which relies on a segmentation process. The ranking method, called Filtered-Page Ranking, has two main steps: (i) page segmentation and irrelevant blocks removal; and (ii) document ranking. The focus of the first step is to eliminate irrelevant content from the document, which has no relevance to user query, by means of the Query-Based Blocks Mining algorithm, creating a filtered document that is evaluated in the ranking process. During the ranking step, the focus is to calculate the relevance of each filtered document for a given query, using criterias that prioritizes specific parts of the document and to the highlighted features of some HTML elements. As shown in our experiments, our approach outperforms the base line Lucene implementation of vector space model. In addition, the results demonstrate that our irrelevant content removal algorithm improves the results and our relevance criterias make difference to the process.
Pages: 12 to 18
Copyright: Copyright (c) IARIA, 2016
Publication date: May 22, 2016
Published in: conference
ISSN: 2308-3972
ISBN: 978-1-61208-474-9
Location: Valencia, Spain
Dates: from May 22, 2016 to May 26, 2016