Home // ICIW 2016, The Eleventh International Conference on Internet and Web Applications and Services // View article


A Filtered-Page Ranking: An Approach for Previously Filtered HTML Documents Ranking

Authors:
Jose Costa
Carina Dorneles

Keywords: Web content automatic extraction; Irrelevant content removal

Abstract:
This paper describes a ranking approach applied over previously filtered documents, which relies on a segmentation process. The ranking method, called Filtered-Page Ranking, has two main steps: (i) page segmentation and irrelevant blocks removal; and (ii) document ranking. The focus of the first step is to eliminate irrelevant content from the document, which has no relevance to user query, by means of the Query-Based Blocks Mining algorithm, creating a filtered document that is evaluated in the ranking process. During the ranking step, the focus is to calculate the relevance of each filtered document for a given query, using criterias that prioritizes specific parts of the document and to the highlighted features of some HTML elements. As shown in our experiments, our approach outperforms the base line Lucene implementation of vector space model. In addition, the results demonstrate that our irrelevant content removal algorithm improves the results and our relevance criterias make difference to the process.

Pages: 12 to 18

Copyright: Copyright (c) IARIA, 2016

Publication date: May 22, 2016

Published in: conference

ISSN: 2308-3972

ISBN: 978-1-61208-474-9

Location: Valencia, Spain

Dates: from May 22, 2016 to May 26, 2016