ConEx | Extract the main content of a webpage

This tool inputs the webpage loaded in the current tab / window (named key page), and it outputs its main content. The tool is based on a site-level algorithm, so it loads and analyzes several other webpages from the same website to infer the main content. This fact allows it to increase its accuracy, because it is able to detect template (repeated) content.

This technique is divided into three phases:

An algorithm examines the links of the key page and selects a set of webpages from the same website.
An algorithm maps the DOM nodes of each webpage in the set with the DOM nodes of the key page. If it finds that a node of the key page is present in another webpage, it updates a counter, which stores how many times each node from the key page appears in other webpages.
The set of DOM nodes in the key page that are not present in any other webpages are added to a set of candidate nodes. Then, these nodes are analyzed in the following way:
- Those DOM nodes of the set without ancestors in the set are selected. They form the reduced set of candidate nodes.
- If the reduced set of candidate nodes is formed only by one node, that node and all its descendants correspond to the main content. On the other hand, if there are several candidate nodes in the set:
  - Each candidate node in the set is analyzed to detect the branch of the DOM tree that more likely contains the main content.
  - Finally, an algorithm selects the candidate nodes that belong to the main content branch.

ConEx: The Web Content Extractor

Automatically extracting web content from webpages

Automatically extracting web content from webpages

Site-level ConEx