Extracting information from a website is useful and important for every user. However it's not always an easy task. When you browse the web you can find a lot of noisy and useless elements that can be annoying.

There are two versions of ConEx depending on the information that the tool analyzes to infer the main content. While site-level ConEx infers the main content of a webpage by comparing several webpages from the same website, page-level ConEx obtains the main content of a webpage by analyzing several properties of its nodes.

These tools implement two techniques for main content extraction. The main content in a webpage contains the relevant content to the user. It is usually formed from text, images, and any other multimedia; and it is uses to be surrounded by or even mixed with irrelevant information such as headers, footers, menus, banners, advertisements, etc.

The main content in a webpage can be useful for:

Website developers, because they can automatically extract a clean HTML of the main of any webpage.
Other systems and tools, such as indexers or wrappers, as a preliminary stage to avoid banners and unnecessary content to the user.

One important advantage of both tools is that they not only extract the main content text from the webpage, but also images, videos, and any other multimedia. The main advantage of the page-level version of ConEx is its performance, because it does not need to load any additional webpages to analyze them.

Open source

These plugins are distributed as open source under the BSD open source license. Any redistribution of any software that contains or makes use of these plugins must retain the same BSD open source license.

Feedback

We greatly thank any feedback from the users of these plugins. Any contribution that can help us to improve the usability or performance of them is highly appreciated. Please report your feedback to jsilva@dsic.upv.es.

ConEx: The Web Content Extractor

Automatically extracting web content from webpages

Automatically extracting web content from webpages

ConEx WebExtension Addons

Open source

Feedback