HybEx Firefox/Chrome/Opera Add-on

Web templates are useful for website developers. Content providers and website developers can automatically insert content into web templates to increase their productivity and produce more usable and scalable websites due to their uniformity.

HybEx is an hybrid tool because it is the combination of two techniques for template and content extraction:

  • TemEx, a site-level template detection technique that explores the initial webpage and builds a set of candidate webpages from the same website that probably share the same template. Then, it carries out a mapping between them to compute the template.
  • Page-level ConEx, a page-level technique that extracts the main content of a webpage. It analyzes some features of the initial webpage and translates them into points in a 4-dimensional Euclidean space. Then it computes the Euclidean distance between them and selects as main content the nodes located further from the centroid node.

HybEx, as a template extractor, is mostly useful for:

  • Website developers, because it allows them to automatically extract a clean HTML template of any webpage, which is specially interesting to reuse components of other webpages.
  • Other systems and tools, such as indexers or wrappers, as a preprocess phase. For instance, the extraction of the template allows them to identify topology of the website and the structure of the webpage. On the other hand, template extraction is also useful to identify pagelets, advertisements, and to isolate the main content.