This main content extraction tool ranks the DOM nodes with several features to identify the main content of the loaded webpage. It inputs the webpage loaded in the current tab / window, and it outputs its main content. The main benefit of being a page-level tool is that it only needs to load and analyze one single webpage to detect the main content. This is especially important because the speed of the algorithm is increased compared to site-level techniques.
The used technique is divided into four phases:
- An algorithm selects some DOM nodes of the webpage and, for each one, it computes several weights: word ratio, hyperlink ratio, children ratio, and position ratio.
- For each node with its weights computed, another algorithm standardizes the value of its weights.
- Each node is considered a point in R4 once its weights are standardized. Then, an algorithm computes the centroid of these nodes (points). The set of DOM nodes (points) that are farther than the centroid are added to a set of candidate nodes.
- Finally, an algorithm analyzes the nodes in the set of candidate nodes in the following way:
- If two nodes have exactly the same text, the node which is descendant of the other is removed from the set. Therefore, the ancestor remains in the set.
- Finally, a ratio between words an tags is computed for the nodes in the set. This node with better ratio together with its siblings are selected as the main content.