Home // SOTICS 2019, The Ninth International Conference on Social Media Technologies, Communication, and Informatics // View article


Automating Blog Crawling Using Pattern Recognition

Authors:
Anal Kanti Roy
Nitin Agarwal

Keywords: blog crawling; generic crawler; blogs; blog posts; metadata; title; author; date; content; patterns; html.

Abstract:
Social media plays an important role in the propagation and dissemination of ideas and thoughts leading to the formation of diverse online communities. Compared to a myriad of other social media sites and applications, blogs provide a convenient platform for users to post detailed information, engage in active discussions and share the content on other social media sites, such as Facebook and Twitter. Thus, the blogosphere has been an enormous and ever-growing part of the open-source intelligence. In order to track and monitor online social behavior particularly from blogs, the first challenging part is to mine the vast pool of unstructured data. Several approaches have been developed to extract blog data using focused crawling, which requires a lot of time, effort and manual intervention. To scale up this process and cope with the continuously changing blog structure, we propose a sophisticated, advanced, generic, and scalable automated blog-crawler, with ability to identify different patterns in the Hypertext Markup Language (HTML) structure of the blog pages and extract data, such as title, author, date, content, tags, etc. from different blog posts. Using the crawler, we have crawled 530 blog sites with 894,856 blog posts so far.

Pages: 32 to 38

Copyright: Copyright (c) IARIA, 2019

Publication date: November 24, 2019

Published in: conference

ISSN: 2326-9294

ISBN: 978-1-61208-757-3

Location: Valencia, Spain

Dates: from November 24, 2019 to November 28, 2019