Home // COGNITIVE 2013, The Fifth International Conference on Advanced Cognitive Technologies and Applications // View article
Linearithmic Corpus to Corpus Comparison by Sentence Hashing Algorithm SHAPD2
Authors:
Dariusz Ceglarek
Keywords: document comparison, plagiary detection, longest common subsequence, sentence hashing, Natural Language Processing, text mining, pattern matching
Abstract:
This work presents an innovative method of comparing sets of textual documents with an aim to identify common phrase sequences. The SHAPD2 (Sentence Hashing Algorithm for Plagiarism Detection 2) algorithm was designed to achieve the goal of a single-pass corpus to corpus comparison. The algorithm was developed taking into account results and observations from previous research activities. It is a highly efficient solution that finds application with considerable amounts of data and excels over other approaches. One of its possible applications is detection of potential plagiarisms comparing not a document against a corpus, but corpus to corpus. Algorithm's performance allows for applications in situations where results have to be served an instant after issuing a query. This makes the $SHAPD2$ algorithm a valuable alternative to the available solutions.
Pages: 141 to 146
Copyright: Copyright (c) IARIA, 2013
Publication date: May 27, 2013
Published in: conference
ISSN: 2308-4197
ISBN: 978-1-61208-273-8
Location: Valencia, Spain
Dates: from May 27, 2013 to June 1, 2013