Linearithmic Corpus to Corpus Comparison by Sentence Hashing Algorithm SHAPD2

Ceglarek, Dariusz

Home // COGNITIVE 2013, The Fifth International Conference on Advanced Cognitive Technologies and Applications // View article

Linearithmic Corpus to Corpus Comparison by Sentence Hashing Algorithm SHAPD2

Authors:
Dariusz Ceglarek

Keywords: document comparison, plagiary detection, longest common subsequence, sentence hashing, Natural Language Processing, text mining, pattern matching

Abstract:
This work presents an innovative method of comparing sets of textual documents with an aim to identify common phrase sequences. The SHAPD2 (Sentence Hashing Algorithm for Plagiarism Detection 2) algorithm was designed to achieve the goal of a single-pass corpus to corpus comparison. The algorithm was developed taking into account results and observations from previous research activities. It is a highly efficient solution that finds application with considerable amounts of data and excels over other approaches. One of its possible applications is detection of potential plagiarisms comparing not a document against a corpus, but corpus to corpus. Algorithm's performance allows for applications in situations where results have to be served an instant after issuing a query. This makes the $SHAPD2$ algorithm a valuable alternative to the available solutions.

Pages: 141 to 146

Copyright: Copyright (c) IARIA, 2013

Publication date: May 27, 2013

Published in: conference

ISSN: 2308-4197

ISBN: 978-1-61208-273-8

Location: Valencia, Spain

Dates: from May 27, 2013 to June 1, 2013