Improving Near Duplicate Data Detection via DSound Phonetic Matching Algorithm: A Solution to Address Typographical Problems

Varol, Cihan; Hari, Sairam

Home // DBKDA 2015, The Seventh International Conference on Advances in Databases, Knowledge, and Data Applications // View article

Improving Near Duplicate Data Detection via DSound Phonetic Matching Algorithm: A Solution to Address Typographical Problems

Authors:
Cihan Varol
Sairam Hari

Keywords: data cleansing; data quality; duplicate detection; DSound; Shingling

Abstract:
Near duplicate data not only increase the cost for information processing, but also increase the time taken for a decision. Therefore, detecting and eliminating them is vital for business decisions. Shingling algorithm has been used in detecting near duplicates in large-scale text databases. The algorithm is based on the number of common tokens in two or more set of information. In other words, if there is a slight variation of the text, such as misspelling, in one of those documents, the performance of the algorithm decreases. Therefore, in this work, we proposed to embed a new phonetic approximate algorithm, namely DSound, to Shingling algorithm for improving the near duplicate data detection if there is a typographical error. Based on the experiments on real dataset, this newly proposed framework improved the Shingling algorithm’s performance by 16 percent.

Pages: 9 to 14

Copyright: Copyright (c) IARIA, 2015

Publication date: May 24, 2015

Published in: conference

ISSN: 2308-4332

ISBN: 978-1-61208-408-4

Location: Rome, Italy

Dates: from May 24, 2015 to May 29, 2015