Home // DBKDA 2015, The Seventh International Conference on Advances in Databases, Knowledge, and Data Applications // View article
Authors:
Cihan Varol
Sairam Hari
Keywords: data cleansing; data quality; duplicate detection; DSound; Shingling
Abstract:
Near duplicate data not only increase the cost for information processing, but also increase the time taken for a decision. Therefore, detecting and eliminating them is vital for business decisions. Shingling algorithm has been used in detecting near duplicates in large-scale text databases. The algorithm is based on the number of common tokens in two or more set of information. In other words, if there is a slight variation of the text, such as misspelling, in one of those documents, the performance of the algorithm decreases. Therefore, in this work, we proposed to embed a new phonetic approximate algorithm, namely DSound, to Shingling algorithm for improving the near duplicate data detection if there is a typographical error. Based on the experiments on real dataset, this newly proposed framework improved the Shingling algorithm’s performance by 16 percent.
Pages: 9 to 14
Copyright: Copyright (c) IARIA, 2015
Publication date: May 24, 2015
Published in: conference
ISSN: 2308-4332
ISBN: 978-1-61208-408-4
Location: Rome, Italy
Dates: from May 24, 2015 to May 29, 2015