Home // DATA ANALYTICS 2016, The Fifth International Conference on Data Analytics // View article
Performance of Spanish Encoding Functions during Record Linkage
Authors:
Maria del Pilar Angeles
Noemi Bailón-Miguel
Keywords: data mining; data matching; de-duplication; record linkage.
Abstract:
Nowadays, many businesses suffer from duplicate records. For instance, information about the same provider, customer or product appears in multiple systems and in multiple formats across the company and simply does not tally from system to system. This situation seriously prevents managers to make well informed decisions. In the case of low data quality written in Spanish language, the identification and correction of problems such as spelling errors with English language based coding techniques is not suitable. In this paper, we have implemented, modified, and utilized three Spanish phonetic coding functions in our prototype called Universal Evaluation System of Data Quality (SEUCAD). A Spanish phonetic coding based on Soundex algorithm, a Spanish Metaphone coding, and a Modified version of the latter were utilized to detect duplicate text strings in the presence of spelling errors in Spanish. The results were satisfactory, and the Spanish phonetic algorithm performed well most of the time, demonstrating opportunities for an improved performance of Spanish encoding during the record linkage process.
Pages: 1 to 7
Copyright: Copyright (c) IARIA, 2016
Publication date: October 9, 2016
Published in: conference
ISSN: 2308-4464
ISBN: 978-1-61208-510-4
Location: Venice, Italy
Dates: from October 9, 2016 to October 13, 2016