Home // DATA ANALYTICS 2016, The Fifth International Conference on Data Analytics // View article


Performance of Spanish Encoding Functions during Record Linkage

Authors:
Maria del Pilar Angeles
Noemi Bailón-Miguel

Keywords: data mining; data matching; de-duplication; record linkage.

Abstract:
Nowadays, many businesses suffer from duplicate records. For instance, information about the same provider, customer or product appears in multiple systems and in multiple formats across the company and simply does not tally from system to system. This situation seriously prevents managers to make well informed decisions. In the case of low data quality written in Spanish language, the identification and correction of problems such as spelling errors with English language based coding techniques is not suitable. In this paper, we have implemented, modified, and utilized three Spanish phonetic coding functions in our prototype called Universal Evaluation System of Data Quality (SEUCAD). A Spanish phonetic coding based on Soundex algorithm, a Spanish Metaphone coding, and a Modified version of the latter were utilized to detect duplicate text strings in the presence of spelling errors in Spanish. The results were satisfactory, and the Spanish phonetic algorithm performed well most of the time, demonstrating opportunities for an improved performance of Spanish encoding during the record linkage process.

Pages: 1 to 7

Copyright: Copyright (c) IARIA, 2016

Publication date: October 9, 2016

Published in: conference

ISSN: 2308-4464

ISBN: 978-1-61208-510-4

Location: Venice, Italy

Dates: from October 9, 2016 to October 13, 2016