Home // ALLDATA 2017, The Third International Conference on Big Data, Small Data, Linked Data and Open Data // View article


Industry Experience: Chinese Names Duplicate Records Detection

Authors:
Tong Khin Thong
Badrul Affandy Bin Ahmad Latfi
Xiaomei Wang
Arichandran A/L M.S. Kandiah

Keywords: duplicate detection; Chinese names; Soundex; false positive.

Abstract:
The Soundex method is the preferred method for duplicate detection process on Malaysian Chinese names. The names are written in English text, but are phonetically translated from various Chinese dialects. When using the Russell Soundex method, it is found that the number of duplicates is high and the number of false positives is also high. The adaptive nature of Soundex method provides an avenue to optimize it for foreign language names, such as Chinese names. Through a series of tests, this study has optimized the Soundex codes for general Malaysian Chinese names. The test results have shown that a few short Chinese surnames contribute to false positives.

Pages: 74 to 77

Copyright: Copyright (c) The Government of Malaysia, 2017. Used by permission to IARIA.

Publication date: April 23, 2017

Published in: conference

ISSN: 2519-8386

ISBN: 978-1-61208-552-4

Location: Venice, Italy

Dates: from April 23, 2017 to April 27, 2017