Unifed Vectorization of Numerical and Textual Data using Self-Organizing Map

Bourennani, Farid; Pu, Ken Q.; Zhu, Ying

Home // International Journal On Advances in Systems and Measurements, volume 2, numbers 2 and 3, 2009 // View article

Unifed Vectorization of Numerical and Textual Data using Self-Organizing Map

Authors:
Farid Bourennani
Ken Q. Pu
Ying Zhu

Keywords: Pre-Processing, Data Integration, Heterogeneous Data Mining (HDM), Unifed Vectorization (UV), Self Organizing Map (SOM)

Abstract:
Data integration is the problem of combining data residing in different sources, and providing the user with a unified view of these data. One of the critical issues of data integration is the detection of similar entities based on the content. This complexity is due to three factors: the data type of the databases are heterogeneous, the schema of databases are unfamiliar and heterogenous as well, and the quantity of records is voluminous and time consuming to analyze. Firstly, in order to accommodate the textual and numerical heterogeneous data types we propose a new weighting measure for the numerical data type called Bin Frequency - Inverse Document Bin Frequency (BF-IDBF). Our proposed BF-IDBF measure is more efficient than histograms, when combined with Term Frequency - Inverse Document Frequency (TF-IDF) measure for Heterogeneous Data Mining (HDM) by Unified Vectorization (UV). The UV permits to combine the algebraic models representing heterogeneous data documents, e.g. textual and numerical, which make the simultaneous HDM process simpler and faster than the traditional attempts to process data sequentially by their respective data type. Secondly, in order to handle the unfamiliar data structure, we use the unsupervised algorithm, Self-Organizing Map (SOM). Finally to help the user to explore and browse the semantically similar entities among the copious amounts of data, we use a SOM-based visualization tool to map the database entities based on their semantical content.

Pages: 142 to 155

Publication date: December 1, 2009

Published in: journal

ISSN: 1942-261x