Automatic KDD Data Preparation Using Parallelism

Hmamouche, Youssef; Ernst, Christian; Casali, Alain

Home // International Journal On Advances in Software, volume 9, numbers 3 and 4, 2016 // View article

Automatic KDD Data Preparation Using Parallelism

Authors:
Youssef Hmamouche
Christian Ernst
Alain Casali

Keywords: Data Mining; Data Preparation; Outliers detection and cleaning; Discretization Methods, Task parallelization

Abstract:
We present an original framework for automatic data preparation, applicable in most Knowledge Discovery and Data Mining systems. It is based on the study of some statistical features of the target database samples. For each attribute of the database used, we automatically propose an optimized approach allowing to ($i$) detect and eliminate outliers, and ($ii$) to identify the most appropriate discretization method. Concerning the former, we show that the detection of an outlier depends on if data distribution is normal or not. When attempting to discern the appropriated discretization method, what is important is the shape followed by the density function of its distribution law. For this reason, we propose an automatic choice for finding the optimized discretization method, based on a multi-criteria (Entropy, Variance, Stability) evaluation. Most of the associated processings are performed in parallel, using the capabilities of multicore computers. Conducted experiments validate our approach, both on rule detection and on time series prediction. In particulary, we show that the same discretization method is not the best when applied to all the attributes of a specific database.

Pages: 167 to 178

Publication date: December 31, 2016

Published in: journal

ISSN: 1942-2628