Home // International Journal On Advances in Intelligent Systems, volume 4, numbers 3 and 4, 2011 // View article
An Integrated Approach for Data- and Compute-intensive Mining of Large Data Sets in the GRID
Authors:
Matthias Röhm
Matthias Grabert
Franz Schweiggert
Keywords: data-intensive; data mining; Grid; MapReduce; scheduling.
Abstract:
The growing computerization in modern academic and industrial sectors is generating huge volumes of electronic data. Data mining is considered the key technology to extract knowledge from these data. Grid and Cloud technologies promise to meet the tremendously rising resource requirements of heterogeneous, large-scale and distributed data mining applications. While most projects addressing these new challenges have a strong focus on compute-intensive applications, we introduce a new paradigm to support the development of both compute- and data-intensive applications in heterogeneous environments. Combined storage and compute resources form the basis of this new approach as they allow programs to be executed on resources storing the data sets and thus are the key to avoid data transfer. A data-aware scheduling algorithm was developed to efficiently utilize all available resources and reduce data transfer of global data-intensive applications as well as support compute-intensive applications. Based on the results of the DataMiningGrid project we developed the DataMiningGrid-Divide&Conquer system that combines this approach with current Grid and Cloud technologies into a general-purpose data mining system suited for the different aspects of today's data analysis challenges. The system forms the core of the Fleet Data Acquisition Miner for analyzing the data generated by the Daimler fuel cell vehicle fleet.
Pages: 318 to 331
Copyright: Copyright (c) to authors, 2011. Used with permission.
Publication date: April 30, 2012
Published in: journal
ISSN: 1942-2679