Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Huang, He; Li, Shanshan; Yi, Xiaodong; Zhang, Feng; Liao, Xiangke; Dong, Pan

Home // CLOUD COMPUTING 2012, The Third International Conference on Cloud Computing, GRIDs, and Virtualization // View article

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Authors:
He Huang
Shanshan Li
Xiaodong Yi
Feng Zhang
Xiangke Liao
Pan Dong

Keywords: high-performance computer; massive data processing; MapReduce paradigm.

Abstract:
MapReduce has emerged as a popular and easy-to-use programming model for numerous organizations to deal with massive data processing. Present works about improving MapReduce are mostly done under commercial clusters, while little work has been done under HPC architecture. With high capability computing node, networking and storage system, it might be promising to build massive data processing paradigm on HPCs. Instead of DFS storage systems, HPCs use dedicated storage subsystem. We first analyze the performance of MapReduce on dedicated storage subsystem. Results show that the performance of DFS scales better when the number of nodes increases; but, when the scale is fixed and the I/O capability is equal, the centralized storage subsystem can do a better job in processing large amount of data. Based on the analysis, two strategies for reducing the network transmitting data and distributing the storage I/O are presented, so as to solve the problem of limited data I/O capability of HPCs. The optimizations for storage localization and network levitation in HPC environment respectively improve the MapReduce performance by 32.5% and 16.9%.

Pages: 186 to 191

Copyright: Copyright (c) IARIA, 2012

Publication date: July 22, 2012

Published in: conference

ISSN: 2308-4294

ISBN: 978-1-61208-216-5

Location: Nice, France

Dates: from July 22, 2012 to July 27, 2012