Design and Implementation of Context-aware Hadoop InputFormat for Large-scale Scientific Dataset

Kwak, Jae-Hyuck; Yoon, Jun Weon; Hwang, Soonwook

Home // INFOCOMP 2012, The Second International Conference on Advanced Communications and Computation // View article

Design and Implementation of Context-aware Hadoop InputFormat for Large-scale Scientific Dataset

Authors:
Jae-Hyuck Kwak
Jun Weon Yoon
Soonwook Hwang

Keywords: Data-intensive computing; Hadoop; MapReduce; Context-aware InputFormat

Abstract:
Hadoop is a open-source software framework for the distributed processing of large-scale data analysis across computer clusters using a MapReduce programming model. It is becoming more popular to scientific communities including bioinformatics, astronomy and high-energy physics due to its strength of reliable, scalable data processing. Hadoop InputFormat describes the input-specification for a MapReduce job and defines how to read data from a file into the Mapper instance. Hadoop comes with several implementations of InputFormat. However, it is basically line-oriented and not suitable for context-oriented scientific data processing. In this paper, we have designed and implemented CxtHadoopInputFormat, context-aware Hadoop InputFormat for large-scale scientific dataset. Scientific dataset consists of numbers of variable-length data compartmented by user-defined context. CxtHadoopInputFormat is aware of the context in the scientific dataset and enables Hadoop to be used for distributed processing of context-oriented scientific data.

Pages: 90 to 93

Copyright: Copyright (c) IARIA, 2012

Publication date: October 21, 2012

Published in: conference

ISSN: 2308-3484

ISBN: 978-1-61208-226-4

Location: Venice, Italy

Dates: from October 21, 2012 to October 26, 2012