Home // INFOCOMP 2012, The Second International Conference on Advanced Communications and Computation // View article
Design and Implementation of Context-aware Hadoop InputFormat for Large-scale Scientific Dataset
Authors:
Jae-Hyuck Kwak
Jun Weon Yoon
Soonwook Hwang
Keywords: Data-intensive computing; Hadoop; MapReduce; Context-aware InputFormat
Abstract:
Hadoop is a open-source software framework for the distributed processing of large-scale data analysis across computer clusters using a MapReduce programming model. It is becoming more popular to scientific communities including bioinformatics, astronomy and high-energy physics due to its strength of reliable, scalable data processing. Hadoop InputFormat describes the input-specification for a MapReduce job and defines how to read data from a file into the Mapper instance. Hadoop comes with several implementations of InputFormat. However, it is basically line-oriented and not suitable for context-oriented scientific data processing. In this paper, we have designed and implemented CxtHadoopInputFormat, context-aware Hadoop InputFormat for large-scale scientific dataset. Scientific dataset consists of numbers of variable-length data compartmented by user-defined context. CxtHadoopInputFormat is aware of the context in the scientific dataset and enables Hadoop to be used for distributed processing of context-oriented scientific data.
Pages: 90 to 93
Copyright: Copyright (c) IARIA, 2012
Publication date: October 21, 2012
Published in: conference
ISSN: 2308-3484
ISBN: 978-1-61208-226-4
Location: Venice, Italy
Dates: from October 21, 2012 to October 26, 2012