Document Identification with MapReduce Framework

Reddy, Yenumula

Home // DATA ANALYTICS 2014, The Third International Conference on Data Analytics // View article

Document Identification with MapReduce Framework

Authors:
Yenumula Reddy

Keywords: Hadoop Distributed File Systems; Big data; key, shuffle; Apache Zookeeper

Abstract:
Hadoop technology made a break through to process the unformatted data and generates the results faster than ever. Before Hadoop technology, the results were produced for formatted data using SQL and other techniques. They allowed to effectively sharing memory, central processing unit, disk, and network input/output more efficiently. There was no proper system to analyze unformatted data. The paper discusses the MapReduce framework to identify a required document from a stream of documents. We proposed an algorithm called MapReduce to detect sensitive documents that identify sensitive or required document among the streams of documents. The algorithm was tested using the Hadoop package and Java program. The results conclude that the Java program is useful for small documents. The Hadoop technology helps in stream of documents and produces the

Pages: 81 to 86

Copyright: Copyright (c) IARIA, 2014

Publication date: August 24, 2014

Published in: conference

ISSN: 2308-4464

ISBN: 978-1-61208-358-2

Location: Rome, Italy

Dates: from August 24, 2014 to August 28, 2013