Home // INNOV 2015, The Fourth International Conference on Communications, Computation, Networks and Technologies // View article


Using High Performance Parallel Data Warehouse (HPDW) Big Data Analytical Platform for Big Data Analysis

Authors:
Boon Keong Seah

Keywords: big data; hadoop; hive; parallel process; infiniband; RDBMS; MPP; streaming data; data warehouse; cube; Extract Transform Load (ETL); OLAP; business intelligence; data marts

Abstract:
Data warehouse has been traditionally implemented in Relational Database Management System (RDBMS) from operational data store up until the data marts and OLAP (online analytical processing) cubes for data analysis. However, the process of analyzing big data based on RDBMS is a time consuming process. In addition, with the advent of IoT, social media and other means of big data incorporation, the challenge pose to process the enormous streaming data with the need to obtain the data analysis at hand with near real time requires a need of new platform to address this. Big data incorporation for data analysis is important as it will enlarge the scope of analysis such as weather, devices information, real-time data for data correlation with existing historical data. Presently, RDBMS is not developed for handling large data set and also with ability to perform join queries between historical and streaming data for more data insight. In this paper, we introduce HPDW appliance which is a new big data platform encompassing from stream and batch data process and data query through JDBC, ODBC and integrated multi-data source BI dashboarding and data scientist tool. As it is an appliance, the nodes and all respective components required are pre-configured, hence data scientist or BI analysis will focus on using the big data for analysis and not on the setup of the big data infrastructure which will be time consuming. HPDW appliance is developed based on Massive Parallel Process (MPP) to achieve the in-memory speed it requires which uses Hadoop Distributed File System (HDFS) as the storage layer and high network speed Infiniband for node connectivity. In this paper, we describe experimental results related with the performance of its query processing. We compare the performance results on a physical cluster between RDBMS against HPDW system by varying the size of the data warehouse for fact table queries ranging from 7GB to 23GB data size. Our experiment results show that HPDW system can process the same SQL query with respect to RDBMS much faster, up to 11-200 times faster. In addition, we also show the data analysis results and data mining that can be performed on HPDW.

Pages: 32 to 39

Copyright: Copyright (c) IARIA, 2015

Publication date: November 15, 2015

Published in: conference

ISSN: 2326-9286

ISBN: 978-1-61208-444-2

Location: Barcelona, Spain

Dates: from November 15, 2015 to November 20, 2015