Closest-Pairs Query Processing in Apache Spark

Mavrommatis, George; Moutafis, Panagiotis; Vassilakopoulos, Michael

Home // CLOUD COMPUTING 2017, The Eighth International Conference on Cloud Computing, GRIDs, and Virtualization // View article

Closest-Pairs Query Processing in Apache Spark

Authors:
George Mavrommatis
Panagiotis Moutafis
Michael Vassilakopoulos

Keywords: Closest-Pairs Query; Spatial Query Processing; Apache Spark

Abstract:
Processing of spatial queries when the datasets involved are big can be accomplished efficiently in a parallel and distributed environment. The (K) Closest-Pair(s) Query, KCPQ, is a common query in many real-life applications involving geographical, or, in general, spatial data. It consists in finding the (K) closest pair(s) of objects between two spatial datasets. Although, processing of this query has been studied extensively for centralized environments, few solutions have appeared for parallel and distributed frameworks. Apache Spark is such a framework that has several advantages compared to other popular ones, like Hadoop MapReduce. In this work, we present an algorithm for processing the KCPQ in Apache Spark and experimentally study its efficiency and scalability, using big real-world datasets.

Pages: 26 to 31

Copyright: Copyright (c) IARIA, 2017

Publication date: February 19, 2017

Published in: conference

ISSN: 2308-4294

ISBN: 978-1-61208-529-6

Location: Athens, Greece

Dates: from February 19, 2017 to February 23, 2017