Home // International Journal On Advances in Software, volume 10, numbers 3 and 4, 2017 // View article


Binary Space Partitioning for Parallel and Distributed Closest-Pairs Query Processing

Authors:
George Mavrommatis
Panagiotis Moutafis
Michael Vassilakopoulos

Keywords: Closest-Pairs Query; Spatial Query Processing; Apache Spark; Binary Space Partitioning

Abstract:
The (k) Closest-Pair(s) Query, kCPQ, consists in finding the (k) closest pair(s) of objects between two spatial datasets. Up to date, only few solutions have appeared that process the kCPQ in parallel and distributed frameworks. Currently, Apache Spark is the state of the art of parallel and distributed frameworks, having several advantages compared to other popular ones, like Hadoop MapReduce. A major step towards answering a query is proper partitioning, a task that is of even greater importance in distributed environments. In this work, we present algorithms for processing the kCPQ in Apache Spark that split the datasets into strips across an axis. Two variations of the Binary Space Partitioning (BSP) technique are used to partition the data, based on two different criteria: equal size and equal width of the child strips. These schemes are compared to a third strategy (previously developed by us), namely splitting into a predefined number of strips. We have performed an extensive set of experiments to evaluate the efficiency and scalability of the algorithm and the performance of the different partitioning schemes by using large real-world datasets. Results show that splitting into strips by means of BSP achieves better performance. This is mainly due to the fact that selecting the number of points within each strip as the preset criterion, instead of the number of strips, provides more flexibility in fine tuning the system.

Pages: 275 to 285

Copyright: Copyright (c) to authors, 2017. Used with permission.

Publication date: December 31, 2017

Published in: journal

ISSN: 1942-2628