Performance Comparison of Big-Data Technologies in Locating Intersections in...

Doan, K., A. Oloso, K. Kuo, and T. L. Clune (2014), Performance Comparison of Big-Data Technologies in Locating Intersections in Satellite Ground Tracks, Conference, Harvard University, December, 2014, 14-16.

The performance and ease of extensibility for two Big-Data technologies, SciDB and Hadoop/MapReduce (HD/MR), are evaluated on identical hardware for an Earth science use case of locating intersections between two NASA remote sensing satellites’ ground tracks. SciDB is found to be 1.5 to 2.5 times faster than HD/MR. The performance of HD/MR approaches that of SciDB as the data size or the cluster size increases. Performance in both SciDB and HD/MR is largely insensitive to the chunk size (i.e., granularity). We have found that it is easier to extend HD/MR than SciDB at this time. abstractions and tools that makes it possible for different types of users to work with data efficiently without detailed knowledge of the underlying implementation.

Since the publication of MapReduce (MR) [1], data scientists and technologists have tried to adapt and extend it to many data analysis applications in various domains. Hadoop (HD) [2], the open-source version of MapReduce, has thus become the default choice for almost every Big-Data analysis application, but its sub-optimal performance has been noted in a number of scenarios [3, 4].

Recent technological developments, such as SciDB [5], which specifically target multidimensional arrays, are providing an attractive alternative to Hadoop/MapReduce (HD/MR) for scientific data analysis. SciDB, a next-generation array-model parallel database system, not only indexes the data it ingests for fast extraction and retrieval, but also provides an attractive, albeit still basic, mathematical/statistical toolbox for data analysis. Like HD/MR, SciDB exploits the affinity of compute and data.

We compare two technologies in this paper, Hadoop and SciDB, in the aspects of 1) performance and 2) ease of implementations, using a common use case in Earth science remote sensing. We first describe our use case scenario in section 2. We elaborate in Section 3 a few key considerations regarding processing ground track arrays, then describe the array data used in Section 4. The Big-Data algorithms used for our evaluation are introduced in Section 5. In Section 6, we describe our hardware platform, detail our experiments, and report results. We conclude the paper with a discussion and our plan for future works. . Use Case Description The problems we are facing today with our Earth’s future are complex and carry grave consequences. We need long-term and comprehensive observations of Earth’s conditions to understand this complex system of systems. However, approximately two-thirds of Earth are oceans where direct and dense measurements are difficult to obtain. Remote sensing hence becomes the more cost-effective means for obtaining the measurements required to monitor Earth’s current health and to provide data for the prediction of its future.

Remote sensing problems, however, are usually underconstrained. That is, its problem space is often of a higher dimensionality than that covered by the observations of the instruments. To gain better constraints and to reduce ambiguity, scientists strive to obtain as much simultaneous, colocated and independent information as possible concerning the problem space. Our use case is thus to find nearly coincident spaceborne radar measurements of two NASA Earth science