A Data Management Scheme for Micro-Level Modular Computation-intensive Programs in Big Data Platforms

Section 1: Publication

Publication Type

Authorship

Debasish Chakroborti, Banani Roy, Amit Kumar Mondal, Golam Mostaeen, Ralph Deters, Chanchal K. Roy and Kevin A. Schneider.

Title

A Data Management Scheme for Micro-Level Modular Computation-intensive Programs in Big Data Platforms

Year

2019

Publication Outlet

In: Moshirpour M., Far B., Alhajj R. (eds) Highlighting the Importance of Big Data Management and Analysis for Various Applications, 20pp., vol x. Springer (to appear with minor revisions

DOI

https://link.springer.com/chapter/10.1007/978-3-030-32587-9_9

ISBN

978-3-030-32586-2

ISSN

Citation

Debasish Chakroborti, Banani Roy, Amit Kumar Mondal, Golam Mostaeen, Ralph Deters, Chanchal K. Roy and Kevin A. Schneider. A Data Management Scheme for Micro-Level Modular Computation-intensive Programs in Big Data Platforms, In: Moshirpour M., Far B., Alhajj R. (eds) Highlighting the Importance of Big Data Management and Analysis for Various Applications, 20pp., vol x. Springer (to appear with minor revisions). Book Chapter

Abstract

Big Data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text, and sensor data). These systems offer a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from different studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big Data management for micro-level modular computation-intensive programs in a Spark and Hadoop-based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.

Plain Language Summary