Related items loading ...
Section 1: Publication
Publication Type
Book Chapter
Authorship
Debasish Chakroborti, Banani Roy, Amit Kumar Mondal, Golam Mostaeen, Ralph Deters, Chanchal K. Roy and Kevin A. Schneider.
Title
A Data Management Scheme for Micro-Level Modular Computation-intensive Programs in Big Data Platforms
Year
2019
Publication Outlet
In: Moshirpour M., Far B., Alhajj R. (eds) Highlighting the Importance of Big Data Management and Analysis for Various Applications, 20pp., vol x. Springer (to appear with minor revisions
DOI
ISBN
978-3-030-32586-2
ISSN
Citation
Debasish Chakroborti, Banani Roy, Amit Kumar Mondal, Golam Mostaeen, Ralph Deters, Chanchal K. Roy and Kevin A. Schneider. A Data Management Scheme for Micro-Level Modular Computation-intensive Programs in Big Data Platforms, In: Moshirpour M., Far B., Alhajj R. (eds) Highlighting the Importance of Big Data Management and Analysis for Various Applications, 20pp., vol x. Springer (to appear with minor revisions). Book Chapter
Abstract
Big Data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text, and sensor data). These systems offer a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from different studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big Data management for micro-level modular computation-intensive programs in a Spark and Hadoop-based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.
Plain Language Summary