Improving resource utilization and fault tolerance in large simulations via actors

Section 1: Publication

Publication Type

Authorship

Klenk, K., Spiteri, R.J.

Title

Improving resource utilization and fault tolerance in large simulations via actors

Year

2024

Publication Outlet

Cluster Computing

DOI

https://doi.org/10.1007/s10586-024-04318-5

ISBN

ISSN

Citation

Klenk, K., Spiteri, R.J. (2024) Improving resource utilization and fault tolerance in large simulations via actors, Cluster Computing, https://doi.org/10.1007/s10586-024-04318-5

Abstract

Large simulations with many independent sub-simulations are common in scientific computing. There are numerous challenges, however, associated with performing such simulations in shared computing environments. For example, sub-simulations may have wildly varying completion times or not complete at all, leading to unpredictable runtimes as well as unbalanced and inefficient use of human and computational resources. In this study, we use the actor model of concurrent computation to improve both the resource utilization and fault tolerance for large-scale scientific computing simulations. More specifically, we use actors in the SUMMA model to manage a large-scale hydrological simulation over the North American continent with over 500,000 independent sub-simulations. We find that the actors implementation outperforms a standard array job submission as well as the job submission tool GNU Parallel by better balancing the computational load across processors. The actors implementation also improves fault tolerance and can eliminate the user intervention required to detect and re-submit failed jobs.

Plain Language Summary