This site requires Cookies enabled in your browser for login.
Updating ...
WaterNet Home
WaterNet
for
pour le
Canada
Menu
WaterNet
Home
GWFO
Home
Master
List
Data
Centre
Collections
X
Defaults
Select All
Websites
X
Global Water Futures Observatories (GWFO) Global Water Futures (GWF) Global Institute for Water Security (GIWS) International Network of Alpine Research Catchment Hydrology
Legacy Research Programs
X
Changing Cold Regions Network (CCRN) Drought Research Initiative (DRI) International Network of Alpine Research Catchment Hydrology (Legacy Site) Improving Processes & Parameterization for Prediction in Cold Regions Hydrology (IP3) The Mackenzie Global Energy and Water Cycle Experiment (GEWEX) Study (MAGS)
Legacy sites
Map
Utilities
X
Account Settings Metadata Editor Record List Alias List Editor
Data Centre
Data Type Editor
. . .
X
Clear
Select All
Advanced Search
Go to Top⇡
Related items loading ...
Fetching Chart ...
Publication Additional Information Download
Publication Type
Conference Presentation
Authorship
Klenk Kyle, Spiteri Raymond J., Zolfaghari Reza, Green Kevin R.
Title
Using actors to increase scalability and fault tolerance of SUMMA
Year
2022
Publication Outlet
AOSM2022
Citation
Kyle Klenk, Raymond J. Spiteri, Reza Zolfaghari, Kevin R. Green (2022). Using actors to increase scalability and fault tolerance of SUMMA. Proceedings of the GWF Annual Open Science Meeting, May 16-18, 2022.
Abstract
SUMMA is a modeling framework that is used for hydrological simulations over large-scale domains, such as the North American continent, which consists of more than half a million hydrological response units (HRUs). In the standard approach to perform such simulations on shared computing resources such as Compute Canada, the HRUs are divided into batches, and the batches then submitted as individual jobs. For the continental North America run described, the batch size is around 500 and results in approximately 1000 jobs. There are a few issues with this approach. First, each job can only utilize one CPU. Second, if any HRU fails, the job is halted. The failed HRU then has to be identified, have its settings adjusted, and be resubmitted manually. Besides the labour-intensive nature of this task, the resubmission to the queue risks further delay because the priority within the queue may decrease with each job submission. In other words, the current approach to running large simulations is neither scalable nor fault tolerant. To address these issues, we redesigned SUMMA to leverage the actor model to separate SUMMA's state from the global structure of HRUs. The actor model is an abstraction of concurrent computation that uses actors as the basic units of computation. An actor has a private state and its own thread of execution and can only communicate with other actors through messages. We developed a new implementation known as SUMMA-Actors that represents each HRU as an HRU-Actor. Separating HRUs into actor components allows us to run them concurrently, thus increasing scalability because jobs can utilize more CPUs resulting in decreased run-time. We have observed essentially perfect scaling when solving one job of 500 HRUs with one, two, and four CPUs, with run-times (HH:MM:SS) of 16:24:50, 08:17:48, and 04:06:53, respectively. By comparison, the standard implementation has a run-time result of 14:32:22. To enable fault tolerance, SUMMA-Actors uses state separation to contain failures within a single HRU and a hierarchical supervision strategy provided by the actor model. The former prevents HRU failures that result in job failures, allowing the remaining HRUs to continue. The latter allows for the implementation of a supervisor actor called the job-actor. The job-actor allows SUMMA-Actors to address failures at run-time, modify the HRU settings, and restart it without going back into the queue. All told, SUMMA-Actors provides a substantive reduction in wall clock time and human effort required to complete large-scale SUMMA simulations.
Program Affiliations
GWF: Global Water Futures
Project Affiliations
GWF-CS: Computer Science
Publication Stage
N/A
Theme
Hydrology and Terrestrial Ecosystems
Presentation Format
10-minute oral presentation
Additional Information
AOSM2022 core-CS First Author: Kyle Klenk, University of Saskatchewan Additional Authors: Raymond J. Spiteri, University of Saskatchewan, Reza Zolfaghari, University of Saskatchewan, Kevin R. Green, University of Saskatchewan
© 2026 - WaterNet Version 2026-06-10
Global Water Futures Observatories
Powered by
G W F Net
T-2022-04-24-s108Wrs39s1EEqEN6h0hnMx4w Publication 1.0