A General Framework for Predicting the Optimal Computing Configurations for Climate-Driven Ecological Forecasting Models
Farley, Scott Sherwin
MetadataShow full item record
Rapidly growing databases are swiftly transforming the field of biodiversity modeling into a big-data science, characterized by high volume, heterogeneous datasets with high uncertainty. As climate warming and land use change accelerate, it is imperative that scientists leverage all available data to generate robust, high resolution, and accurate predictions of biodiversity changes, in order to protect vital ecosystem services and minimize loss. In recent years, cloud computing’s flexibility and scalability has caused it to emerge in many fields as the standard for analyzing massive datasets. However, cloud computing has been underutilized in biodiversity studies and climate-driven ecological forecasting specifically. While the cloud’s novel operating model allows users to provision and release virtual instances from a utility provider on demand, ecological researchers currently have little guidance about the most efficient configuration to use. Researchers face tradeoffs between model accuracy, computing cost, and model execution time, and the choice of configuration has scientific and financial ramifications. In this thesis, I present a general conceptual framework for approaching these tradeoffs and introduce a model for determining the optimal data-hardware configuration for a species distribution modeling (SDM) workflow. I develop and test three hypotheses relating model accuracy and cost to algorithm inputs and computer hardware and collect an empirical dataset of over 25,000 experimental trials of four leading SDMs (generalized additive models, GAM; boosted regression trees, GBM-BRT; multivariate adaptive regression splines, MARS; random forests, RF), using Bayesian regression trees and to model the drivers of SDM accuracy and execution time. The computing performance models (CPMs) demonstrated explained more than half of the variance in the dataset in all cases and can improve future allocation of time and money, as well as inform model developers on future priorities. The CPMs are also a key component in identifying the data-hardware configurations that maximize accuracy and jointly minimize SDM execution time and cost. In general, the SDMs examined were most accurate when fit with a large number of training examples and many covariates and accuracy was largely unaffected by hardware configuration. The optimal hardware for GAM, GBM-BRT, and MARS were low memory with few CPUs. RF, an ensemble technique, can more easily leverage parallel infrastructure, resulting in an optimal hardware configuration of four to seven CPU cores. Many widely used SDMs are structurally unable to take advantage of the increased computing power offered by cloud computing. As datasets continue to grow, new algorithms and software packages must be developed to explicitly take advantage of the modern high performance computing techniques. The CPMs developed here are extensible to other forms of biodiversity and ecological modeling studies. Efficiency studies such as the one presented here are valuable in facilitating uptake of new technologies by providing rigorous evidence of their utility.
Ecological forecasting models
Species distribution modeling (SDM)
Inlcudes Equations, Figures, Charts, Graphs, Tables, Appendices and Bibliography.