COMBUSTI/O. Abstractions facilitating parallel execution of programs implementing common I/O patterns in a pipelined fashion as workflows in Spark
In light of recent years’ exploding data generation in life sciences, increasing downstream analysis capabilities is paramount to address the asymmetry of innovation in data creation contra processing capacities. Many contemporaneously used tools are sequential programs, ofttimes including convoluted dependencies leading to workflows crashing due to misconfiguration, detrimental to both development efforts and production, also inducing duplicate work upon re-execution. This thesis proposes a distributed and easy-to-use general framework for work- flow creation and ad hoc parallelization of existing serial programs. In furtherance of reducing wall-clock time consumed by big data processing pipelines, its processing is horizontally scaled out, whilst supporting recovery and tool validation. COMBUSTI/O is a cloud and hpc ready framework for pipelined execution of unmodified third-party program binaries on Spark. It supports tool requirements of named input and output files, usage and redirection of standard streams, and combinations of these, as well as both coarse and fine granularity state recovery. Designed to run independently, its scalability is reduced to Spark and the underlying fault-tolerant big data frameworks. We evaluate COMBUSTI/O on real and synthetic workflows, demonstrating its propriety for facilitation of complex compute-intensive workflows, as well as its applicability for data-intensive and latency-sensitive workflows, and validate the coarse-grained recovery mechanism and its cost for the different flavors of workflows. We show stage recovery to be beneficial during development, for compute-intensive workflows, and for error-prone data-intensive workflows. Moreover, we show that the I/O overhead of COMBUSTI/O grows for dataintensive workflows, and that our remote tool execution is inexpensive. COMBUSTI/O is open-sourced at https://github.com/jarlebass/combustio, and currently used by SfB at the University of Tromsø.
PublisherUiT Norges arktiske universitet
UiT The Arctic University of Norway
The following license file are associated with this item: