COMBUSTI/O. Abstractions facilitating parallel execution of programs implementing common I/O patterns in a pipelined fashion as workflows in Spark

Fagerli, Jarl

dc.contributor.advisor	Bongo, Lars Ailo
dc.contributor.author	Fagerli, Jarl
dc.date.accessioned	2016-07-01T10:32:02Z
dc.date.available	2016-07-01T10:32:02Z
dc.date.issued	2016-05-31
dc.description.abstract	In light of recent years’ exploding data generation in life sciences, increasing downstream analysis capabilities is paramount to address the asymmetry of innovation in data creation contra processing capacities. Many contemporaneously used tools are sequential programs, ofttimes including convoluted dependencies leading to workflows crashing due to misconfiguration, detrimental to both development efforts and production, also inducing duplicate work upon re-execution. This thesis proposes a distributed and easy-to-use general framework for work- flow creation and ad hoc parallelization of existing serial programs. In furtherance of reducing wall-clock time consumed by big data processing pipelines, its processing is horizontally scaled out, whilst supporting recovery and tool validation. COMBUSTI/O is a cloud and hpc ready framework for pipelined execution of unmodified third-party program binaries on Spark. It supports tool requirements of named input and output files, usage and redirection of standard streams, and combinations of these, as well as both coarse and fine granularity state recovery. Designed to run independently, its scalability is reduced to Spark and the underlying fault-tolerant big data frameworks. We evaluate COMBUSTI/O on real and synthetic workflows, demonstrating its propriety for facilitation of complex compute-intensive workflows, as well as its applicability for data-intensive and latency-sensitive workflows, and validate the coarse-grained recovery mechanism and its cost for the different flavors of workflows. We show stage recovery to be beneficial during development, for compute-intensive workflows, and for error-prone data-intensive workflows. Moreover, we show that the I/O overhead of COMBUSTI/O grows for dataintensive workflows, and that our remote tool execution is inexpensive. COMBUSTI/O is open-sourced at https://github.com/jarlebass/combustio, and currently used by SfB at the University of Tromsø.	en_US
dc.identifier.uri	https://hdl.handle.net/10037/9361
dc.identifier.urn	URN:NBN:no-uit_munin_8919
dc.language.iso	eng	en_US
dc.publisher	UiT Norges arktiske universitet	en_US
dc.publisher	UiT The Arctic University of Norway	en_US
dc.rights.accessRights	openAccess
dc.rights.holder	Copyright 2016 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/3.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)	en_US
dc.subject.courseID	INF-3981
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og – arbeid: 426	en_US
dc.subject	VDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426	en_US
dc.title	COMBUSTI/O. Abstractions facilitating parallel execution of programs implementing common I/O patterns in a pipelined fashion as workflows in Spark	en_US
dc.type	Master thesis	en_US
dc.type	Mastergradsoppgave	en_US