Mario. A system for iterative and interactive processing of biological data

Ernstsen, Martin

dc.contributor.advisor	Bongo, Lars Ailo
dc.contributor.advisor	Willassen, Nils-Peder
dc.contributor.author	Ernstsen, Martin
dc.date.accessioned	2014-01-13T09:45:40Z
dc.date.available	2014-01-13T09:45:40Z
dc.date.issued	2013-11-15
dc.description.abstract	This thesis address challenges in metagenomic data processing on clusters of computers; in particular the need for interactive response times during development, debugging and tuning of data processing pipelines. Typical metagenomics pipelines batch process data, and have execution times ranging from hours to months, making configuration and tuning time consuming and impractical. We have analyzed the data usage of metagenomic pipelines, including a visualization frontend, to develop an approach that use an online, data-parallel processing model, where changes in the pipeline configuration are quickly reflected in updated pipeline output available to the user. We describe the design and implementation of the Mario system that real- izes the approach. Mario is a distributed system built on top of the HBase storage system, that provide data processing using commonly used bioinformatics applications, interactive tuning, automatic parallelization and data provenance support. We evaluate Mario and its underlying storage system, HBase, using a benchmark developed to simulate I/O loads that are representative for biological data processing. The results show that Mario adds less than 100 milliseconds to the end-to-end latency of processing one item of data. This low latency, combined with Mario’s storage of all intermediate data generated by the processing, enables easy parameter tuning. In addition to improved interactivity, Mario also offer integrated data provenance, by storing detailed pipeline configurations associated with the data. The evaluation of Mario demonstrate that it can be used to achieve more interactivity in the configuration of pipelines for processing biological data. We believe that biology researchers can take advantage of this interactivity to perform better parameter tuning, which may lead to more accurate analyses, and ultimately to new scientific discoveries.	en
dc.identifier.uri	https://hdl.handle.net/10037/5762
dc.identifier.urn	URN:NBN:no-uit_munin_5457
dc.language.iso	eng	en
dc.publisher	UiT Norges arktiske universitet	en
dc.publisher	UiT The Arctic University of Norway	en
dc.rights.accessRights	openAccess
dc.rights.holder	Copyright 2013 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/3.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)	en_US
dc.subject.courseID	INF-3990	en
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og -arbeid: 426	en
dc.subject	VDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426	en
dc.title	Mario. A system for iterative and interactive processing of biological data	en
dc.type	Master thesis	en
dc.type	Mastergradsoppgave	en