dc.contributor.advisor | Bongo, Lars Ailo | |
dc.contributor.advisor | Willassen, Nils-Peder | |
dc.contributor.author | Ernstsen, Martin | |
dc.date.accessioned | 2014-01-13T09:45:40Z | |
dc.date.available | 2014-01-13T09:45:40Z | |
dc.date.issued | 2013-11-15 | |
dc.description.abstract | This thesis address challenges in metagenomic data processing on clusters of computers; in particular the need for interactive response times during development, debugging and tuning of data processing pipelines. Typical metagenomics pipelines batch process data, and have execution times ranging from hours to months, making configuration and tuning time consuming and impractical.
We have analyzed the data usage of metagenomic pipelines, including a visualization frontend, to develop an approach that use an online, data-parallel processing model, where changes in the pipeline configuration are quickly reflected in updated pipeline output available to the user.
We describe the design and implementation of the Mario system that real- izes the approach. Mario is a distributed system built on top of the HBase storage system, that provide data processing using commonly used bioinformatics applications, interactive tuning, automatic parallelization and data provenance support.
We evaluate Mario and its underlying storage system, HBase, using a benchmark developed to simulate I/O loads that are representative for biological data processing. The results show that Mario adds less than 100 milliseconds to the end-to-end latency of processing one item of data. This low latency, combined with Mario’s storage of all intermediate data generated by the processing, enables easy parameter tuning. In addition to improved interactivity, Mario also offer integrated data provenance, by storing detailed pipeline configurations associated with the data.
The evaluation of Mario demonstrate that it can be used to achieve more interactivity in the configuration of pipelines for processing biological data. We believe that biology researchers can take advantage of this interactivity to perform better parameter tuning, which may lead to more accurate analyses, and ultimately to new scientific discoveries. | en |
dc.identifier.uri | https://hdl.handle.net/10037/5762 | |
dc.identifier.urn | URN:NBN:no-uit_munin_5457 | |
dc.language.iso | eng | en |
dc.publisher | UiT Norges arktiske universitet | en |
dc.publisher | UiT The Arctic University of Norway | en |
dc.rights.accessRights | openAccess | |
dc.rights.holder | Copyright 2013 The Author(s) | |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/3.0 | en_US |
dc.rights | Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) | en_US |
dc.subject.courseID | INF-3990 | en |
dc.subject | VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og -arbeid: 426 | en |
dc.subject | VDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426 | en |
dc.title | Mario. A system for iterative and interactive processing of biological data | en |
dc.type | Master thesis | en |
dc.type | Mastergradsoppgave | en |