Vis enkel innførsel

dc.contributor.advisorBongo, Lars Ailo
dc.contributor.advisorWillassen, Nils-Peder
dc.contributor.authorErnstsen, Martin
dc.date.accessioned2014-01-13T09:45:40Z
dc.date.available2014-01-13T09:45:40Z
dc.date.issued2013-11-15
dc.description.abstractThis thesis address challenges in metagenomic data processing on clusters of computers; in particular the need for interactive response times during development, debugging and tuning of data processing pipelines. Typical metagenomics pipelines batch process data, and have execution times ranging from hours to months, making configuration and tuning time consuming and impractical. We have analyzed the data usage of metagenomic pipelines, including a visualization frontend, to develop an approach that use an online, data-parallel processing model, where changes in the pipeline configuration are quickly reflected in updated pipeline output available to the user. We describe the design and implementation of the Mario system that real- izes the approach. Mario is a distributed system built on top of the HBase storage system, that provide data processing using commonly used bioinformatics applications, interactive tuning, automatic parallelization and data provenance support. We evaluate Mario and its underlying storage system, HBase, using a benchmark developed to simulate I/O loads that are representative for biological data processing. The results show that Mario adds less than 100 milliseconds to the end-to-end latency of processing one item of data. This low latency, combined with Mario’s storage of all intermediate data generated by the processing, enables easy parameter tuning. In addition to improved interactivity, Mario also offer integrated data provenance, by storing detailed pipeline configurations associated with the data. The evaluation of Mario demonstrate that it can be used to achieve more interactivity in the configuration of pipelines for processing biological data. We believe that biology researchers can take advantage of this interactivity to perform better parameter tuning, which may lead to more accurate analyses, and ultimately to new scientific discoveries.en
dc.identifier.urihttps://hdl.handle.net/10037/5762
dc.identifier.urnURN:NBN:no-uit_munin_5457
dc.language.isoengen
dc.publisherUiT Norges arktiske universiteten
dc.publisherUiT The Arctic University of Norwayen
dc.rights.accessRightsopenAccess
dc.rights.holderCopyright 2013 The Author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/3.0en_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)en_US
dc.subject.courseIDINF-3990en
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og -arbeid: 426en
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426en
dc.titleMario. A system for iterative and interactive processing of biological dataen
dc.typeMaster thesisen
dc.typeMastergradsoppgaveen


Tilhørende fil(er)

Thumbnail
Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)