A Data Management Model For Large-Scale Bioinformatics Analysis

Pedersen, Edvard

dc.contributor.advisor	Bongo, Lars Ailo
dc.contributor.author	Pedersen, Edvard
dc.date.accessioned	2017-04-07T08:17:06Z
dc.date.available	2017-04-07T08:17:06Z
dc.date.issued	2017-01-18
dc.description.abstract	Bioinformatics has seen an extreme data growth in later years due to the reduction in cost per megabase of sequencing, which today is around 1/400,000th of the cost in 2001. This reduction in cost enables new types of studies, such as searching for novel enzymes in marine environments using metagenomic approaches. However, it also leads to an increase in volume of data, which shifts overall cost from sequencing to analysis and data management. In addition, the data growth means that the analysis must move from personal computers to cluster, cloud and supercomputer infrastructure, which further complicates data management and processing. This increase in data volume applies to both raw data produced as well as the size of reference databases. These reference databases are used in analysis to e.g. compare sequences to all the known sequences, so larger reference databases provide more accurate results. However, this increase in analysis accuracy also increases the volume of both input data and reference databases, which further increases analysis cost as well as the complexity of data management and processing. In this dissertation, we examine the challenge of data management, particularly how existing bioinformatics analysis pipelines can reduce the runtime and hence the cost of analysis through a better data management approach. We provide the file-based distributed data materialization (FDDM) approach and realize it as the GeStore system to provide data management for real-world bioinformatics pipelines. The commonly used bioinformatics analysis frameworks do not provide efficient large-scale data management, in particular, updating analysis results with updated reference databases and reproducing previously computed results are costly and time-consuming. Technologies such as distributed databases and processing systems can efficiently process large amounts of data, but such systems are not straightforward to integrate with existing bioinformatics workflows since these workflows typically comprise legacy tools that are costly and time-consuming to port to new frameworks. Our approach bridges the gap between these by providing a simple file-based interface that makes it simple to integrate workflows using legacy tools with modern distributed databases and data processing frameworks. We show the need for such a system through an evaluation of the tools of a bioinformatics pipeline that is provided as a data analysis service. Our results show that the runtime of many of the most computationally intensive tools in the pipeline scale approximately linearly with input data size, so that runtime can be reduced by limiting the volume of data. We evaluate our implementation of the FDDM model using synthetic- and application benchmarks. Our results show that our implementation stores data efficiently with regards to storage space, and retrieves data quickly. We can therefore increase the speed of updates by up to 14 times. We integrate GeStore with three different workflow managers to demonstrate how popular workflow managers can easily use the FDDM approach.	en_US
dc.description.doctoraltype	ph.d.	en_US
dc.description.popularabstract	Bioinformatikk har opplevd en eksplosiv vekst i datamengde de siste årene. Dette har ført med seg at kostnadene ved eksperimenter i større grad omhandler analysen av data. I denne avhandlingen blir moderne databehandlingsteknikker brukt for å holde kostnadene ved analysene nede. Denne kostnadsbegrensningen skjer gjennom å begrense mengden data som blir behandlet gjennom bruk av distribuerte dataprosseseringssystemer. Denne tilnærmingen er formalisert i FDDM-modellen for databehandling, og implementert i GeStore-systemet. Videre er dette systemet integrert med tre forskjellige analyse-pipelines, og ytelsen er evaluert. Disse eksperimentene viser at denne modellen enkelt kan brukes av eksisterende analyse-systemer, og kan øke ytelsen opptil 14 ganger for månedlige oppdateringer av resultater. Denne forskningen kan bidra til at nye analysemetoder blir tatt i bruk i bioinformatikken, gjennom reduksjonen i kostnader, spesielt i forhold til å holde resultater oppdatert.	en_US
dc.description	The papers of this thesis are not available in Munin. <p> Paper 1: Robertsen, E.M., Kahlke, T., Raknes, I.A., Pedersen, E., Semb, E.K., Ernstsen, M., Bongo, L.A.,Willassen, N.P.: Metapipe - pipeline annotation, analysis and visualization of marine metagenomic sequence data. (Manuscript). Preprint-version available at <a href=https://arxiv.org/abs/1604.04103> https://arxiv.org/abs/1604.04103 </a><br> Paper 2: Pedersen, E., Bongo, L. A.: "Large-scale Biological reference database Management". (Manuscript). Published version available in <a href=http://dx.doi.org/10.1016/j.future.2016.02.010> Future Generation Computer Systems 2017, 67:481-489. </a> <br> Paper 3: Pedersen, E., Raknes, I. A., Ernstsen, M., Bongo, L. A.: “Integrating Data Intensive Computing Systems with Biological Data Analysis Frameworks”. Available in <a href=http://dx.doi.org/10.1109/PDP.2015.106> Proc. of 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing 2015, 733–740. </a> <br> Paper 4: Bongo, L. A., Pedersen, E., Ernstsen, M.: “Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines”. Available in <a href=http://dx.doi.org/10.1007/978-3-319-24462-4_22> DI Serio C., Liò P., Nonis A., Tagliaferri R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. Lecture Notes in Computer Science, 2014, 8623. Springer. </a> <br> Paper 5: Pedersen, E., Bongo, L. A.: “Big Biological Data Management”. Available in <a href=http://dx.doi.org/10.1007/978-3-319-44881-7_13> Pop, F. et.al.(eds.): Resource Management for Big Data Platforms. 2016, ISBN: 978-3-319-44881-7. Springer. </a>	en_US
dc.identifier.isbn	978-82-8236-242-9 (trykt) og 978-82-8236-243-6 (pdf)
dc.identifier.uri	https://hdl.handle.net/10037/10944
dc.language.iso	eng	en_US
dc.publisher	UiT Norges arktiske universitet	en_US
dc.publisher	UiT The Arctic University of Norway	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2017 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/3.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)	en_US
dc.subject	VDP::Mathematics and natural science: 400::Information and communication science: 420	en_US
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420	en_US
dc.title	A Data Management Model For Large-Scale Bioinformatics Analysis	en_US
dc.type	Doctoral thesis	en_US
dc.type	Doktorgradsavhandling	en_US

Tilhørende fil(er)

Navn:: license.txt
Størrelse:: 1.402Kb
Format:: Tekstfil

Åpne

Navn:: thesis.pdf
Størrelse:: 5.082Mb
Format:: PDF
Beskrivelse:: Thesis

Åpne

Denne innførselen finnes i følgende samling(er)

Doktorgradsavhandlinger (NT-fak) [322]

Vis enkel innførsel

Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)