Vis enkel innførsel

dc.contributor.advisorBongo, Lars Ailo
dc.contributor.authorPedersen, Edvard
dc.date.accessioned2017-04-07T08:17:06Z
dc.date.available2017-04-07T08:17:06Z
dc.date.issued2017-01-18
dc.description.abstractBioinformatics has seen an extreme data growth in later years due to the reduction in cost per megabase of sequencing, which today is around 1/400,000th of the cost in 2001. This reduction in cost enables new types of studies, such as searching for novel enzymes in marine environments using metagenomic approaches. However, it also leads to an increase in volume of data, which shifts overall cost from sequencing to analysis and data management. In addition, the data growth means that the analysis must move from personal computers to cluster, cloud and supercomputer infrastructure, which further complicates data management and processing. This increase in data volume applies to both raw data produced as well as the size of reference databases. These reference databases are used in analysis to e.g. compare sequences to all the known sequences, so larger reference databases provide more accurate results. However, this increase in analysis accuracy also increases the volume of both input data and reference databases, which further increases analysis cost as well as the complexity of data management and processing. In this dissertation, we examine the challenge of data management, particularly how existing bioinformatics analysis pipelines can reduce the runtime and hence the cost of analysis through a better data management approach. We provide the file-based distributed data materialization (FDDM) approach and realize it as the GeStore system to provide data management for real-world bioinformatics pipelines. The commonly used bioinformatics analysis frameworks do not provide efficient large-scale data management, in particular, updating analysis results with updated reference databases and reproducing previously computed results are costly and time-consuming. Technologies such as distributed databases and processing systems can efficiently process large amounts of data, but such systems are not straightforward to integrate with existing bioinformatics workflows since these workflows typically comprise legacy tools that are costly and time-consuming to port to new frameworks. Our approach bridges the gap between these by providing a simple file-based interface that makes it simple to integrate workflows using legacy tools with modern distributed databases and data processing frameworks. We show the need for such a system through an evaluation of the tools of a bioinformatics pipeline that is provided as a data analysis service. Our results show that the runtime of many of the most computationally intensive tools in the pipeline scale approximately linearly with input data size, so that runtime can be reduced by limiting the volume of data. We evaluate our implementation of the FDDM model using synthetic- and application benchmarks. Our results show that our implementation stores data efficiently with regards to storage space, and retrieves data quickly. We can therefore increase the speed of updates by up to 14 times. We integrate GeStore with three different workflow managers to demonstrate how popular workflow managers can easily use the FDDM approach.en_US
dc.description.doctoraltypeph.d.en_US
dc.description.popularabstractBioinformatikk har opplevd en eksplosiv vekst i datamengde de siste årene. Dette har ført med seg at kostnadene ved eksperimenter i større grad omhandler analysen av data. I denne avhandlingen blir moderne databehandlingsteknikker brukt for å holde kostnadene ved analysene nede. Denne kostnadsbegrensningen skjer gjennom å begrense mengden data som blir behandlet gjennom bruk av distribuerte dataprosseseringssystemer. Denne tilnærmingen er formalisert i FDDM-modellen for databehandling, og implementert i GeStore-systemet. Videre er dette systemet integrert med tre forskjellige analyse-pipelines, og ytelsen er evaluert. Disse eksperimentene viser at denne modellen enkelt kan brukes av eksisterende analyse-systemer, og kan øke ytelsen opptil 14 ganger for månedlige oppdateringer av resultater. Denne forskningen kan bidra til at nye analysemetoder blir tatt i bruk i bioinformatikken, gjennom reduksjonen i kostnader, spesielt i forhold til å holde resultater oppdatert.en_US
dc.descriptionThe papers of this thesis are not available in Munin. <p> Paper 1: Robertsen, E.M., Kahlke, T., Raknes, I.A., Pedersen, E., Semb, E.K., Ernstsen, M., Bongo, L.A.,Willassen, N.P.: Metapipe - pipeline annotation, analysis and visualization of marine metagenomic sequence data. (Manuscript). Preprint-version available at <a href=https://arxiv.org/abs/1604.04103> https://arxiv.org/abs/1604.04103 </a><br> Paper 2: Pedersen, E., Bongo, L. A.: "Large-scale Biological reference database Management". (Manuscript). Published version available in <a href=http://dx.doi.org/10.1016/j.future.2016.02.010> Future Generation Computer Systems 2017, 67:481-489. </a> <br> Paper 3: Pedersen, E., Raknes, I. A., Ernstsen, M., Bongo, L. A.: “Integrating Data Intensive Computing Systems with Biological Data Analysis Frameworks”. Available in <a href=http://dx.doi.org/10.1109/PDP.2015.106> Proc. of 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing 2015, 733–740. </a> <br> Paper 4: Bongo, L. A., Pedersen, E., Ernstsen, M.: “Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines”. Available in <a href=http://dx.doi.org/10.1007/978-3-319-24462-4_22> DI Serio C., Liò P., Nonis A., Tagliaferri R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. Lecture Notes in Computer Science, 2014, 8623. Springer. </a> <br> Paper 5: Pedersen, E., Bongo, L. A.: “Big Biological Data Management”. Available in <a href=http://dx.doi.org/10.1007/978-3-319-44881-7_13> Pop, F. et.al.(eds.): Resource Management for Big Data Platforms. 2016, ISBN: 978-3-319-44881-7. Springer. </a>en_US
dc.identifier.isbn978-82-8236-242-9 (trykt) og 978-82-8236-243-6 (pdf)
dc.identifier.urihttps://hdl.handle.net/10037/10944
dc.language.isoengen_US
dc.publisherUiT Norges arktiske universiteten_US
dc.publisherUiT The Arctic University of Norwayen_US
dc.rights.accessRightsopenAccessen_US
dc.rights.holderCopyright 2017 The Author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/3.0en_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)en_US
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420en_US
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420en_US
dc.titleA Data Management Model For Large-Scale Bioinformatics Analysisen_US
dc.typeDoctoral thesisen_US
dc.typeDoktorgradsavhandlingen_US


Tilhørende fil(er)

Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)