Abstract
Bioinformatics has seen an extreme data growth in later years due to the reduction in cost per megabase of sequencing, which today is around 1/400,000th of the cost in 2001. This reduction in cost enables new types of studies, such as searching for novel enzymes in marine environments using metagenomic approaches. However, it also leads to an increase in volume of data, which shifts overall cost from sequencing to analysis and data management. In addition, the data growth means that the analysis must move from personal computers to cluster, cloud and supercomputer infrastructure, which further complicates data management and processing.
This increase in data volume applies to both raw data produced as well as the size of reference databases. These reference databases are used in analysis to e.g. compare sequences to all the known sequences, so larger reference databases provide more accurate results. However, this increase in analysis accuracy also increases the volume of both input data and reference databases, which further increases analysis cost as well as the complexity of data management and processing.
In this dissertation, we examine the challenge of data management, particularly how existing bioinformatics analysis pipelines can reduce the runtime and hence the cost of analysis through a better data management approach.
We provide the file-based distributed data materialization (FDDM) approach and realize it as the GeStore system to provide data management for real-world bioinformatics pipelines.
The commonly used bioinformatics analysis frameworks do not provide efficient large-scale data management, in particular, updating analysis results with updated reference databases and reproducing previously computed results are costly and time-consuming. Technologies such as distributed databases and processing systems can efficiently process large amounts of data, but such systems are not straightforward to integrate with existing bioinformatics workflows since these workflows typically comprise legacy tools that are costly and time-consuming to port to new frameworks. Our approach bridges the gap between these by providing a simple file-based interface that makes it simple to integrate workflows using legacy tools with modern distributed databases and data processing frameworks.
We show the need for such a system through an evaluation of the tools of a bioinformatics pipeline that is provided as a data analysis service. Our results show that the runtime of many of the most computationally intensive tools in the pipeline scale approximately linearly with input data size, so that runtime can be reduced by limiting the volume of data. We evaluate our implementation of the FDDM model using synthetic- and application benchmarks. Our results show that our implementation stores data efficiently with regards to storage space, and retrieves data quickly. We can therefore increase the speed of updates by up to 14 times. We integrate GeStore with three different workflow managers to demonstrate how popular workflow managers can easily use the FDDM approach.
Description
The papers of this thesis are not available in Munin.
Paper 1: Robertsen, E.M., Kahlke, T., Raknes, I.A., Pedersen, E., Semb, E.K., Ernstsen, M., Bongo, L.A.,Willassen, N.P.: Metapipe - pipeline annotation, analysis and visualization of marine
metagenomic sequence data. (Manuscript). Preprint-version available at https://arxiv.org/abs/1604.04103
Paper 2: Pedersen, E., Bongo, L. A.: "Large-scale Biological reference database Management". (Manuscript). Published version available in Future Generation Computer Systems 2017, 67:481-489.
Paper 3: Pedersen, E., Raknes, I. A., Ernstsen, M., Bongo, L. A.: “Integrating Data Intensive Computing Systems with Biological Data Analysis Frameworks”. Available in Proc. of 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing 2015, 733–740.
Paper 4: Bongo, L. A., Pedersen, E., Ernstsen, M.: “Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines”. Available in DI Serio C., Liò P., Nonis A., Tagliaferri R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. Lecture Notes in Computer Science, 2014, 8623. Springer.
Paper 5: Pedersen, E., Bongo, L. A.: “Big Biological Data Management”. Available in Pop, F. et.al.(eds.): Resource Management for Big Data Platforms. 2016, ISBN: 978-3-319-44881-7. Springer.