GeStore : incremental computation for metagenomic pipelines
Genomics is the study of the genomes of organisms. Metagenomics is the study of environmental genomic samples. For both genomics and metagenomics DNA sequencing, and the analysis of these sequences, is an important tool. This analysis is done through integration of sequence data with existing meta-data collections. Genomics is the study of the genomes of organisms, and involves cultivating organisms in a lab and analyzing them. Metagenomics is the study of genomic samples collected directly from the environment, allowing researchers to study organisms that are difficult to cultivate in a petri dish. DNA sequencing and the analysis of these sequences is an important tool for both genomics and metagenomics. The integration of the data produced by sequencing with existing meta-data collections is particularly interesting for metagenomics, as a single biological sample can contain thousands of different organisms. The recent developments in DNA sequencing technology mean that the volume of data that can be produced per dollar is increasing faster than the volume of data that can be analyzed and stored per dollar. This data growth means that the initial analysis of these massive data sets becomes increasingly expensive. In addition, there is a need to periodically update old results using new meta-data from the many knowledge bases (meta-data collections) for biological data. Today, this typically requires rerunning the experimental analysis. Such incremental analysis is interesting for metagenomics since environmental samples potentially contain thousands of organisms. In metagenomic analysis, different sets of tools are used depending on the type of information required. These tools are generally arranged in a pipeline, where the output files of one tool acts as the input for the next. The analysis done by some steps is dependent on different meta-data collections. When meta-data is updated, these steps and all subsequent steps typically need to be executed again. Incremental updates can save significant computation time by running these pipelines against the updated segments, rather than the full meta-data collections. We believe that systems for incremental updates for metagenomic analysis pipelines have the following requirements; (i) reduce the computational resource requirements by using incremental update techniques (ii) the meta-data collections should be accessible without the use of proprietary or computationally expensive techniques (iii) do the incremental updates on demand, due to different needs of experiments, through handling meta-data updates and generating arbitrary delta meta-data collections (iv) support most genomic analysis tools and run on most job management systems (v) no changes should be made to the tools that the pipeline is comprised of, since modifying the many available tools is impractical (vi) the changes to the job management and resource allocation system should be minimal, to save implementation time for the pipeline system maintainer (vii) maintain a view of previous meta-data collections, so old experiments can be repeated with the correct meta-data collection version. To our knowledge no existing incremental update systems satisfy all seven requirements. Often they do not support on-demand processing or maintaining views of old data, in addition many systems require computations to be done within a specific framework or programming language. In this thesis we describe the GeStore incremental update system which satisfies all seven requirements. GeStore reduces the size of the meta-data collections, and thus the computational requirements for the pipeline, by leveraging incremental update techniques, satisfying requirements (i) and (iii). In addition it reduces the storage requirements of the meta-data collections, while still maintaining a complete view of the meta-data collection in a plain-text format, fulfilling requirement (ii) and (vii). It also presents a simple interface to the application programmer, so that integrating the system with existing pipeline solutions does not require large changes to the pipeline system or tools, in accordance with requirements (vi), (iv) and (v). GeStore has been implemented using the MapReduce framework, along with HBase, to provide scalable meta-data processing. We demonstrate the system by generating subsets of meta-data collections for use by the widely used genomic tool BLAST. In our evaluation, we have integrated GeStore with an existing pipelining system, GePan; a metagenomic pipeline system developed for a local biotech company in Tromsø, Norway, and used real-world data to evaluate the performance and benefits of GeStore. Our experimental results show that GeStore is able to reduce the runtime of the incremental updates by up to 65\% when compared to unmodified GePan, while introducing a low storage overhead and requiring minimal changes to GePan. We beleive that efficient on-demand updates of metagenomic data, as provided by GeStore, will be useful to our biology collaborators.
PublisherUniversitetet i Tromsø
University of Tromsø
The following license file are associated with this item: