Data-intensive computing infrastructure systems for unmodified biological data analysis pipelines

Bongo, Lars Ailo; Pedersen, Edvard; Ernstsen, Martin

dc.contributor.author	Bongo, Lars Ailo
dc.contributor.author	Pedersen, Edvard
dc.contributor.author	Ernstsen, Martin
dc.date.accessioned	2016-03-09T14:08:04Z
dc.date.available	2016-03-09T14:08:04Z
dc.date.issued	2015-11-18
dc.description.abstract	Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems. We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.	en_US
dc.description	Accepted manuscript version. The final publication is available at Springer via <a href=http://dx.doi.org/10.1007/978-3-319-24462-4_22>http://dx.doi.org/10.1007/978-3-319-24462-4_22</a>.	en_US
dc.identifier.citation	Lecture Notes in Computer Science 2015, 8623:259-272	en_US
dc.identifier.cristinID	FRIDAID 1319765
dc.identifier.doi	10.1007/978-3-319-24462-4_22
dc.identifier.issn	1611-3349
dc.identifier.uri	https://hdl.handle.net/10037/8816
dc.identifier.urn	URN:NBN:no-uit_munin_8358
dc.language.iso	eng	en_US
dc.publisher	Springer	en_US
dc.rights.accessRights	openAccess
dc.subject	data-intensive computing	en_US
dc.subject	biological data analysis	en_US
dc.subject	flexible pipelines	en_US
dc.subject	infrastructure systems	en_US
dc.subject	VDP::Teknologi: 500::Informasjons- og kommunikasjonsteknologi: 550::Datateknologi: 551	en_US
dc.title	Data-intensive computing infrastructure systems for unmodified biological data analysis pipelines	en_US
dc.type	Journal article	en_US
dc.type	Tidsskriftartikkel	en_US
dc.type	Peer reviewed	en_US

File(s) in this item

Name:: article.pdf
Size:: 413.7Kb
Format:: PDF

View/Open

This item appears in the following collection(s)

Artikler, rapporter og annet (informatikk) [486]

Show simple item record