Distributed media versioning

Murphy, Michael J.

dc.contributor.advisor	Anshus, Otto J.
dc.contributor.author	Murphy, Michael J.
dc.date.accessioned	2017-06-30T10:44:14Z
dc.date.available	2017-06-30T10:44:14Z
dc.date.issued	2017-05-15
dc.description.abstract	It is still strangely difficult to backup and synchronize data. Cloud computing solves the problem by centralizing everything and letting someone else handle the backups. But what about situations with low connectivity or sensitive data? For this, software developers have an interesting distributed, decentralized, and partition-tolerant data storage system right at their fingertips: distributed version control. Inspired by distributed version control, we have researched and developed a prototype for a scalable high-availability system called Distributed Media Versioning (DMV). DMV expands Git's data model to allow files to be broken into more digestible chunks via a rolling hash algorithm. DMV will also allow data to be sharded according to data locality needs, slicing the data set in space (subset of data with full history), time (subset of history for full data set), or both. DMV repositories will be able to read and to update any subset of the data that they have locally, and then synchronize with other repositories in an ad-hoc network. We have performed experiments to probe the scalability limits of existing version control systems, specifically what happens as file sizes grow ever larger or as the number of files grow. We found that processing files whole limits maximum file size to what can fit in RAM, and that storing millions of objects loose as files with hash-based names incurs disk space overhead and write speed penalties. We have observed a system needing 24 seconds to store a 6.8 KiB file. We conclude that the key to storing large files is the break them into many small chunks, and that the key to storing many chunks is to aggregate them into pack files. And though the current DMV prototype does only the former, we have a clear path forward as we continue our work.	en_US
dc.identifier.uri	https://hdl.handle.net/10037/11213
dc.language.iso	eng	en_US
dc.publisher	UiT Norges arktiske universitet	en_US
dc.publisher	UiT The Arctic University of Norway	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2017 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/3.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)	en_US
dc.subject.courseID	INF-3990
dc.subject	VDP::Mathematics and natural science: 400::Information and communication science: 420::Communication and distributed systems: 423	en_US
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Kommunikasjon og distribuerte systemer: 423	en_US
dc.title	Distributed media versioning	en_US
dc.type	Master thesis	en_US
dc.type	Mastergradsoppgave	en_US