|dc.description.abstract||It is still strangely difficult to backup and synchronize data. Cloud computing solves the problem by centralizing everything and letting someone else handle the backups. But what about situations with low connectivity or sensitive data?
For this, software developers have an interesting distributed, decentralized, and partition-tolerant data storage system right at their fingertips: distributed version control.
Inspired by distributed version control, we have researched and developed a prototype for a scalable high-availability system called Distributed Media Versioning (DMV). DMV expands Git's data model to allow files to be broken into more digestible chunks via a rolling hash algorithm. DMV will also allow data to be sharded according to data locality needs, slicing the data set in space (subset of data with full history), time (subset of history for full data set), or both. DMV repositories will be able to read and to update any subset of the data that they have locally, and then synchronize with other repositories in an ad-hoc network.
We have performed experiments to probe the scalability limits of existing version control systems, specifically what happens as file sizes grow ever larger or as the number of files grow. We found that processing files whole limits maximum file size to what can fit in RAM, and that storing millions of objects loose as files with hash-based names incurs disk space overhead and write speed penalties. We have observed a system needing 24 seconds to store a 6.8 KiB file.
We conclude that the key to storing large files is the break them into many small chunks, and that the key to storing many chunks is to aggregate them into pack files. And though the current DMV prototype does only the former, we have a clear path forward as we continue our work.||en_US