Klaus Birkelund Abildgaard Jensen
A thesis submitted December 21, 2017 for the degree of Doctor of Philosophy and defended Januar, 2018.
The PhD School of Science
Faculty of Science, the eScience group, Niels Bohr Institute, University of Copenhagen
Professor Brian Vinter
Tape as Primary Storage for Large Scientific Data Sets
This work investigates how magnetic tape technology can be used to provide efficient and reliable low-cost storage for large scientific data sets. The low cost is a direct implication of the fact that power consumption can be reduced to a constant dependant only on the size of the tape system and infrastructure and not the size of the data set. Thus, one conclusion from this work is that data from Big Science facilities can in fact be stored on tape and thus avoid an enormous energy consumption for data storage.
My work is focused on the challenges that face scientists working with ever increasing data volumes from large-scale scientific experiments and facilities. I discuss what challenges are preventing tape from becoming a primary tier in the high performance computing data center for such data. The work includes a literature study on tape technology in general, the data sets it can be made to support and a survey of state-of-the-art tape storage systems. I describe and motivate Tapr, a highly extendable parallel I/O gateway and tape library management system optimized for high throughput data streams with special semantic support for scientific data. I discuss the motivation as well as the core data models, transfer protocols, features such as disaster recovery, inline simulation and possible support for retrieval latency prediction to strengthen the adoption of tape in a High Performance Computing environment as a primary storage tier. I then describe a storage backend for Tapr, a binary extension to the Linear Tape File System (LTFS) that provides higher scalability and superior fault-tolerance, in detail. The format extension is based on embedding and is completely backward compatible with existing LTFS software. I provide an overview of the inner workings of the binary index and the recovery log that is integral to the format. Finally, I present a framework for generic redundant streaming based on redundant I/O behaviors and describe possible extensions to it and how it can be integrated into Tapr as well has how several I/O behaviors can be composed into more powerful behaviors.