Tape as Primary Storage for Large Scientific Data Sets

Research output: Book/ReportPh.D. thesisResearch

  • Klaus Birkelund Abildgaard Jensen
This work investigates how magnetic tape technology can be used to
provide efficient and reliable low-cost storage for large scientific data sets.
The low cost is a direct implication of the fact that power consumption can
be reduced to a constant dependant only on the size of the tape system and
infrastructure and not the size of the data set. Thus, one conclusion from
this work is that data from Big Science facilities can in fact be stored on
tape and thus avoid an enormous energy consumption for data storage.
My work is focused on the challenges that face scientists working with
ever increasing data volumes from large-scale scientific experiments and
facilities. I discuss what challenges are preventing tape from becoming
a primary tier in the high performance computing data center for such
data. The work includes a literature study on tape technology in general,
the data sets it can be made to support and a survey of state-of-the-art
tape storage systems.
I describe and motivate Tapr, a highly extendable parallel I/O gateway
and tape library management system optimized for high throughput
data streams with special semantic support for scientific data. I discuss
the motivation as well as the core data models, transfer protocols, features
such as disaster recovery, inline simulation and possible support
for retrieval latency prediction to strengthen the adoption of tape in a
High Performance Computing environment as a primary storage tier. I
then describe a storage backend for Tapr, a binary extension to the Linear
Tape File System (LTFS) that provides higher scalability and superior
fault-tolerance, in detail. The format extension is based on embedding
and is completely backward compatible with existing LTFS software. I
provide an overview of the inner workings of the binary index and the
recovery log that is integral to the format.
Finally, I present a framework for generic redundant streaming based
on redundant I/O behaviors and describe possible extensions to it and
how it can be integrated into Tapr as well has how several I/O behaviors
can be composed into more powerful behaviors.
Original languageEnglish
PublisherThe Niels Bohr Institute, Faculty of Science, University of Copenhagen
Publication statusPublished - 2017

ID: 200495953