RDSF Evolution

As we look at the future archive data storage requirements for researchers at the University we see a need to support new projects with very large allocations (100s of TBytes per project) and demand for more space from many projects.

To meet this need we are working on introducing some new technologies into the RDSF. Some of these have different operational characteristics which will be noticed by the users and some may require changes in the way you use the RDSF if you want to take advantage of the new features.

There are two main new features we are currently working on:

1) Compression. The space occupied by some files can be significantly reduced by data compression. Tests on some datasets have produced a 30% reduction in the space occupied. The compression and decompression are lossless and are done on the fly, so there’s no significant delay in accessing the data or the bandwidth of data transfer. Selected parts of the RDSF are now compressed automatically and this enables us to make better use of our existing disk space.

2) Tape. The cost of tape per TByte is much lower than that of disk. The future major expansion of the RDSF capacity will be in the form of a tape storage tier. This will give us much larger capacity and flexibility to further expand very easily by purchasing more tapes. However the access characteristics of tape are very different to those of disk and this affects how you can use the space. Tape has a much greater latency of access than disk. To access a disk file takes milliseconds, to access a tape file can take minutes. This is a big difference, often 100,000 times longer for tape. Tape does have very good bandwidth though and once a tape file has been accessed the data can move as fast as disk. Also, most applications cannot access tape files directly, so when a user opens a file on tape, the system automatically copies it back to disk so that it can be accessed. This process takes minutes to complete.
The outcome of this latency from the user point of view is that tape cannot be used for small files. The latency dominates the access time for small files and for example it could take hours to access a hundred small files. If you have large files the latency is comparable to the time it takes to read the file contents and thus is relatively insignificant. In practise we will set a threshold of 1 GByte as the minimum size of file that can be resident on tape. Files smaller than this will remain always on disk.

What you need to do

1) For compression you need do nothing, you may notice that your files take up less space than they used to. You can take advantage of the extra space in your quota to store more.

2) For tape. If your files are large ( > 1 GByte in size) you need do nothing.

If your files are smaller than 1GByte you will need to consider how you organise that data if you want to take advantage of tape. You’ll need to collect the small files together and store them in larger chunks or blobs. It will be wise to be thoughtful about this and to group files together which are going to be used together. Here are some examples to help illustrate this:

a) Where files are organised with many small files stored in a directory tree (e.g) where samples from an experiment organised into different directories for each day or week. These directories can be stored into a tarball, using the tar utility, or on Windows in a zip file. Each directory, or even a collection of directories and its files could be replaced by an equivalent tar file. The resulting tar files would be larger than 1 GByte and could reside on tape. When you access the tar file from tape, there is only on latency period, after which you have immediate access to all the contents.

b) Consider using different file formats. For example if you are working on digital images from a movie and each frame is a separate file, then changing from a single frame per file format to a movie file format, where many frames are stored in one file. This is not only more space efficient and may enable the use of tape, its also more efficient in terms of the computational overhead when processing the data – a single large file is much more efficient than thousands of tiny ones.

c) In general, the better organised you data holding is, the more likely you will be able to take advantage of the tape storage. For any project, when you archive your results and data to the RDSF , collect the data into a single file, using tar or gzip or another appropriate utility, and store the resulting package as a single piece on the RDSF.