There are several types of storage available on the cluster. Their usage is described below.
The cluster's main network storage device is a dual-head NetApp FAS6210. Most users have their home directories on one of the NetApp device's heads. These devices provide high-performance and highly reliable storage. However, there are limitations on their usage:
Space on the NetApp is very expensive. Therefore, it is to be used for temporary storage of actively used data and programs only. All simulation results must be moved off of the NetApp as soon as possible.
- Each user has a quota of 200GB.
You can check your current usage via the command "quota -s".
- Users over this limit will be notified by email (as will the cluster sysadmins).
- 200GB is a soft quota. Once an account reaches the hard quota of 300GB, that account will not be able to write any more data until some is deleted.
The NetApp has nowhere near enough space for every user to use their entire quota. Again, keep only active data on the NetApp.
While the NetApp is a high performance unit, there are a lot of nodes in the shared cluster. Running jobs on many of them at once, all of which write to the NetApp, can easily overwhelm it.
When this happens, it affects all users of the cluster.
- For this reason, the scratch disks on the nodes should be used for temporary job workspace as much as possible.
As the NetApp is not a permanent data storage device, it is not backed up. Should anything physically happen to the unit, there will be no way to retrieve the data.
There is a facility, however, for immediate retrieval of mistakenly deleted data:
In the "root" of the mounted NetApp volume (e.g. '/netapp/home' for most users), there is a directory named '.snapshot'.
- In that directory are multiple subdirectories, named 'hourly.N' and 'nightly.N'.
These are SnapShot directories, and they contain snapshots of the volume contents at specific points in time.
To retrieve something from a SnapShot directory, simply 'cp' it into your normal home directory.
- Note that the snapshots only go back ~24 hours -- after that amount of time, no file retrieval is possible.
Node Scratch Disks
Each node contains a local disk which can be used by an SGE job for temporary data storage. Data on these disks is automatically deleted after 2 weeks (or sooner if circumstances dictate). While these directories are generally only accessible for the lifetime of the job that creates them, it is possible to access their contents should a job crash or be terminated. To do so:
- You will need two pieces of information
- The name of the node the job ran on -- this can be obtained by including the command 'hostname' in your job script.
The (presumably random) name of the scratch directory created by your job. If following the syntax from Cluster_Usage to create this scratch directory, include the command 'echo $MYTMP' in your job script.
- Given those 2 pieces of info, submit a job specifically to the node the job crashed on (using the SGE flag '-l hostname=$HOSTNAME_FROM_JOB_OUTPUT_SCRIPT') which copies the contents of the scratch directory to your $HOME.
In August 2011, we upgraded our primary NetApp from a daul 3070 system to the dual 6210 system discussed above. Given an exceedingly low trade-in cost, we kept the old dual 3070 system. It is now setup as a cluster wide scratch space.
There are 2 volumes mounted, one on each NetApp head. They are /scrapp (7.5TB) and /scrapp2 (3.3TB). They are available on all nodes, including the login nodes chef and sous.
- Jobs can check for available space in these volumes using "-l scrapp=NG" and/or "-l scrapp2=NG" (replacing N with the number of free GB your job requires, of course)
- We have no hardware or software support on these systems, so this is truly scratch space. In the case of a data loss event, no effort will be made to recover any data.
- As with the node scratch disks, data older than 2 weeks will be automatically deleted.
One of the main reasons we bought a new NetApp is that it is trivial, given the number of nodes we have, to completely overwhelm the old system. So you still have to be judicious in your use of the old NetApp in its new role. While hammering /scrapp and/or /scrapp2 will no longer slow down *every* cluster user, it will significantly slow down the jobs which are using those volumes. So, please, continue to take care in your data access patterns.