Local Scratch
Legacy documentation
This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca. |
- Main page: Storage System
Each compute node has its own disk (or in some cases, solid state memory) which is not shared with other compute nodes. We refer to this as local disk. If it is used to store temporary files for an individual job, then we refer to that as "local scratch storage".
Grid Engine and Local Scratch
Local scratch is not organized consistently across all clusters and hosts. In most cases it is in /scratch/tmp
, but there are some hosts where /scratch/tmp
doesn't exist or where another location is preferred. Grid Engine provides an environment variable TMPDIR
which points to a local disk location which always exists and is unique to a job or a task in a job array, hence
$ cd $TMPDIR
should always succeed inside your submission script.
However, the amount of space in $TMPDIR
varies from cluster to cluster and from host to host, so while it always exists, it may or may not be large enough for your purposes. In particular the X6440 "Blade Servers" introduced in late 2009 at Placentia and Glooscap have small local scratch (Placentia nodes cl135-cl266, Glooscap nodes cl059-cl097). If you want to see how much space is available,
$ qstat -F localscratch
will give you a list of all hosts and the local scratch space on each.
You can request localscratch
from Grid Engine as a custom resource, much like you request time or memory. For example, a job submitted with:
$ qsub -l localscratch=10G job.script
will only be assigned to hosts with at least 10 gigabytes in $TMPDIR
.
- Obviously, the more accurately you can predict how much space your code needs, the better this will work: Ask for too much and your job might not schedule quickly (or at all); ask for too little and the job could fail.
- As with other requestable resources like
h_vmem
, the space is allocated per process for a parallel job. See below for more about parallel jobs. - Grid Engine only knows about the total disk space in the filesystem --- used plus unused --- and about any other jobs which explicitly request
localscratch
. If other jobs write to the same filesystem without requestinglocalscratch
then there could be less free space than requested.
For this last reason you might also wish to have your job script check the available space in order to avoid "File system full" or "No space left on device" errors. Here's a script fragment that prints the available space in $TMPDIR
in gigabytes:
$ df --block-size=1G $TMPDIR | awk 'END {print $4}'
This can be used in a conditional:
#$ -S /bin/bash scratchdir=$TMPDIR freespace=`df --block-size=1G $scratchdir | awk 'END {print $4}'` if (( $freespace < 10 )); then echo "Not enough free space in TMPDIR $TMPDIR, using /nqs..." scratchdir=/nqs/$USER/$JOB_ID mkdir $scratchdir fi
We strongly recommend that you use $TMPDIR
and localscratch
if you want to use node-local disk. $TMPDIR
is unique to each job, and Grid Engine deletes the directory at the end of the job. Therefore your script should ensure that any output files written to $TMPDIR
are copied to Main Storage before the end of the job script.
If you choose not to use $TMPDIR
you must:
- check for the existence of
/scratch/tmp
; - create a subdirectory with your username,
/scratch/tmp/$USER
, or Grid Engine job number,/scratch/tmp/$JOB_ID
; - ensure at the end of the job that the directory is cleaned up and deleted.
It can be tricky to write job scripts that clean up all files under every possible failure. Therefore you should also manually patrol your Local Scratch directories to ensure that the space is not occupied by outdated files from failed or finished jobs.
Multiple hosts
Grid Engine only creates $TMPDIR
on the master or "shepherd" host for a job. Therefore the above comments technically apply to serial jobs, or jobs where all the processes reside on a single host (e.g. Gaussian). If you are using the Open MPI library for running your application, please note that Open MPI will use $TMPDIR
for its own purposes and thus will create the directory on those hosts where it is missing, effectively providing $TMPDIR
on all hosts assigned to the job. Nonetheless, if your application expects each process to write and read to local disk on different hosts, you need to manage things much more carefully.
The hosts attached to a parallel job are described in the hostfile. The name of the hostfile is given to the job script in the environment variable PE_HOSTFILE
. You can, for example, fill a shell variable with a list of hostnames with code like this:
hostlist=`awk '{print $1}' $PE_HOSTFILE`
The Grid Engine requestable resource lscratch
can and should be used with parallel jobs. While $TMPDIR
is only created on the master host, lscratch
will check for sufficient disk space on the same filesystem (typically /scratch/tmp) on each host when scheduling the job. There is no reason you couldn't create and destroy directories named $TMPDIR
on worker hosts to match the one on the master host:
#$ -l h_rt=0:1:0,test=true #$ -pe ompi* 2 hostlist=`awk '{print $1}' $PE_HOSTFILE` echo "------ host ------- GB free" for host in $hostlist; do ssh $host mkdir -p $TMPDIR freespace=`ssh $host df --block-size=1G $TMPDIR | awk 'END {print $4}'` echo "$host $freespace" done # .... do some computing here .... # then clean up: for host in $hostlist; do ssh $host rm -r $TMPDIR done
To further complicate matters, Open MPI does create $TMPDIR
on each subordinate host when mpirun
is invoked, but then also destroys these $TMPDIR
s when the child MPI processes complete.