Job Control
Legacy documentation
This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca. |
All production-class jobs must be submitted via the scheduler, which manages the available computing resources and assigns them to the waiting jobs. The scheduler used on all ACENET clusters is Sun Grid Engine (SGE).
Contents
Main commands
The three most important SGE commands are:
qsub
- Submits a batch job
qstat
- Shows the status of jobs and queues
qdel
- Deletes or kills a job
Submitting a simple job
Write a script that describes your job. Here's a trivial example:
#$ -cwd #$ -j y #$ -l h_rt=00:03:00 echo Hello from inside a Grid Engine job running on `hostname` echo Job beginning at `date` sleep 120 echo Job ending at `date`
Note that this is just a shell script — a list of commands (echo, sleep
) to be executed in order. The comment lines beginning with #$
provide extra information to the job scheduler. More about them below. The default execution shell is bash
unless specified otherwise with the -S option.
Save the script with some name, like trivial.sh
. Then submit the script to the scheduler by typing
$ qsub trivial.sh
The system will reply something like
Your job 7635 ("trivial.sh") has been submitted
and queue up your job to wait its turn. The number is called the JOB_ID
and will be different for every job.
If you are looking for a parallel job script example, refer to the Parallel Jobs page.
Monitoring jobs
How can you tell if your job has run? The usual way is with
$ qstat
The output from qstat
is very wide and looks something like this:
job-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------- 7635 0.5404 trivial jdoe r 11/18/2011 23:16:10 short.q@cl061 1
While your job is waiting to run there will be "qw
" in the state
column.
When it starts to run it changes to "r
".
When the job ends, either because it finished or because it crashed, it disappears
from the list and qstat
will return nothing at all --- unless you have other jobs submitted.
When the job is done, the output appears in a file with a name like trivial.sh.o7635
. The components of the output file name are the job name (trivial.sh
in our example), the job id (7635
), and between them ".o
" for "output".
There are other utilities that can show you some useful information about your jobs:
qsum
, for a simplified view of the entire system loadshowq
, for insight into when your job might begin running- More on
qstat
, including the meaning of job status codes qacct
, for data about a finished job (memory used, run time, error codes, etc.)
Deleting jobs
If you want to remove your job from the queue, whether it's running or waiting, you can use the qdel
command. You can delete one or more jobs by specifying their names or job IDs like so:
$ qdel job_name1 job_id2
If you delete a job by a name, and several of your jobs have the same name, then all of them will be deleted. The alternative to this is to use a job ID, which is unique.
You can delete all of your jobs by using a wildcard like so:
$ qdel "*"
If you are running an array job and want to delete only one task, then you need to specify a job ID as well as a task ID separated with a full stop, like so:
$ qdel job_id.task_id
You can also delete a range of tasks like so:
$ qdel job_id.task_id1-task_id2
You can find a task ID in the qstat
output.
Finally, if your cannot delete your job and it's stuck in the d
state for a long time, then you can force its deletion providing the -f
option to qdel
. However, before doing so, please read the relevant section in our FAQ.
Parameters
- Complete parameter list: Grid Engine
Here are the most commonly-used job parameters:
Option | Description |
---|---|
-l h_rt=time |
Run time limit either in seconds or in hh:mm:ss format |
-l h_vmem=mem |
Hard virtual memory limit; mem specifier may include k, K, m, M, g, G; details at man queue_conf
|
-cwd |
Start the job script in the same directory it was submitted from, the "current working directory". If absent, job will start in your home directory. |
-j y |
Join the stderr output stream to the stdout stream. Error messages will be mixed in with the job script standard output. If absent then standard error will go into job_name.ejob_id |
-N name |
Assigns a name to the job other than the name of the job script |
-o file |
Redirects the standard output to the named file |
-S shell |
Shell to interpret the job script: /bin/bash (default) or /bin/csh
|
Every job must be submitted with a run time limit, h_rt
. This is a hard limit, which means your job will be killed after it has been running for that length of time, so you should give yourself a margin of error. If you really don't know what run time to set, 48 hours is an acceptable choice. All other parameters are optional.
There are three ways to set a parameter or supply an option to a job:
- With
#$
directives inside the job script, as shown above - With flags to
qsub
when the job is submitted - With flags to
qalter
while the job is waiting to run
The second method follows this pattern:
$ qsub -l h_rt=0:1:0 trivial.sh
Options to the qsub
command override any conflicting options set with directives inside the job script trivial.sh
. So the job in this example will initially have a run-time limit of one minute (0:1:0) regardless of what is given inside the script. Note that when using qsub
that the script name (and any arguments to the script) must appear after all the Grid Engine flags.
The third method follows this pattern:
$ qalter -l h_rt=0:2:0 job_id
After the qalter
command the run time limit will change to 2 minutes, but this will only have an effect if the job has not yet started. Please note that
- Changing a parameter on a job that is already executing, for example to give it more time or more memory, has no effect.
- You must re-supply the
h_rt
and any other arguments to-l
when you useqalter
.