Qsum

From ACENET
Jump to: navigation, search
Achtung.png Legacy documentation

This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca.


Main page: Job Control

qsum is a ACENET custom-made utility for seeing a short summary of how busy the Grid Engine queues and waiting list are.

Here is an example of qsum output from Mahone:

$ qsum

--------------------------------------------------
Queue Info                 Processors
--------------------------------------------------
Queue Name    IN USE     AVAIL   UNAVAIL     TOTAL
    long.q        55         1         0        56
  medium.q       108         0         4       112
   short.q       348        16        16       380
    test.q         0         8         0         8

---------------------------------------------------------
Users Info       Running        Waiting      Error/Other
---------------------------------------------------------
        ID    #JOBS  #CPUS   #JOBS  #CPUS    #JOBS  #CPUS
    asmith        1      8
    bbrown       11     89       3     24
    cjones        6     96      16    256      18    288
    dduffy        3    248
   ewilson        2     32
  ffreleng        3     19
   ggeorge        3      3
   hpotter        2     16
---------------------------------------------------------
     Total       31    511      19    280      18    288

The first table ("Queue Info") summarizes how many slots (cpu cores) are occupied with jobs, and how many are available. The rows (long.q, medium.q, short.q, test.q) indicate how many slots are in use, available, or out of service for

  • long jobs, i.e. those requesting more than 168 hours run time,
  • medium jobs, requesting more than 48 hours,
  • short jobs, requesting less than 48 hours, and
  • test jobs, requesting less than 1 hour and "test=true".

Short jobs can also run in medium.q and long.q, and medium jobs can run in long.q.

The second table ("User Info") summarizes the number of jobs and slots that are in use and requested by each user.

The columns headed "Error/Other" show if you have any jobs that are being held back from running due to an error condition. These error conditions are usually correctable by the user, not the system operator. Contact support if your jobs are in an error state and you don't know what do to about it. Jobs in various other non-runnable states like "hqw" or "dr" will also appear in this column.

$ qsum

--------------------------------------------------
Queue Info                 Processors
--------------------------------------------------
Queue Name    IN USE     AVAIL   UNAVAIL     TOTAL
    cmms.q        63        17         0        80
 demirov.q        32         0         0        32
gaussian.q        75         1         0        76
interact.q         0         8         0         8
    long.q        43         5         0        48
  medium.q        96         0         0        96
   short.q       299        33         0       332
     sub.q         0        72        88       160
 tarasov.q         4        76         0        80
    test.q         0         8         0         8

---------------------------------------------------------
Users Info       Running        Waiting      Error/Other
---------------------------------------------------------
        ID    #JOBS  #CPUS   #JOBS  #CPUS    #JOBS  #CPUS
    asmith       10    160       2     32       2
    bbrown        7    112      18    288
    cjones        1      1
    dduffy        3      7                      1      4
                ..... lines omitted .....
      zeno        1      4
---------------------------------------------------------
     Total      103    596     768   1108       4      5

This example from Placentia shows the large number of cluster queues present at that site.

  • The standard cluster queues, long.q, medium.q, short.q and test.q are present as at other sites.
  • There are several Green ACENET queues, e.g. demirov.q, cmms.q, tarasov.q. Access to these queues is restricted to members of the research group which funded the purchase of the nodes.
  • sub.q is a subordinate queue which allows users to run jobs on these Green ACENET nodes. See the sub.q page for how to do this.
  • gaussian.q is a set of nodes purpose-built for the Gaussian computational chemistry code, and accessible to all users of that package.

Caveat

The queueing system has a much more complicated state than can be represented in a one-screen summary. In particular, the "AVAIL" column should not simply be read as "idle and waiting for your job". For example, your job may require memory or other resources the available slots don't have. See "Job won't start" in our FAQ for more on this. Also, if adjustments have been made to the Grid Engine configuration, as technical staff have to do from time to time, there may be overcounting of available slots for a day or two.

See also