Argo

From ACENET
Jump to: navigation, search


Argo is a computer cluster located at Memorial University of Newfoundland, installed in late 2023, and managed by ACENET on behalf of researchers who have purchased the equipment. The environment is very similar the environment on to the Alliance general-purpose systems, with the exceptions described below.

Access

You may only access Argo with the permission of one of the contributing Principal Investigators. If you don't have access to Argo and believe you should, please write to support@ace-net.ca and say so. Copy the PI with whom you are associated, and mention in the email your Digital Research Alliance of Canada username.

When your Alliance username has been added to the access control list for Argo, you can log in like so:

ssh username@argo.ace-net.ca

Argo uses the Alliance's multi-factor authentication and SSH-key authentication mechanisms. We strongly recommend that you register an SSH key and use that in place of password authentication.

Data transfer

A dedicated node is available for data transfers: dtn.argo.ace-net.ca

Software

Argo uses the same modules system as Alliance clusters, providing access to the same list of available software.

Job scheduling

The basic guidance on job scheduling is to ask for the run time, memory, CPUs, and GPUs you need and let the scheduler assign your jobs to nodes or partitions automatically.

Maximum run time is 14 days, which was determined in consultation with the two largest contributing Principal Investigators. If this is insufficient, have your PI contact support@ace-net.ca to request a review of this limit.

All compute nodes at Argo are contributed by one or another research group. Members of each research group (or any other group designated by the contributing PI) can run jobs on their contributed nodes with run-times of up to 14 days.

Idle compute nodes contributed by other research groups than your own may be used by jobs of up to 1 day (24 hours) run-time if it is purely a compute node, and up to 3 hours run-time if it is a GPU node.

You should never have to use --partition; your job should automatically be assigned a list of suitable partitions at submission time. Likewise you should never have to use --account since all users on Argo belong to one and only one Slurm account.

GPUs may be requested using either the --gres=gpu syntax or the newer --gpus= syntax, e.g. one of:

#SBATCH --gres=gpu:a100:1
#SBATCH --gpus=a100:1
#SBATCH --gres=gpu:h100:1
#SBATCH --gpus=h100:1
#SBATCH --gres=gpu:l40:1
#SBATCH --gpus=l40:1

See "Node characteristics" below for the numbers and types of GPUs installed.

Storage

MORE TO COME...

Node characteristics

Nodes Cores Available memory CPU Storage GPU
20 64 500G or 512000M 2 x Intel Xeon Platinum 8358 @ 2.6GHz ~750G -
47 64 250G or 256000M 2 x Intel Xeon Platinum 8358 @ 2.6GHz ~750G -
1 64 500G or 512000M 2 x Intel Xeon Platinum 8358 @ 2.6GHz ~720G 1 x NVIDIA Tesla A100 (80GB memory)
1 48 1000G or 1024000M 2 x Intel Xeon Gold 5418Y @ 3.5GHz ~1800G 4 x NVIDIA Tesla H100 (93GB memory)
1 64 1000G or 1024000M 2 x Intel Xeon Platinum 8352S @ 2.2GHz ~1700G 3 x NVIDIA Tesla L40 (46GB memory)
  • "Available memory" is the amount of memory configured for use by Slurm jobs. Actual memory is slightly larger to allow for operating system overhead.
  • "Storage" is node-local storage. Access it via the $SLURM_TMPDIR environment variable.
  • Hyperthreading is turned off.

Operating system: Rocky Linux 9