Cluster Status
![]() |
Clusters
Cluster | Status | Planned Outage | Notes |
---|---|---|---|
Siku | Online | - | |
Placentia | Online | no outages | Restricted since March 2019 |
Nefelibata | Online | - | No scheduler |
Humus | Online | - | No scheduler |
Argo | Online | - |
For national clusters (Arbutus, Béluga, Cedar, Graham, Narval, Niagara) see status.alliancecan.ca
Services
Service | Status | Planned Outage | Notes |
---|---|---|---|
Globus at Argo | Online | - | |
Globus at Siku | Online | - | Academic users only |
Account creation | Manual | No outages | Write support |
PGI and Intel licenses | Online | No outages |
- Legend:
Online | cluster is up and running |
Offline | all users cannot login or submit jobs, or service is not working |
Online | some users can login and/or there are problems affecting your work |
Outage schedule
Jobs will not be scheduled with a run time (--time=
) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.
- There are currently no planned outages.
Siku
2025
- Both Siku and Argo were offline from March 18 to 20 for network- and system maintenance.
During the outage, the public IP addresses of both clusters has changed and moved to a different subnet and software updates will be installed.
This outage has also resolved a performance regression for certain MPI jobs that ran on nodes connected across more than one Infiniband leaf-switch and were close to their scaling limit.
- Wed Mar 12 2025 11:30 NDT
- UPDATE March 18, 08h30 NDT: The planned maintenance has started. We will continue to post updates here.
- UPDATE March 20, 10h10 NDT: The planned maintenance has been completed and job scheduling has been resumed.
- Siku is operating at reduced capacity due to problems with the cooling in the data centre. New jobs will not be started until we are confident that the temperature in the room will remain stable.
- Thu, 06 Mar 2025 16:08 (UTC)
- Update: Further work on the A/C unit has been postponed and jobs have been allowed to continue. We will communicate the start and duration of the upcoming A/C work once we know more. Rolling updates of compute nodes will continue next week (see below).
- Thu, 06 Mar 2025 19:33 (UTC)
- On Thu Feb 20 and between Mon Mar 03 and Thu Mar 13, we will be performing rolling updates of all compute nodes causing SIKU to operate at a reduced total capacity. Since we only reserve a small fraction of nodes each day, the impact to user-jobs should be small since all other nodes will still be available.
- 11:00, January 30, 2025(NST)
For older outages see: Previous outages
Argo
2025
- Both Siku and Argo were offline from March 18 to 20 for network- and system maintenance.
During the outage, the public IP addresses of both clusters has changed and moved to a different subnet and software updates will be installed.
Also storage quotas are now being enforced at Argo.
- Wed Mar 12 2025 11:30 NDT
- UPDATE March 18, 08h30 NDT: The planned maintenance has started. We will continue to post updates here.
- UPDATE March 20, 10h10 NDT: The planned maintenance has been completed and job scheduling has been resumed.
- UPDATE March 27, 09h30 NDT: Globus file transfer at Argo has been restored.
- Due to a critical cooling failure in the data-centre we had to perform an emergency shutdown of Argo on the morning of Saturday, February 15th. We expect Argo to become available again sometime on Monday, February 17.
- 12:30, Feb 15, 2025 (NST)
- Update #1: Argo's login nodes and filesystems are available again, however the compute nodes will remain offline until next week.
- 14:30, Feb 15, 2025 (NST)
- Update #2: Over the course of today we have released about half of Argo's CPU nodes and all GPU nodes back into production. We continue to work on the remaining nodes.
- 16:30, Feb 17, 2025 (NST)
- Update #3: Most of Argo's compute nodes are back in production and we will continue enabling the remaining ones as soon as they are available.
- 13:30, Feb 19, 2025 (NST)
- Argo suffered an electrical power event on Friday evening (Jan 17) around 18h00 NST (21h30 UTC) which brought down some components. The cluster is back in production at this hour. Some compute nodes have not yet recovered; we are working to bring them back.
- 10:30, Jan 20, 2025 (NST)
2024
- Argo suffered an electrical power event last night (Nov 19-20) which brought down some components. The cluster is back in production at this hour. Some compute nodes have not yet recovered; sysadmins are working to bring them back.
- 12:10, Nov 20, 2024 (NST)
- Argo was offline from October 28 to 30, 2024 for electrical power work, some upgrades of infrastructure machines, and some software and firmware updates. Service was resumed on Thursday October 31st at around 14h00 NDT with about 75% of its CPU-capacity while the remaining nodes are being worked on.
- 14:40, Oct 31, 2024 (NDT)
- Update: The GPU nodes
argo[72-73]
have been returned to service- 17:00, Nov 1, 2024 (NDT)
Placentia
- Placentia was retired from general service as of 2019 Mar 31. A reduced number of compute nodes remain in service, with access restricted to MUN users who have made suitable arrangements. Contact support@ace-net.ca if you believe you should have access.
Nefelibata
- Nefelibata has had its shared storage replaced, but Slurm scheduler service has not been restored.
- 2024-03-18
- Nefelibata will be unavailable on 2023 September 5, Tuesday, for operating system and driver updates. We expect return-to-service on Wednesday Sept 6.
- Update at 2023-09-07 12:00 NDT: Outage complete, Nefelibata back in service.