Cluster Status

From ACENET
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Cluster Status Planned Outage Notes
Siku Online Dec. 17 & Dec. 19, 2024 MUN network maintenance
Placentia Online no outages Restricted since March 2019
Nefelibata Online - No scheduler
Argo Online Dec. 17 & Dec. 19, 2024 MUN network maintenance

For national clusters (Arbutus, Béluga, Cedar, Graham, Narval, Niagara) see status.alliancecan.ca

Services

Service Status Planned Outage Notes
Globus at Argo Online -
Globus at Siku Online - Academic users only
Account creation Manual No outages Write support
PGI and Intel licenses Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Jobs will not be scheduled with a run time (--time=) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • Memorial University's IT services will carry out network maintenance on Tuesday, Dec 17 between 11 pm and 1 am NST (Dec 18 2h30 to 4h30 UTC) and on Thursday, Dec 19 between 11 pm and 1 am NST (Dec 20 2h30 to 4h30 UTC), that might cause network interruptions and dropped connections from and to Siku and Argo.
    Connections from outside the campus also shortly dropped on Dec 17 at about 2pm NST (17h30 UTC). Therefore brief interruptions may occur at other times as well.


Siku

2024

  • On the morning of Tuesday, December 3rd at around 8:00 am Nfld (7h30 Atlantic; 11h30 UTC) there was an unexpected power-event that affected the Siku data-centre causing compute-nodes to crash and running jobs to fail. UPDATE: Normal operations have resumed shortly after 10h00 Nfld.
Updated: 10:30, Dec 3, 2024 (NST)
  • On the morning of Wednesday, October 30th there was be a brief power-outage affecting several buildings on the South Campus, including the data center that houses Siku. A reservation beginning at 6:00 am Nfld on Wednesday morning has prevented jobs from starting unless they finished by that time. Regular production resumed at 15h40 UTC (13h10 NDT).
13:30, Oct 30, 2024 (NDT)
  • Siku underwent a rolling outage between Monday, Aug 26 and Monday Sep 9, 2024, to facilitate kernel- and other smaller updates. Over the course of two weeks the total capacity was reduced, as nodes were drained in small batches. This outage concluded with updating and rebooting the remaining login nodes on Monday Sep 9, 2024.
17:45, Sep 9, 2024 (NDT)
  • Siku compute nodes were unavailable for several hours overnight July 18-19 due to electrical work by the city. Regular production resumed at 2024-07-19 11h54 UTC.
09:33, July 19, 2024 (NDT)
  • We started Siku's maintenance outage this morning at 10h00 UTC (7h30 NDT, 7h00 ADT). Over the next two weeks we will perform operating system and software upgrades of the login-, compute- and backend-machines, including the GPFS filesystem.
09:30, June 17, 2024 (NDT)
  • There was an unplanned power outage between 16h15 and 16h30 UTC (13h45 and 14h00 NDT), during which many but not all jobs were lost. Normal operation was resumed about 18h00 UTC (15:30 NDT).
15:38, March 26, 2024 (NDT)
  • Slurm job scheduler was off-line Monday March 25, 2024, beginning at 11h00 UTC (08h30 NDT) until 12h45 UTC (10h15 NDT) for a second urgent maintenance on the machine running the Slurm controller. This was now completed and normal operation has resumed.
10:23, March 25, 2024 (NDT)
  • Siku scheduler is available again.
    The emergency maintenance was completed and normal operation has resumed at 11h50 NDT (14h20 UTC).
12:00, March 19, 2024 (NDT)
  • Slurm job scheduler will be off-line Tuesday March 19, 2024, beginning at 13h30 UTC for emergency maintenance on the machine running the Slurm controller. We anticipate an outage of approximately two hours. New jobs are being accepted but none will be launched until after the outage. Access to the cluster will still be permitted and storage will remain accessible.

For older outages see: Previous outages

  • Our newest cluster, Siku, is now in production. Access is currently restricted to invited users only. Access request form.
13:00, December 10, 2019 (NST)

Argo

  • Argo suffered an electrical power event last night (Nov 19-20) which brought down some components. The cluster is back in production at this hour. Some compute nodes have not yet recovered; sysadmins are working to bring them back.
12:10, Nov 20, 2024 (NST)
  • Argo was offline from October 28 to 30, 2024 for electrical power work, some upgrades of infrastructure machines, and some software and firmware updates. Service was resumed on Thursday October 31st at around 14h00 NDT with about 75% of its CPU-capacity while the remaining nodes are being worked on.
14:40, Oct 31, 2024 (NDT)
Update: The GPU nodes argo[72-73] have been returned to service
17:00, Nov 1, 2024 (NDT)

Placentia

  • Placentia was retired from general service as of 2019 Mar 31. A reduced number of compute nodes remain in service, with access restricted to MUN users who have made suitable arrangements. Contact support@ace-net.ca if you believe you should have access.

Nefelibata

  • Nefelibata has had its shared storage replaced, but Slurm scheduler service has not yet been restored. This is waiting for personnel to become available from other work.
2024-03-18
  • Nefelibata will be unavailable on 2023 September 5, Tuesday, for operating system and driver updates. We expect return-to-service on Wednesday Sept 6.
Update at 2023-09-07 12:00 NDT: Outage complete, Nefelibata back in service.