Cluster Status

From ACENET
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Cluster Status Planned Outage Notes
Siku Online No outages
Placentia Online Restricted since Mar 2019
Arbutus See status.computecanada.ca (west.cloud.computecanada.ca)
Béluga See status.computecanada.ca
Cedar See status.computecanada.ca
Graham See status.computecanada.ca
Niagara See status.computecanada.ca

Services

Service Status Planned Outage Notes
WebMO Retired End of service 2019 Mar 31 Retired with Placentia
Account creation Manual No outages Write support
PGI and Intel licenses Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Jobs will not be scheduled with a run time (--time=) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • No outages currently scheduled.

Siku

  • In the early morning hours of Saturday, Sep. 11, 2021, Hurricane "Larry" has caused significant power outages across eastern Newfoundland. There have been power interruptions in the MUN data centre that caused several compute nodes to reboot and the scheduler service to crash. Operation of the scheduler has resumed about two hours ago and as of now, all compute nodes are back in service.
13:15, September 11, 2021 (NDT)
  • In the night from Saturday to Sunday (Aug. 7/8 2021) there was a short power-interruption in the MUN data centre due to a thunderstorm over St. John's. This caused a number of compute nodes to reboot and crashed the Slurm scheduler. The scheduler was restarted around 2021-08-08 18:30 NDT and as of 10:00 am on Monday August 9th all compute nodes are back in production.
11:55, August 9 2021 (NDT)
  • Air conditioning maintenance is planned for the Siku data centre on Thursday July 22. MUN IT Services has asked ACENET to reduce the heat in the room, so we are preventing new jobs from starting. The current plan is that logins will continue to be accepted, the filesystem will continue to be accessible, new jobs will be accepted (but will not start), and running jobs will be allowed to complete normally. However, if temperature becomes a problem during the maintenance we may have to take stronger measures (such as terminating jobs prematurely) on short notice or no notice.
10:37, July 21, 2021 (ADT)
  • UPDATE at 2021-07-22 16:40 NDT: Siku has been partially returned to service. Out of an abundance of caution, we have released 10 nodes overnight, leaving 50 idle. In the morning (Fri July 23) we will release the remaining nodes while staff are on duty to monitor and address any unforeseen problems arising from the air conditioning.
  • UPDATE at 2021-07-23 13:00 NDT: Most compute nodes have been returned to service. Only a few compute nodes remain offline/draining as we are installing important updates to the Linux Kernel. We don't expect any further disruptions.
  • Siku is in a planned outage that commenced at Tuesday, 8 June 2021 at 6:30am NDT in order to perform maintenance and incorporate new equipment, affecting both Siku's HPC nodes as well as the cloud.
    During this outage the default software environment Siku-HPC will be changed to StdEnv/2020 to bring Siku in line with other Compute Canada HPC systems.
    Please see Migration to the 2020 standard environment for more information about this change.
  • UPDATE at 2021-06-08 16:00 NDT: The outage will take longer than anticipated. The revised expected return to service is noon Thursday, 10 June 2021.
  • UPDATE at 2021-06-10 11:30 NDT: Expected return to service is now noon Friday, 11 June 2021.
  • UPDATE at 2021-06-11 14:30 NDT: The outage has completed and job scheduling has resumed at 14:00 NDT. Remember that StdEnv/2020 is the new default.
  • Siku was offline between 12:00 pm NST on November 27th and 12:00 pm NST on November 30th due to a planned campus wide power outage during that weekend. We took this opportunity to upgrade Slurm to version 20.11.0.
  • On November 19 2020 job scheduling was stopped between 8:30 am and 1:00 pm NST to facilitate servicing the air conditioning unit in the data centre. Access to login-nodes and storage was maintained during that time.
  • Our newest cluster, Siku, is now in production. Access is currently restricted to invited users only. Access request form.
13:00, December 10, 2019 (NST)

Placentia

  • Placentia was retired from general service as of 2019 Mar 31. A reduced number of compute nodes remain in service, with access restricted to MUN users who have made suitable arrangements. Contact support@ace-net.ca if you believe you should have access.