Changes

Jump to: navigation, search

Cluster Status/Previous outages

3,291 bytes added, 16:07, October 27, 2021
previous siku outages
This page contains information on previous outages that have been removed from the [[Cluster Status|main Cluster Status page]].

== Siku ==

==== 2021 ====

* In the night from Saturday to Sunday (Aug. 7/8 2021) there was a short power-interruption in the MUN data centre due to a thunderstorm over St. John's. This caused a number of compute nodes to reboot and crashed the Slurm scheduler. The scheduler was restarted around 2021-08-08 18:30 NDT and as of 10:00 am on Monday August 9th all compute nodes are back in production.
: 11:55, August 9 2021 (NDT)

* Air conditioning maintenance is planned for the Siku data centre on Thursday July 22. MUN IT Services has asked ACENET to reduce the heat in the room, so we are preventing new jobs from starting. The current plan is that logins will continue to be accepted, the filesystem will continue to be accessible, new jobs will be accepted (but will not start), and running jobs will be allowed to complete normally. However, if temperature becomes a problem during the maintenance we may have to take stronger measures (such as terminating jobs prematurely) on short notice or no notice.
: 10:37, July 21, 2021 (ADT)
:* '''UPDATE''' at ''2021-07-22 16:40 NDT'': Siku has been partially returned to service. Out of an abundance of caution, we have released 10 nodes overnight, leaving 50 idle. In the morning (Fri July 23) we will release the remaining nodes while staff are on duty to monitor and address any unforeseen problems arising from the air conditioning.
:* '''UPDATE''' at ''2021-07-23 13:00 NDT'': Most compute nodes have been returned to service. Only a few compute nodes remain offline/draining as we are installing important updates to the Linux Kernel. We don't expect any further disruptions.

* Siku was in a planned outage that commenced at ''Tuesday, 8 June 2021 at 6:30am NDT'' in order to perform maintenance and incorporate new equipment, affecting both Siku's HPC nodes as well as the cloud. <br>During this outage the [https://docs.computecanada.ca/wiki/Standard_software_environments default software environment] Siku-HPC will be changed to <code>StdEnv/2020</code> to bring Siku in line with other Compute Canada HPC systems.<br>Please see [https://docs.computecanada.ca/wiki/Migration_to_the_2020_standard_environment Migration to the 2020 standard environment] for more information about this change.
:* '''UPDATE''' at ''2021-06-08 16:00 NDT'': The outage will take longer than anticipated. The revised expected return to service is noon Thursday, 10 June 2021.
:* '''UPDATE''' at ''2021-06-10 11:30 NDT'': Expected return to service is now noon Friday, 11 June 2021.
:* '''UPDATE''' at ''2021-06-11 14:30 NDT'': The outage has completed and job scheduling has resumed at 14:00 NDT. Remember that '''<code>StdEnv/2020</code>''' is the new default.

==== 2020 ====

* Siku was offline between 12:00 pm NST on November 27th and 12:00 pm NST on November 30th due to a planned campus wide power outage during that weekend. We took this opportunity to upgrade Slurm to version 20.11.0.

* On November 19 2020 job scheduling was stopped between 8:30 am and 1:00 pm NST to facilitate servicing the air conditioning unit in the data centre. Access to login-nodes and storage was maintained during that time.
342
edits

Navigation menu