342
edits
Changes
→Siku: add 2023 Siku outages
== Siku ==
==== 2023 ====
* '''UPDATE''': Siku is available again.<br>The two V100-GPU nodes are known to boot very slowly and still unavailable at this time. We will return them to service later today.
: 11:00, December 22, 2023 (NST)
* Newfoundland power has advised us of a planned power outage of Memorial University's south campus in order to facilitate relocation of an overhead powerline & pole on '''Thursday, December 21st 2023'''. We will start shutting down Siku at 1100h Nfld (1030h Atlantic) that day and are planning to have Siku up and running again around noon on Friday, December 22nd.
: 15:30, December 18, 2023 (NST)
* Last night there was a power outage in the data-centre that hosts Siku. Currently the whole system is unavailable, however we are actively working on booting everything up again and expect Siku to be operational again later today.
: 09:10, December 14, 2023 (NST)
: '''UPDATE''' at 13:00, December 14, 2023 (NST): We have completed recovery after this unplanned power outage and Siku is operational again.<br>The two V100-GPU nodes are known to boot very slowly and still unavailable at this time. We will return them to service later today.
* Facilities Management has advised us of a short power outage on the morning of Thursday, August 3rd 2023, for which we need to shut down all compute-nodes.<br>Access to the login-nodes and storage system is expected to be maintained throughout the outage, though a short (<5min) network outage may be experienced around 6 am NDT (8:30 am UTC).<br>The scheduler won't start any jobs that won't finish by 4:30 am NDT (7 am UTC) on Thu Aug 3rd 2023.<br>We expect to resume normal operations later the same day.
: 16:20, July 28, 2023 (NDT)
: '''UPDATE''' at 09:45h, August 3, 2023 (NDT): The power outage was completed and Siku is operational again.
* On the morning of July 17th we noticed that our air conditioning (A/C) unit was leaking water and had to be turned off. Without a working A/C unit we are now powering off all compute nodes in order to reduce heat in the data centre. <br>We will post an update here as soon as we have a better estimate about when service can be resumed.
: 10:45, July 17, 2023 (NDT)
: '''UPDATE''' at 2023-06-17 11:50 NDT: The issue (a clogged drain) has been resolved and we are in the process of powering up the compute nodes again. We will provide an update as soon as Siku is available again.
: '''UPDATE''' at 2023-06-17 12:50 NDT: Siku is available again. Jobs have resumed 30 minutes ago and users can log-in again. Three of our GPU nodes are still offline, but we are working on putting them back into service later today.
* On June 26 there was a brief power interruption in the MUN data centre that caused several compute nodes to reboot and the cluster as well as internal network interruptions. We are in the process of resolving the issues caused by this and making all resources available again.
: 11:00, June 27, 2023 (NDT)
: '''UPDATE''' at 2023-06-27 12:00 NDT: Siku is fully available again. Unfortunately we had to reboot all compute nodes to resolve filesystem issues that were caused by the power-event.
* Siku outage has started at 07:30 NDT (10h00 UTC). We anticipate restoring service by Wednesday May 10 at 20:00 UTC, sooner if possible.
: 7:47, May 8, 2023 (NDT)
: '''UPDATE''' at 2023-05-10 19:00 NDT: We are still experiencing several issues with Siku. Expected return to service is now Thursday, 11 May 2023.
: '''UPDATE #2''' at 2023-05-12 09:12 NDT: The outage was successfully completed. We informed all Siku users via email.
==== 2022 ====
* Siku is back online since 12:30pm NDT (15h00 UTC). This was the last of three scheduled power outages.