Changes

Jump to: navigation, search

Cluster Status/Previous outages

1,497 bytes added, 17:55, October 29, 2021
Siku
==== 2021 ====
* Siku is now in a planned outage to facilitate an urgent maintenance of the Uninterruptible Power Supply (UPS) units in the data centre that houses Siku and other equipment. We anticipate return-to-service mid-day on Friday October 29th.
: 13:30, October 27, 2021 (NDT)
 
* On Wednesday, Oct. 6, 2021 around 5:30pm NDT (8pm UTC) there was what seems to be a power event, which caused an interruption in the GPFS filesystem and crashed the Slurm controller (scheduler). All running jobs have been lost. As of now (Oct 7th, 9:30 am NDT) everything is back up and scheduling has resumed.
: 09:40, October 7, 2021 (NDT)
 
* The Uninterruptible Power Supply (UPS) units in the machine room serving Siku, Placentia, etc, will undergo maintenance on Thursday, October 28. We are advised that we will not be able to run on "street power" for this maintenance, so all clusters will be powered down on Wednesday, October 27, beginning at 12:00 noon Newfoundland time (14h30 UTC). We anticipate return-to-service mid-day on Friday October 29th.
: 15:30, October 5, 2021 (NDT)
 
* In the early morning hours of Saturday, Sep. 11, 2021, Hurricane "Larry" has caused significant power outages across eastern Newfoundland. There have been power interruptions in the MUN data centre that caused several compute nodes to reboot and the scheduler service to crash. Operation of the scheduler has resumed about two hours ago and as of now, all compute nodes are back in service.
: 13:15, September 11, 2021 (NDT)
* In the night from Saturday to Sunday (Aug. 7/8 2021) there was a short power-interruption in the MUN data centre due to a thunderstorm over St. John's. This caused a number of compute nodes to reboot and crashed the Slurm scheduler. The scheduler was restarted around 2021-08-08 18:30 NDT and as of 10:00 am on Monday August 9th all compute nodes are back in production.

Navigation menu