Cluster Status/Previous outages

From ACENET
Jump to: navigation, search

This page contains information on previous outages that have been removed from the main Cluster Status page.

Siku

2021

  • Today's outage was successful. Backups, collection of quota info, and generation of home-directories have resumed. We are aware that "srun --x11" and "salloc --x11" don't work at this time and are still investigating.
12:10, December 20, 2021 (NST)
  • Trouble with the filesystem and/or network earlier today resulted in the loss of several jobs. Siku is now back in production, but backups are disabled and the output of the 'quota' command will be out-of-date until we are able to correct the underlying problem next week.
15:20, December 10, 2021 (NST)
  • Trouble with the filesystem beginning about 13:55 UTC today, causing Slurm to remove many compute nodes from service. Staff are investigating.
13:00, December 10, 2021 (NST)
  • Maintenance which began yesterday is now complete, and Siku is back in production.
14:10, December 8, 2021 (NST)
  • UPS maintenance of Oct 27-29 has ended.
16:20, October 29, 2021 (NDT)
  • Siku is now in a planned outage to facilitate an urgent maintenance of the Uninterruptible Power Supply (UPS) units in the data centre that houses Siku and other equipment. We anticipate return-to-service mid-day on Friday October 29th.
13:30, October 27, 2021 (NDT)
  • On Wednesday, Oct. 6, 2021 around 5:30pm NDT (8pm UTC) there was what seems to be a power event, which caused an interruption in the GPFS filesystem and crashed the Slurm controller (scheduler). All running jobs have been lost. As of now (Oct 7th, 9:30 am NDT) everything is back up and scheduling has resumed.
09:40, October 7, 2021 (NDT)
  • The Uninterruptible Power Supply (UPS) units in the machine room serving Siku, Placentia, etc, will undergo maintenance on Thursday, October 28. We are advised that we will not be able to run on "street power" for this maintenance, so all clusters will be powered down on Wednesday, October 27, beginning at 12:00 noon Newfoundland time (14h30 UTC). We anticipate return-to-service mid-day on Friday October 29th.
15:30, October 5, 2021 (NDT)
  • In the early morning hours of Saturday, Sep. 11, 2021, Hurricane "Larry" has caused significant power outages across eastern Newfoundland. There have been power interruptions in the MUN data centre that caused several compute nodes to reboot and the scheduler service to crash. Operation of the scheduler has resumed about two hours ago and as of now, all compute nodes are back in service.
13:15, September 11, 2021 (NDT)
  • In the night from Saturday to Sunday (Aug. 7/8 2021) there was a short power-interruption in the MUN data centre due to a thunderstorm over St. John's. This caused a number of compute nodes to reboot and crashed the Slurm scheduler. The scheduler was restarted around 2021-08-08 18:30 NDT and as of 10:00 am on Monday August 9th all compute nodes are back in production.
11:55, August 9 2021 (NDT)
  • Air conditioning maintenance is planned for the Siku data centre on Thursday July 22. MUN IT Services has asked ACENET to reduce the heat in the room, so we are preventing new jobs from starting. The current plan is that logins will continue to be accepted, the filesystem will continue to be accessible, new jobs will be accepted (but will not start), and running jobs will be allowed to complete normally. However, if temperature becomes a problem during the maintenance we may have to take stronger measures (such as terminating jobs prematurely) on short notice or no notice.
10:37, July 21, 2021 (ADT)
  • UPDATE at 2021-07-22 16:40 NDT: Siku has been partially returned to service. Out of an abundance of caution, we have released 10 nodes overnight, leaving 50 idle. In the morning (Fri July 23) we will release the remaining nodes while staff are on duty to monitor and address any unforeseen problems arising from the air conditioning.
  • UPDATE at 2021-07-23 13:00 NDT: Most compute nodes have been returned to service. Only a few compute nodes remain offline/draining as we are installing important updates to the Linux Kernel. We don't expect any further disruptions.
  • Siku was in a planned outage that commenced at Tuesday, 8 June 2021 at 6:30am NDT in order to perform maintenance and incorporate new equipment, affecting both Siku's HPC nodes as well as the cloud.
    During this outage the default software environment Siku-HPC will be changed to StdEnv/2020 to bring Siku in line with other Compute Canada HPC systems.
    Please see Migration to the 2020 standard environment for more information about this change.
  • UPDATE at 2021-06-08 16:00 NDT: The outage will take longer than anticipated. The revised expected return to service is noon Thursday, 10 June 2021.
  • UPDATE at 2021-06-10 11:30 NDT: Expected return to service is now noon Friday, 11 June 2021.
  • UPDATE at 2021-06-11 14:30 NDT: The outage has completed and job scheduling has resumed at 14:00 NDT. Remember that StdEnv/2020 is the new default.

2020

  • Siku was offline between 12:00 pm NST on November 27th and 12:00 pm NST on November 30th due to a planned campus wide power outage during that weekend. We took this opportunity to upgrade Slurm to version 20.11.0.
  • On November 19 2020 job scheduling was stopped between 8:30 am and 1:00 pm NST to facilitate servicing the air conditioning unit in the data centre. Access to login-nodes and storage was maintained during that time.
  • In the morning of July 2, 2020 around 9:30 am NDT there was power fluctuation in the data centre which has caused most nodes to reboot. As of 10:30 am NDT, the scheduler is back in operation and job execution has been resumed.
11:12, July 2, 2020 (NDT)
  • Siku was offline for maintenance and addition of contributed equipment since Monday, June 1 at 10h00 UTC (07h30 NDT). The maintenance has now concluded and Siku is now accessible again and job scheduling has resumed.
12:00, June 2, 2020 (NDT)
  • After a power failure on April 16, 2020 at approximately 11:20 NDT, Siku returned to service on April 24, 2020 after UPS batteries have been replaced.
14:49, April 24, 2020 (NDT)
  • Siku is back online. The UPS that failed is conditioning power as it should but is still in need of battery replacement.
16:00, April 1, 2020 (NDT)
  • UPS failure. Returning to service on street power is deemed risky. We are investigating repair options for the UPS.
15:01, March 30, 2020 (NDT)
  • Electrical power work at MUN data centre is complete. Siku and Placentia are back in production.
17:15, March 17, 2020 (NDT)
  • The power-fluctuation at Memorial University caused issues with the Infiniband network. In the process we had to temporarily suspend the scheduler and terminate all jobs that had not already failed right away. As of Wednesday 4:40pm NST the scheduler was resumed and Siku is back online.
9:00, March 12, 2020 (NST)
  • We had a campus wide power-event at Memorial University. Some compute nodes were affected and some jobs have crashed. We are still investigating and fixing issues and will prevent jobs from starting in the mean time.
13:51, March 11, 2020 (NST)