Cluster Status/Previous outages

From ACENET
Jump to: navigation, search

This page contains information on previous outages that have been removed from the main Cluster Status page.

Siku

2023

  • UPDATE: Siku is available again.
    The two V100-GPU nodes are known to boot very slowly and still unavailable at this time. We will return them to service later today.
11:00, December 22, 2023 (NST)
  • Newfoundland power has advised us of a planned power outage of Memorial University's south campus in order to facilitate relocation of an overhead powerline & pole on Thursday, December 21st 2023. We will start shutting down Siku at 1100h Nfld (1030h Atlantic) that day and are planning to have Siku up and running again around noon on Friday, December 22nd.
15:30, December 18, 2023 (NST)
  • Last night there was a power outage in the data-centre that hosts Siku. Currently the whole system is unavailable, however we are actively working on booting everything up again and expect Siku to be operational again later today.
09:10, December 14, 2023 (NST)
UPDATE at 13:00, December 14, 2023 (NST): We have completed recovery after this unplanned power outage and Siku is operational again.
The two V100-GPU nodes are known to boot very slowly and still unavailable at this time. We will return them to service later today.
  • Facilities Management has advised us of a short power outage on the morning of Thursday, August 3rd 2023, for which we need to shut down all compute-nodes.
    Access to the login-nodes and storage system is expected to be maintained throughout the outage, though a short (<5min) network outage may be experienced around 6 am NDT (8:30 am UTC).
    The scheduler won't start any jobs that won't finish by 4:30 am NDT (7 am UTC) on Thu Aug 3rd 2023.
    We expect to resume normal operations later the same day.
16:20, July 28, 2023 (NDT)
UPDATE at 09:45h, August 3, 2023 (NDT): The power outage was completed and Siku is operational again.
  • On the morning of July 17th we noticed that our air conditioning (A/C) unit was leaking water and had to be turned off. Without a working A/C unit we are now powering off all compute nodes in order to reduce heat in the data centre.
    We will post an update here as soon as we have a better estimate about when service can be resumed.
10:45, July 17, 2023 (NDT)
UPDATE at 2023-06-17 11:50 NDT: The issue (a clogged drain) has been resolved and we are in the process of powering up the compute nodes again. We will provide an update as soon as Siku is available again.
UPDATE at 2023-06-17 12:50 NDT: Siku is available again. Jobs have resumed 30 minutes ago and users can log-in again. Three of our GPU nodes are still offline, but we are working on putting them back into service later today.
  • On June 26 there was a brief power interruption in the MUN data centre that caused several compute nodes to reboot and the cluster as well as internal network interruptions. We are in the process of resolving the issues caused by this and making all resources available again.
11:00, June 27, 2023 (NDT)
UPDATE at 2023-06-27 12:00 NDT: Siku is fully available again. Unfortunately we had to reboot all compute nodes to resolve filesystem issues that were caused by the power-event.
  • Siku outage has started at 07:30 NDT (10h00 UTC). We anticipate restoring service by Wednesday May 10 at 20:00 UTC, sooner if possible.
7:47, May 8, 2023 (NDT)
UPDATE at 2023-05-10 19:00 NDT: We are still experiencing several issues with Siku. Expected return to service is now Thursday, 11 May 2023.
UPDATE #2 at 2023-05-12 09:12 NDT: The outage was successfully completed. We informed all Siku users via email.

2022

  • Siku is back online since 12:30pm NDT (15h00 UTC). This was the last of three scheduled power outages.
12:45, May 30, 2022 (NDT)
  • Siku is back online since 12:00pm NDT (14h30 UTC). There will be one further outage from 11:30am NDT (14h00 UTC) on Friday May 27 until midday on Monday May 30.
13:00, May 16, 2022 (NDT)
  • Siku is back online since 12:30pm NDT (15h00 UTC). There will be two similar outages: May 13-16 and May 27-30.
12:32, May 2, 2022 (NDT)
  • Siku is offline since 11:30am NDT (14h00 UTC) to facilitate electrical work by Memorial University facilities management in the data centre. We expect a return to service by mid-day on Monday, May 2nd 2022.
11:40, April 29, 2022 (NDT)
  • A time sensitive maintenance outage was carried out on Monday March 28. Work began at 7:30AM Newfoundland time (10h00 UTC) and was completed by 5:30pm Newfoundland time (20h00 UTC). The work carried out has expanded our Infiniband Network and increased the capacity of our backend-infrastructure to allow the addition of almost 30 additional nodes, which will be added over the coming days.
17:30, March 28, 2022 (NST)
  • Memorial University IT services has interrupted network service to Siku just after midnight Newfoundland time (03h30 UTC) on Tuesday Mar 1, 2022, to perform maintenance. The interruption to lasted less 30min. During this time, jobs were prevented to start to avoid failures caused by the lack of external network connection, but has now resumed.
00:25, March 1, 2022 (NST)
  • Memorial Universities networks are online again and access to Siku has been restored. Siku's scheduler had stopped at some point during the outage, but has been restarted on Sat Jan. 8th at 10:20am (NST). Jobs have been running on Siku since then.
Update Monday Jan 10, 13:15 (NST): The onset of the network interruption was also accompanied by a power-fluctuation, that has caused some (but not all) compute nodes to reboot.
13:00, January 8, 2022 (NST)
  • Memorial University has announced that they are experiencing a wide-spread internet outage. Therefore access to Siku is currently not possible, but we expect the system to continue running jobs until internet access has been restored.
Update 14:10 NST: Memorial University has announced on their Twitter account that the issue was caused by an internal technology malfunction. MUN-ITS is working on fixing it.
13:30, January 7, 2022 (NST)

2021

  • Today's outage was successful. Backups, collection of quota info, and generation of home-directories have resumed. We are aware that "srun --x11" and "salloc --x11" don't work at this time and are still investigating.
12:10, December 20, 2021 (NST)
  • Trouble with the filesystem and/or network earlier today resulted in the loss of several jobs. Siku is now back in production, but backups are disabled and the output of the 'quota' command will be out-of-date until we are able to correct the underlying problem next week.
15:20, December 10, 2021 (NST)
  • Trouble with the filesystem beginning about 13:55 UTC today, causing Slurm to remove many compute nodes from service. Staff are investigating.
13:00, December 10, 2021 (NST)
  • Maintenance which began yesterday is now complete, and Siku is back in production.
14:10, December 8, 2021 (NST)
  • UPS maintenance of Oct 27-29 has ended.
16:20, October 29, 2021 (NDT)
  • Siku is now in a planned outage to facilitate an urgent maintenance of the Uninterruptible Power Supply (UPS) units in the data centre that houses Siku and other equipment. We anticipate return-to-service mid-day on Friday October 29th.
13:30, October 27, 2021 (NDT)
  • On Wednesday, Oct. 6, 2021 around 5:30pm NDT (8pm UTC) there was what seems to be a power event, which caused an interruption in the GPFS filesystem and crashed the Slurm controller (scheduler). All running jobs have been lost. As of now (Oct 7th, 9:30 am NDT) everything is back up and scheduling has resumed.
09:40, October 7, 2021 (NDT)
  • The Uninterruptible Power Supply (UPS) units in the machine room serving Siku, Placentia, etc, will undergo maintenance on Thursday, October 28. We are advised that we will not be able to run on "street power" for this maintenance, so all clusters will be powered down on Wednesday, October 27, beginning at 12:00 noon Newfoundland time (14h30 UTC). We anticipate return-to-service mid-day on Friday October 29th.
15:30, October 5, 2021 (NDT)
  • In the early morning hours of Saturday, Sep. 11, 2021, Hurricane "Larry" has caused significant power outages across eastern Newfoundland. There have been power interruptions in the MUN data centre that caused several compute nodes to reboot and the scheduler service to crash. Operation of the scheduler has resumed about two hours ago and as of now, all compute nodes are back in service.
13:15, September 11, 2021 (NDT)
  • In the night from Saturday to Sunday (Aug. 7/8 2021) there was a short power-interruption in the MUN data centre due to a thunderstorm over St. John's. This caused a number of compute nodes to reboot and crashed the Slurm scheduler. The scheduler was restarted around 2021-08-08 18:30 NDT and as of 10:00 am on Monday August 9th all compute nodes are back in production.
11:55, August 9 2021 (NDT)
  • Air conditioning maintenance is planned for the Siku data centre on Thursday July 22. MUN IT Services has asked ACENET to reduce the heat in the room, so we are preventing new jobs from starting. The current plan is that logins will continue to be accepted, the filesystem will continue to be accessible, new jobs will be accepted (but will not start), and running jobs will be allowed to complete normally. However, if temperature becomes a problem during the maintenance we may have to take stronger measures (such as terminating jobs prematurely) on short notice or no notice.
10:37, July 21, 2021 (ADT)
  • UPDATE at 2021-07-22 16:40 NDT: Siku has been partially returned to service. Out of an abundance of caution, we have released 10 nodes overnight, leaving 50 idle. In the morning (Fri July 23) we will release the remaining nodes while staff are on duty to monitor and address any unforeseen problems arising from the air conditioning.
  • UPDATE at 2021-07-23 13:00 NDT: Most compute nodes have been returned to service. Only a few compute nodes remain offline/draining as we are installing important updates to the Linux Kernel. We don't expect any further disruptions.
  • Siku was in a planned outage that commenced at Tuesday, 8 June 2021 at 6:30am NDT in order to perform maintenance and incorporate new equipment, affecting both Siku's HPC nodes as well as the cloud.
    During this outage the default software environment Siku-HPC will be changed to StdEnv/2020 to bring Siku in line with other Compute Canada HPC systems.
    Please see Migration to the 2020 standard environment for more information about this change.
  • UPDATE at 2021-06-08 16:00 NDT: The outage will take longer than anticipated. The revised expected return to service is noon Thursday, 10 June 2021.
  • UPDATE at 2021-06-10 11:30 NDT: Expected return to service is now noon Friday, 11 June 2021.
  • UPDATE at 2021-06-11 14:30 NDT: The outage has completed and job scheduling has resumed at 14:00 NDT. Remember that StdEnv/2020 is the new default.

2020

  • Siku was offline between 12:00 pm NST on November 27th and 12:00 pm NST on November 30th due to a planned campus wide power outage during that weekend. We took this opportunity to upgrade Slurm to version 20.11.0.
  • On November 19 2020 job scheduling was stopped between 8:30 am and 1:00 pm NST to facilitate servicing the air conditioning unit in the data centre. Access to login-nodes and storage was maintained during that time.
  • In the morning of July 2, 2020 around 9:30 am NDT there was power fluctuation in the data centre which has caused most nodes to reboot. As of 10:30 am NDT, the scheduler is back in operation and job execution has been resumed.
11:12, July 2, 2020 (NDT)
  • Siku was offline for maintenance and addition of contributed equipment since Monday, June 1 at 10h00 UTC (07h30 NDT). The maintenance has now concluded and Siku is now accessible again and job scheduling has resumed.
12:00, June 2, 2020 (NDT)
  • After a power failure on April 16, 2020 at approximately 11:20 NDT, Siku returned to service on April 24, 2020 after UPS batteries have been replaced.
14:49, April 24, 2020 (NDT)
  • Siku is back online. The UPS that failed is conditioning power as it should but is still in need of battery replacement.
16:00, April 1, 2020 (NDT)
  • UPS failure. Returning to service on street power is deemed risky. We are investigating repair options for the UPS.
15:01, March 30, 2020 (NDT)
  • Electrical power work at MUN data centre is complete. Siku and Placentia are back in production.
17:15, March 17, 2020 (NDT)
  • The power-fluctuation at Memorial University caused issues with the Infiniband network. In the process we had to temporarily suspend the scheduler and terminate all jobs that had not already failed right away. As of Wednesday 4:40pm NST the scheduler was resumed and Siku is back online.
9:00, March 12, 2020 (NST)
  • We had a campus wide power-event at Memorial University. Some compute nodes were affected and some jobs have crashed. We are still investigating and fixing issues and will prevent jobs from starting in the mean time.
13:51, March 11, 2020 (NST)