342
edits
Changes
→Siku: archive 2024 outages
== Siku ==
==== 2024 2025 ====* Memorial University's IT services has carried out network maintenance on Tuesday, Dec 17 between 11 pm On Thu Feb 20 and 1 am NST (Dec 18 2h30 to 4h30 UTC) and on Thursday, Dec 19 between 11 pm Mon Mar 03 and 1 am NST (Dec 20 2h30 to 4h30 UTC)Thu Mar 13, have caused network interruptions and dropped connections from and to Siku and Argo.<br>Connections from outside the campus also shortly dropped on Dec 17 at about 2pm NST (17h30 UTC).: Updated: 10:30, Jan 6, 2025 (NST) * On the morning we will be performing rolling updates of Tuesday, December 3rd at around 8:00 am Nfld (7h30 Atlantic; 11h30 UTC) there was an unexpected power-event that affected the Siku data-centre causing all compute-nodes causing SIKU to crash and running jobs to fail. '''UPDATE:''' Normal operations have resumed shortly after 10h00 Nfld.: Updated: 10:30, Dec 3, 2024 (NST) * On the morning of Wednesday, October 30th there was be a brief power-outage affecting several buildings on the South Campus, including the data center that houses Siku. A reservation beginning operate at 6:00 am Nfld on Wednesday morning has prevented jobs from starting unless they finished by that time. Regular production resumed at 15h40 UTC (13h10 NDT).: 13:30, Oct 30, 2024 (NDT) * Siku underwent a rolling outage between Monday, Aug 26 and Monday Sep 9, 2024, to facilitate kernel- and other smaller updates. Over the course of two weeks the reduced total capacity was reduced, as nodes were drained in . Since we only reserve a small batches. This outage concluded with updating and rebooting the remaining login fraction of nodes on Monday Sep 9each day, 2024.: 17:45, Sep 9, 2024 (NDT) * Siku compute nodes were unavailable for several hours overnight July 18-19 due the impact to electrical work by the city. Regular production resumed at 2024user-07-19 11h54 UTC.: 09:33, July 19, 2024 (NDT) * We started Siku's maintenance outage this morning at 10h00 UTC (7h30 NDT, 7h00 ADT). Over the next two weeks we jobs should be small since all other nodes will perform operating system and software upgrades of the login-, compute- and backend-machines, including the GPFS filesystem.: 09:30, June 17, 2024 (NDT) * There was an unplanned power outage between 16h15 and 16h30 UTC (13h45 and 14h00 NDT), during which many but not all jobs were lost. Normal operation was resumed about 18h00 UTC (15:30 NDT).: 15:38, March 26, 2024 (NDT) * Slurm job scheduler was off-line '''Monday March 25, 2024''', beginning at 11h00 UTC (08h30 NDT) until 12h45 UTC (10h15 NDT) for a second urgent maintenance on the machine running the Slurm controller. This was now completed and normal operation has resumed.: 10:23, March 25, 2024 (NDT) * Siku scheduler is still be available again.<br>The emergency maintenance was completed and normal operation has resumed at 11h50 NDT (14h20 UTC).: 1211:00, March 19January 30, 2024 2025(NDTNST) * Slurm job scheduler will be off-line Tuesday March 19, 2024, beginning at 13h30 UTC for emergency maintenance on the machine running the Slurm controller. We anticipate an outage of approximately two hours. New jobs are being accepted but none will be launched until after the outage. Access to the cluster will still be permitted and storage will remain accessible.
For older outages see: [[Cluster Status/Previous outages#Siku|Previous outages]]