Difference between revisions of "Cluster Status"
(→Placentia) |
|||
Line 13: | Line 13: | ||
! scope="col" align=left width="250px" | Notes | ! scope="col" align=left width="250px" | Notes | ||
|- valign=top bgcolor="#f5faff" | |- valign=top bgcolor="#f5faff" | ||
− | | [[Cluster Status#Placentia | Placentia]] || style="color:#ff8c00" | '''Online''' || [[Cluster Status#Outage schedule | Partial until | + | | [[Cluster Status#Placentia | Placentia]] || style="color:#ff8c00" | '''Online''' || [[Cluster Status#Outage schedule | Partial until Jan 3 ]] || |
|- valign=top bgcolor="#f5faff" | |- valign=top bgcolor="#f5faff" | ||
| [[Cluster Status#Glooscap | Glooscap]] || style="color:#ff8c00" | '''Online''' || [[Cluster Status#Outage schedule | Partial Nov 26-Jan 3 ]] || | | [[Cluster Status#Glooscap | Glooscap]] || style="color:#ff8c00" | '''Online''' || [[Cluster Status#Outage schedule | Partial Nov 26-Jan 3 ]] || | ||
Line 63: | Line 63: | ||
== Placentia == | == Placentia == | ||
− | * Due to an issue with an A/C unit, some compute (up to cl107) has been taken down. | + | * Due to an issue with an A/C unit, some compute (up to cl107) has been taken down. Repairs have completed, but at ITS request, we are leaving the nodes down until Jan 3. |
: 13:30, December 23, 2018 (NST) | : 13:30, December 23, 2018 (NST) | ||
* Work on the cooling systems was complete mid-afternoon on Wednesday December 19, and the idled nodes were returned to service. | * Work on the cooling systems was complete mid-afternoon on Wednesday December 19, and the idled nodes were returned to service. |
Revision as of 15:34, December 24, 2018
![]() |
Clusters
Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled outages are represented.
Cluster | Status | Planned Outage | Notes |
---|---|---|---|
Placentia | Online | Partial until Jan 3 | |
Glooscap | Online | Partial Nov 26-Jan 3 | |
Arbutus | See status.computecanada.ca | (west.cloud.computecanada.ca) | |
Cedar | See status.computecanada.ca | ||
Graham | See status.computecanada.ca | ||
Niagara | See status.computecanada.ca |
Services
Service | Status | Planned Outage | Notes |
---|---|---|---|
WebMO | Online | No outages | |
Account creation | Manual | No outages | Write support |
PGI and Intel licenses | Online | No outages |
- Legend:
Online | cluster is up and running |
Offline | all users cannot login or submit jobs, or service is not working |
Online | some users can login and/or there are problems affecting your work |
Outage schedule
Grid Engine will not schedule any job with a run time (h_rt
) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.
- Glooscap capacity will be temporarily reduced by 1120 cores beginning the morning of Monday November 26, in order to facilitate maintenance on the cooling systems in the Killam Data Centre. The head node and storage will remain accessible, as will about 30% of the compute capacity. Originally scheduled to be complete by Dec 15, the work has been extended to Dec 21 which does not leave us time to restore service before the Christmas break. Return to service is now expected on 2019 January 3.
Placentia
- Due to an issue with an A/C unit, some compute (up to cl107) has been taken down. Repairs have completed, but at ITS request, we are leaving the nodes down until Jan 3.
- 13:30, December 23, 2018 (NST)
- Work on the cooling systems was complete mid-afternoon on Wednesday December 19, and the idled nodes were returned to service.
- 15:28, December 19, 2018 (AST)
- Over the past days we have been large numbers of jobs failing immediately with Eqw errors. This was caused by one of the switches in our data-network. The problem has now been resolved.
- 11:30, November 22, 2018 (AST)
- The WebMO web-service has been migrated to a different physical machine on the afternoon of Thursday November 15 2018. Further reboots were necessary on the morning of Friday Nov. 16. Running jobs have not been effected.
- 08:59, November 16, 2018 (AST)
- A power fluctuation on the morning of Wednesday November 14 2018 around 6:30am in the Placentia data centre, has caused about 20 compute nodes to crash, which had to be rebooted. This would have effected a small number of jobs.
- 9:25, November 15, 2018 (ADT)
Glooscap
- Glooscap is back in service after a planned interruption this weekend (Sep 7-10). The metadata server component of the file system has been relocated and a full fschk has been run. We hope this will alleviate file system load problems.
- 10:54, September 10, 2018 (ADT)
- Glooscap is back in service again. We believe we may have identified a source of unusual load that was causing the trouble. Please check on the status of any jobs you have in the system to ensure they are running properly.
- 16:25, August 23, 2018 (ADT)
- Glooscap is not accepting logins again. The sysadmin is investigating the cause.
- 10:48, August 23, 2018 (ADT)
Fundy
- Fundy has been retired from service.
- 10:01, April 5, 2018 (ADT)
Mahone
- Mahone has been retired from service.
- 10:00, April 5, 2018 (ADT)