Difference between revisions of "Cluster Status"
Line 9: | Line 9: | ||
! scope="col" align=left width="250px" | Notes | ! scope="col" align=left width="250px" | Notes | ||
|- valign=top bgcolor="#f5faff" | |- valign=top bgcolor="#f5faff" | ||
− | | [[Cluster Status#Brasdor | Brasdor]] || style="color: | + | | [[Cluster Status#Brasdor | Brasdor]] || style="color:red" | '''Offline''' || [[Cluster Status#Outage schedule | No outages]] || |
|- valign=top bgcolor="#f5faff" | |- valign=top bgcolor="#f5faff" | ||
| [[Cluster Status#Mahone | Mahone]] || style="color:green" | '''Online''' || [[Cluster Status#Outage schedule | No outages]] || | | [[Cluster Status#Mahone | Mahone]] || style="color:green" | '''Online''' || [[Cluster Status#Outage schedule | No outages]] || | ||
Line 36: | Line 36: | ||
== Brasdor == | == Brasdor == | ||
+ | * Brasdor is not responding to any login attempts......investigating | ||
+ | : 13:35, February 22, 2013 (AST) | ||
* Brasdor is back up after the AC outage. The delay was due to a faulty motor in one of the roof compressors that kept tripping the power breaker to the room. | * Brasdor is back up after the AC outage. The delay was due to a faulty motor in one of the roof compressors that kept tripping the power breaker to the room. | ||
: 13:48, February 21, 2013 (AST) | : 13:48, February 21, 2013 (AST) |
Revision as of 17:35, February 22, 2013
Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled ACEnet outages are represented.
Cluster | Status | Planned Outage | Notes |
---|---|---|---|
Brasdor | Offline | No outages | |
Mahone | Online | No outages | |
Placentia | Online | No outages | |
Fundy | Online | No outages | |
Glooscap | Online | No outages | |
Courtenay | Online | No outages |
- Legend:
Online | cluster is up and running |
Offline | all users cannot login or submit jobs |
Online | some users can login and/or there are problems |
Outage schedule
Grid Engine will not schedule any job with a run time (h_rt
) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.
Brasdor
- Brasdor is not responding to any login attempts......investigating
- 13:35, February 22, 2013 (AST)
- Brasdor is back up after the AC outage. The delay was due to a faulty motor in one of the roof compressors that kept tripping the power breaker to the room.
- 13:48, February 21, 2013 (AST)
- The temperature in the Brasdor Machine room is still high. Unknown as to when the AC will be repaired properly. News to come as we know more.
- 13:44, February 18, 2013 (AST)
- One AC is working again, and the other needs more work. Brasdor should be back up later today. All nodes had to be turned off, so jobs will have been lost.
- 12:35, February 15, 2013 (AST)
- The AC has failed in the Brasdor machine room, it might be due to a power failure. FacMan is looking into the issue. More information to come.
- 15:59, February 14, 2013 (AST)
Mahone
- Queues enabled. NQS is available.
- 17:09, January 15, 2013 (AST)
- All the queue have been temporarily disabled to prevent jobs from using the NQS filesystem, which will be temporarily unmounted for a filesystem check to resolve the utilization problem.
- 12:04, January 15, 2013 (AST)
Placentia
- There was a power surge on Feb 10 around 9am that brought down some compute nodes.
- 13:07, February 11, 2013 (AST)
- Back online after the power outage due to a snow storm.
- 14:46, January 14, 2013 (AST)
- Placentia in inaccessible. We are investigating if it's related to the snow storm in Newfoundland. MUN is closed.
- 08:50, January 11, 2013 (AST)
Fundy
- Head node rebooted due to the head node policy abuse.
- 08:33, February 8, 2013 (AST)
- We are having problems with NFS servers. Affected compute nodes have been disabled, but existing jobs might crash. Support calls have been launched with Oracle.
- 09:59, February 6, 2013 (AST)
Glooscap
- Network switches were be replaced on nodes cl098-cl183 to fix cooling and airflow problems. The service was completed ahead of schedule and the entire cluster is back in service.
- 15:43, December 17, 2012 (AST)
- The Grid Engine queue master is running.
- 08:11, November 19, 2012 (AST)
Courtenay
- Courtenay is back online.
- 11:47, November 28, 2012 (AST)
- Courtenay is offline due to some nfs/network problem. We are sorting this out.
- 08:23, November 28, 2012 (AST)