Difference between revisions of "Cluster Status"

From ACENET
Jump to: navigation, search
Line 9: Line 9:
 
! scope="col" align=left width="250px" | Notes
 
! scope="col" align=left width="250px" | Notes
 
|- valign=top bgcolor="#f5faff"
 
|- valign=top bgcolor="#f5faff"
| [[Cluster Status#Brasdor | Brasdor]] || style="color:red" | '''Offline''' || [[Cluster Status#Outage schedule | No outages ]] || NFS issue
+
| [[Cluster Status#Brasdor | Brasdor]] || style="color:green" | '''Online''' || [[Cluster Status#Outage schedule | No outages ]] ||  
 
|- valign=top bgcolor="#f5faff"
 
|- valign=top bgcolor="#f5faff"
 
| [[Cluster Status#Mahone | Mahone]] || style="color:green" | '''Online''' || [[Cluster Status#Outage schedule | No outages]] ||  
 
| [[Cluster Status#Mahone | Mahone]] || style="color:green" | '''Online''' || [[Cluster Status#Outage schedule | No outages]] ||  
Line 37: Line 37:
  
 
== Brasdor ==
 
== Brasdor ==
 +
* Brasdor is back online after some file system work.  Jobs have been restarted, but please check to make sure they are working.  If you experience any performance troubles, or find Brasdor unresponsive please email support@ace-net.ca
 +
: 13:37, December 5, 2013 (AST)
 
* We've been advised to upgrade to the latest patch cluster on our nfs servers.  We'll need to stop currently running jobs while we do this.  They will be restarted when the work has been complete.  Sorry to any inconvenience this may cause.
 
* We've been advised to upgrade to the latest patch cluster on our nfs servers.  We'll need to stop currently running jobs while we do this.  They will be restarted when the work has been complete.  Sorry to any inconvenience this may cause.
 
: 10:37, November 30, 2013 (AST)
 
: 10:37, November 30, 2013 (AST)

Revision as of 17:37, December 5, 2013

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled ACEnet outages are represented.

Cluster Status Planned Outage Notes
Brasdor Online No outages
Mahone Online No outages
Placentia Online Dec 6-9 Outage rescheduled by MUN Facilities
Fundy Online No outages
Glooscap Online No outages
Courtenay Offline No outages The head node has hardware problem.
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs
Online some users can login and/or there are problems

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • Placentia: Following work on an emergency generator at Memorial University, Facilities Management will be interrupting the power to the Placentia machine room again at the completion of the work. To minimize the anticipated job loss we have scheduled a queue outage beginning at noon on Friday December 6. Placentia will return to normal service on Monday December 9.

Brasdor

  • Brasdor is back online after some file system work. Jobs have been restarted, but please check to make sure they are working. If you experience any performance troubles, or find Brasdor unresponsive please email support@ace-net.ca
13:37, December 5, 2013 (AST)
  • We've been advised to upgrade to the latest patch cluster on our nfs servers. We'll need to stop currently running jobs while we do this. They will be restarted when the work has been complete. Sorry to any inconvenience this may cause.
10:37, November 30, 2013 (AST)
  • We'll be keeping Brasdor offline until at least Monday for work and testing. This isn't affecting most running jobs, if that changes we'll update this space. We have temporarily disabled NQS deletion, so don't worry about these files getting cleaned. We'll be working to get this problem sorted out as soon as possible.
09:12, November 29, 2013 (AST)
  • We have opened a support ticket with Oracle to resolve the NFS server issue.
13:38, November 28, 2013 (AST)
  • We are having NFS problems again.
07:46, November 28, 2013 (AST)
  • Over the last few days we have been having an issue with the network file system (NFS) that affected the head node as well as some of the compute nodes. The cause of the problem has not been determined yet, but we have tried implementing various measures to prevent this from happening. There is no way for us to reproduce the problem, so we are releasing Brasdor into production once again to see if it has been fixed this time.
11:25, November 27, 2013 (AST)
  • It appears there is a recurring problem at Brasdor. We are trying to resolve it.
07:28, November 27, 2013 (AST)
  • Login issues are resolved, some jobs may have been affected, please check your jobs.
10:39, November 26, 2013 (AST)
  • There are login issues. We are investigating.
18:41, November 25, 2013 (AST)
  • There was a NFS server software hang, it's been fixed. Check your jobs to make sure you were not affected.
07:33, November 25, 2013 (AST)

Mahone

  • Outage complete.
11:17, November 18, 2013 (AST)
  • The preventive maintenance formerly scheduled for September 9, 2013, has been withdrawn.
9:52, August 20, 2013 (ADT)

Placentia

  • Memorial University Facilities Management has rescheduled the next power interruption from Saturday November 30 to Saturday December 7, on very short notice. The associated queue outage has been pushed back a week.
13:52, November 29, 2013 (NST)
  • Placentia is back online after the outage.
14:54, November 25, 2013 (AST)

Fundy

  • Fundy is back on-line.
09:36, July 16, 2013 (ADT)
  • The storage problems have been solved. We are checking around and testing it. We expect to release it early tomorrow morning.
16:20, July 15, 2013 (ADT)

Glooscap

  • Back online.
11:14, November 18, 2013 (AST)
  • On Friday afternoon a vendor technician accidentally caused a switch outage affecting nodes cl059-cl097. Jobs running on those nodes crashed or hung. The switch is back in service this morning and ACEnet staff are testing the affected nodes and returning them to service.
10:35, November 18, 2013 (AST)
  • There seems to be a problem with two NFS servers. User are likely to see their jobs suck or fail.
18:47, November 15, 2013 (AST)

Courtenay

  • The head node has a hardware problem and does not respond any more. We are configuring another box to be the new head.
08:54, April 22, 2013 (ADT)