Difference between revisions of "Cluster Status"

Revision as of 17:37, December 5, 2013

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled ACEnet outages are represented.

Cluster	Status	Planned Outage	Notes
Brasdor	Online	No outages
Mahone	Online	No outages
Placentia	Online	Dec 6-9	Outage rescheduled by MUN Facilities
Fundy	Online	No outages
Glooscap	Online	No outages
Courtenay	Offline	No outages	The head node has hardware problem.

Legend:

Online	cluster is up and running
Offline	all users cannot login or submit jobs
Online	some users can login and/or there are problems

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

Placentia: Following work on an emergency generator at Memorial University, Facilities Management will be interrupting the power to the Placentia machine room again at the completion of the work. To minimize the anticipated job loss we have scheduled a queue outage beginning at noon on Friday December 6. Placentia will return to normal service on Monday December 9.

Brasdor

Brasdor is back online after some file system work. Jobs have been restarted, but please check to make sure they are working. If you experience any performance troubles, or find Brasdor unresponsive please email support@ace-net.ca

13:37, December 5, 2013 (AST)

We've been advised to upgrade to the latest patch cluster on our nfs servers. We'll need to stop currently running jobs while we do this. They will be restarted when the work has been complete. Sorry to any inconvenience this may cause.

10:37, November 30, 2013 (AST)

We'll be keeping Brasdor offline until at least Monday for work and testing. This isn't affecting most running jobs, if that changes we'll update this space. We have temporarily disabled NQS deletion, so don't worry about these files getting cleaned. We'll be working to get this problem sorted out as soon as possible.

09:12, November 29, 2013 (AST)

We have opened a support ticket with Oracle to resolve the NFS server issue.

13:38, November 28, 2013 (AST)

We are having NFS problems again.

07:46, November 28, 2013 (AST)

Over the last few days we have been having an issue with the network file system (NFS) that affected the head node as well as some of the compute nodes. The cause of the problem has not been determined yet, but we have tried implementing various measures to prevent this from happening. There is no way for us to reproduce the problem, so we are releasing Brasdor into production once again to see if it has been fixed this time.

11:25, November 27, 2013 (AST)

It appears there is a recurring problem at Brasdor. We are trying to resolve it.

07:28, November 27, 2013 (AST)

Login issues are resolved, some jobs may have been affected, please check your jobs.

10:39, November 26, 2013 (AST)

There are login issues. We are investigating.

18:41, November 25, 2013 (AST)

There was a NFS server software hang, it's been fixed. Check your jobs to make sure you were not affected.

07:33, November 25, 2013 (AST)

Mahone

Outage complete.

11:17, November 18, 2013 (AST)

The preventive maintenance formerly scheduled for September 9, 2013, has been withdrawn.

9:52, August 20, 2013 (ADT)

Placentia

Memorial University Facilities Management has rescheduled the next power interruption from Saturday November 30 to Saturday December 7, on very short notice. The associated queue outage has been pushed back a week.

13:52, November 29, 2013 (NST)

Placentia is back online after the outage.

14:54, November 25, 2013 (AST)

Fundy

Fundy is back on-line.

09:36, July 16, 2013 (ADT)

The storage problems have been solved. We are checking around and testing it. We expect to release it early tomorrow morning.

16:20, July 15, 2013 (ADT)

Glooscap

Back online.

11:14, November 18, 2013 (AST)

On Friday afternoon a vendor technician accidentally caused a switch outage affecting nodes cl059-cl097. Jobs running on those nodes crashed or hung. The switch is back in service this morning and ACEnet staff are testing the affected nodes and returning them to service.

10:35, November 18, 2013 (AST)

There seems to be a problem with two NFS servers. User are likely to see their jobs suck or fail.

18:47, November 15, 2013 (AST)

Courtenay

The head node has a hardware problem and does not respond any more. We are configuring another box to be the new head.

08:54, April 22, 2013 (ADT)

@@ Line 9: / Line 9: @@
 ! scope="col" align=left width="250px" | Notes
 |- valign=top bgcolor="#f5faff"
-| [[Cluster Status#Brasdor | Brasdor]] || style="color:red" | '''Offline''' || [[Cluster Status#Outage schedule | No outages ]] || NFS issue
+| [[Cluster Status#Brasdor | Brasdor]] || style="color:green" | '''Online''' || [[Cluster Status#Outage schedule | No outages ]] ||
 |- valign=top bgcolor="#f5faff"
 | [[Cluster Status#Mahone | Mahone]] || style="color:green" | '''Online''' || [[Cluster Status#Outage schedule | No outages]] ||
@@ Line 37: / Line 37: @@
 == Brasdor ==
+* Brasdor is back online after some file system work.  Jobs have been restarted, but please check to make sure they are working.  If you experience any performance troubles, or find Brasdor unresponsive please email support@ace-net.ca
+: 13:37, December 5, 2013 (AST)
 * We've been advised to upgrade to the latest patch cluster on our nfs servers.  We'll need to stop currently running jobs while we do this.  They will be restarted when the work has been complete.  Sorry to any inconvenience this may cause.
 : 10:37, November 30, 2013 (AST)

Difference between revisions of "Cluster Status"

Revision as of 17:37, December 5, 2013

Outage schedule

Brasdor

Mahone

Placentia

Fundy

Glooscap

Courtenay

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Quick Links

User Support

Resources

Policies

Legacy Documentation

Tools