Difference between revisions of "Cluster Status"

Revision as of 19:12, September 27, 2014

This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled ACEnet outages are represented.

Cluster	Status	Planned Outage	Notes
Brasdor	Offline	No outages	Extensive damage
Mahone	Online	No outages
Placentia	Online	No outages
Fundy	Offline	No outages	Cannot login currently
Glooscap	Online	No outages

Services

Service	Status	Planned Outage
WebMO	Online	No outages
Account creation	Online	No outages
PGI and Intel licenses	Online	No outages

Legend:

Online	cluster is up and running
Offline	all users cannot login or submit jobs, or service is not working
Online	some users can login and/or there are problems

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

No outages currently planned

Brasdor

On February 21, 2014, ACEnet's Brasdor cluster suffered serious damage when an A/C malfunction over-cooled the room, causing a sprinkler head to deploy. Assessment is ongoing, however it is clear that the water damage was extensive enough that we will be unable to return the cluster to service. A central concern of our recovery work has been the possibility of restoring user data. Data written to /home or /globalscratch on or before February 15, 2014 has the potential to have a copy surviving on tape. We have been able to restore such data using Mahone's tape library. Due to disk space limitations, the process to restore data must be approached in a user-by-user fashion. We are asking any user requiring recovery of Brasdor data to contact support specifying which file system you want us to recover (/home and/or /globalscratch). Please specify the subject line as "File recovery at Brasdor - your_username". Also, please note that /nqs cannot be recovered.

Mahone

The Mahone head node has been cleansed and returned to service following yesterday's security incident. We have no evidence that any user accounts have been compromised, but this would be an excellent time to CHANGE YOUR PASSWORD, and ensure it is strong [1].

15:26, September 4, 2014 (ADT)

There is evidence of a security breach this morning at Placentia and Mahone. Access to these systems has been blocked until the situation is resolved. We expect return to service by Friday morning, Sept 5.

15:30, September 3, 2014 (ADT)

Placentia

Tape library has been repaired and is functioning normally again.

14:44, September 12, 2014 (NDT)

The tape library which forms part of the /home and /globalscratch filesystems at Placentia is down and awaiting a replacement part, expected Monday. During this time, files which have not been accessed recently might not be available. Please minimize activities which touch old files as such operations will hang. Unfortunately, there is no way for the user to determine whether a given file resides only on tape and will therefore hang on access.

16:20, September 11, 2014 (NDT)

The Placentia head node has been cleansed and returned to service following Wednesday's security incident. We have no evidence that any user accounts have been compromised, but this would be an excellent time to CHANGE YOUR PASSWORD, and ensure it is strong [2].

13:50, September 5, 2014 (NDT)

There is evidence of a security breach this morning at Placentia and Mahone. Access to these systems has been blocked until the situation is resolved. We expect return to service by Friday morning, Sept 5.

16:00, September 3, 2014 (NDT)

On August 27 some Grid Engine jobs were found to be in the running ("r") state but not actually executing on Aug 27. This may have been a side-effect of the storage problems of Aug 19-21. In order to restore consistency, some jobs were forcibly rescheduled (state "Rq" or "Rr") and others deleted. Please check that your jobs are making progress, and resubmit work as necessary. We regret any inconvenience.

09:40, August 28, 2014 (NDT)

/globalscratch has been restored.

14:30, August 21, 2014 (NDT)

One of the storage arrays failed over night. This has affected /globalscratch. Logins have been disabled.

09:00, August 19, 2014 (NDT)

Fundy

Users cannot login from off-campus or on-campus.

16:00, September 27, 2014 (ADT)

Fundy is back up on-line.

12:00, August 25, 2014 (ADT)

We already solved the power failure problem on nfs server. But we detected a defective hard drive. We will replace the drive Monday morning. We are looking at early afternoon Monday to bring Fundy back to service.

15:39, August 22, 2014 (ADT)

We were instructed to replace more hardware on the nfs server. The replacement will happen tomorrow morning.

13:51, August 21, 2014 (ADT)

We are sorry that our Field Engineer is stilling working on the problems caused by the replacement of the mother board.

16:31, August 20, 2014 (ADT)

Mother board replaced. But we are still working on some problems after the replacement.

16:01, August 19, 2014 (ADT)

We expect to get the replacement mother board tomorrow afternoon.

13:32, August 18, 2014 (ADT)

The mother board of nfs server has to be replaced.

14:43, August 15, 2014 (ADT)

The main nfs server of Fundy does not power up. We are working with Oracle on this right now.

11:13, August 15, 2014 (ADT)

Glooscap

A replacement switch equivalent to the original has been installed and Glooscap has been returned to service. Most running jobs were lost during repairs. Please check that any remaining jobs you have in the system are progressing properly.

10:03, July 25, 2014 (ADT)

A spare switch of lower capacity has been swapped in for the failed network switch. Users can log in and access their data, but many compute nodes are inaccessible and so queue capacity will be limited until we can obtain a better replacement. Jobs listed in qstat as "running" may in fact be hung. Users should check for output dated later than 09:30 July 22, and if there is none, consider submitting replacement jobs to other ACEnet clusters.

13:44, July 22, 2014 (ADT)

A network switch has failed, making the head node unusable. We cannot yet estimate time of return-to-service.

09:47, July 22, 2014 (ADT)

@@ Line 85: / Line 85: @@
 == Fundy ==
+* Users cannot login from off-campus or on-campus.
+: 16:00, September 27, 2014 (ADT)
 * Fundy is back up on-line.
 : 12:00, August 25, 2014 (ADT)

Difference between revisions of "Cluster Status"

Revision as of 19:12, September 27, 2014

Clusters

Services

Outage schedule

Brasdor

Mahone

Placentia

Fundy

Glooscap

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Quick Links

User Support

Resources

Policies

Legacy Documentation

Tools