Difference between revisions of "Cluster Status"
(→Clusters) |
(→Fundy) |
||
Line 85: | Line 85: | ||
== Fundy == | == Fundy == | ||
+ | * Users cannot login from off-campus or on-campus. | ||
+ | : 16:00, September 27, 2014 (ADT) | ||
* Fundy is back up on-line. | * Fundy is back up on-line. | ||
: 12:00, August 25, 2014 (ADT) | : 12:00, August 25, 2014 (ADT) |
Revision as of 19:12, September 27, 2014
![]() |
Clusters
Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled ACEnet outages are represented.
Cluster | Status | Planned Outage | Notes |
---|---|---|---|
Brasdor | Offline | No outages | Extensive damage |
Mahone | Online | No outages | |
Placentia | Online | No outages | |
Fundy | Offline | No outages | Cannot login currently |
Glooscap | Online | No outages |
Services
Service | Status | Planned Outage | Notes |
---|---|---|---|
WebMO | Online | No outages | |
Account creation | Online | No outages | |
PGI and Intel licenses | Online | No outages |
- Legend:
Online | cluster is up and running |
Offline | all users cannot login or submit jobs, or service is not working |
Online | some users can login and/or there are problems |
Outage schedule
Grid Engine will not schedule any job with a run time (h_rt
) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.
- No outages currently planned
Brasdor
![]() A central concern of our recovery work has been the possibility of restoring user data. Data written to /home or /globalscratch on or before February 15, 2014 has the potential to have a copy surviving on tape. We have been able to restore such data using Mahone's tape library. Due to disk space limitations, the process to restore data must be approached in a user-by-user fashion. We are asking any user requiring recovery of Brasdor data to contact support specifying which file system you want us to recover (/home and/or /globalscratch). Please specify the subject line as "File recovery at Brasdor - your_username". Also, please note that /nqs cannot be recovered. |
Mahone
- The Mahone head node has been cleansed and returned to service following yesterday's security incident. We have no evidence that any user accounts have been compromised, but this would be an excellent time to CHANGE YOUR PASSWORD, and ensure it is strong [1].
- 15:26, September 4, 2014 (ADT)
- There is evidence of a security breach this morning at Placentia and Mahone. Access to these systems has been blocked until the situation is resolved. We expect return to service by Friday morning, Sept 5.
- 15:30, September 3, 2014 (ADT)
Placentia
- Tape library has been repaired and is functioning normally again.
- 14:44, September 12, 2014 (NDT)
- The tape library which forms part of the /home and /globalscratch filesystems at Placentia is down and awaiting a replacement part, expected Monday. During this time, files which have not been accessed recently might not be available. Please minimize activities which touch old files as such operations will hang. Unfortunately, there is no way for the user to determine whether a given file resides only on tape and will therefore hang on access.
- 16:20, September 11, 2014 (NDT)
- The Placentia head node has been cleansed and returned to service following Wednesday's security incident. We have no evidence that any user accounts have been compromised, but this would be an excellent time to CHANGE YOUR PASSWORD, and ensure it is strong [2].
- 13:50, September 5, 2014 (NDT)
- There is evidence of a security breach this morning at Placentia and Mahone. Access to these systems has been blocked until the situation is resolved. We expect return to service by Friday morning, Sept 5.
- 16:00, September 3, 2014 (NDT)
- On August 27 some Grid Engine jobs were found to be in the running ("r") state but not actually executing on Aug 27. This may have been a side-effect of the storage problems of Aug 19-21. In order to restore consistency, some jobs were forcibly rescheduled (state "Rq" or "Rr") and others deleted. Please check that your jobs are making progress, and resubmit work as necessary. We regret any inconvenience.
- 09:40, August 28, 2014 (NDT)
- /globalscratch has been restored.
- 14:30, August 21, 2014 (NDT)
- One of the storage arrays failed over night. This has affected /globalscratch. Logins have been disabled.
- 09:00, August 19, 2014 (NDT)
Fundy
- Users cannot login from off-campus or on-campus.
- 16:00, September 27, 2014 (ADT)
- Fundy is back up on-line.
- 12:00, August 25, 2014 (ADT)
- We already solved the power failure problem on nfs server. But we detected a defective hard drive. We will replace the drive Monday morning. We are looking at early afternoon Monday to bring Fundy back to service.
- 15:39, August 22, 2014 (ADT)
- We were instructed to replace more hardware on the nfs server. The replacement will happen tomorrow morning.
- 13:51, August 21, 2014 (ADT)
- We are sorry that our Field Engineer is stilling working on the problems caused by the replacement of the mother board.
- 16:31, August 20, 2014 (ADT)
- Mother board replaced. But we are still working on some problems after the replacement.
- 16:01, August 19, 2014 (ADT)
- We expect to get the replacement mother board tomorrow afternoon.
- 13:32, August 18, 2014 (ADT)
- The mother board of nfs server has to be replaced.
- 14:43, August 15, 2014 (ADT)
- The main nfs server of Fundy does not power up. We are working with Oracle on this right now.
- 11:13, August 15, 2014 (ADT)
Glooscap
- A replacement switch equivalent to the original has been installed and Glooscap has been returned to service. Most running jobs were lost during repairs. Please check that any remaining jobs you have in the system are progressing properly.
- 10:03, July 25, 2014 (ADT)
- A spare switch of lower capacity has been swapped in for the failed network switch. Users can log in and access their data, but many compute nodes are inaccessible and so queue capacity will be limited until we can obtain a better replacement. Jobs listed in qstat as "running" may in fact be hung. Users should check for output dated later than 09:30 July 22, and if there is none, consider submitting replacement jobs to other ACEnet clusters.
- 13:44, July 22, 2014 (ADT)
- A network switch has failed, making the head node unusable. We cannot yet estimate time of return-to-service.
- 09:47, July 22, 2014 (ADT)