This page is maintained manually. It gets updated as soon as we learn new information.
|
Clusters
Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled outages are represented.
Services
- Legend:
Online |
cluster is up and running
|
Offline |
all users cannot login or submit jobs, or service is not working
|
Online |
some users can login and/or there are problems affecting your work
|
Outage schedule
Grid Engine will not schedule any job with a run time (h_rt
) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.
- No outages currently scheduled. All clusters and services online.
Mahone
- The default storage quota for users at Mahone has been reduced to 150 gigabytes.
- 14:34, March 8, 2016 (AST)
- Back online with the new storage.
- 09:39, January 25, 2016 (AST)
- Mahone is undergoing a major upgrade involving new storage hardware and software. As well, the new operating system on all nodes will be changed to Red Hat Enterprise Linux 6 (RHEL6).
- 10:59, January 11, 2016 (AST)
Placentia
- The outage is complete. Major changes are described at Storage System#Changes. Temporarily, WebMO will continue offline until next week and Q-Chem licenses will be unavailable pending renewal.
- 10:03, April 8, 2016 (ADT)
- Placentia is undergoing a major upgrade involving new storage hardware and software. As well, the new operating system on all nodes will be changed to Red Hat Enterprise Linux 6 (RHEL6).
- 08:00, March 21, 2016 (ADT)
Fundy
- Fundy is once again in production. Grid Engine jobs enqueued when we went offline for maintenance were deleted; we regret the inconvenience. /globalscratch has been merged with /home as at Mahone and the filesystem is now Lustre in place of SAM-QFS. Read more about it at Storage System. Password changes must be made at Placentia temporarily.
- 13:50, March 9, 2016 (AST)
- Return to service has been delayed one more day to Wednesday, March 9, while we finish staging old files from tape onto the new disk array.
- 12:10, March 8, 2016 (AST)
Glooscap
- NFS5 has failed over the weekend, ORACLE Kernel engineers have been engaged to troubleshoot ongoing issues with NFS5. cl098 - cl183 are unavailable until further notice.
- 08:14, May 2nd, 2016 (ADT)
- NFS server nfs5 continues to fail sporadically. Vendor support is engaged and seeking a solution. Users can expect jobs occasionally to go into error state ("Eqw"), and the head node may become unresponsive during failure events, but the effect on running jobs seems to be minimal. ACENET staff will clear error states on jobs connected with nfs5 failures; users can also do this for themselves with "qmod -cj <jobid>".
- 13:42, April 22, 2016 (ADT)
- The head node has become unresponsive to some users. Diagnosis and resolution should be expected tomorrow (Tuesday) morning.
- 16:57, April 18, 2016 (ADT)
- The NFS server and associated compute hosts have been returned to service at the vendor's suggestion. Please check any jobs you have running to ensure that they are writing output as expected and not stalled.
- 11:28, April 15, 2016 (ADT)
- The failure logged earlier (see Feb 26) is recurring on the NFS server assigned to hosts cl098-cl183. Vendor has been contacted for detailed troubleshooting. Hosts cl098-cl183 will not accept new jobs.
- 10:36, April 15, 2016 (ADT)
- Outage is complete and file system software has been upgraded to try to ameliorate the recurring NFS issues logged earlier. Grid Engine master has been migrated to new hardware. Unfortunately this required the deletion of all waiting jobs, so users must resubmit work that did not schedule before the outage began Monday.
- 12:01, April 13, 2016 (ADT)
- NFS issues resolved, appears to be a result of temporary IO lock on cl098 and higher systems, once cleared mounts started working properly again, no indication of lost jobs or data.
- 16:37, February 26, 2016 (AST)
- There appears to be another NFS problem, with compute nodes cl098 and higher failing to mount filesystems properly. Sysadmins are looking into it.
- 16:21, February 26, 2016 (AST)