Cluster Status

From ACENET
Revision as of 11:43, June 27, 2016 by Luyang (talk | contribs) (Fundy)
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled outages are represented.

Cluster Status Planned Outage Notes
Mahone Online No outages
Placentia Online No outages
Fundy Offline No outages Campus-wide power outage
Glooscap Online Compute nodes July 16 & 22

Services

Service Status Planned Outage Notes
WebMO Online No outages
Account creation Online No outages
PGI and Intel licenses Online No outages
Videoconferencing (IOCOM Server) Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • Electrical power work at the Killam Data Centre will require that Glooscap compute nodes be powered down on Saturday July 16 and then again on Friday July 22. Nodes will be drained at 07:00 each morning. We expect the outage on July 16 to last a few hours. On July 22 we will take the opportunity to apply more vendor-recommended changes with regard to the ongoing nfs5 issue (see below), which will take longer.

Mahone

  • The default storage quota for users at Mahone has been reduced to 150 gigabytes.
14:34, March 8, 2016 (AST)
  • Back online with the new storage.
09:39, January 25, 2016 (AST)
  • Mahone is undergoing a major upgrade involving new storage hardware and software. As well, the new operating system on all nodes will be changed to Red Hat Enterprise Linux 6 (RHEL6).
10:59, January 11, 2016 (AST)

Placentia

  • The outage is complete. Major changes are described at Storage System#Changes. Temporarily, WebMO will continue offline until next week and Q-Chem licenses will be unavailable pending renewal.
10:03, April 8, 2016 (ADT)
  • Placentia is undergoing a major upgrade involving new storage hardware and software. As well, the new operating system on all nodes will be changed to Red Hat Enterprise Linux 6 (RHEL6).
08:00, March 21, 2016 (ADT)

Fundy

  • We will try to bring Fundy back late this morning.
08:43, June 27, 2016 (ADT)
  • There was a power outage on Saturday on south side of Fredericton. UNB lost power campus wide.
08:17, June 27, 2016 (ADT)
  • The cluster is unreachable. We are investigating.
07:31, June 27, 2016 (ADT)

Glooscap

  • Vendor-supplied software updates applied last week have not resolved the nfs5 lock-up issues. Conversation with vendor support continues. Another whole-system outage is required to apply the next vendor-requested patch; see Outage Schedule above. Tech staff are actively adjusting the available number of cores to balance job throughput with risk of failure, which seems to be load-driven.
11:17, May 16, 2016 (ADT)
  • The head node is failing to mount home directories. The fault will be addressed in the morning.
22:55, May 3, 2016 (ADT)
  • NFS5 has failed over the weekend, ORACLE Kernel engineers have been engaged to troubleshoot ongoing issues with NFS5. cl098 - cl183 are unavailable until further notice.
08:14, May 2nd, 2016 (ADT)
  • NFS server nfs5 continues to fail sporadically. Vendor support is engaged and seeking a solution. Users can expect jobs occasionally to go into error state ("Eqw"), and the head node may become unresponsive during failure events, but the effect on running jobs seems to be minimal. ACENET staff will clear error states on jobs connected with nfs5 failures; users can also do this for themselves with "qmod -cj <jobid>".
13:42, April 22, 2016 (ADT)
  • The head node has become unresponsive to some users. Diagnosis and resolution should be expected tomorrow (Tuesday) morning.
16:57, April 18, 2016 (ADT)
  • The NFS server and associated compute hosts have been returned to service at the vendor's suggestion. Please check any jobs you have running to ensure that they are writing output as expected and not stalled.
11:28, April 15, 2016 (ADT)
  • The failure logged earlier (see Feb 26) is recurring on the NFS server assigned to hosts cl098-cl183. Vendor has been contacted for detailed troubleshooting. Hosts cl098-cl183 will not accept new jobs.
10:36, April 15, 2016 (ADT)
  • Outage is complete and file system software has been upgraded to try to ameliorate the recurring NFS issues logged earlier. Grid Engine master has been migrated to new hardware. Unfortunately this required the deletion of all waiting jobs, so users must resubmit work that did not schedule before the outage began Monday.
12:01, April 13, 2016 (ADT)
  • NFS issues resolved, appears to be a result of temporary IO lock on cl098 and higher systems, once cleared mounts started working properly again, no indication of lost jobs or data.
16:37, February 26, 2016 (AST)
  • There appears to be another NFS problem, with compute nodes cl098 and higher failing to mount filesystems properly. Sysadmins are looking into it.
16:21, February 26, 2016 (AST)