Cluster Status

From ACENET
Revision as of 12:42, August 1, 2017 by Gsingh (talk | contribs) (Clusters)
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled outages are represented.

Cluster Status Planned Outage Notes
Mahone Online No outages
Placentia Online No outages
Fundy Online No outages
Glooscap Online UPS maintenance Aug 16 & 23

Services

Service Status Planned Outage Notes
WebMO Online Date to come We are experiencing problems submitting WebMO jobs
Account creation Online No outages
PGI and Intel licenses Online No outages
Videoconferencing (IOCOM Server) Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • Glooscap will be offline Wednesday August 16 and again on Wednesday August 23 to permit maintenance on UPS units. Notice that this will block the scheduling of long-running jobs until after August 23.

Mahone

  • The cluster may be unreachable due to an upstream provider networking issue.
08:11, December 8, 2016 (AST)

Placentia

  • Placentia is back up after a planned power outage at the Memorial University Campus.
The jobs have been restarted and the vast majority of them are running fine, however a few jobs have failed in the process. Please check your jobs to make sure whether they belong to the latter group.
14:07, June 3, 2017 (ADT)
  • The file system check is complete and all clear. Tests this morning showed evidence that the poor responsiveness and excessive filesystem load were due to one popular chemistry application running in large numbers. We are revising the advice to users on our wiki regarding this application, and will be consulting with certain research groups about modifying their workflow for everyone's benefit.
13:23, March 23, 2017 (ADT)
  • We are running a consistency check on the main file system (/home). This is expected to run at least overnight. New information will be posted tomorrow morning, Thursday March 23.
15:43, March 22, 2017 (ADT)
  • Commands on Placentia clhead are responding slowly and load on the Lustre Object Storage Server (OSS) has been very high since approximately mid-day Saturday March 18.
13:25, March 21, 2017 (ADT)

Fundy

  • Fundy head node is unresponsive. Technical staff are investigating the cause.
12:19, April 18, 2017 (ADT)
  • Fundy is back now.
10:59, August 8, 2016 (ADT)

Glooscap

  • Network problem has been resolved. Glooscap is reachable again.
16:36, April 24, 2017 (ADT)
  • Glooscap is inaccessible due to a network problem at the host university. We expect most jobs will continue running uninterrupted while we diagnose the problem.
13:06, April 24, 2017 (ADT)
  • The metadata server was hung all night March 7-8. It was rebooted this morning and Glooscap is operating once again, although technical staff continue to be cautious about its future behaviour. To try to alleviate the load on the metadata server we are withdrawing compute nodes cl002 through cl058 from service. This represents a reduction of 188 cores in the capacity of the cluster.
11:24, March 8, 2017 (AST)
  • The cluster is unresponsive.
17:05, March 7, 2017 (AST)
  • A file system consistency check (fsck) has been completed, jobs have been restarted and logins are once again enabled. We will be monitoring to see if the rate or severity of slowdowns has changed.
10:20, March 7, 2017 (AST)
  • The intermittent slow response on Glooscap continues, with many such events logged Feb 16-18 and Feb 26-Mar 1. Technical staff continue to investigate the cause without vendor support.
09:13, March 1, 2017 (AST)
  • Users report intermittent slowness in interactive use of Glooscap. Symptoms include pauses of several seconds to over a minute in response to shell commands involving files or file metadata (such as "ls"). This is believed to be due to load on the file system, and therefore may also be affecting the run times of jobs doing extensive I/O. Vendor support for the file system is no longer available so deep troubleshooting is out of reach. We have no reports of loss of data or other actual failures. All we can recommend is great patience.
12:18, February 9, 2017 (AST)