Cluster Status

From ACENET
Revision as of 13:41, September 26, 2017 by Ostueker (talk | contribs) (Placentia: details on Placentia network outage)
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled outages are represented.

Cluster Status Planned Outage Notes
Mahone Online No outages
Placentia Offline No outages Loss of network connection. Jobs are not affected.
Fundy Online No outages
Glooscap Online No outages

Services

Service Status Planned Outage Notes
WebMO Online Date to come We are experiencing problems submitting WebMO jobs
Account creation Online No outages
PGI and Intel licenses Online No outages
Videoconferencing (IOCOM Server) Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • None

Mahone

  • The cluster may be unreachable due to an upstream provider networking issue.
08:11, December 8, 2016 (AST)

Placentia

  • Due to a network problem at Memorial University the Placentia cluster has currently lost it's external network connection.
Jobs are not affected as the internal network is still available. We expect this outage to last a few hours.
10:41, September 26, 2017 (ADT)
  • Placentia's headnode (clhead) spontaneously rebooted last night around 2:15 am NST.
As far as we can tell no jobs were effected.
08:14, September 8, 2017 (ADT)
  • A/C repairs have been completed and Placentia is back in production.
Fortunately we didn't have to kill any jobs or shutdown any equipment. Jobs that had been previously submitted are starting normally.
We don't expect any negative effects besides the longer waiting time over the past 2 days.
13:33, August 25, 2017 (ADT)
  • Service technicians have started working on the affected A/C unit.
We are trying to avoid killing jobs that are already running or having to shutdown compute nodes, however we are prepared to do so if the temperature rises too high during the maintenance.
08:49, August 25, 2017 (ADT)
  • The Memorial Data centre is having A/C problems, therefore we are reducing Placentia's capacity.
For now we prevent new jobs from starting. If this proves to be sufficient, already running jobs won't be effected.
15:20, August 23, 2017 (ADT)
  • Placentia is back up after a planned power outage at the Memorial University Campus.
The jobs have been restarted and the vast majority of them are running fine, however a few jobs have failed in the process. Please check your jobs to make sure whether they belong to the latter group.
14:07, June 3, 2017 (ADT)

Fundy

  • Fundy head node is unresponsive. Technical staff are investigating the cause.
12:19, April 18, 2017 (ADT)
  • Fundy is back now.
10:59, August 8, 2016 (ADT)

Glooscap

  • Network problem has been resolved. Glooscap is reachable again.
16:36, April 24, 2017 (ADT)
  • Glooscap is inaccessible due to a network problem at the host university. We expect most jobs will continue running uninterrupted while we diagnose the problem.
13:06, April 24, 2017 (ADT)
  • The metadata server was hung all night March 7-8. It was rebooted this morning and Glooscap is operating once again, although technical staff continue to be cautious about its future behaviour. To try to alleviate the load on the metadata server we are withdrawing compute nodes cl002 through cl058 from service. This represents a reduction of 188 cores in the capacity of the cluster.
11:24, March 8, 2017 (AST)
  • The cluster is unresponsive.
17:05, March 7, 2017 (AST)
  • A file system consistency check (fsck) has been completed, jobs have been restarted and logins are once again enabled. We will be monitoring to see if the rate or severity of slowdowns has changed.
10:20, March 7, 2017 (AST)
  • The intermittent slow response on Glooscap continues, with many such events logged Feb 16-18 and Feb 26-Mar 1. Technical staff continue to investigate the cause without vendor support.
09:13, March 1, 2017 (AST)
  • Users report intermittent slowness in interactive use of Glooscap. Symptoms include pauses of several seconds to over a minute in response to shell commands involving files or file metadata (such as "ls"). This is believed to be due to load on the file system, and therefore may also be affecting the run times of jobs doing extensive I/O. Vendor support for the file system is no longer available so deep troubleshooting is out of reach. We have no reports of loss of data or other actual failures. All we can recommend is great patience.
12:18, February 9, 2017 (AST)