Cluster Status

From ACENET
Revision as of 15:36, November 21, 2014 by Luyang (talk | contribs) (Fundy)
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled ACEnet outages are represented.

Cluster Status Planned Outage Notes
Brasdor Offline No outages Extensive damage
Mahone Online No outages Maintenance complete
Placentia Online No outages
Fundy Online No outages
Glooscap Online No outages

Services

Service Status Planned Outage Notes
WebMO Online No outages
Account creation Online No outages
PGI and Intel licenses Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

Brasdor

Ambox notice.png On February 21, 2014, ACEnet's Brasdor cluster suffered serious damage when an A/C malfunction over-cooled the room, causing a sprinkler head to deploy. Assessment is ongoing, however it is clear that the water damage was extensive enough that we will be unable to return the cluster to service.

A central concern of our recovery work has been the possibility of restoring user data. Data written to /home or /globalscratch on or before February 15, 2014 has the potential to have a copy surviving on tape. We have been able to restore such data using Mahone's tape library. Due to disk space limitations, the process to restore data must be approached in a user-by-user fashion.

We are asking any user requiring recovery of Brasdor data to contact support specifying which file system you want us to recover (/home and/or /globalscratch). Please specify the subject line as "File recovery at Brasdor - your_username". Also, please note that /nqs cannot be recovered.

Mahone

  • Maintenance complete.
12:01, November 18, 2014 (AST)
  • The cluster is offline for unscheduled NFS maintenance.
09:35, November 18, 2014 (AST)

Placentia

  • An NFS server has been rebooted. Please check whether your jobs are progressing normally or need to be resubmitted.
08:13, November 17, 2014 (AST)
  • NFS issues. Users home dirs may not get mounted on the computed noted, jobs could fail or not start.
07:13, November 17, 2014 (AST)

Fundy

  • Fundy is back online.
11:36, November 21, 2014 (AST)
  • The cluster if offline to investigate and fix the storage system problems.
10:46, November 20, 2014 (AST)
  • NFS problems once again. Users might not be able to log in.
23:01, November 19, 2014 (AST)

Glooscap

  • Head node locked up late Thursday afternoon, November 6. Service has been restored. Jobs were unaffected.
08:56, November 7, 2014 (AST)
  • All general production hosts (short.q, medium.q, long.q) at Glooscap are now running the RHEL 6 operating system. Upgrade of the head node to RHEL 6 is being planned.
11:16, October 23, 2014 (ADT)