This page is maintained manually. It gets updated as soon as we learn new information.
|
Clusters
Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled ACEnet outages are represented.
Services
- Legend:
Online |
cluster is up and running
|
Offline |
all users cannot login or submit jobs, or service is not working
|
Online |
some users can login and/or there are problems
|
Outage schedule
Grid Engine will not schedule any job with a run time (h_rt
) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.
- Glooscap will be inaccessible for a few hours on Sunday morning August 24, 2014, while Dalhousie IT Services installs new networking equipment. The cluster will continue to run jobs, but logins and external data transfers will be unavailable.
- 11:11, August 11, 2014 (ADT)
- Mahone will be offline beginning 07h00 on Monday August 25, 2014, for preventive maintenance.
- 11:26, July 21, 2014 (ADT)
- Glooscap will be offline beginning 06h00 Tuesday September 2, 2014, to fsck the filesystem and correct any inconsistencies left over from the July switch failure.
- 10:10, July 25, 2014 (ADT)
Brasdor
On February 21, 2014, ACEnet's Brasdor cluster suffered serious damage when an A/C malfunction over-cooled the room, causing a sprinkler head to deploy. Assessment is ongoing, however it is clear that the water damage was extensive enough that we will be unable to return the cluster to service.
A central concern of our recovery work has been the possibility of restoring user data. Data written to /home or /globalscratch on or before February 15, 2014 has the potential to have a copy surviving on tape. We have been able to restore such data using Mahone's tape library. Due to disk space limitations, the process to restore data must be approached in a user-by-user fashion.
We are asking any user requiring recovery of Brasdor data to contact support specifying which file system you want us to recover (/home and/or /globalscratch). Please specify the subject line as "File recovery at Brasdor - your_username". Also, please note that /nqs cannot be recovered.
|
Mahone
- 11:49, June 17, 2014 (ADT)
- The planned outage has begun.
- 07:37, June 17, 2014 (ADT)
Placentia
- The A/C has been fixed. We are back at full capacity.
- 14:50, May 29, 2014 (ADT)
- We received word today from Memorial University Facilities Management that the air conditioning fault in the Placentia machine room continues: "It was discovered subsequent to the compressor replacement that another piece of the chiller has failed... This is another part that has to be ordered in from the mainland and we're not sure where from yet (Dartmouth, Montreal or further afield) so we don't have a time frame." Consequently the return to service will not take place this week.
- 15:18, May 14, 2014 (NDT)
- The problem with an NFS server has been fixed.
- 10:01, April 29, 2014 (ADT)
- There is a problem with one of the NFS servers.
- 09:02, April 29, 2014 (ADT)
- The A/C failed some time over the weekend. We are shutting down all the compute that is closest to the failed A/C: cl001-108.
- 08:02, April 14, 2014 (ADT)
Fundy
- Grid Engine was unavailable briefly around 1:00 pm on July 10 while we backed up the accounting database and reduced its size. This should result in better performance for the 'qacct' command.
- 13:15, July 10, 2014 (ADT)
- 13:11, July 8, 2014 (ADT)
- Fundy is inaccessible in the wake of Tropical Storm Arthur and extensive power outages in the Fredericton area.
- 15:12, July 6, 2014 (ADT)
Glooscap
- A replacement switch equivalent to the original has been installed and Glooscap has been returned to service. Most running jobs were lost during repairs. Please check that any remaining jobs you have in the system are progressing properly.
- 10:03, July 25, 2014 (ADT)
- A spare switch of lower capacity has been swapped in for the failed network switch. Users can log in and access their data, but many compute nodes are inaccessible and so queue capacity will be limited until we can obtain a better replacement. Jobs listed in qstat as "running" may in fact be hung. Users should check for output dated later than 09:30 July 22, and if there is none, consider submitting replacement jobs to other ACEnet clusters.
- 13:44, July 22, 2014 (ADT)
- A network switch has failed, making the head node unusable. We cannot yet estimate time of return-to-service.
- 09:47, July 22, 2014 (ADT)