Changes

Jump to: navigation, search

Siku

1,648 bytes removed, March 27
remove section on July 2024 changes
===== Mac Users =====
If you are using MacOS 12 or better, you should be able to SSH with the new key types without issue. Users with older versions of the OS can often update to access new features. If you cannot update, you can either choose to install a current version of OpenSSH via the [https://brew.sh/ Brew] package manager, or use the cross-platform [https://termius.com/ Termius] application, which will provide support for generating and using the new key types on a trial basis.
 
== Known issues ==
* We have enabled [https://apptainer.org/docs/admin/main/user_namespace.html user namespaces] to support the operation of Apptainer <i>only on compute nodes</i>. You will not be able to test an Apptainer image on a login node. If you wish to do interactive testing, use <code>salloc</code> to get an interactive shell on a compute node.
 
* If you are ''not'' an academic user and intend to compile programs, please read [[License considerations for commercial users]].
 
* Multi-Processing using <code>libverbs</code> is not working as expected. MPI implementations, however, should work.
 
* Directories are automatically created at first logon. This may produce a race condition that results in errors like the following:
Could not chdir to home directory /home/username: No such file or directory
/usr/bin/xauth: error in locking authority file /home/username/.Xauthority
Lmod has detected the following error: Unable to load module because of error when evaluating modulefile: ...
Should this occur on first login, simply log out, wait a minute, and log back in again.
 
* Julia problems:
** Multi-node Julia jobs currently end with <code>Authentication failed</code> message. Workarounds are to do the calculation on a single node, or [https://docs.alliancecan.ca/wiki/Julia use an Alliance cluster].
== Similarities and differences with national GP clusters ==
: <code>MD5:62:c6:d9:d9:53:c5:71:e9:4f:d9:69:46:e4:3a:0c:38</code>
== July 2024 changes Known issues == * Globus file transfer service is only available to academic users, since it is integrated with the Digital Research Alliance's authentication and authorization systems.
During an extended outage between June 17th and July 3rd we implemented a number * The V100 GPU nodes are out of changesservice. Only RTX6000 GPUs are available.
=== List of changes ===* The operating system was updated to Rocky Linux 9.4* The Slurm scheduler has been updated to version 23.11.8* Login nodes were renamed to <code>sikulogin1<We have enabled [https:/code>/<code>sikulogin2<apptainer.org/code>* Compute nodes were renamed to <code>siku1<docs/code> through <code>siku104<admin/code> with <code>siku101<main/code> user_namespace.html user namespaces] to support the operation of Apptainer <codei>siku104</code> being the GPU only on compute nodes (formerly <code>cg001</codei> . You will not be able to <code>cg004</code>)test an Apptainer image on a login node.* Industry users need If you wish to do interactive testing with Apptainer, use the new login node <code>industry.siku.ace-net.casalloc</code>.* The SSH hostkeys have changed. The fingerprints of the new keys can be found [[#SSH_host_keys|above]].: You may need to remove the old keys by the following commands get an interactive shell on your machine: ssh-keygen -f ~/.ssh/known_hosts -R "siku.ace-net.ca" ssh-keygen -f ~/.ssh/known_hosts -R "134.153.246.145" ssh-keygen -f ~/.ssh/known_hosts -R "134.153.246.158" ssh-keygen -f ~/.ssh/known_hosts -R "dtn.siku.ace-net.ca" ssh-keygen -f ~/.ssh/known_hosts -R "134.153.246a compute node.139"
=== Known Issues ===These issues * If you are being worked at:''not'' an academic user and intend to compile programs, please read [[License considerations for commercial users]].
* Currently only two GPU nodes with RTX6000 GPUs are available. We are working on making the other GPU nodes available.* Currently a few CPU compute nodes are still unavailable. We are working on bringing the remaining compute nodes online.* Using ssh to connect to running jobs is currently not possible.* The <code>diskusage_report</code> (Multi-Processing using <code>quotalibverbs</code>) command is currently not being updatedworking as expected.MPI implementations, however, should work. * The <code>diskusage_report</code> (<code>quota</code>) command is currently not showing usage of Directories are automatically created at first logon. This may produce a race condition that results in errors like the project filesystem.following:* The following web-services are currently Could not availablechdir to home directory /home/username:No such file or directory /usr/bin/xauth: * JupyterHub error in locking authority file /home/username/.Xauthority Lmod has detected the following error: Unable to load module because of error when evaluating modulefile: * WebMO...: * Siku User PortalShould this occur on first login, simply log out, wait a minute, and log back in again.
==== Resolved Issues ====* The Data Transfer Node (Multi-node Julia jobs currently end with <code>Authentication failed</code>dtnmessage.siku Workarounds are to do the calculation on a single node, or [https://docs.ace-netalliancecan.ca</code>) is now availablewiki/Julia use an Alliance cluster].

Navigation menu