2014 Cluster Upgrade Information
Cluster Changes Overview
- Compute nodes are comprised of Intel based Cicso and IBM servers.
- slurm replaces LSF scheduling software
- ssh logins to compute nodes is restricted
- cross mounts of local disk storage is removed
- slurm interactive partition limited to 2 nodes for increased performance and quality of service
Every 3 years, the Tufts high-performance computing cluster environment is partially refreshed to offer increased capacity. We did so in 2008 and 2011, and again in 2014. We have had 4 generations of IBM hardware and we will soon welcome a new generation of Cisco UCS hardware. Our oldest IBM hardware has been retired and our newest Cisco hardware has been installed. We will slowly integrate current cluster IBM hardware into the new Cisco environment during the Fall 2014. This will allow for a smooth and non-disruptive fall transition for users and testing.
By late Dec. 2014 the transition to the new cluster production environment will provide a total of 2680 cores (w/ 1680 IBM and 1000 Cisco cores) as compared to 2110 cores during 2013-2014 academic year . Due to a significantly different and newer architecture, the 1000 Cisco cores are a lot more powerful. In practical terms this refresh represents the largest increase in compute power for the Tufts HPC community.
As part of this hardware refresh, we will also be replacing our LSF scheduler with slurm on the new cluster. We made this decision as did many other high-performance supercomputing centers worldwide. The benefit to users is build upon a common understanding and implementation strategy for scaling HPC jobs.
See wiki page aboutslurm for lsf users.
Legacy Cluster Information
May 2014 Announcement
Every 3 years, the Tufts high-performance computing cluster environment is partially refreshed to offer increased capacity. We did so in 2008 and 2011, and we have just completed our competitive bid for our 2014 refresh purchase. As you may know, we currently have 4 generations of IBM hardware and we are proud to announce that we will soon welcome a new generation of Cisco UCS hardware.
...
To allow for a smooth transition to the new slurm scheduler, LSF will remain on the current cluster until December 31, 2014 at which time it will also be retired and replaced by slurm. We are creating this 3-month+ transition period from mid-September to December with both LSF and slurm available to allow users to easily switch from the former to the latter. Workshops, tip sheets and training will be organized during the transition period to make sure no Tufts cluster user is left behind.
Update Announcement 2011
TTS is pleased to announce the completion of the Tufts research cluster upgrade project. The new High-Performance Computing (HPC) research environment is in production bringing more than 1000-cores of computing power to the Tufts community. Over the past three years, TTS has observed an increasing demand for our HPC research cluster. In anticipation of the ongoing need for additional resources, we began a project to increase resources more than three fold.
...
As of August 29, 2011, the new cluster environment is at 77% of total capacity with 776 cores in production (out of 1004) and we anticipate full completion with a full 1004-core production capacity by early September.
Please read the information below carefully since some of the changes to the new cluster environment may affect you.
User Interface:
There are no changes to your login home directory, user account names or passwords. If you don’t remember your password, please visit Tufts tools at http://tuftstools.tufts.edu/ . Home directories have retained their previous backup schedules and the default login shell environment remains bash.
Please note that you may receive a warning that the SSH key has changed when logging for the first time to the new research cluster at cluster6.uit.tufts.edu. This is the result of the new installation and is not a security issue. Your login credentials and login hostname have not changed, and simple fixes are detailed below.
Storage:
Data on temporary file systems (/tmp, /scratch or /scratch2) were not migrated to the new cluster. Users were encouraged to transfer any important data prior to August 12. Temporary storage on file system /cluster/shared/your_user_name are available on the new cluster as before. Similarly, any directory on file system /cluster/tufts/ are available on the new cluster.
Compute Nodes:
There has been a slight change to the compute node naming convention to help reflect hardware differences:
...
Going forward, new iDataPlex nodes contributed by Tufts faculty will be named: contribNN (contrib01, contrib02, etc)
LSF Queues:
All LSF queues keep the same name on the new cluster.
Parallel related:
Access to the new GPU resource will be activated in September after the new cluster is stable and in full production.
SSH related login key issues
We are aware that changes to the ssh keys have caused some connectivity problems with the new cluster head node. A simple solution is to remove old cluster keys from your ssh client. This process depends on how you connect. For those that use ssh directly, (Mac OS X or Windows with CygWin) one may remove/edit out the old key in the following manner:
...
Once you have connected to the new head node, you may need to clear old ssh keys to ensure connectivity with all the new cluster nodes. The easiest way to do this is to run the following command when you have
connected:
cleanSSH.sh
SSH debug information
Should the above removal of ssh keys not work upon login, try connecting with ssh in verbose mode.
...
where each additional -v adds more detail. Usually one -v is sufficient. Login, then logout and cut-n-paste the total debug output into a mail message and send to cluster-support@tufts.edu.
Support
Please check your mailbox for a follow-up announcements. If you have any questions or concerns with Tufts new HPC research cluster environment, please direct them via email to cluster-support@tufts.edu.